Files
identity-canon/research/entity-resolution-privacy/deterministic-vs-probabilistic-matching.md
tegwick 1c1b5c9bc6 Complete IDENTITY-WP-0003 corpus backfill and model refinement
Backfill all 23 research source notes with terminology extracts, modeling
assumptions, conflicts, canonical mappings, and references. Refresh terminology
artifacts, refine the conceptual model with explicit scenario paths, reconcile
canon surfaces and open questions, and mark the workplan finished.
2026-06-21 20:22:20 +02:00

4.8 KiB

Deterministic vs Probabilistic Matching

Source Type

Academic and industry practice. Entity resolution, record linkage, and duplicate detection literature plus operational MDM patterns.

Domain

Entity resolution, record linkage, duplicate detection, and account matching confidence.

Why This Source Matters

Entity resolution literature distinguishes deterministic keys from probabilistic matching — the foundation for weak vs. strong synonymity modeling.

Key Concepts

  • Deterministic matching: records match when agreed key fields are equal (exact email, government ID, OIDC iss+sub).
  • Probabilistic matching: match score from weighted field similarity (Fellegi-Sunter, Jaro-Winkler, ML classifiers).
  • Record linkage: identifying records across datasets referring to same entity.
  • Blocking: reduce comparison space by bucketing on partial keys.
  • Match threshold: score above which records are linked or flagged.
  • False positive / false negative tradeoff: precision vs. recall in linking.
  • Golden record / survivor: MDM pattern selecting canonical merged record.
  • Non-destructive link: associate records without merge (preferred in modern MDM).
  • Human review queue: ambiguous matches escalated for operator decision.
  • Master data management (MDM): operational discipline around entity resolution.

Relevant Terminology

Term Source meaning
Deterministic match Equality on predefined key fields.
Probabilistic match Scored similarity above threshold.
Record linkage Cross-dataset entity correspondence.
Blocking key Partial key for candidate pair generation.
Match score Confidence metric for probable same entity.
Duplicate Records hypothesized to refer to same entity.
Merge Combine records into one (destructive).
Link Associate records preserving sources (non-destructive).
Survivor record Chosen primary after merge.
Quarantine Hold ambiguous matches for review.

Modeling Assumptions

  • Same entity is hypothesis until verified in probabilistic approaches.
  • Deterministic rules are domain-specific (no universal golden key).
  • Merge destroys provenance unless carefully audited — increasingly avoided.
  • Confidence is continuous or banded (weak/medium/strong).
  • Source system identity must be preserved for compliance and undo.
  • Human review is part of high-assurance linking.
  • Privacy regulations constrain which fields can be matched.

Identity-Canon Implications

  • Deterministic match maps to strong Synonymity Assertion when keys are authoritative (S13).
  • Probabilistic match maps to weak Synonymity Assertion with confidence score and method (S12).
  • Link without merge is the canonical preferred pattern (P7).
  • Match score maps to confidence/strength on Synonymity Assertion.
  • Blocking/method maps to Evidence Source metadata.
  • Quarantine maps to Lifecycle State proposed on assertion.
  • Golden record is downstream MDM pattern; canon should not require merge.
  • Human review maps to Evidence Source (operator decision).

Terminology Conflicts

  • Duplicate vs. Synonymity: duplicates imply merge; synonymity allows coexistence.
  • Match vs. Link: industry uses interchangeably; canon distinguishes strength.
  • Entity vs. Actor: resolution literature says entity; canon prefers Actor target.
  • Identity vs. Record: matching is between records, not persons directly.
  • Deterministic vs. Strong: deterministic can still be wrong if key is shared (shared email).

Candidate Canonical Mappings

Entity resolution concept Candidate canonical concept
Deterministic match Strong Synonymity Assertion
Probabilistic match Weak Synonymity Assertion
Match score Confidence / strength metadata
Link (non-destructive) Synonymity Assertion
Merge Downstream anti-pattern (avoid)
Blocking key Evidence Source method
Review queue Lifecycle State proposed
Source record ID Identifier
Golden record Downstream projection only
False positive handling Revocation / supersession of assertion

Open Questions

  • What confidence bands (weak/medium/strong) should canon standardize?
  • Which deterministic keys are authoritative per source family (OIDC iss+sub, persistent SAML NameID, verified email)?
  • Should probabilistic matchers be required to store feature-level Evidence Source?
  • How should shared-attribute false positives (family email) be classified?

References

  • Fellegi-Sunter model (1969) — foundational probabilistic record linkage
  • Christen, "Data Matching" (2012) — entity resolution textbook
  • NIST SP 800-63A evidence requirements — https://pages.nist.gov/800-63-4/sp800-63A.html
  • MDM Institute duplicate management practices — industry reference