# Deterministic vs Probabilistic Matching ## Source Type Academic and industry practice. Entity resolution, record linkage, and duplicate detection literature plus operational MDM patterns. ## Domain Entity resolution, record linkage, duplicate detection, and account matching confidence. ## Why This Source Matters Entity resolution literature distinguishes deterministic keys from probabilistic matching — the foundation for weak vs. strong synonymity modeling. ## Key Concepts - **Deterministic matching**: records match when agreed key fields are equal (exact email, government ID, OIDC `iss`+`sub`). - **Probabilistic matching**: match score from weighted field similarity (Fellegi-Sunter, Jaro-Winkler, ML classifiers). - **Record linkage**: identifying records across datasets referring to same entity. - **Blocking**: reduce comparison space by bucketing on partial keys. - **Match threshold**: score above which records are linked or flagged. - **False positive / false negative tradeoff**: precision vs. recall in linking. - **Golden record / survivor**: MDM pattern selecting canonical merged record. - **Non-destructive link**: associate records without merge (preferred in modern MDM). - **Human review queue**: ambiguous matches escalated for operator decision. - **Master data management (MDM)**: operational discipline around entity resolution. ## Relevant Terminology | Term | Source meaning | | --- | --- | | Deterministic match | Equality on predefined key fields. | | Probabilistic match | Scored similarity above threshold. | | Record linkage | Cross-dataset entity correspondence. | | Blocking key | Partial key for candidate pair generation. | | Match score | Confidence metric for probable same entity. | | Duplicate | Records hypothesized to refer to same entity. | | Merge | Combine records into one (destructive). | | Link | Associate records preserving sources (non-destructive). | | Survivor record | Chosen primary after merge. | | Quarantine | Hold ambiguous matches for review. | ## Modeling Assumptions - **Same entity is hypothesis until verified** in probabilistic approaches. - **Deterministic rules are domain-specific** (no universal golden key). - **Merge destroys provenance** unless carefully audited — increasingly avoided. - **Confidence is continuous or banded** (weak/medium/strong). - **Source system identity must be preserved** for compliance and undo. - **Human review is part of high-assurance linking.** - **Privacy regulations constrain** which fields can be matched. ## Identity-Canon Implications - Deterministic match maps to **strong Synonymity Assertion** when keys are authoritative (S13). - Probabilistic match maps to **weak Synonymity Assertion** with confidence score and method (S12). - **Link without merge** is the canonical preferred pattern (**P7**). - **Match score** maps to confidence/strength on Synonymity Assertion. - **Blocking/method** maps to Evidence Source metadata. - **Quarantine** maps to Lifecycle State `proposed` on assertion. - **Golden record** is downstream MDM pattern; canon should not require merge. - **Human review** maps to Evidence Source (operator decision). ## Terminology Conflicts - **Duplicate vs. Synonymity**: duplicates imply merge; synonymity allows coexistence. - **Match vs. Link**: industry uses interchangeably; canon distinguishes strength. - **Entity vs. Actor**: resolution literature says entity; canon prefers Actor target. - **Identity vs. Record**: matching is between records, not persons directly. - **Deterministic vs. Strong**: deterministic can still be wrong if key is shared (shared email). ## Candidate Canonical Mappings | Entity resolution concept | Candidate canonical concept | | --- | --- | | Deterministic match | Strong Synonymity Assertion | | Probabilistic match | Weak Synonymity Assertion | | Match score | Confidence / strength metadata | | Link (non-destructive) | Synonymity Assertion | | Merge | Downstream anti-pattern (avoid) | | Blocking key | Evidence Source method | | Review queue | Lifecycle State `proposed` | | Source record ID | Identifier | | Golden record | Downstream projection only | | False positive handling | Revocation / supersession of assertion | ## Open Questions - What confidence bands (weak/medium/strong) should canon standardize? - Which deterministic keys are authoritative per source family (OIDC iss+sub, persistent SAML NameID, verified email)? - Should probabilistic matchers be required to store feature-level Evidence Source? - How should shared-attribute false positives (family email) be classified? ## References - Fellegi-Sunter model (1969) — foundational probabilistic record linkage - Christen, "Data Matching" (2012) — entity resolution textbook - NIST SP 800-63A evidence requirements — https://pages.nist.gov/800-63-4/sp800-63A.html - MDM Institute duplicate management practices — industry reference