Files
identity-canon/research/entity-resolution-privacy/deterministic-vs-probabilistic-matching.md
tegwick 1c1b5c9bc6 Complete IDENTITY-WP-0003 corpus backfill and model refinement
Backfill all 23 research source notes with terminology extracts, modeling
assumptions, conflicts, canonical mappings, and references. Refresh terminology
artifacts, refine the conceptual model with explicit scenario paths, reconcile
canon surfaces and open questions, and mark the workplan finished.
2026-06-21 20:22:20 +02:00

109 lines
4.8 KiB
Markdown

# Deterministic vs Probabilistic Matching
## Source Type
Academic and industry practice. Entity resolution, record linkage, and
duplicate detection literature plus operational MDM patterns.
## Domain
Entity resolution, record linkage, duplicate detection, and account matching
confidence.
## Why This Source Matters
Entity resolution literature distinguishes deterministic keys from probabilistic
matching — the foundation for weak vs. strong synonymity modeling.
## Key Concepts
- **Deterministic matching**: records match when agreed key fields are equal
(exact email, government ID, OIDC `iss`+`sub`).
- **Probabilistic matching**: match score from weighted field similarity
(Fellegi-Sunter, Jaro-Winkler, ML classifiers).
- **Record linkage**: identifying records across datasets referring to same
entity.
- **Blocking**: reduce comparison space by bucketing on partial keys.
- **Match threshold**: score above which records are linked or flagged.
- **False positive / false negative tradeoff**: precision vs. recall in linking.
- **Golden record / survivor**: MDM pattern selecting canonical merged record.
- **Non-destructive link**: associate records without merge (preferred in modern MDM).
- **Human review queue**: ambiguous matches escalated for operator decision.
- **Master data management (MDM)**: operational discipline around entity resolution.
## Relevant Terminology
| Term | Source meaning |
| --- | --- |
| Deterministic match | Equality on predefined key fields. |
| Probabilistic match | Scored similarity above threshold. |
| Record linkage | Cross-dataset entity correspondence. |
| Blocking key | Partial key for candidate pair generation. |
| Match score | Confidence metric for probable same entity. |
| Duplicate | Records hypothesized to refer to same entity. |
| Merge | Combine records into one (destructive). |
| Link | Associate records preserving sources (non-destructive). |
| Survivor record | Chosen primary after merge. |
| Quarantine | Hold ambiguous matches for review. |
## Modeling Assumptions
- **Same entity is hypothesis until verified** in probabilistic approaches.
- **Deterministic rules are domain-specific** (no universal golden key).
- **Merge destroys provenance** unless carefully audited — increasingly avoided.
- **Confidence is continuous or banded** (weak/medium/strong).
- **Source system identity must be preserved** for compliance and undo.
- **Human review is part of high-assurance linking.**
- **Privacy regulations constrain** which fields can be matched.
## Identity-Canon Implications
- Deterministic match maps to **strong Synonymity Assertion** when keys are
authoritative (S13).
- Probabilistic match maps to **weak Synonymity Assertion** with confidence
score and method (S12).
- **Link without merge** is the canonical preferred pattern (**P7**).
- **Match score** maps to confidence/strength on Synonymity Assertion.
- **Blocking/method** maps to Evidence Source metadata.
- **Quarantine** maps to Lifecycle State `proposed` on assertion.
- **Golden record** is downstream MDM pattern; canon should not require merge.
- **Human review** maps to Evidence Source (operator decision).
## Terminology Conflicts
- **Duplicate vs. Synonymity**: duplicates imply merge; synonymity allows coexistence.
- **Match vs. Link**: industry uses interchangeably; canon distinguishes strength.
- **Entity vs. Actor**: resolution literature says entity; canon prefers Actor target.
- **Identity vs. Record**: matching is between records, not persons directly.
- **Deterministic vs. Strong**: deterministic can still be wrong if key is shared
(shared email).
## Candidate Canonical Mappings
| Entity resolution concept | Candidate canonical concept |
| --- | --- |
| Deterministic match | Strong Synonymity Assertion |
| Probabilistic match | Weak Synonymity Assertion |
| Match score | Confidence / strength metadata |
| Link (non-destructive) | Synonymity Assertion |
| Merge | Downstream anti-pattern (avoid) |
| Blocking key | Evidence Source method |
| Review queue | Lifecycle State `proposed` |
| Source record ID | Identifier |
| Golden record | Downstream projection only |
| False positive handling | Revocation / supersession of assertion |
## Open Questions
- What confidence bands (weak/medium/strong) should canon standardize?
- Which deterministic keys are authoritative per source family (OIDC iss+sub,
persistent SAML NameID, verified email)?
- Should probabilistic matchers be required to store feature-level Evidence Source?
- How should shared-attribute false positives (family email) be classified?
## References
- Fellegi-Sunter model (1969) — foundational probabilistic record linkage
- Christen, "Data Matching" (2012) — entity resolution textbook
- NIST SP 800-63A evidence requirements — https://pages.nist.gov/800-63-4/sp800-63A.html
- MDM Institute duplicate management practices — industry reference