generated from coulomb/repo-seed
Backfill all 23 research source notes with terminology extracts, modeling assumptions, conflicts, canonical mappings, and references. Refresh terminology artifacts, refine the conceptual model with explicit scenario paths, reconcile canon surfaces and open questions, and mark the workplan finished.
4.8 KiB
4.8 KiB
Deterministic vs Probabilistic Matching
Source Type
Academic and industry practice. Entity resolution, record linkage, and duplicate detection literature plus operational MDM patterns.
Domain
Entity resolution, record linkage, duplicate detection, and account matching confidence.
Why This Source Matters
Entity resolution literature distinguishes deterministic keys from probabilistic matching — the foundation for weak vs. strong synonymity modeling.
Key Concepts
- Deterministic matching: records match when agreed key fields are equal
(exact email, government ID, OIDC
iss+sub). - Probabilistic matching: match score from weighted field similarity (Fellegi-Sunter, Jaro-Winkler, ML classifiers).
- Record linkage: identifying records across datasets referring to same entity.
- Blocking: reduce comparison space by bucketing on partial keys.
- Match threshold: score above which records are linked or flagged.
- False positive / false negative tradeoff: precision vs. recall in linking.
- Golden record / survivor: MDM pattern selecting canonical merged record.
- Non-destructive link: associate records without merge (preferred in modern MDM).
- Human review queue: ambiguous matches escalated for operator decision.
- Master data management (MDM): operational discipline around entity resolution.
Relevant Terminology
| Term | Source meaning |
|---|---|
| Deterministic match | Equality on predefined key fields. |
| Probabilistic match | Scored similarity above threshold. |
| Record linkage | Cross-dataset entity correspondence. |
| Blocking key | Partial key for candidate pair generation. |
| Match score | Confidence metric for probable same entity. |
| Duplicate | Records hypothesized to refer to same entity. |
| Merge | Combine records into one (destructive). |
| Link | Associate records preserving sources (non-destructive). |
| Survivor record | Chosen primary after merge. |
| Quarantine | Hold ambiguous matches for review. |
Modeling Assumptions
- Same entity is hypothesis until verified in probabilistic approaches.
- Deterministic rules are domain-specific (no universal golden key).
- Merge destroys provenance unless carefully audited — increasingly avoided.
- Confidence is continuous or banded (weak/medium/strong).
- Source system identity must be preserved for compliance and undo.
- Human review is part of high-assurance linking.
- Privacy regulations constrain which fields can be matched.
Identity-Canon Implications
- Deterministic match maps to strong Synonymity Assertion when keys are authoritative (S13).
- Probabilistic match maps to weak Synonymity Assertion with confidence score and method (S12).
- Link without merge is the canonical preferred pattern (P7).
- Match score maps to confidence/strength on Synonymity Assertion.
- Blocking/method maps to Evidence Source metadata.
- Quarantine maps to Lifecycle State
proposedon assertion. - Golden record is downstream MDM pattern; canon should not require merge.
- Human review maps to Evidence Source (operator decision).
Terminology Conflicts
- Duplicate vs. Synonymity: duplicates imply merge; synonymity allows coexistence.
- Match vs. Link: industry uses interchangeably; canon distinguishes strength.
- Entity vs. Actor: resolution literature says entity; canon prefers Actor target.
- Identity vs. Record: matching is between records, not persons directly.
- Deterministic vs. Strong: deterministic can still be wrong if key is shared (shared email).
Candidate Canonical Mappings
| Entity resolution concept | Candidate canonical concept |
|---|---|
| Deterministic match | Strong Synonymity Assertion |
| Probabilistic match | Weak Synonymity Assertion |
| Match score | Confidence / strength metadata |
| Link (non-destructive) | Synonymity Assertion |
| Merge | Downstream anti-pattern (avoid) |
| Blocking key | Evidence Source method |
| Review queue | Lifecycle State proposed |
| Source record ID | Identifier |
| Golden record | Downstream projection only |
| False positive handling | Revocation / supersession of assertion |
Open Questions
- What confidence bands (weak/medium/strong) should canon standardize?
- Which deterministic keys are authoritative per source family (OIDC iss+sub, persistent SAML NameID, verified email)?
- Should probabilistic matchers be required to store feature-level Evidence Source?
- How should shared-attribute false positives (family email) be classified?
References
- Fellegi-Sunter model (1969) — foundational probabilistic record linkage
- Christen, "Data Matching" (2012) — entity resolution textbook
- NIST SP 800-63A evidence requirements — https://pages.nist.gov/800-63-4/sp800-63A.html
- MDM Institute duplicate management practices — industry reference