identity-canon/research/entity-resolution-privacy/deterministic-vs-probabilistic-matching.md

# Deterministic vs Probabilistic Matching

## Source Type

Academic and industry practice. Entity resolution, record linkage, and
duplicate detection literature plus operational MDM patterns.

## Domain

Entity resolution, record linkage, duplicate detection, and account matching
confidence.

## Why This Source Matters

Entity resolution literature distinguishes deterministic keys from probabilistic
matching — the foundation for weak vs. strong synonymity modeling.

## Key Concepts

- **Deterministic matching**: records match when agreed key fields are equal
  (exact email, government ID, OIDC `iss`+`sub`).
- **Probabilistic matching**: match score from weighted field similarity
  (Fellegi-Sunter, Jaro-Winkler, ML classifiers).
- **Record linkage**: identifying records across datasets referring to same
  entity.
- **Blocking**: reduce comparison space by bucketing on partial keys.
- **Match threshold**: score above which records are linked or flagged.
- **False positive / false negative tradeoff**: precision vs. recall in linking.
- **Golden record / survivor**: MDM pattern selecting canonical merged record.
- **Non-destructive link**: associate records without merge (preferred in modern MDM).
- **Human review queue**: ambiguous matches escalated for operator decision.
- **Master data management (MDM)**: operational discipline around entity resolution.

## Relevant Terminology

| Term | Source meaning |
| --- | --- |
| Deterministic match | Equality on predefined key fields. |
| Probabilistic match | Scored similarity above threshold. |
| Record linkage | Cross-dataset entity correspondence. |
| Blocking key | Partial key for candidate pair generation. |
| Match score | Confidence metric for probable same entity. |
| Duplicate | Records hypothesized to refer to same entity. |
| Merge | Combine records into one (destructive). |
| Link | Associate records preserving sources (non-destructive). |
| Survivor record | Chosen primary after merge. |
| Quarantine | Hold ambiguous matches for review. |

## Modeling Assumptions

- **Same entity is hypothesis until verified** in probabilistic approaches.
- **Deterministic rules are domain-specific** (no universal golden key).
- **Merge destroys provenance** unless carefully audited — increasingly avoided.
- **Confidence is continuous or banded** (weak/medium/strong).
- **Source system identity must be preserved** for compliance and undo.
- **Human review is part of high-assurance linking.**
- **Privacy regulations constrain** which fields can be matched.

## Identity-Canon Implications

- Deterministic match maps to **strong Synonymity Assertion** when keys are
  authoritative (S13).
- Probabilistic match maps to **weak Synonymity Assertion** with confidence
  score and method (S12).
- **Link without merge** is the canonical preferred pattern (**P7**).
- **Match score** maps to confidence/strength on Synonymity Assertion.
- **Blocking/method** maps to Evidence Source metadata.
- **Quarantine** maps to Lifecycle State `proposed` on assertion.
- **Golden record** is downstream MDM pattern; canon should not require merge.
- **Human review** maps to Evidence Source (operator decision).

## Terminology Conflicts

- **Duplicate vs. Synonymity**: duplicates imply merge; synonymity allows coexistence.
- **Match vs. Link**: industry uses interchangeably; canon distinguishes strength.
- **Entity vs. Actor**: resolution literature says entity; canon prefers Actor target.
- **Identity vs. Record**: matching is between records, not persons directly.
- **Deterministic vs. Strong**: deterministic can still be wrong if key is shared
  (shared email).

## Candidate Canonical Mappings

| Entity resolution concept | Candidate canonical concept |
| --- | --- |
| Deterministic match | Strong Synonymity Assertion |
| Probabilistic match | Weak Synonymity Assertion |
| Match score | Confidence / strength metadata |
| Link (non-destructive) | Synonymity Assertion |
| Merge | Downstream anti-pattern (avoid) |
| Blocking key | Evidence Source method |
| Review queue | Lifecycle State `proposed` |
| Source record ID | Identifier |
| Golden record | Downstream projection only |
| False positive handling | Revocation / supersession of assertion |

## Open Questions

- What confidence bands (weak/medium/strong) should canon standardize?
- Which deterministic keys are authoritative per source family (OIDC iss+sub,
  persistent SAML NameID, verified email)?
- Should probabilistic matchers be required to store feature-level Evidence Source?
- How should shared-attribute false positives (family email) be classified?

## References

- Fellegi-Sunter model (1969) — foundational probabilistic record linkage
- Christen, "Data Matching" (2012) — entity resolution textbook
- NIST SP 800-63A evidence requirements — https://pages.nist.gov/800-63-4/sp800-63A.html
- MDM Institute duplicate management practices — industry reference