Files

tegwick 1c1b5c9bc6 Complete IDENTITY-WP-0003 corpus backfill and model refinement

Backfill all 23 research source notes with terminology extracts, modeling
assumptions, conflicts, canonical mappings, and references. Refresh terminology
artifacts, refine the conceptual model with explicit scenario paths, reconcile
canon surfaces and open questions, and mark the workplan finished.

2026-06-21 20:22:20 +02:00

4.8 KiB

Raw Blame History

Deterministic vs Probabilistic Matching

Source Type

Academic and industry practice. Entity resolution, record linkage, and duplicate detection literature plus operational MDM patterns.

Domain

Entity resolution, record linkage, duplicate detection, and account matching confidence.

Why This Source Matters

Entity resolution literature distinguishes deterministic keys from probabilistic matching — the foundation for weak vs. strong synonymity modeling.

Key Concepts

Deterministic matching: records match when agreed key fields are equal (exact email, government ID, OIDC iss+sub).
Probabilistic matching: match score from weighted field similarity (Fellegi-Sunter, Jaro-Winkler, ML classifiers).
Record linkage: identifying records across datasets referring to same entity.
Blocking: reduce comparison space by bucketing on partial keys.
Match threshold: score above which records are linked or flagged.
False positive / false negative tradeoff: precision vs. recall in linking.
Golden record / survivor: MDM pattern selecting canonical merged record.
Non-destructive link: associate records without merge (preferred in modern MDM).
Human review queue: ambiguous matches escalated for operator decision.
Master data management (MDM): operational discipline around entity resolution.

Relevant Terminology

Term	Source meaning
Deterministic match	Equality on predefined key fields.
Probabilistic match	Scored similarity above threshold.
Record linkage	Cross-dataset entity correspondence.
Blocking key	Partial key for candidate pair generation.
Match score	Confidence metric for probable same entity.
Duplicate	Records hypothesized to refer to same entity.
Merge	Combine records into one (destructive).
Link	Associate records preserving sources (non-destructive).
Survivor record	Chosen primary after merge.
Quarantine	Hold ambiguous matches for review.

Modeling Assumptions

Same entity is hypothesis until verified in probabilistic approaches.
Deterministic rules are domain-specific (no universal golden key).
Merge destroys provenance unless carefully audited — increasingly avoided.
Confidence is continuous or banded (weak/medium/strong).
Source system identity must be preserved for compliance and undo.
Human review is part of high-assurance linking.
Privacy regulations constrain which fields can be matched.

Identity-Canon Implications

Deterministic match maps to strong Synonymity Assertion when keys are authoritative (S13).
Probabilistic match maps to weak Synonymity Assertion with confidence score and method (S12).
Link without merge is the canonical preferred pattern (P7).
Match score maps to confidence/strength on Synonymity Assertion.
Blocking/method maps to Evidence Source metadata.
Quarantine maps to Lifecycle State proposed on assertion.
Golden record is downstream MDM pattern; canon should not require merge.
Human review maps to Evidence Source (operator decision).

Terminology Conflicts

Duplicate vs. Synonymity: duplicates imply merge; synonymity allows coexistence.
Match vs. Link: industry uses interchangeably; canon distinguishes strength.
Entity vs. Actor: resolution literature says entity; canon prefers Actor target.
Identity vs. Record: matching is between records, not persons directly.
Deterministic vs. Strong: deterministic can still be wrong if key is shared (shared email).

Candidate Canonical Mappings

Entity resolution concept	Candidate canonical concept
Deterministic match	Strong Synonymity Assertion
Probabilistic match	Weak Synonymity Assertion
Match score	Confidence / strength metadata
Link (non-destructive)	Synonymity Assertion
Merge	Downstream anti-pattern (avoid)
Blocking key	Evidence Source method
Review queue	Lifecycle State `proposed`
Source record ID	Identifier
Golden record	Downstream projection only
False positive handling	Revocation / supersession of assertion

Open Questions

What confidence bands (weak/medium/strong) should canon standardize?
Which deterministic keys are authoritative per source family (OIDC iss+sub, persistent SAML NameID, verified email)?
Should probabilistic matchers be required to store feature-level Evidence Source?
How should shared-attribute false positives (family email) be classified?

References

Fellegi-Sunter model (1969) — foundational probabilistic record linkage
Christen, "Data Matching" (2012) — entity resolution textbook
NIST SP 800-63A evidence requirements — https://pages.nist.gov/800-63-4/sp800-63A.html
MDM Institute duplicate management practices — industry reference

4.8 KiB Raw Blame History