Files
info-tech-canon/infospace/models/data/InfoTechCanonDataModel.md

2181 lines
43 KiB
Markdown

# InfoTechCanon Data Model
**Short Name:** `ITC-DATA`
**Document Status:** Seed Standard Release Candidate 1
**Version:** RC1-seed
**Date:** 2026-05-22
**Repository Context:** `info-tech-canon`
**Document Type:** InfoTechCanon Domain Standard
**Intended Audience:** Data architects, data engineers, data stewards, platform engineers, governance designers, security architects, application architects, product owners, knowledge-system builders, compliance reviewers, AI/analytics teams, and agentic tooling.
---
# 1. Purpose
The **InfoTechCanon Data Model** defines a canonical seed model for representing data as a managed, governed, discoverable, classifiable, lineage-bearing, quality-assessable, and reusable information asset.
It exists to give data its own canonical domain instead of leaving data semantics scattered across landscape, security, governance, DevSecOps, observability, and application models.
This standard provides a canonical vocabulary for:
- data domains,
- datasets,
- data products,
- data objects,
- records,
- fields,
- schemas,
- data elements,
- code lists,
- data stores as references,
- data flows,
- data lineage,
- data quality,
- metadata,
- catalogs,
- distributions,
- data services,
- data classification,
- sensitivity,
- residency,
- retention,
- processing purpose,
- data ownership and stewardship references,
- data contracts,
- and data evidence.
---
# 2. Position in InfoTechCanon
The Data Model is a **domain standard** within InfoTechCanon.
It depends on the existing seed standards as follows:
```text
Landscape = where data is stored, processed, moved, and exposed.
Organization = data owners, stewards, custodians, producers, consumers.
Governance = data policies, obligations, controls, evidence, exceptions.
Security = data exposure, data-security findings, data attack paths.
Access Control = permissions and grants to data resources.
Task = data-quality work, migration work, remediation, reviews.
Tagging = lightweight classification and retrieval.
Data = datasets, schemas, metadata, lineage, quality, classification, retention.
```
```text
InfoTechCanon
├── InfoTechCanonCore
├── InfoTechCanonLandscapeModel
├── InfoTechCanonOrganizationModel
├── InfoTechCanonGovernanceModel
├── InfoTechCanonTaskModel
├── InfoTechCanonTaggingStandard
├── InfoTechCanonAccessControlModel
├── InfoTechCanonSecurityModel
├── InfoTechCanonDataModel <-- this standard
├── InfoTechCanonDevSecOpsModel
├── InfoTechCanonNetworkModel
├── InfoTechCanonObservabilityModel
├── InfoTechCanonPatternLanguage
└── Application Profiles
```
---
# 3. Boundary with Adjacent Standards
## 3.1 Boundary with Landscape
The Landscape Model owns:
```text
DataStore
DatabaseInstance
ObjectBucket
FileShare
Queue
Cache
RuntimeResource
ApplicationService
IntegrationFlow
Endpoint
```
The Data Model owns:
```text
Dataset
DataProduct
DataObject
Schema
Field
DataElement
DataFlow
DataLineage
DataClassification
DataQualityRule
DataContract
DataDistribution
```
Boundary rule:
```text
Landscape owns the technical and runtime places where data lives or moves.
Data owns the semantic, structural, quality, classification, and lineage meaning of data.
```
## 3.2 Boundary with Governance
The Governance Model owns:
```text
Policy
Requirement
Obligation
Control
Risk
Exception
Evidence
Review
Approval
ComplianceRequirement
```
The Data Model owns data-specific structures that are governed:
```text
RetentionRuleReference
ProcessingPurpose
DataClassification
DataQualityRule
DataContract
DataLineage
```
Boundary rule:
```text
Governance defines why data must be governed.
Data defines what data is and how it is described, classified, measured, and traced.
```
## 3.3 Boundary with Security
The Security Model owns:
```text
DataSecurityFinding
ExposureFinding
CredentialExposure
SecurityIncident
AttackPath
Mitigation
```
The Data Model owns:
```text
Sensitivity
Classification
DataResidency
DataSubjectCategory
DataCategory
DataLineage
```
Security may use these for posture analysis.
## 3.4 Boundary with Access Control
Access Control owns permissions, grants, authorization decisions, and enforcement.
Data owns data resources and classifications that access policies may use.
Example:
```text
Dataset classified_as Confidential
AccessPolicy permits Role to read Dataset
AuthorizationDecision permits read on Dataset
```
## 3.5 Boundary with Organization
Organization owns actors and responsibilities.
Data references Organization concepts for:
```text
DataOwner
DataSteward
DataCustodian
DataProducer
DataConsumer
DataTrustee
```
## 3.6 Boundary with DevSecOps
DevSecOps owns source, build, artifact, pipeline, release, deployment, SBOM, and attestation semantics.
Data owns data contracts, schema evolution, migration data, test data, synthetic data, lineage, and data-quality semantics.
---
# 4. Research Basis and External Alignment
This seed standard draws on multiple data-management bodies of knowledge.
## 4.1 DAMA-DMBOK
DAMA-DMBOK is a broad reference for data management disciplines including data governance, architecture, modeling, storage, security, integration, documents/content, reference/master data, warehousing/BI, metadata, and data quality. InfoTechCanon uses it as a broad mapping and assimilation target, not as a direct controlling model.
## 4.2 DCAT
W3C DCAT defines a vocabulary for data catalogs. DCAT Version 3 organizes catalog access around datasets, distributions, data services, and dataset series. This is highly relevant for InfoTechCanon catalog, dataset, distribution, and data-service concepts.
## 4.3 PROV-O
W3C PROV-O models provenance using entities, activities, and agents. This is highly relevant for data lineage, derivation, generation, transformation, and responsibility.
## 4.4 ISO/IEC 11179
ISO/IEC 11179 provides a metadata registry framework for data elements, naming, identification, definitions, classification, and registration. It is an important mapping target for data element, representation, data definition, code list, and metadata registry concepts.
## 4.5 Data Mesh and Data Products
Data product thinking emphasizes ownership, discoverability, quality, fitness for use, service-like interfaces, and domain responsibility. InfoTechCanon should support data products without requiring a specific data-mesh organizational model.
## 4.6 Data Contracts
Data contracts define expectations between producers and consumers around schema, semantics, quality, delivery, compatibility, ownership, and change management. They are critical for reliable information-processing systems.
## 4.7 Privacy and Data Protection Practice
Privacy and data-protection practice contributes concepts such as personal data, sensitive data, data subject, processing purpose, lawful basis, retention, residency, and minimization. The Data Model provides data semantics, while Governance owns legal obligations and Security owns data exposure and incident semantics.
---
# 5. Seed Standard Design Stance
This standard is a **seed standard**, not a full data-governance or database-design manual.
It shall:
1. define canonical data semantics,
2. distinguish data from storage infrastructure,
3. distinguish dataset, data product, data object, schema, field, and data element,
4. support data classification, lineage, quality, retention, residency, and processing purpose,
5. support catalog and discovery concepts,
6. support data contracts and schema evolution,
7. support operational, analytical, reference, master, event, and document data,
8. support mappings to external standards without becoming subordinate to them,
9. remain markdown-first and agent-retrievable,
10. and support future assimilation of data standards, platforms, regulations, and product schemas.
---
# 6. Scope
## 6.1 In Scope
This standard covers canonical representation of:
- data domains,
- data products,
- datasets,
- dataset series,
- data distributions,
- data services,
- data objects,
- entities,
- records,
- fields,
- attributes,
- data elements,
- schemas,
- schema versions,
- code lists,
- reference data,
- master data references,
- metadata,
- catalogs,
- data lineage,
- data flows,
- data transformations,
- data quality rules,
- data quality results,
- data contracts,
- data classification,
- sensitivity,
- confidentiality level,
- integrity expectation,
- availability expectation,
- retention rules as data semantics,
- data residency,
- data minimization,
- processing purpose,
- data subject categories,
- data provenance,
- data ownership and stewardship references,
- and data lifecycle states.
## 6.2 Out of Scope
This standard does not fully define:
- database engine internals,
- storage infrastructure,
- full data warehouse architecture,
- full analytics modeling,
- full privacy-law interpretation,
- full data-governance process,
- full security incident handling,
- all ontology modeling,
- all semantic-web representation,
- complete ETL/ELT implementation,
- or every vendor-specific data catalog schema.
Those may be mapped, assimilated, profiled, or handled by adjacent standards.
---
# 7. Normative Language
The following terms are used normatively:
- **SHALL** indicates a mandatory rule for conformance.
- **SHOULD** indicates a recommended practice.
- **MAY** indicates an optional capability.
- **MUST NOT** indicates a prohibited practice.
- **SEED** marks a concept defined provisionally here but open to later refinement.
- **EXTRACT** marks a concept that may later move to a more specialized standard.
---
# 8. Core Principles
## 8.1 Data Is Not Its Store
A dataset is not the same thing as a database, bucket, table, file, topic, or API.
Storage and runtime locations are Landscape concepts. Data semantics belong here.
## 8.2 Dataset Is Not Schema
A dataset may have one or more schemas, distributions, versions, contracts, lineage records, and quality expectations.
## 8.3 Schema Is Not Meaning
A schema describes structure. It does not fully define business meaning, ownership, usage constraints, quality, or purpose.
## 8.4 Classification Is First-Class
Data classification and sensitivity SHOULD be explicit where data has security, privacy, compliance, operational, or business significance.
## 8.5 Lineage Is Evidence-Carrying
Lineage SHOULD identify source data, transformations, activities, agents, and derived outputs with confidence and evidence where possible.
## 8.6 Data Quality Is Contextual
Data quality depends on intended use, domain meaning, contract expectations, and consumer needs.
## 8.7 Data Contracts Make Data Reliable
Producer-consumer expectations SHOULD be explicit when data is reused across system boundaries.
## 8.8 External Standards Are Mapped, Not Obeyed
The Data Model MAY map to DAMA-DMBOK, DCAT, PROV-O, ISO/IEC 11179, schema.org, OpenLineage, DataHub, OpenMetadata, dbt, Great Expectations, or similar standards and tools.
It MUST NOT subordinate its internal semantics to any single external model.
---
# 9. Canonical Seed Metadata
Every data artifact SHOULD support structured metadata.
Recommended front matter:
```yaml
---
id: itc-data:Dataset
type: concept
standard: InfoTechCanonDataModel
standard_version: RC1-seed
status: candidate
canonical_owner: InfoTechCanonDataModel
preferred_label: Dataset
related:
- itc-data:DataProduct
- itc-data:Schema
- itc-data:DataDistribution
- itc-data:DataLineage
mappings:
- itc-map:dataset-to-dcat-dataset
---
```
Recommended artifact statuses:
```text
idea
draft
candidate
release-candidate
adopted
stable
deprecated
retired
```
Recommended concept statuses:
```text
proposed
experimental
candidate
canonical
deprecated
retired
```
---
# 10. Root Data Taxonomy
```text
DataEntity
├── DataAssetEntity
│ ├── DataDomain
│ ├── DataProduct
│ ├── Dataset
│ ├── DatasetSeries
│ ├── DataDistribution
│ ├── DataService
│ ├── DataObject
│ ├── Record
│ └── DocumentData
├── StructureEntity
│ ├── Schema
│ ├── SchemaVersion
│ ├── Field
│ ├── Attribute
│ ├── DataElement
│ ├── DataElementConcept
│ ├── Representation
│ ├── DataType
│ ├── Constraint
│ └── CodeList
├── SemanticEntity
│ ├── BusinessTerm
│ ├── GlossaryTerm
│ ├── ConceptualEntity
│ ├── DataDefinition
│ ├── ReferenceData
│ ├── MasterDataReference
│ └── CanonicalValue
├── GovernanceReferenceEntity
│ ├── DataClassification
│ ├── Sensitivity
│ ├── DataCategory
│ ├── DataSubjectCategory
│ ├── ProcessingPurpose
│ ├── RetentionRuleReference
│ ├── DataResidency
│ └── DataUsageConstraint
├── QualityEntity
│ ├── DataQualityDimension
│ ├── DataQualityRule
│ ├── DataQualityCheck
│ ├── DataQualityResult
│ ├── DataQualityIssue
│ └── FitnessForUse
├── LineageEntity
│ ├── DataFlow
│ ├── DataLineage
│ ├── Transformation
│ ├── Derivation
│ ├── SourceDataset
│ ├── TargetDataset
│ └── ProvenanceRecord
├── ContractEntity
│ ├── DataContract
│ ├── ProducerExpectation
│ ├── ConsumerExpectation
│ ├── CompatibilityRule
│ ├── BreakingChange
│ └── SchemaEvolutionPolicy
└── OperationalDataEntity
├── DataPipelineReference
├── DataStoreReference
├── QueryReference
├── DataAccessPattern
├── DataFreshness
└── DataAvailability
```
---
# 11. Core Concepts
## 11.1 DataEntity
A **DataEntity** is any identifiable concept used to represent data, metadata, structure, classification, quality, lineage, contract, or data lifecycle.
Recommended attributes:
```yaml
id:
entity_type:
canonical_name:
display_name:
lifecycle_state:
source_system:
created_at:
updated_at:
```
Optional attributes:
```yaml
owner:
steward:
data_domain:
classification:
source_confidence:
valid_from:
valid_to:
tags:
external_references:
```
---
## 11.2 DataDomain
A **DataDomain** is a bounded area of data meaning, ownership, stewardship, or subject matter.
Examples:
```text
customer
billing
product
identity
orders
support
security
operations
finance
```
---
## 11.3 DataProduct
A **DataProduct** is a managed data asset or set of data assets offered for use by consumers with explicit ownership, quality expectations, documentation, interfaces, and lifecycle.
Recommended attributes:
```yaml
owner:
steward:
producer:
consumers:
service_level_expectations:
quality_expectations:
contract:
distribution_methods:
```
---
## 11.4 Dataset
A **Dataset** is a coherent collection of data published, managed, processed, analyzed, or consumed as a unit.
A dataset may have:
```text
schema
distribution
catalog entry
classification
lineage
quality rules
owner
steward
contract
retention expectation
```
Canonical rule:
```text
Dataset MUST NOT be treated as identical to its storage location.
```
---
## 11.5 DatasetSeries
A **DatasetSeries** is a sequence or family of related datasets organized over time, version, geography, domain, or release.
---
## 11.6 DataDistribution
A **DataDistribution** is an accessible representation of a dataset.
Examples:
```text
CSV file
Parquet file
API response
database table export
event stream
report download
object storage path
```
---
## 11.7 DataService
A **DataService** is a service that provides access to data or operations over data.
Examples:
```text
query API
data product API
metadata API
streaming endpoint
analytics service
```
---
## 11.8 DataObject
A **DataObject** is a meaningful object or structure represented in data.
Examples:
```text
Customer
Invoice
Order
Payment
Product
Device
UserProfile
AccessGrant
SecurityFinding
```
---
## 11.9 Record
A **Record** is an instance-level representation of data about an entity, event, relationship, or observation.
---
## 11.10 Field
A **Field** is a named component of a schema, record, message, or table.
---
## 11.11 Attribute
An **Attribute** is a property of a data object or conceptual entity.
A field may represent an attribute, but field is structural while attribute is semantic.
---
## 11.12 DataElement
A **DataElement** is a defined unit of data with meaning, representation, and expected usage.
It may map to ISO/IEC 11179 data element concepts.
Recommended attributes:
```yaml
object_class:
property:
representation:
data_type:
definition:
permitted_values:
```
---
## 11.13 DataElementConcept
A **DataElementConcept** is the semantic idea of a data element independent of representation.
Example:
```text
Customer birth date
Invoice total amount
Repository default branch name
```
---
## 11.14 Representation
A **Representation** describes how a data element is represented.
Examples:
```text
string
integer
decimal
boolean
date
timestamp
code
identifier
URI
```
---
## 11.15 DataType
A **DataType** specifies the technical or logical type of a field or data element.
---
## 11.16 Constraint
A **Constraint** is a rule limiting valid data.
Examples:
```text
required
unique
minimum
maximum
regex
foreign key
enum
format
cardinality
```
---
## 11.17 CodeList
A **CodeList** is a controlled set of allowed values with definitions.
Examples:
```text
country codes
currency codes
status codes
classification labels
risk levels
```
---
## 11.18 BusinessTerm
A **BusinessTerm** is a term used by domain actors to describe data meaning.
---
## 11.19 GlossaryTerm
A **GlossaryTerm** is a documented term in a glossary with definition, synonyms, ownership, and mappings.
---
## 11.20 DataDefinition
A **DataDefinition** is a textual or structured definition explaining the meaning, scope, and intended use of a data concept.
---
## 11.21 ReferenceData
**ReferenceData** is data used to classify, categorize, or constrain other data.
Examples:
```text
country list
currency list
product category list
status code list
business unit list
```
---
## 11.22 MasterDataReference
A **MasterDataReference** points to a controlled source of core business entities.
Examples:
```text
customer master
product master
supplier master
employee master
```
The Data Model references master-data semantics but does not require a specific MDM architecture.
---
## 11.23 DataClassification
A **DataClassification** is a classification assigned to data based on sensitivity, confidentiality, regulatory concern, operational criticality, or business significance.
Examples:
```text
public
internal
confidential
restricted
regulated
personal
sensitive personal
secret
```
---
## 11.24 Sensitivity
**Sensitivity** indicates potential harm, obligation, or restriction associated with data disclosure, modification, loss, misuse, or processing.
---
## 11.25 DataCategory
A **DataCategory** groups data by semantic, legal, operational, or analytical type.
Examples:
```text
personal data
financial data
health data
authentication data
transaction data
telemetry data
metadata
content data
```
---
## 11.26 DataSubjectCategory
A **DataSubjectCategory** identifies the kind of person or entity data is about.
Examples:
```text
customer
employee
applicant
supplier contact
child
patient
user
administrator
```
---
## 11.27 ProcessingPurpose
A **ProcessingPurpose** describes why data is collected, stored, transformed, shared, or used.
Examples:
```text
billing
support
security monitoring
analytics
product improvement
legal compliance
identity verification
```
---
## 11.28 RetentionRuleReference
A **RetentionRuleReference** links data to governance-defined retention obligations, policies, or rules.
The Data Model may model retention expectation, but Governance owns the policy and obligation.
---
## 11.29 DataResidency
**DataResidency** describes where data is stored, processed, transferred, or legally required to remain.
Examples:
```text
EU
Germany
customer region
cloud region
on-premises only
```
---
## 11.30 DataUsageConstraint
A **DataUsageConstraint** describes a restriction on how data may be used.
Examples:
```text
not for training
not for export
internal analytics only
production use prohibited
no cross-border transfer
only aggregated use
```
---
## 11.31 DataQualityDimension
A **DataQualityDimension** is an aspect of data quality.
Common dimensions:
```text
accuracy
completeness
consistency
timeliness
validity
uniqueness
freshness
integrity
fitness_for_use
```
---
## 11.32 DataQualityRule
A **DataQualityRule** is a testable expectation about data quality.
Examples:
```text
customer_id must not be null
invoice_total must be >= 0
country_code must be in ISO country code list
event_timestamp must be within expected delay window
```
---
## 11.33 DataQualityCheck
A **DataQualityCheck** is an execution of one or more data quality rules.
---
## 11.34 DataQualityResult
A **DataQualityResult** is the outcome of a data quality check.
---
## 11.35 DataQualityIssue
A **DataQualityIssue** is a finding indicating data does not meet a quality rule or fitness expectation.
It may create Task Model remediation work.
---
## 11.36 FitnessForUse
**FitnessForUse** is the degree to which data is suitable for a specific purpose or consumer context.
---
## 11.37 DataFlow
A **DataFlow** is movement or transfer of data between sources, systems, stores, services, actors, or processes.
---
## 11.38 DataLineage
**DataLineage** describes the origin, movement, transformation, derivation, and usage path of data.
Lineage may include:
```text
source dataset
transformation
activity
agent
target dataset
time
evidence
confidence
```
---
## 11.39 Transformation
A **Transformation** is an activity that changes data structure, content, format, aggregation, classification, or meaning.
---
## 11.40 Derivation
A **Derivation** is a relationship where one data entity is derived from another.
---
## 11.41 ProvenanceRecord
A **ProvenanceRecord** records information about how data came to exist, who or what generated it, what activity produced it, and what source influenced it.
---
## 11.42 DataContract
A **DataContract** is an explicit agreement between data producers and consumers about data structure, semantics, quality, delivery, compatibility, ownership, and change expectations.
---
## 11.43 ProducerExpectation
A **ProducerExpectation** describes what a data producer commits to provide.
Examples:
```text
schema stability
freshness
completeness
availability
documentation
change notice
```
---
## 11.44 ConsumerExpectation
A **ConsumerExpectation** describes what a data consumer expects or is allowed to assume.
---
## 11.45 CompatibilityRule
A **CompatibilityRule** describes what changes are considered compatible or breaking.
---
## 11.46 BreakingChange
A **BreakingChange** is a data, schema, semantic, quality, or delivery change that violates consumer expectations or compatibility rules.
---
## 11.47 SchemaEvolutionPolicy
A **SchemaEvolutionPolicy** defines rules for how schemas may change over time.
---
## 11.48 DataStoreReference
A **DataStoreReference** points to a Landscape data store or storage resource.
Examples:
```text
database
table
bucket
file share
topic
queue
index
warehouse
lakehouse table
```
---
## 11.49 DataAccessPattern
A **DataAccessPattern** describes how data is accessed.
Examples:
```text
batch export
API query
event stream
direct database query
file download
replication
analytics dashboard
```
---
## 11.50 DataFreshness
**DataFreshness** describes how current data is relative to a defined expectation.
---
## 11.51 DataAvailability
**DataAvailability** describes whether data is accessible according to expectations.
---
# 12. Core Relationship Vocabulary
Recommended root relationship types:
```text
contains
part_of
describes
classified_as
has_schema
has_field
has_distribution
provided_by
consumed_by
stored_in
accessed_via
flows_to
derived_from
generated_by
transformed_by
governed_by
constrained_by
subject_to
owned_by
stewarded_by
produced_by
consumed_by
validated_by
violates
satisfies
maps_to
```
Relationship records SHOULD support:
```yaml
id:
relationship_type:
source_entity:
target_entity:
scope:
valid_from:
valid_to:
source_system:
confidence:
evidence:
rationale:
```
---
# 13. Data State Models
## 13.1 Dataset Lifecycle States
```text
proposed
designed
active
deprecated
retired
archived
deleted
```
## 13.2 Schema States
```text
draft
candidate
active
deprecated
superseded
retired
```
## 13.3 Data Quality States
```text
unknown
unchecked
passing
warning
failing
waived
remediating
verified
```
## 13.4 Data Contract States
```text
draft
under_review
active
violated
deprecated
superseded
retired
```
## 13.5 Lineage Confidence States
```text
unknown
declared
inferred
observed
verified
conflicting
```
---
# 14. Data Patterns
## 14.1 Pattern: Data Is Not Its Store
**Context:** Teams model data by pointing at tables, buckets, or files.
**Problem:** Storage location does not explain semantic meaning, ownership, classification, quality, or lineage.
**Solution:** Model Dataset, Schema, Distribution, StoreReference, and Lineage separately.
---
## 14.2 Pattern: Dataset Catalog Entry
**Context:** Data consumers need to discover and understand data.
**Problem:** Data assets remain invisible or only known by tribal knowledge.
**Solution:** Provide a catalog entry with:
```text
dataset name
description
owner
steward
classification
schema
distribution
quality expectations
lineage
access method
usage constraints
```
---
## 14.3 Pattern: Data Contract at Boundary
**Context:** Data crosses a team, service, product, or system boundary.
**Problem:** Consumers break when producers change data unexpectedly.
**Solution:** Define a DataContract with schema, semantic expectations, quality rules, compatibility rules, and change process.
---
## 14.4 Pattern: Classification Drives Controls
**Context:** Data has different sensitivity and obligations.
**Problem:** Systems apply uniform controls or rely on ad hoc judgment.
**Solution:** Classify data and map classifications to governance controls, access policies, security measures, and retention expectations.
---
## 14.5 Pattern: Lineage as Evidence
**Context:** A derived dataset is used for decisions or compliance.
**Problem:** Consumers cannot determine origin, transformations, or trustworthiness.
**Solution:** Model lineage with source datasets, transformations, activities, agents, target datasets, and evidence.
---
## 14.6 Pattern: Quality Rule to Remediation
**Context:** Data quality checks fail.
**Problem:** Failures remain dashboards instead of corrective action.
**Solution:**
```text
DataQualityRule
-> DataQualityCheck
-> DataQualityResult
-> DataQualityIssue
-> RemediationTask
-> VerificationEvidence
```
---
## 14.7 Pattern: Semantic Term and Field Split
**Context:** Database columns are treated as business terms.
**Problem:** Field names do not fully encode business meaning.
**Solution:** Link Field to DataElement, BusinessTerm, and DataDefinition.
---
## 14.8 Pattern: Retention with Governance Reference
**Context:** Data must be kept or deleted according to obligations.
**Problem:** Retention is encoded as undocumented operational behavior.
**Solution:** Link Dataset or DataObject to RetentionRuleReference and keep the governing obligation in Governance.
---
# 15. Data Profiles
## 15.1 Profile Format
A Data Profile SHALL declare:
```yaml
id:
profile_name:
status:
implements:
- InfoTechCanonDataModel
target_context:
included_concepts:
required_relationships:
required_metadata:
state_model:
source_of_truth_rules:
mapping_files:
validation_rules:
examples:
known_deviations:
```
---
## 15.2 Seed Profile: Small SaaS Data Profile
Purpose:
```text
Provide a minimal data model for a small SaaS platform moving toward production readiness.
```
Included concepts:
```text
DataDomain
Dataset
DataObject
Schema
Field
DataClassification
DataStoreReference
DataFlow
DataQualityRule
RetentionRuleReference
DataOwnerReference
DataStewardReference
```
Required relationships:
```text
Dataset has_schema Schema
Schema has_field Field
Dataset classified_as DataClassification
Dataset stored_in DataStoreReference
Dataset owned_by DataOwnerReference
Dataset stewarded_by DataStewardReference
DataFlow moves Dataset
RetentionRuleReference applies_to Dataset
```
---
## 15.3 Seed Profile: Data Catalog Profile
Purpose:
```text
Represent data catalog entries for discoverability and reuse.
```
Included concepts:
```text
Catalog
Dataset
DatasetSeries
DataDistribution
DataService
DataOwnerReference
DataStewardReference
DataClassification
DataQualitySummary
DataLineageSummary
```
Mapping targets:
```text
DCAT
DCAT-AP
DataHub
OpenMetadata
Amundsen
Collibra / catalog tools
```
---
## 15.4 Seed Profile: Data Contract Profile
Purpose:
```text
Represent data producer-consumer agreements.
```
Included concepts:
```text
DataContract
ProducerExpectation
ConsumerExpectation
Schema
SchemaVersion
DataQualityRule
CompatibilityRule
BreakingChange
ChangeNotice
DataContractViolation
```
Required relationships:
```text
DataContract applies_to Dataset
ProducerExpectation constrains Producer
ConsumerExpectation informs Consumer
CompatibilityRule governs SchemaEvolution
BreakingChange violates DataContract
```
---
## 15.5 Seed Profile: Data Lineage Profile
Purpose:
```text
Represent lineage across datasets, transformations, pipelines, and systems.
```
Included concepts:
```text
Dataset
SourceDataset
TargetDataset
Transformation
DataFlow
DataLineage
ProvenanceRecord
DataPipelineReference
ActivityReference
AgentReference
```
Mapping targets:
```text
PROV-O
OpenLineage
Marquez
dbt exposures/models/sources
DataHub lineage
```
---
## 15.6 Seed Profile: Privacy-Relevant Data Profile
Purpose:
```text
Represent data concepts relevant to privacy, data protection, retention, and processing.
```
Included concepts:
```text
PersonalDataCategory
SensitiveDataCategory
DataSubjectCategory
ProcessingPurpose
DataResidency
RetentionRuleReference
DataUsageConstraint
DataMinimizationExpectation
```
Governance owns legal obligations and lawful-basis interpretation.
---
## 15.7 Seed Profile: Analytics Dataset Profile
Purpose:
```text
Represent analytical datasets, metrics, dimensions, facts, models, and reports.
```
Included concepts:
```text
Dataset
Metric
Dimension
Fact
Measure
AggregationRule
ReportReference
DashboardReference
DataQualityRule
FreshnessExpectation
```
---
# 16. Mapping Model for the Data Standard
Mappings relate InfoTechCanon data concepts to external standards, frameworks, products, and regulations.
## 16.1 Mapping Types
Recommended mapping types:
```text
exactMatch
closeMatch
broadMatch
narrowMatch
relatedMatch
conflictMatch
gapMatch
derivedFrom
regulatoryReference
toolEquivalent
```
## 16.2 Mapping Record
Example:
```yaml
id: itc-map:dataset-to-dcat-dataset
source_concept: itc-data:Dataset
target_body: W3C DCAT
target_version: "3"
target_concept: dcat:Dataset
mapping_type: closeMatch
scope:
- data catalog interoperability
not_valid_for:
- all internal schema semantics
- all data product lifecycle semantics
rationale: >
DCAT Dataset is a strong catalog-oriented match for InfoTechCanon Dataset,
but InfoTechCanon includes additional governance, quality, contract,
and lineage expectations that may not be required by DCAT.
confidence: high
status: candidate
owner: InfoTechCanonDataModel
```
## 16.3 Seed Mapping Targets
The Data Model SHOULD maintain mappings to:
```text
DAMA-DMBOK
W3C DCAT 3
DCAT-AP
W3C PROV-O
ISO/IEC 11179
schema.org Dataset
OpenLineage
DataHub metadata model
OpenMetadata
dbt sources/models/exposures
Great Expectations
Apache Atlas
Collibra / data catalog concepts
GDPR / privacy-regulation references
Dublin Core metadata
SPDX / CycloneDX data references where relevant
```
---
# 17. Assimilation Hooks
The Data Model SHALL be able to receive new data standards, platforms, regulations, product schemas, and practices through the InfoTechCanon assimilation process.
## 17.1 Assimilation Triggers
Assimilation may be triggered by:
```text
new data catalog model
new data lineage standard
new metadata registry standard
new privacy regulation
new data-quality tool
new data-contract practice
new data-product pattern
new analytics modeling method
new data platform integration
new recurring data classification conflict
```
## 17.2 Data Assimilation Output
A data assimilation SHOULD produce:
```text
source summary
extracted data concepts
concept comparison matrix
gap list
conflict list
mapping file
candidate new concepts
candidate relationship changes
candidate pattern changes
candidate profile changes
open questions
```
## 17.3 Recommended First Assimilation Candidates
```text
W3C DCAT 3
PROV-O
ISO/IEC 11179
DAMA-DMBOK
OpenLineage
DataHub
OpenMetadata
Great Expectations
dbt semantic layer / metadata
GDPR data categories and processing concepts
```
---
# 18. Integration with Other InfoTechCanon Standards
## 18.1 Landscape Model
Data references Landscape concepts for:
```text
data store
database
bucket
queue
topic
pipeline
runtime service
application service
endpoint
environment
```
## 18.2 Organization Model
Data imports organization concepts for:
```text
data owner
data steward
data custodian
data producer
data consumer
data trustee
responsible team
```
## 18.3 Governance Model
Data imports governance concepts for:
```text
policy
retention requirement
processing obligation
control
exception
evidence
review
compliance requirement
```
## 18.4 Security Model
Security imports data concepts for:
```text
classification
sensitivity
data category
data subject category
data exposure
residency
data security finding
```
## 18.5 Access Control Model
Access Control imports data concepts for:
```text
dataset
data object
data classification
data usage constraint
data access pattern
```
## 18.6 Task Model
Data creates or references tasks such as:
```text
data-quality remediation
schema migration
contract review
lineage clarification
classification review
retention cleanup
data incident investigation
```
## 18.7 Tagging Standard
Tagging supports data discovery and classification but must not replace data classification, schema, lineage, quality, or governance records.
---
# 19. Canon Interface Card Usage
Subsystems that implement or produce data knowledge SHOULD publish a Canon Interface Card.
Example:
```yaml
subsystem: data-catalog-importer
implements:
- InfoTechCanonDataModel
- DataCatalogProfile
produces:
- Dataset
- Schema
- Field
- DataDistribution
- DataOwnerReference
consumes:
- Team
- DataStoreReference
- Policy
relations:
- Dataset has_schema Schema
- Schema has_field Field
- Dataset stored_in DataStoreReference
- Dataset owned_by Team
source_of_truth:
dataset_catalog_entries: data-catalog
known_deviations:
- lineage is summary-only
- data quality checks are imported from separate system
```
---
# 20. Retrieval Requirements
The Data Model is designed for markdown-based infospaces.
## 20.1 Required Retrieval Properties
Every major concept SHOULD provide:
- stable heading,
- stable identifier,
- short definition,
- longer explanation,
- examples,
- distinction notes,
- relationship examples,
- mapping hooks,
- profile references,
- and common mistakes.
## 20.2 Agent Brief
A mature Data Model SHOULD include an `agent-brief.md` file with:
```text
purpose
scope
owned concepts
imported concepts
core distinctions
do / do not rules
relationship patterns
minimal examples
common mistakes
profile list
mapping list
```
## 20.3 Indexes
The data information space SHOULD provide indexes by:
```text
concept
relationship
data domain
dataset
schema
field
classification
quality rule
lineage
contract
profile
pattern
mapping target
status
source system
```
---
# 21. Conformance Levels
## 21.1 Reference-Conformant
A document or system is reference-conformant if it uses Data Model terminology consistently but does not implement structured metadata or validation rules.
## 21.2 Metadata-Conformant
A system is metadata-conformant if it uses stable identifiers, concept names, lifecycle states, source metadata, and relationship types.
## 21.3 Catalog-Conformant
A system is catalog-conformant if datasets, distributions, data services, owners, stewards, descriptions, and classifications are represented.
## 21.4 Lineage-Conformant
A system is lineage-conformant if it represents data sources, transformations, targets, provenance, and confidence.
## 21.5 Quality-Conformant
A system is quality-conformant if it represents data quality rules, checks, results, and issues.
## 21.6 Contract-Conformant
A system is contract-conformant if producer and consumer expectations are represented as DataContracts.
## 21.7 Profile-Conformant
A system is profile-conformant if it implements a declared Data Profile and passes its validation rules.
## 21.8 Assimilation-Conformant
A system or repository is assimilation-conformant if it can accept external data concepts through the InfoTechCanon assimilation workflow and produce mappings, gaps, conflicts, and proposed changes.
---
# 22. Validation Rules
Initial validation rules:
```text
VAL-DATA-001: Dataset SHOULD NOT be modeled as identical to DataStoreReference.
VAL-DATA-002: Dataset SHOULD have owner or steward reference when used for operational or governed purposes.
VAL-DATA-003: Dataset SHOULD have classification when it may contain sensitive, regulated, operationally critical, or business-critical data.
VAL-DATA-004: Schema SHOULD have version when used across system boundaries.
VAL-DATA-005: Field SHOULD be distinguishable from DataElement where semantic precision matters.
VAL-DATA-006: DataQualityRule SHOULD declare the dataset, field, or data object it applies to.
VAL-DATA-007: DataQualityResult SHOULD reference the executed rule and check.
VAL-DATA-008: DataLineage SHOULD distinguish declared, inferred, observed, and verified lineage.
VAL-DATA-009: DataContract SHOULD declare producer, consumer, dataset, schema or semantic expectations, quality expectations, and compatibility rules where applicable.
VAL-DATA-010: BreakingChange SHOULD reference the DataContract or CompatibilityRule it violates.
VAL-DATA-011: RetentionRuleReference SHOULD point to Governance concepts rather than embedding legal interpretation in Data.
VAL-DATA-012: DataResidency SHOULD reference region, jurisdiction, environment, or storage/processing scope where available.
VAL-DATA-013: Tags MUST NOT replace DataClassification, Schema, Lineage, Quality, or Contract records.
VAL-DATA-014: External data concepts SHOULD be represented through mapping records rather than silently reused.
VAL-DATA-015: Profiles MUST NOT redefine canonical concepts. They may constrain them.
VAL-DATA-016: Data used for AI training, analytics, or automation SHOULD declare usage constraints and provenance where relevant.
```
---
# 23. Anti-Patterns
## 23.1 Table Equals Dataset
Treating every table as a complete dataset and every dataset as a table.
## 23.2 Schema Equals Meaning
Assuming column names and types fully define business meaning.
## 23.3 Classification by Tag Only
Using tags such as `confidential` without a governed DataClassification record.
## 23.4 Lineage by Diagram Only
Drawing flows without source, transformation, target, evidence, or confidence.
## 23.5 Quality Dashboard Graveyard
Tracking quality failures without owners, tasks, remediation, or fitness-for-use decisions.
## 23.6 Contract-Free Integration
Letting consumers depend on producer data without explicit compatibility expectations.
## 23.7 Hidden Retention Logic
Deleting or keeping data based on undocumented scripts or tribal knowledge.
## 23.8 Catalog Without Trust
Cataloging datasets without owner, freshness, classification, quality, or lineage.
## 23.9 Privacy in Free Text
Recording processing purpose, data subject category, residency, or sensitivity as unstructured notes only.
## 23.10 Vendor Model Capture
Letting one data catalog, warehouse, or governance product define the internal data model.
---
# 24. Initial Repository Placement
Recommended repository layout:
```text
info-tech-canon/
standards/
data/
InfoTechCanonDataModel.md
agent-brief.md
concepts/
relationships/
patterns/
profiles/
mappings/
assimilation/
examples/
validation/
```
Seed files:
```text
standards/data/InfoTechCanonDataModel.md
standards/data/agent-brief.md
standards/data/concepts/dataset.md
standards/data/concepts/data-product.md
standards/data/concepts/schema.md
standards/data/concepts/data-element.md
standards/data/concepts/data-classification.md
standards/data/concepts/data-lineage.md
standards/data/concepts/data-quality-rule.md
standards/data/concepts/data-contract.md
standards/data/patterns/data-is-not-its-store.md
standards/data/patterns/dataset-catalog-entry.md
standards/data/patterns/data-contract-at-boundary.md
standards/data/patterns/lineage-as-evidence.md
standards/data/profiles/small-saas-data-profile.md
standards/data/profiles/data-catalog-profile.md
standards/data/profiles/data-contract-profile.md
standards/data/profiles/data-lineage-profile.md
standards/data/mappings/dcat.yaml
standards/data/mappings/prov-o.yaml
standards/data/mappings/iso-11179.yaml
standards/data/mappings/dama-dmbok.yaml
```
---
# 25. Roadmap
## Phase 1: Seed Stabilization
- Establish this standard as `InfoTechCanonDataModel`.
- Add seed concepts, relationship vocabulary, patterns, and profiles.
- Define validation rules.
- Align with Landscape, Governance, Security, Access Control, Task, and Tagging.
## Phase 2: First Assimilations
Recommended first assimilations:
```text
W3C DCAT 3
PROV-O
ISO/IEC 11179
DAMA-DMBOK
OpenLineage
DataHub
OpenMetadata
Great Expectations
dbt metadata
GDPR data category concepts
```
## Phase 3: Profile Maturation
- Mature Small SaaS Data Profile.
- Mature Data Catalog Profile.
- Mature Data Contract Profile.
- Mature Data Lineage Profile.
- Mature Privacy-Relevant Data Profile.
- Mature Analytics Dataset Profile.
## Phase 4: Tooling Integration
- Generate concept indexes.
- Generate agent brief.
- Create machine-readable YAML/JSON exports.
- Add validation scripts.
- Integrate data catalog, lineage, data-quality, schema registry, and contract tooling.
## Phase 5: Data Intelligence Loop
- Connect datasets to services and repositories.
- Connect classification to access control and security.
- Connect quality issues to tasks.
- Connect lineage to provenance and assurance.
- Connect data contracts to DevSecOps and release workflows.
- Connect privacy and retention to governance obligations.
---
# 26. Summary
The InfoTechCanon Data Model is the seed standard for representing data as a managed, governed, discoverable, reusable, classifiable, lineage-bearing, and quality-assessable asset.
Its most important commitments are:
```text
Separate data from storage.
Separate dataset, schema, field, data element, data object, and data product.
Treat classification, lineage, quality, retention, residency, and processing purpose as first-class concerns.
Use data contracts at producer-consumer boundaries.
Import governance, access-control, security, task, tagging, organization, and landscape concepts
instead of redefining them.
Map to DCAT, PROV-O, ISO/IEC 11179, DAMA-DMBOK, OpenLineage, and catalog tools
without surrendering internal semantic autonomy.
Use profiles to make the model practical for SaaS systems, catalogs, contracts,
lineage, privacy-relevant data, analytics, and AI/agentic workflows.
```
This makes the Data Model a core seed for information architecture, data governance, security posture, AI readiness, analytics reliability, and interoperable information-processing systems.