43 KiB
InfoTechCanon Data Model
Short Name: ITC-DATA
Document Status: Seed Standard Release Candidate 1
Version: RC1-seed
Date: 2026-05-22
Repository Context: info-tech-canon
Document Type: InfoTechCanon Domain Standard
Intended Audience: Data architects, data engineers, data stewards, platform engineers, governance designers, security architects, application architects, product owners, knowledge-system builders, compliance reviewers, AI/analytics teams, and agentic tooling.
1. Purpose
The InfoTechCanon Data Model defines a canonical seed model for representing data as a managed, governed, discoverable, classifiable, lineage-bearing, quality-assessable, and reusable information asset.
It exists to give data its own canonical domain instead of leaving data semantics scattered across landscape, security, governance, DevSecOps, observability, and application models.
This standard provides a canonical vocabulary for:
- data domains,
- datasets,
- data products,
- data objects,
- records,
- fields,
- schemas,
- data elements,
- code lists,
- data stores as references,
- data flows,
- data lineage,
- data quality,
- metadata,
- catalogs,
- distributions,
- data services,
- data classification,
- sensitivity,
- residency,
- retention,
- processing purpose,
- data ownership and stewardship references,
- data contracts,
- and data evidence.
2. Position in InfoTechCanon
The Data Model is a domain standard within InfoTechCanon.
It depends on the existing seed standards as follows:
Landscape = where data is stored, processed, moved, and exposed.
Organization = data owners, stewards, custodians, producers, consumers.
Governance = data policies, obligations, controls, evidence, exceptions.
Security = data exposure, data-security findings, data attack paths.
Access Control = permissions and grants to data resources.
Task = data-quality work, migration work, remediation, reviews.
Tagging = lightweight classification and retrieval.
Data = datasets, schemas, metadata, lineage, quality, classification, retention.
InfoTechCanon
├── InfoTechCanonCore
├── InfoTechCanonLandscapeModel
├── InfoTechCanonOrganizationModel
├── InfoTechCanonGovernanceModel
├── InfoTechCanonTaskModel
├── InfoTechCanonTaggingStandard
├── InfoTechCanonAccessControlModel
├── InfoTechCanonSecurityModel
├── InfoTechCanonDataModel <-- this standard
├── InfoTechCanonDevSecOpsModel
├── InfoTechCanonNetworkModel
├── InfoTechCanonObservabilityModel
├── InfoTechCanonPatternLanguage
└── Application Profiles
3. Boundary with Adjacent Standards
3.1 Boundary with Landscape
The Landscape Model owns:
DataStore
DatabaseInstance
ObjectBucket
FileShare
Queue
Cache
RuntimeResource
ApplicationService
IntegrationFlow
Endpoint
The Data Model owns:
Dataset
DataProduct
DataObject
Schema
Field
DataElement
DataFlow
DataLineage
DataClassification
DataQualityRule
DataContract
DataDistribution
Boundary rule:
Landscape owns the technical and runtime places where data lives or moves.
Data owns the semantic, structural, quality, classification, and lineage meaning of data.
3.2 Boundary with Governance
The Governance Model owns:
Policy
Requirement
Obligation
Control
Risk
Exception
Evidence
Review
Approval
ComplianceRequirement
The Data Model owns data-specific structures that are governed:
RetentionRuleReference
ProcessingPurpose
DataClassification
DataQualityRule
DataContract
DataLineage
Boundary rule:
Governance defines why data must be governed.
Data defines what data is and how it is described, classified, measured, and traced.
3.3 Boundary with Security
The Security Model owns:
DataSecurityFinding
ExposureFinding
CredentialExposure
SecurityIncident
AttackPath
Mitigation
The Data Model owns:
Sensitivity
Classification
DataResidency
DataSubjectCategory
DataCategory
DataLineage
Security may use these for posture analysis.
3.4 Boundary with Access Control
Access Control owns permissions, grants, authorization decisions, and enforcement.
Data owns data resources and classifications that access policies may use.
Example:
Dataset classified_as Confidential
AccessPolicy permits Role to read Dataset
AuthorizationDecision permits read on Dataset
3.5 Boundary with Organization
Organization owns actors and responsibilities.
Data references Organization concepts for:
DataOwner
DataSteward
DataCustodian
DataProducer
DataConsumer
DataTrustee
3.6 Boundary with DevSecOps
DevSecOps owns source, build, artifact, pipeline, release, deployment, SBOM, and attestation semantics.
Data owns data contracts, schema evolution, migration data, test data, synthetic data, lineage, and data-quality semantics.
4. Research Basis and External Alignment
This seed standard draws on multiple data-management bodies of knowledge.
4.1 DAMA-DMBOK
DAMA-DMBOK is a broad reference for data management disciplines including data governance, architecture, modeling, storage, security, integration, documents/content, reference/master data, warehousing/BI, metadata, and data quality. InfoTechCanon uses it as a broad mapping and assimilation target, not as a direct controlling model.
4.2 DCAT
W3C DCAT defines a vocabulary for data catalogs. DCAT Version 3 organizes catalog access around datasets, distributions, data services, and dataset series. This is highly relevant for InfoTechCanon catalog, dataset, distribution, and data-service concepts.
4.3 PROV-O
W3C PROV-O models provenance using entities, activities, and agents. This is highly relevant for data lineage, derivation, generation, transformation, and responsibility.
4.4 ISO/IEC 11179
ISO/IEC 11179 provides a metadata registry framework for data elements, naming, identification, definitions, classification, and registration. It is an important mapping target for data element, representation, data definition, code list, and metadata registry concepts.
4.5 Data Mesh and Data Products
Data product thinking emphasizes ownership, discoverability, quality, fitness for use, service-like interfaces, and domain responsibility. InfoTechCanon should support data products without requiring a specific data-mesh organizational model.
4.6 Data Contracts
Data contracts define expectations between producers and consumers around schema, semantics, quality, delivery, compatibility, ownership, and change management. They are critical for reliable information-processing systems.
4.7 Privacy and Data Protection Practice
Privacy and data-protection practice contributes concepts such as personal data, sensitive data, data subject, processing purpose, lawful basis, retention, residency, and minimization. The Data Model provides data semantics, while Governance owns legal obligations and Security owns data exposure and incident semantics.
5. Seed Standard Design Stance
This standard is a seed standard, not a full data-governance or database-design manual.
It shall:
- define canonical data semantics,
- distinguish data from storage infrastructure,
- distinguish dataset, data product, data object, schema, field, and data element,
- support data classification, lineage, quality, retention, residency, and processing purpose,
- support catalog and discovery concepts,
- support data contracts and schema evolution,
- support operational, analytical, reference, master, event, and document data,
- support mappings to external standards without becoming subordinate to them,
- remain markdown-first and agent-retrievable,
- and support future assimilation of data standards, platforms, regulations, and product schemas.
6. Scope
6.1 In Scope
This standard covers canonical representation of:
- data domains,
- data products,
- datasets,
- dataset series,
- data distributions,
- data services,
- data objects,
- entities,
- records,
- fields,
- attributes,
- data elements,
- schemas,
- schema versions,
- code lists,
- reference data,
- master data references,
- metadata,
- catalogs,
- data lineage,
- data flows,
- data transformations,
- data quality rules,
- data quality results,
- data contracts,
- data classification,
- sensitivity,
- confidentiality level,
- integrity expectation,
- availability expectation,
- retention rules as data semantics,
- data residency,
- data minimization,
- processing purpose,
- data subject categories,
- data provenance,
- data ownership and stewardship references,
- and data lifecycle states.
6.2 Out of Scope
This standard does not fully define:
- database engine internals,
- storage infrastructure,
- full data warehouse architecture,
- full analytics modeling,
- full privacy-law interpretation,
- full data-governance process,
- full security incident handling,
- all ontology modeling,
- all semantic-web representation,
- complete ETL/ELT implementation,
- or every vendor-specific data catalog schema.
Those may be mapped, assimilated, profiled, or handled by adjacent standards.
7. Normative Language
The following terms are used normatively:
- SHALL indicates a mandatory rule for conformance.
- SHOULD indicates a recommended practice.
- MAY indicates an optional capability.
- MUST NOT indicates a prohibited practice.
- SEED marks a concept defined provisionally here but open to later refinement.
- EXTRACT marks a concept that may later move to a more specialized standard.
8. Core Principles
8.1 Data Is Not Its Store
A dataset is not the same thing as a database, bucket, table, file, topic, or API.
Storage and runtime locations are Landscape concepts. Data semantics belong here.
8.2 Dataset Is Not Schema
A dataset may have one or more schemas, distributions, versions, contracts, lineage records, and quality expectations.
8.3 Schema Is Not Meaning
A schema describes structure. It does not fully define business meaning, ownership, usage constraints, quality, or purpose.
8.4 Classification Is First-Class
Data classification and sensitivity SHOULD be explicit where data has security, privacy, compliance, operational, or business significance.
8.5 Lineage Is Evidence-Carrying
Lineage SHOULD identify source data, transformations, activities, agents, and derived outputs with confidence and evidence where possible.
8.6 Data Quality Is Contextual
Data quality depends on intended use, domain meaning, contract expectations, and consumer needs.
8.7 Data Contracts Make Data Reliable
Producer-consumer expectations SHOULD be explicit when data is reused across system boundaries.
8.8 External Standards Are Mapped, Not Obeyed
The Data Model MAY map to DAMA-DMBOK, DCAT, PROV-O, ISO/IEC 11179, schema.org, OpenLineage, DataHub, OpenMetadata, dbt, Great Expectations, or similar standards and tools.
It MUST NOT subordinate its internal semantics to any single external model.
9. Canonical Seed Metadata
Every data artifact SHOULD support structured metadata.
Recommended front matter:
---
id: itc-data:Dataset
type: concept
standard: InfoTechCanonDataModel
standard_version: RC1-seed
status: candidate
canonical_owner: InfoTechCanonDataModel
preferred_label: Dataset
related:
- itc-data:DataProduct
- itc-data:Schema
- itc-data:DataDistribution
- itc-data:DataLineage
mappings:
- itc-map:dataset-to-dcat-dataset
---
Recommended artifact statuses:
idea
draft
candidate
release-candidate
adopted
stable
deprecated
retired
Recommended concept statuses:
proposed
experimental
candidate
canonical
deprecated
retired
10. Root Data Taxonomy
DataEntity
├── DataAssetEntity
│ ├── DataDomain
│ ├── DataProduct
│ ├── Dataset
│ ├── DatasetSeries
│ ├── DataDistribution
│ ├── DataService
│ ├── DataObject
│ ├── Record
│ └── DocumentData
├── StructureEntity
│ ├── Schema
│ ├── SchemaVersion
│ ├── Field
│ ├── Attribute
│ ├── DataElement
│ ├── DataElementConcept
│ ├── Representation
│ ├── DataType
│ ├── Constraint
│ └── CodeList
├── SemanticEntity
│ ├── BusinessTerm
│ ├── GlossaryTerm
│ ├── ConceptualEntity
│ ├── DataDefinition
│ ├── ReferenceData
│ ├── MasterDataReference
│ └── CanonicalValue
├── GovernanceReferenceEntity
│ ├── DataClassification
│ ├── Sensitivity
│ ├── DataCategory
│ ├── DataSubjectCategory
│ ├── ProcessingPurpose
│ ├── RetentionRuleReference
│ ├── DataResidency
│ └── DataUsageConstraint
├── QualityEntity
│ ├── DataQualityDimension
│ ├── DataQualityRule
│ ├── DataQualityCheck
│ ├── DataQualityResult
│ ├── DataQualityIssue
│ └── FitnessForUse
├── LineageEntity
│ ├── DataFlow
│ ├── DataLineage
│ ├── Transformation
│ ├── Derivation
│ ├── SourceDataset
│ ├── TargetDataset
│ └── ProvenanceRecord
├── ContractEntity
│ ├── DataContract
│ ├── ProducerExpectation
│ ├── ConsumerExpectation
│ ├── CompatibilityRule
│ ├── BreakingChange
│ └── SchemaEvolutionPolicy
└── OperationalDataEntity
├── DataPipelineReference
├── DataStoreReference
├── QueryReference
├── DataAccessPattern
├── DataFreshness
└── DataAvailability
11. Core Concepts
11.1 DataEntity
A DataEntity is any identifiable concept used to represent data, metadata, structure, classification, quality, lineage, contract, or data lifecycle.
Recommended attributes:
id:
entity_type:
canonical_name:
display_name:
lifecycle_state:
source_system:
created_at:
updated_at:
Optional attributes:
owner:
steward:
data_domain:
classification:
source_confidence:
valid_from:
valid_to:
tags:
external_references:
11.2 DataDomain
A DataDomain is a bounded area of data meaning, ownership, stewardship, or subject matter.
Examples:
customer
billing
product
identity
orders
support
security
operations
finance
11.3 DataProduct
A DataProduct is a managed data asset or set of data assets offered for use by consumers with explicit ownership, quality expectations, documentation, interfaces, and lifecycle.
Recommended attributes:
owner:
steward:
producer:
consumers:
service_level_expectations:
quality_expectations:
contract:
distribution_methods:
11.4 Dataset
A Dataset is a coherent collection of data published, managed, processed, analyzed, or consumed as a unit.
A dataset may have:
schema
distribution
catalog entry
classification
lineage
quality rules
owner
steward
contract
retention expectation
Canonical rule:
Dataset MUST NOT be treated as identical to its storage location.
11.5 DatasetSeries
A DatasetSeries is a sequence or family of related datasets organized over time, version, geography, domain, or release.
11.6 DataDistribution
A DataDistribution is an accessible representation of a dataset.
Examples:
CSV file
Parquet file
API response
database table export
event stream
report download
object storage path
11.7 DataService
A DataService is a service that provides access to data or operations over data.
Examples:
query API
data product API
metadata API
streaming endpoint
analytics service
11.8 DataObject
A DataObject is a meaningful object or structure represented in data.
Examples:
Customer
Invoice
Order
Payment
Product
Device
UserProfile
AccessGrant
SecurityFinding
11.9 Record
A Record is an instance-level representation of data about an entity, event, relationship, or observation.
11.10 Field
A Field is a named component of a schema, record, message, or table.
11.11 Attribute
An Attribute is a property of a data object or conceptual entity.
A field may represent an attribute, but field is structural while attribute is semantic.
11.12 DataElement
A DataElement is a defined unit of data with meaning, representation, and expected usage.
It may map to ISO/IEC 11179 data element concepts.
Recommended attributes:
object_class:
property:
representation:
data_type:
definition:
permitted_values:
11.13 DataElementConcept
A DataElementConcept is the semantic idea of a data element independent of representation.
Example:
Customer birth date
Invoice total amount
Repository default branch name
11.14 Representation
A Representation describes how a data element is represented.
Examples:
string
integer
decimal
boolean
date
timestamp
code
identifier
URI
11.15 DataType
A DataType specifies the technical or logical type of a field or data element.
11.16 Constraint
A Constraint is a rule limiting valid data.
Examples:
required
unique
minimum
maximum
regex
foreign key
enum
format
cardinality
11.17 CodeList
A CodeList is a controlled set of allowed values with definitions.
Examples:
country codes
currency codes
status codes
classification labels
risk levels
11.18 BusinessTerm
A BusinessTerm is a term used by domain actors to describe data meaning.
11.19 GlossaryTerm
A GlossaryTerm is a documented term in a glossary with definition, synonyms, ownership, and mappings.
11.20 DataDefinition
A DataDefinition is a textual or structured definition explaining the meaning, scope, and intended use of a data concept.
11.21 ReferenceData
ReferenceData is data used to classify, categorize, or constrain other data.
Examples:
country list
currency list
product category list
status code list
business unit list
11.22 MasterDataReference
A MasterDataReference points to a controlled source of core business entities.
Examples:
customer master
product master
supplier master
employee master
The Data Model references master-data semantics but does not require a specific MDM architecture.
11.23 DataClassification
A DataClassification is a classification assigned to data based on sensitivity, confidentiality, regulatory concern, operational criticality, or business significance.
Examples:
public
internal
confidential
restricted
regulated
personal
sensitive personal
secret
11.24 Sensitivity
Sensitivity indicates potential harm, obligation, or restriction associated with data disclosure, modification, loss, misuse, or processing.
11.25 DataCategory
A DataCategory groups data by semantic, legal, operational, or analytical type.
Examples:
personal data
financial data
health data
authentication data
transaction data
telemetry data
metadata
content data
11.26 DataSubjectCategory
A DataSubjectCategory identifies the kind of person or entity data is about.
Examples:
customer
employee
applicant
supplier contact
child
patient
user
administrator
11.27 ProcessingPurpose
A ProcessingPurpose describes why data is collected, stored, transformed, shared, or used.
Examples:
billing
support
security monitoring
analytics
product improvement
legal compliance
identity verification
11.28 RetentionRuleReference
A RetentionRuleReference links data to governance-defined retention obligations, policies, or rules.
The Data Model may model retention expectation, but Governance owns the policy and obligation.
11.29 DataResidency
DataResidency describes where data is stored, processed, transferred, or legally required to remain.
Examples:
EU
Germany
customer region
cloud region
on-premises only
11.30 DataUsageConstraint
A DataUsageConstraint describes a restriction on how data may be used.
Examples:
not for training
not for export
internal analytics only
production use prohibited
no cross-border transfer
only aggregated use
11.31 DataQualityDimension
A DataQualityDimension is an aspect of data quality.
Common dimensions:
accuracy
completeness
consistency
timeliness
validity
uniqueness
freshness
integrity
fitness_for_use
11.32 DataQualityRule
A DataQualityRule is a testable expectation about data quality.
Examples:
customer_id must not be null
invoice_total must be >= 0
country_code must be in ISO country code list
event_timestamp must be within expected delay window
11.33 DataQualityCheck
A DataQualityCheck is an execution of one or more data quality rules.
11.34 DataQualityResult
A DataQualityResult is the outcome of a data quality check.
11.35 DataQualityIssue
A DataQualityIssue is a finding indicating data does not meet a quality rule or fitness expectation.
It may create Task Model remediation work.
11.36 FitnessForUse
FitnessForUse is the degree to which data is suitable for a specific purpose or consumer context.
11.37 DataFlow
A DataFlow is movement or transfer of data between sources, systems, stores, services, actors, or processes.
11.38 DataLineage
DataLineage describes the origin, movement, transformation, derivation, and usage path of data.
Lineage may include:
source dataset
transformation
activity
agent
target dataset
time
evidence
confidence
11.39 Transformation
A Transformation is an activity that changes data structure, content, format, aggregation, classification, or meaning.
11.40 Derivation
A Derivation is a relationship where one data entity is derived from another.
11.41 ProvenanceRecord
A ProvenanceRecord records information about how data came to exist, who or what generated it, what activity produced it, and what source influenced it.
11.42 DataContract
A DataContract is an explicit agreement between data producers and consumers about data structure, semantics, quality, delivery, compatibility, ownership, and change expectations.
11.43 ProducerExpectation
A ProducerExpectation describes what a data producer commits to provide.
Examples:
schema stability
freshness
completeness
availability
documentation
change notice
11.44 ConsumerExpectation
A ConsumerExpectation describes what a data consumer expects or is allowed to assume.
11.45 CompatibilityRule
A CompatibilityRule describes what changes are considered compatible or breaking.
11.46 BreakingChange
A BreakingChange is a data, schema, semantic, quality, or delivery change that violates consumer expectations or compatibility rules.
11.47 SchemaEvolutionPolicy
A SchemaEvolutionPolicy defines rules for how schemas may change over time.
11.48 DataStoreReference
A DataStoreReference points to a Landscape data store or storage resource.
Examples:
database
table
bucket
file share
topic
queue
index
warehouse
lakehouse table
11.49 DataAccessPattern
A DataAccessPattern describes how data is accessed.
Examples:
batch export
API query
event stream
direct database query
file download
replication
analytics dashboard
11.50 DataFreshness
DataFreshness describes how current data is relative to a defined expectation.
11.51 DataAvailability
DataAvailability describes whether data is accessible according to expectations.
12. Core Relationship Vocabulary
Recommended root relationship types:
contains
part_of
describes
classified_as
has_schema
has_field
has_distribution
provided_by
consumed_by
stored_in
accessed_via
flows_to
derived_from
generated_by
transformed_by
governed_by
constrained_by
subject_to
owned_by
stewarded_by
produced_by
consumed_by
validated_by
violates
satisfies
maps_to
Relationship records SHOULD support:
id:
relationship_type:
source_entity:
target_entity:
scope:
valid_from:
valid_to:
source_system:
confidence:
evidence:
rationale:
13. Data State Models
13.1 Dataset Lifecycle States
proposed
designed
active
deprecated
retired
archived
deleted
13.2 Schema States
draft
candidate
active
deprecated
superseded
retired
13.3 Data Quality States
unknown
unchecked
passing
warning
failing
waived
remediating
verified
13.4 Data Contract States
draft
under_review
active
violated
deprecated
superseded
retired
13.5 Lineage Confidence States
unknown
declared
inferred
observed
verified
conflicting
14. Data Patterns
14.1 Pattern: Data Is Not Its Store
Context: Teams model data by pointing at tables, buckets, or files.
Problem: Storage location does not explain semantic meaning, ownership, classification, quality, or lineage.
Solution: Model Dataset, Schema, Distribution, StoreReference, and Lineage separately.
14.2 Pattern: Dataset Catalog Entry
Context: Data consumers need to discover and understand data.
Problem: Data assets remain invisible or only known by tribal knowledge.
Solution: Provide a catalog entry with:
dataset name
description
owner
steward
classification
schema
distribution
quality expectations
lineage
access method
usage constraints
14.3 Pattern: Data Contract at Boundary
Context: Data crosses a team, service, product, or system boundary.
Problem: Consumers break when producers change data unexpectedly.
Solution: Define a DataContract with schema, semantic expectations, quality rules, compatibility rules, and change process.
14.4 Pattern: Classification Drives Controls
Context: Data has different sensitivity and obligations.
Problem: Systems apply uniform controls or rely on ad hoc judgment.
Solution: Classify data and map classifications to governance controls, access policies, security measures, and retention expectations.
14.5 Pattern: Lineage as Evidence
Context: A derived dataset is used for decisions or compliance.
Problem: Consumers cannot determine origin, transformations, or trustworthiness.
Solution: Model lineage with source datasets, transformations, activities, agents, target datasets, and evidence.
14.6 Pattern: Quality Rule to Remediation
Context: Data quality checks fail.
Problem: Failures remain dashboards instead of corrective action.
Solution:
DataQualityRule
-> DataQualityCheck
-> DataQualityResult
-> DataQualityIssue
-> RemediationTask
-> VerificationEvidence
14.7 Pattern: Semantic Term and Field Split
Context: Database columns are treated as business terms.
Problem: Field names do not fully encode business meaning.
Solution: Link Field to DataElement, BusinessTerm, and DataDefinition.
14.8 Pattern: Retention with Governance Reference
Context: Data must be kept or deleted according to obligations.
Problem: Retention is encoded as undocumented operational behavior.
Solution: Link Dataset or DataObject to RetentionRuleReference and keep the governing obligation in Governance.
15. Data Profiles
15.1 Profile Format
A Data Profile SHALL declare:
id:
profile_name:
status:
implements:
- InfoTechCanonDataModel
target_context:
included_concepts:
required_relationships:
required_metadata:
state_model:
source_of_truth_rules:
mapping_files:
validation_rules:
examples:
known_deviations:
15.2 Seed Profile: Small SaaS Data Profile
Purpose:
Provide a minimal data model for a small SaaS platform moving toward production readiness.
Included concepts:
DataDomain
Dataset
DataObject
Schema
Field
DataClassification
DataStoreReference
DataFlow
DataQualityRule
RetentionRuleReference
DataOwnerReference
DataStewardReference
Required relationships:
Dataset has_schema Schema
Schema has_field Field
Dataset classified_as DataClassification
Dataset stored_in DataStoreReference
Dataset owned_by DataOwnerReference
Dataset stewarded_by DataStewardReference
DataFlow moves Dataset
RetentionRuleReference applies_to Dataset
15.3 Seed Profile: Data Catalog Profile
Purpose:
Represent data catalog entries for discoverability and reuse.
Included concepts:
Catalog
Dataset
DatasetSeries
DataDistribution
DataService
DataOwnerReference
DataStewardReference
DataClassification
DataQualitySummary
DataLineageSummary
Mapping targets:
DCAT
DCAT-AP
DataHub
OpenMetadata
Amundsen
Collibra / catalog tools
15.4 Seed Profile: Data Contract Profile
Purpose:
Represent data producer-consumer agreements.
Included concepts:
DataContract
ProducerExpectation
ConsumerExpectation
Schema
SchemaVersion
DataQualityRule
CompatibilityRule
BreakingChange
ChangeNotice
DataContractViolation
Required relationships:
DataContract applies_to Dataset
ProducerExpectation constrains Producer
ConsumerExpectation informs Consumer
CompatibilityRule governs SchemaEvolution
BreakingChange violates DataContract
15.5 Seed Profile: Data Lineage Profile
Purpose:
Represent lineage across datasets, transformations, pipelines, and systems.
Included concepts:
Dataset
SourceDataset
TargetDataset
Transformation
DataFlow
DataLineage
ProvenanceRecord
DataPipelineReference
ActivityReference
AgentReference
Mapping targets:
PROV-O
OpenLineage
Marquez
dbt exposures/models/sources
DataHub lineage
15.6 Seed Profile: Privacy-Relevant Data Profile
Purpose:
Represent data concepts relevant to privacy, data protection, retention, and processing.
Included concepts:
PersonalDataCategory
SensitiveDataCategory
DataSubjectCategory
ProcessingPurpose
DataResidency
RetentionRuleReference
DataUsageConstraint
DataMinimizationExpectation
Governance owns legal obligations and lawful-basis interpretation.
15.7 Seed Profile: Analytics Dataset Profile
Purpose:
Represent analytical datasets, metrics, dimensions, facts, models, and reports.
Included concepts:
Dataset
Metric
Dimension
Fact
Measure
AggregationRule
ReportReference
DashboardReference
DataQualityRule
FreshnessExpectation
16. Mapping Model for the Data Standard
Mappings relate InfoTechCanon data concepts to external standards, frameworks, products, and regulations.
16.1 Mapping Types
Recommended mapping types:
exactMatch
closeMatch
broadMatch
narrowMatch
relatedMatch
conflictMatch
gapMatch
derivedFrom
regulatoryReference
toolEquivalent
16.2 Mapping Record
Example:
id: itc-map:dataset-to-dcat-dataset
source_concept: itc-data:Dataset
target_body: W3C DCAT
target_version: "3"
target_concept: dcat:Dataset
mapping_type: closeMatch
scope:
- data catalog interoperability
not_valid_for:
- all internal schema semantics
- all data product lifecycle semantics
rationale: >
DCAT Dataset is a strong catalog-oriented match for InfoTechCanon Dataset,
but InfoTechCanon includes additional governance, quality, contract,
and lineage expectations that may not be required by DCAT.
confidence: high
status: candidate
owner: InfoTechCanonDataModel
16.3 Seed Mapping Targets
The Data Model SHOULD maintain mappings to:
DAMA-DMBOK
W3C DCAT 3
DCAT-AP
W3C PROV-O
ISO/IEC 11179
schema.org Dataset
OpenLineage
DataHub metadata model
OpenMetadata
dbt sources/models/exposures
Great Expectations
Apache Atlas
Collibra / data catalog concepts
GDPR / privacy-regulation references
Dublin Core metadata
SPDX / CycloneDX data references where relevant
17. Assimilation Hooks
The Data Model SHALL be able to receive new data standards, platforms, regulations, product schemas, and practices through the InfoTechCanon assimilation process.
17.1 Assimilation Triggers
Assimilation may be triggered by:
new data catalog model
new data lineage standard
new metadata registry standard
new privacy regulation
new data-quality tool
new data-contract practice
new data-product pattern
new analytics modeling method
new data platform integration
new recurring data classification conflict
17.2 Data Assimilation Output
A data assimilation SHOULD produce:
source summary
extracted data concepts
concept comparison matrix
gap list
conflict list
mapping file
candidate new concepts
candidate relationship changes
candidate pattern changes
candidate profile changes
open questions
17.3 Recommended First Assimilation Candidates
W3C DCAT 3
PROV-O
ISO/IEC 11179
DAMA-DMBOK
OpenLineage
DataHub
OpenMetadata
Great Expectations
dbt semantic layer / metadata
GDPR data categories and processing concepts
18. Integration with Other InfoTechCanon Standards
18.1 Landscape Model
Data references Landscape concepts for:
data store
database
bucket
queue
topic
pipeline
runtime service
application service
endpoint
environment
18.2 Organization Model
Data imports organization concepts for:
data owner
data steward
data custodian
data producer
data consumer
data trustee
responsible team
18.3 Governance Model
Data imports governance concepts for:
policy
retention requirement
processing obligation
control
exception
evidence
review
compliance requirement
18.4 Security Model
Security imports data concepts for:
classification
sensitivity
data category
data subject category
data exposure
residency
data security finding
18.5 Access Control Model
Access Control imports data concepts for:
dataset
data object
data classification
data usage constraint
data access pattern
18.6 Task Model
Data creates or references tasks such as:
data-quality remediation
schema migration
contract review
lineage clarification
classification review
retention cleanup
data incident investigation
18.7 Tagging Standard
Tagging supports data discovery and classification but must not replace data classification, schema, lineage, quality, or governance records.
19. Canon Interface Card Usage
Subsystems that implement or produce data knowledge SHOULD publish a Canon Interface Card.
Example:
subsystem: data-catalog-importer
implements:
- InfoTechCanonDataModel
- DataCatalogProfile
produces:
- Dataset
- Schema
- Field
- DataDistribution
- DataOwnerReference
consumes:
- Team
- DataStoreReference
- Policy
relations:
- Dataset has_schema Schema
- Schema has_field Field
- Dataset stored_in DataStoreReference
- Dataset owned_by Team
source_of_truth:
dataset_catalog_entries: data-catalog
known_deviations:
- lineage is summary-only
- data quality checks are imported from separate system
20. Retrieval Requirements
The Data Model is designed for markdown-based infospaces.
20.1 Required Retrieval Properties
Every major concept SHOULD provide:
- stable heading,
- stable identifier,
- short definition,
- longer explanation,
- examples,
- distinction notes,
- relationship examples,
- mapping hooks,
- profile references,
- and common mistakes.
20.2 Agent Brief
A mature Data Model SHOULD include an agent-brief.md file with:
purpose
scope
owned concepts
imported concepts
core distinctions
do / do not rules
relationship patterns
minimal examples
common mistakes
profile list
mapping list
20.3 Indexes
The data information space SHOULD provide indexes by:
concept
relationship
data domain
dataset
schema
field
classification
quality rule
lineage
contract
profile
pattern
mapping target
status
source system
21. Conformance Levels
21.1 Reference-Conformant
A document or system is reference-conformant if it uses Data Model terminology consistently but does not implement structured metadata or validation rules.
21.2 Metadata-Conformant
A system is metadata-conformant if it uses stable identifiers, concept names, lifecycle states, source metadata, and relationship types.
21.3 Catalog-Conformant
A system is catalog-conformant if datasets, distributions, data services, owners, stewards, descriptions, and classifications are represented.
21.4 Lineage-Conformant
A system is lineage-conformant if it represents data sources, transformations, targets, provenance, and confidence.
21.5 Quality-Conformant
A system is quality-conformant if it represents data quality rules, checks, results, and issues.
21.6 Contract-Conformant
A system is contract-conformant if producer and consumer expectations are represented as DataContracts.
21.7 Profile-Conformant
A system is profile-conformant if it implements a declared Data Profile and passes its validation rules.
21.8 Assimilation-Conformant
A system or repository is assimilation-conformant if it can accept external data concepts through the InfoTechCanon assimilation workflow and produce mappings, gaps, conflicts, and proposed changes.
22. Validation Rules
Initial validation rules:
VAL-DATA-001: Dataset SHOULD NOT be modeled as identical to DataStoreReference.
VAL-DATA-002: Dataset SHOULD have owner or steward reference when used for operational or governed purposes.
VAL-DATA-003: Dataset SHOULD have classification when it may contain sensitive, regulated, operationally critical, or business-critical data.
VAL-DATA-004: Schema SHOULD have version when used across system boundaries.
VAL-DATA-005: Field SHOULD be distinguishable from DataElement where semantic precision matters.
VAL-DATA-006: DataQualityRule SHOULD declare the dataset, field, or data object it applies to.
VAL-DATA-007: DataQualityResult SHOULD reference the executed rule and check.
VAL-DATA-008: DataLineage SHOULD distinguish declared, inferred, observed, and verified lineage.
VAL-DATA-009: DataContract SHOULD declare producer, consumer, dataset, schema or semantic expectations, quality expectations, and compatibility rules where applicable.
VAL-DATA-010: BreakingChange SHOULD reference the DataContract or CompatibilityRule it violates.
VAL-DATA-011: RetentionRuleReference SHOULD point to Governance concepts rather than embedding legal interpretation in Data.
VAL-DATA-012: DataResidency SHOULD reference region, jurisdiction, environment, or storage/processing scope where available.
VAL-DATA-013: Tags MUST NOT replace DataClassification, Schema, Lineage, Quality, or Contract records.
VAL-DATA-014: External data concepts SHOULD be represented through mapping records rather than silently reused.
VAL-DATA-015: Profiles MUST NOT redefine canonical concepts. They may constrain them.
VAL-DATA-016: Data used for AI training, analytics, or automation SHOULD declare usage constraints and provenance where relevant.
23. Anti-Patterns
23.1 Table Equals Dataset
Treating every table as a complete dataset and every dataset as a table.
23.2 Schema Equals Meaning
Assuming column names and types fully define business meaning.
23.3 Classification by Tag Only
Using tags such as confidential without a governed DataClassification record.
23.4 Lineage by Diagram Only
Drawing flows without source, transformation, target, evidence, or confidence.
23.5 Quality Dashboard Graveyard
Tracking quality failures without owners, tasks, remediation, or fitness-for-use decisions.
23.6 Contract-Free Integration
Letting consumers depend on producer data without explicit compatibility expectations.
23.7 Hidden Retention Logic
Deleting or keeping data based on undocumented scripts or tribal knowledge.
23.8 Catalog Without Trust
Cataloging datasets without owner, freshness, classification, quality, or lineage.
23.9 Privacy in Free Text
Recording processing purpose, data subject category, residency, or sensitivity as unstructured notes only.
23.10 Vendor Model Capture
Letting one data catalog, warehouse, or governance product define the internal data model.
24. Initial Repository Placement
Recommended repository layout:
info-tech-canon/
standards/
data/
InfoTechCanonDataModel.md
agent-brief.md
concepts/
relationships/
patterns/
profiles/
mappings/
assimilation/
examples/
validation/
Seed files:
standards/data/InfoTechCanonDataModel.md
standards/data/agent-brief.md
standards/data/concepts/dataset.md
standards/data/concepts/data-product.md
standards/data/concepts/schema.md
standards/data/concepts/data-element.md
standards/data/concepts/data-classification.md
standards/data/concepts/data-lineage.md
standards/data/concepts/data-quality-rule.md
standards/data/concepts/data-contract.md
standards/data/patterns/data-is-not-its-store.md
standards/data/patterns/dataset-catalog-entry.md
standards/data/patterns/data-contract-at-boundary.md
standards/data/patterns/lineage-as-evidence.md
standards/data/profiles/small-saas-data-profile.md
standards/data/profiles/data-catalog-profile.md
standards/data/profiles/data-contract-profile.md
standards/data/profiles/data-lineage-profile.md
standards/data/mappings/dcat.yaml
standards/data/mappings/prov-o.yaml
standards/data/mappings/iso-11179.yaml
standards/data/mappings/dama-dmbok.yaml
25. Roadmap
Phase 1: Seed Stabilization
- Establish this standard as
InfoTechCanonDataModel. - Add seed concepts, relationship vocabulary, patterns, and profiles.
- Define validation rules.
- Align with Landscape, Governance, Security, Access Control, Task, and Tagging.
Phase 2: First Assimilations
Recommended first assimilations:
W3C DCAT 3
PROV-O
ISO/IEC 11179
DAMA-DMBOK
OpenLineage
DataHub
OpenMetadata
Great Expectations
dbt metadata
GDPR data category concepts
Phase 3: Profile Maturation
- Mature Small SaaS Data Profile.
- Mature Data Catalog Profile.
- Mature Data Contract Profile.
- Mature Data Lineage Profile.
- Mature Privacy-Relevant Data Profile.
- Mature Analytics Dataset Profile.
Phase 4: Tooling Integration
- Generate concept indexes.
- Generate agent brief.
- Create machine-readable YAML/JSON exports.
- Add validation scripts.
- Integrate data catalog, lineage, data-quality, schema registry, and contract tooling.
Phase 5: Data Intelligence Loop
- Connect datasets to services and repositories.
- Connect classification to access control and security.
- Connect quality issues to tasks.
- Connect lineage to provenance and assurance.
- Connect data contracts to DevSecOps and release workflows.
- Connect privacy and retention to governance obligations.
26. Summary
The InfoTechCanon Data Model is the seed standard for representing data as a managed, governed, discoverable, reusable, classifiable, lineage-bearing, and quality-assessable asset.
Its most important commitments are:
Separate data from storage.
Separate dataset, schema, field, data element, data object, and data product.
Treat classification, lineage, quality, retention, residency, and processing purpose as first-class concerns.
Use data contracts at producer-consumer boundaries.
Import governance, access-control, security, task, tagging, organization, and landscape concepts
instead of redefining them.
Map to DCAT, PROV-O, ISO/IEC 11179, DAMA-DMBOK, OpenLineage, and catalog tools
without surrendering internal semantic autonomy.
Use profiles to make the model practical for SaaS systems, catalogs, contracts,
lineage, privacy-relevant data, analytics, and AI/agentic workflows.
This makes the Data Model a core seed for information architecture, data governance, security posture, AI readiness, analytics reliability, and interoperable information-processing systems.