# InfoTechCanon Data Model **Short Name:** `ITC-DATA` **Document Status:** Seed Standard Release Candidate 1 **Version:** RC1-seed **Date:** 2026-05-22 **Repository Context:** `info-tech-canon` **Document Type:** InfoTechCanon Domain Standard **Intended Audience:** Data architects, data engineers, data stewards, platform engineers, governance designers, security architects, application architects, product owners, knowledge-system builders, compliance reviewers, AI/analytics teams, and agentic tooling. --- # 1. Purpose The **InfoTechCanon Data Model** defines a canonical seed model for representing data as a managed, governed, discoverable, classifiable, lineage-bearing, quality-assessable, and reusable information asset. It exists to give data its own canonical domain instead of leaving data semantics scattered across landscape, security, governance, DevSecOps, observability, and application models. This standard provides a canonical vocabulary for: - data domains, - datasets, - data products, - data objects, - records, - fields, - schemas, - data elements, - code lists, - data stores as references, - data flows, - data lineage, - data quality, - metadata, - catalogs, - distributions, - data services, - data classification, - sensitivity, - residency, - retention, - processing purpose, - data ownership and stewardship references, - data contracts, - and data evidence. --- # 2. Position in InfoTechCanon The Data Model is a **domain standard** within InfoTechCanon. It depends on the existing seed standards as follows: ```text Landscape = where data is stored, processed, moved, and exposed. Organization = data owners, stewards, custodians, producers, consumers. Governance = data policies, obligations, controls, evidence, exceptions. Security = data exposure, data-security findings, data attack paths. Access Control = permissions and grants to data resources. Task = data-quality work, migration work, remediation, reviews. Tagging = lightweight classification and retrieval. Data = datasets, schemas, metadata, lineage, quality, classification, retention. ``` ```text InfoTechCanon ├── InfoTechCanonCore ├── InfoTechCanonLandscapeModel ├── InfoTechCanonOrganizationModel ├── InfoTechCanonGovernanceModel ├── InfoTechCanonTaskModel ├── InfoTechCanonTaggingStandard ├── InfoTechCanonAccessControlModel ├── InfoTechCanonSecurityModel ├── InfoTechCanonDataModel <-- this standard ├── InfoTechCanonDevSecOpsModel ├── InfoTechCanonNetworkModel ├── InfoTechCanonObservabilityModel ├── InfoTechCanonPatternLanguage └── Application Profiles ``` --- # 3. Boundary with Adjacent Standards ## 3.1 Boundary with Landscape The Landscape Model owns: ```text DataStore DatabaseInstance ObjectBucket FileShare Queue Cache RuntimeResource ApplicationService IntegrationFlow Endpoint ``` The Data Model owns: ```text Dataset DataProduct DataObject Schema Field DataElement DataFlow DataLineage DataClassification DataQualityRule DataContract DataDistribution ``` Boundary rule: ```text Landscape owns the technical and runtime places where data lives or moves. Data owns the semantic, structural, quality, classification, and lineage meaning of data. ``` ## 3.2 Boundary with Governance The Governance Model owns: ```text Policy Requirement Obligation Control Risk Exception Evidence Review Approval ComplianceRequirement ``` The Data Model owns data-specific structures that are governed: ```text RetentionRuleReference ProcessingPurpose DataClassification DataQualityRule DataContract DataLineage ``` Boundary rule: ```text Governance defines why data must be governed. Data defines what data is and how it is described, classified, measured, and traced. ``` ## 3.3 Boundary with Security The Security Model owns: ```text DataSecurityFinding ExposureFinding CredentialExposure SecurityIncident AttackPath Mitigation ``` The Data Model owns: ```text Sensitivity Classification DataResidency DataSubjectCategory DataCategory DataLineage ``` Security may use these for posture analysis. ## 3.4 Boundary with Access Control Access Control owns permissions, grants, authorization decisions, and enforcement. Data owns data resources and classifications that access policies may use. Example: ```text Dataset classified_as Confidential AccessPolicy permits Role to read Dataset AuthorizationDecision permits read on Dataset ``` ## 3.5 Boundary with Organization Organization owns actors and responsibilities. Data references Organization concepts for: ```text DataOwner DataSteward DataCustodian DataProducer DataConsumer DataTrustee ``` ## 3.6 Boundary with DevSecOps DevSecOps owns source, build, artifact, pipeline, release, deployment, SBOM, and attestation semantics. Data owns data contracts, schema evolution, migration data, test data, synthetic data, lineage, and data-quality semantics. --- # 4. Research Basis and External Alignment This seed standard draws on multiple data-management bodies of knowledge. ## 4.1 DAMA-DMBOK DAMA-DMBOK is a broad reference for data management disciplines including data governance, architecture, modeling, storage, security, integration, documents/content, reference/master data, warehousing/BI, metadata, and data quality. InfoTechCanon uses it as a broad mapping and assimilation target, not as a direct controlling model. ## 4.2 DCAT W3C DCAT defines a vocabulary for data catalogs. DCAT Version 3 organizes catalog access around datasets, distributions, data services, and dataset series. This is highly relevant for InfoTechCanon catalog, dataset, distribution, and data-service concepts. ## 4.3 PROV-O W3C PROV-O models provenance using entities, activities, and agents. This is highly relevant for data lineage, derivation, generation, transformation, and responsibility. ## 4.4 ISO/IEC 11179 ISO/IEC 11179 provides a metadata registry framework for data elements, naming, identification, definitions, classification, and registration. It is an important mapping target for data element, representation, data definition, code list, and metadata registry concepts. ## 4.5 Data Mesh and Data Products Data product thinking emphasizes ownership, discoverability, quality, fitness for use, service-like interfaces, and domain responsibility. InfoTechCanon should support data products without requiring a specific data-mesh organizational model. ## 4.6 Data Contracts Data contracts define expectations between producers and consumers around schema, semantics, quality, delivery, compatibility, ownership, and change management. They are critical for reliable information-processing systems. ## 4.7 Privacy and Data Protection Practice Privacy and data-protection practice contributes concepts such as personal data, sensitive data, data subject, processing purpose, lawful basis, retention, residency, and minimization. The Data Model provides data semantics, while Governance owns legal obligations and Security owns data exposure and incident semantics. --- # 5. Seed Standard Design Stance This standard is a **seed standard**, not a full data-governance or database-design manual. It shall: 1. define canonical data semantics, 2. distinguish data from storage infrastructure, 3. distinguish dataset, data product, data object, schema, field, and data element, 4. support data classification, lineage, quality, retention, residency, and processing purpose, 5. support catalog and discovery concepts, 6. support data contracts and schema evolution, 7. support operational, analytical, reference, master, event, and document data, 8. support mappings to external standards without becoming subordinate to them, 9. remain markdown-first and agent-retrievable, 10. and support future assimilation of data standards, platforms, regulations, and product schemas. --- # 6. Scope ## 6.1 In Scope This standard covers canonical representation of: - data domains, - data products, - datasets, - dataset series, - data distributions, - data services, - data objects, - entities, - records, - fields, - attributes, - data elements, - schemas, - schema versions, - code lists, - reference data, - master data references, - metadata, - catalogs, - data lineage, - data flows, - data transformations, - data quality rules, - data quality results, - data contracts, - data classification, - sensitivity, - confidentiality level, - integrity expectation, - availability expectation, - retention rules as data semantics, - data residency, - data minimization, - processing purpose, - data subject categories, - data provenance, - data ownership and stewardship references, - and data lifecycle states. ## 6.2 Out of Scope This standard does not fully define: - database engine internals, - storage infrastructure, - full data warehouse architecture, - full analytics modeling, - full privacy-law interpretation, - full data-governance process, - full security incident handling, - all ontology modeling, - all semantic-web representation, - complete ETL/ELT implementation, - or every vendor-specific data catalog schema. Those may be mapped, assimilated, profiled, or handled by adjacent standards. --- # 7. Normative Language The following terms are used normatively: - **SHALL** indicates a mandatory rule for conformance. - **SHOULD** indicates a recommended practice. - **MAY** indicates an optional capability. - **MUST NOT** indicates a prohibited practice. - **SEED** marks a concept defined provisionally here but open to later refinement. - **EXTRACT** marks a concept that may later move to a more specialized standard. --- # 8. Core Principles ## 8.1 Data Is Not Its Store A dataset is not the same thing as a database, bucket, table, file, topic, or API. Storage and runtime locations are Landscape concepts. Data semantics belong here. ## 8.2 Dataset Is Not Schema A dataset may have one or more schemas, distributions, versions, contracts, lineage records, and quality expectations. ## 8.3 Schema Is Not Meaning A schema describes structure. It does not fully define business meaning, ownership, usage constraints, quality, or purpose. ## 8.4 Classification Is First-Class Data classification and sensitivity SHOULD be explicit where data has security, privacy, compliance, operational, or business significance. ## 8.5 Lineage Is Evidence-Carrying Lineage SHOULD identify source data, transformations, activities, agents, and derived outputs with confidence and evidence where possible. ## 8.6 Data Quality Is Contextual Data quality depends on intended use, domain meaning, contract expectations, and consumer needs. ## 8.7 Data Contracts Make Data Reliable Producer-consumer expectations SHOULD be explicit when data is reused across system boundaries. ## 8.8 External Standards Are Mapped, Not Obeyed The Data Model MAY map to DAMA-DMBOK, DCAT, PROV-O, ISO/IEC 11179, schema.org, OpenLineage, DataHub, OpenMetadata, dbt, Great Expectations, or similar standards and tools. It MUST NOT subordinate its internal semantics to any single external model. --- # 9. Canonical Seed Metadata Every data artifact SHOULD support structured metadata. Recommended front matter: ```yaml --- id: itc-data:Dataset type: concept standard: InfoTechCanonDataModel standard_version: RC1-seed status: candidate canonical_owner: InfoTechCanonDataModel preferred_label: Dataset related: - itc-data:DataProduct - itc-data:Schema - itc-data:DataDistribution - itc-data:DataLineage mappings: - itc-map:dataset-to-dcat-dataset --- ``` Recommended artifact statuses: ```text idea draft candidate release-candidate adopted stable deprecated retired ``` Recommended concept statuses: ```text proposed experimental candidate canonical deprecated retired ``` --- # 10. Root Data Taxonomy ```text DataEntity ├── DataAssetEntity │ ├── DataDomain │ ├── DataProduct │ ├── Dataset │ ├── DatasetSeries │ ├── DataDistribution │ ├── DataService │ ├── DataObject │ ├── Record │ └── DocumentData ├── StructureEntity │ ├── Schema │ ├── SchemaVersion │ ├── Field │ ├── Attribute │ ├── DataElement │ ├── DataElementConcept │ ├── Representation │ ├── DataType │ ├── Constraint │ └── CodeList ├── SemanticEntity │ ├── BusinessTerm │ ├── GlossaryTerm │ ├── ConceptualEntity │ ├── DataDefinition │ ├── ReferenceData │ ├── MasterDataReference │ └── CanonicalValue ├── GovernanceReferenceEntity │ ├── DataClassification │ ├── Sensitivity │ ├── DataCategory │ ├── DataSubjectCategory │ ├── ProcessingPurpose │ ├── RetentionRuleReference │ ├── DataResidency │ └── DataUsageConstraint ├── QualityEntity │ ├── DataQualityDimension │ ├── DataQualityRule │ ├── DataQualityCheck │ ├── DataQualityResult │ ├── DataQualityIssue │ └── FitnessForUse ├── LineageEntity │ ├── DataFlow │ ├── DataLineage │ ├── Transformation │ ├── Derivation │ ├── SourceDataset │ ├── TargetDataset │ └── ProvenanceRecord ├── ContractEntity │ ├── DataContract │ ├── ProducerExpectation │ ├── ConsumerExpectation │ ├── CompatibilityRule │ ├── BreakingChange │ └── SchemaEvolutionPolicy └── OperationalDataEntity ├── DataPipelineReference ├── DataStoreReference ├── QueryReference ├── DataAccessPattern ├── DataFreshness └── DataAvailability ``` --- # 11. Core Concepts ## 11.1 DataEntity A **DataEntity** is any identifiable concept used to represent data, metadata, structure, classification, quality, lineage, contract, or data lifecycle. Recommended attributes: ```yaml id: entity_type: canonical_name: display_name: lifecycle_state: source_system: created_at: updated_at: ``` Optional attributes: ```yaml owner: steward: data_domain: classification: source_confidence: valid_from: valid_to: tags: external_references: ``` --- ## 11.2 DataDomain A **DataDomain** is a bounded area of data meaning, ownership, stewardship, or subject matter. Examples: ```text customer billing product identity orders support security operations finance ``` --- ## 11.3 DataProduct A **DataProduct** is a managed data asset or set of data assets offered for use by consumers with explicit ownership, quality expectations, documentation, interfaces, and lifecycle. Recommended attributes: ```yaml owner: steward: producer: consumers: service_level_expectations: quality_expectations: contract: distribution_methods: ``` --- ## 11.4 Dataset A **Dataset** is a coherent collection of data published, managed, processed, analyzed, or consumed as a unit. A dataset may have: ```text schema distribution catalog entry classification lineage quality rules owner steward contract retention expectation ``` Canonical rule: ```text Dataset MUST NOT be treated as identical to its storage location. ``` --- ## 11.5 DatasetSeries A **DatasetSeries** is a sequence or family of related datasets organized over time, version, geography, domain, or release. --- ## 11.6 DataDistribution A **DataDistribution** is an accessible representation of a dataset. Examples: ```text CSV file Parquet file API response database table export event stream report download object storage path ``` --- ## 11.7 DataService A **DataService** is a service that provides access to data or operations over data. Examples: ```text query API data product API metadata API streaming endpoint analytics service ``` --- ## 11.8 DataObject A **DataObject** is a meaningful object or structure represented in data. Examples: ```text Customer Invoice Order Payment Product Device UserProfile AccessGrant SecurityFinding ``` --- ## 11.9 Record A **Record** is an instance-level representation of data about an entity, event, relationship, or observation. --- ## 11.10 Field A **Field** is a named component of a schema, record, message, or table. --- ## 11.11 Attribute An **Attribute** is a property of a data object or conceptual entity. A field may represent an attribute, but field is structural while attribute is semantic. --- ## 11.12 DataElement A **DataElement** is a defined unit of data with meaning, representation, and expected usage. It may map to ISO/IEC 11179 data element concepts. Recommended attributes: ```yaml object_class: property: representation: data_type: definition: permitted_values: ``` --- ## 11.13 DataElementConcept A **DataElementConcept** is the semantic idea of a data element independent of representation. Example: ```text Customer birth date Invoice total amount Repository default branch name ``` --- ## 11.14 Representation A **Representation** describes how a data element is represented. Examples: ```text string integer decimal boolean date timestamp code identifier URI ``` --- ## 11.15 DataType A **DataType** specifies the technical or logical type of a field or data element. --- ## 11.16 Constraint A **Constraint** is a rule limiting valid data. Examples: ```text required unique minimum maximum regex foreign key enum format cardinality ``` --- ## 11.17 CodeList A **CodeList** is a controlled set of allowed values with definitions. Examples: ```text country codes currency codes status codes classification labels risk levels ``` --- ## 11.18 BusinessTerm A **BusinessTerm** is a term used by domain actors to describe data meaning. --- ## 11.19 GlossaryTerm A **GlossaryTerm** is a documented term in a glossary with definition, synonyms, ownership, and mappings. --- ## 11.20 DataDefinition A **DataDefinition** is a textual or structured definition explaining the meaning, scope, and intended use of a data concept. --- ## 11.21 ReferenceData **ReferenceData** is data used to classify, categorize, or constrain other data. Examples: ```text country list currency list product category list status code list business unit list ``` --- ## 11.22 MasterDataReference A **MasterDataReference** points to a controlled source of core business entities. Examples: ```text customer master product master supplier master employee master ``` The Data Model references master-data semantics but does not require a specific MDM architecture. --- ## 11.23 DataClassification A **DataClassification** is a classification assigned to data based on sensitivity, confidentiality, regulatory concern, operational criticality, or business significance. Examples: ```text public internal confidential restricted regulated personal sensitive personal secret ``` --- ## 11.24 Sensitivity **Sensitivity** indicates potential harm, obligation, or restriction associated with data disclosure, modification, loss, misuse, or processing. --- ## 11.25 DataCategory A **DataCategory** groups data by semantic, legal, operational, or analytical type. Examples: ```text personal data financial data health data authentication data transaction data telemetry data metadata content data ``` --- ## 11.26 DataSubjectCategory A **DataSubjectCategory** identifies the kind of person or entity data is about. Examples: ```text customer employee applicant supplier contact child patient user administrator ``` --- ## 11.27 ProcessingPurpose A **ProcessingPurpose** describes why data is collected, stored, transformed, shared, or used. Examples: ```text billing support security monitoring analytics product improvement legal compliance identity verification ``` --- ## 11.28 RetentionRuleReference A **RetentionRuleReference** links data to governance-defined retention obligations, policies, or rules. The Data Model may model retention expectation, but Governance owns the policy and obligation. --- ## 11.29 DataResidency **DataResidency** describes where data is stored, processed, transferred, or legally required to remain. Examples: ```text EU Germany customer region cloud region on-premises only ``` --- ## 11.30 DataUsageConstraint A **DataUsageConstraint** describes a restriction on how data may be used. Examples: ```text not for training not for export internal analytics only production use prohibited no cross-border transfer only aggregated use ``` --- ## 11.31 DataQualityDimension A **DataQualityDimension** is an aspect of data quality. Common dimensions: ```text accuracy completeness consistency timeliness validity uniqueness freshness integrity fitness_for_use ``` --- ## 11.32 DataQualityRule A **DataQualityRule** is a testable expectation about data quality. Examples: ```text customer_id must not be null invoice_total must be >= 0 country_code must be in ISO country code list event_timestamp must be within expected delay window ``` --- ## 11.33 DataQualityCheck A **DataQualityCheck** is an execution of one or more data quality rules. --- ## 11.34 DataQualityResult A **DataQualityResult** is the outcome of a data quality check. --- ## 11.35 DataQualityIssue A **DataQualityIssue** is a finding indicating data does not meet a quality rule or fitness expectation. It may create Task Model remediation work. --- ## 11.36 FitnessForUse **FitnessForUse** is the degree to which data is suitable for a specific purpose or consumer context. --- ## 11.37 DataFlow A **DataFlow** is movement or transfer of data between sources, systems, stores, services, actors, or processes. --- ## 11.38 DataLineage **DataLineage** describes the origin, movement, transformation, derivation, and usage path of data. Lineage may include: ```text source dataset transformation activity agent target dataset time evidence confidence ``` --- ## 11.39 Transformation A **Transformation** is an activity that changes data structure, content, format, aggregation, classification, or meaning. --- ## 11.40 Derivation A **Derivation** is a relationship where one data entity is derived from another. --- ## 11.41 ProvenanceRecord A **ProvenanceRecord** records information about how data came to exist, who or what generated it, what activity produced it, and what source influenced it. --- ## 11.42 DataContract A **DataContract** is an explicit agreement between data producers and consumers about data structure, semantics, quality, delivery, compatibility, ownership, and change expectations. --- ## 11.43 ProducerExpectation A **ProducerExpectation** describes what a data producer commits to provide. Examples: ```text schema stability freshness completeness availability documentation change notice ``` --- ## 11.44 ConsumerExpectation A **ConsumerExpectation** describes what a data consumer expects or is allowed to assume. --- ## 11.45 CompatibilityRule A **CompatibilityRule** describes what changes are considered compatible or breaking. --- ## 11.46 BreakingChange A **BreakingChange** is a data, schema, semantic, quality, or delivery change that violates consumer expectations or compatibility rules. --- ## 11.47 SchemaEvolutionPolicy A **SchemaEvolutionPolicy** defines rules for how schemas may change over time. --- ## 11.48 DataStoreReference A **DataStoreReference** points to a Landscape data store or storage resource. Examples: ```text database table bucket file share topic queue index warehouse lakehouse table ``` --- ## 11.49 DataAccessPattern A **DataAccessPattern** describes how data is accessed. Examples: ```text batch export API query event stream direct database query file download replication analytics dashboard ``` --- ## 11.50 DataFreshness **DataFreshness** describes how current data is relative to a defined expectation. --- ## 11.51 DataAvailability **DataAvailability** describes whether data is accessible according to expectations. --- # 12. Core Relationship Vocabulary Recommended root relationship types: ```text contains part_of describes classified_as has_schema has_field has_distribution provided_by consumed_by stored_in accessed_via flows_to derived_from generated_by transformed_by governed_by constrained_by subject_to owned_by stewarded_by produced_by consumed_by validated_by violates satisfies maps_to ``` Relationship records SHOULD support: ```yaml id: relationship_type: source_entity: target_entity: scope: valid_from: valid_to: source_system: confidence: evidence: rationale: ``` --- # 13. Data State Models ## 13.1 Dataset Lifecycle States ```text proposed designed active deprecated retired archived deleted ``` ## 13.2 Schema States ```text draft candidate active deprecated superseded retired ``` ## 13.3 Data Quality States ```text unknown unchecked passing warning failing waived remediating verified ``` ## 13.4 Data Contract States ```text draft under_review active violated deprecated superseded retired ``` ## 13.5 Lineage Confidence States ```text unknown declared inferred observed verified conflicting ``` --- # 14. Data Patterns ## 14.1 Pattern: Data Is Not Its Store **Context:** Teams model data by pointing at tables, buckets, or files. **Problem:** Storage location does not explain semantic meaning, ownership, classification, quality, or lineage. **Solution:** Model Dataset, Schema, Distribution, StoreReference, and Lineage separately. --- ## 14.2 Pattern: Dataset Catalog Entry **Context:** Data consumers need to discover and understand data. **Problem:** Data assets remain invisible or only known by tribal knowledge. **Solution:** Provide a catalog entry with: ```text dataset name description owner steward classification schema distribution quality expectations lineage access method usage constraints ``` --- ## 14.3 Pattern: Data Contract at Boundary **Context:** Data crosses a team, service, product, or system boundary. **Problem:** Consumers break when producers change data unexpectedly. **Solution:** Define a DataContract with schema, semantic expectations, quality rules, compatibility rules, and change process. --- ## 14.4 Pattern: Classification Drives Controls **Context:** Data has different sensitivity and obligations. **Problem:** Systems apply uniform controls or rely on ad hoc judgment. **Solution:** Classify data and map classifications to governance controls, access policies, security measures, and retention expectations. --- ## 14.5 Pattern: Lineage as Evidence **Context:** A derived dataset is used for decisions or compliance. **Problem:** Consumers cannot determine origin, transformations, or trustworthiness. **Solution:** Model lineage with source datasets, transformations, activities, agents, target datasets, and evidence. --- ## 14.6 Pattern: Quality Rule to Remediation **Context:** Data quality checks fail. **Problem:** Failures remain dashboards instead of corrective action. **Solution:** ```text DataQualityRule -> DataQualityCheck -> DataQualityResult -> DataQualityIssue -> RemediationTask -> VerificationEvidence ``` --- ## 14.7 Pattern: Semantic Term and Field Split **Context:** Database columns are treated as business terms. **Problem:** Field names do not fully encode business meaning. **Solution:** Link Field to DataElement, BusinessTerm, and DataDefinition. --- ## 14.8 Pattern: Retention with Governance Reference **Context:** Data must be kept or deleted according to obligations. **Problem:** Retention is encoded as undocumented operational behavior. **Solution:** Link Dataset or DataObject to RetentionRuleReference and keep the governing obligation in Governance. --- # 15. Data Profiles ## 15.1 Profile Format A Data Profile SHALL declare: ```yaml id: profile_name: status: implements: - InfoTechCanonDataModel target_context: included_concepts: required_relationships: required_metadata: state_model: source_of_truth_rules: mapping_files: validation_rules: examples: known_deviations: ``` --- ## 15.2 Seed Profile: Small SaaS Data Profile Purpose: ```text Provide a minimal data model for a small SaaS platform moving toward production readiness. ``` Included concepts: ```text DataDomain Dataset DataObject Schema Field DataClassification DataStoreReference DataFlow DataQualityRule RetentionRuleReference DataOwnerReference DataStewardReference ``` Required relationships: ```text Dataset has_schema Schema Schema has_field Field Dataset classified_as DataClassification Dataset stored_in DataStoreReference Dataset owned_by DataOwnerReference Dataset stewarded_by DataStewardReference DataFlow moves Dataset RetentionRuleReference applies_to Dataset ``` --- ## 15.3 Seed Profile: Data Catalog Profile Purpose: ```text Represent data catalog entries for discoverability and reuse. ``` Included concepts: ```text Catalog Dataset DatasetSeries DataDistribution DataService DataOwnerReference DataStewardReference DataClassification DataQualitySummary DataLineageSummary ``` Mapping targets: ```text DCAT DCAT-AP DataHub OpenMetadata Amundsen Collibra / catalog tools ``` --- ## 15.4 Seed Profile: Data Contract Profile Purpose: ```text Represent data producer-consumer agreements. ``` Included concepts: ```text DataContract ProducerExpectation ConsumerExpectation Schema SchemaVersion DataQualityRule CompatibilityRule BreakingChange ChangeNotice DataContractViolation ``` Required relationships: ```text DataContract applies_to Dataset ProducerExpectation constrains Producer ConsumerExpectation informs Consumer CompatibilityRule governs SchemaEvolution BreakingChange violates DataContract ``` --- ## 15.5 Seed Profile: Data Lineage Profile Purpose: ```text Represent lineage across datasets, transformations, pipelines, and systems. ``` Included concepts: ```text Dataset SourceDataset TargetDataset Transformation DataFlow DataLineage ProvenanceRecord DataPipelineReference ActivityReference AgentReference ``` Mapping targets: ```text PROV-O OpenLineage Marquez dbt exposures/models/sources DataHub lineage ``` --- ## 15.6 Seed Profile: Privacy-Relevant Data Profile Purpose: ```text Represent data concepts relevant to privacy, data protection, retention, and processing. ``` Included concepts: ```text PersonalDataCategory SensitiveDataCategory DataSubjectCategory ProcessingPurpose DataResidency RetentionRuleReference DataUsageConstraint DataMinimizationExpectation ``` Governance owns legal obligations and lawful-basis interpretation. --- ## 15.7 Seed Profile: Analytics Dataset Profile Purpose: ```text Represent analytical datasets, metrics, dimensions, facts, models, and reports. ``` Included concepts: ```text Dataset Metric Dimension Fact Measure AggregationRule ReportReference DashboardReference DataQualityRule FreshnessExpectation ``` --- # 16. Mapping Model for the Data Standard Mappings relate InfoTechCanon data concepts to external standards, frameworks, products, and regulations. ## 16.1 Mapping Types Recommended mapping types: ```text exactMatch closeMatch broadMatch narrowMatch relatedMatch conflictMatch gapMatch derivedFrom regulatoryReference toolEquivalent ``` ## 16.2 Mapping Record Example: ```yaml id: itc-map:dataset-to-dcat-dataset source_concept: itc-data:Dataset target_body: W3C DCAT target_version: "3" target_concept: dcat:Dataset mapping_type: closeMatch scope: - data catalog interoperability not_valid_for: - all internal schema semantics - all data product lifecycle semantics rationale: > DCAT Dataset is a strong catalog-oriented match for InfoTechCanon Dataset, but InfoTechCanon includes additional governance, quality, contract, and lineage expectations that may not be required by DCAT. confidence: high status: candidate owner: InfoTechCanonDataModel ``` ## 16.3 Seed Mapping Targets The Data Model SHOULD maintain mappings to: ```text DAMA-DMBOK W3C DCAT 3 DCAT-AP W3C PROV-O ISO/IEC 11179 schema.org Dataset OpenLineage DataHub metadata model OpenMetadata dbt sources/models/exposures Great Expectations Apache Atlas Collibra / data catalog concepts GDPR / privacy-regulation references Dublin Core metadata SPDX / CycloneDX data references where relevant ``` --- # 17. Assimilation Hooks The Data Model SHALL be able to receive new data standards, platforms, regulations, product schemas, and practices through the InfoTechCanon assimilation process. ## 17.1 Assimilation Triggers Assimilation may be triggered by: ```text new data catalog model new data lineage standard new metadata registry standard new privacy regulation new data-quality tool new data-contract practice new data-product pattern new analytics modeling method new data platform integration new recurring data classification conflict ``` ## 17.2 Data Assimilation Output A data assimilation SHOULD produce: ```text source summary extracted data concepts concept comparison matrix gap list conflict list mapping file candidate new concepts candidate relationship changes candidate pattern changes candidate profile changes open questions ``` ## 17.3 Recommended First Assimilation Candidates ```text W3C DCAT 3 PROV-O ISO/IEC 11179 DAMA-DMBOK OpenLineage DataHub OpenMetadata Great Expectations dbt semantic layer / metadata GDPR data categories and processing concepts ``` --- # 18. Integration with Other InfoTechCanon Standards ## 18.1 Landscape Model Data references Landscape concepts for: ```text data store database bucket queue topic pipeline runtime service application service endpoint environment ``` ## 18.2 Organization Model Data imports organization concepts for: ```text data owner data steward data custodian data producer data consumer data trustee responsible team ``` ## 18.3 Governance Model Data imports governance concepts for: ```text policy retention requirement processing obligation control exception evidence review compliance requirement ``` ## 18.4 Security Model Security imports data concepts for: ```text classification sensitivity data category data subject category data exposure residency data security finding ``` ## 18.5 Access Control Model Access Control imports data concepts for: ```text dataset data object data classification data usage constraint data access pattern ``` ## 18.6 Task Model Data creates or references tasks such as: ```text data-quality remediation schema migration contract review lineage clarification classification review retention cleanup data incident investigation ``` ## 18.7 Tagging Standard Tagging supports data discovery and classification but must not replace data classification, schema, lineage, quality, or governance records. --- # 19. Canon Interface Card Usage Subsystems that implement or produce data knowledge SHOULD publish a Canon Interface Card. Example: ```yaml subsystem: data-catalog-importer implements: - InfoTechCanonDataModel - DataCatalogProfile produces: - Dataset - Schema - Field - DataDistribution - DataOwnerReference consumes: - Team - DataStoreReference - Policy relations: - Dataset has_schema Schema - Schema has_field Field - Dataset stored_in DataStoreReference - Dataset owned_by Team source_of_truth: dataset_catalog_entries: data-catalog known_deviations: - lineage is summary-only - data quality checks are imported from separate system ``` --- # 20. Retrieval Requirements The Data Model is designed for markdown-based infospaces. ## 20.1 Required Retrieval Properties Every major concept SHOULD provide: - stable heading, - stable identifier, - short definition, - longer explanation, - examples, - distinction notes, - relationship examples, - mapping hooks, - profile references, - and common mistakes. ## 20.2 Agent Brief A mature Data Model SHOULD include an `agent-brief.md` file with: ```text purpose scope owned concepts imported concepts core distinctions do / do not rules relationship patterns minimal examples common mistakes profile list mapping list ``` ## 20.3 Indexes The data information space SHOULD provide indexes by: ```text concept relationship data domain dataset schema field classification quality rule lineage contract profile pattern mapping target status source system ``` --- # 21. Conformance Levels ## 21.1 Reference-Conformant A document or system is reference-conformant if it uses Data Model terminology consistently but does not implement structured metadata or validation rules. ## 21.2 Metadata-Conformant A system is metadata-conformant if it uses stable identifiers, concept names, lifecycle states, source metadata, and relationship types. ## 21.3 Catalog-Conformant A system is catalog-conformant if datasets, distributions, data services, owners, stewards, descriptions, and classifications are represented. ## 21.4 Lineage-Conformant A system is lineage-conformant if it represents data sources, transformations, targets, provenance, and confidence. ## 21.5 Quality-Conformant A system is quality-conformant if it represents data quality rules, checks, results, and issues. ## 21.6 Contract-Conformant A system is contract-conformant if producer and consumer expectations are represented as DataContracts. ## 21.7 Profile-Conformant A system is profile-conformant if it implements a declared Data Profile and passes its validation rules. ## 21.8 Assimilation-Conformant A system or repository is assimilation-conformant if it can accept external data concepts through the InfoTechCanon assimilation workflow and produce mappings, gaps, conflicts, and proposed changes. --- # 22. Validation Rules Initial validation rules: ```text VAL-DATA-001: Dataset SHOULD NOT be modeled as identical to DataStoreReference. VAL-DATA-002: Dataset SHOULD have owner or steward reference when used for operational or governed purposes. VAL-DATA-003: Dataset SHOULD have classification when it may contain sensitive, regulated, operationally critical, or business-critical data. VAL-DATA-004: Schema SHOULD have version when used across system boundaries. VAL-DATA-005: Field SHOULD be distinguishable from DataElement where semantic precision matters. VAL-DATA-006: DataQualityRule SHOULD declare the dataset, field, or data object it applies to. VAL-DATA-007: DataQualityResult SHOULD reference the executed rule and check. VAL-DATA-008: DataLineage SHOULD distinguish declared, inferred, observed, and verified lineage. VAL-DATA-009: DataContract SHOULD declare producer, consumer, dataset, schema or semantic expectations, quality expectations, and compatibility rules where applicable. VAL-DATA-010: BreakingChange SHOULD reference the DataContract or CompatibilityRule it violates. VAL-DATA-011: RetentionRuleReference SHOULD point to Governance concepts rather than embedding legal interpretation in Data. VAL-DATA-012: DataResidency SHOULD reference region, jurisdiction, environment, or storage/processing scope where available. VAL-DATA-013: Tags MUST NOT replace DataClassification, Schema, Lineage, Quality, or Contract records. VAL-DATA-014: External data concepts SHOULD be represented through mapping records rather than silently reused. VAL-DATA-015: Profiles MUST NOT redefine canonical concepts. They may constrain them. VAL-DATA-016: Data used for AI training, analytics, or automation SHOULD declare usage constraints and provenance where relevant. ``` --- # 23. Anti-Patterns ## 23.1 Table Equals Dataset Treating every table as a complete dataset and every dataset as a table. ## 23.2 Schema Equals Meaning Assuming column names and types fully define business meaning. ## 23.3 Classification by Tag Only Using tags such as `confidential` without a governed DataClassification record. ## 23.4 Lineage by Diagram Only Drawing flows without source, transformation, target, evidence, or confidence. ## 23.5 Quality Dashboard Graveyard Tracking quality failures without owners, tasks, remediation, or fitness-for-use decisions. ## 23.6 Contract-Free Integration Letting consumers depend on producer data without explicit compatibility expectations. ## 23.7 Hidden Retention Logic Deleting or keeping data based on undocumented scripts or tribal knowledge. ## 23.8 Catalog Without Trust Cataloging datasets without owner, freshness, classification, quality, or lineage. ## 23.9 Privacy in Free Text Recording processing purpose, data subject category, residency, or sensitivity as unstructured notes only. ## 23.10 Vendor Model Capture Letting one data catalog, warehouse, or governance product define the internal data model. --- # 24. Initial Repository Placement Recommended repository layout: ```text info-tech-canon/ standards/ data/ InfoTechCanonDataModel.md agent-brief.md concepts/ relationships/ patterns/ profiles/ mappings/ assimilation/ examples/ validation/ ``` Seed files: ```text standards/data/InfoTechCanonDataModel.md standards/data/agent-brief.md standards/data/concepts/dataset.md standards/data/concepts/data-product.md standards/data/concepts/schema.md standards/data/concepts/data-element.md standards/data/concepts/data-classification.md standards/data/concepts/data-lineage.md standards/data/concepts/data-quality-rule.md standards/data/concepts/data-contract.md standards/data/patterns/data-is-not-its-store.md standards/data/patterns/dataset-catalog-entry.md standards/data/patterns/data-contract-at-boundary.md standards/data/patterns/lineage-as-evidence.md standards/data/profiles/small-saas-data-profile.md standards/data/profiles/data-catalog-profile.md standards/data/profiles/data-contract-profile.md standards/data/profiles/data-lineage-profile.md standards/data/mappings/dcat.yaml standards/data/mappings/prov-o.yaml standards/data/mappings/iso-11179.yaml standards/data/mappings/dama-dmbok.yaml ``` --- # 25. Roadmap ## Phase 1: Seed Stabilization - Establish this standard as `InfoTechCanonDataModel`. - Add seed concepts, relationship vocabulary, patterns, and profiles. - Define validation rules. - Align with Landscape, Governance, Security, Access Control, Task, and Tagging. ## Phase 2: First Assimilations Recommended first assimilations: ```text W3C DCAT 3 PROV-O ISO/IEC 11179 DAMA-DMBOK OpenLineage DataHub OpenMetadata Great Expectations dbt metadata GDPR data category concepts ``` ## Phase 3: Profile Maturation - Mature Small SaaS Data Profile. - Mature Data Catalog Profile. - Mature Data Contract Profile. - Mature Data Lineage Profile. - Mature Privacy-Relevant Data Profile. - Mature Analytics Dataset Profile. ## Phase 4: Tooling Integration - Generate concept indexes. - Generate agent brief. - Create machine-readable YAML/JSON exports. - Add validation scripts. - Integrate data catalog, lineage, data-quality, schema registry, and contract tooling. ## Phase 5: Data Intelligence Loop - Connect datasets to services and repositories. - Connect classification to access control and security. - Connect quality issues to tasks. - Connect lineage to provenance and assurance. - Connect data contracts to DevSecOps and release workflows. - Connect privacy and retention to governance obligations. --- # 26. Summary The InfoTechCanon Data Model is the seed standard for representing data as a managed, governed, discoverable, reusable, classifiable, lineage-bearing, and quality-assessable asset. Its most important commitments are: ```text Separate data from storage. Separate dataset, schema, field, data element, data object, and data product. Treat classification, lineage, quality, retention, residency, and processing purpose as first-class concerns. Use data contracts at producer-consumer boundaries. Import governance, access-control, security, task, tagging, organization, and landscape concepts instead of redefining them. Map to DCAT, PROV-O, ISO/IEC 11179, DAMA-DMBOK, OpenLineage, and catalog tools without surrendering internal semantic autonomy. Use profiles to make the model practical for SaaS systems, catalogs, contracts, lineage, privacy-relevant data, analytics, and AI/agentic workflows. ``` This makes the Data Model a core seed for information architecture, data governance, security posture, AI readiness, analytics reliability, and interoperable information-processing systems.