# InfoTechCanon Observability Model **Short Name:** `ITC-OBS` **Document Status:** Seed Standard Release Candidate 1 **Version:** RC1-seed **Date:** 2026-05-23 **Repository Context:** `info-tech-canon` **Document Type:** InfoTechCanon Domain Standard **Intended Audience:** SREs, platform engineers, DevSecOps teams, service owners, observability engineers, incident responders, network operators, security analysts, product owners, governance designers, knowledge-system builders, and agentic tooling. --- # 1. Purpose The **InfoTechCanon Observability Model** defines a canonical seed model for representing telemetry, signals, events, logs, metrics, traces, profiles, health, service levels, alerts, incidents as observed phenomena, dashboards, runbooks, investigations, and operational evidence. It exists to make runtime understanding interoperable across systems, services, platforms, networks, security, delivery pipelines, data products, and agentic operations. This standard provides a canonical vocabulary for: - telemetry sources, - resources, - signals, - metrics, - logs, - events, - traces, - spans, - profiles, - exemplars, - attributes, - dimensions, - correlation context, - service level indicators, - service level objectives, - error budgets, - health states, - alerts, - notifications, - incidents, - investigations, - dashboards, - runbooks, - observability evidence, - and feedback loops. --- # 2. Position in InfoTechCanon The Observability Model is a **domain standard** within InfoTechCanon. It depends on the existing seed standards as follows: ```text Landscape = services, runtime resources, environments, endpoints, workloads. Organization = owners, on-call actors, responders, teams, accountable roles. Governance = policies, controls, evidence, reviews, assurance, obligations. Task = incident work, remediation work, investigation, follow-up tasks. Tagging = lightweight classification of signals, alerts, incidents, dashboards. Access Control = access to telemetry, dashboards, logs, admin actions, incident tools. Security = security signals, detections, alerts, incidents, forensic evidence. Data = telemetry as data, retention, classification, quality, lineage. DevSecOps = deployment events, delivery metrics, verification, change failures. Network = flow logs, reachability tests, network metrics, DNS logs, latency. Observability = signals, telemetry, correlation, health, SLOs, alerts, operational evidence. ``` ```text InfoTechCanon ├── InfoTechCanonCore ├── InfoTechCanonLandscapeModel ├── InfoTechCanonOrganizationModel ├── InfoTechCanonGovernanceModel ├── InfoTechCanonTaskModel ├── InfoTechCanonTaggingStandard ├── InfoTechCanonAccessControlModel ├── InfoTechCanonSecurityModel ├── InfoTechCanonDataModel ├── InfoTechCanonDevSecOpsModel ├── InfoTechCanonNetworkModel ├── InfoTechCanonObservabilityModel <-- this standard ├── InfoTechCanonPatternLanguage └── Application Profiles ``` --- # 3. Boundary with Adjacent Standards ## 3.1 Boundary with Landscape The Landscape Model owns the entities being observed: ```text ApplicationService TechnicalService RuntimeWorkload Environment Endpoint NetworkEntity DataStore DeploymentRecord ``` The Observability Model owns telemetry and signals about those entities: ```text Metric LogRecord Trace Span Event Profile Alert HealthState SLI SLO Dashboard IncidentSignal ``` Boundary rule: ```text Landscape owns what exists. Observability owns what is observed, measured, correlated, alerted, and evidenced. ``` ## 3.2 Boundary with Security The Security Model owns security interpretation: ```text SecurityFinding Detection SecurityIncident Threat AttackPath SecurityEvidence ``` Observability owns telemetry substrate and operational signals. Example: ```text LogRecord may be evidence for SecurityFinding. SecurityDetection may be derived from ObservabilitySignal. SecurityIncident may reference Alert, Trace, LogRecord, or Event. ``` ## 3.3 Boundary with Governance Governance owns policies, controls, evidence, reviews, assurance, and compliance claims. Observability provides evidence and indicators. Example: ```text SLOEvidence supports ServiceReview. Metric supports ControlResult. AlertPolicy implements Governance Policy. ``` ## 3.4 Boundary with Task Task owns work semantics. Observability creates or references tasks: ```text Alert creates IncidentTask Incident creates RemediationTask Investigation creates FollowUpTask SLOBurn creates ReliabilityTask ``` ## 3.5 Boundary with DevSecOps DevSecOps owns delivery events and deployment records. Observability owns runtime signals used to verify deployments and measure change impact. Example: ```text DeploymentRecord produces DeploymentEvent DeploymentHealthSignal verifies DeploymentRecord ChangeFailure detected_by ObservabilitySignal ``` ## 3.6 Boundary with Data Data owns dataset, classification, lineage, quality, and retention semantics. Observability telemetry may itself be data, but Observability owns telemetry-specific semantics. Example: ```text LogDataset classified_as Restricted MetricStream has_retention RetentionRuleReference TraceSample derived_from RuntimeWorkload ``` --- # 4. Research Basis and External Alignment This seed standard draws on several mature observability and operations bodies of knowledge. ## 4.1 OpenTelemetry OpenTelemetry provides a broad observability framework covering traces, metrics, logs, baggage, resources, semantic conventions, instrumentation, collection, and export. Its semantic conventions define common attributes that give meaning to telemetry across systems. ## 4.2 SRE and Service Level Objectives SRE practice distinguishes Service Level Indicators, Service Level Objectives, Service Level Agreements, and error budgets. It emphasizes that SLOs should measure user-relevant reliability and guide operational decision-making. ## 4.3 Prometheus and OpenMetrics Prometheus and OpenMetrics influence metric naming, metric exposition, labels, time series, counters, gauges, histograms, summaries, and scraping/pull-based metric collection. ## 4.4 CloudEvents CloudEvents standardizes common event metadata for interoperability across services, platforms, and systems. It is a strong mapping target for event structure and routing metadata. ## 4.5 IT Operations and Incident Management IT operations practice distinguishes alerts, incidents, problems, changes, runbooks, on-call, escalation, resolution, and post-incident review. The Observability Model provides signal semantics while Task and Governance own work and decision semantics. ## 4.6 AIOps and Event Correlation AIOps practice emphasizes correlation, anomaly detection, event deduplication, root-cause analysis, topology-aware alerting, and automated remediation. These are advanced profiles rather than mandatory core concepts. --- # 5. Seed Standard Design Stance This standard is a **seed standard**, not a vendor-specific observability schema. It shall: 1. define canonical observability semantics, 2. distinguish telemetry, signal, event, log, metric, trace, span, profile, alert, and incident, 3. support OpenTelemetry alignment without being limited to it, 4. support SLOs, SLIs, and error budgets, 5. support correlation across services, runtime, network, security, data, and delivery, 6. support operational evidence and feedback loops, 7. support human and agentic operations, 8. map to external standards and tools without becoming subordinate to them, 9. remain markdown-first and agent-retrievable, 10. and support future assimilation of observability tools, standards, and practices. --- # 6. Scope ## 6.1 In Scope This standard covers canonical representation of: - telemetry, - telemetry sources, - observed resources, - observability signals, - metrics, - time series, - metric points, - metric instruments, - logs, - log records, - events, - event envelopes, - traces, - spans, - span links, - trace context, - profiles, - exemplars, - attributes, - dimensions, - labels, - correlation context, - service-level indicators, - service-level objectives, - service-level agreements as references, - error budgets, - burn rates, - health states, - alert rules, - alerts, - notifications, - alert routes, - incidents as observed operational objects, - investigations, - dashboards, - runbooks, - telemetry pipelines, - collectors, - exporters, - sampling, - retention, - and observability evidence. ## 6.2 Out of Scope This standard does not fully define: - all monitoring tool schemas, - all incident-management process details, - all SRE organizational practice, - complete AIOps algorithms, - all logging formats, - all SIEM detection content, - full OpenTelemetry SDK implementation, - all Prometheus query semantics, - complete data-retention law, - complete security incident-response methodology, - or every vendor-specific telemetry backend. Those may be mapped, assimilated, profiled, or handled by adjacent standards. --- # 7. Normative Language The following terms are used normatively: - **SHALL** indicates a mandatory rule for conformance. - **SHOULD** indicates a recommended practice. - **MAY** indicates an optional capability. - **MUST NOT** indicates a prohibited practice. - **SEED** marks a concept defined provisionally here but open to later refinement. - **EXTRACT** marks a concept that may later move to a more specialized standard. --- # 8. Core Principles ## 8.1 Observability Is More Than Monitoring Monitoring checks known conditions. Observability supports understanding system behavior, including unknown or emergent failure modes, through signals and correlation. ## 8.2 Telemetry Is Not Insight Raw telemetry becomes useful through context, correlation, aggregation, interpretation, and action. ## 8.3 Signal Is Not Incident A signal, alert, or event may indicate a possible problem. An incident is an operationally relevant situation requiring response. ## 8.4 Alert Is Not Evidence by Itself An alert indicates that a rule fired or condition was detected. Evidence should include the underlying signals, query, thresholds, state, and context. ## 8.5 Metrics, Logs, Traces, Events, and Profiles Are Distinct Each signal type has different strengths and should not be collapsed into one generic “event” concept. ## 8.6 Service Levels Must Be Explicit SLIs, SLOs, and error budgets SHOULD be modeled explicitly when reliability is important. ## 8.7 Correlation Requires Identity Telemetry SHOULD be linked to canonical landscape entities, deployment records, network endpoints, data resources, or security entities where possible. ## 8.8 Observability Must Support Feedback Observability should feed tasks, incidents, governance reviews, deployment verification, security detection, reliability improvement, and standard evolution. ## 8.9 External Standards Are Mapped, Not Obeyed The Observability Model MAY map to OpenTelemetry, Prometheus, OpenMetrics, CloudEvents, SRE SLO concepts, ITIL incident practices, and vendor schemas. It MUST NOT subordinate its internal semantics to any single external model. --- # 9. Canonical Seed Metadata Every observability artifact SHOULD support structured metadata. Recommended front matter: ```yaml --- id: itc-obs:Metric type: concept standard: InfoTechCanonObservabilityModel standard_version: RC1-seed status: candidate canonical_owner: InfoTechCanonObservabilityModel preferred_label: Metric related: - itc-obs:TimeSeries - itc-obs:SLI - itc-obs:AlertRule mappings: - itc-map:metric-to-opentelemetry-metric --- ``` Recommended artifact statuses: ```text idea draft candidate release-candidate adopted stable deprecated retired ``` Recommended concept statuses: ```text proposed experimental candidate canonical deprecated retired ``` --- # 10. Root Observability Taxonomy ```text ObservabilityEntity ├── TelemetryEntity │ ├── Telemetry │ ├── TelemetrySource │ ├── ObservedResource │ ├── ResourceAttribute │ ├── Signal │ ├── SignalSource │ └── TelemetryPipeline ├── MetricEntity │ ├── Metric │ ├── MetricInstrument │ ├── TimeSeries │ ├── MetricPoint │ ├── Counter │ ├── Gauge │ ├── Histogram │ ├── Summary │ └── Exemplar ├── LogEntity │ ├── Log │ ├── LogRecord │ ├── LogStream │ ├── LogLevel │ ├── LogContext │ └── StructuredLogField ├── TraceEntity │ ├── Trace │ ├── Span │ ├── SpanEvent │ ├── SpanLink │ ├── TraceContext │ ├── Baggage │ └── TraceSample ├── EventEntity │ ├── Event │ ├── EventEnvelope │ ├── EventSource │ ├── EventType │ ├── EventConsumer │ └── EventCorrelationKey ├── ProfileEntity │ ├── Profile │ ├── ProfilingSample │ ├── ResourceProfile │ └── PerformanceProfile ├── ReliabilityEntity │ ├── SLI │ ├── SLO │ ├── SLAReference │ ├── ErrorBudget │ ├── BurnRate │ ├── HealthState │ └── AvailabilityWindow ├── AlertingEntity │ ├── AlertRule │ ├── Alert │ ├── Notification │ ├── AlertRoute │ ├── AlertSuppression │ ├── AlertCorrelation │ └── EscalationReference ├── OperationsEntity │ ├── ObservedIncident │ ├── Investigation │ ├── Timeline │ ├── Runbook │ ├── Dashboard │ ├── OperationalView │ └── PostIncidentObservation └── EvidenceEntity ├── ObservabilityEvidence ├── Query ├── QueryResult ├── Snapshot ├── Annotation ├── Correlation └── RootCauseHypothesis ``` --- # 11. Core Concepts ## 11.1 ObservabilityEntity An **ObservabilityEntity** is any identifiable concept used to represent telemetry, signals, correlation, health, service levels, alerts, incidents as observed phenomena, dashboards, runbooks, or operational evidence. Recommended attributes: ```yaml id: entity_type: canonical_name: display_name: lifecycle_state: source_system: created_at: updated_at: ``` Optional attributes: ```yaml owner: steward: observed_resource: service: environment: source_confidence: valid_from: valid_to: tags: external_references: ``` --- ## 11.2 Telemetry **Telemetry** is machine-generated or manually recorded operational data about system behavior, state, performance, events, or activity. Examples: ```text metric sample log record trace span event profile sample flow record health check result ``` --- ## 11.3 TelemetrySource A **TelemetrySource** is a system, component, agent, collector, service, device, pipeline, or actor that emits or provides telemetry. --- ## 11.4 ObservedResource An **ObservedResource** is the entity about which telemetry is emitted or collected. Observed resources SHOULD map to Landscape, Network, Data, Security, or DevSecOps entities where possible. --- ## 11.5 ResourceAttribute A **ResourceAttribute** is an attribute describing an observed resource. Examples: ```text service.name service.version deployment.environment host.name cloud.region k8s.cluster.name container.image.name ``` --- ## 11.6 Signal A **Signal** is an interpretable unit or stream of observability data. Signal types include: ```text metric log trace event profile alert health check synthetic result ``` --- ## 11.7 SignalSource A **SignalSource** is the origin of a signal. --- ## 11.8 TelemetryPipeline A **TelemetryPipeline** is a flow that collects, processes, transforms, samples, enriches, routes, stores, or exports telemetry. --- ## 11.9 Metric A **Metric** is a measurement of a system, service, resource, or process over time. Metrics may be used for alerting, dashboards, SLOs, capacity planning, anomaly detection, and evidence. --- ## 11.10 MetricInstrument A **MetricInstrument** defines the kind of measurement instrument. Seed instrument types: ```text counter gauge histogram summary up_down_counter observable_gauge ``` --- ## 11.11 TimeSeries A **TimeSeries** is a sequence of metric points over time for a metric and a set of dimensions or labels. --- ## 11.12 MetricPoint A **MetricPoint** is a single measurement value at a time. --- ## 11.13 Counter A **Counter** is a monotonically increasing measurement of occurrences or accumulated quantity. --- ## 11.14 Gauge A **Gauge** is a measurement that can go up or down. --- ## 11.15 Histogram A **Histogram** is a distribution of measurements across buckets or ranges. --- ## 11.16 Summary A **Summary** is a metric representation of observations including quantiles or summary statistics. --- ## 11.17 Exemplar An **Exemplar** is a representative sample connecting an aggregate metric point to a trace, log, or other detailed signal. --- ## 11.18 Log A **Log** is a stream or collection of timestamped records describing events, state, actions, or messages. --- ## 11.19 LogRecord A **LogRecord** is a single log entry. Recommended attributes: ```yaml timestamp: severity: message: body: resource: trace_id: span_id: attributes: source: ``` --- ## 11.20 LogStream A **LogStream** is a sequence of log records from a source or resource. --- ## 11.21 LogLevel A **LogLevel** is a severity or importance category for log records. Examples: ```text trace debug info warn error fatal ``` --- ## 11.22 LogContext **LogContext** is contextual metadata attached to log records. Examples: ```text request id trace id user id reference tenant id deployment version environment component ``` --- ## 11.23 Trace A **Trace** is a representation of a request, transaction, workflow, or operation as it moves through a distributed system. --- ## 11.24 Span A **Span** is a single timed operation within a trace. Recommended attributes: ```yaml trace_id: span_id: parent_span_id: name: kind: start_time: end_time: status: attributes: events: links: ``` --- ## 11.25 SpanEvent A **SpanEvent** is a timestamped event attached to a span. --- ## 11.26 SpanLink A **SpanLink** connects a span to another span or trace context. --- ## 11.27 TraceContext **TraceContext** is propagation metadata that links operations across process, service, or network boundaries. --- ## 11.28 Baggage **Baggage** is contextual metadata propagated across process boundaries. Baggage SHOULD be governed carefully when it may contain sensitive data. --- ## 11.29 TraceSample A **TraceSample** is a selected trace or subset of trace data retained or analyzed. --- ## 11.30 Event An **Event** is a record of an occurrence and its context. Events may be operational, domain, security, deployment, infrastructure, or business events. --- ## 11.31 EventEnvelope An **EventEnvelope** is structured metadata around event data. CloudEvents is a primary mapping target. --- ## 11.32 EventSource An **EventSource** is the producer or origin of an event. --- ## 11.33 EventType An **EventType** classifies the kind of occurrence represented by an event. --- ## 11.34 EventConsumer An **EventConsumer** is an actor, system, service, or pipeline that consumes events. --- ## 11.35 EventCorrelationKey An **EventCorrelationKey** links events to related traces, logs, requests, incidents, deployments, or resources. --- ## 11.36 Profile A **Profile** is sampled performance or resource-use data. This concept is observability-specific and distinct from InfoTechCanon application profiles. --- ## 11.37 ProfilingSample A **ProfilingSample** is one sample of profiling data. --- ## 11.38 ResourceProfile A **ResourceProfile** describes resource use over time or sampled execution. Examples: ```text CPU profile memory profile allocation profile lock profile I/O profile ``` --- ## 11.39 PerformanceProfile A **PerformanceProfile** describes performance characteristics of a system, component, or operation. --- ## 11.40 SLI A **Service Level Indicator** is a quantitative measure of a service level. Examples: ```text availability latency error rate throughput correctness freshness durability ``` --- ## 11.41 SLO A **Service Level Objective** is a target value or range for an SLI over a defined measurement window. Recommended attributes: ```yaml service: sli: target: window: scope: owner: evidence_source: ``` --- ## 11.42 SLAReference An **SLAReference** points to a contractual or formal service-level agreement. Governance owns contractual obligation semantics. Observability owns measured service-level signals. --- ## 11.43 ErrorBudget An **ErrorBudget** is the allowed amount of unreliability implied by an SLO over a measurement window. --- ## 11.44 BurnRate **BurnRate** is the rate at which an error budget is being consumed. --- ## 11.45 HealthState **HealthState** is an assessed operational state of a resource, service, dependency, or system. Seed health states: ```text unknown healthy degraded unhealthy down recovering maintenance ``` --- ## 11.46 AvailabilityWindow An **AvailabilityWindow** is the time period over which availability or service level is measured. --- ## 11.47 AlertRule An **AlertRule** defines conditions under which an alert is created. Recommended attributes: ```yaml query: condition: threshold: window: severity: for_duration: labels: annotations: owner: runbook: ``` --- ## 11.48 Alert An **Alert** is an instance of an alert rule firing or resolving. Seed alert states: ```text pending firing acknowledged suppressed resolved expired ``` --- ## 11.49 Notification A **Notification** is a message sent to humans, agents, or systems about an alert, incident, or operational state. --- ## 11.50 AlertRoute An **AlertRoute** defines how alerts are routed to responders, teams, tools, or escalation paths. --- ## 11.51 AlertSuppression **AlertSuppression** is a rule or state that suppresses notifications for known, duplicate, maintenance, or intentionally ignored alert conditions. --- ## 11.52 AlertCorrelation **AlertCorrelation** groups related alerts or signals. --- ## 11.53 EscalationReference An **EscalationReference** points to Organization, Task, or Governance concepts defining who should respond and how escalation works. --- ## 11.54 ObservedIncident An **ObservedIncident** is an operationally significant situation inferred or declared from observability signals. Task and ITSM systems may own incident work records. Observability owns the signal-derived incident view. --- ## 11.55 Investigation An **Investigation** is analysis of signals, alerts, telemetry, incidents, or hypotheses to understand cause, scope, impact, and remediation. --- ## 11.56 Timeline A **Timeline** is an ordered sequence of events, signals, decisions, actions, and observations. --- ## 11.57 Runbook A **Runbook** is an operational procedure used to investigate, respond, recover, or verify a condition. --- ## 11.58 Dashboard A **Dashboard** is a visual or structured view of observability data. --- ## 11.59 OperationalView An **OperationalView** is a purpose-specific view of system state, health, risk, or performance. --- ## 11.60 PostIncidentObservation A **PostIncidentObservation** is a signal, fact, lesson, or finding captured after an incident. --- ## 11.61 ObservabilityEvidence **ObservabilityEvidence** is telemetry, query output, screenshot, dashboard state, trace, log, metric, or event used to support a claim. --- ## 11.62 Query A **Query** is an expression used to retrieve or calculate observability data. Examples: ```text PromQL query LogQL query SQL query trace search SIEM query dashboard panel query ``` --- ## 11.63 QueryResult A **QueryResult** is the result of executing a query. --- ## 11.64 Snapshot A **Snapshot** is a captured state of telemetry, dashboard, trace, log, metric, or query result at a time. --- ## 11.65 Annotation An **Annotation** is a human, agent, or system-added note attached to telemetry, dashboard, timeline, incident, deployment, or event. --- ## 11.66 Correlation A **Correlation** is a relationship linking signals, resources, events, deployments, incidents, or hypotheses. --- ## 11.67 RootCauseHypothesis A **RootCauseHypothesis** is a candidate explanation for an observed issue. Canonical rule: ```text RootCauseHypothesis SHOULD remain distinguishable from verified cause. ``` --- # 12. Core Relationship Vocabulary Recommended root relationship types: ```text emitted_by observes measures describes correlates_with derived_from generated_by triggered_by alerts_on routes_to acknowledged_by suppressed_by resolves affects indicates supports evidences verifies invalidates samples aggregates annotates links_to maps_to ``` Relationship records SHOULD support: ```yaml id: relationship_type: source_entity: target_entity: scope: time_window: state_context: valid_from: valid_to: source_system: confidence: evidence: rationale: ``` --- # 13. Observability State Models ## 13.1 Signal States ```text unknown emitting missing delayed partial degraded invalid stale ``` ## 13.2 Alert States ```text pending firing acknowledged suppressed resolved expired ``` ## 13.3 Incident Observation States ```text suspected confirmed investigating mitigating recovering resolved post_review closed ``` ## 13.4 Health States ```text unknown healthy degraded unhealthy down recovering maintenance ``` ## 13.5 SLO States ```text not_measured within_budget burning_fast at_risk violated paused retired ``` ## 13.6 Telemetry Pipeline States ```text configured active degraded dropping_data stalled misconfigured retired ``` --- # 14. Observability Patterns ## 14.1 Pattern: Resource-Linked Telemetry **Context:** Telemetry is collected from many systems. **Problem:** Signals are hard to interpret if they cannot be linked to canonical resources. **Solution:** Attach telemetry to ObservedResource references mapped to Landscape, Network, DevSecOps, Security, or Data entities. --- ## 14.2 Pattern: Signal-to-Alert-to-Task **Context:** A condition needs human or agent response. **Problem:** Alerts fire but do not become accountable work. **Solution:** ```text Signal -> AlertRule -> Alert -> ObservedIncident or Task -> Investigation -> RemediationTask -> VerificationEvidence ``` --- ## 14.3 Pattern: SLO as Reliability Contract **Context:** Service reliability must be operationally meaningful. **Problem:** Teams alert on low-level metrics that do not represent user experience. **Solution:** Define SLIs and SLOs for user-meaningful service behavior and use error budgets to guide action. --- ## 14.4 Pattern: Deployment Health Verification **Context:** A deployment has completed. **Problem:** Successful deployment command does not prove healthy service behavior. **Solution:** Link DeploymentRecord to DeploymentHealthSignal, SLO state, traces, logs, metrics, and verification evidence. --- ## 14.5 Pattern: Correlated Timeline **Context:** Incidents require understanding what happened. **Problem:** Logs, alerts, deployments, changes, and network events are scattered. **Solution:** Build Timeline from correlated events, alerts, traces, deployment records, annotations, and task actions. --- ## 14.6 Pattern: Alert with Runbook **Context:** An alert requires response. **Problem:** Responders waste time discovering what the alert means. **Solution:** AlertRule SHOULD reference owner, runbook, dashboard, likely causes, and escalation path. --- ## 14.7 Pattern: Metric with Exemplar **Context:** Aggregate metrics show a problem. **Problem:** Aggregates hide individual requests or traces. **Solution:** Link MetricPoint or histogram bucket to trace/log exemplar. --- ## 14.8 Pattern: Observability as Governance Evidence **Context:** Governance requires proof that controls or SLOs are operating. **Problem:** Compliance claims rely on manual screenshots or weak assertions. **Solution:** Use query results, snapshots, dashboards, and telemetry evidence as structured ObservabilityEvidence. --- ## 14.9 Pattern: Missing Signal as Signal **Context:** A telemetry source goes silent. **Problem:** Systems only alert on bad values, not missing data. **Solution:** Model missing, stale, or delayed telemetry as signal states and potential alerts. --- # 15. Observability Profiles ## 15.1 Profile Format An Observability Profile SHALL declare: ```yaml id: profile_name: status: implements: - InfoTechCanonObservabilityModel target_context: included_concepts: required_relationships: required_metadata: state_model: source_of_truth_rules: mapping_files: validation_rules: examples: known_deviations: ``` --- ## 15.2 Seed Profile: Small SaaS Observability Profile Purpose: ```text Provide a minimal observability model for a small SaaS platform moving toward production readiness. ``` Included concepts: ```text ObservedResource Metric LogRecord Trace Span Event AlertRule Alert Dashboard Runbook SLI SLO HealthState ObservedIncident ObservabilityEvidence ``` Required relationships: ```text Metric emitted_by ObservedResource LogRecord emitted_by ObservedResource Trace observes Service Alert triggered_by AlertRule Alert affects Service SLO measures Service Dashboard displays Metric Runbook supports Alert ObservabilityEvidence supports Investigation ``` --- ## 15.3 Seed Profile: OpenTelemetry Profile Purpose: ```text Map OpenTelemetry resources, traces, metrics, logs, attributes, baggage, and semantic conventions into InfoTechCanon. ``` Example mappings: ```text Resource -> ObservedResource Resource attributes -> ResourceAttribute Metric -> Metric LogRecord -> LogRecord Trace -> Trace Span -> Span Span event -> SpanEvent Span link -> SpanLink Baggage -> Baggage Semantic conventions -> Mapping / Attribute vocabulary Collector -> TelemetryPipeline component Exporter -> TelemetryPipeline component ``` --- ## 15.4 Seed Profile: Prometheus / OpenMetrics Profile Purpose: ```text Represent metrics, labels, time series, scrape targets, alert rules, and query results. ``` Example mappings: ```text metric name -> Metric labels -> dimensions / attributes sample -> MetricPoint time series -> TimeSeries PromQL -> Query recording rule -> DerivedMetric / Query alerting rule -> AlertRule target -> TelemetrySource / ObservedResource ``` --- ## 15.5 Seed Profile: CloudEvents Profile Purpose: ```text Represent event metadata and event envelopes. ``` Example mappings: ```text id -> Event id source -> EventSource type -> EventType specversion -> EventEnvelope version subject -> Event subject time -> Event timestamp datacontenttype -> Event data content type data -> Event data ``` --- ## 15.6 Seed Profile: SRE Reliability Profile Purpose: ```text Represent SLIs, SLOs, error budgets, burn rates, and reliability decisions. ``` Included concepts: ```text SLI SLO ErrorBudget BurnRate AvailabilityWindow AlertRule ReliabilityReview ServiceHealthState ErrorBudgetPolicyReference ``` Required relationships: ```text SLO applies_to Service SLI measures Service ErrorBudget derived_from SLO BurnRate measures ErrorBudgetConsumption AlertRule alerts_on BurnRate ReliabilityReview reviews SLOState ``` --- ## 15.7 Seed Profile: Incident Observability Profile Purpose: ```text Represent telemetry, alerts, timelines, dashboards, and evidence for incident response. ``` Included concepts: ```text Alert ObservedIncident Timeline Investigation Dashboard Runbook Annotation RootCauseHypothesis ObservabilityEvidence PostIncidentObservation ``` --- ## 15.8 Seed Profile: Network Observability Profile Purpose: ```text Represent network metrics, flow logs, reachability tests, DNS logs, and latency signals. ``` Included concepts: ```text NetworkMetric ObservedFlowSignal DNSLogRecord ReachabilityTestResult LatencyMetric PacketLossMetric EndpointHealthSignal ``` Mapping targets: ```text NetFlow/IPFIX VPC Flow Logs Kubernetes CNI telemetry service mesh telemetry DNS logs synthetic probes ``` --- ## 15.9 Seed Profile: Security Observability Profile Purpose: ```text Represent observability signals used for security detection, investigation, and evidence. ``` Included concepts: ```text SecuritySignal SecurityLogRecord DetectionEvent Alert TraceEvidence AccessSessionLog AuditLogReference SecurityEvidence ``` Security interpretation remains owned by the Security Model. --- # 16. Mapping Model for the Observability Standard Mappings relate InfoTechCanon observability concepts to external standards, tools, and products. ## 16.1 Mapping Types Recommended mapping types: ```text exactMatch closeMatch broadMatch narrowMatch relatedMatch conflictMatch gapMatch derivedFrom regulatoryReference toolEquivalent ``` ## 16.2 Mapping Record Example: ```yaml id: itc-map:span-to-opentelemetry-span source_concept: itc-obs:Span target_body: OpenTelemetry target_version: "current" target_concept: Span mapping_type: closeMatch scope: - distributed tracing not_valid_for: - all event or log semantics rationale: > OpenTelemetry Span is the primary mapping target for timed operations in traces. InfoTechCanon keeps Span as a canonical concept to allow mappings to other tracing systems. confidence: high status: candidate owner: InfoTechCanonObservabilityModel ``` ## 16.3 Seed Mapping Targets The Observability Model SHOULD maintain mappings to: ```text OpenTelemetry OpenTelemetry Semantic Conventions Prometheus OpenMetrics / Prometheus exposition format CloudEvents W3C Trace Context Google SRE SLI/SLO/Error Budget concepts Grafana dashboards and alerting Prometheus Alertmanager Loki / LogQL Jaeger Tempo Elastic Observability Datadog New Relic Splunk OpenSearch ITIL incident concepts NetFlow / IPFIX VPC Flow Logs Kubernetes events and metrics service mesh telemetry ``` --- # 17. Assimilation Hooks The Observability Model SHALL be able to receive new observability standards, tool models, telemetry schemas, incident practices, and operational patterns through the InfoTechCanon assimilation process. ## 17.1 Assimilation Triggers Assimilation may be triggered by: ```text new telemetry standard new observability backend new incident-management tool new SLO practice new dashboard model new alerting model new tracing model new logging schema new AIOps product new runtime verification practice new recurring signal classification conflict ``` ## 17.2 Observability Assimilation Output An observability assimilation SHOULD produce: ```text source summary extracted observability concepts concept comparison matrix gap list conflict list mapping file candidate new concepts candidate relationship changes candidate pattern changes candidate profile changes open questions ``` ## 17.3 Recommended First Assimilation Candidates ```text OpenTelemetry specification and semantic conventions Prometheus / OpenMetrics CloudEvents W3C Trace Context Google SRE SLO chapters Grafana dashboard and alerting models Prometheus Alertmanager Kubernetes events and metrics VPC Flow Logs / NetFlow / IPFIX ITIL incident management concepts ``` --- # 18. Integration with Other InfoTechCanon Standards ## 18.1 Landscape Model Observability links signals to: ```text ApplicationService TechnicalService RuntimeWorkload Environment Endpoint DataStore DeploymentRecord NetworkEntity ``` ## 18.2 Organization Model Observability imports organization concepts for: ```text service owner on-call responder team escalation target runbook owner incident commander ``` ## 18.3 Governance Model Observability imports governance concepts for: ```text evidence control result review assurance policy SLA obligation audit evidence ``` ## 18.4 Task Model Observability creates or references: ```text incident task investigation task remediation task follow-up task reliability improvement task ``` ## 18.5 Tagging Standard Observability uses tags for: ```text service environment severity signal type dashboard category incident category team ``` Tags must not replace ObservedResource, AlertRule, SLO, or Evidence records. ## 18.6 Access Control Model Observability imports access concepts for: ```text dashboard access log access trace access incident tool access telemetry pipeline access sensitive telemetry access ``` ## 18.7 Security Model Security imports observability concepts for: ```text security signal detection evidence security alert audit log trace evidence incident timeline ``` ## 18.8 Data Model Data imports observability concepts when telemetry is treated as a dataset and for data freshness, quality, and lineage signals. ## 18.9 DevSecOps Model DevSecOps imports observability concepts for: ```text deployment verification change failure detection delivery metric runtime feedback SLO impact ``` ## 18.10 Network Model Network imports observability concepts for: ```text flow logs reachability test results latency packet loss DNS logs endpoint health ``` --- # 19. Canon Interface Card Usage Subsystems that implement or produce observability knowledge SHOULD publish a Canon Interface Card. Example: ```yaml subsystem: prometheus-importer implements: - InfoTechCanonObservabilityModel - PrometheusOpenMetricsProfile produces: - Metric - TimeSeries - MetricPoint - AlertRule - Alert - QueryResult consumes: - ObservedResource - Service - Environment relations: - Metric emitted_by ObservedResource - Alert triggered_by AlertRule - Alert affects Service source_of_truth: metric_samples: Prometheus alert_rule_state: Prometheus known_deviations: - resource identity depends on labels - long-term retention may be external ``` --- # 20. Retrieval Requirements The Observability Model is designed for markdown-based infospaces. ## 20.1 Required Retrieval Properties Every major concept SHOULD provide: - stable heading, - stable identifier, - short definition, - longer explanation, - examples, - distinction notes, - relationship examples, - mapping hooks, - profile references, - and common mistakes. ## 20.2 Agent Brief A mature Observability Model SHOULD include an `agent-brief.md` file with: ```text purpose scope owned concepts imported concepts core distinctions do / do not rules relationship patterns minimal examples common mistakes profile list mapping list ``` ## 20.3 Indexes The observability information space SHOULD provide indexes by: ```text concept relationship signal type metric log trace event resource service alert SLO dashboard incident profile pattern mapping target status source system ``` --- # 21. Conformance Levels ## 21.1 Reference-Conformant A document or system is reference-conformant if it uses Observability Model terminology consistently but does not implement structured metadata or validation rules. ## 21.2 Metadata-Conformant A system is metadata-conformant if it uses stable identifiers, concept names, lifecycle states, source metadata, and relationship types. ## 21.3 Signal-Conformant A system is signal-conformant if it distinguishes metrics, logs, traces, events, profiles, alerts, and health signals. ## 21.4 Resource-Correlated A system is resource-correlated if observability signals can be linked to observed resources and canonical landscape entities. ## 21.5 SLO-Conformant A system is SLO-conformant if it represents SLIs, SLOs, error budgets, burn rates, and measurement windows. ## 21.6 Evidence-Conformant A system is evidence-conformant if observability claims, incidents, alerts, and service-level states can be linked to evidence. ## 21.7 Profile-Conformant A system is profile-conformant if it implements a declared Observability Profile and passes its validation rules. ## 21.8 Assimilation-Conformant A system or repository is assimilation-conformant if it can accept external observability concepts through the InfoTechCanon assimilation workflow and produce mappings, gaps, conflicts, and proposed changes. --- # 22. Validation Rules Initial validation rules: ```text VAL-OBS-001: Metric, LogRecord, Trace, Span, Event, Profile, Alert, and Incident SHOULD be modeled as distinct concepts. VAL-OBS-002: Telemetry SHOULD reference an ObservedResource where possible. VAL-OBS-003: ObservedResource SHOULD map to a Landscape, Network, Data, Security, or DevSecOps entity where possible. VAL-OBS-004: Metric SHOULD declare unit, instrument type, source, and dimensions where available. VAL-OBS-005: TimeSeries SHOULD distinguish metric identity from labels/dimensions. VAL-OBS-006: LogRecord SHOULD include timestamp, severity, source, and body where available. VAL-OBS-007: Span SHOULD include trace id, span id, timing, name, status, and parent/link references where available. VAL-OBS-008: Event SHOULD distinguish event data from event context metadata. VAL-OBS-009: Alert SHOULD reference AlertRule or source condition where available. VAL-OBS-010: AlertRule SHOULD reference query or condition, threshold, time window, owner, and runbook where applicable. VAL-OBS-011: SLO SHOULD reference SLI, target, measurement window, service, and evidence source. VAL-OBS-012: ErrorBudget SHOULD derive from an SLO. VAL-OBS-013: Dashboard SHOULD NOT be treated as evidence unless a Snapshot or QueryResult is captured. VAL-OBS-014: Incident SHOULD NOT be inferred solely from one alert unless profile permits it. VAL-OBS-015: RootCauseHypothesis SHOULD remain distinguishable from verified cause. VAL-OBS-016: Missing, stale, or delayed telemetry SHOULD be representable as signal state. VAL-OBS-017: Tags MUST NOT replace resource identity, SLO definitions, alert rules, or evidence. VAL-OBS-018: Imported external observability concepts SHOULD be represented through mapping records rather than silently reused. VAL-OBS-019: Profiles MUST NOT redefine canonical concepts. They may constrain them. VAL-OBS-020: Telemetry containing sensitive data SHOULD reference Data, Security, Access Control, or Governance constraints where relevant. ``` --- # 23. Anti-Patterns ## 23.1 Dashboard as Truth Treating a dashboard view as evidence without preserving query, time window, data source, or snapshot. ## 23.2 Alert Equals Incident Treating every alert as an incident. ## 23.3 Metric Soup Collecting many metrics without ownership, resource identity, interpretation, or action path. ## 23.4 Logs Without Context Logging messages that cannot be correlated to service, request, trace, tenant, deployment, or resource. ## 23.5 Traces Without Boundaries Tracing calls without linking them to service ownership, deployment version, or runtime resource. ## 23.6 SLO Theater Creating SLOs that do not reflect user experience or guide operational decisions. ## 23.7 Alert Without Runbook Creating alerts without ownership, runbook, dashboard, or response expectation. ## 23.8 Missing Signal Blindness Failing to alert when telemetry stops arriving. ## 23.9 Tool-Native Capture Letting one observability backend define the internal observability model. ## 23.10 Telemetry Without Governance Collecting sensitive logs, traces, or profiles without classification, retention, access control, or privacy consideration. --- # 24. Initial Repository Placement Recommended repository layout: ```text info-tech-canon/ standards/ observability/ InfoTechCanonObservabilityModel.md agent-brief.md concepts/ relationships/ patterns/ profiles/ mappings/ assimilation/ examples/ validation/ ``` Seed files: ```text standards/observability/InfoTechCanonObservabilityModel.md standards/observability/agent-brief.md standards/observability/concepts/telemetry.md standards/observability/concepts/metric.md standards/observability/concepts/log-record.md standards/observability/concepts/trace.md standards/observability/concepts/span.md standards/observability/concepts/event.md standards/observability/concepts/sli.md standards/observability/concepts/slo.md standards/observability/concepts/alert.md standards/observability/concepts/observability-evidence.md standards/observability/patterns/resource-linked-telemetry.md standards/observability/patterns/signal-to-alert-to-task.md standards/observability/patterns/slo-as-reliability-contract.md standards/observability/patterns/deployment-health-verification.md standards/observability/profiles/small-saas-observability-profile.md standards/observability/profiles/opentelemetry-profile.md standards/observability/profiles/prometheus-openmetrics-profile.md standards/observability/profiles/sre-reliability-profile.md standards/observability/mappings/opentelemetry.yaml standards/observability/mappings/prometheus-openmetrics.yaml standards/observability/mappings/cloudevents.yaml standards/observability/mappings/sre-slo.yaml ``` --- # 25. Roadmap ## Phase 1: Seed Stabilization - Establish this standard as `InfoTechCanonObservabilityModel`. - Add seed concepts, relationship vocabulary, patterns, and profiles. - Define validation rules. - Align with Landscape, Network, DevSecOps, Security, Data, Governance, Task, Access Control, and Tagging. ## Phase 2: First Assimilations Recommended first assimilations: ```text OpenTelemetry specification and semantic conventions Prometheus / OpenMetrics CloudEvents W3C Trace Context Google SRE SLO chapters Grafana dashboard and alerting model Prometheus Alertmanager Kubernetes events and metrics VPC Flow Logs / NetFlow / IPFIX ITIL incident management concepts ``` ## Phase 3: Profile Maturation - Mature Small SaaS Observability Profile. - Mature OpenTelemetry Profile. - Mature Prometheus / OpenMetrics Profile. - Mature CloudEvents Profile. - Mature SRE Reliability Profile. - Mature Incident Observability Profile. - Mature Network Observability Profile. - Mature Security Observability Profile. ## Phase 4: Tooling Integration - Generate concept indexes. - Generate agent brief. - Create machine-readable YAML/JSON exports. - Add validation scripts. - Integrate telemetry pipelines, metrics, logs, traces, dashboards, alerts, incident tools, and service catalogs. ## Phase 5: Operational Intelligence Loop - Connect telemetry to canonical resources. - Connect alerts to tasks and incidents. - Connect SLOs to governance and service ownership. - Connect deployment records to runtime health signals. - Connect security detections to security incidents. - Connect network flows to reachability and exposure. - Connect post-incident observations to improvements and standard evolution. --- # 26. Summary The InfoTechCanon Observability Model is the seed standard for representing telemetry, signals, metrics, logs, traces, events, profiles, alerts, SLOs, health, incidents as observed phenomena, and operational evidence. Its most important commitments are: ```text Separate telemetry, signal, metric, log, trace, span, event, profile, alert, and incident. Link signals to canonical resources and landscape entities. Treat SLOs, SLIs, error budgets, burn rates, and health states as first-class reliability concepts. Use observability evidence to support governance, security, delivery, incident response, and operational review. Map to OpenTelemetry, Prometheus/OpenMetrics, CloudEvents, SRE practices, and observability tools without surrendering internal semantic autonomy. Use profiles to make the model practical for SaaS systems, OpenTelemetry, Prometheus, SRE reliability, incident response, network observability, and security observability. ``` This makes the Observability Model a core seed for runtime intelligence, production readiness, SRE practice, incident response, deployment verification, security detection, and agent-supported operations.