Files
info-tech-canon/infospace/models/observability/InfoTechCanonObservabilityModel.md

2273 lines
47 KiB
Markdown

# InfoTechCanon Observability Model
**Short Name:** `ITC-OBS`
**Document Status:** Seed Standard Release Candidate 1
**Version:** RC1-seed
**Date:** 2026-05-23
**Repository Context:** `info-tech-canon`
**Document Type:** InfoTechCanon Domain Standard
**Intended Audience:** SREs, platform engineers, DevSecOps teams, service owners, observability engineers, incident responders, network operators, security analysts, product owners, governance designers, knowledge-system builders, and agentic tooling.
---
# 1. Purpose
The **InfoTechCanon Observability Model** defines a canonical seed model for representing telemetry, signals, events, logs, metrics, traces, profiles, health, service levels, alerts, incidents as observed phenomena, dashboards, runbooks, investigations, and operational evidence.
It exists to make runtime understanding interoperable across systems, services, platforms, networks, security, delivery pipelines, data products, and agentic operations.
This standard provides a canonical vocabulary for:
- telemetry sources,
- resources,
- signals,
- metrics,
- logs,
- events,
- traces,
- spans,
- profiles,
- exemplars,
- attributes,
- dimensions,
- correlation context,
- service level indicators,
- service level objectives,
- error budgets,
- health states,
- alerts,
- notifications,
- incidents,
- investigations,
- dashboards,
- runbooks,
- observability evidence,
- and feedback loops.
---
# 2. Position in InfoTechCanon
The Observability Model is a **domain standard** within InfoTechCanon.
It depends on the existing seed standards as follows:
```text
Landscape = services, runtime resources, environments, endpoints, workloads.
Organization = owners, on-call actors, responders, teams, accountable roles.
Governance = policies, controls, evidence, reviews, assurance, obligations.
Task = incident work, remediation work, investigation, follow-up tasks.
Tagging = lightweight classification of signals, alerts, incidents, dashboards.
Access Control = access to telemetry, dashboards, logs, admin actions, incident tools.
Security = security signals, detections, alerts, incidents, forensic evidence.
Data = telemetry as data, retention, classification, quality, lineage.
DevSecOps = deployment events, delivery metrics, verification, change failures.
Network = flow logs, reachability tests, network metrics, DNS logs, latency.
Observability = signals, telemetry, correlation, health, SLOs, alerts, operational evidence.
```
```text
InfoTechCanon
├── InfoTechCanonCore
├── InfoTechCanonLandscapeModel
├── InfoTechCanonOrganizationModel
├── InfoTechCanonGovernanceModel
├── InfoTechCanonTaskModel
├── InfoTechCanonTaggingStandard
├── InfoTechCanonAccessControlModel
├── InfoTechCanonSecurityModel
├── InfoTechCanonDataModel
├── InfoTechCanonDevSecOpsModel
├── InfoTechCanonNetworkModel
├── InfoTechCanonObservabilityModel <-- this standard
├── InfoTechCanonPatternLanguage
└── Application Profiles
```
---
# 3. Boundary with Adjacent Standards
## 3.1 Boundary with Landscape
The Landscape Model owns the entities being observed:
```text
ApplicationService
TechnicalService
RuntimeWorkload
Environment
Endpoint
NetworkEntity
DataStore
DeploymentRecord
```
The Observability Model owns telemetry and signals about those entities:
```text
Metric
LogRecord
Trace
Span
Event
Profile
Alert
HealthState
SLI
SLO
Dashboard
IncidentSignal
```
Boundary rule:
```text
Landscape owns what exists.
Observability owns what is observed, measured, correlated, alerted, and evidenced.
```
## 3.2 Boundary with Security
The Security Model owns security interpretation:
```text
SecurityFinding
Detection
SecurityIncident
Threat
AttackPath
SecurityEvidence
```
Observability owns telemetry substrate and operational signals.
Example:
```text
LogRecord may be evidence for SecurityFinding.
SecurityDetection may be derived from ObservabilitySignal.
SecurityIncident may reference Alert, Trace, LogRecord, or Event.
```
## 3.3 Boundary with Governance
Governance owns policies, controls, evidence, reviews, assurance, and compliance claims.
Observability provides evidence and indicators.
Example:
```text
SLOEvidence supports ServiceReview.
Metric supports ControlResult.
AlertPolicy implements Governance Policy.
```
## 3.4 Boundary with Task
Task owns work semantics.
Observability creates or references tasks:
```text
Alert creates IncidentTask
Incident creates RemediationTask
Investigation creates FollowUpTask
SLOBurn creates ReliabilityTask
```
## 3.5 Boundary with DevSecOps
DevSecOps owns delivery events and deployment records.
Observability owns runtime signals used to verify deployments and measure change impact.
Example:
```text
DeploymentRecord produces DeploymentEvent
DeploymentHealthSignal verifies DeploymentRecord
ChangeFailure detected_by ObservabilitySignal
```
## 3.6 Boundary with Data
Data owns dataset, classification, lineage, quality, and retention semantics.
Observability telemetry may itself be data, but Observability owns telemetry-specific semantics.
Example:
```text
LogDataset classified_as Restricted
MetricStream has_retention RetentionRuleReference
TraceSample derived_from RuntimeWorkload
```
---
# 4. Research Basis and External Alignment
This seed standard draws on several mature observability and operations bodies of knowledge.
## 4.1 OpenTelemetry
OpenTelemetry provides a broad observability framework covering traces, metrics, logs, baggage, resources, semantic conventions, instrumentation, collection, and export. Its semantic conventions define common attributes that give meaning to telemetry across systems.
## 4.2 SRE and Service Level Objectives
SRE practice distinguishes Service Level Indicators, Service Level Objectives, Service Level Agreements, and error budgets. It emphasizes that SLOs should measure user-relevant reliability and guide operational decision-making.
## 4.3 Prometheus and OpenMetrics
Prometheus and OpenMetrics influence metric naming, metric exposition, labels, time series, counters, gauges, histograms, summaries, and scraping/pull-based metric collection.
## 4.4 CloudEvents
CloudEvents standardizes common event metadata for interoperability across services, platforms, and systems. It is a strong mapping target for event structure and routing metadata.
## 4.5 IT Operations and Incident Management
IT operations practice distinguishes alerts, incidents, problems, changes, runbooks, on-call, escalation, resolution, and post-incident review. The Observability Model provides signal semantics while Task and Governance own work and decision semantics.
## 4.6 AIOps and Event Correlation
AIOps practice emphasizes correlation, anomaly detection, event deduplication, root-cause analysis, topology-aware alerting, and automated remediation. These are advanced profiles rather than mandatory core concepts.
---
# 5. Seed Standard Design Stance
This standard is a **seed standard**, not a vendor-specific observability schema.
It shall:
1. define canonical observability semantics,
2. distinguish telemetry, signal, event, log, metric, trace, span, profile, alert, and incident,
3. support OpenTelemetry alignment without being limited to it,
4. support SLOs, SLIs, and error budgets,
5. support correlation across services, runtime, network, security, data, and delivery,
6. support operational evidence and feedback loops,
7. support human and agentic operations,
8. map to external standards and tools without becoming subordinate to them,
9. remain markdown-first and agent-retrievable,
10. and support future assimilation of observability tools, standards, and practices.
---
# 6. Scope
## 6.1 In Scope
This standard covers canonical representation of:
- telemetry,
- telemetry sources,
- observed resources,
- observability signals,
- metrics,
- time series,
- metric points,
- metric instruments,
- logs,
- log records,
- events,
- event envelopes,
- traces,
- spans,
- span links,
- trace context,
- profiles,
- exemplars,
- attributes,
- dimensions,
- labels,
- correlation context,
- service-level indicators,
- service-level objectives,
- service-level agreements as references,
- error budgets,
- burn rates,
- health states,
- alert rules,
- alerts,
- notifications,
- alert routes,
- incidents as observed operational objects,
- investigations,
- dashboards,
- runbooks,
- telemetry pipelines,
- collectors,
- exporters,
- sampling,
- retention,
- and observability evidence.
## 6.2 Out of Scope
This standard does not fully define:
- all monitoring tool schemas,
- all incident-management process details,
- all SRE organizational practice,
- complete AIOps algorithms,
- all logging formats,
- all SIEM detection content,
- full OpenTelemetry SDK implementation,
- all Prometheus query semantics,
- complete data-retention law,
- complete security incident-response methodology,
- or every vendor-specific telemetry backend.
Those may be mapped, assimilated, profiled, or handled by adjacent standards.
---
# 7. Normative Language
The following terms are used normatively:
- **SHALL** indicates a mandatory rule for conformance.
- **SHOULD** indicates a recommended practice.
- **MAY** indicates an optional capability.
- **MUST NOT** indicates a prohibited practice.
- **SEED** marks a concept defined provisionally here but open to later refinement.
- **EXTRACT** marks a concept that may later move to a more specialized standard.
---
# 8. Core Principles
## 8.1 Observability Is More Than Monitoring
Monitoring checks known conditions.
Observability supports understanding system behavior, including unknown or emergent failure modes, through signals and correlation.
## 8.2 Telemetry Is Not Insight
Raw telemetry becomes useful through context, correlation, aggregation, interpretation, and action.
## 8.3 Signal Is Not Incident
A signal, alert, or event may indicate a possible problem.
An incident is an operationally relevant situation requiring response.
## 8.4 Alert Is Not Evidence by Itself
An alert indicates that a rule fired or condition was detected.
Evidence should include the underlying signals, query, thresholds, state, and context.
## 8.5 Metrics, Logs, Traces, Events, and Profiles Are Distinct
Each signal type has different strengths and should not be collapsed into one generic “event” concept.
## 8.6 Service Levels Must Be Explicit
SLIs, SLOs, and error budgets SHOULD be modeled explicitly when reliability is important.
## 8.7 Correlation Requires Identity
Telemetry SHOULD be linked to canonical landscape entities, deployment records, network endpoints, data resources, or security entities where possible.
## 8.8 Observability Must Support Feedback
Observability should feed tasks, incidents, governance reviews, deployment verification, security detection, reliability improvement, and standard evolution.
## 8.9 External Standards Are Mapped, Not Obeyed
The Observability Model MAY map to OpenTelemetry, Prometheus, OpenMetrics, CloudEvents, SRE SLO concepts, ITIL incident practices, and vendor schemas.
It MUST NOT subordinate its internal semantics to any single external model.
---
# 9. Canonical Seed Metadata
Every observability artifact SHOULD support structured metadata.
Recommended front matter:
```yaml
---
id: itc-obs:Metric
type: concept
standard: InfoTechCanonObservabilityModel
standard_version: RC1-seed
status: candidate
canonical_owner: InfoTechCanonObservabilityModel
preferred_label: Metric
related:
- itc-obs:TimeSeries
- itc-obs:SLI
- itc-obs:AlertRule
mappings:
- itc-map:metric-to-opentelemetry-metric
---
```
Recommended artifact statuses:
```text
idea
draft
candidate
release-candidate
adopted
stable
deprecated
retired
```
Recommended concept statuses:
```text
proposed
experimental
candidate
canonical
deprecated
retired
```
---
# 10. Root Observability Taxonomy
```text
ObservabilityEntity
├── TelemetryEntity
│ ├── Telemetry
│ ├── TelemetrySource
│ ├── ObservedResource
│ ├── ResourceAttribute
│ ├── Signal
│ ├── SignalSource
│ └── TelemetryPipeline
├── MetricEntity
│ ├── Metric
│ ├── MetricInstrument
│ ├── TimeSeries
│ ├── MetricPoint
│ ├── Counter
│ ├── Gauge
│ ├── Histogram
│ ├── Summary
│ └── Exemplar
├── LogEntity
│ ├── Log
│ ├── LogRecord
│ ├── LogStream
│ ├── LogLevel
│ ├── LogContext
│ └── StructuredLogField
├── TraceEntity
│ ├── Trace
│ ├── Span
│ ├── SpanEvent
│ ├── SpanLink
│ ├── TraceContext
│ ├── Baggage
│ └── TraceSample
├── EventEntity
│ ├── Event
│ ├── EventEnvelope
│ ├── EventSource
│ ├── EventType
│ ├── EventConsumer
│ └── EventCorrelationKey
├── ProfileEntity
│ ├── Profile
│ ├── ProfilingSample
│ ├── ResourceProfile
│ └── PerformanceProfile
├── ReliabilityEntity
│ ├── SLI
│ ├── SLO
│ ├── SLAReference
│ ├── ErrorBudget
│ ├── BurnRate
│ ├── HealthState
│ └── AvailabilityWindow
├── AlertingEntity
│ ├── AlertRule
│ ├── Alert
│ ├── Notification
│ ├── AlertRoute
│ ├── AlertSuppression
│ ├── AlertCorrelation
│ └── EscalationReference
├── OperationsEntity
│ ├── ObservedIncident
│ ├── Investigation
│ ├── Timeline
│ ├── Runbook
│ ├── Dashboard
│ ├── OperationalView
│ └── PostIncidentObservation
└── EvidenceEntity
├── ObservabilityEvidence
├── Query
├── QueryResult
├── Snapshot
├── Annotation
├── Correlation
└── RootCauseHypothesis
```
---
# 11. Core Concepts
## 11.1 ObservabilityEntity
An **ObservabilityEntity** is any identifiable concept used to represent telemetry, signals, correlation, health, service levels, alerts, incidents as observed phenomena, dashboards, runbooks, or operational evidence.
Recommended attributes:
```yaml
id:
entity_type:
canonical_name:
display_name:
lifecycle_state:
source_system:
created_at:
updated_at:
```
Optional attributes:
```yaml
owner:
steward:
observed_resource:
service:
environment:
source_confidence:
valid_from:
valid_to:
tags:
external_references:
```
---
## 11.2 Telemetry
**Telemetry** is machine-generated or manually recorded operational data about system behavior, state, performance, events, or activity.
Examples:
```text
metric sample
log record
trace span
event
profile sample
flow record
health check result
```
---
## 11.3 TelemetrySource
A **TelemetrySource** is a system, component, agent, collector, service, device, pipeline, or actor that emits or provides telemetry.
---
## 11.4 ObservedResource
An **ObservedResource** is the entity about which telemetry is emitted or collected.
Observed resources SHOULD map to Landscape, Network, Data, Security, or DevSecOps entities where possible.
---
## 11.5 ResourceAttribute
A **ResourceAttribute** is an attribute describing an observed resource.
Examples:
```text
service.name
service.version
deployment.environment
host.name
cloud.region
k8s.cluster.name
container.image.name
```
---
## 11.6 Signal
A **Signal** is an interpretable unit or stream of observability data.
Signal types include:
```text
metric
log
trace
event
profile
alert
health check
synthetic result
```
---
## 11.7 SignalSource
A **SignalSource** is the origin of a signal.
---
## 11.8 TelemetryPipeline
A **TelemetryPipeline** is a flow that collects, processes, transforms, samples, enriches, routes, stores, or exports telemetry.
---
## 11.9 Metric
A **Metric** is a measurement of a system, service, resource, or process over time.
Metrics may be used for alerting, dashboards, SLOs, capacity planning, anomaly detection, and evidence.
---
## 11.10 MetricInstrument
A **MetricInstrument** defines the kind of measurement instrument.
Seed instrument types:
```text
counter
gauge
histogram
summary
up_down_counter
observable_gauge
```
---
## 11.11 TimeSeries
A **TimeSeries** is a sequence of metric points over time for a metric and a set of dimensions or labels.
---
## 11.12 MetricPoint
A **MetricPoint** is a single measurement value at a time.
---
## 11.13 Counter
A **Counter** is a monotonically increasing measurement of occurrences or accumulated quantity.
---
## 11.14 Gauge
A **Gauge** is a measurement that can go up or down.
---
## 11.15 Histogram
A **Histogram** is a distribution of measurements across buckets or ranges.
---
## 11.16 Summary
A **Summary** is a metric representation of observations including quantiles or summary statistics.
---
## 11.17 Exemplar
An **Exemplar** is a representative sample connecting an aggregate metric point to a trace, log, or other detailed signal.
---
## 11.18 Log
A **Log** is a stream or collection of timestamped records describing events, state, actions, or messages.
---
## 11.19 LogRecord
A **LogRecord** is a single log entry.
Recommended attributes:
```yaml
timestamp:
severity:
message:
body:
resource:
trace_id:
span_id:
attributes:
source:
```
---
## 11.20 LogStream
A **LogStream** is a sequence of log records from a source or resource.
---
## 11.21 LogLevel
A **LogLevel** is a severity or importance category for log records.
Examples:
```text
trace
debug
info
warn
error
fatal
```
---
## 11.22 LogContext
**LogContext** is contextual metadata attached to log records.
Examples:
```text
request id
trace id
user id reference
tenant id
deployment version
environment
component
```
---
## 11.23 Trace
A **Trace** is a representation of a request, transaction, workflow, or operation as it moves through a distributed system.
---
## 11.24 Span
A **Span** is a single timed operation within a trace.
Recommended attributes:
```yaml
trace_id:
span_id:
parent_span_id:
name:
kind:
start_time:
end_time:
status:
attributes:
events:
links:
```
---
## 11.25 SpanEvent
A **SpanEvent** is a timestamped event attached to a span.
---
## 11.26 SpanLink
A **SpanLink** connects a span to another span or trace context.
---
## 11.27 TraceContext
**TraceContext** is propagation metadata that links operations across process, service, or network boundaries.
---
## 11.28 Baggage
**Baggage** is contextual metadata propagated across process boundaries.
Baggage SHOULD be governed carefully when it may contain sensitive data.
---
## 11.29 TraceSample
A **TraceSample** is a selected trace or subset of trace data retained or analyzed.
---
## 11.30 Event
An **Event** is a record of an occurrence and its context.
Events may be operational, domain, security, deployment, infrastructure, or business events.
---
## 11.31 EventEnvelope
An **EventEnvelope** is structured metadata around event data.
CloudEvents is a primary mapping target.
---
## 11.32 EventSource
An **EventSource** is the producer or origin of an event.
---
## 11.33 EventType
An **EventType** classifies the kind of occurrence represented by an event.
---
## 11.34 EventConsumer
An **EventConsumer** is an actor, system, service, or pipeline that consumes events.
---
## 11.35 EventCorrelationKey
An **EventCorrelationKey** links events to related traces, logs, requests, incidents, deployments, or resources.
---
## 11.36 Profile
A **Profile** is sampled performance or resource-use data.
This concept is observability-specific and distinct from InfoTechCanon application profiles.
---
## 11.37 ProfilingSample
A **ProfilingSample** is one sample of profiling data.
---
## 11.38 ResourceProfile
A **ResourceProfile** describes resource use over time or sampled execution.
Examples:
```text
CPU profile
memory profile
allocation profile
lock profile
I/O profile
```
---
## 11.39 PerformanceProfile
A **PerformanceProfile** describes performance characteristics of a system, component, or operation.
---
## 11.40 SLI
A **Service Level Indicator** is a quantitative measure of a service level.
Examples:
```text
availability
latency
error rate
throughput
correctness
freshness
durability
```
---
## 11.41 SLO
A **Service Level Objective** is a target value or range for an SLI over a defined measurement window.
Recommended attributes:
```yaml
service:
sli:
target:
window:
scope:
owner:
evidence_source:
```
---
## 11.42 SLAReference
An **SLAReference** points to a contractual or formal service-level agreement.
Governance owns contractual obligation semantics. Observability owns measured service-level signals.
---
## 11.43 ErrorBudget
An **ErrorBudget** is the allowed amount of unreliability implied by an SLO over a measurement window.
---
## 11.44 BurnRate
**BurnRate** is the rate at which an error budget is being consumed.
---
## 11.45 HealthState
**HealthState** is an assessed operational state of a resource, service, dependency, or system.
Seed health states:
```text
unknown
healthy
degraded
unhealthy
down
recovering
maintenance
```
---
## 11.46 AvailabilityWindow
An **AvailabilityWindow** is the time period over which availability or service level is measured.
---
## 11.47 AlertRule
An **AlertRule** defines conditions under which an alert is created.
Recommended attributes:
```yaml
query:
condition:
threshold:
window:
severity:
for_duration:
labels:
annotations:
owner:
runbook:
```
---
## 11.48 Alert
An **Alert** is an instance of an alert rule firing or resolving.
Seed alert states:
```text
pending
firing
acknowledged
suppressed
resolved
expired
```
---
## 11.49 Notification
A **Notification** is a message sent to humans, agents, or systems about an alert, incident, or operational state.
---
## 11.50 AlertRoute
An **AlertRoute** defines how alerts are routed to responders, teams, tools, or escalation paths.
---
## 11.51 AlertSuppression
**AlertSuppression** is a rule or state that suppresses notifications for known, duplicate, maintenance, or intentionally ignored alert conditions.
---
## 11.52 AlertCorrelation
**AlertCorrelation** groups related alerts or signals.
---
## 11.53 EscalationReference
An **EscalationReference** points to Organization, Task, or Governance concepts defining who should respond and how escalation works.
---
## 11.54 ObservedIncident
An **ObservedIncident** is an operationally significant situation inferred or declared from observability signals.
Task and ITSM systems may own incident work records. Observability owns the signal-derived incident view.
---
## 11.55 Investigation
An **Investigation** is analysis of signals, alerts, telemetry, incidents, or hypotheses to understand cause, scope, impact, and remediation.
---
## 11.56 Timeline
A **Timeline** is an ordered sequence of events, signals, decisions, actions, and observations.
---
## 11.57 Runbook
A **Runbook** is an operational procedure used to investigate, respond, recover, or verify a condition.
---
## 11.58 Dashboard
A **Dashboard** is a visual or structured view of observability data.
---
## 11.59 OperationalView
An **OperationalView** is a purpose-specific view of system state, health, risk, or performance.
---
## 11.60 PostIncidentObservation
A **PostIncidentObservation** is a signal, fact, lesson, or finding captured after an incident.
---
## 11.61 ObservabilityEvidence
**ObservabilityEvidence** is telemetry, query output, screenshot, dashboard state, trace, log, metric, or event used to support a claim.
---
## 11.62 Query
A **Query** is an expression used to retrieve or calculate observability data.
Examples:
```text
PromQL query
LogQL query
SQL query
trace search
SIEM query
dashboard panel query
```
---
## 11.63 QueryResult
A **QueryResult** is the result of executing a query.
---
## 11.64 Snapshot
A **Snapshot** is a captured state of telemetry, dashboard, trace, log, metric, or query result at a time.
---
## 11.65 Annotation
An **Annotation** is a human, agent, or system-added note attached to telemetry, dashboard, timeline, incident, deployment, or event.
---
## 11.66 Correlation
A **Correlation** is a relationship linking signals, resources, events, deployments, incidents, or hypotheses.
---
## 11.67 RootCauseHypothesis
A **RootCauseHypothesis** is a candidate explanation for an observed issue.
Canonical rule:
```text
RootCauseHypothesis SHOULD remain distinguishable from verified cause.
```
---
# 12. Core Relationship Vocabulary
Recommended root relationship types:
```text
emitted_by
observes
measures
describes
correlates_with
derived_from
generated_by
triggered_by
alerts_on
routes_to
acknowledged_by
suppressed_by
resolves
affects
indicates
supports
evidences
verifies
invalidates
samples
aggregates
annotates
links_to
maps_to
```
Relationship records SHOULD support:
```yaml
id:
relationship_type:
source_entity:
target_entity:
scope:
time_window:
state_context:
valid_from:
valid_to:
source_system:
confidence:
evidence:
rationale:
```
---
# 13. Observability State Models
## 13.1 Signal States
```text
unknown
emitting
missing
delayed
partial
degraded
invalid
stale
```
## 13.2 Alert States
```text
pending
firing
acknowledged
suppressed
resolved
expired
```
## 13.3 Incident Observation States
```text
suspected
confirmed
investigating
mitigating
recovering
resolved
post_review
closed
```
## 13.4 Health States
```text
unknown
healthy
degraded
unhealthy
down
recovering
maintenance
```
## 13.5 SLO States
```text
not_measured
within_budget
burning_fast
at_risk
violated
paused
retired
```
## 13.6 Telemetry Pipeline States
```text
configured
active
degraded
dropping_data
stalled
misconfigured
retired
```
---
# 14. Observability Patterns
## 14.1 Pattern: Resource-Linked Telemetry
**Context:** Telemetry is collected from many systems.
**Problem:** Signals are hard to interpret if they cannot be linked to canonical resources.
**Solution:** Attach telemetry to ObservedResource references mapped to Landscape, Network, DevSecOps, Security, or Data entities.
---
## 14.2 Pattern: Signal-to-Alert-to-Task
**Context:** A condition needs human or agent response.
**Problem:** Alerts fire but do not become accountable work.
**Solution:**
```text
Signal
-> AlertRule
-> Alert
-> ObservedIncident or Task
-> Investigation
-> RemediationTask
-> VerificationEvidence
```
---
## 14.3 Pattern: SLO as Reliability Contract
**Context:** Service reliability must be operationally meaningful.
**Problem:** Teams alert on low-level metrics that do not represent user experience.
**Solution:** Define SLIs and SLOs for user-meaningful service behavior and use error budgets to guide action.
---
## 14.4 Pattern: Deployment Health Verification
**Context:** A deployment has completed.
**Problem:** Successful deployment command does not prove healthy service behavior.
**Solution:** Link DeploymentRecord to DeploymentHealthSignal, SLO state, traces, logs, metrics, and verification evidence.
---
## 14.5 Pattern: Correlated Timeline
**Context:** Incidents require understanding what happened.
**Problem:** Logs, alerts, deployments, changes, and network events are scattered.
**Solution:** Build Timeline from correlated events, alerts, traces, deployment records, annotations, and task actions.
---
## 14.6 Pattern: Alert with Runbook
**Context:** An alert requires response.
**Problem:** Responders waste time discovering what the alert means.
**Solution:** AlertRule SHOULD reference owner, runbook, dashboard, likely causes, and escalation path.
---
## 14.7 Pattern: Metric with Exemplar
**Context:** Aggregate metrics show a problem.
**Problem:** Aggregates hide individual requests or traces.
**Solution:** Link MetricPoint or histogram bucket to trace/log exemplar.
---
## 14.8 Pattern: Observability as Governance Evidence
**Context:** Governance requires proof that controls or SLOs are operating.
**Problem:** Compliance claims rely on manual screenshots or weak assertions.
**Solution:** Use query results, snapshots, dashboards, and telemetry evidence as structured ObservabilityEvidence.
---
## 14.9 Pattern: Missing Signal as Signal
**Context:** A telemetry source goes silent.
**Problem:** Systems only alert on bad values, not missing data.
**Solution:** Model missing, stale, or delayed telemetry as signal states and potential alerts.
---
# 15. Observability Profiles
## 15.1 Profile Format
An Observability Profile SHALL declare:
```yaml
id:
profile_name:
status:
implements:
- InfoTechCanonObservabilityModel
target_context:
included_concepts:
required_relationships:
required_metadata:
state_model:
source_of_truth_rules:
mapping_files:
validation_rules:
examples:
known_deviations:
```
---
## 15.2 Seed Profile: Small SaaS Observability Profile
Purpose:
```text
Provide a minimal observability model for a small SaaS platform moving toward production readiness.
```
Included concepts:
```text
ObservedResource
Metric
LogRecord
Trace
Span
Event
AlertRule
Alert
Dashboard
Runbook
SLI
SLO
HealthState
ObservedIncident
ObservabilityEvidence
```
Required relationships:
```text
Metric emitted_by ObservedResource
LogRecord emitted_by ObservedResource
Trace observes Service
Alert triggered_by AlertRule
Alert affects Service
SLO measures Service
Dashboard displays Metric
Runbook supports Alert
ObservabilityEvidence supports Investigation
```
---
## 15.3 Seed Profile: OpenTelemetry Profile
Purpose:
```text
Map OpenTelemetry resources, traces, metrics, logs, attributes, baggage, and semantic conventions into InfoTechCanon.
```
Example mappings:
```text
Resource -> ObservedResource
Resource attributes -> ResourceAttribute
Metric -> Metric
LogRecord -> LogRecord
Trace -> Trace
Span -> Span
Span event -> SpanEvent
Span link -> SpanLink
Baggage -> Baggage
Semantic conventions -> Mapping / Attribute vocabulary
Collector -> TelemetryPipeline component
Exporter -> TelemetryPipeline component
```
---
## 15.4 Seed Profile: Prometheus / OpenMetrics Profile
Purpose:
```text
Represent metrics, labels, time series, scrape targets, alert rules, and query results.
```
Example mappings:
```text
metric name -> Metric
labels -> dimensions / attributes
sample -> MetricPoint
time series -> TimeSeries
PromQL -> Query
recording rule -> DerivedMetric / Query
alerting rule -> AlertRule
target -> TelemetrySource / ObservedResource
```
---
## 15.5 Seed Profile: CloudEvents Profile
Purpose:
```text
Represent event metadata and event envelopes.
```
Example mappings:
```text
id -> Event id
source -> EventSource
type -> EventType
specversion -> EventEnvelope version
subject -> Event subject
time -> Event timestamp
datacontenttype -> Event data content type
data -> Event data
```
---
## 15.6 Seed Profile: SRE Reliability Profile
Purpose:
```text
Represent SLIs, SLOs, error budgets, burn rates, and reliability decisions.
```
Included concepts:
```text
SLI
SLO
ErrorBudget
BurnRate
AvailabilityWindow
AlertRule
ReliabilityReview
ServiceHealthState
ErrorBudgetPolicyReference
```
Required relationships:
```text
SLO applies_to Service
SLI measures Service
ErrorBudget derived_from SLO
BurnRate measures ErrorBudgetConsumption
AlertRule alerts_on BurnRate
ReliabilityReview reviews SLOState
```
---
## 15.7 Seed Profile: Incident Observability Profile
Purpose:
```text
Represent telemetry, alerts, timelines, dashboards, and evidence for incident response.
```
Included concepts:
```text
Alert
ObservedIncident
Timeline
Investigation
Dashboard
Runbook
Annotation
RootCauseHypothesis
ObservabilityEvidence
PostIncidentObservation
```
---
## 15.8 Seed Profile: Network Observability Profile
Purpose:
```text
Represent network metrics, flow logs, reachability tests, DNS logs, and latency signals.
```
Included concepts:
```text
NetworkMetric
ObservedFlowSignal
DNSLogRecord
ReachabilityTestResult
LatencyMetric
PacketLossMetric
EndpointHealthSignal
```
Mapping targets:
```text
NetFlow/IPFIX
VPC Flow Logs
Kubernetes CNI telemetry
service mesh telemetry
DNS logs
synthetic probes
```
---
## 15.9 Seed Profile: Security Observability Profile
Purpose:
```text
Represent observability signals used for security detection, investigation, and evidence.
```
Included concepts:
```text
SecuritySignal
SecurityLogRecord
DetectionEvent
Alert
TraceEvidence
AccessSessionLog
AuditLogReference
SecurityEvidence
```
Security interpretation remains owned by the Security Model.
---
# 16. Mapping Model for the Observability Standard
Mappings relate InfoTechCanon observability concepts to external standards, tools, and products.
## 16.1 Mapping Types
Recommended mapping types:
```text
exactMatch
closeMatch
broadMatch
narrowMatch
relatedMatch
conflictMatch
gapMatch
derivedFrom
regulatoryReference
toolEquivalent
```
## 16.2 Mapping Record
Example:
```yaml
id: itc-map:span-to-opentelemetry-span
source_concept: itc-obs:Span
target_body: OpenTelemetry
target_version: "current"
target_concept: Span
mapping_type: closeMatch
scope:
- distributed tracing
not_valid_for:
- all event or log semantics
rationale: >
OpenTelemetry Span is the primary mapping target for timed operations in traces.
InfoTechCanon keeps Span as a canonical concept to allow mappings to other tracing systems.
confidence: high
status: candidate
owner: InfoTechCanonObservabilityModel
```
## 16.3 Seed Mapping Targets
The Observability Model SHOULD maintain mappings to:
```text
OpenTelemetry
OpenTelemetry Semantic Conventions
Prometheus
OpenMetrics / Prometheus exposition format
CloudEvents
W3C Trace Context
Google SRE SLI/SLO/Error Budget concepts
Grafana dashboards and alerting
Prometheus Alertmanager
Loki / LogQL
Jaeger
Tempo
Elastic Observability
Datadog
New Relic
Splunk
OpenSearch
ITIL incident concepts
NetFlow / IPFIX
VPC Flow Logs
Kubernetes events and metrics
service mesh telemetry
```
---
# 17. Assimilation Hooks
The Observability Model SHALL be able to receive new observability standards, tool models, telemetry schemas, incident practices, and operational patterns through the InfoTechCanon assimilation process.
## 17.1 Assimilation Triggers
Assimilation may be triggered by:
```text
new telemetry standard
new observability backend
new incident-management tool
new SLO practice
new dashboard model
new alerting model
new tracing model
new logging schema
new AIOps product
new runtime verification practice
new recurring signal classification conflict
```
## 17.2 Observability Assimilation Output
An observability assimilation SHOULD produce:
```text
source summary
extracted observability concepts
concept comparison matrix
gap list
conflict list
mapping file
candidate new concepts
candidate relationship changes
candidate pattern changes
candidate profile changes
open questions
```
## 17.3 Recommended First Assimilation Candidates
```text
OpenTelemetry specification and semantic conventions
Prometheus / OpenMetrics
CloudEvents
W3C Trace Context
Google SRE SLO chapters
Grafana dashboard and alerting models
Prometheus Alertmanager
Kubernetes events and metrics
VPC Flow Logs / NetFlow / IPFIX
ITIL incident management concepts
```
---
# 18. Integration with Other InfoTechCanon Standards
## 18.1 Landscape Model
Observability links signals to:
```text
ApplicationService
TechnicalService
RuntimeWorkload
Environment
Endpoint
DataStore
DeploymentRecord
NetworkEntity
```
## 18.2 Organization Model
Observability imports organization concepts for:
```text
service owner
on-call responder
team
escalation target
runbook owner
incident commander
```
## 18.3 Governance Model
Observability imports governance concepts for:
```text
evidence
control result
review
assurance
policy
SLA obligation
audit evidence
```
## 18.4 Task Model
Observability creates or references:
```text
incident task
investigation task
remediation task
follow-up task
reliability improvement task
```
## 18.5 Tagging Standard
Observability uses tags for:
```text
service
environment
severity
signal type
dashboard category
incident category
team
```
Tags must not replace ObservedResource, AlertRule, SLO, or Evidence records.
## 18.6 Access Control Model
Observability imports access concepts for:
```text
dashboard access
log access
trace access
incident tool access
telemetry pipeline access
sensitive telemetry access
```
## 18.7 Security Model
Security imports observability concepts for:
```text
security signal
detection evidence
security alert
audit log
trace evidence
incident timeline
```
## 18.8 Data Model
Data imports observability concepts when telemetry is treated as a dataset and for data freshness, quality, and lineage signals.
## 18.9 DevSecOps Model
DevSecOps imports observability concepts for:
```text
deployment verification
change failure detection
delivery metric
runtime feedback
SLO impact
```
## 18.10 Network Model
Network imports observability concepts for:
```text
flow logs
reachability test results
latency
packet loss
DNS logs
endpoint health
```
---
# 19. Canon Interface Card Usage
Subsystems that implement or produce observability knowledge SHOULD publish a Canon Interface Card.
Example:
```yaml
subsystem: prometheus-importer
implements:
- InfoTechCanonObservabilityModel
- PrometheusOpenMetricsProfile
produces:
- Metric
- TimeSeries
- MetricPoint
- AlertRule
- Alert
- QueryResult
consumes:
- ObservedResource
- Service
- Environment
relations:
- Metric emitted_by ObservedResource
- Alert triggered_by AlertRule
- Alert affects Service
source_of_truth:
metric_samples: Prometheus
alert_rule_state: Prometheus
known_deviations:
- resource identity depends on labels
- long-term retention may be external
```
---
# 20. Retrieval Requirements
The Observability Model is designed for markdown-based infospaces.
## 20.1 Required Retrieval Properties
Every major concept SHOULD provide:
- stable heading,
- stable identifier,
- short definition,
- longer explanation,
- examples,
- distinction notes,
- relationship examples,
- mapping hooks,
- profile references,
- and common mistakes.
## 20.2 Agent Brief
A mature Observability Model SHOULD include an `agent-brief.md` file with:
```text
purpose
scope
owned concepts
imported concepts
core distinctions
do / do not rules
relationship patterns
minimal examples
common mistakes
profile list
mapping list
```
## 20.3 Indexes
The observability information space SHOULD provide indexes by:
```text
concept
relationship
signal type
metric
log
trace
event
resource
service
alert
SLO
dashboard
incident
profile
pattern
mapping target
status
source system
```
---
# 21. Conformance Levels
## 21.1 Reference-Conformant
A document or system is reference-conformant if it uses Observability Model terminology consistently but does not implement structured metadata or validation rules.
## 21.2 Metadata-Conformant
A system is metadata-conformant if it uses stable identifiers, concept names, lifecycle states, source metadata, and relationship types.
## 21.3 Signal-Conformant
A system is signal-conformant if it distinguishes metrics, logs, traces, events, profiles, alerts, and health signals.
## 21.4 Resource-Correlated
A system is resource-correlated if observability signals can be linked to observed resources and canonical landscape entities.
## 21.5 SLO-Conformant
A system is SLO-conformant if it represents SLIs, SLOs, error budgets, burn rates, and measurement windows.
## 21.6 Evidence-Conformant
A system is evidence-conformant if observability claims, incidents, alerts, and service-level states can be linked to evidence.
## 21.7 Profile-Conformant
A system is profile-conformant if it implements a declared Observability Profile and passes its validation rules.
## 21.8 Assimilation-Conformant
A system or repository is assimilation-conformant if it can accept external observability concepts through the InfoTechCanon assimilation workflow and produce mappings, gaps, conflicts, and proposed changes.
---
# 22. Validation Rules
Initial validation rules:
```text
VAL-OBS-001: Metric, LogRecord, Trace, Span, Event, Profile, Alert, and Incident SHOULD be modeled as distinct concepts.
VAL-OBS-002: Telemetry SHOULD reference an ObservedResource where possible.
VAL-OBS-003: ObservedResource SHOULD map to a Landscape, Network, Data, Security, or DevSecOps entity where possible.
VAL-OBS-004: Metric SHOULD declare unit, instrument type, source, and dimensions where available.
VAL-OBS-005: TimeSeries SHOULD distinguish metric identity from labels/dimensions.
VAL-OBS-006: LogRecord SHOULD include timestamp, severity, source, and body where available.
VAL-OBS-007: Span SHOULD include trace id, span id, timing, name, status, and parent/link references where available.
VAL-OBS-008: Event SHOULD distinguish event data from event context metadata.
VAL-OBS-009: Alert SHOULD reference AlertRule or source condition where available.
VAL-OBS-010: AlertRule SHOULD reference query or condition, threshold, time window, owner, and runbook where applicable.
VAL-OBS-011: SLO SHOULD reference SLI, target, measurement window, service, and evidence source.
VAL-OBS-012: ErrorBudget SHOULD derive from an SLO.
VAL-OBS-013: Dashboard SHOULD NOT be treated as evidence unless a Snapshot or QueryResult is captured.
VAL-OBS-014: Incident SHOULD NOT be inferred solely from one alert unless profile permits it.
VAL-OBS-015: RootCauseHypothesis SHOULD remain distinguishable from verified cause.
VAL-OBS-016: Missing, stale, or delayed telemetry SHOULD be representable as signal state.
VAL-OBS-017: Tags MUST NOT replace resource identity, SLO definitions, alert rules, or evidence.
VAL-OBS-018: Imported external observability concepts SHOULD be represented through mapping records rather than silently reused.
VAL-OBS-019: Profiles MUST NOT redefine canonical concepts. They may constrain them.
VAL-OBS-020: Telemetry containing sensitive data SHOULD reference Data, Security, Access Control, or Governance constraints where relevant.
```
---
# 23. Anti-Patterns
## 23.1 Dashboard as Truth
Treating a dashboard view as evidence without preserving query, time window, data source, or snapshot.
## 23.2 Alert Equals Incident
Treating every alert as an incident.
## 23.3 Metric Soup
Collecting many metrics without ownership, resource identity, interpretation, or action path.
## 23.4 Logs Without Context
Logging messages that cannot be correlated to service, request, trace, tenant, deployment, or resource.
## 23.5 Traces Without Boundaries
Tracing calls without linking them to service ownership, deployment version, or runtime resource.
## 23.6 SLO Theater
Creating SLOs that do not reflect user experience or guide operational decisions.
## 23.7 Alert Without Runbook
Creating alerts without ownership, runbook, dashboard, or response expectation.
## 23.8 Missing Signal Blindness
Failing to alert when telemetry stops arriving.
## 23.9 Tool-Native Capture
Letting one observability backend define the internal observability model.
## 23.10 Telemetry Without Governance
Collecting sensitive logs, traces, or profiles without classification, retention, access control, or privacy consideration.
---
# 24. Initial Repository Placement
Recommended repository layout:
```text
info-tech-canon/
standards/
observability/
InfoTechCanonObservabilityModel.md
agent-brief.md
concepts/
relationships/
patterns/
profiles/
mappings/
assimilation/
examples/
validation/
```
Seed files:
```text
standards/observability/InfoTechCanonObservabilityModel.md
standards/observability/agent-brief.md
standards/observability/concepts/telemetry.md
standards/observability/concepts/metric.md
standards/observability/concepts/log-record.md
standards/observability/concepts/trace.md
standards/observability/concepts/span.md
standards/observability/concepts/event.md
standards/observability/concepts/sli.md
standards/observability/concepts/slo.md
standards/observability/concepts/alert.md
standards/observability/concepts/observability-evidence.md
standards/observability/patterns/resource-linked-telemetry.md
standards/observability/patterns/signal-to-alert-to-task.md
standards/observability/patterns/slo-as-reliability-contract.md
standards/observability/patterns/deployment-health-verification.md
standards/observability/profiles/small-saas-observability-profile.md
standards/observability/profiles/opentelemetry-profile.md
standards/observability/profiles/prometheus-openmetrics-profile.md
standards/observability/profiles/sre-reliability-profile.md
standards/observability/mappings/opentelemetry.yaml
standards/observability/mappings/prometheus-openmetrics.yaml
standards/observability/mappings/cloudevents.yaml
standards/observability/mappings/sre-slo.yaml
```
---
# 25. Roadmap
## Phase 1: Seed Stabilization
- Establish this standard as `InfoTechCanonObservabilityModel`.
- Add seed concepts, relationship vocabulary, patterns, and profiles.
- Define validation rules.
- Align with Landscape, Network, DevSecOps, Security, Data, Governance, Task, Access Control, and Tagging.
## Phase 2: First Assimilations
Recommended first assimilations:
```text
OpenTelemetry specification and semantic conventions
Prometheus / OpenMetrics
CloudEvents
W3C Trace Context
Google SRE SLO chapters
Grafana dashboard and alerting model
Prometheus Alertmanager
Kubernetes events and metrics
VPC Flow Logs / NetFlow / IPFIX
ITIL incident management concepts
```
## Phase 3: Profile Maturation
- Mature Small SaaS Observability Profile.
- Mature OpenTelemetry Profile.
- Mature Prometheus / OpenMetrics Profile.
- Mature CloudEvents Profile.
- Mature SRE Reliability Profile.
- Mature Incident Observability Profile.
- Mature Network Observability Profile.
- Mature Security Observability Profile.
## Phase 4: Tooling Integration
- Generate concept indexes.
- Generate agent brief.
- Create machine-readable YAML/JSON exports.
- Add validation scripts.
- Integrate telemetry pipelines, metrics, logs, traces, dashboards, alerts, incident tools, and service catalogs.
## Phase 5: Operational Intelligence Loop
- Connect telemetry to canonical resources.
- Connect alerts to tasks and incidents.
- Connect SLOs to governance and service ownership.
- Connect deployment records to runtime health signals.
- Connect security detections to security incidents.
- Connect network flows to reachability and exposure.
- Connect post-incident observations to improvements and standard evolution.
---
# 26. Summary
The InfoTechCanon Observability Model is the seed standard for representing telemetry, signals, metrics, logs, traces, events, profiles, alerts, SLOs, health, incidents as observed phenomena, and operational evidence.
Its most important commitments are:
```text
Separate telemetry, signal, metric, log, trace, span, event, profile, alert, and incident.
Link signals to canonical resources and landscape entities.
Treat SLOs, SLIs, error budgets, burn rates, and health states as first-class reliability concepts.
Use observability evidence to support governance, security, delivery, incident response, and operational review.
Map to OpenTelemetry, Prometheus/OpenMetrics, CloudEvents, SRE practices, and observability tools
without surrendering internal semantic autonomy.
Use profiles to make the model practical for SaaS systems, OpenTelemetry, Prometheus,
SRE reliability, incident response, network observability, and security observability.
```
This makes the Observability Model a core seed for runtime intelligence, production readiness, SRE practice, incident response, deployment verification, security detection, and agent-supported operations.