generated from coulomb/repo-seed
2273 lines
47 KiB
Markdown
2273 lines
47 KiB
Markdown
# InfoTechCanon Observability Model
|
|
|
|
**Short Name:** `ITC-OBS`
|
|
**Document Status:** Seed Standard Release Candidate 1
|
|
**Version:** RC1-seed
|
|
**Date:** 2026-05-23
|
|
**Repository Context:** `info-tech-canon`
|
|
**Document Type:** InfoTechCanon Domain Standard
|
|
**Intended Audience:** SREs, platform engineers, DevSecOps teams, service owners, observability engineers, incident responders, network operators, security analysts, product owners, governance designers, knowledge-system builders, and agentic tooling.
|
|
|
|
---
|
|
|
|
# 1. Purpose
|
|
|
|
The **InfoTechCanon Observability Model** defines a canonical seed model for representing telemetry, signals, events, logs, metrics, traces, profiles, health, service levels, alerts, incidents as observed phenomena, dashboards, runbooks, investigations, and operational evidence.
|
|
|
|
It exists to make runtime understanding interoperable across systems, services, platforms, networks, security, delivery pipelines, data products, and agentic operations.
|
|
|
|
This standard provides a canonical vocabulary for:
|
|
|
|
- telemetry sources,
|
|
- resources,
|
|
- signals,
|
|
- metrics,
|
|
- logs,
|
|
- events,
|
|
- traces,
|
|
- spans,
|
|
- profiles,
|
|
- exemplars,
|
|
- attributes,
|
|
- dimensions,
|
|
- correlation context,
|
|
- service level indicators,
|
|
- service level objectives,
|
|
- error budgets,
|
|
- health states,
|
|
- alerts,
|
|
- notifications,
|
|
- incidents,
|
|
- investigations,
|
|
- dashboards,
|
|
- runbooks,
|
|
- observability evidence,
|
|
- and feedback loops.
|
|
|
|
---
|
|
|
|
# 2. Position in InfoTechCanon
|
|
|
|
The Observability Model is a **domain standard** within InfoTechCanon.
|
|
|
|
It depends on the existing seed standards as follows:
|
|
|
|
```text
|
|
Landscape = services, runtime resources, environments, endpoints, workloads.
|
|
Organization = owners, on-call actors, responders, teams, accountable roles.
|
|
Governance = policies, controls, evidence, reviews, assurance, obligations.
|
|
Task = incident work, remediation work, investigation, follow-up tasks.
|
|
Tagging = lightweight classification of signals, alerts, incidents, dashboards.
|
|
Access Control = access to telemetry, dashboards, logs, admin actions, incident tools.
|
|
Security = security signals, detections, alerts, incidents, forensic evidence.
|
|
Data = telemetry as data, retention, classification, quality, lineage.
|
|
DevSecOps = deployment events, delivery metrics, verification, change failures.
|
|
Network = flow logs, reachability tests, network metrics, DNS logs, latency.
|
|
Observability = signals, telemetry, correlation, health, SLOs, alerts, operational evidence.
|
|
```
|
|
|
|
```text
|
|
InfoTechCanon
|
|
├── InfoTechCanonCore
|
|
├── InfoTechCanonLandscapeModel
|
|
├── InfoTechCanonOrganizationModel
|
|
├── InfoTechCanonGovernanceModel
|
|
├── InfoTechCanonTaskModel
|
|
├── InfoTechCanonTaggingStandard
|
|
├── InfoTechCanonAccessControlModel
|
|
├── InfoTechCanonSecurityModel
|
|
├── InfoTechCanonDataModel
|
|
├── InfoTechCanonDevSecOpsModel
|
|
├── InfoTechCanonNetworkModel
|
|
├── InfoTechCanonObservabilityModel <-- this standard
|
|
├── InfoTechCanonPatternLanguage
|
|
└── Application Profiles
|
|
```
|
|
|
|
---
|
|
|
|
# 3. Boundary with Adjacent Standards
|
|
|
|
## 3.1 Boundary with Landscape
|
|
|
|
The Landscape Model owns the entities being observed:
|
|
|
|
```text
|
|
ApplicationService
|
|
TechnicalService
|
|
RuntimeWorkload
|
|
Environment
|
|
Endpoint
|
|
NetworkEntity
|
|
DataStore
|
|
DeploymentRecord
|
|
```
|
|
|
|
The Observability Model owns telemetry and signals about those entities:
|
|
|
|
```text
|
|
Metric
|
|
LogRecord
|
|
Trace
|
|
Span
|
|
Event
|
|
Profile
|
|
Alert
|
|
HealthState
|
|
SLI
|
|
SLO
|
|
Dashboard
|
|
IncidentSignal
|
|
```
|
|
|
|
Boundary rule:
|
|
|
|
```text
|
|
Landscape owns what exists.
|
|
Observability owns what is observed, measured, correlated, alerted, and evidenced.
|
|
```
|
|
|
|
## 3.2 Boundary with Security
|
|
|
|
The Security Model owns security interpretation:
|
|
|
|
```text
|
|
SecurityFinding
|
|
Detection
|
|
SecurityIncident
|
|
Threat
|
|
AttackPath
|
|
SecurityEvidence
|
|
```
|
|
|
|
Observability owns telemetry substrate and operational signals.
|
|
|
|
Example:
|
|
|
|
```text
|
|
LogRecord may be evidence for SecurityFinding.
|
|
SecurityDetection may be derived from ObservabilitySignal.
|
|
SecurityIncident may reference Alert, Trace, LogRecord, or Event.
|
|
```
|
|
|
|
## 3.3 Boundary with Governance
|
|
|
|
Governance owns policies, controls, evidence, reviews, assurance, and compliance claims.
|
|
|
|
Observability provides evidence and indicators.
|
|
|
|
Example:
|
|
|
|
```text
|
|
SLOEvidence supports ServiceReview.
|
|
Metric supports ControlResult.
|
|
AlertPolicy implements Governance Policy.
|
|
```
|
|
|
|
## 3.4 Boundary with Task
|
|
|
|
Task owns work semantics.
|
|
|
|
Observability creates or references tasks:
|
|
|
|
```text
|
|
Alert creates IncidentTask
|
|
Incident creates RemediationTask
|
|
Investigation creates FollowUpTask
|
|
SLOBurn creates ReliabilityTask
|
|
```
|
|
|
|
## 3.5 Boundary with DevSecOps
|
|
|
|
DevSecOps owns delivery events and deployment records.
|
|
|
|
Observability owns runtime signals used to verify deployments and measure change impact.
|
|
|
|
Example:
|
|
|
|
```text
|
|
DeploymentRecord produces DeploymentEvent
|
|
DeploymentHealthSignal verifies DeploymentRecord
|
|
ChangeFailure detected_by ObservabilitySignal
|
|
```
|
|
|
|
## 3.6 Boundary with Data
|
|
|
|
Data owns dataset, classification, lineage, quality, and retention semantics.
|
|
|
|
Observability telemetry may itself be data, but Observability owns telemetry-specific semantics.
|
|
|
|
Example:
|
|
|
|
```text
|
|
LogDataset classified_as Restricted
|
|
MetricStream has_retention RetentionRuleReference
|
|
TraceSample derived_from RuntimeWorkload
|
|
```
|
|
|
|
---
|
|
|
|
# 4. Research Basis and External Alignment
|
|
|
|
This seed standard draws on several mature observability and operations bodies of knowledge.
|
|
|
|
## 4.1 OpenTelemetry
|
|
|
|
OpenTelemetry provides a broad observability framework covering traces, metrics, logs, baggage, resources, semantic conventions, instrumentation, collection, and export. Its semantic conventions define common attributes that give meaning to telemetry across systems.
|
|
|
|
## 4.2 SRE and Service Level Objectives
|
|
|
|
SRE practice distinguishes Service Level Indicators, Service Level Objectives, Service Level Agreements, and error budgets. It emphasizes that SLOs should measure user-relevant reliability and guide operational decision-making.
|
|
|
|
## 4.3 Prometheus and OpenMetrics
|
|
|
|
Prometheus and OpenMetrics influence metric naming, metric exposition, labels, time series, counters, gauges, histograms, summaries, and scraping/pull-based metric collection.
|
|
|
|
## 4.4 CloudEvents
|
|
|
|
CloudEvents standardizes common event metadata for interoperability across services, platforms, and systems. It is a strong mapping target for event structure and routing metadata.
|
|
|
|
## 4.5 IT Operations and Incident Management
|
|
|
|
IT operations practice distinguishes alerts, incidents, problems, changes, runbooks, on-call, escalation, resolution, and post-incident review. The Observability Model provides signal semantics while Task and Governance own work and decision semantics.
|
|
|
|
## 4.6 AIOps and Event Correlation
|
|
|
|
AIOps practice emphasizes correlation, anomaly detection, event deduplication, root-cause analysis, topology-aware alerting, and automated remediation. These are advanced profiles rather than mandatory core concepts.
|
|
|
|
---
|
|
|
|
# 5. Seed Standard Design Stance
|
|
|
|
This standard is a **seed standard**, not a vendor-specific observability schema.
|
|
|
|
It shall:
|
|
|
|
1. define canonical observability semantics,
|
|
2. distinguish telemetry, signal, event, log, metric, trace, span, profile, alert, and incident,
|
|
3. support OpenTelemetry alignment without being limited to it,
|
|
4. support SLOs, SLIs, and error budgets,
|
|
5. support correlation across services, runtime, network, security, data, and delivery,
|
|
6. support operational evidence and feedback loops,
|
|
7. support human and agentic operations,
|
|
8. map to external standards and tools without becoming subordinate to them,
|
|
9. remain markdown-first and agent-retrievable,
|
|
10. and support future assimilation of observability tools, standards, and practices.
|
|
|
|
---
|
|
|
|
# 6. Scope
|
|
|
|
## 6.1 In Scope
|
|
|
|
This standard covers canonical representation of:
|
|
|
|
- telemetry,
|
|
- telemetry sources,
|
|
- observed resources,
|
|
- observability signals,
|
|
- metrics,
|
|
- time series,
|
|
- metric points,
|
|
- metric instruments,
|
|
- logs,
|
|
- log records,
|
|
- events,
|
|
- event envelopes,
|
|
- traces,
|
|
- spans,
|
|
- span links,
|
|
- trace context,
|
|
- profiles,
|
|
- exemplars,
|
|
- attributes,
|
|
- dimensions,
|
|
- labels,
|
|
- correlation context,
|
|
- service-level indicators,
|
|
- service-level objectives,
|
|
- service-level agreements as references,
|
|
- error budgets,
|
|
- burn rates,
|
|
- health states,
|
|
- alert rules,
|
|
- alerts,
|
|
- notifications,
|
|
- alert routes,
|
|
- incidents as observed operational objects,
|
|
- investigations,
|
|
- dashboards,
|
|
- runbooks,
|
|
- telemetry pipelines,
|
|
- collectors,
|
|
- exporters,
|
|
- sampling,
|
|
- retention,
|
|
- and observability evidence.
|
|
|
|
## 6.2 Out of Scope
|
|
|
|
This standard does not fully define:
|
|
|
|
- all monitoring tool schemas,
|
|
- all incident-management process details,
|
|
- all SRE organizational practice,
|
|
- complete AIOps algorithms,
|
|
- all logging formats,
|
|
- all SIEM detection content,
|
|
- full OpenTelemetry SDK implementation,
|
|
- all Prometheus query semantics,
|
|
- complete data-retention law,
|
|
- complete security incident-response methodology,
|
|
- or every vendor-specific telemetry backend.
|
|
|
|
Those may be mapped, assimilated, profiled, or handled by adjacent standards.
|
|
|
|
---
|
|
|
|
# 7. Normative Language
|
|
|
|
The following terms are used normatively:
|
|
|
|
- **SHALL** indicates a mandatory rule for conformance.
|
|
- **SHOULD** indicates a recommended practice.
|
|
- **MAY** indicates an optional capability.
|
|
- **MUST NOT** indicates a prohibited practice.
|
|
- **SEED** marks a concept defined provisionally here but open to later refinement.
|
|
- **EXTRACT** marks a concept that may later move to a more specialized standard.
|
|
|
|
---
|
|
|
|
# 8. Core Principles
|
|
|
|
## 8.1 Observability Is More Than Monitoring
|
|
|
|
Monitoring checks known conditions.
|
|
|
|
Observability supports understanding system behavior, including unknown or emergent failure modes, through signals and correlation.
|
|
|
|
## 8.2 Telemetry Is Not Insight
|
|
|
|
Raw telemetry becomes useful through context, correlation, aggregation, interpretation, and action.
|
|
|
|
## 8.3 Signal Is Not Incident
|
|
|
|
A signal, alert, or event may indicate a possible problem.
|
|
|
|
An incident is an operationally relevant situation requiring response.
|
|
|
|
## 8.4 Alert Is Not Evidence by Itself
|
|
|
|
An alert indicates that a rule fired or condition was detected.
|
|
|
|
Evidence should include the underlying signals, query, thresholds, state, and context.
|
|
|
|
## 8.5 Metrics, Logs, Traces, Events, and Profiles Are Distinct
|
|
|
|
Each signal type has different strengths and should not be collapsed into one generic “event” concept.
|
|
|
|
## 8.6 Service Levels Must Be Explicit
|
|
|
|
SLIs, SLOs, and error budgets SHOULD be modeled explicitly when reliability is important.
|
|
|
|
## 8.7 Correlation Requires Identity
|
|
|
|
Telemetry SHOULD be linked to canonical landscape entities, deployment records, network endpoints, data resources, or security entities where possible.
|
|
|
|
## 8.8 Observability Must Support Feedback
|
|
|
|
Observability should feed tasks, incidents, governance reviews, deployment verification, security detection, reliability improvement, and standard evolution.
|
|
|
|
## 8.9 External Standards Are Mapped, Not Obeyed
|
|
|
|
The Observability Model MAY map to OpenTelemetry, Prometheus, OpenMetrics, CloudEvents, SRE SLO concepts, ITIL incident practices, and vendor schemas.
|
|
|
|
It MUST NOT subordinate its internal semantics to any single external model.
|
|
|
|
---
|
|
|
|
# 9. Canonical Seed Metadata
|
|
|
|
Every observability artifact SHOULD support structured metadata.
|
|
|
|
Recommended front matter:
|
|
|
|
```yaml
|
|
---
|
|
id: itc-obs:Metric
|
|
type: concept
|
|
standard: InfoTechCanonObservabilityModel
|
|
standard_version: RC1-seed
|
|
status: candidate
|
|
canonical_owner: InfoTechCanonObservabilityModel
|
|
preferred_label: Metric
|
|
related:
|
|
- itc-obs:TimeSeries
|
|
- itc-obs:SLI
|
|
- itc-obs:AlertRule
|
|
mappings:
|
|
- itc-map:metric-to-opentelemetry-metric
|
|
---
|
|
```
|
|
|
|
Recommended artifact statuses:
|
|
|
|
```text
|
|
idea
|
|
draft
|
|
candidate
|
|
release-candidate
|
|
adopted
|
|
stable
|
|
deprecated
|
|
retired
|
|
```
|
|
|
|
Recommended concept statuses:
|
|
|
|
```text
|
|
proposed
|
|
experimental
|
|
candidate
|
|
canonical
|
|
deprecated
|
|
retired
|
|
```
|
|
|
|
---
|
|
|
|
# 10. Root Observability Taxonomy
|
|
|
|
```text
|
|
ObservabilityEntity
|
|
├── TelemetryEntity
|
|
│ ├── Telemetry
|
|
│ ├── TelemetrySource
|
|
│ ├── ObservedResource
|
|
│ ├── ResourceAttribute
|
|
│ ├── Signal
|
|
│ ├── SignalSource
|
|
│ └── TelemetryPipeline
|
|
├── MetricEntity
|
|
│ ├── Metric
|
|
│ ├── MetricInstrument
|
|
│ ├── TimeSeries
|
|
│ ├── MetricPoint
|
|
│ ├── Counter
|
|
│ ├── Gauge
|
|
│ ├── Histogram
|
|
│ ├── Summary
|
|
│ └── Exemplar
|
|
├── LogEntity
|
|
│ ├── Log
|
|
│ ├── LogRecord
|
|
│ ├── LogStream
|
|
│ ├── LogLevel
|
|
│ ├── LogContext
|
|
│ └── StructuredLogField
|
|
├── TraceEntity
|
|
│ ├── Trace
|
|
│ ├── Span
|
|
│ ├── SpanEvent
|
|
│ ├── SpanLink
|
|
│ ├── TraceContext
|
|
│ ├── Baggage
|
|
│ └── TraceSample
|
|
├── EventEntity
|
|
│ ├── Event
|
|
│ ├── EventEnvelope
|
|
│ ├── EventSource
|
|
│ ├── EventType
|
|
│ ├── EventConsumer
|
|
│ └── EventCorrelationKey
|
|
├── ProfileEntity
|
|
│ ├── Profile
|
|
│ ├── ProfilingSample
|
|
│ ├── ResourceProfile
|
|
│ └── PerformanceProfile
|
|
├── ReliabilityEntity
|
|
│ ├── SLI
|
|
│ ├── SLO
|
|
│ ├── SLAReference
|
|
│ ├── ErrorBudget
|
|
│ ├── BurnRate
|
|
│ ├── HealthState
|
|
│ └── AvailabilityWindow
|
|
├── AlertingEntity
|
|
│ ├── AlertRule
|
|
│ ├── Alert
|
|
│ ├── Notification
|
|
│ ├── AlertRoute
|
|
│ ├── AlertSuppression
|
|
│ ├── AlertCorrelation
|
|
│ └── EscalationReference
|
|
├── OperationsEntity
|
|
│ ├── ObservedIncident
|
|
│ ├── Investigation
|
|
│ ├── Timeline
|
|
│ ├── Runbook
|
|
│ ├── Dashboard
|
|
│ ├── OperationalView
|
|
│ └── PostIncidentObservation
|
|
└── EvidenceEntity
|
|
├── ObservabilityEvidence
|
|
├── Query
|
|
├── QueryResult
|
|
├── Snapshot
|
|
├── Annotation
|
|
├── Correlation
|
|
└── RootCauseHypothesis
|
|
```
|
|
|
|
---
|
|
|
|
# 11. Core Concepts
|
|
|
|
## 11.1 ObservabilityEntity
|
|
|
|
An **ObservabilityEntity** is any identifiable concept used to represent telemetry, signals, correlation, health, service levels, alerts, incidents as observed phenomena, dashboards, runbooks, or operational evidence.
|
|
|
|
Recommended attributes:
|
|
|
|
```yaml
|
|
id:
|
|
entity_type:
|
|
canonical_name:
|
|
display_name:
|
|
lifecycle_state:
|
|
source_system:
|
|
created_at:
|
|
updated_at:
|
|
```
|
|
|
|
Optional attributes:
|
|
|
|
```yaml
|
|
owner:
|
|
steward:
|
|
observed_resource:
|
|
service:
|
|
environment:
|
|
source_confidence:
|
|
valid_from:
|
|
valid_to:
|
|
tags:
|
|
external_references:
|
|
```
|
|
|
|
---
|
|
|
|
## 11.2 Telemetry
|
|
|
|
**Telemetry** is machine-generated or manually recorded operational data about system behavior, state, performance, events, or activity.
|
|
|
|
Examples:
|
|
|
|
```text
|
|
metric sample
|
|
log record
|
|
trace span
|
|
event
|
|
profile sample
|
|
flow record
|
|
health check result
|
|
```
|
|
|
|
---
|
|
|
|
## 11.3 TelemetrySource
|
|
|
|
A **TelemetrySource** is a system, component, agent, collector, service, device, pipeline, or actor that emits or provides telemetry.
|
|
|
|
---
|
|
|
|
## 11.4 ObservedResource
|
|
|
|
An **ObservedResource** is the entity about which telemetry is emitted or collected.
|
|
|
|
Observed resources SHOULD map to Landscape, Network, Data, Security, or DevSecOps entities where possible.
|
|
|
|
---
|
|
|
|
## 11.5 ResourceAttribute
|
|
|
|
A **ResourceAttribute** is an attribute describing an observed resource.
|
|
|
|
Examples:
|
|
|
|
```text
|
|
service.name
|
|
service.version
|
|
deployment.environment
|
|
host.name
|
|
cloud.region
|
|
k8s.cluster.name
|
|
container.image.name
|
|
```
|
|
|
|
---
|
|
|
|
## 11.6 Signal
|
|
|
|
A **Signal** is an interpretable unit or stream of observability data.
|
|
|
|
Signal types include:
|
|
|
|
```text
|
|
metric
|
|
log
|
|
trace
|
|
event
|
|
profile
|
|
alert
|
|
health check
|
|
synthetic result
|
|
```
|
|
|
|
---
|
|
|
|
## 11.7 SignalSource
|
|
|
|
A **SignalSource** is the origin of a signal.
|
|
|
|
---
|
|
|
|
## 11.8 TelemetryPipeline
|
|
|
|
A **TelemetryPipeline** is a flow that collects, processes, transforms, samples, enriches, routes, stores, or exports telemetry.
|
|
|
|
---
|
|
|
|
## 11.9 Metric
|
|
|
|
A **Metric** is a measurement of a system, service, resource, or process over time.
|
|
|
|
Metrics may be used for alerting, dashboards, SLOs, capacity planning, anomaly detection, and evidence.
|
|
|
|
---
|
|
|
|
## 11.10 MetricInstrument
|
|
|
|
A **MetricInstrument** defines the kind of measurement instrument.
|
|
|
|
Seed instrument types:
|
|
|
|
```text
|
|
counter
|
|
gauge
|
|
histogram
|
|
summary
|
|
up_down_counter
|
|
observable_gauge
|
|
```
|
|
|
|
---
|
|
|
|
## 11.11 TimeSeries
|
|
|
|
A **TimeSeries** is a sequence of metric points over time for a metric and a set of dimensions or labels.
|
|
|
|
---
|
|
|
|
## 11.12 MetricPoint
|
|
|
|
A **MetricPoint** is a single measurement value at a time.
|
|
|
|
---
|
|
|
|
## 11.13 Counter
|
|
|
|
A **Counter** is a monotonically increasing measurement of occurrences or accumulated quantity.
|
|
|
|
---
|
|
|
|
## 11.14 Gauge
|
|
|
|
A **Gauge** is a measurement that can go up or down.
|
|
|
|
---
|
|
|
|
## 11.15 Histogram
|
|
|
|
A **Histogram** is a distribution of measurements across buckets or ranges.
|
|
|
|
---
|
|
|
|
## 11.16 Summary
|
|
|
|
A **Summary** is a metric representation of observations including quantiles or summary statistics.
|
|
|
|
---
|
|
|
|
## 11.17 Exemplar
|
|
|
|
An **Exemplar** is a representative sample connecting an aggregate metric point to a trace, log, or other detailed signal.
|
|
|
|
---
|
|
|
|
## 11.18 Log
|
|
|
|
A **Log** is a stream or collection of timestamped records describing events, state, actions, or messages.
|
|
|
|
---
|
|
|
|
## 11.19 LogRecord
|
|
|
|
A **LogRecord** is a single log entry.
|
|
|
|
Recommended attributes:
|
|
|
|
```yaml
|
|
timestamp:
|
|
severity:
|
|
message:
|
|
body:
|
|
resource:
|
|
trace_id:
|
|
span_id:
|
|
attributes:
|
|
source:
|
|
```
|
|
|
|
---
|
|
|
|
## 11.20 LogStream
|
|
|
|
A **LogStream** is a sequence of log records from a source or resource.
|
|
|
|
---
|
|
|
|
## 11.21 LogLevel
|
|
|
|
A **LogLevel** is a severity or importance category for log records.
|
|
|
|
Examples:
|
|
|
|
```text
|
|
trace
|
|
debug
|
|
info
|
|
warn
|
|
error
|
|
fatal
|
|
```
|
|
|
|
---
|
|
|
|
## 11.22 LogContext
|
|
|
|
**LogContext** is contextual metadata attached to log records.
|
|
|
|
Examples:
|
|
|
|
```text
|
|
request id
|
|
trace id
|
|
user id reference
|
|
tenant id
|
|
deployment version
|
|
environment
|
|
component
|
|
```
|
|
|
|
---
|
|
|
|
## 11.23 Trace
|
|
|
|
A **Trace** is a representation of a request, transaction, workflow, or operation as it moves through a distributed system.
|
|
|
|
---
|
|
|
|
## 11.24 Span
|
|
|
|
A **Span** is a single timed operation within a trace.
|
|
|
|
Recommended attributes:
|
|
|
|
```yaml
|
|
trace_id:
|
|
span_id:
|
|
parent_span_id:
|
|
name:
|
|
kind:
|
|
start_time:
|
|
end_time:
|
|
status:
|
|
attributes:
|
|
events:
|
|
links:
|
|
```
|
|
|
|
---
|
|
|
|
## 11.25 SpanEvent
|
|
|
|
A **SpanEvent** is a timestamped event attached to a span.
|
|
|
|
---
|
|
|
|
## 11.26 SpanLink
|
|
|
|
A **SpanLink** connects a span to another span or trace context.
|
|
|
|
---
|
|
|
|
## 11.27 TraceContext
|
|
|
|
**TraceContext** is propagation metadata that links operations across process, service, or network boundaries.
|
|
|
|
---
|
|
|
|
## 11.28 Baggage
|
|
|
|
**Baggage** is contextual metadata propagated across process boundaries.
|
|
|
|
Baggage SHOULD be governed carefully when it may contain sensitive data.
|
|
|
|
---
|
|
|
|
## 11.29 TraceSample
|
|
|
|
A **TraceSample** is a selected trace or subset of trace data retained or analyzed.
|
|
|
|
---
|
|
|
|
## 11.30 Event
|
|
|
|
An **Event** is a record of an occurrence and its context.
|
|
|
|
Events may be operational, domain, security, deployment, infrastructure, or business events.
|
|
|
|
---
|
|
|
|
## 11.31 EventEnvelope
|
|
|
|
An **EventEnvelope** is structured metadata around event data.
|
|
|
|
CloudEvents is a primary mapping target.
|
|
|
|
---
|
|
|
|
## 11.32 EventSource
|
|
|
|
An **EventSource** is the producer or origin of an event.
|
|
|
|
---
|
|
|
|
## 11.33 EventType
|
|
|
|
An **EventType** classifies the kind of occurrence represented by an event.
|
|
|
|
---
|
|
|
|
## 11.34 EventConsumer
|
|
|
|
An **EventConsumer** is an actor, system, service, or pipeline that consumes events.
|
|
|
|
---
|
|
|
|
## 11.35 EventCorrelationKey
|
|
|
|
An **EventCorrelationKey** links events to related traces, logs, requests, incidents, deployments, or resources.
|
|
|
|
---
|
|
|
|
## 11.36 Profile
|
|
|
|
A **Profile** is sampled performance or resource-use data.
|
|
|
|
This concept is observability-specific and distinct from InfoTechCanon application profiles.
|
|
|
|
---
|
|
|
|
## 11.37 ProfilingSample
|
|
|
|
A **ProfilingSample** is one sample of profiling data.
|
|
|
|
---
|
|
|
|
## 11.38 ResourceProfile
|
|
|
|
A **ResourceProfile** describes resource use over time or sampled execution.
|
|
|
|
Examples:
|
|
|
|
```text
|
|
CPU profile
|
|
memory profile
|
|
allocation profile
|
|
lock profile
|
|
I/O profile
|
|
```
|
|
|
|
---
|
|
|
|
## 11.39 PerformanceProfile
|
|
|
|
A **PerformanceProfile** describes performance characteristics of a system, component, or operation.
|
|
|
|
---
|
|
|
|
## 11.40 SLI
|
|
|
|
A **Service Level Indicator** is a quantitative measure of a service level.
|
|
|
|
Examples:
|
|
|
|
```text
|
|
availability
|
|
latency
|
|
error rate
|
|
throughput
|
|
correctness
|
|
freshness
|
|
durability
|
|
```
|
|
|
|
---
|
|
|
|
## 11.41 SLO
|
|
|
|
A **Service Level Objective** is a target value or range for an SLI over a defined measurement window.
|
|
|
|
Recommended attributes:
|
|
|
|
```yaml
|
|
service:
|
|
sli:
|
|
target:
|
|
window:
|
|
scope:
|
|
owner:
|
|
evidence_source:
|
|
```
|
|
|
|
---
|
|
|
|
## 11.42 SLAReference
|
|
|
|
An **SLAReference** points to a contractual or formal service-level agreement.
|
|
|
|
Governance owns contractual obligation semantics. Observability owns measured service-level signals.
|
|
|
|
---
|
|
|
|
## 11.43 ErrorBudget
|
|
|
|
An **ErrorBudget** is the allowed amount of unreliability implied by an SLO over a measurement window.
|
|
|
|
---
|
|
|
|
## 11.44 BurnRate
|
|
|
|
**BurnRate** is the rate at which an error budget is being consumed.
|
|
|
|
---
|
|
|
|
## 11.45 HealthState
|
|
|
|
**HealthState** is an assessed operational state of a resource, service, dependency, or system.
|
|
|
|
Seed health states:
|
|
|
|
```text
|
|
unknown
|
|
healthy
|
|
degraded
|
|
unhealthy
|
|
down
|
|
recovering
|
|
maintenance
|
|
```
|
|
|
|
---
|
|
|
|
## 11.46 AvailabilityWindow
|
|
|
|
An **AvailabilityWindow** is the time period over which availability or service level is measured.
|
|
|
|
---
|
|
|
|
## 11.47 AlertRule
|
|
|
|
An **AlertRule** defines conditions under which an alert is created.
|
|
|
|
Recommended attributes:
|
|
|
|
```yaml
|
|
query:
|
|
condition:
|
|
threshold:
|
|
window:
|
|
severity:
|
|
for_duration:
|
|
labels:
|
|
annotations:
|
|
owner:
|
|
runbook:
|
|
```
|
|
|
|
---
|
|
|
|
## 11.48 Alert
|
|
|
|
An **Alert** is an instance of an alert rule firing or resolving.
|
|
|
|
Seed alert states:
|
|
|
|
```text
|
|
pending
|
|
firing
|
|
acknowledged
|
|
suppressed
|
|
resolved
|
|
expired
|
|
```
|
|
|
|
---
|
|
|
|
## 11.49 Notification
|
|
|
|
A **Notification** is a message sent to humans, agents, or systems about an alert, incident, or operational state.
|
|
|
|
---
|
|
|
|
## 11.50 AlertRoute
|
|
|
|
An **AlertRoute** defines how alerts are routed to responders, teams, tools, or escalation paths.
|
|
|
|
---
|
|
|
|
## 11.51 AlertSuppression
|
|
|
|
**AlertSuppression** is a rule or state that suppresses notifications for known, duplicate, maintenance, or intentionally ignored alert conditions.
|
|
|
|
---
|
|
|
|
## 11.52 AlertCorrelation
|
|
|
|
**AlertCorrelation** groups related alerts or signals.
|
|
|
|
---
|
|
|
|
## 11.53 EscalationReference
|
|
|
|
An **EscalationReference** points to Organization, Task, or Governance concepts defining who should respond and how escalation works.
|
|
|
|
---
|
|
|
|
## 11.54 ObservedIncident
|
|
|
|
An **ObservedIncident** is an operationally significant situation inferred or declared from observability signals.
|
|
|
|
Task and ITSM systems may own incident work records. Observability owns the signal-derived incident view.
|
|
|
|
---
|
|
|
|
## 11.55 Investigation
|
|
|
|
An **Investigation** is analysis of signals, alerts, telemetry, incidents, or hypotheses to understand cause, scope, impact, and remediation.
|
|
|
|
---
|
|
|
|
## 11.56 Timeline
|
|
|
|
A **Timeline** is an ordered sequence of events, signals, decisions, actions, and observations.
|
|
|
|
---
|
|
|
|
## 11.57 Runbook
|
|
|
|
A **Runbook** is an operational procedure used to investigate, respond, recover, or verify a condition.
|
|
|
|
---
|
|
|
|
## 11.58 Dashboard
|
|
|
|
A **Dashboard** is a visual or structured view of observability data.
|
|
|
|
---
|
|
|
|
## 11.59 OperationalView
|
|
|
|
An **OperationalView** is a purpose-specific view of system state, health, risk, or performance.
|
|
|
|
---
|
|
|
|
## 11.60 PostIncidentObservation
|
|
|
|
A **PostIncidentObservation** is a signal, fact, lesson, or finding captured after an incident.
|
|
|
|
---
|
|
|
|
## 11.61 ObservabilityEvidence
|
|
|
|
**ObservabilityEvidence** is telemetry, query output, screenshot, dashboard state, trace, log, metric, or event used to support a claim.
|
|
|
|
---
|
|
|
|
## 11.62 Query
|
|
|
|
A **Query** is an expression used to retrieve or calculate observability data.
|
|
|
|
Examples:
|
|
|
|
```text
|
|
PromQL query
|
|
LogQL query
|
|
SQL query
|
|
trace search
|
|
SIEM query
|
|
dashboard panel query
|
|
```
|
|
|
|
---
|
|
|
|
## 11.63 QueryResult
|
|
|
|
A **QueryResult** is the result of executing a query.
|
|
|
|
---
|
|
|
|
## 11.64 Snapshot
|
|
|
|
A **Snapshot** is a captured state of telemetry, dashboard, trace, log, metric, or query result at a time.
|
|
|
|
---
|
|
|
|
## 11.65 Annotation
|
|
|
|
An **Annotation** is a human, agent, or system-added note attached to telemetry, dashboard, timeline, incident, deployment, or event.
|
|
|
|
---
|
|
|
|
## 11.66 Correlation
|
|
|
|
A **Correlation** is a relationship linking signals, resources, events, deployments, incidents, or hypotheses.
|
|
|
|
---
|
|
|
|
## 11.67 RootCauseHypothesis
|
|
|
|
A **RootCauseHypothesis** is a candidate explanation for an observed issue.
|
|
|
|
Canonical rule:
|
|
|
|
```text
|
|
RootCauseHypothesis SHOULD remain distinguishable from verified cause.
|
|
```
|
|
|
|
---
|
|
|
|
# 12. Core Relationship Vocabulary
|
|
|
|
Recommended root relationship types:
|
|
|
|
```text
|
|
emitted_by
|
|
observes
|
|
measures
|
|
describes
|
|
correlates_with
|
|
derived_from
|
|
generated_by
|
|
triggered_by
|
|
alerts_on
|
|
routes_to
|
|
acknowledged_by
|
|
suppressed_by
|
|
resolves
|
|
affects
|
|
indicates
|
|
supports
|
|
evidences
|
|
verifies
|
|
invalidates
|
|
samples
|
|
aggregates
|
|
annotates
|
|
links_to
|
|
maps_to
|
|
```
|
|
|
|
Relationship records SHOULD support:
|
|
|
|
```yaml
|
|
id:
|
|
relationship_type:
|
|
source_entity:
|
|
target_entity:
|
|
scope:
|
|
time_window:
|
|
state_context:
|
|
valid_from:
|
|
valid_to:
|
|
source_system:
|
|
confidence:
|
|
evidence:
|
|
rationale:
|
|
```
|
|
|
|
---
|
|
|
|
# 13. Observability State Models
|
|
|
|
## 13.1 Signal States
|
|
|
|
```text
|
|
unknown
|
|
emitting
|
|
missing
|
|
delayed
|
|
partial
|
|
degraded
|
|
invalid
|
|
stale
|
|
```
|
|
|
|
## 13.2 Alert States
|
|
|
|
```text
|
|
pending
|
|
firing
|
|
acknowledged
|
|
suppressed
|
|
resolved
|
|
expired
|
|
```
|
|
|
|
## 13.3 Incident Observation States
|
|
|
|
```text
|
|
suspected
|
|
confirmed
|
|
investigating
|
|
mitigating
|
|
recovering
|
|
resolved
|
|
post_review
|
|
closed
|
|
```
|
|
|
|
## 13.4 Health States
|
|
|
|
```text
|
|
unknown
|
|
healthy
|
|
degraded
|
|
unhealthy
|
|
down
|
|
recovering
|
|
maintenance
|
|
```
|
|
|
|
## 13.5 SLO States
|
|
|
|
```text
|
|
not_measured
|
|
within_budget
|
|
burning_fast
|
|
at_risk
|
|
violated
|
|
paused
|
|
retired
|
|
```
|
|
|
|
## 13.6 Telemetry Pipeline States
|
|
|
|
```text
|
|
configured
|
|
active
|
|
degraded
|
|
dropping_data
|
|
stalled
|
|
misconfigured
|
|
retired
|
|
```
|
|
|
|
---
|
|
|
|
# 14. Observability Patterns
|
|
|
|
## 14.1 Pattern: Resource-Linked Telemetry
|
|
|
|
**Context:** Telemetry is collected from many systems.
|
|
|
|
**Problem:** Signals are hard to interpret if they cannot be linked to canonical resources.
|
|
|
|
**Solution:** Attach telemetry to ObservedResource references mapped to Landscape, Network, DevSecOps, Security, or Data entities.
|
|
|
|
---
|
|
|
|
## 14.2 Pattern: Signal-to-Alert-to-Task
|
|
|
|
**Context:** A condition needs human or agent response.
|
|
|
|
**Problem:** Alerts fire but do not become accountable work.
|
|
|
|
**Solution:**
|
|
|
|
```text
|
|
Signal
|
|
-> AlertRule
|
|
-> Alert
|
|
-> ObservedIncident or Task
|
|
-> Investigation
|
|
-> RemediationTask
|
|
-> VerificationEvidence
|
|
```
|
|
|
|
---
|
|
|
|
## 14.3 Pattern: SLO as Reliability Contract
|
|
|
|
**Context:** Service reliability must be operationally meaningful.
|
|
|
|
**Problem:** Teams alert on low-level metrics that do not represent user experience.
|
|
|
|
**Solution:** Define SLIs and SLOs for user-meaningful service behavior and use error budgets to guide action.
|
|
|
|
---
|
|
|
|
## 14.4 Pattern: Deployment Health Verification
|
|
|
|
**Context:** A deployment has completed.
|
|
|
|
**Problem:** Successful deployment command does not prove healthy service behavior.
|
|
|
|
**Solution:** Link DeploymentRecord to DeploymentHealthSignal, SLO state, traces, logs, metrics, and verification evidence.
|
|
|
|
---
|
|
|
|
## 14.5 Pattern: Correlated Timeline
|
|
|
|
**Context:** Incidents require understanding what happened.
|
|
|
|
**Problem:** Logs, alerts, deployments, changes, and network events are scattered.
|
|
|
|
**Solution:** Build Timeline from correlated events, alerts, traces, deployment records, annotations, and task actions.
|
|
|
|
---
|
|
|
|
## 14.6 Pattern: Alert with Runbook
|
|
|
|
**Context:** An alert requires response.
|
|
|
|
**Problem:** Responders waste time discovering what the alert means.
|
|
|
|
**Solution:** AlertRule SHOULD reference owner, runbook, dashboard, likely causes, and escalation path.
|
|
|
|
---
|
|
|
|
## 14.7 Pattern: Metric with Exemplar
|
|
|
|
**Context:** Aggregate metrics show a problem.
|
|
|
|
**Problem:** Aggregates hide individual requests or traces.
|
|
|
|
**Solution:** Link MetricPoint or histogram bucket to trace/log exemplar.
|
|
|
|
---
|
|
|
|
## 14.8 Pattern: Observability as Governance Evidence
|
|
|
|
**Context:** Governance requires proof that controls or SLOs are operating.
|
|
|
|
**Problem:** Compliance claims rely on manual screenshots or weak assertions.
|
|
|
|
**Solution:** Use query results, snapshots, dashboards, and telemetry evidence as structured ObservabilityEvidence.
|
|
|
|
---
|
|
|
|
## 14.9 Pattern: Missing Signal as Signal
|
|
|
|
**Context:** A telemetry source goes silent.
|
|
|
|
**Problem:** Systems only alert on bad values, not missing data.
|
|
|
|
**Solution:** Model missing, stale, or delayed telemetry as signal states and potential alerts.
|
|
|
|
---
|
|
|
|
# 15. Observability Profiles
|
|
|
|
## 15.1 Profile Format
|
|
|
|
An Observability Profile SHALL declare:
|
|
|
|
```yaml
|
|
id:
|
|
profile_name:
|
|
status:
|
|
implements:
|
|
- InfoTechCanonObservabilityModel
|
|
target_context:
|
|
included_concepts:
|
|
required_relationships:
|
|
required_metadata:
|
|
state_model:
|
|
source_of_truth_rules:
|
|
mapping_files:
|
|
validation_rules:
|
|
examples:
|
|
known_deviations:
|
|
```
|
|
|
|
---
|
|
|
|
## 15.2 Seed Profile: Small SaaS Observability Profile
|
|
|
|
Purpose:
|
|
|
|
```text
|
|
Provide a minimal observability model for a small SaaS platform moving toward production readiness.
|
|
```
|
|
|
|
Included concepts:
|
|
|
|
```text
|
|
ObservedResource
|
|
Metric
|
|
LogRecord
|
|
Trace
|
|
Span
|
|
Event
|
|
AlertRule
|
|
Alert
|
|
Dashboard
|
|
Runbook
|
|
SLI
|
|
SLO
|
|
HealthState
|
|
ObservedIncident
|
|
ObservabilityEvidence
|
|
```
|
|
|
|
Required relationships:
|
|
|
|
```text
|
|
Metric emitted_by ObservedResource
|
|
LogRecord emitted_by ObservedResource
|
|
Trace observes Service
|
|
Alert triggered_by AlertRule
|
|
Alert affects Service
|
|
SLO measures Service
|
|
Dashboard displays Metric
|
|
Runbook supports Alert
|
|
ObservabilityEvidence supports Investigation
|
|
```
|
|
|
|
---
|
|
|
|
## 15.3 Seed Profile: OpenTelemetry Profile
|
|
|
|
Purpose:
|
|
|
|
```text
|
|
Map OpenTelemetry resources, traces, metrics, logs, attributes, baggage, and semantic conventions into InfoTechCanon.
|
|
```
|
|
|
|
Example mappings:
|
|
|
|
```text
|
|
Resource -> ObservedResource
|
|
Resource attributes -> ResourceAttribute
|
|
Metric -> Metric
|
|
LogRecord -> LogRecord
|
|
Trace -> Trace
|
|
Span -> Span
|
|
Span event -> SpanEvent
|
|
Span link -> SpanLink
|
|
Baggage -> Baggage
|
|
Semantic conventions -> Mapping / Attribute vocabulary
|
|
Collector -> TelemetryPipeline component
|
|
Exporter -> TelemetryPipeline component
|
|
```
|
|
|
|
---
|
|
|
|
## 15.4 Seed Profile: Prometheus / OpenMetrics Profile
|
|
|
|
Purpose:
|
|
|
|
```text
|
|
Represent metrics, labels, time series, scrape targets, alert rules, and query results.
|
|
```
|
|
|
|
Example mappings:
|
|
|
|
```text
|
|
metric name -> Metric
|
|
labels -> dimensions / attributes
|
|
sample -> MetricPoint
|
|
time series -> TimeSeries
|
|
PromQL -> Query
|
|
recording rule -> DerivedMetric / Query
|
|
alerting rule -> AlertRule
|
|
target -> TelemetrySource / ObservedResource
|
|
```
|
|
|
|
---
|
|
|
|
## 15.5 Seed Profile: CloudEvents Profile
|
|
|
|
Purpose:
|
|
|
|
```text
|
|
Represent event metadata and event envelopes.
|
|
```
|
|
|
|
Example mappings:
|
|
|
|
```text
|
|
id -> Event id
|
|
source -> EventSource
|
|
type -> EventType
|
|
specversion -> EventEnvelope version
|
|
subject -> Event subject
|
|
time -> Event timestamp
|
|
datacontenttype -> Event data content type
|
|
data -> Event data
|
|
```
|
|
|
|
---
|
|
|
|
## 15.6 Seed Profile: SRE Reliability Profile
|
|
|
|
Purpose:
|
|
|
|
```text
|
|
Represent SLIs, SLOs, error budgets, burn rates, and reliability decisions.
|
|
```
|
|
|
|
Included concepts:
|
|
|
|
```text
|
|
SLI
|
|
SLO
|
|
ErrorBudget
|
|
BurnRate
|
|
AvailabilityWindow
|
|
AlertRule
|
|
ReliabilityReview
|
|
ServiceHealthState
|
|
ErrorBudgetPolicyReference
|
|
```
|
|
|
|
Required relationships:
|
|
|
|
```text
|
|
SLO applies_to Service
|
|
SLI measures Service
|
|
ErrorBudget derived_from SLO
|
|
BurnRate measures ErrorBudgetConsumption
|
|
AlertRule alerts_on BurnRate
|
|
ReliabilityReview reviews SLOState
|
|
```
|
|
|
|
---
|
|
|
|
## 15.7 Seed Profile: Incident Observability Profile
|
|
|
|
Purpose:
|
|
|
|
```text
|
|
Represent telemetry, alerts, timelines, dashboards, and evidence for incident response.
|
|
```
|
|
|
|
Included concepts:
|
|
|
|
```text
|
|
Alert
|
|
ObservedIncident
|
|
Timeline
|
|
Investigation
|
|
Dashboard
|
|
Runbook
|
|
Annotation
|
|
RootCauseHypothesis
|
|
ObservabilityEvidence
|
|
PostIncidentObservation
|
|
```
|
|
|
|
---
|
|
|
|
## 15.8 Seed Profile: Network Observability Profile
|
|
|
|
Purpose:
|
|
|
|
```text
|
|
Represent network metrics, flow logs, reachability tests, DNS logs, and latency signals.
|
|
```
|
|
|
|
Included concepts:
|
|
|
|
```text
|
|
NetworkMetric
|
|
ObservedFlowSignal
|
|
DNSLogRecord
|
|
ReachabilityTestResult
|
|
LatencyMetric
|
|
PacketLossMetric
|
|
EndpointHealthSignal
|
|
```
|
|
|
|
Mapping targets:
|
|
|
|
```text
|
|
NetFlow/IPFIX
|
|
VPC Flow Logs
|
|
Kubernetes CNI telemetry
|
|
service mesh telemetry
|
|
DNS logs
|
|
synthetic probes
|
|
```
|
|
|
|
---
|
|
|
|
## 15.9 Seed Profile: Security Observability Profile
|
|
|
|
Purpose:
|
|
|
|
```text
|
|
Represent observability signals used for security detection, investigation, and evidence.
|
|
```
|
|
|
|
Included concepts:
|
|
|
|
```text
|
|
SecuritySignal
|
|
SecurityLogRecord
|
|
DetectionEvent
|
|
Alert
|
|
TraceEvidence
|
|
AccessSessionLog
|
|
AuditLogReference
|
|
SecurityEvidence
|
|
```
|
|
|
|
Security interpretation remains owned by the Security Model.
|
|
|
|
---
|
|
|
|
# 16. Mapping Model for the Observability Standard
|
|
|
|
Mappings relate InfoTechCanon observability concepts to external standards, tools, and products.
|
|
|
|
## 16.1 Mapping Types
|
|
|
|
Recommended mapping types:
|
|
|
|
```text
|
|
exactMatch
|
|
closeMatch
|
|
broadMatch
|
|
narrowMatch
|
|
relatedMatch
|
|
conflictMatch
|
|
gapMatch
|
|
derivedFrom
|
|
regulatoryReference
|
|
toolEquivalent
|
|
```
|
|
|
|
## 16.2 Mapping Record
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
id: itc-map:span-to-opentelemetry-span
|
|
source_concept: itc-obs:Span
|
|
target_body: OpenTelemetry
|
|
target_version: "current"
|
|
target_concept: Span
|
|
mapping_type: closeMatch
|
|
scope:
|
|
- distributed tracing
|
|
not_valid_for:
|
|
- all event or log semantics
|
|
rationale: >
|
|
OpenTelemetry Span is the primary mapping target for timed operations in traces.
|
|
InfoTechCanon keeps Span as a canonical concept to allow mappings to other tracing systems.
|
|
confidence: high
|
|
status: candidate
|
|
owner: InfoTechCanonObservabilityModel
|
|
```
|
|
|
|
## 16.3 Seed Mapping Targets
|
|
|
|
The Observability Model SHOULD maintain mappings to:
|
|
|
|
```text
|
|
OpenTelemetry
|
|
OpenTelemetry Semantic Conventions
|
|
Prometheus
|
|
OpenMetrics / Prometheus exposition format
|
|
CloudEvents
|
|
W3C Trace Context
|
|
Google SRE SLI/SLO/Error Budget concepts
|
|
Grafana dashboards and alerting
|
|
Prometheus Alertmanager
|
|
Loki / LogQL
|
|
Jaeger
|
|
Tempo
|
|
Elastic Observability
|
|
Datadog
|
|
New Relic
|
|
Splunk
|
|
OpenSearch
|
|
ITIL incident concepts
|
|
NetFlow / IPFIX
|
|
VPC Flow Logs
|
|
Kubernetes events and metrics
|
|
service mesh telemetry
|
|
```
|
|
|
|
---
|
|
|
|
# 17. Assimilation Hooks
|
|
|
|
The Observability Model SHALL be able to receive new observability standards, tool models, telemetry schemas, incident practices, and operational patterns through the InfoTechCanon assimilation process.
|
|
|
|
## 17.1 Assimilation Triggers
|
|
|
|
Assimilation may be triggered by:
|
|
|
|
```text
|
|
new telemetry standard
|
|
new observability backend
|
|
new incident-management tool
|
|
new SLO practice
|
|
new dashboard model
|
|
new alerting model
|
|
new tracing model
|
|
new logging schema
|
|
new AIOps product
|
|
new runtime verification practice
|
|
new recurring signal classification conflict
|
|
```
|
|
|
|
## 17.2 Observability Assimilation Output
|
|
|
|
An observability assimilation SHOULD produce:
|
|
|
|
```text
|
|
source summary
|
|
extracted observability concepts
|
|
concept comparison matrix
|
|
gap list
|
|
conflict list
|
|
mapping file
|
|
candidate new concepts
|
|
candidate relationship changes
|
|
candidate pattern changes
|
|
candidate profile changes
|
|
open questions
|
|
```
|
|
|
|
## 17.3 Recommended First Assimilation Candidates
|
|
|
|
```text
|
|
OpenTelemetry specification and semantic conventions
|
|
Prometheus / OpenMetrics
|
|
CloudEvents
|
|
W3C Trace Context
|
|
Google SRE SLO chapters
|
|
Grafana dashboard and alerting models
|
|
Prometheus Alertmanager
|
|
Kubernetes events and metrics
|
|
VPC Flow Logs / NetFlow / IPFIX
|
|
ITIL incident management concepts
|
|
```
|
|
|
|
---
|
|
|
|
# 18. Integration with Other InfoTechCanon Standards
|
|
|
|
## 18.1 Landscape Model
|
|
|
|
Observability links signals to:
|
|
|
|
```text
|
|
ApplicationService
|
|
TechnicalService
|
|
RuntimeWorkload
|
|
Environment
|
|
Endpoint
|
|
DataStore
|
|
DeploymentRecord
|
|
NetworkEntity
|
|
```
|
|
|
|
## 18.2 Organization Model
|
|
|
|
Observability imports organization concepts for:
|
|
|
|
```text
|
|
service owner
|
|
on-call responder
|
|
team
|
|
escalation target
|
|
runbook owner
|
|
incident commander
|
|
```
|
|
|
|
## 18.3 Governance Model
|
|
|
|
Observability imports governance concepts for:
|
|
|
|
```text
|
|
evidence
|
|
control result
|
|
review
|
|
assurance
|
|
policy
|
|
SLA obligation
|
|
audit evidence
|
|
```
|
|
|
|
## 18.4 Task Model
|
|
|
|
Observability creates or references:
|
|
|
|
```text
|
|
incident task
|
|
investigation task
|
|
remediation task
|
|
follow-up task
|
|
reliability improvement task
|
|
```
|
|
|
|
## 18.5 Tagging Standard
|
|
|
|
Observability uses tags for:
|
|
|
|
```text
|
|
service
|
|
environment
|
|
severity
|
|
signal type
|
|
dashboard category
|
|
incident category
|
|
team
|
|
```
|
|
|
|
Tags must not replace ObservedResource, AlertRule, SLO, or Evidence records.
|
|
|
|
## 18.6 Access Control Model
|
|
|
|
Observability imports access concepts for:
|
|
|
|
```text
|
|
dashboard access
|
|
log access
|
|
trace access
|
|
incident tool access
|
|
telemetry pipeline access
|
|
sensitive telemetry access
|
|
```
|
|
|
|
## 18.7 Security Model
|
|
|
|
Security imports observability concepts for:
|
|
|
|
```text
|
|
security signal
|
|
detection evidence
|
|
security alert
|
|
audit log
|
|
trace evidence
|
|
incident timeline
|
|
```
|
|
|
|
## 18.8 Data Model
|
|
|
|
Data imports observability concepts when telemetry is treated as a dataset and for data freshness, quality, and lineage signals.
|
|
|
|
## 18.9 DevSecOps Model
|
|
|
|
DevSecOps imports observability concepts for:
|
|
|
|
```text
|
|
deployment verification
|
|
change failure detection
|
|
delivery metric
|
|
runtime feedback
|
|
SLO impact
|
|
```
|
|
|
|
## 18.10 Network Model
|
|
|
|
Network imports observability concepts for:
|
|
|
|
```text
|
|
flow logs
|
|
reachability test results
|
|
latency
|
|
packet loss
|
|
DNS logs
|
|
endpoint health
|
|
```
|
|
|
|
---
|
|
|
|
# 19. Canon Interface Card Usage
|
|
|
|
Subsystems that implement or produce observability knowledge SHOULD publish a Canon Interface Card.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
subsystem: prometheus-importer
|
|
implements:
|
|
- InfoTechCanonObservabilityModel
|
|
- PrometheusOpenMetricsProfile
|
|
produces:
|
|
- Metric
|
|
- TimeSeries
|
|
- MetricPoint
|
|
- AlertRule
|
|
- Alert
|
|
- QueryResult
|
|
consumes:
|
|
- ObservedResource
|
|
- Service
|
|
- Environment
|
|
relations:
|
|
- Metric emitted_by ObservedResource
|
|
- Alert triggered_by AlertRule
|
|
- Alert affects Service
|
|
source_of_truth:
|
|
metric_samples: Prometheus
|
|
alert_rule_state: Prometheus
|
|
known_deviations:
|
|
- resource identity depends on labels
|
|
- long-term retention may be external
|
|
```
|
|
|
|
---
|
|
|
|
# 20. Retrieval Requirements
|
|
|
|
The Observability Model is designed for markdown-based infospaces.
|
|
|
|
## 20.1 Required Retrieval Properties
|
|
|
|
Every major concept SHOULD provide:
|
|
|
|
- stable heading,
|
|
- stable identifier,
|
|
- short definition,
|
|
- longer explanation,
|
|
- examples,
|
|
- distinction notes,
|
|
- relationship examples,
|
|
- mapping hooks,
|
|
- profile references,
|
|
- and common mistakes.
|
|
|
|
## 20.2 Agent Brief
|
|
|
|
A mature Observability Model SHOULD include an `agent-brief.md` file with:
|
|
|
|
```text
|
|
purpose
|
|
scope
|
|
owned concepts
|
|
imported concepts
|
|
core distinctions
|
|
do / do not rules
|
|
relationship patterns
|
|
minimal examples
|
|
common mistakes
|
|
profile list
|
|
mapping list
|
|
```
|
|
|
|
## 20.3 Indexes
|
|
|
|
The observability information space SHOULD provide indexes by:
|
|
|
|
```text
|
|
concept
|
|
relationship
|
|
signal type
|
|
metric
|
|
log
|
|
trace
|
|
event
|
|
resource
|
|
service
|
|
alert
|
|
SLO
|
|
dashboard
|
|
incident
|
|
profile
|
|
pattern
|
|
mapping target
|
|
status
|
|
source system
|
|
```
|
|
|
|
---
|
|
|
|
# 21. Conformance Levels
|
|
|
|
## 21.1 Reference-Conformant
|
|
|
|
A document or system is reference-conformant if it uses Observability Model terminology consistently but does not implement structured metadata or validation rules.
|
|
|
|
## 21.2 Metadata-Conformant
|
|
|
|
A system is metadata-conformant if it uses stable identifiers, concept names, lifecycle states, source metadata, and relationship types.
|
|
|
|
## 21.3 Signal-Conformant
|
|
|
|
A system is signal-conformant if it distinguishes metrics, logs, traces, events, profiles, alerts, and health signals.
|
|
|
|
## 21.4 Resource-Correlated
|
|
|
|
A system is resource-correlated if observability signals can be linked to observed resources and canonical landscape entities.
|
|
|
|
## 21.5 SLO-Conformant
|
|
|
|
A system is SLO-conformant if it represents SLIs, SLOs, error budgets, burn rates, and measurement windows.
|
|
|
|
## 21.6 Evidence-Conformant
|
|
|
|
A system is evidence-conformant if observability claims, incidents, alerts, and service-level states can be linked to evidence.
|
|
|
|
## 21.7 Profile-Conformant
|
|
|
|
A system is profile-conformant if it implements a declared Observability Profile and passes its validation rules.
|
|
|
|
## 21.8 Assimilation-Conformant
|
|
|
|
A system or repository is assimilation-conformant if it can accept external observability concepts through the InfoTechCanon assimilation workflow and produce mappings, gaps, conflicts, and proposed changes.
|
|
|
|
---
|
|
|
|
# 22. Validation Rules
|
|
|
|
Initial validation rules:
|
|
|
|
```text
|
|
VAL-OBS-001: Metric, LogRecord, Trace, Span, Event, Profile, Alert, and Incident SHOULD be modeled as distinct concepts.
|
|
|
|
VAL-OBS-002: Telemetry SHOULD reference an ObservedResource where possible.
|
|
|
|
VAL-OBS-003: ObservedResource SHOULD map to a Landscape, Network, Data, Security, or DevSecOps entity where possible.
|
|
|
|
VAL-OBS-004: Metric SHOULD declare unit, instrument type, source, and dimensions where available.
|
|
|
|
VAL-OBS-005: TimeSeries SHOULD distinguish metric identity from labels/dimensions.
|
|
|
|
VAL-OBS-006: LogRecord SHOULD include timestamp, severity, source, and body where available.
|
|
|
|
VAL-OBS-007: Span SHOULD include trace id, span id, timing, name, status, and parent/link references where available.
|
|
|
|
VAL-OBS-008: Event SHOULD distinguish event data from event context metadata.
|
|
|
|
VAL-OBS-009: Alert SHOULD reference AlertRule or source condition where available.
|
|
|
|
VAL-OBS-010: AlertRule SHOULD reference query or condition, threshold, time window, owner, and runbook where applicable.
|
|
|
|
VAL-OBS-011: SLO SHOULD reference SLI, target, measurement window, service, and evidence source.
|
|
|
|
VAL-OBS-012: ErrorBudget SHOULD derive from an SLO.
|
|
|
|
VAL-OBS-013: Dashboard SHOULD NOT be treated as evidence unless a Snapshot or QueryResult is captured.
|
|
|
|
VAL-OBS-014: Incident SHOULD NOT be inferred solely from one alert unless profile permits it.
|
|
|
|
VAL-OBS-015: RootCauseHypothesis SHOULD remain distinguishable from verified cause.
|
|
|
|
VAL-OBS-016: Missing, stale, or delayed telemetry SHOULD be representable as signal state.
|
|
|
|
VAL-OBS-017: Tags MUST NOT replace resource identity, SLO definitions, alert rules, or evidence.
|
|
|
|
VAL-OBS-018: Imported external observability concepts SHOULD be represented through mapping records rather than silently reused.
|
|
|
|
VAL-OBS-019: Profiles MUST NOT redefine canonical concepts. They may constrain them.
|
|
|
|
VAL-OBS-020: Telemetry containing sensitive data SHOULD reference Data, Security, Access Control, or Governance constraints where relevant.
|
|
```
|
|
|
|
---
|
|
|
|
# 23. Anti-Patterns
|
|
|
|
## 23.1 Dashboard as Truth
|
|
|
|
Treating a dashboard view as evidence without preserving query, time window, data source, or snapshot.
|
|
|
|
## 23.2 Alert Equals Incident
|
|
|
|
Treating every alert as an incident.
|
|
|
|
## 23.3 Metric Soup
|
|
|
|
Collecting many metrics without ownership, resource identity, interpretation, or action path.
|
|
|
|
## 23.4 Logs Without Context
|
|
|
|
Logging messages that cannot be correlated to service, request, trace, tenant, deployment, or resource.
|
|
|
|
## 23.5 Traces Without Boundaries
|
|
|
|
Tracing calls without linking them to service ownership, deployment version, or runtime resource.
|
|
|
|
## 23.6 SLO Theater
|
|
|
|
Creating SLOs that do not reflect user experience or guide operational decisions.
|
|
|
|
## 23.7 Alert Without Runbook
|
|
|
|
Creating alerts without ownership, runbook, dashboard, or response expectation.
|
|
|
|
## 23.8 Missing Signal Blindness
|
|
|
|
Failing to alert when telemetry stops arriving.
|
|
|
|
## 23.9 Tool-Native Capture
|
|
|
|
Letting one observability backend define the internal observability model.
|
|
|
|
## 23.10 Telemetry Without Governance
|
|
|
|
Collecting sensitive logs, traces, or profiles without classification, retention, access control, or privacy consideration.
|
|
|
|
---
|
|
|
|
# 24. Initial Repository Placement
|
|
|
|
Recommended repository layout:
|
|
|
|
```text
|
|
info-tech-canon/
|
|
standards/
|
|
observability/
|
|
InfoTechCanonObservabilityModel.md
|
|
agent-brief.md
|
|
concepts/
|
|
relationships/
|
|
patterns/
|
|
profiles/
|
|
mappings/
|
|
assimilation/
|
|
examples/
|
|
validation/
|
|
```
|
|
|
|
Seed files:
|
|
|
|
```text
|
|
standards/observability/InfoTechCanonObservabilityModel.md
|
|
standards/observability/agent-brief.md
|
|
standards/observability/concepts/telemetry.md
|
|
standards/observability/concepts/metric.md
|
|
standards/observability/concepts/log-record.md
|
|
standards/observability/concepts/trace.md
|
|
standards/observability/concepts/span.md
|
|
standards/observability/concepts/event.md
|
|
standards/observability/concepts/sli.md
|
|
standards/observability/concepts/slo.md
|
|
standards/observability/concepts/alert.md
|
|
standards/observability/concepts/observability-evidence.md
|
|
standards/observability/patterns/resource-linked-telemetry.md
|
|
standards/observability/patterns/signal-to-alert-to-task.md
|
|
standards/observability/patterns/slo-as-reliability-contract.md
|
|
standards/observability/patterns/deployment-health-verification.md
|
|
standards/observability/profiles/small-saas-observability-profile.md
|
|
standards/observability/profiles/opentelemetry-profile.md
|
|
standards/observability/profiles/prometheus-openmetrics-profile.md
|
|
standards/observability/profiles/sre-reliability-profile.md
|
|
standards/observability/mappings/opentelemetry.yaml
|
|
standards/observability/mappings/prometheus-openmetrics.yaml
|
|
standards/observability/mappings/cloudevents.yaml
|
|
standards/observability/mappings/sre-slo.yaml
|
|
```
|
|
|
|
---
|
|
|
|
# 25. Roadmap
|
|
|
|
## Phase 1: Seed Stabilization
|
|
|
|
- Establish this standard as `InfoTechCanonObservabilityModel`.
|
|
- Add seed concepts, relationship vocabulary, patterns, and profiles.
|
|
- Define validation rules.
|
|
- Align with Landscape, Network, DevSecOps, Security, Data, Governance, Task, Access Control, and Tagging.
|
|
|
|
## Phase 2: First Assimilations
|
|
|
|
Recommended first assimilations:
|
|
|
|
```text
|
|
OpenTelemetry specification and semantic conventions
|
|
Prometheus / OpenMetrics
|
|
CloudEvents
|
|
W3C Trace Context
|
|
Google SRE SLO chapters
|
|
Grafana dashboard and alerting model
|
|
Prometheus Alertmanager
|
|
Kubernetes events and metrics
|
|
VPC Flow Logs / NetFlow / IPFIX
|
|
ITIL incident management concepts
|
|
```
|
|
|
|
## Phase 3: Profile Maturation
|
|
|
|
- Mature Small SaaS Observability Profile.
|
|
- Mature OpenTelemetry Profile.
|
|
- Mature Prometheus / OpenMetrics Profile.
|
|
- Mature CloudEvents Profile.
|
|
- Mature SRE Reliability Profile.
|
|
- Mature Incident Observability Profile.
|
|
- Mature Network Observability Profile.
|
|
- Mature Security Observability Profile.
|
|
|
|
## Phase 4: Tooling Integration
|
|
|
|
- Generate concept indexes.
|
|
- Generate agent brief.
|
|
- Create machine-readable YAML/JSON exports.
|
|
- Add validation scripts.
|
|
- Integrate telemetry pipelines, metrics, logs, traces, dashboards, alerts, incident tools, and service catalogs.
|
|
|
|
## Phase 5: Operational Intelligence Loop
|
|
|
|
- Connect telemetry to canonical resources.
|
|
- Connect alerts to tasks and incidents.
|
|
- Connect SLOs to governance and service ownership.
|
|
- Connect deployment records to runtime health signals.
|
|
- Connect security detections to security incidents.
|
|
- Connect network flows to reachability and exposure.
|
|
- Connect post-incident observations to improvements and standard evolution.
|
|
|
|
---
|
|
|
|
# 26. Summary
|
|
|
|
The InfoTechCanon Observability Model is the seed standard for representing telemetry, signals, metrics, logs, traces, events, profiles, alerts, SLOs, health, incidents as observed phenomena, and operational evidence.
|
|
|
|
Its most important commitments are:
|
|
|
|
```text
|
|
Separate telemetry, signal, metric, log, trace, span, event, profile, alert, and incident.
|
|
|
|
Link signals to canonical resources and landscape entities.
|
|
|
|
Treat SLOs, SLIs, error budgets, burn rates, and health states as first-class reliability concepts.
|
|
|
|
Use observability evidence to support governance, security, delivery, incident response, and operational review.
|
|
|
|
Map to OpenTelemetry, Prometheus/OpenMetrics, CloudEvents, SRE practices, and observability tools
|
|
without surrendering internal semantic autonomy.
|
|
|
|
Use profiles to make the model practical for SaaS systems, OpenTelemetry, Prometheus,
|
|
SRE reliability, incident response, network observability, and security observability.
|
|
```
|
|
|
|
This makes the Observability Model a core seed for runtime intelligence, production readiness, SRE practice, incident response, deployment verification, security detection, and agent-supported operations.
|