Files

tegwick 9883a99f78 Implement infospace scaffold and service baseline

2026-05-23 03:12:02 +02:00

47 KiB

Raw Blame History

InfoTechCanon Observability Model

Short Name: ITC-OBS Document Status: Seed Standard Release Candidate 1 Version: RC1-seed Date: 2026-05-23 Repository Context: info-tech-canon Document Type: InfoTechCanon Domain Standard Intended Audience: SREs, platform engineers, DevSecOps teams, service owners, observability engineers, incident responders, network operators, security analysts, product owners, governance designers, knowledge-system builders, and agentic tooling.

1. Purpose

The InfoTechCanon Observability Model defines a canonical seed model for representing telemetry, signals, events, logs, metrics, traces, profiles, health, service levels, alerts, incidents as observed phenomena, dashboards, runbooks, investigations, and operational evidence.

It exists to make runtime understanding interoperable across systems, services, platforms, networks, security, delivery pipelines, data products, and agentic operations.

This standard provides a canonical vocabulary for:

telemetry sources,
resources,
signals,
metrics,
logs,
events,
traces,
spans,
profiles,
exemplars,
attributes,
dimensions,
correlation context,
service level indicators,
service level objectives,
error budgets,
health states,
alerts,
notifications,
incidents,
investigations,
dashboards,
runbooks,
observability evidence,
and feedback loops.

2. Position in InfoTechCanon

The Observability Model is a domain standard within InfoTechCanon.

It depends on the existing seed standards as follows:

Landscape      = services, runtime resources, environments, endpoints, workloads.
Organization   = owners, on-call actors, responders, teams, accountable roles.
Governance     = policies, controls, evidence, reviews, assurance, obligations.
Task           = incident work, remediation work, investigation, follow-up tasks.
Tagging        = lightweight classification of signals, alerts, incidents, dashboards.
Access Control = access to telemetry, dashboards, logs, admin actions, incident tools.
Security       = security signals, detections, alerts, incidents, forensic evidence.
Data           = telemetry as data, retention, classification, quality, lineage.
DevSecOps      = deployment events, delivery metrics, verification, change failures.
Network        = flow logs, reachability tests, network metrics, DNS logs, latency.
Observability  = signals, telemetry, correlation, health, SLOs, alerts, operational evidence.

InfoTechCanon
├── InfoTechCanonCore
├── InfoTechCanonLandscapeModel
├── InfoTechCanonOrganizationModel
├── InfoTechCanonGovernanceModel
├── InfoTechCanonTaskModel
├── InfoTechCanonTaggingStandard
├── InfoTechCanonAccessControlModel
├── InfoTechCanonSecurityModel
├── InfoTechCanonDataModel
├── InfoTechCanonDevSecOpsModel
├── InfoTechCanonNetworkModel
├── InfoTechCanonObservabilityModel     <-- this standard
├── InfoTechCanonPatternLanguage
└── Application Profiles

3. Boundary with Adjacent Standards

3.1 Boundary with Landscape

The Landscape Model owns the entities being observed:

ApplicationService
TechnicalService
RuntimeWorkload
Environment
Endpoint
NetworkEntity
DataStore
DeploymentRecord

The Observability Model owns telemetry and signals about those entities:

Metric
LogRecord
Trace
Span
Event
Profile
Alert
HealthState
SLI
SLO
Dashboard
IncidentSignal

Boundary rule:

Landscape owns what exists.
Observability owns what is observed, measured, correlated, alerted, and evidenced.

3.2 Boundary with Security

The Security Model owns security interpretation:

SecurityFinding
Detection
SecurityIncident
Threat
AttackPath
SecurityEvidence

Observability owns telemetry substrate and operational signals.

Example:

LogRecord may be evidence for SecurityFinding.
SecurityDetection may be derived from ObservabilitySignal.
SecurityIncident may reference Alert, Trace, LogRecord, or Event.

3.3 Boundary with Governance

Governance owns policies, controls, evidence, reviews, assurance, and compliance claims.

Observability provides evidence and indicators.

Example:

SLOEvidence supports ServiceReview.
Metric supports ControlResult.
AlertPolicy implements Governance Policy.

3.4 Boundary with Task

Task owns work semantics.

Observability creates or references tasks:

Alert creates IncidentTask
Incident creates RemediationTask
Investigation creates FollowUpTask
SLOBurn creates ReliabilityTask

3.5 Boundary with DevSecOps

DevSecOps owns delivery events and deployment records.

Observability owns runtime signals used to verify deployments and measure change impact.

Example:

DeploymentRecord produces DeploymentEvent
DeploymentHealthSignal verifies DeploymentRecord
ChangeFailure detected_by ObservabilitySignal

3.6 Boundary with Data

Data owns dataset, classification, lineage, quality, and retention semantics.

Observability telemetry may itself be data, but Observability owns telemetry-specific semantics.

Example:

LogDataset classified_as Restricted
MetricStream has_retention RetentionRuleReference
TraceSample derived_from RuntimeWorkload

4. Research Basis and External Alignment

This seed standard draws on several mature observability and operations bodies of knowledge.

4.1 OpenTelemetry

OpenTelemetry provides a broad observability framework covering traces, metrics, logs, baggage, resources, semantic conventions, instrumentation, collection, and export. Its semantic conventions define common attributes that give meaning to telemetry across systems.

4.2 SRE and Service Level Objectives

SRE practice distinguishes Service Level Indicators, Service Level Objectives, Service Level Agreements, and error budgets. It emphasizes that SLOs should measure user-relevant reliability and guide operational decision-making.

4.3 Prometheus and OpenMetrics

Prometheus and OpenMetrics influence metric naming, metric exposition, labels, time series, counters, gauges, histograms, summaries, and scraping/pull-based metric collection.

4.4 CloudEvents

CloudEvents standardizes common event metadata for interoperability across services, platforms, and systems. It is a strong mapping target for event structure and routing metadata.

4.5 IT Operations and Incident Management

IT operations practice distinguishes alerts, incidents, problems, changes, runbooks, on-call, escalation, resolution, and post-incident review. The Observability Model provides signal semantics while Task and Governance own work and decision semantics.

4.6 AIOps and Event Correlation

AIOps practice emphasizes correlation, anomaly detection, event deduplication, root-cause analysis, topology-aware alerting, and automated remediation. These are advanced profiles rather than mandatory core concepts.

5. Seed Standard Design Stance

This standard is a seed standard, not a vendor-specific observability schema.

It shall:

define canonical observability semantics,
distinguish telemetry, signal, event, log, metric, trace, span, profile, alert, and incident,
support OpenTelemetry alignment without being limited to it,
support SLOs, SLIs, and error budgets,
support correlation across services, runtime, network, security, data, and delivery,
support operational evidence and feedback loops,
support human and agentic operations,
map to external standards and tools without becoming subordinate to them,
remain markdown-first and agent-retrievable,
and support future assimilation of observability tools, standards, and practices.

6. Scope

6.1 In Scope

This standard covers canonical representation of:

telemetry,
telemetry sources,
observed resources,
observability signals,
metrics,
time series,
metric points,
metric instruments,
logs,
log records,
events,
event envelopes,
traces,
spans,
span links,
trace context,
profiles,
exemplars,
attributes,
dimensions,
labels,
correlation context,
service-level indicators,
service-level objectives,
service-level agreements as references,
error budgets,
burn rates,
health states,
alert rules,
alerts,
notifications,
alert routes,
incidents as observed operational objects,
investigations,
dashboards,
runbooks,
telemetry pipelines,
collectors,
exporters,
sampling,
retention,
and observability evidence.

6.2 Out of Scope

This standard does not fully define:

all monitoring tool schemas,
all incident-management process details,
all SRE organizational practice,
complete AIOps algorithms,
all logging formats,
all SIEM detection content,
full OpenTelemetry SDK implementation,
all Prometheus query semantics,
complete data-retention law,
complete security incident-response methodology,
or every vendor-specific telemetry backend.

Those may be mapped, assimilated, profiled, or handled by adjacent standards.

7. Normative Language

The following terms are used normatively:

SHALL indicates a mandatory rule for conformance.
SHOULD indicates a recommended practice.
MAY indicates an optional capability.
MUST NOT indicates a prohibited practice.
SEED marks a concept defined provisionally here but open to later refinement.
EXTRACT marks a concept that may later move to a more specialized standard.

8. Core Principles

8.1 Observability Is More Than Monitoring

Monitoring checks known conditions.

Observability supports understanding system behavior, including unknown or emergent failure modes, through signals and correlation.

8.2 Telemetry Is Not Insight

Raw telemetry becomes useful through context, correlation, aggregation, interpretation, and action.

8.3 Signal Is Not Incident

A signal, alert, or event may indicate a possible problem.

An incident is an operationally relevant situation requiring response.

8.4 Alert Is Not Evidence by Itself

An alert indicates that a rule fired or condition was detected.

Evidence should include the underlying signals, query, thresholds, state, and context.

8.5 Metrics, Logs, Traces, Events, and Profiles Are Distinct

Each signal type has different strengths and should not be collapsed into one generic “event” concept.

8.6 Service Levels Must Be Explicit

SLIs, SLOs, and error budgets SHOULD be modeled explicitly when reliability is important.

8.7 Correlation Requires Identity

Telemetry SHOULD be linked to canonical landscape entities, deployment records, network endpoints, data resources, or security entities where possible.

8.8 Observability Must Support Feedback

Observability should feed tasks, incidents, governance reviews, deployment verification, security detection, reliability improvement, and standard evolution.

8.9 External Standards Are Mapped, Not Obeyed

The Observability Model MAY map to OpenTelemetry, Prometheus, OpenMetrics, CloudEvents, SRE SLO concepts, ITIL incident practices, and vendor schemas.

It MUST NOT subordinate its internal semantics to any single external model.

9. Canonical Seed Metadata

Every observability artifact SHOULD support structured metadata.

Recommended front matter:

---
id: itc-obs:Metric
type: concept
standard: InfoTechCanonObservabilityModel
standard_version: RC1-seed
status: candidate
canonical_owner: InfoTechCanonObservabilityModel
preferred_label: Metric
related:
  - itc-obs:TimeSeries
  - itc-obs:SLI
  - itc-obs:AlertRule
mappings:
  - itc-map:metric-to-opentelemetry-metric
---

Recommended artifact statuses:

idea
draft
candidate
release-candidate
adopted
stable
deprecated
retired

Recommended concept statuses:

proposed
experimental
candidate
canonical
deprecated
retired

10. Root Observability Taxonomy

ObservabilityEntity
├── TelemetryEntity
│   ├── Telemetry
│   ├── TelemetrySource
│   ├── ObservedResource
│   ├── ResourceAttribute
│   ├── Signal
│   ├── SignalSource
│   └── TelemetryPipeline
├── MetricEntity
│   ├── Metric
│   ├── MetricInstrument
│   ├── TimeSeries
│   ├── MetricPoint
│   ├── Counter
│   ├── Gauge
│   ├── Histogram
│   ├── Summary
│   └── Exemplar
├── LogEntity
│   ├── Log
│   ├── LogRecord
│   ├── LogStream
│   ├── LogLevel
│   ├── LogContext
│   └── StructuredLogField
├── TraceEntity
│   ├── Trace
│   ├── Span
│   ├── SpanEvent
│   ├── SpanLink
│   ├── TraceContext
│   ├── Baggage
│   └── TraceSample
├── EventEntity
│   ├── Event
│   ├── EventEnvelope
│   ├── EventSource
│   ├── EventType
│   ├── EventConsumer
│   └── EventCorrelationKey
├── ProfileEntity
│   ├── Profile
│   ├── ProfilingSample
│   ├── ResourceProfile
│   └── PerformanceProfile
├── ReliabilityEntity
│   ├── SLI
│   ├── SLO
│   ├── SLAReference
│   ├── ErrorBudget
│   ├── BurnRate
│   ├── HealthState
│   └── AvailabilityWindow
├── AlertingEntity
│   ├── AlertRule
│   ├── Alert
│   ├── Notification
│   ├── AlertRoute
│   ├── AlertSuppression
│   ├── AlertCorrelation
│   └── EscalationReference
├── OperationsEntity
│   ├── ObservedIncident
│   ├── Investigation
│   ├── Timeline
│   ├── Runbook
│   ├── Dashboard
│   ├── OperationalView
│   └── PostIncidentObservation
└── EvidenceEntity
    ├── ObservabilityEvidence
    ├── Query
    ├── QueryResult
    ├── Snapshot
    ├── Annotation
    ├── Correlation
    └── RootCauseHypothesis

11. Core Concepts

11.1 ObservabilityEntity

An ObservabilityEntity is any identifiable concept used to represent telemetry, signals, correlation, health, service levels, alerts, incidents as observed phenomena, dashboards, runbooks, or operational evidence.

Recommended attributes:

id:
entity_type:
canonical_name:
display_name:
lifecycle_state:
source_system:
created_at:
updated_at:

Optional attributes:

owner:
steward:
observed_resource:
service:
environment:
source_confidence:
valid_from:
valid_to:
tags:
external_references:

11.2 Telemetry

Telemetry is machine-generated or manually recorded operational data about system behavior, state, performance, events, or activity.

Examples:

metric sample
log record
trace span
event
profile sample
flow record
health check result

11.3 TelemetrySource

A TelemetrySource is a system, component, agent, collector, service, device, pipeline, or actor that emits or provides telemetry.

11.4 ObservedResource

An ObservedResource is the entity about which telemetry is emitted or collected.

Observed resources SHOULD map to Landscape, Network, Data, Security, or DevSecOps entities where possible.

11.5 ResourceAttribute

A ResourceAttribute is an attribute describing an observed resource.

Examples:

service.name
service.version
deployment.environment
host.name
cloud.region
k8s.cluster.name
container.image.name

11.6 Signal

A Signal is an interpretable unit or stream of observability data.

Signal types include:

metric
log
trace
event
profile
alert
health check
synthetic result

11.7 SignalSource

A SignalSource is the origin of a signal.

11.8 TelemetryPipeline

A TelemetryPipeline is a flow that collects, processes, transforms, samples, enriches, routes, stores, or exports telemetry.

11.9 Metric

A Metric is a measurement of a system, service, resource, or process over time.

Metrics may be used for alerting, dashboards, SLOs, capacity planning, anomaly detection, and evidence.

11.10 MetricInstrument

A MetricInstrument defines the kind of measurement instrument.

Seed instrument types:

counter
gauge
histogram
summary
up_down_counter
observable_gauge

11.11 TimeSeries

A TimeSeries is a sequence of metric points over time for a metric and a set of dimensions or labels.

11.12 MetricPoint

A MetricPoint is a single measurement value at a time.

11.13 Counter

A Counter is a monotonically increasing measurement of occurrences or accumulated quantity.

11.14 Gauge

A Gauge is a measurement that can go up or down.

11.15 Histogram

A Histogram is a distribution of measurements across buckets or ranges.

11.16 Summary

A Summary is a metric representation of observations including quantiles or summary statistics.

11.17 Exemplar

An Exemplar is a representative sample connecting an aggregate metric point to a trace, log, or other detailed signal.

11.18 Log

A Log is a stream or collection of timestamped records describing events, state, actions, or messages.

11.19 LogRecord

A LogRecord is a single log entry.

Recommended attributes:

timestamp:
severity:
message:
body:
resource:
trace_id:
span_id:
attributes:
source:

11.20 LogStream

A LogStream is a sequence of log records from a source or resource.

11.21 LogLevel

A LogLevel is a severity or importance category for log records.

Examples:

trace
debug
info
warn
error
fatal

11.22 LogContext

LogContext is contextual metadata attached to log records.

Examples:

request id
trace id
user id reference
tenant id
deployment version
environment
component

11.23 Trace

A Trace is a representation of a request, transaction, workflow, or operation as it moves through a distributed system.

11.24 Span

A Span is a single timed operation within a trace.

Recommended attributes:

trace_id:
span_id:
parent_span_id:
name:
kind:
start_time:
end_time:
status:
attributes:
events:
links:

11.25 SpanEvent

A SpanEvent is a timestamped event attached to a span.

11.26 SpanLink

A SpanLink connects a span to another span or trace context.

11.27 TraceContext

TraceContext is propagation metadata that links operations across process, service, or network boundaries.

11.28 Baggage

Baggage is contextual metadata propagated across process boundaries.

Baggage SHOULD be governed carefully when it may contain sensitive data.

11.29 TraceSample

A TraceSample is a selected trace or subset of trace data retained or analyzed.

11.30 Event

An Event is a record of an occurrence and its context.

Events may be operational, domain, security, deployment, infrastructure, or business events.

11.31 EventEnvelope

An EventEnvelope is structured metadata around event data.

CloudEvents is a primary mapping target.

11.32 EventSource

An EventSource is the producer or origin of an event.

11.33 EventType

An EventType classifies the kind of occurrence represented by an event.

11.34 EventConsumer

An EventConsumer is an actor, system, service, or pipeline that consumes events.

11.35 EventCorrelationKey

An EventCorrelationKey links events to related traces, logs, requests, incidents, deployments, or resources.

11.36 Profile

A Profile is sampled performance or resource-use data.

This concept is observability-specific and distinct from InfoTechCanon application profiles.

11.37 ProfilingSample

A ProfilingSample is one sample of profiling data.

11.38 ResourceProfile

A ResourceProfile describes resource use over time or sampled execution.

Examples:

CPU profile
memory profile
allocation profile
lock profile
I/O profile

11.39 PerformanceProfile

A PerformanceProfile describes performance characteristics of a system, component, or operation.

11.40 SLI

A Service Level Indicator is a quantitative measure of a service level.

Examples:

availability
latency
error rate
throughput
correctness
freshness
durability

11.41 SLO

A Service Level Objective is a target value or range for an SLI over a defined measurement window.

Recommended attributes:

service:
sli:
target:
window:
scope:
owner:
evidence_source:

11.42 SLAReference

An SLAReference points to a contractual or formal service-level agreement.

Governance owns contractual obligation semantics. Observability owns measured service-level signals.

11.43 ErrorBudget

An ErrorBudget is the allowed amount of unreliability implied by an SLO over a measurement window.

11.44 BurnRate

BurnRate is the rate at which an error budget is being consumed.

11.45 HealthState

HealthState is an assessed operational state of a resource, service, dependency, or system.

Seed health states:

unknown
healthy
degraded
unhealthy
down
recovering
maintenance

11.46 AvailabilityWindow

An AvailabilityWindow is the time period over which availability or service level is measured.

11.47 AlertRule

An AlertRule defines conditions under which an alert is created.

Recommended attributes:

query:
condition:
threshold:
window:
severity:
for_duration:
labels:
annotations:
owner:
runbook:

11.48 Alert

An Alert is an instance of an alert rule firing or resolving.

Seed alert states:

pending
firing
acknowledged
suppressed
resolved
expired

11.49 Notification

A Notification is a message sent to humans, agents, or systems about an alert, incident, or operational state.

11.50 AlertRoute

An AlertRoute defines how alerts are routed to responders, teams, tools, or escalation paths.

11.51 AlertSuppression

AlertSuppression is a rule or state that suppresses notifications for known, duplicate, maintenance, or intentionally ignored alert conditions.

11.52 AlertCorrelation

AlertCorrelation groups related alerts or signals.

11.53 EscalationReference

An EscalationReference points to Organization, Task, or Governance concepts defining who should respond and how escalation works.

11.54 ObservedIncident

An ObservedIncident is an operationally significant situation inferred or declared from observability signals.

Task and ITSM systems may own incident work records. Observability owns the signal-derived incident view.

11.55 Investigation

An Investigation is analysis of signals, alerts, telemetry, incidents, or hypotheses to understand cause, scope, impact, and remediation.

11.56 Timeline

A Timeline is an ordered sequence of events, signals, decisions, actions, and observations.

11.57 Runbook

A Runbook is an operational procedure used to investigate, respond, recover, or verify a condition.

11.58 Dashboard

A Dashboard is a visual or structured view of observability data.

11.59 OperationalView

An OperationalView is a purpose-specific view of system state, health, risk, or performance.

11.60 PostIncidentObservation

A PostIncidentObservation is a signal, fact, lesson, or finding captured after an incident.

11.61 ObservabilityEvidence

ObservabilityEvidence is telemetry, query output, screenshot, dashboard state, trace, log, metric, or event used to support a claim.

11.62 Query

A Query is an expression used to retrieve or calculate observability data.

Examples:

PromQL query
LogQL query
SQL query
trace search
SIEM query
dashboard panel query

11.63 QueryResult

A QueryResult is the result of executing a query.

11.64 Snapshot

A Snapshot is a captured state of telemetry, dashboard, trace, log, metric, or query result at a time.

11.65 Annotation

An Annotation is a human, agent, or system-added note attached to telemetry, dashboard, timeline, incident, deployment, or event.

11.66 Correlation

A Correlation is a relationship linking signals, resources, events, deployments, incidents, or hypotheses.

11.67 RootCauseHypothesis

A RootCauseHypothesis is a candidate explanation for an observed issue.

Canonical rule:

RootCauseHypothesis SHOULD remain distinguishable from verified cause.

12. Core Relationship Vocabulary

Recommended root relationship types:

emitted_by
observes
measures
describes
correlates_with
derived_from
generated_by
triggered_by
alerts_on
routes_to
acknowledged_by
suppressed_by
resolves
affects
indicates
supports
evidences
verifies
invalidates
samples
aggregates
annotates
links_to
maps_to

Relationship records SHOULD support:

id:
relationship_type:
source_entity:
target_entity:
scope:
time_window:
state_context:
valid_from:
valid_to:
source_system:
confidence:
evidence:
rationale:

13. Observability State Models

13.1 Signal States

unknown
emitting
missing
delayed
partial
degraded
invalid
stale

13.2 Alert States

pending
firing
acknowledged
suppressed
resolved
expired

13.3 Incident Observation States

suspected
confirmed
investigating
mitigating
recovering
resolved
post_review
closed

13.4 Health States

unknown
healthy
degraded
unhealthy
down
recovering
maintenance

13.5 SLO States

not_measured
within_budget
burning_fast
at_risk
violated
paused
retired

13.6 Telemetry Pipeline States

configured
active
degraded
dropping_data
stalled
misconfigured
retired

14. Observability Patterns

14.1 Pattern: Resource-Linked Telemetry

Context: Telemetry is collected from many systems.

Problem: Signals are hard to interpret if they cannot be linked to canonical resources.

Solution: Attach telemetry to ObservedResource references mapped to Landscape, Network, DevSecOps, Security, or Data entities.

14.2 Pattern: Signal-to-Alert-to-Task

Context: A condition needs human or agent response.

Problem: Alerts fire but do not become accountable work.

Solution:

Signal
  -> AlertRule
  -> Alert
  -> ObservedIncident or Task
  -> Investigation
  -> RemediationTask
  -> VerificationEvidence

14.3 Pattern: SLO as Reliability Contract

Context: Service reliability must be operationally meaningful.

Problem: Teams alert on low-level metrics that do not represent user experience.

Solution: Define SLIs and SLOs for user-meaningful service behavior and use error budgets to guide action.

14.4 Pattern: Deployment Health Verification

Context: A deployment has completed.

Problem: Successful deployment command does not prove healthy service behavior.

Solution: Link DeploymentRecord to DeploymentHealthSignal, SLO state, traces, logs, metrics, and verification evidence.

14.5 Pattern: Correlated Timeline

Context: Incidents require understanding what happened.

Problem: Logs, alerts, deployments, changes, and network events are scattered.

Solution: Build Timeline from correlated events, alerts, traces, deployment records, annotations, and task actions.

14.6 Pattern: Alert with Runbook

Context: An alert requires response.

Problem: Responders waste time discovering what the alert means.

Solution: AlertRule SHOULD reference owner, runbook, dashboard, likely causes, and escalation path.

14.7 Pattern: Metric with Exemplar

Context: Aggregate metrics show a problem.

Problem: Aggregates hide individual requests or traces.

Solution: Link MetricPoint or histogram bucket to trace/log exemplar.

14.8 Pattern: Observability as Governance Evidence

Context: Governance requires proof that controls or SLOs are operating.

Problem: Compliance claims rely on manual screenshots or weak assertions.

Solution: Use query results, snapshots, dashboards, and telemetry evidence as structured ObservabilityEvidence.

14.9 Pattern: Missing Signal as Signal

Context: A telemetry source goes silent.

Problem: Systems only alert on bad values, not missing data.

Solution: Model missing, stale, or delayed telemetry as signal states and potential alerts.

15. Observability Profiles

15.1 Profile Format

An Observability Profile SHALL declare:

id:
profile_name:
status:
implements:
  - InfoTechCanonObservabilityModel
target_context:
included_concepts:
required_relationships:
required_metadata:
state_model:
source_of_truth_rules:
mapping_files:
validation_rules:
examples:
known_deviations:

15.2 Seed Profile: Small SaaS Observability Profile

Purpose:

Provide a minimal observability model for a small SaaS platform moving toward production readiness.

Included concepts:

ObservedResource
Metric
LogRecord
Trace
Span
Event
AlertRule
Alert
Dashboard
Runbook
SLI
SLO
HealthState
ObservedIncident
ObservabilityEvidence

Required relationships:

Metric emitted_by ObservedResource
LogRecord emitted_by ObservedResource
Trace observes Service
Alert triggered_by AlertRule
Alert affects Service
SLO measures Service
Dashboard displays Metric
Runbook supports Alert
ObservabilityEvidence supports Investigation

15.3 Seed Profile: OpenTelemetry Profile

Purpose:

Map OpenTelemetry resources, traces, metrics, logs, attributes, baggage, and semantic conventions into InfoTechCanon.

Example mappings:

Resource -> ObservedResource
Resource attributes -> ResourceAttribute
Metric -> Metric
LogRecord -> LogRecord
Trace -> Trace
Span -> Span
Span event -> SpanEvent
Span link -> SpanLink
Baggage -> Baggage
Semantic conventions -> Mapping / Attribute vocabulary
Collector -> TelemetryPipeline component
Exporter -> TelemetryPipeline component

15.4 Seed Profile: Prometheus / OpenMetrics Profile

Purpose:

Represent metrics, labels, time series, scrape targets, alert rules, and query results.

Example mappings:

metric name -> Metric
labels -> dimensions / attributes
sample -> MetricPoint
time series -> TimeSeries
PromQL -> Query
recording rule -> DerivedMetric / Query
alerting rule -> AlertRule
target -> TelemetrySource / ObservedResource

15.5 Seed Profile: CloudEvents Profile

Purpose:

Represent event metadata and event envelopes.

Example mappings:

id -> Event id
source -> EventSource
type -> EventType
specversion -> EventEnvelope version
subject -> Event subject
time -> Event timestamp
datacontenttype -> Event data content type
data -> Event data

15.6 Seed Profile: SRE Reliability Profile

Purpose:

Represent SLIs, SLOs, error budgets, burn rates, and reliability decisions.

Included concepts:

SLI
SLO
ErrorBudget
BurnRate
AvailabilityWindow
AlertRule
ReliabilityReview
ServiceHealthState
ErrorBudgetPolicyReference

Required relationships:

SLO applies_to Service
SLI measures Service
ErrorBudget derived_from SLO
BurnRate measures ErrorBudgetConsumption
AlertRule alerts_on BurnRate
ReliabilityReview reviews SLOState

15.7 Seed Profile: Incident Observability Profile

Purpose:

Represent telemetry, alerts, timelines, dashboards, and evidence for incident response.

Included concepts:

Alert
ObservedIncident
Timeline
Investigation
Dashboard
Runbook
Annotation
RootCauseHypothesis
ObservabilityEvidence
PostIncidentObservation

15.8 Seed Profile: Network Observability Profile

Purpose:

Represent network metrics, flow logs, reachability tests, DNS logs, and latency signals.

Included concepts:

NetworkMetric
ObservedFlowSignal
DNSLogRecord
ReachabilityTestResult
LatencyMetric
PacketLossMetric
EndpointHealthSignal

Mapping targets:

NetFlow/IPFIX
VPC Flow Logs
Kubernetes CNI telemetry
service mesh telemetry
DNS logs
synthetic probes

15.9 Seed Profile: Security Observability Profile

Purpose:

Represent observability signals used for security detection, investigation, and evidence.

Included concepts:

SecuritySignal
SecurityLogRecord
DetectionEvent
Alert
TraceEvidence
AccessSessionLog
AuditLogReference
SecurityEvidence

Security interpretation remains owned by the Security Model.

16. Mapping Model for the Observability Standard

Mappings relate InfoTechCanon observability concepts to external standards, tools, and products.

16.1 Mapping Types

Recommended mapping types:

exactMatch
closeMatch
broadMatch
narrowMatch
relatedMatch
conflictMatch
gapMatch
derivedFrom
regulatoryReference
toolEquivalent

16.2 Mapping Record

Example:

id: itc-map:span-to-opentelemetry-span
source_concept: itc-obs:Span
target_body: OpenTelemetry
target_version: "current"
target_concept: Span
mapping_type: closeMatch
scope:
  - distributed tracing
not_valid_for:
  - all event or log semantics
rationale: >
  OpenTelemetry Span is the primary mapping target for timed operations in traces.
  InfoTechCanon keeps Span as a canonical concept to allow mappings to other tracing systems.
confidence: high
status: candidate
owner: InfoTechCanonObservabilityModel

16.3 Seed Mapping Targets

The Observability Model SHOULD maintain mappings to:

OpenTelemetry
OpenTelemetry Semantic Conventions
Prometheus
OpenMetrics / Prometheus exposition format
CloudEvents
W3C Trace Context
Google SRE SLI/SLO/Error Budget concepts
Grafana dashboards and alerting
Prometheus Alertmanager
Loki / LogQL
Jaeger
Tempo
Elastic Observability
Datadog
New Relic
Splunk
OpenSearch
ITIL incident concepts
NetFlow / IPFIX
VPC Flow Logs
Kubernetes events and metrics
service mesh telemetry

17. Assimilation Hooks

The Observability Model SHALL be able to receive new observability standards, tool models, telemetry schemas, incident practices, and operational patterns through the InfoTechCanon assimilation process.

17.1 Assimilation Triggers

Assimilation may be triggered by:

new telemetry standard
new observability backend
new incident-management tool
new SLO practice
new dashboard model
new alerting model
new tracing model
new logging schema
new AIOps product
new runtime verification practice
new recurring signal classification conflict

17.2 Observability Assimilation Output

An observability assimilation SHOULD produce:

source summary
extracted observability concepts
concept comparison matrix
gap list
conflict list
mapping file
candidate new concepts
candidate relationship changes
candidate pattern changes
candidate profile changes
open questions

17.3 Recommended First Assimilation Candidates

OpenTelemetry specification and semantic conventions
Prometheus / OpenMetrics
CloudEvents
W3C Trace Context
Google SRE SLO chapters
Grafana dashboard and alerting models
Prometheus Alertmanager
Kubernetes events and metrics
VPC Flow Logs / NetFlow / IPFIX
ITIL incident management concepts

18. Integration with Other InfoTechCanon Standards

18.1 Landscape Model

Observability links signals to:

ApplicationService
TechnicalService
RuntimeWorkload
Environment
Endpoint
DataStore
DeploymentRecord
NetworkEntity

18.2 Organization Model

Observability imports organization concepts for:

service owner
on-call responder
team
escalation target
runbook owner
incident commander

18.3 Governance Model

Observability imports governance concepts for:

evidence
control result
review
assurance
policy
SLA obligation
audit evidence

18.4 Task Model

Observability creates or references:

incident task
investigation task
remediation task
follow-up task
reliability improvement task

18.5 Tagging Standard

Observability uses tags for:

service
environment
severity
signal type
dashboard category
incident category
team

Tags must not replace ObservedResource, AlertRule, SLO, or Evidence records.

18.6 Access Control Model

Observability imports access concepts for:

dashboard access
log access
trace access
incident tool access
telemetry pipeline access
sensitive telemetry access

18.7 Security Model

Security imports observability concepts for:

security signal
detection evidence
security alert
audit log
trace evidence
incident timeline

18.8 Data Model

Data imports observability concepts when telemetry is treated as a dataset and for data freshness, quality, and lineage signals.

18.9 DevSecOps Model

DevSecOps imports observability concepts for:

deployment verification
change failure detection
delivery metric
runtime feedback
SLO impact

18.10 Network Model

Network imports observability concepts for:

flow logs
reachability test results
latency
packet loss
DNS logs
endpoint health

19. Canon Interface Card Usage

Subsystems that implement or produce observability knowledge SHOULD publish a Canon Interface Card.

Example:

subsystem: prometheus-importer
implements:
  - InfoTechCanonObservabilityModel
  - PrometheusOpenMetricsProfile
produces:
  - Metric
  - TimeSeries
  - MetricPoint
  - AlertRule
  - Alert
  - QueryResult
consumes:
  - ObservedResource
  - Service
  - Environment
relations:
  - Metric emitted_by ObservedResource
  - Alert triggered_by AlertRule
  - Alert affects Service
source_of_truth:
  metric_samples: Prometheus
  alert_rule_state: Prometheus
known_deviations:
  - resource identity depends on labels
  - long-term retention may be external

20. Retrieval Requirements

The Observability Model is designed for markdown-based infospaces.

20.1 Required Retrieval Properties

Every major concept SHOULD provide:

stable heading,
stable identifier,
short definition,
longer explanation,
examples,
distinction notes,
relationship examples,
mapping hooks,
profile references,
and common mistakes.

20.2 Agent Brief

A mature Observability Model SHOULD include an agent-brief.md file with:

purpose
scope
owned concepts
imported concepts
core distinctions
do / do not rules
relationship patterns
minimal examples
common mistakes
profile list
mapping list

20.3 Indexes

The observability information space SHOULD provide indexes by:

concept
relationship
signal type
metric
log
trace
event
resource
service
alert
SLO
dashboard
incident
profile
pattern
mapping target
status
source system

21. Conformance Levels

21.1 Reference-Conformant

A document or system is reference-conformant if it uses Observability Model terminology consistently but does not implement structured metadata or validation rules.

21.2 Metadata-Conformant

A system is metadata-conformant if it uses stable identifiers, concept names, lifecycle states, source metadata, and relationship types.

21.3 Signal-Conformant

A system is signal-conformant if it distinguishes metrics, logs, traces, events, profiles, alerts, and health signals.

21.4 Resource-Correlated

A system is resource-correlated if observability signals can be linked to observed resources and canonical landscape entities.

21.5 SLO-Conformant

A system is SLO-conformant if it represents SLIs, SLOs, error budgets, burn rates, and measurement windows.

21.6 Evidence-Conformant

A system is evidence-conformant if observability claims, incidents, alerts, and service-level states can be linked to evidence.

21.7 Profile-Conformant

A system is profile-conformant if it implements a declared Observability Profile and passes its validation rules.

21.8 Assimilation-Conformant

A system or repository is assimilation-conformant if it can accept external observability concepts through the InfoTechCanon assimilation workflow and produce mappings, gaps, conflicts, and proposed changes.

22. Validation Rules

Initial validation rules:

VAL-OBS-001: Metric, LogRecord, Trace, Span, Event, Profile, Alert, and Incident SHOULD be modeled as distinct concepts.

VAL-OBS-002: Telemetry SHOULD reference an ObservedResource where possible.

VAL-OBS-003: ObservedResource SHOULD map to a Landscape, Network, Data, Security, or DevSecOps entity where possible.

VAL-OBS-004: Metric SHOULD declare unit, instrument type, source, and dimensions where available.

VAL-OBS-005: TimeSeries SHOULD distinguish metric identity from labels/dimensions.

VAL-OBS-006: LogRecord SHOULD include timestamp, severity, source, and body where available.

VAL-OBS-007: Span SHOULD include trace id, span id, timing, name, status, and parent/link references where available.

VAL-OBS-008: Event SHOULD distinguish event data from event context metadata.

VAL-OBS-009: Alert SHOULD reference AlertRule or source condition where available.

VAL-OBS-010: AlertRule SHOULD reference query or condition, threshold, time window, owner, and runbook where applicable.

VAL-OBS-011: SLO SHOULD reference SLI, target, measurement window, service, and evidence source.

VAL-OBS-012: ErrorBudget SHOULD derive from an SLO.

VAL-OBS-013: Dashboard SHOULD NOT be treated as evidence unless a Snapshot or QueryResult is captured.

VAL-OBS-014: Incident SHOULD NOT be inferred solely from one alert unless profile permits it.

VAL-OBS-015: RootCauseHypothesis SHOULD remain distinguishable from verified cause.

VAL-OBS-016: Missing, stale, or delayed telemetry SHOULD be representable as signal state.

VAL-OBS-017: Tags MUST NOT replace resource identity, SLO definitions, alert rules, or evidence.

VAL-OBS-018: Imported external observability concepts SHOULD be represented through mapping records rather than silently reused.

VAL-OBS-019: Profiles MUST NOT redefine canonical concepts. They may constrain them.

VAL-OBS-020: Telemetry containing sensitive data SHOULD reference Data, Security, Access Control, or Governance constraints where relevant.

23. Anti-Patterns

23.1 Dashboard as Truth

Treating a dashboard view as evidence without preserving query, time window, data source, or snapshot.

23.2 Alert Equals Incident

Treating every alert as an incident.

23.3 Metric Soup

Collecting many metrics without ownership, resource identity, interpretation, or action path.

23.4 Logs Without Context

Logging messages that cannot be correlated to service, request, trace, tenant, deployment, or resource.

23.5 Traces Without Boundaries

Tracing calls without linking them to service ownership, deployment version, or runtime resource.

23.6 SLO Theater

Creating SLOs that do not reflect user experience or guide operational decisions.

23.7 Alert Without Runbook

Creating alerts without ownership, runbook, dashboard, or response expectation.

23.8 Missing Signal Blindness

Failing to alert when telemetry stops arriving.

23.9 Tool-Native Capture

Letting one observability backend define the internal observability model.

23.10 Telemetry Without Governance

Collecting sensitive logs, traces, or profiles without classification, retention, access control, or privacy consideration.

24. Initial Repository Placement

Recommended repository layout:

info-tech-canon/
  standards/
    observability/
      InfoTechCanonObservabilityModel.md
      agent-brief.md
      concepts/
      relationships/
      patterns/
      profiles/
      mappings/
      assimilation/
      examples/
      validation/

Seed files:

standards/observability/InfoTechCanonObservabilityModel.md
standards/observability/agent-brief.md
standards/observability/concepts/telemetry.md
standards/observability/concepts/metric.md
standards/observability/concepts/log-record.md
standards/observability/concepts/trace.md
standards/observability/concepts/span.md
standards/observability/concepts/event.md
standards/observability/concepts/sli.md
standards/observability/concepts/slo.md
standards/observability/concepts/alert.md
standards/observability/concepts/observability-evidence.md
standards/observability/patterns/resource-linked-telemetry.md
standards/observability/patterns/signal-to-alert-to-task.md
standards/observability/patterns/slo-as-reliability-contract.md
standards/observability/patterns/deployment-health-verification.md
standards/observability/profiles/small-saas-observability-profile.md
standards/observability/profiles/opentelemetry-profile.md
standards/observability/profiles/prometheus-openmetrics-profile.md
standards/observability/profiles/sre-reliability-profile.md
standards/observability/mappings/opentelemetry.yaml
standards/observability/mappings/prometheus-openmetrics.yaml
standards/observability/mappings/cloudevents.yaml
standards/observability/mappings/sre-slo.yaml

25. Roadmap

Phase 1: Seed Stabilization

Establish this standard as InfoTechCanonObservabilityModel.
Add seed concepts, relationship vocabulary, patterns, and profiles.
Define validation rules.
Align with Landscape, Network, DevSecOps, Security, Data, Governance, Task, Access Control, and Tagging.

Phase 2: First Assimilations

Recommended first assimilations:

OpenTelemetry specification and semantic conventions
Prometheus / OpenMetrics
CloudEvents
W3C Trace Context
Google SRE SLO chapters
Grafana dashboard and alerting model
Prometheus Alertmanager
Kubernetes events and metrics
VPC Flow Logs / NetFlow / IPFIX
ITIL incident management concepts

Phase 3: Profile Maturation

Mature Small SaaS Observability Profile.
Mature OpenTelemetry Profile.
Mature Prometheus / OpenMetrics Profile.
Mature CloudEvents Profile.
Mature SRE Reliability Profile.
Mature Incident Observability Profile.
Mature Network Observability Profile.
Mature Security Observability Profile.

Phase 4: Tooling Integration

Generate concept indexes.
Generate agent brief.
Create machine-readable YAML/JSON exports.
Add validation scripts.
Integrate telemetry pipelines, metrics, logs, traces, dashboards, alerts, incident tools, and service catalogs.

Phase 5: Operational Intelligence Loop

Connect telemetry to canonical resources.
Connect alerts to tasks and incidents.
Connect SLOs to governance and service ownership.
Connect deployment records to runtime health signals.
Connect security detections to security incidents.
Connect network flows to reachability and exposure.
Connect post-incident observations to improvements and standard evolution.

26. Summary

The InfoTechCanon Observability Model is the seed standard for representing telemetry, signals, metrics, logs, traces, events, profiles, alerts, SLOs, health, incidents as observed phenomena, and operational evidence.

Its most important commitments are:

Separate telemetry, signal, metric, log, trace, span, event, profile, alert, and incident.

Link signals to canonical resources and landscape entities.

Treat SLOs, SLIs, error budgets, burn rates, and health states as first-class reliability concepts.

Use observability evidence to support governance, security, delivery, incident response, and operational review.

Map to OpenTelemetry, Prometheus/OpenMetrics, CloudEvents, SRE practices, and observability tools
without surrendering internal semantic autonomy.

Use profiles to make the model practical for SaaS systems, OpenTelemetry, Prometheus,
SRE reliability, incident response, network observability, and security observability.

This makes the Observability Model a core seed for runtime intelligence, production readiness, SRE practice, incident response, deployment verification, security detection, and agent-supported operations.

47 KiB Raw Blame History

InfoTechCanon Observability Model

1. Purpose

2. Position in InfoTechCanon

3. Boundary with Adjacent Standards

3.1 Boundary with Landscape

3.2 Boundary with Security

3.3 Boundary with Governance

3.4 Boundary with Task

3.5 Boundary with DevSecOps

3.6 Boundary with Data

4. Research Basis and External Alignment

4.1 OpenTelemetry

4.2 SRE and Service Level Objectives

4.3 Prometheus and OpenMetrics

4.4 CloudEvents

4.5 IT Operations and Incident Management

4.6 AIOps and Event Correlation

5. Seed Standard Design Stance

6. Scope

6.1 In Scope

6.2 Out of Scope

7. Normative Language

8. Core Principles

8.1 Observability Is More Than Monitoring

8.2 Telemetry Is Not Insight

8.3 Signal Is Not Incident

8.4 Alert Is Not Evidence by Itself

8.5 Metrics, Logs, Traces, Events, and Profiles Are Distinct

8.6 Service Levels Must Be Explicit

8.7 Correlation Requires Identity

8.8 Observability Must Support Feedback

8.9 External Standards Are Mapped, Not Obeyed

9. Canonical Seed Metadata

10. Root Observability Taxonomy

11. Core Concepts

11.1 ObservabilityEntity

11.2 Telemetry

11.3 TelemetrySource

11.4 ObservedResource

11.5 ResourceAttribute

11.6 Signal

11.7 SignalSource

11.8 TelemetryPipeline

11.9 Metric

11.10 MetricInstrument

11.11 TimeSeries

11.12 MetricPoint

11.13 Counter

11.14 Gauge

11.15 Histogram

11.16 Summary

11.17 Exemplar

11.18 Log

11.19 LogRecord

11.20 LogStream

11.21 LogLevel

11.22 LogContext

11.23 Trace

11.24 Span

11.25 SpanEvent

11.26 SpanLink

11.27 TraceContext

11.28 Baggage

11.29 TraceSample

11.30 Event

11.31 EventEnvelope

11.32 EventSource

11.33 EventType

11.34 EventConsumer

11.35 EventCorrelationKey

11.36 Profile

11.37 ProfilingSample

11.38 ResourceProfile

11.39 PerformanceProfile

11.40 SLI

11.41 SLO

11.42 SLAReference

11.43 ErrorBudget

11.44 BurnRate

47 KiB

Raw Blame History