Files
state-hub/policies/service-dom.md
tegwick f14c225dd9 STATE-WP-0062 T4: Service DoM uses "Level" not "Tier"
Rename Tier 1/2/3 -> Level 1/2/3 (Core/Standard/Full) in the Service DoM policy
and the checklist header to "Level", aligning with the service_catalog
maturity_level column. The DoI tier subsystem is intentionally untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 21:03:35 +02:00

107 lines
4.8 KiB
Markdown

# Service Definition of Mature (DoM)
A long-running **service** (an HTTP API, MCP server, worker, or daemon — as
opposed to a repository or a workstream) is considered **mature** when all
criteria below are satisfied. This is the service-level companion to the
Repository *Definition of Integrated* (`repo-doi.md`) and the Workstream
*Definition of Done* (`workstream-dod.md`).
Criteria are grouped by **Service Maturity Level**: a service that meets all
**Core** criteria is *operable*; meeting **Standard** criteria makes it
*observable*; meeting **Full** criteria makes it *mature*.
---
## Level 1 — Core (Operable)
The minimum for a service to be run and reasoned about by agents and operators.
- [ ] **Health endpoint** — the service exposes an unauthenticated health route
that allows efficient "is the service available?" probing **without** running
business logic. It returns `200` with a small JSON body
(`{"status": "ok", ...}`) when ready, and a non-2xx (e.g. `503`) when a hard
dependency is unavailable. For the State Hub API this is
`GET /state/health`, which also reports DB connectivity (`{"db": "connected"}`).
Agents should probe this **before** assuming the service is offline (see the
session protocol fallback in `CLAUDE.md`).
- [ ] **Start command documented** — a single documented command brings the
service up from a clean checkout (for the State Hub API: `make api`, with
`make db` first if Postgres is not running).
- [ ] **Bound address known** — the listen host/port is fixed and documented
(State Hub API: `http://127.0.0.1:8000`; remote via ops-bridge:
`http://127.0.0.1:18000`).
---
## Level 2 — Standard (Observable)
The service can be monitored and integrated by other agents and tooling.
- [ ] **Health route is tested** — an automated test asserts the health route
returns a success status and the expected shape, so regressions that take the
service silently un-probeable are caught.
- [ ] **Dependencies declared** — external service dependencies are declared in
`tpsc.yaml` and ingested (`make ingest-tpsc REPO={slug}`); an empty
`services: []` is used when there are none, to make the absence explicit.
- [ ] **Remote reachability path** — if the service is consumed across machines,
the tunnel/bridge route is documented (ops-bridge port map) and the health
endpoint is reachable over it.
- [ ] **Graceful dependency failure** — when a hard dependency (DB, broker) is
down, the service reports it via the health route rather than crashing or
hanging callers.
---
## Level 3 — Full (Mature)
The service participates safely in the wider ecosystem over time.
- [ ] **Versioned interface** — breaking interface changes are published via the
interface-change tracker (`publish_interface_change`) so consumers are warned.
- [ ] **Authn/authz boundary documented** — which routes are public (e.g. health)
versus authenticated is explicit, and credential needs route through the
standard channels (`credential-routing.md` / `warden route`).
- [ ] **Recovery documented** — the runbook for restart and for restoring a
failed dependency is captured (for the State Hub API: `make db` then
`make api`; consistency repair via `make fix-consistency`).
- [ ] **Progress/telemetry on lifecycle** — significant lifecycle events
(deploys, migrations, outages) are recorded so the hub reflects service state.
---
## Maturity Checklist (Quick Reference)
| # | Criterion | Level | Verified by |
|---|---|---|---|
| 1 | Health endpoint | 1 · Core | `curl -s $BASE/state/health``200`, `{"status":"ok"}` |
| 2 | Start command documented | 1 · Core | `make api` from clean checkout |
| 3 | Bound address known | 1 · Core | docs / `CLAUDE.md` |
| 4 | Health route is tested | 2 · Standard | `tests/` asserts health route |
| 5 | Dependencies declared | 2 · Standard | `make ingest-tpsc` |
| 6 | Remote reachability path | 2 · Standard | ops-bridge health probe |
| 7 | Graceful dependency failure | 2 · Standard | health returns `503` when DB down |
| 8 | Versioned interface | 3 · Full | `publish_interface_change` |
| 9 | Authn/authz boundary documented | 3 · Full | docs review |
| 10 | Recovery documented | 3 · Full | runbook present |
| 11 | Lifecycle telemetry | 3 · Full | `add_progress_event` on lifecycle |
---
## Notes
- The DoM is enforced by convention, not by automated gates.
- The **health endpoint** (Core #1) is the load-bearing criterion: it is what
lets agents and monitors distinguish *"service down"* from *"service up but
the request is wrong,"* cheaply and without side effects.
- "Service" here means a process exposing an interface over its lifetime — the
State Hub API and the FastMCP server each qualify. A one-shot CLI or a
migration script is **not** a service and is out of scope for the DoM.