Add Service Definition of Mature policy and health-route test

Establish policies/service-dom.md as the service-level companion to the repo
DoI and workstream DoD. Its load-bearing Core criterion is a cheap, side-effect
free health endpoint for availability probing — satisfied by the existing
GET /state/health (DB readiness, 200/503). Served automatically at
/policy/service-dom by the existing policy router.

Add a regression test asserting /state/health returns 200 with the expected
shape, since none existed (DoM Standard criterion #4).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-19 15:58:51 +02:00
parent 044141de48
commit e4126bc755
2 changed files with 123 additions and 0 deletions

106
policies/service-dom.md Normal file
View File

@@ -0,0 +1,106 @@
# Service Definition of Mature (DoM)
A long-running **service** (an HTTP API, MCP server, worker, or daemon — as
opposed to a repository or a workstream) is considered **mature** when all
criteria below are satisfied. This is the service-level companion to the
Repository *Definition of Integrated* (`repo-doi.md`) and the Workstream
*Definition of Done* (`workstream-dod.md`).
Criteria are grouped by tier: a service that meets all **Core** criteria is
*operable*; meeting **Standard** criteria makes it *observable*; meeting
**Full** criteria makes it *mature*.
---
## Tier 1 — Core (Operable)
The minimum for a service to be run and reasoned about by agents and operators.
- [ ] **Health endpoint** — the service exposes an unauthenticated health route
that allows efficient "is the service available?" probing **without** running
business logic. It returns `200` with a small JSON body
(`{"status": "ok", ...}`) when ready, and a non-2xx (e.g. `503`) when a hard
dependency is unavailable. For the State Hub API this is
`GET /state/health`, which also reports DB connectivity (`{"db": "connected"}`).
Agents should probe this **before** assuming the service is offline (see the
session protocol fallback in `CLAUDE.md`).
- [ ] **Start command documented** — a single documented command brings the
service up from a clean checkout (for the State Hub API: `make api`, with
`make db` first if Postgres is not running).
- [ ] **Bound address known** — the listen host/port is fixed and documented
(State Hub API: `http://127.0.0.1:8000`; remote via ops-bridge:
`http://127.0.0.1:18000`).
---
## Tier 2 — Standard (Observable)
The service can be monitored and integrated by other agents and tooling.
- [ ] **Health route is tested** — an automated test asserts the health route
returns a success status and the expected shape, so regressions that take the
service silently un-probeable are caught.
- [ ] **Dependencies declared** — external service dependencies are declared in
`tpsc.yaml` and ingested (`make ingest-tpsc REPO={slug}`); an empty
`services: []` is used when there are none, to make the absence explicit.
- [ ] **Remote reachability path** — if the service is consumed across machines,
the tunnel/bridge route is documented (ops-bridge port map) and the health
endpoint is reachable over it.
- [ ] **Graceful dependency failure** — when a hard dependency (DB, broker) is
down, the service reports it via the health route rather than crashing or
hanging callers.
---
## Tier 3 — Full (Mature)
The service participates safely in the wider ecosystem over time.
- [ ] **Versioned interface** — breaking interface changes are published via the
interface-change tracker (`publish_interface_change`) so consumers are warned.
- [ ] **Authn/authz boundary documented** — which routes are public (e.g. health)
versus authenticated is explicit, and credential needs route through the
standard channels (`credential-routing.md` / `warden route`).
- [ ] **Recovery documented** — the runbook for restart and for restoring a
failed dependency is captured (for the State Hub API: `make db` then
`make api`; consistency repair via `make fix-consistency`).
- [ ] **Progress/telemetry on lifecycle** — significant lifecycle events
(deploys, migrations, outages) are recorded so the hub reflects service state.
---
## Maturity Checklist (Quick Reference)
| # | Criterion | Tier | Verified by |
|---|---|---|---|
| 1 | Health endpoint | Core | `curl -s $BASE/state/health``200`, `{"status":"ok"}` |
| 2 | Start command documented | Core | `make api` from clean checkout |
| 3 | Bound address known | Core | docs / `CLAUDE.md` |
| 4 | Health route is tested | Standard | `tests/` asserts health route |
| 5 | Dependencies declared | Standard | `make ingest-tpsc` |
| 6 | Remote reachability path | Standard | ops-bridge health probe |
| 7 | Graceful dependency failure | Standard | health returns `503` when DB down |
| 8 | Versioned interface | Full | `publish_interface_change` |
| 9 | Authn/authz boundary documented | Full | docs review |
| 10 | Recovery documented | Full | runbook present |
| 11 | Lifecycle telemetry | Full | `add_progress_event` on lifecycle |
---
## Notes
- The DoM is enforced by convention, not by automated gates.
- The **health endpoint** (Core #1) is the load-bearing criterion: it is what
lets agents and monitors distinguish *"service down"* from *"service up but
the request is wrong,"* cheaply and without side effects.
- "Service" here means a process exposing an interface over its lifetime — the
State Hub API and the FastMCP server each qualify. A one-shot CLI or a
migration script is **not** a service and is out of scope for the DoM.

View File

@@ -1512,3 +1512,20 @@ class TestFabricGraphReadModel:
summary = r.json()
assert summary["schema_version"] is None
assert summary["nodes_by_fabric"] == {}
# ---------------------------------------------------------------------------
# Health route — Service Definition of Mature (policies/service-dom.md), Core #1
# ---------------------------------------------------------------------------
async def test_health_route_reports_ok_when_db_reachable(client):
"""The health endpoint is a cheap availability probe with no business logic.
It must return 200 and a small JSON body so agents and monitors can tell
"service available" from "request wrong" without side effects.
"""
r = await client.get("/state/health")
assert r.status_code == 200, r.text
body = r.json()
assert body["status"] == "ok"
assert body["db"] == "connected"