From e4126bc75593d8698f50289b336eadce08974d4a Mon Sep 17 00:00:00 2001 From: tegwick Date: Fri, 19 Jun 2026 15:58:51 +0200 Subject: [PATCH] Add Service Definition of Mature policy and health-route test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Establish policies/service-dom.md as the service-level companion to the repo DoI and workstream DoD. Its load-bearing Core criterion is a cheap, side-effect free health endpoint for availability probing — satisfied by the existing GET /state/health (DB readiness, 200/503). Served automatically at /policy/service-dom by the existing policy router. Add a regression test asserting /state/health returns 200 with the expected shape, since none existed (DoM Standard criterion #4). Co-Authored-By: Claude Opus 4.8 --- policies/service-dom.md | 106 +++++++++++++++++++++++++++++++++++++ tests/test_routers_core.py | 17 ++++++ 2 files changed, 123 insertions(+) create mode 100644 policies/service-dom.md diff --git a/policies/service-dom.md b/policies/service-dom.md new file mode 100644 index 0000000..8ab3f79 --- /dev/null +++ b/policies/service-dom.md @@ -0,0 +1,106 @@ +# Service Definition of Mature (DoM) + +A long-running **service** (an HTTP API, MCP server, worker, or daemon — as +opposed to a repository or a workstream) is considered **mature** when all +criteria below are satisfied. This is the service-level companion to the +Repository *Definition of Integrated* (`repo-doi.md`) and the Workstream +*Definition of Done* (`workstream-dod.md`). + +Criteria are grouped by tier: a service that meets all **Core** criteria is +*operable*; meeting **Standard** criteria makes it *observable*; meeting +**Full** criteria makes it *mature*. + +--- + +## Tier 1 — Core (Operable) + +The minimum for a service to be run and reasoned about by agents and operators. + +- [ ] **Health endpoint** — the service exposes an unauthenticated health route + that allows efficient "is the service available?" probing **without** running + business logic. It returns `200` with a small JSON body + (`{"status": "ok", ...}`) when ready, and a non-2xx (e.g. `503`) when a hard + dependency is unavailable. For the State Hub API this is + `GET /state/health`, which also reports DB connectivity (`{"db": "connected"}`). + Agents should probe this **before** assuming the service is offline (see the + session protocol fallback in `CLAUDE.md`). + +- [ ] **Start command documented** — a single documented command brings the + service up from a clean checkout (for the State Hub API: `make api`, with + `make db` first if Postgres is not running). + +- [ ] **Bound address known** — the listen host/port is fixed and documented + (State Hub API: `http://127.0.0.1:8000`; remote via ops-bridge: + `http://127.0.0.1:18000`). + +--- + +## Tier 2 — Standard (Observable) + +The service can be monitored and integrated by other agents and tooling. + +- [ ] **Health route is tested** — an automated test asserts the health route + returns a success status and the expected shape, so regressions that take the + service silently un-probeable are caught. + +- [ ] **Dependencies declared** — external service dependencies are declared in + `tpsc.yaml` and ingested (`make ingest-tpsc REPO={slug}`); an empty + `services: []` is used when there are none, to make the absence explicit. + +- [ ] **Remote reachability path** — if the service is consumed across machines, + the tunnel/bridge route is documented (ops-bridge port map) and the health + endpoint is reachable over it. + +- [ ] **Graceful dependency failure** — when a hard dependency (DB, broker) is + down, the service reports it via the health route rather than crashing or + hanging callers. + +--- + +## Tier 3 — Full (Mature) + +The service participates safely in the wider ecosystem over time. + +- [ ] **Versioned interface** — breaking interface changes are published via the + interface-change tracker (`publish_interface_change`) so consumers are warned. + +- [ ] **Authn/authz boundary documented** — which routes are public (e.g. health) + versus authenticated is explicit, and credential needs route through the + standard channels (`credential-routing.md` / `warden route`). + +- [ ] **Recovery documented** — the runbook for restart and for restoring a + failed dependency is captured (for the State Hub API: `make db` then + `make api`; consistency repair via `make fix-consistency`). + +- [ ] **Progress/telemetry on lifecycle** — significant lifecycle events + (deploys, migrations, outages) are recorded so the hub reflects service state. + +--- + +## Maturity Checklist (Quick Reference) + +| # | Criterion | Tier | Verified by | +|---|---|---|---| +| 1 | Health endpoint | Core | `curl -s $BASE/state/health` → `200`, `{"status":"ok"}` | +| 2 | Start command documented | Core | `make api` from clean checkout | +| 3 | Bound address known | Core | docs / `CLAUDE.md` | +| 4 | Health route is tested | Standard | `tests/` asserts health route | +| 5 | Dependencies declared | Standard | `make ingest-tpsc` | +| 6 | Remote reachability path | Standard | ops-bridge health probe | +| 7 | Graceful dependency failure | Standard | health returns `503` when DB down | +| 8 | Versioned interface | Full | `publish_interface_change` | +| 9 | Authn/authz boundary documented | Full | docs review | +| 10 | Recovery documented | Full | runbook present | +| 11 | Lifecycle telemetry | Full | `add_progress_event` on lifecycle | + +--- + +## Notes + +- The DoM is enforced by convention, not by automated gates. +- The **health endpoint** (Core #1) is the load-bearing criterion: it is what + lets agents and monitors distinguish *"service down"* from *"service up but + the request is wrong,"* cheaply and without side effects. +- "Service" here means a process exposing an interface over its lifetime — the + State Hub API and the FastMCP server each qualify. A one-shot CLI or a + migration script is **not** a service and is out of scope for the DoM. diff --git a/tests/test_routers_core.py b/tests/test_routers_core.py index a31c507..aac9e53 100644 --- a/tests/test_routers_core.py +++ b/tests/test_routers_core.py @@ -1512,3 +1512,20 @@ class TestFabricGraphReadModel: summary = r.json() assert summary["schema_version"] is None assert summary["nodes_by_fabric"] == {} + + +# --------------------------------------------------------------------------- +# Health route — Service Definition of Mature (policies/service-dom.md), Core #1 +# --------------------------------------------------------------------------- + +async def test_health_route_reports_ok_when_db_reachable(client): + """The health endpoint is a cheap availability probe with no business logic. + + It must return 200 and a small JSON body so agents and monitors can tell + "service available" from "request wrong" without side effects. + """ + r = await client.get("/state/health") + assert r.status_code == 200, r.text + body = r.json() + assert body["status"] == "ok" + assert body["db"] == "connected"