Files
state-hub/workplans/STATE-WP-0068-offline-write-buffer-and-edge-relay.md

429 lines
15 KiB
Markdown

---
id: STATE-WP-0068
type: workplan
title: "State Hub offline write buffer and edge relay"
domain: infotech
repo: state-hub
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-23"
updated: "2026-06-23"
finished: "2026-06-23"
state_hub_workstream_id: "189508bd-b3cb-4caf-ac95-30bf2823201d"
---
# STATE-WP-0068 - State Hub offline write buffer and edge relay
## Summary
Build a durable client-side write buffer for State Hub so agents can keep
recording progress, decisions, messages, and safe status updates when the
central State Hub deployment or its private tunnel is offline.
The improved design is deliberately split into two layers:
- **Central HA** makes the primary State Hub deployment fail less often
(`CUST-WP-0011`, `CUST-WP-0038`).
- **Edge buffering** makes agent write attempts durable when the central
deployment is still unreachable.
The central service cannot buffer requests it never receives. The buffer must
live close to the callers: operator workstation, agent host, bridge host, or
MCP wrapper. State Hub should therefore provide a small local relay/outbox that
accepts sanctioned writes, persists them locally, and replays them to the
central API when connectivity returns.
## Critical Review of the Original Suggestion
The suggestion is directionally right but incomplete if phrased as "the central
State Hub buffers while offline." If the central endpoint is unreachable, the
client needs somewhere else to put the write.
The robust version is:
1. Agents send writes to a local State Hub edge relay, not directly to the
remote central endpoint.
2. The relay forwards immediately while the central API is reachable.
3. On outage, the relay stores a durable, non-secret write envelope in a local
SQLite outbox and returns an explicit queued receipt.
4. A replay worker flushes the outbox with idempotency keys when the central
API recovers.
5. The central API deduplicates retries and rejects or flags conflicting stale
writes instead of silently overwriting newer state.
This keeps State Hub local-first and file-canon aligned. It does not make a
multi-master database, and it does not turn queued writes into pretend success.
## Goals
- Preserve session-close writes during central State Hub or tunnel outages.
- Make offline write state observable to operators and agents.
- Prevent duplicate progress/events when a replay retries after partial
success.
- Detect stale/conflicting replace-style writes, especially task status and
decision resolution changes.
- Keep secrets out of the buffer.
- Reuse the existing REST contract and MCP write-layer reliability work.
## Non-Goals
- Replacing `CUST-WP-0038` high availability, backup, restore, or failover
work.
- Accepting arbitrary offline edits as authoritative current state.
- Queuing destructive deletes, imports, repo syncs, or bulk maintenance jobs in
v1.
- Publicly exposing State Hub.
- Adding Redis, Kafka, or NATS as a required edge dependency. The edge path
should work during local bootstrap with only Python and SQLite.
## Target Architecture
```
Codex / Claude / agent process
-> MCP server or REST client
-> local statehub-edge relay
-> central State Hub API when reachable
-> local SQLite outbox when unreachable
-> replay worker
-> central State Hub API with idempotency key
-> normal DB commit and lifecycle event publication
```
The relay is a local process with an explicit listen port, for example
`127.0.0.1:18080`, configured with an upstream central API such as
`http://127.0.0.1:18000` or the local development API.
## Write Classification
### Offline-safe append-only writes
These should be queueable in v1:
- `POST /progress/`
- `POST /messages/`
- `PATCH /messages/{id}/read` when message id is already known
- `POST /token-events/`
- `POST /decisions/` with an idempotency key and no immediate dependency on the
generated decision id
### Offline-safe replace-style writes with conflict checks
These may be queueable only with an expected revision or last-observed
timestamp:
- `PATCH /tasks/{task_id}`
- `POST /tasks/bulk-status-sync` decomposed into per-task envelopes or replayed
as an ordered batch
- `PATCH /decisions/{decision_id}` and `POST /decisions/{decision_id}/resolve`
- `PATCH /workplans/{workplan_id}` for lifecycle/status fields
Replay must mark these as conflicted when the central row changed after the
client's observed revision and the update is not a monotonic no-op.
### Online-only writes in v1
These should fail fast while offline:
- `DELETE` endpoints
- repository sync/import/ingest endpoints
- consistency sweep mutation endpoints
- fabric graph exports
- schema/bootstrap/admin operations
- any request containing authorization tokens, credentials, attachments, or
large opaque payloads
## Conflict Policy
- Append-only writes use idempotency keys and replay exactly once from the
caller's point of view.
- Replace-style writes include `expected_updated_at`, `expected_status`, or a
route-specific revision field where available.
- Supersedable queued writes, such as multiple task status patches for the same
task, may be coalesced for replay while preserving local audit entries.
- If central state is newer and the replay cannot prove the queued write is
still safe, mark the envelope `conflict` and surface it in relay status.
- Workplan-file canon remains authoritative. After recovery, operators should
run `make fix-consistency REPO=state-hub` so file-backed task/workplan state
wins over stale queued task updates.
## T01 - Write Safety ADR and Route Inventory
```task
id: STATE-WP-0068-T01
status: done
priority: high
state_hub_task_id: "07aa2d43-0305-45ca-8b5a-bf6f96f716a9"
```
Create a short ADR or design doc that classifies State Hub write routes as
append-only, replace-style, supersedable, or online-only.
Deliverables:
- Route inventory generated from `api/routers/*` and MCP sanctioned writes.
- V1 safe-write allowlist with request/response examples.
- Conflict policy per route class.
- Explicit statement that queued receipts are pending evidence, not successful
central commits.
- Operator decision on the local relay port, default outbox location, and
retention window.
Done when implementation tasks can refer to a reviewed allowlist instead of
guessing route safety.
## T02 - Central Idempotency and Replay Acceptance
```task
id: STATE-WP-0068-T02
status: done
priority: high
state_hub_task_id: "f0060859-e9a7-441c-91cc-1e838c5ba60f"
```
Add central API support for idempotent replay.
Expected implementation:
- Migration for a `write_idempotency_keys` table storing key, method, path,
request hash, response status/body, source host/agent, first seen, last seen,
and expiry.
- Middleware or route dependency that accepts `Idempotency-Key` on allowlisted
write endpoints.
- Same-key/same-request replay returns the original response.
- Same-key/different-request returns HTTP 409.
- Replay metadata is available for diagnostics without logging request secrets.
- Tests cover success, retry, hash mismatch, expiry, and unsupported routes.
Done when append-only writes can be retried after a transport failure without
duplicating central records.
## T03 - Durable Local Outbox Store
```task
id: STATE-WP-0068-T03
status: done
priority: high
state_hub_task_id: "6897dd71-6252-4eed-bb0c-350e8c566b3b"
```
Implement a local SQLite-backed outbox module used by the relay and CLI.
Minimum schema:
- envelope id and idempotency key
- method, path, scrubbed JSON body, route class
- source agent, source host, repo slug, session id when known
- observed revision fields for conflict checks
- status: `queued`, `sending`, `acked`, `conflict`, `dead`, `cancelled`
- attempt count, next retry time, last error, central response summary
- created, updated, acked timestamps
Safety requirements:
- Create the DB with owner-only permissions where the platform supports it.
- Never persist authorization headers, API keys, bearer tokens, cookies, or
secret-looking fields.
- Cap payload size and reject large opaque bodies.
- Provide export/import of non-secret envelopes for operator debugging.
Done when unit tests prove enqueue, status transitions, coalescing metadata,
scrubbing, and corruption-safe startup behavior.
## T04 - Edge Relay HTTP Surface
```task
id: STATE-WP-0068-T04
status: done
priority: high
state_hub_task_id: "deb883df-b312-4e8f-b559-718bb8a94035"
```
Create a local `statehub-edge` relay process that exposes a small HTTP surface.
Behavior:
- Online path: forward allowlisted writes to upstream and return the upstream
response.
- Offline path: enqueue allowlisted writes and return a clear queued receipt:
`{"queued": true, "outbox_id": "...", "idempotency_key": "...",
"upstream": "unreachable"}`.
- Online-only path during outage: return a deterministic error explaining that
the route is not queueable.
- Read path: proxy selected reads while online; optionally serve cached
`/state/summary` metadata with stale markers while offline.
- Health/status: expose relay health, upstream reachability, pending count,
oldest pending age, and conflict count.
Done when agents can point `API_BASE` at the relay and receive either the
normal REST shape or an explicit queued/error shape.
## T05 - Replay Worker and Conflict Handling
```task
id: STATE-WP-0068-T05
status: done
priority: high
state_hub_task_id: "6c3916c1-4a9f-4b1d-a8b1-a356a6edf3db"
```
Implement the replay loop.
Requirements:
- Exponential backoff with jitter for transport failures.
- Single-flight sending per envelope.
- Preserve per-entity order for replace-style writes.
- Coalesce superseded task/workplan status writes before replay when safe.
- Use `Idempotency-Key` for every replayed write.
- Mark conflicts without dropping the original envelope.
- Provide commands to retry, cancel, or mark-dead individual envelopes.
Done when an integration test can simulate central outage, enqueue writes,
restore central service, replay successfully, and surface one intentionally
stale task update as a conflict.
## T06 - MCP and Agent UX Integration
```task
id: STATE-WP-0068-T06
status: done
priority: high
state_hub_task_id: "8ccac4f9-f457-4f87-9195-1d8619043c0f"
```
Update MCP tooling and agent-facing docs so offline buffering is usable without
surprise.
Expected changes:
- MCP write helpers recognize relay queued receipts and return them clearly.
- Automatic progress-event side effects do not duplicate queued primary writes.
- Session-close guidance says to check relay status when writes were queued.
- `mcp_server/TOOLS.md` documents online, queued, and conflict outcomes.
- Repo `AGENTS.md` template can point agents at the relay when enabled.
Done when an agent can complete a session during a central outage, see that the
progress write is queued, and verify later that it was replayed.
## T07 - Operator Observability
```task
id: STATE-WP-0068-T07
status: done
priority: medium
state_hub_task_id: "62c0ca4f-b3e2-49f7-ba70-365016195e83"
```
Expose pending offline writes to humans and automations.
Deliverables:
- CLI commands: `statehub outbox status`, `statehub outbox list`,
`statehub outbox replay`, `statehub outbox export`.
- Optional dashboard panel or docs page showing edge relay health, if the
dashboard can reach the relay.
- Prometheus-style or JSON metrics for pending count, oldest age, replay
failures, and conflicts.
- Progress event after replay recovery summarizing non-secret results.
Done when the operator can see whether any host still has unsent State Hub
writes before declaring an outage recovered.
## T08 - Chaos and Regression Test Suite
```task
id: STATE-WP-0068-T08
status: done
priority: high
state_hub_task_id: "2a12614f-8923-45b1-b8e9-ad8c818b23d3"
```
Add tests that make offline behavior boring.
Coverage:
- Unit tests for route allowlist, payload scrubbing, idempotency hash behavior,
outbox state transitions, and coalescing decisions.
- Integration test with a fake upstream returning connection errors, 5xx, 409,
and success.
- End-to-end test for MCP write through relay during outage and replay.
- Drill script that can be run locally without touching production data.
Done when CI can prove no duplicate append-only records are produced across
retry and no replace-style conflict is silently applied.
## T09 - Runbooks, Cutover, and Recovery Drill
```task
id: STATE-WP-0068-T09
status: done
priority: medium
state_hub_task_id: "fedea85e-c720-4814-9691-affa6c944954"
```
Document and rehearse the operator workflow.
Runbook content:
- How to start the relay on an operator workstation or agent host.
- How to configure MCP/REST clients to use the relay.
- What queued receipts mean during session close.
- How to inspect, replay, export, cancel, and resolve conflicted envelopes.
- Recovery checklist after central State Hub returns.
- Interaction with `make fix-consistency REPO=state-hub`.
Done when a controlled drill queues at least one progress event and one task
status update during a forced outage, replays the progress event, flags or
applies the task update according to the conflict policy, and records the
results without exposing secrets.
## Dependencies and References
- `CUST-WP-0011` - pragmatic railiance01 State Hub migration.
- `CUST-WP-0038` - long-term ThreePhoenix HA State Hub target.
- `STATE-WP-0059` - MCP write-layer reliability and explicit API failure
handling.
- `STATE-WP-0066` - summary cache and stale-while-revalidate for read paths.
- `docs/activity-core-delegation.md` - JetStream buffering covers State Hub to
activity-core events after commit; this work covers agent to State Hub writes
before commit.
- `mcp_server/TOOLS.md` - current MCP/REST parity and failure handling contract.
After this workplan is synced, run:
```bash
make fix-consistency REPO=state-hub
```
## Implementation Notes
Completed 2026-06-23. The implementation provides the first full offline-write
buffering path:
- Central idempotency support through WriteIdempotencyMiddleware, the
write_idempotency_keys model, and migration e9f0a1b2c3d4. Exact duplicate
writes replay the original response; same key with a different request returns
HTTP 409.
- Shared route classification for queueable append and replace-style writes.
- Local SQLite outbox with payload scrubbing, payload size limits, private file
permissions where supported, status transitions, retry/cancel/export support,
and latest replace-write coalescing.
- State Hub edge relay app with online forwarding, offline queue receipts,
health/status, replay endpoint, and replay worker.
- statehub outbox CLI commands for status, list, export, replay, retry, and
cancel.
- MCP queued receipt handling so queued primary writes do not trigger automatic
progress side effects.
- Operator documentation in docs/offline-write-buffer.md, MCP tool docs, and the
Codex agent instruction template.
Verification:
- Focused suite: 22 passed in 19.51s.
- Full suite: 446 passed, 1 warning in 287.20s. The warning was a SQLAlchemy
RuntimeWarning in tests/test_summary_cache.py and was not introduced by a
failing assertion.
- Syntax checks passed for the new and touched Python modules.
- git diff --check passed.