15 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, finished, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | finished | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|---|
| STATE-WP-0068 | workplan | State Hub offline write buffer and edge relay | infotech | state-hub | finished | codex | custodian | 2026-06-23 | 2026-06-23 | 2026-06-23 | 189508bd-b3cb-4caf-ac95-30bf2823201d |
STATE-WP-0068 - State Hub offline write buffer and edge relay
Summary
Build a durable client-side write buffer for State Hub so agents can keep recording progress, decisions, messages, and safe status updates when the central State Hub deployment or its private tunnel is offline.
The improved design is deliberately split into two layers:
- Central HA makes the primary State Hub deployment fail less often
(
CUST-WP-0011,CUST-WP-0038). - Edge buffering makes agent write attempts durable when the central deployment is still unreachable.
The central service cannot buffer requests it never receives. The buffer must live close to the callers: operator workstation, agent host, bridge host, or MCP wrapper. State Hub should therefore provide a small local relay/outbox that accepts sanctioned writes, persists them locally, and replays them to the central API when connectivity returns.
Critical Review of the Original Suggestion
The suggestion is directionally right but incomplete if phrased as "the central State Hub buffers while offline." If the central endpoint is unreachable, the client needs somewhere else to put the write.
The robust version is:
- Agents send writes to a local State Hub edge relay, not directly to the remote central endpoint.
- The relay forwards immediately while the central API is reachable.
- On outage, the relay stores a durable, non-secret write envelope in a local SQLite outbox and returns an explicit queued receipt.
- A replay worker flushes the outbox with idempotency keys when the central API recovers.
- The central API deduplicates retries and rejects or flags conflicting stale writes instead of silently overwriting newer state.
This keeps State Hub local-first and file-canon aligned. It does not make a multi-master database, and it does not turn queued writes into pretend success.
Goals
- Preserve session-close writes during central State Hub or tunnel outages.
- Make offline write state observable to operators and agents.
- Prevent duplicate progress/events when a replay retries after partial success.
- Detect stale/conflicting replace-style writes, especially task status and decision resolution changes.
- Keep secrets out of the buffer.
- Reuse the existing REST contract and MCP write-layer reliability work.
Non-Goals
- Replacing
CUST-WP-0038high availability, backup, restore, or failover work. - Accepting arbitrary offline edits as authoritative current state.
- Queuing destructive deletes, imports, repo syncs, or bulk maintenance jobs in v1.
- Publicly exposing State Hub.
- Adding Redis, Kafka, or NATS as a required edge dependency. The edge path should work during local bootstrap with only Python and SQLite.
Target Architecture
Codex / Claude / agent process
-> MCP server or REST client
-> local statehub-edge relay
-> central State Hub API when reachable
-> local SQLite outbox when unreachable
-> replay worker
-> central State Hub API with idempotency key
-> normal DB commit and lifecycle event publication
The relay is a local process with an explicit listen port, for example
127.0.0.1:18080, configured with an upstream central API such as
http://127.0.0.1:18000 or the local development API.
Write Classification
Offline-safe append-only writes
These should be queueable in v1:
POST /progress/POST /messages/PATCH /messages/{id}/readwhen message id is already knownPOST /token-events/POST /decisions/with an idempotency key and no immediate dependency on the generated decision id
Offline-safe replace-style writes with conflict checks
These may be queueable only with an expected revision or last-observed timestamp:
PATCH /tasks/{task_id}POST /tasks/bulk-status-syncdecomposed into per-task envelopes or replayed as an ordered batchPATCH /decisions/{decision_id}andPOST /decisions/{decision_id}/resolvePATCH /workplans/{workplan_id}for lifecycle/status fields
Replay must mark these as conflicted when the central row changed after the client's observed revision and the update is not a monotonic no-op.
Online-only writes in v1
These should fail fast while offline:
DELETEendpoints- repository sync/import/ingest endpoints
- consistency sweep mutation endpoints
- fabric graph exports
- schema/bootstrap/admin operations
- any request containing authorization tokens, credentials, attachments, or large opaque payloads
Conflict Policy
- Append-only writes use idempotency keys and replay exactly once from the caller's point of view.
- Replace-style writes include
expected_updated_at,expected_status, or a route-specific revision field where available. - Supersedable queued writes, such as multiple task status patches for the same task, may be coalesced for replay while preserving local audit entries.
- If central state is newer and the replay cannot prove the queued write is
still safe, mark the envelope
conflictand surface it in relay status. - Workplan-file canon remains authoritative. After recovery, operators should
run
make fix-consistency REPO=state-hubso file-backed task/workplan state wins over stale queued task updates.
T01 - Write Safety ADR and Route Inventory
id: STATE-WP-0068-T01
status: done
priority: high
state_hub_task_id: "07aa2d43-0305-45ca-8b5a-bf6f96f716a9"
Create a short ADR or design doc that classifies State Hub write routes as append-only, replace-style, supersedable, or online-only.
Deliverables:
- Route inventory generated from
api/routers/*and MCP sanctioned writes. - V1 safe-write allowlist with request/response examples.
- Conflict policy per route class.
- Explicit statement that queued receipts are pending evidence, not successful central commits.
- Operator decision on the local relay port, default outbox location, and retention window.
Done when implementation tasks can refer to a reviewed allowlist instead of guessing route safety.
T02 - Central Idempotency and Replay Acceptance
id: STATE-WP-0068-T02
status: done
priority: high
state_hub_task_id: "f0060859-e9a7-441c-91cc-1e838c5ba60f"
Add central API support for idempotent replay.
Expected implementation:
- Migration for a
write_idempotency_keystable storing key, method, path, request hash, response status/body, source host/agent, first seen, last seen, and expiry. - Middleware or route dependency that accepts
Idempotency-Keyon allowlisted write endpoints. - Same-key/same-request replay returns the original response.
- Same-key/different-request returns HTTP 409.
- Replay metadata is available for diagnostics without logging request secrets.
- Tests cover success, retry, hash mismatch, expiry, and unsupported routes.
Done when append-only writes can be retried after a transport failure without duplicating central records.
T03 - Durable Local Outbox Store
id: STATE-WP-0068-T03
status: done
priority: high
state_hub_task_id: "6897dd71-6252-4eed-bb0c-350e8c566b3b"
Implement a local SQLite-backed outbox module used by the relay and CLI.
Minimum schema:
- envelope id and idempotency key
- method, path, scrubbed JSON body, route class
- source agent, source host, repo slug, session id when known
- observed revision fields for conflict checks
- status:
queued,sending,acked,conflict,dead,cancelled - attempt count, next retry time, last error, central response summary
- created, updated, acked timestamps
Safety requirements:
- Create the DB with owner-only permissions where the platform supports it.
- Never persist authorization headers, API keys, bearer tokens, cookies, or secret-looking fields.
- Cap payload size and reject large opaque bodies.
- Provide export/import of non-secret envelopes for operator debugging.
Done when unit tests prove enqueue, status transitions, coalescing metadata, scrubbing, and corruption-safe startup behavior.
T04 - Edge Relay HTTP Surface
id: STATE-WP-0068-T04
status: done
priority: high
state_hub_task_id: "deb883df-b312-4e8f-b559-718bb8a94035"
Create a local statehub-edge relay process that exposes a small HTTP surface.
Behavior:
- Online path: forward allowlisted writes to upstream and return the upstream response.
- Offline path: enqueue allowlisted writes and return a clear queued receipt:
{"queued": true, "outbox_id": "...", "idempotency_key": "...", "upstream": "unreachable"}. - Online-only path during outage: return a deterministic error explaining that the route is not queueable.
- Read path: proxy selected reads while online; optionally serve cached
/state/summarymetadata with stale markers while offline. - Health/status: expose relay health, upstream reachability, pending count, oldest pending age, and conflict count.
Done when agents can point API_BASE at the relay and receive either the
normal REST shape or an explicit queued/error shape.
T05 - Replay Worker and Conflict Handling
id: STATE-WP-0068-T05
status: done
priority: high
state_hub_task_id: "6c3916c1-4a9f-4b1d-a8b1-a356a6edf3db"
Implement the replay loop.
Requirements:
- Exponential backoff with jitter for transport failures.
- Single-flight sending per envelope.
- Preserve per-entity order for replace-style writes.
- Coalesce superseded task/workplan status writes before replay when safe.
- Use
Idempotency-Keyfor every replayed write. - Mark conflicts without dropping the original envelope.
- Provide commands to retry, cancel, or mark-dead individual envelopes.
Done when an integration test can simulate central outage, enqueue writes, restore central service, replay successfully, and surface one intentionally stale task update as a conflict.
T06 - MCP and Agent UX Integration
id: STATE-WP-0068-T06
status: done
priority: high
state_hub_task_id: "8ccac4f9-f457-4f87-9195-1d8619043c0f"
Update MCP tooling and agent-facing docs so offline buffering is usable without surprise.
Expected changes:
- MCP write helpers recognize relay queued receipts and return them clearly.
- Automatic progress-event side effects do not duplicate queued primary writes.
- Session-close guidance says to check relay status when writes were queued.
mcp_server/TOOLS.mddocuments online, queued, and conflict outcomes.- Repo
AGENTS.mdtemplate can point agents at the relay when enabled.
Done when an agent can complete a session during a central outage, see that the progress write is queued, and verify later that it was replayed.
T07 - Operator Observability
id: STATE-WP-0068-T07
status: done
priority: medium
state_hub_task_id: "62c0ca4f-b3e2-49f7-ba70-365016195e83"
Expose pending offline writes to humans and automations.
Deliverables:
- CLI commands:
statehub outbox status,statehub outbox list,statehub outbox replay,statehub outbox export. - Optional dashboard panel or docs page showing edge relay health, if the dashboard can reach the relay.
- Prometheus-style or JSON metrics for pending count, oldest age, replay failures, and conflicts.
- Progress event after replay recovery summarizing non-secret results.
Done when the operator can see whether any host still has unsent State Hub writes before declaring an outage recovered.
T08 - Chaos and Regression Test Suite
id: STATE-WP-0068-T08
status: done
priority: high
state_hub_task_id: "2a12614f-8923-45b1-b8e9-ad8c818b23d3"
Add tests that make offline behavior boring.
Coverage:
- Unit tests for route allowlist, payload scrubbing, idempotency hash behavior, outbox state transitions, and coalescing decisions.
- Integration test with a fake upstream returning connection errors, 5xx, 409, and success.
- End-to-end test for MCP write through relay during outage and replay.
- Drill script that can be run locally without touching production data.
Done when CI can prove no duplicate append-only records are produced across retry and no replace-style conflict is silently applied.
T09 - Runbooks, Cutover, and Recovery Drill
id: STATE-WP-0068-T09
status: done
priority: medium
state_hub_task_id: "fedea85e-c720-4814-9691-affa6c944954"
Document and rehearse the operator workflow.
Runbook content:
- How to start the relay on an operator workstation or agent host.
- How to configure MCP/REST clients to use the relay.
- What queued receipts mean during session close.
- How to inspect, replay, export, cancel, and resolve conflicted envelopes.
- Recovery checklist after central State Hub returns.
- Interaction with
make fix-consistency REPO=state-hub.
Done when a controlled drill queues at least one progress event and one task status update during a forced outage, replays the progress event, flags or applies the task update according to the conflict policy, and records the results without exposing secrets.
Dependencies and References
CUST-WP-0011- pragmatic railiance01 State Hub migration.CUST-WP-0038- long-term ThreePhoenix HA State Hub target.STATE-WP-0059- MCP write-layer reliability and explicit API failure handling.STATE-WP-0066- summary cache and stale-while-revalidate for read paths.docs/activity-core-delegation.md- JetStream buffering covers State Hub to activity-core events after commit; this work covers agent to State Hub writes before commit.mcp_server/TOOLS.md- current MCP/REST parity and failure handling contract.
After this workplan is synced, run:
make fix-consistency REPO=state-hub
Implementation Notes
Completed 2026-06-23. The implementation provides the first full offline-write buffering path:
- Central idempotency support through WriteIdempotencyMiddleware, the write_idempotency_keys model, and migration e9f0a1b2c3d4. Exact duplicate writes replay the original response; same key with a different request returns HTTP 409.
- Shared route classification for queueable append and replace-style writes.
- Local SQLite outbox with payload scrubbing, payload size limits, private file permissions where supported, status transitions, retry/cancel/export support, and latest replace-write coalescing.
- State Hub edge relay app with online forwarding, offline queue receipts, health/status, replay endpoint, and replay worker.
- statehub outbox CLI commands for status, list, export, replay, retry, and cancel.
- MCP queued receipt handling so queued primary writes do not trigger automatic progress side effects.
- Operator documentation in docs/offline-write-buffer.md, MCP tool docs, and the Codex agent instruction template.
Verification:
- Focused suite: 22 passed in 19.51s.
- Full suite: 446 passed, 1 warning in 287.20s. The warning was a SQLAlchemy RuntimeWarning in tests/test_summary_cache.py and was not introduced by a failing assertion.
- Syntax checks passed for the new and touched Python modules.
- git diff --check passed.