generated from coulomb/repo-seed
feat(statehub): add offline write buffer relay
This commit is contained in:
428
workplans/STATE-WP-0068-offline-write-buffer-and-edge-relay.md
Normal file
428
workplans/STATE-WP-0068-offline-write-buffer-and-edge-relay.md
Normal file
@@ -0,0 +1,428 @@
|
||||
---
|
||||
id: STATE-WP-0068
|
||||
type: workplan
|
||||
title: "State Hub offline write buffer and edge relay"
|
||||
domain: infotech
|
||||
repo: state-hub
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-23"
|
||||
updated: "2026-06-23"
|
||||
finished: "2026-06-23"
|
||||
state_hub_workstream_id: "189508bd-b3cb-4caf-ac95-30bf2823201d"
|
||||
---
|
||||
|
||||
# STATE-WP-0068 - State Hub offline write buffer and edge relay
|
||||
|
||||
## Summary
|
||||
|
||||
Build a durable client-side write buffer for State Hub so agents can keep
|
||||
recording progress, decisions, messages, and safe status updates when the
|
||||
central State Hub deployment or its private tunnel is offline.
|
||||
|
||||
The improved design is deliberately split into two layers:
|
||||
|
||||
- **Central HA** makes the primary State Hub deployment fail less often
|
||||
(`CUST-WP-0011`, `CUST-WP-0038`).
|
||||
- **Edge buffering** makes agent write attempts durable when the central
|
||||
deployment is still unreachable.
|
||||
|
||||
The central service cannot buffer requests it never receives. The buffer must
|
||||
live close to the callers: operator workstation, agent host, bridge host, or
|
||||
MCP wrapper. State Hub should therefore provide a small local relay/outbox that
|
||||
accepts sanctioned writes, persists them locally, and replays them to the
|
||||
central API when connectivity returns.
|
||||
|
||||
## Critical Review of the Original Suggestion
|
||||
|
||||
The suggestion is directionally right but incomplete if phrased as "the central
|
||||
State Hub buffers while offline." If the central endpoint is unreachable, the
|
||||
client needs somewhere else to put the write.
|
||||
|
||||
The robust version is:
|
||||
|
||||
1. Agents send writes to a local State Hub edge relay, not directly to the
|
||||
remote central endpoint.
|
||||
2. The relay forwards immediately while the central API is reachable.
|
||||
3. On outage, the relay stores a durable, non-secret write envelope in a local
|
||||
SQLite outbox and returns an explicit queued receipt.
|
||||
4. A replay worker flushes the outbox with idempotency keys when the central
|
||||
API recovers.
|
||||
5. The central API deduplicates retries and rejects or flags conflicting stale
|
||||
writes instead of silently overwriting newer state.
|
||||
|
||||
This keeps State Hub local-first and file-canon aligned. It does not make a
|
||||
multi-master database, and it does not turn queued writes into pretend success.
|
||||
|
||||
## Goals
|
||||
|
||||
- Preserve session-close writes during central State Hub or tunnel outages.
|
||||
- Make offline write state observable to operators and agents.
|
||||
- Prevent duplicate progress/events when a replay retries after partial
|
||||
success.
|
||||
- Detect stale/conflicting replace-style writes, especially task status and
|
||||
decision resolution changes.
|
||||
- Keep secrets out of the buffer.
|
||||
- Reuse the existing REST contract and MCP write-layer reliability work.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Replacing `CUST-WP-0038` high availability, backup, restore, or failover
|
||||
work.
|
||||
- Accepting arbitrary offline edits as authoritative current state.
|
||||
- Queuing destructive deletes, imports, repo syncs, or bulk maintenance jobs in
|
||||
v1.
|
||||
- Publicly exposing State Hub.
|
||||
- Adding Redis, Kafka, or NATS as a required edge dependency. The edge path
|
||||
should work during local bootstrap with only Python and SQLite.
|
||||
|
||||
## Target Architecture
|
||||
|
||||
```
|
||||
Codex / Claude / agent process
|
||||
-> MCP server or REST client
|
||||
-> local statehub-edge relay
|
||||
-> central State Hub API when reachable
|
||||
-> local SQLite outbox when unreachable
|
||||
-> replay worker
|
||||
-> central State Hub API with idempotency key
|
||||
-> normal DB commit and lifecycle event publication
|
||||
```
|
||||
|
||||
The relay is a local process with an explicit listen port, for example
|
||||
`127.0.0.1:18080`, configured with an upstream central API such as
|
||||
`http://127.0.0.1:18000` or the local development API.
|
||||
|
||||
## Write Classification
|
||||
|
||||
### Offline-safe append-only writes
|
||||
|
||||
These should be queueable in v1:
|
||||
|
||||
- `POST /progress/`
|
||||
- `POST /messages/`
|
||||
- `PATCH /messages/{id}/read` when message id is already known
|
||||
- `POST /token-events/`
|
||||
- `POST /decisions/` with an idempotency key and no immediate dependency on the
|
||||
generated decision id
|
||||
|
||||
### Offline-safe replace-style writes with conflict checks
|
||||
|
||||
These may be queueable only with an expected revision or last-observed
|
||||
timestamp:
|
||||
|
||||
- `PATCH /tasks/{task_id}`
|
||||
- `POST /tasks/bulk-status-sync` decomposed into per-task envelopes or replayed
|
||||
as an ordered batch
|
||||
- `PATCH /decisions/{decision_id}` and `POST /decisions/{decision_id}/resolve`
|
||||
- `PATCH /workplans/{workplan_id}` for lifecycle/status fields
|
||||
|
||||
Replay must mark these as conflicted when the central row changed after the
|
||||
client's observed revision and the update is not a monotonic no-op.
|
||||
|
||||
### Online-only writes in v1
|
||||
|
||||
These should fail fast while offline:
|
||||
|
||||
- `DELETE` endpoints
|
||||
- repository sync/import/ingest endpoints
|
||||
- consistency sweep mutation endpoints
|
||||
- fabric graph exports
|
||||
- schema/bootstrap/admin operations
|
||||
- any request containing authorization tokens, credentials, attachments, or
|
||||
large opaque payloads
|
||||
|
||||
## Conflict Policy
|
||||
|
||||
- Append-only writes use idempotency keys and replay exactly once from the
|
||||
caller's point of view.
|
||||
- Replace-style writes include `expected_updated_at`, `expected_status`, or a
|
||||
route-specific revision field where available.
|
||||
- Supersedable queued writes, such as multiple task status patches for the same
|
||||
task, may be coalesced for replay while preserving local audit entries.
|
||||
- If central state is newer and the replay cannot prove the queued write is
|
||||
still safe, mark the envelope `conflict` and surface it in relay status.
|
||||
- Workplan-file canon remains authoritative. After recovery, operators should
|
||||
run `make fix-consistency REPO=state-hub` so file-backed task/workplan state
|
||||
wins over stale queued task updates.
|
||||
|
||||
## T01 - Write Safety ADR and Route Inventory
|
||||
|
||||
```task
|
||||
id: STATE-WP-0068-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "07aa2d43-0305-45ca-8b5a-bf6f96f716a9"
|
||||
```
|
||||
|
||||
Create a short ADR or design doc that classifies State Hub write routes as
|
||||
append-only, replace-style, supersedable, or online-only.
|
||||
|
||||
Deliverables:
|
||||
|
||||
- Route inventory generated from `api/routers/*` and MCP sanctioned writes.
|
||||
- V1 safe-write allowlist with request/response examples.
|
||||
- Conflict policy per route class.
|
||||
- Explicit statement that queued receipts are pending evidence, not successful
|
||||
central commits.
|
||||
- Operator decision on the local relay port, default outbox location, and
|
||||
retention window.
|
||||
|
||||
Done when implementation tasks can refer to a reviewed allowlist instead of
|
||||
guessing route safety.
|
||||
|
||||
## T02 - Central Idempotency and Replay Acceptance
|
||||
|
||||
```task
|
||||
id: STATE-WP-0068-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "f0060859-e9a7-441c-91cc-1e838c5ba60f"
|
||||
```
|
||||
|
||||
Add central API support for idempotent replay.
|
||||
|
||||
Expected implementation:
|
||||
|
||||
- Migration for a `write_idempotency_keys` table storing key, method, path,
|
||||
request hash, response status/body, source host/agent, first seen, last seen,
|
||||
and expiry.
|
||||
- Middleware or route dependency that accepts `Idempotency-Key` on allowlisted
|
||||
write endpoints.
|
||||
- Same-key/same-request replay returns the original response.
|
||||
- Same-key/different-request returns HTTP 409.
|
||||
- Replay metadata is available for diagnostics without logging request secrets.
|
||||
- Tests cover success, retry, hash mismatch, expiry, and unsupported routes.
|
||||
|
||||
Done when append-only writes can be retried after a transport failure without
|
||||
duplicating central records.
|
||||
|
||||
## T03 - Durable Local Outbox Store
|
||||
|
||||
```task
|
||||
id: STATE-WP-0068-T03
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "6897dd71-6252-4eed-bb0c-350e8c566b3b"
|
||||
```
|
||||
|
||||
Implement a local SQLite-backed outbox module used by the relay and CLI.
|
||||
|
||||
Minimum schema:
|
||||
|
||||
- envelope id and idempotency key
|
||||
- method, path, scrubbed JSON body, route class
|
||||
- source agent, source host, repo slug, session id when known
|
||||
- observed revision fields for conflict checks
|
||||
- status: `queued`, `sending`, `acked`, `conflict`, `dead`, `cancelled`
|
||||
- attempt count, next retry time, last error, central response summary
|
||||
- created, updated, acked timestamps
|
||||
|
||||
Safety requirements:
|
||||
|
||||
- Create the DB with owner-only permissions where the platform supports it.
|
||||
- Never persist authorization headers, API keys, bearer tokens, cookies, or
|
||||
secret-looking fields.
|
||||
- Cap payload size and reject large opaque bodies.
|
||||
- Provide export/import of non-secret envelopes for operator debugging.
|
||||
|
||||
Done when unit tests prove enqueue, status transitions, coalescing metadata,
|
||||
scrubbing, and corruption-safe startup behavior.
|
||||
|
||||
## T04 - Edge Relay HTTP Surface
|
||||
|
||||
```task
|
||||
id: STATE-WP-0068-T04
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "deb883df-b312-4e8f-b559-718bb8a94035"
|
||||
```
|
||||
|
||||
Create a local `statehub-edge` relay process that exposes a small HTTP surface.
|
||||
|
||||
Behavior:
|
||||
|
||||
- Online path: forward allowlisted writes to upstream and return the upstream
|
||||
response.
|
||||
- Offline path: enqueue allowlisted writes and return a clear queued receipt:
|
||||
`{"queued": true, "outbox_id": "...", "idempotency_key": "...",
|
||||
"upstream": "unreachable"}`.
|
||||
- Online-only path during outage: return a deterministic error explaining that
|
||||
the route is not queueable.
|
||||
- Read path: proxy selected reads while online; optionally serve cached
|
||||
`/state/summary` metadata with stale markers while offline.
|
||||
- Health/status: expose relay health, upstream reachability, pending count,
|
||||
oldest pending age, and conflict count.
|
||||
|
||||
Done when agents can point `API_BASE` at the relay and receive either the
|
||||
normal REST shape or an explicit queued/error shape.
|
||||
|
||||
## T05 - Replay Worker and Conflict Handling
|
||||
|
||||
```task
|
||||
id: STATE-WP-0068-T05
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "6c3916c1-4a9f-4b1d-a8b1-a356a6edf3db"
|
||||
```
|
||||
|
||||
Implement the replay loop.
|
||||
|
||||
Requirements:
|
||||
|
||||
- Exponential backoff with jitter for transport failures.
|
||||
- Single-flight sending per envelope.
|
||||
- Preserve per-entity order for replace-style writes.
|
||||
- Coalesce superseded task/workplan status writes before replay when safe.
|
||||
- Use `Idempotency-Key` for every replayed write.
|
||||
- Mark conflicts without dropping the original envelope.
|
||||
- Provide commands to retry, cancel, or mark-dead individual envelopes.
|
||||
|
||||
Done when an integration test can simulate central outage, enqueue writes,
|
||||
restore central service, replay successfully, and surface one intentionally
|
||||
stale task update as a conflict.
|
||||
|
||||
## T06 - MCP and Agent UX Integration
|
||||
|
||||
```task
|
||||
id: STATE-WP-0068-T06
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "8ccac4f9-f457-4f87-9195-1d8619043c0f"
|
||||
```
|
||||
|
||||
Update MCP tooling and agent-facing docs so offline buffering is usable without
|
||||
surprise.
|
||||
|
||||
Expected changes:
|
||||
|
||||
- MCP write helpers recognize relay queued receipts and return them clearly.
|
||||
- Automatic progress-event side effects do not duplicate queued primary writes.
|
||||
- Session-close guidance says to check relay status when writes were queued.
|
||||
- `mcp_server/TOOLS.md` documents online, queued, and conflict outcomes.
|
||||
- Repo `AGENTS.md` template can point agents at the relay when enabled.
|
||||
|
||||
Done when an agent can complete a session during a central outage, see that the
|
||||
progress write is queued, and verify later that it was replayed.
|
||||
|
||||
## T07 - Operator Observability
|
||||
|
||||
```task
|
||||
id: STATE-WP-0068-T07
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "62c0ca4f-b3e2-49f7-ba70-365016195e83"
|
||||
```
|
||||
|
||||
Expose pending offline writes to humans and automations.
|
||||
|
||||
Deliverables:
|
||||
|
||||
- CLI commands: `statehub outbox status`, `statehub outbox list`,
|
||||
`statehub outbox replay`, `statehub outbox export`.
|
||||
- Optional dashboard panel or docs page showing edge relay health, if the
|
||||
dashboard can reach the relay.
|
||||
- Prometheus-style or JSON metrics for pending count, oldest age, replay
|
||||
failures, and conflicts.
|
||||
- Progress event after replay recovery summarizing non-secret results.
|
||||
|
||||
Done when the operator can see whether any host still has unsent State Hub
|
||||
writes before declaring an outage recovered.
|
||||
|
||||
## T08 - Chaos and Regression Test Suite
|
||||
|
||||
```task
|
||||
id: STATE-WP-0068-T08
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "2a12614f-8923-45b1-b8e9-ad8c818b23d3"
|
||||
```
|
||||
|
||||
Add tests that make offline behavior boring.
|
||||
|
||||
Coverage:
|
||||
|
||||
- Unit tests for route allowlist, payload scrubbing, idempotency hash behavior,
|
||||
outbox state transitions, and coalescing decisions.
|
||||
- Integration test with a fake upstream returning connection errors, 5xx, 409,
|
||||
and success.
|
||||
- End-to-end test for MCP write through relay during outage and replay.
|
||||
- Drill script that can be run locally without touching production data.
|
||||
|
||||
Done when CI can prove no duplicate append-only records are produced across
|
||||
retry and no replace-style conflict is silently applied.
|
||||
|
||||
## T09 - Runbooks, Cutover, and Recovery Drill
|
||||
|
||||
```task
|
||||
id: STATE-WP-0068-T09
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "fedea85e-c720-4814-9691-affa6c944954"
|
||||
```
|
||||
|
||||
Document and rehearse the operator workflow.
|
||||
|
||||
Runbook content:
|
||||
|
||||
- How to start the relay on an operator workstation or agent host.
|
||||
- How to configure MCP/REST clients to use the relay.
|
||||
- What queued receipts mean during session close.
|
||||
- How to inspect, replay, export, cancel, and resolve conflicted envelopes.
|
||||
- Recovery checklist after central State Hub returns.
|
||||
- Interaction with `make fix-consistency REPO=state-hub`.
|
||||
|
||||
Done when a controlled drill queues at least one progress event and one task
|
||||
status update during a forced outage, replays the progress event, flags or
|
||||
applies the task update according to the conflict policy, and records the
|
||||
results without exposing secrets.
|
||||
|
||||
## Dependencies and References
|
||||
|
||||
- `CUST-WP-0011` - pragmatic railiance01 State Hub migration.
|
||||
- `CUST-WP-0038` - long-term ThreePhoenix HA State Hub target.
|
||||
- `STATE-WP-0059` - MCP write-layer reliability and explicit API failure
|
||||
handling.
|
||||
- `STATE-WP-0066` - summary cache and stale-while-revalidate for read paths.
|
||||
- `docs/activity-core-delegation.md` - JetStream buffering covers State Hub to
|
||||
activity-core events after commit; this work covers agent to State Hub writes
|
||||
before commit.
|
||||
- `mcp_server/TOOLS.md` - current MCP/REST parity and failure handling contract.
|
||||
|
||||
After this workplan is synced, run:
|
||||
|
||||
```bash
|
||||
make fix-consistency REPO=state-hub
|
||||
```
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
Completed 2026-06-23. The implementation provides the first full offline-write
|
||||
buffering path:
|
||||
|
||||
- Central idempotency support through WriteIdempotencyMiddleware, the
|
||||
write_idempotency_keys model, and migration e9f0a1b2c3d4. Exact duplicate
|
||||
writes replay the original response; same key with a different request returns
|
||||
HTTP 409.
|
||||
- Shared route classification for queueable append and replace-style writes.
|
||||
- Local SQLite outbox with payload scrubbing, payload size limits, private file
|
||||
permissions where supported, status transitions, retry/cancel/export support,
|
||||
and latest replace-write coalescing.
|
||||
- State Hub edge relay app with online forwarding, offline queue receipts,
|
||||
health/status, replay endpoint, and replay worker.
|
||||
- statehub outbox CLI commands for status, list, export, replay, retry, and
|
||||
cancel.
|
||||
- MCP queued receipt handling so queued primary writes do not trigger automatic
|
||||
progress side effects.
|
||||
- Operator documentation in docs/offline-write-buffer.md, MCP tool docs, and the
|
||||
Codex agent instruction template.
|
||||
|
||||
Verification:
|
||||
|
||||
- Focused suite: 22 passed in 19.51s.
|
||||
- Full suite: 446 passed, 1 warning in 287.20s. The warning was a SQLAlchemy
|
||||
RuntimeWarning in tests/test_summary_cache.py and was not introduced by a
|
||||
failing assertion.
|
||||
- Syntax checks passed for the new and touched Python modules.
|
||||
- git diff --check passed.
|
||||
Reference in New Issue
Block a user