generated from coulomb/repo-seed
docs(ACTIVITY-WP-0014): rescope T05 to thin client under State Hub beachhead model
Resilience (queue/cache) is handed to custodian/state-hub as a per-machine beachhead; activity-core keeps only idempotent writes + adopt-beachhead-endpoint and retires its bespoke actcore-state-hub-bridge proxy. Proposal sent to state-hub. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -151,7 +151,7 @@ multi-day outage should not flood the triage feed). Update the Railiance runtime
|
||||
ConfigMap / bundle, redeploy, and document the run-miss options + per-definition
|
||||
guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist).
|
||||
|
||||
## Resilient State Hub sinks/resolvers (real incident fix)
|
||||
## Keep activity-core thin under the State Hub beachhead model
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0014-T05
|
||||
@@ -160,17 +160,28 @@ priority: high
|
||||
state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"
|
||||
```
|
||||
|
||||
T01 proved the 06-22/06-23 silence was **not** a Temporal misfire but a State Hub
|
||||
**`Connection refused` at the report sink** (and chronic resolver timeouts) because
|
||||
railiance01 reaches State Hub via a reverse tunnel back to the workstation, which
|
||||
is asleep at 07:20 Berlin. Misfire policies do not help: the run fires and fails
|
||||
the same way. Make activity-core resilient to transient State Hub unavailability:
|
||||
**Architecture decision (Bernd, 2026-06-23):** the resilience that this incident
|
||||
needs — queuing writes and caching reads while State Hub is unreachable — must
|
||||
**not** be a burden carried by client repos. It belongs to State Hub as a
|
||||
**per-machine local "beachhead"** (transparent read cache + write outbox, possibly
|
||||
with State-Hub federation), owned by custodian/state-hub. It handles all three
|
||||
failure modes: network interruption, central State Hub crash, central machine
|
||||
down. This is handed off to state-hub (see the coordination message / proposal);
|
||||
**do not build client-side queue/cache logic in activity-core.**
|
||||
|
||||
- Report sinks should retry with backoff and **not hard-fail the workflow** when
|
||||
the only failure is transient State Hub delivery; preserve the generated report
|
||||
(working-memory note + a deferred/outbox state-hub-progress) for later flush.
|
||||
- Required State Hub context resolvers should retry/backoff and surface a clear,
|
||||
single diagnostic rather than a bare `timed out`.
|
||||
- Separately (out of this repo): give railiance01 a State Hub endpoint that does
|
||||
not depend on the workstation being awake, or run the triage at a time the
|
||||
workstation is reliably up. Owner decision needed.
|
||||
activity-core's only responsibilities under this model are thin:
|
||||
|
||||
- **Idempotent writes (do now, in-repo):** attach a stable idempotency key
|
||||
(e.g. `run_id` + `instruction_id` + `event_type`) to every State Hub write so a
|
||||
beachhead flush — possibly replayed after an outage — cannot create duplicate
|
||||
`daily_triage`/progress events. The report sink already does a read-based dedup
|
||||
check (`_progress_exists`); make the guarantee explicit and not dependent on a
|
||||
live read.
|
||||
- **Adopt the beachhead endpoint (blocked on state-hub):** keep `STATE_HUB_URL`
|
||||
pointed at the local beachhead, and **retire the bespoke
|
||||
`actcore-state-hub-bridge` proxy** (the inline `hostNetwork` proxy in
|
||||
`k8s/railiance/20-runtime.yaml`) once the state-hub-owned beachhead exists — it
|
||||
is a primitive precursor of the beachhead and should not be extended here.
|
||||
|
||||
Blocked on the state-hub beachhead capability for the second item; the idempotent
|
||||
-writes item can proceed independently.
|
||||
|
||||
Reference in New Issue
Block a user