docs(ACTIVITY-WP-0014): rescope T05 to thin client under State Hub beachhead model

Resilience (queue/cache) is handed to custodian/state-hub as a per-machine
beachhead; activity-core keeps only idempotent writes + adopt-beachhead-endpoint
and retires its bespoke actcore-state-hub-bridge proxy. Proposal sent to state-hub.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-23 21:18:01 +02:00
parent cf7a11dcd9
commit f90591c5f1

View File

@@ -151,7 +151,7 @@ multi-day outage should not flood the triage feed). Update the Railiance runtime
ConfigMap / bundle, redeploy, and document the run-miss options + per-definition
guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist).
## Resilient State Hub sinks/resolvers (real incident fix)
## Keep activity-core thin under the State Hub beachhead model
```task
id: ACTIVITY-WP-0014-T05
@@ -160,17 +160,28 @@ priority: high
state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"
```
T01 proved the 06-22/06-23 silence was **not** a Temporal misfire but a State Hub
**`Connection refused` at the report sink** (and chronic resolver timeouts) because
railiance01 reaches State Hub via a reverse tunnel back to the workstation, which
is asleep at 07:20 Berlin. Misfire policies do not help: the run fires and fails
the same way. Make activity-core resilient to transient State Hub unavailability:
**Architecture decision (Bernd, 2026-06-23):** the resilience that this incident
needs — queuing writes and caching reads while State Hub is unreachable — must
**not** be a burden carried by client repos. It belongs to State Hub as a
**per-machine local "beachhead"** (transparent read cache + write outbox, possibly
with State-Hub federation), owned by custodian/state-hub. It handles all three
failure modes: network interruption, central State Hub crash, central machine
down. This is handed off to state-hub (see the coordination message / proposal);
**do not build client-side queue/cache logic in activity-core.**
- Report sinks should retry with backoff and **not hard-fail the workflow** when
the only failure is transient State Hub delivery; preserve the generated report
(working-memory note + a deferred/outbox state-hub-progress) for later flush.
- Required State Hub context resolvers should retry/backoff and surface a clear,
single diagnostic rather than a bare `timed out`.
- Separately (out of this repo): give railiance01 a State Hub endpoint that does
not depend on the workstation being awake, or run the triage at a time the
workstation is reliably up. Owner decision needed.
activity-core's only responsibilities under this model are thin:
- **Idempotent writes (do now, in-repo):** attach a stable idempotency key
(e.g. `run_id` + `instruction_id` + `event_type`) to every State Hub write so a
beachhead flush — possibly replayed after an outage — cannot create duplicate
`daily_triage`/progress events. The report sink already does a read-based dedup
check (`_progress_exists`); make the guarantee explicit and not dependent on a
live read.
- **Adopt the beachhead endpoint (blocked on state-hub):** keep `STATE_HUB_URL`
pointed at the local beachhead, and **retire the bespoke
`actcore-state-hub-bridge` proxy** (the inline `hostNetwork` proxy in
`k8s/railiance/20-runtime.yaml`) once the state-hub-owned beachhead exists — it
is a primitive precursor of the beachhead and should not be extended here.
Blocked on the state-hub beachhead capability for the second item; the idempotent
-writes item can proceed independently.