docs(ACTIVITY-WP-0014): rescope T05 to thin client under State Hub beachhead model

Resilience (queue/cache) is handed to custodian/state-hub as a per-machine beachhead; activity-core keeps only idempotent writes + adopt-beachhead-endpoint and retires its bespoke actcore-state-hub-bridge proxy. Proposal sent to state-hub. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:18:01 +02:00
parent cf7a11dcd9
commit f90591c5f1
1 changed files with 25 additions and 14 deletions
--- a/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md
+++ b/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md
@@ -151,7 +151,7 @@ multi-day outage should not flood the triage feed). Update the Railiance runtime
 ConfigMap / bundle, redeploy, and document the run-miss options + per-definition
 guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist).

-## Resilient State Hub sinks/resolvers (real incident fix)
+## Keep activity-core thin under the State Hub beachhead model

 ```task
 id: ACTIVITY-WP-0014-T05
@@ -160,17 +160,28 @@ priority: high
 state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"
 ```

-T01 proved the 06-22/06-23 silence was **not** a Temporal misfire but a State Hub
-**`Connection refused` at the report sink** (and chronic resolver timeouts) because
-railiance01 reaches State Hub via a reverse tunnel back to the workstation, which
-is asleep at 07:20 Berlin. Misfire policies do not help: the run fires and fails
-the same way. Make activity-core resilient to transient State Hub unavailability:
+**Architecture decision (Bernd, 2026-06-23):** the resilience that this incident
+needs — queuing writes and caching reads while State Hub is unreachable — must
+**not** be a burden carried by client repos. It belongs to State Hub as a
+**per-machine local "beachhead"** (transparent read cache + write outbox, possibly
+with State-Hub federation), owned by custodian/state-hub. It handles all three
+failure modes: network interruption, central State Hub crash, central machine
+down. This is handed off to state-hub (see the coordination message / proposal);
+**do not build client-side queue/cache logic in activity-core.**

- Report sinks should retry with backoff and **not hard-fail the workflow** when
-  the only failure is transient State Hub delivery; preserve the generated report
-  (working-memory note + a deferred/outbox state-hub-progress) for later flush.
- Required State Hub context resolvers should retry/backoff and surface a clear,
-  single diagnostic rather than a bare `timed out`.
- Separately (out of this repo): give railiance01 a State Hub endpoint that does
-  not depend on the workstation being awake, or run the triage at a time the
-  workstation is reliably up. Owner decision needed.
+activity-core's only responsibilities under this model are thin:
+
+- **Idempotent writes (do now, in-repo):** attach a stable idempotency key
+  (e.g. `run_id` + `instruction_id` + `event_type`) to every State Hub write so a
+  beachhead flush — possibly replayed after an outage — cannot create duplicate
+  `daily_triage`/progress events. The report sink already does a read-based dedup
+  check (`_progress_exists`); make the guarantee explicit and not dependent on a
+  live read.
+- **Adopt the beachhead endpoint (blocked on state-hub):** keep `STATE_HUB_URL`
+  pointed at the local beachhead, and **retire the bespoke
+  `actcore-state-hub-bridge` proxy** (the inline `hostNetwork` proxy in
+  `k8s/railiance/20-runtime.yaml`) once the state-hub-owned beachhead exists — it
+  is a primitive precursor of the beachhead and should not be extended here.
+
+Blocked on the state-hub beachhead capability for the second item; the idempotent
+-writes item can proceed independently.