From f90591c5f1e17c04f8c70652af0b4549f2a086e6 Mon Sep 17 00:00:00 2001 From: tegwick Date: Tue, 23 Jun 2026 21:18:01 +0200 Subject: [PATCH] docs(ACTIVITY-WP-0014): rescope T05 to thin client under State Hub beachhead model Resilience (queue/cache) is handed to custodian/state-hub as a per-machine beachhead; activity-core keeps only idempotent writes + adopt-beachhead-endpoint and retires its bespoke actcore-state-hub-bridge proxy. Proposal sent to state-hub. Co-Authored-By: Claude Opus 4.8 --- ...ITY-WP-0014-schedule-misfire-robustness.md | 39 ++++++++++++------- 1 file changed, 25 insertions(+), 14 deletions(-) diff --git a/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md b/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md index 75fad1f..9ae43da 100644 --- a/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md +++ b/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md @@ -151,7 +151,7 @@ multi-day outage should not flood the triage feed). Update the Railiance runtime ConfigMap / bundle, redeploy, and document the run-miss options + per-definition guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist). -## Resilient State Hub sinks/resolvers (real incident fix) +## Keep activity-core thin under the State Hub beachhead model ```task id: ACTIVITY-WP-0014-T05 @@ -160,17 +160,28 @@ priority: high state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1" ``` -T01 proved the 06-22/06-23 silence was **not** a Temporal misfire but a State Hub -**`Connection refused` at the report sink** (and chronic resolver timeouts) because -railiance01 reaches State Hub via a reverse tunnel back to the workstation, which -is asleep at 07:20 Berlin. Misfire policies do not help: the run fires and fails -the same way. Make activity-core resilient to transient State Hub unavailability: +**Architecture decision (Bernd, 2026-06-23):** the resilience that this incident +needs — queuing writes and caching reads while State Hub is unreachable — must +**not** be a burden carried by client repos. It belongs to State Hub as a +**per-machine local "beachhead"** (transparent read cache + write outbox, possibly +with State-Hub federation), owned by custodian/state-hub. It handles all three +failure modes: network interruption, central State Hub crash, central machine +down. This is handed off to state-hub (see the coordination message / proposal); +**do not build client-side queue/cache logic in activity-core.** -- Report sinks should retry with backoff and **not hard-fail the workflow** when - the only failure is transient State Hub delivery; preserve the generated report - (working-memory note + a deferred/outbox state-hub-progress) for later flush. -- Required State Hub context resolvers should retry/backoff and surface a clear, - single diagnostic rather than a bare `timed out`. -- Separately (out of this repo): give railiance01 a State Hub endpoint that does - not depend on the workstation being awake, or run the triage at a time the - workstation is reliably up. Owner decision needed. +activity-core's only responsibilities under this model are thin: + +- **Idempotent writes (do now, in-repo):** attach a stable idempotency key + (e.g. `run_id` + `instruction_id` + `event_type`) to every State Hub write so a + beachhead flush — possibly replayed after an outage — cannot create duplicate + `daily_triage`/progress events. The report sink already does a read-based dedup + check (`_progress_exists`); make the guarantee explicit and not dependent on a + live read. +- **Adopt the beachhead endpoint (blocked on state-hub):** keep `STATE_HUB_URL` + pointed at the local beachhead, and **retire the bespoke + `actcore-state-hub-bridge` proxy** (the inline `hostNetwork` proxy in + `k8s/railiance/20-runtime.yaml`) once the state-hub-owned beachhead exists — it + is a primitive precursor of the beachhead and should not be extended here. + +Blocked on the state-hub beachhead capability for the second item; the idempotent +-writes item can proceed independently.