diff --git a/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md b/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md index 197074c..cb7d089 100644 --- a/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md +++ b/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md @@ -64,18 +64,45 @@ Proposed mapping to a new `misfire_policy` value set (names open to review): ```task id: ACTIVITY-WP-0014-T01 -status: todo +status: done priority: high state_hub_task_id: "c90ff214-9214-48c7-96b9-7d699528d5ab" ``` -Bring up the ops-bridge tunnel (`bridge up state-hub-coulombcore`, -`bridge up state-hub-mcp-coulombcore`; tunnel to railiance01 was down from the -workstation during diagnosis). Inspect the live Temporal schedule -`activity-schedule-`: paused state, configured catchup window, -recent action/fire history, and worker pod status. Confirm whether the -06-22/06-23 07:20 UTC fires were dropped vs. failed. Backfill the missed runs if -calibration evidence is still wanted. Record findings in the workplan. +Inspected via `ssh railiance01` + in-node `kubectl`/`temporal` (no k3s tunnel is +defined for railiance01; the documented access path is SSH to the host). + +**Findings (2026-06-23) — the WP-0014 premise was wrong for this incident:** + +- All pods healthy; `actcore-worker` up 44h, 0 restarts. Not a crash. +- The daily-triage Temporal schedule (`activity-schedule-6fca51fa-…`) is + **healthy**: `Paused false`, `OverlapPolicy Skip`, **`CatchupWindow 365d`** + (Temporal's *default* when unset), `ActionCounts {Total:8, MissedCatchupWindow:0}`. + So fires were **not** silently dropped — my original "no catchup window → silent + drop" hypothesis does not hold; the server default is already 365d. +- The `2026-06-23T05:20:00Z` fire **did fire and ran**, then **Failed at the report + sink**: `report sink failure: state-hub-progress … '[Errno 111] Connection + refused'`. The run produced a report but could not deliver it to State Hub, so + no `daily_triage` progress event (not even a "could not run" one) was posted → + the silence. The 06-22 fire has no execution in retention (bridge likely down + then too / schedule update window at `LastUpdateAt 1d ago`). +- Root cause is **State Hub connectivity from railiance01**, not Temporal. The + in-cluster `actcore-state-hub-bridge` (`hostNetwork`) proxies to + `127.0.0.1:18000` on the node — the local end of the ops-bridge **reverse tunnel + back to the workstation's State Hub**. At 07:20 Europe/Berlin (= 05:20 UTC) the + workstation/tunnel was unreachable → `Connection refused`. Chronic flakiness + confirmed: 102 State Hub resolver timeouts in 24h (69 `recently_on_scope`, + 33 `consistency_sweep`). + +**Implication:** the trigger *is* independent of the laptop, but the triage's +**data dependencies (State Hub context resolution + report delivery) still route +back to the workstation State Hub**, which is asleep at 07:20 Berlin. WP-0014's +misfire policies are still good robustness, but the real fix is (a) State Hub +reachable from railiance01 independent of the workstation, and/or (b) sinks/ +resolvers resilient to transient State Hub unavailability (retry/backoff, +store-and-forward) instead of hard-failing the workflow. Tracked as follow-up +below. Backfill deferred: a replay only succeeds while the workstation State Hub +is reachable. ## Implement explicit misfire recovery modes @@ -122,3 +149,27 @@ Choose and set the appropriate `misfire_policy` for `daily-statehub-wsjf-triage` multi-day outage should not flood the triage feed). Update the Railiance runtime ConfigMap / bundle, redeploy, and document the run-miss options + per-definition guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist). + +## Resilient State Hub sinks/resolvers (real incident fix) + +```task +id: ACTIVITY-WP-0014-T05 +status: todo +priority: high +state_hub_task_id: "" +``` + +T01 proved the 06-22/06-23 silence was **not** a Temporal misfire but a State Hub +**`Connection refused` at the report sink** (and chronic resolver timeouts) because +railiance01 reaches State Hub via a reverse tunnel back to the workstation, which +is asleep at 07:20 Berlin. Misfire policies do not help: the run fires and fails +the same way. Make activity-core resilient to transient State Hub unavailability: + +- Report sinks should retry with backoff and **not hard-fail the workflow** when + the only failure is transient State Hub delivery; preserve the generated report + (working-memory note + a deferred/outbox state-hub-progress) for later flush. +- Required State Hub context resolvers should retry/backoff and surface a clear, + single diagnostic rather than a bare `timed out`. +- Separately (out of this repo): give railiance01 a State Hub endpoint that does + not depend on the workstation being awake, or run the triage at a time the + workstation is reliably up. Owner decision needed.