diff --git a/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md b/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md index cb7d089..75fad1f 100644 --- a/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md +++ b/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md @@ -21,27 +21,28 @@ plus detection/alerting when a scheduled fire is missed. ## Motivation On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition -(cron `20 7 * * *` UTC, projected into the Railiance runtime ConfigMap +(cron `20 7 * * *` Europe/Berlin, projected into the Railiance runtime ConfigMap `actcore-external-activity-definitions`) produced **no `daily_triage` progress event at all** — neither a success nor a `could not run; operator review -required` failure. Complete silence means the workflow never started, i.e. the -miss is at the **schedule/catchup level**, not the instruction level. +required` failure. -Root cause (code-level, confirmed in `schedule_manager.py`): -`_build_schedule()` sets `SchedulePolicy(overlap=...)` but **never sets -`catchup_window`**. Temporal's default catchup window is short, so if the worker -pod or in-cluster Temporal (`actcore-temporal:7233`) was briefly unavailable at -07:20 UTC, the fire is **silently dropped with no recovery and no signal**. The -existing `misfire_policy` (`skip`/`catchup`/`compress`) only maps to -`ScheduleOverlapPolicy` (concurrency of overlapping runs) — it does **not** -control missed-fire recovery. The only recovery path is a manual 1-hour -`backfill` that runs solely at `upsert_schedule` time and only when -`misfire_policy == "catchup"` (the triage def uses the default `skip`). +> **Corrected by T01 (2026-06-23).** The initial hypothesis below — that +> `_build_schedule()` never set `catchup_window`, so a short-default catchup +> window silently dropped the fire — was **disproven on the live cluster**. The +> Temporal schedule is healthy with `CatchupWindow 365d` (the server default) and +> `0 MissedCatchupWindow`. The real cause is that the run **fired and ran but +> failed at the report sink** with `Connection refused` posting to State Hub, +> because railiance01 reaches State Hub via a reverse tunnel back to the +> workstation, which is asleep at 07:20 Berlin. See the T01 findings and T05. The trigger now originates entirely on **railiance01** (in-cluster Temporal -Schedule, ConfigMap-projected definition). The workstation laptop is **not** -required at trigger time — so any miss is a runtime-robustness gap, not a -"laptop was off" problem. +Schedule, ConfigMap-projected definition) and is **not** laptop-dependent — but +the triage's State Hub *data dependencies* (context resolution and report +delivery) still route back to the workstation State Hub. + +This workplan still delivers worthwhile robustness — explicit run-miss recovery +policies (T02) and missed-fire detection (T03) — but the fix for *this* incident +is T05 (resilient sinks/resolvers + a workstation-independent State Hub endpoint). ## Desired run-miss options (from Bernd) @@ -156,7 +157,7 @@ guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist). id: ACTIVITY-WP-0014-T05 status: todo priority: high -state_hub_task_id: "" +state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1" ``` T01 proved the 06-22/06-23 silence was **not** a Temporal misfire but a State Hub