docs(ACTIVITY-WP-0014): correct Motivation to match T01 findings

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 17:16:17 +02:00
parent 99e5d525a8
commit cf7a11dcd9
1 changed files with 18 additions and 17 deletions
--- a/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md
+++ b/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md
@@ -21,27 +21,28 @@ plus detection/alerting when a scheduled fire is missed.
 ## Motivation

 On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition
-(cron `20 7 * * *` UTC, projected into the Railiance runtime ConfigMap
+(cron `20 7 * * *` Europe/Berlin, projected into the Railiance runtime ConfigMap
 `actcore-external-activity-definitions`) produced **no `daily_triage` progress
 event at all** — neither a success nor a `could not run; operator review
-required` failure. Complete silence means the workflow never started, i.e. the
-miss is at the **schedule/catchup level**, not the instruction level.
+required` failure.

-Root cause (code-level, confirmed in `schedule_manager.py`):
-`_build_schedule()` sets `SchedulePolicy(overlap=...)` but **never sets
-`catchup_window`**. Temporal's default catchup window is short, so if the worker
-pod or in-cluster Temporal (`actcore-temporal:7233`) was briefly unavailable at
-07:20 UTC, the fire is **silently dropped with no recovery and no signal**. The
-existing `misfire_policy` (`skip`/`catchup`/`compress`) only maps to
-`ScheduleOverlapPolicy` (concurrency of overlapping runs) — it does **not**
-control missed-fire recovery. The only recovery path is a manual 1-hour
-`backfill` that runs solely at `upsert_schedule` time and only when
-`misfire_policy == "catchup"` (the triage def uses the default `skip`).
+> **Corrected by T01 (2026-06-23).** The initial hypothesis below — that
+> `_build_schedule()` never set `catchup_window`, so a short-default catchup
+> window silently dropped the fire — was **disproven on the live cluster**. The
+> Temporal schedule is healthy with `CatchupWindow 365d` (the server default) and
+> `0 MissedCatchupWindow`. The real cause is that the run **fired and ran but
+> failed at the report sink** with `Connection refused` posting to State Hub,
+> because railiance01 reaches State Hub via a reverse tunnel back to the
+> workstation, which is asleep at 07:20 Berlin. See the T01 findings and T05.

 The trigger now originates entirely on **railiance01** (in-cluster Temporal
-Schedule, ConfigMap-projected definition). The workstation laptop is **not**
-required at trigger time — so any miss is a runtime-robustness gap, not a
-"laptop was off" problem.
+Schedule, ConfigMap-projected definition) and is **not** laptop-dependent — but
+the triage's State Hub *data dependencies* (context resolution and report
+delivery) still route back to the workstation State Hub.
+
+This workplan still delivers worthwhile robustness — explicit run-miss recovery
+policies (T02) and missed-fire detection (T03) — but the fix for *this* incident
+is T05 (resilient sinks/resolvers + a workstation-independent State Hub endpoint).

 ## Desired run-miss options (from Bernd)

@@ -156,7 +157,7 @@ guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist).
 id: ACTIVITY-WP-0014-T05
 status: todo
 priority: high
-state_hub_task_id: ""
+state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"
 ```

 T01 proved the 06-22/06-23 silence was **not** a Temporal misfire but a State Hub