docs(ACTIVITY-WP-0014): correct Motivation to match T01 findings

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-23 17:16:17 +02:00
parent 99e5d525a8
commit cf7a11dcd9

View File

@@ -21,27 +21,28 @@ plus detection/alerting when a scheduled fire is missed.
## Motivation
On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition
(cron `20 7 * * *` UTC, projected into the Railiance runtime ConfigMap
(cron `20 7 * * *` Europe/Berlin, projected into the Railiance runtime ConfigMap
`actcore-external-activity-definitions`) produced **no `daily_triage` progress
event at all** — neither a success nor a `could not run; operator review
required` failure. Complete silence means the workflow never started, i.e. the
miss is at the **schedule/catchup level**, not the instruction level.
required` failure.
Root cause (code-level, confirmed in `schedule_manager.py`):
`_build_schedule()` sets `SchedulePolicy(overlap=...)` but **never sets
`catchup_window`**. Temporal's default catchup window is short, so if the worker
pod or in-cluster Temporal (`actcore-temporal:7233`) was briefly unavailable at
07:20 UTC, the fire is **silently dropped with no recovery and no signal**. The
existing `misfire_policy` (`skip`/`catchup`/`compress`) only maps to
`ScheduleOverlapPolicy` (concurrency of overlapping runs) — it does **not**
control missed-fire recovery. The only recovery path is a manual 1-hour
`backfill` that runs solely at `upsert_schedule` time and only when
`misfire_policy == "catchup"` (the triage def uses the default `skip`).
> **Corrected by T01 (2026-06-23).** The initial hypothesis below — that
> `_build_schedule()` never set `catchup_window`, so a short-default catchup
> window silently dropped the fire — was **disproven on the live cluster**. The
> Temporal schedule is healthy with `CatchupWindow 365d` (the server default) and
> `0 MissedCatchupWindow`. The real cause is that the run **fired and ran but
> failed at the report sink** with `Connection refused` posting to State Hub,
> because railiance01 reaches State Hub via a reverse tunnel back to the
> workstation, which is asleep at 07:20 Berlin. See the T01 findings and T05.
The trigger now originates entirely on **railiance01** (in-cluster Temporal
Schedule, ConfigMap-projected definition). The workstation laptop is **not**
required at trigger time — so any miss is a runtime-robustness gap, not a
"laptop was off" problem.
Schedule, ConfigMap-projected definition) and is **not** laptop-dependent — but
the triage's State Hub *data dependencies* (context resolution and report
delivery) still route back to the workstation State Hub.
This workplan still delivers worthwhile robustness — explicit run-miss recovery
policies (T02) and missed-fire detection (T03) — but the fix for *this* incident
is T05 (resilient sinks/resolvers + a workstation-independent State Hub endpoint).
## Desired run-miss options (from Bernd)
@@ -156,7 +157,7 @@ guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist).
id: ACTIVITY-WP-0014-T05
status: todo
priority: high
state_hub_task_id: ""
state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"
```
T01 proved the 06-22/06-23 silence was **not** a Temporal misfire but a State Hub