docs(ACTIVITY-WP-0014): correct Motivation to match T01 findings

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-23 17:16:17 +02:00
parent 99e5d525a8
commit cf7a11dcd9

View File

@@ -21,27 +21,28 @@ plus detection/alerting when a scheduled fire is missed.
## Motivation ## Motivation
On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition
(cron `20 7 * * *` UTC, projected into the Railiance runtime ConfigMap (cron `20 7 * * *` Europe/Berlin, projected into the Railiance runtime ConfigMap
`actcore-external-activity-definitions`) produced **no `daily_triage` progress `actcore-external-activity-definitions`) produced **no `daily_triage` progress
event at all** — neither a success nor a `could not run; operator review event at all** — neither a success nor a `could not run; operator review
required` failure. Complete silence means the workflow never started, i.e. the required` failure.
miss is at the **schedule/catchup level**, not the instruction level.
Root cause (code-level, confirmed in `schedule_manager.py`): > **Corrected by T01 (2026-06-23).** The initial hypothesis below — that
`_build_schedule()` sets `SchedulePolicy(overlap=...)` but **never sets > `_build_schedule()` never set `catchup_window`, so a short-default catchup
`catchup_window`**. Temporal's default catchup window is short, so if the worker > window silently dropped the fire — was **disproven on the live cluster**. The
pod or in-cluster Temporal (`actcore-temporal:7233`) was briefly unavailable at > Temporal schedule is healthy with `CatchupWindow 365d` (the server default) and
07:20 UTC, the fire is **silently dropped with no recovery and no signal**. The > `0 MissedCatchupWindow`. The real cause is that the run **fired and ran but
existing `misfire_policy` (`skip`/`catchup`/`compress`) only maps to > failed at the report sink** with `Connection refused` posting to State Hub,
`ScheduleOverlapPolicy` (concurrency of overlapping runs) — it does **not** > because railiance01 reaches State Hub via a reverse tunnel back to the
control missed-fire recovery. The only recovery path is a manual 1-hour > workstation, which is asleep at 07:20 Berlin. See the T01 findings and T05.
`backfill` that runs solely at `upsert_schedule` time and only when
`misfire_policy == "catchup"` (the triage def uses the default `skip`).
The trigger now originates entirely on **railiance01** (in-cluster Temporal The trigger now originates entirely on **railiance01** (in-cluster Temporal
Schedule, ConfigMap-projected definition). The workstation laptop is **not** Schedule, ConfigMap-projected definition) and is **not** laptop-dependent — but
required at trigger time — so any miss is a runtime-robustness gap, not a the triage's State Hub *data dependencies* (context resolution and report
"laptop was off" problem. delivery) still route back to the workstation State Hub.
This workplan still delivers worthwhile robustness — explicit run-miss recovery
policies (T02) and missed-fire detection (T03) — but the fix for *this* incident
is T05 (resilient sinks/resolvers + a workstation-independent State Hub endpoint).
## Desired run-miss options (from Bernd) ## Desired run-miss options (from Bernd)
@@ -156,7 +157,7 @@ guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist).
id: ACTIVITY-WP-0014-T05 id: ACTIVITY-WP-0014-T05
status: todo status: todo
priority: high priority: high
state_hub_task_id: "" state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"
``` ```
T01 proved the 06-22/06-23 silence was **not** a Temporal misfire but a State Hub T01 proved the 06-22/06-23 silence was **not** a Temporal misfire but a State Hub