generated from coulomb/repo-seed
docs(ACTIVITY-WP-0014): correct Motivation to match T01 findings
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -21,27 +21,28 @@ plus detection/alerting when a scheduled fire is missed.
|
||||
## Motivation
|
||||
|
||||
On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition
|
||||
(cron `20 7 * * *` UTC, projected into the Railiance runtime ConfigMap
|
||||
(cron `20 7 * * *` Europe/Berlin, projected into the Railiance runtime ConfigMap
|
||||
`actcore-external-activity-definitions`) produced **no `daily_triage` progress
|
||||
event at all** — neither a success nor a `could not run; operator review
|
||||
required` failure. Complete silence means the workflow never started, i.e. the
|
||||
miss is at the **schedule/catchup level**, not the instruction level.
|
||||
required` failure.
|
||||
|
||||
Root cause (code-level, confirmed in `schedule_manager.py`):
|
||||
`_build_schedule()` sets `SchedulePolicy(overlap=...)` but **never sets
|
||||
`catchup_window`**. Temporal's default catchup window is short, so if the worker
|
||||
pod or in-cluster Temporal (`actcore-temporal:7233`) was briefly unavailable at
|
||||
07:20 UTC, the fire is **silently dropped with no recovery and no signal**. The
|
||||
existing `misfire_policy` (`skip`/`catchup`/`compress`) only maps to
|
||||
`ScheduleOverlapPolicy` (concurrency of overlapping runs) — it does **not**
|
||||
control missed-fire recovery. The only recovery path is a manual 1-hour
|
||||
`backfill` that runs solely at `upsert_schedule` time and only when
|
||||
`misfire_policy == "catchup"` (the triage def uses the default `skip`).
|
||||
> **Corrected by T01 (2026-06-23).** The initial hypothesis below — that
|
||||
> `_build_schedule()` never set `catchup_window`, so a short-default catchup
|
||||
> window silently dropped the fire — was **disproven on the live cluster**. The
|
||||
> Temporal schedule is healthy with `CatchupWindow 365d` (the server default) and
|
||||
> `0 MissedCatchupWindow`. The real cause is that the run **fired and ran but
|
||||
> failed at the report sink** with `Connection refused` posting to State Hub,
|
||||
> because railiance01 reaches State Hub via a reverse tunnel back to the
|
||||
> workstation, which is asleep at 07:20 Berlin. See the T01 findings and T05.
|
||||
|
||||
The trigger now originates entirely on **railiance01** (in-cluster Temporal
|
||||
Schedule, ConfigMap-projected definition). The workstation laptop is **not**
|
||||
required at trigger time — so any miss is a runtime-robustness gap, not a
|
||||
"laptop was off" problem.
|
||||
Schedule, ConfigMap-projected definition) and is **not** laptop-dependent — but
|
||||
the triage's State Hub *data dependencies* (context resolution and report
|
||||
delivery) still route back to the workstation State Hub.
|
||||
|
||||
This workplan still delivers worthwhile robustness — explicit run-miss recovery
|
||||
policies (T02) and missed-fire detection (T03) — but the fix for *this* incident
|
||||
is T05 (resilient sinks/resolvers + a workstation-independent State Hub endpoint).
|
||||
|
||||
## Desired run-miss options (from Bernd)
|
||||
|
||||
@@ -156,7 +157,7 @@ guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist).
|
||||
id: ACTIVITY-WP-0014-T05
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: ""
|
||||
state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"
|
||||
```
|
||||
|
||||
T01 proved the 06-22/06-23 silence was **not** a Temporal misfire but a State Hub
|
||||
|
||||
Reference in New Issue
Block a user