feat(ACTIVITY-WP-0014): plan schedule misfire robustness & run-miss options

Cron fires are silently dropped: _build_schedule() sets SchedulePolicy(overlap=) but never catchup_window, so a brief worker/Temporal outage at trigger time drops the fire with no recovery and no signal (root cause of missing 06-22/06-23 daily triage runs). Define three explicit run-miss policies: skip, catchup_all, catchup_latest, plus missed-fire detection. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 13:46:19 +02:00
parent 59b3b73061
commit ffc0ee2cb7
1 changed files with 124 additions and 0 deletions
--- a/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md
+++ b/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md
@@ -0,0 +1,124 @@
+---
+id: ACTIVITY-WP-0014
+type: workplan
+title: "Schedule Misfire Robustness & Run-Miss Recovery Options"
+domain: infotech
+repo: activity-core
+status: proposed
+owner: claude
+topic_slug: activity-core
+created: "2026-06-23"
+updated: "2026-06-23"
+state_hub_workstream_id: ""
+---
+
+# Schedule Misfire Robustness & Run-Miss Recovery Options
+
+Make cron-triggered ActivityDefinitions robust to missed fires (worker/Temporal
+unavailable at trigger time) with explicit, per-definition recovery behaviour,
+plus detection/alerting when a scheduled fire is missed.
+
+## Motivation
+
+On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition
+(cron `20 7 * * *` UTC, projected into the Railiance runtime ConfigMap
+`actcore-external-activity-definitions`) produced **no `daily_triage` progress
+event at all** — neither a success nor a `could not run; operator review
+required` failure. Complete silence means the workflow never started, i.e. the
+miss is at the **schedule/catchup level**, not the instruction level.
+
+Root cause (code-level, confirmed in `schedule_manager.py`):
+`_build_schedule()` sets `SchedulePolicy(overlap=...)` but **never sets
+`catchup_window`**. Temporal's default catchup window is short, so if the worker
+pod or in-cluster Temporal (`actcore-temporal:7233`) was briefly unavailable at
+07:20 UTC, the fire is **silently dropped with no recovery and no signal**. The
+existing `misfire_policy` (`skip`/`catchup`/`compress`) only maps to
+`ScheduleOverlapPolicy` (concurrency of overlapping runs) — it does **not**
+control missed-fire recovery. The only recovery path is a manual 1-hour
+`backfill` that runs solely at `upsert_schedule` time and only when
+`misfire_policy == "catchup"` (the triage def uses the default `skip`).
+
+The trigger now originates entirely on **railiance01** (in-cluster Temporal
+Schedule, ConfigMap-projected definition). The workstation laptop is **not**
+required at trigger time — so any miss is a runtime-robustness gap, not a
+"laptop was off" problem.
+
+## Desired run-miss options (from Bernd)
+
+Three explicit, per-definition behaviours when a fire is missed:
+
+1. **Run on trigger or skip** — never recover a missed fire.
+2. **Run on trigger or later if missed** — recover **all** missed fires when back up.
+3. **Run on trigger or later if missed, but skip if next trigger reached** —
+   recover only the **most recent** missed fire; do not accumulate a backlog.
+
+Proposed mapping to a new `misfire_policy` value set (names open to review):
+
+| Policy | Semantics | Temporal mapping |
+| --- | --- | --- |
+| `skip` | Run on trigger or skip | `catchup_window ≈ 0`, `overlap=SKIP` |
+| `catchup_all` | Run on trigger or all missed later | `catchup_window=<long>`, `overlap=BUFFER_ALL` |
+| `catchup_latest` | Run on trigger or only the latest missed | `catchup_window ≈ 1 interval`, `overlap=BUFFER_ONE` |
+
+## Confirm root cause on Railiance01
+
+```task
+id: ACTIVITY-WP-0014-T01
+status: todo
+priority: high
+state_hub_task_id: ""
+```
+
+Bring up the ops-bridge tunnel (`bridge up state-hub-coulombcore`,
+`bridge up state-hub-mcp-coulombcore`; tunnel to railiance01 was down from the
+workstation during diagnosis). Inspect the live Temporal schedule
+`activity-schedule-<daily-triage-id>`: paused state, configured catchup window,
+recent action/fire history, and worker pod status. Confirm whether the
+06-22/06-23 07:20 UTC fires were dropped vs. failed. Backfill the missed runs if
+calibration evidence is still wanted. Record findings in the workplan.
+
+## Implement explicit misfire recovery modes
+
+```task
+id: ACTIVITY-WP-0014-T02
+status: todo
+priority: high
+state_hub_task_id: ""
+```
+
+Add `catchup_window_seconds` to `CronTriggerConfig` and redefine `misfire_policy`
+into the three explicit modes above. In `_build_schedule()` set
+`SchedulePolicy(overlap=..., catchup_window=timedelta(...))` per mode. Remove the
+ad-hoc 1-hour `backfill` hack in favour of native catchup-window semantics. Keep
+backward compatibility for existing `skip`/`catchup`/`compress` values (alias
+map). Unit tests for each mode's `(catchup_window, overlap)` mapping.
+
+## Missed-fire detection & alert sink
+
+```task
+id: ACTIVITY-WP-0014-T03
+status: todo
+priority: medium
+state_hub_task_id: ""
+```
+
+Detect when a scheduled definition has no successful run within its expected
+interval + tolerance, and emit a signal (State Hub progress event and/or
+agent-inbox message) so a miss is visible even under `skip`. This is the
+observability the current silent-drop behaviour lacks — a miss should never again
+be invisible.
+
+## Apply policy to runtime definitions & document
+
+```task
+id: ACTIVITY-WP-0014-T04
+status: wait
+priority: medium
+state_hub_task_id: ""
+```
+
+Choose and set the appropriate `misfire_policy` for `daily-statehub-wsjf-triage`
+(likely `catchup_latest` — one missed daily run should still run, but a
+multi-day outage should not flood the triage feed). Update the Railiance runtime
+ConfigMap / bundle, redeploy, and document the run-miss options + per-definition
+guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist).