--- id: ACTIVITY-WP-0014 type: workplan title: "Schedule Misfire Robustness & Run-Miss Recovery Options" domain: infotech repo: activity-core status: active owner: claude topic_slug: activity-core created: "2026-06-23" updated: "2026-06-23" state_hub_workstream_id: "91b64686-5d17-4c86-bc9e-3d0ee6720cf5" --- # Schedule Misfire Robustness & Run-Miss Recovery Options Make cron-triggered ActivityDefinitions robust to missed fires (worker/Temporal unavailable at trigger time) with explicit, per-definition recovery behaviour, plus detection/alerting when a scheduled fire is missed. ## Motivation On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition (cron `20 7 * * *` UTC, projected into the Railiance runtime ConfigMap `actcore-external-activity-definitions`) produced **no `daily_triage` progress event at all** — neither a success nor a `could not run; operator review required` failure. Complete silence means the workflow never started, i.e. the miss is at the **schedule/catchup level**, not the instruction level. Root cause (code-level, confirmed in `schedule_manager.py`): `_build_schedule()` sets `SchedulePolicy(overlap=...)` but **never sets `catchup_window`**. Temporal's default catchup window is short, so if the worker pod or in-cluster Temporal (`actcore-temporal:7233`) was briefly unavailable at 07:20 UTC, the fire is **silently dropped with no recovery and no signal**. The existing `misfire_policy` (`skip`/`catchup`/`compress`) only maps to `ScheduleOverlapPolicy` (concurrency of overlapping runs) — it does **not** control missed-fire recovery. The only recovery path is a manual 1-hour `backfill` that runs solely at `upsert_schedule` time and only when `misfire_policy == "catchup"` (the triage def uses the default `skip`). The trigger now originates entirely on **railiance01** (in-cluster Temporal Schedule, ConfigMap-projected definition). The workstation laptop is **not** required at trigger time — so any miss is a runtime-robustness gap, not a "laptop was off" problem. ## Desired run-miss options (from Bernd) Three explicit, per-definition behaviours when a fire is missed: 1. **Run on trigger or skip** — never recover a missed fire. 2. **Run on trigger or later if missed** — recover **all** missed fires when back up. 3. **Run on trigger or later if missed, but skip if next trigger reached** — recover only the **most recent** missed fire; do not accumulate a backlog. Proposed mapping to a new `misfire_policy` value set (names open to review): | Policy | Semantics | Temporal mapping | | --- | --- | --- | | `skip` | Run on trigger or skip | `catchup_window ≈ 0`, `overlap=SKIP` | | `catchup_all` | Run on trigger or all missed later | `catchup_window=`, `overlap=BUFFER_ALL` | | `catchup_latest` | Run on trigger or only the latest missed | `catchup_window ≈ 1 interval`, `overlap=BUFFER_ONE` | ## Confirm root cause on Railiance01 ```task id: ACTIVITY-WP-0014-T01 status: todo priority: high state_hub_task_id: "c90ff214-9214-48c7-96b9-7d699528d5ab" ``` Bring up the ops-bridge tunnel (`bridge up state-hub-coulombcore`, `bridge up state-hub-mcp-coulombcore`; tunnel to railiance01 was down from the workstation during diagnosis). Inspect the live Temporal schedule `activity-schedule-`: paused state, configured catchup window, recent action/fire history, and worker pod status. Confirm whether the 06-22/06-23 07:20 UTC fires were dropped vs. failed. Backfill the missed runs if calibration evidence is still wanted. Record findings in the workplan. ## Implement explicit misfire recovery modes ```task id: ACTIVITY-WP-0014-T02 status: done priority: high state_hub_task_id: "19615562-4cb2-4f25-872f-505d6e40dcc5" ``` Add `catchup_window_seconds` to `CronTriggerConfig` and redefine `misfire_policy` into the three explicit modes above. In `_build_schedule()` set `SchedulePolicy(overlap=..., catchup_window=timedelta(...))` per mode. Remove the ad-hoc 1-hour `backfill` hack in favour of native catchup-window semantics. Keep backward compatibility for existing `skip`/`catchup`/`compress` values (alias map). Unit tests for each mode's `(catchup_window, overlap)` mapping. ## Missed-fire detection & alert sink ```task id: ACTIVITY-WP-0014-T03 status: todo priority: medium state_hub_task_id: "dbedd96a-59ca-4b83-bce6-35755b076807" ``` Detect when a scheduled definition has no successful run within its expected interval + tolerance, and emit a signal (State Hub progress event and/or agent-inbox message) so a miss is visible even under `skip`. This is the observability the current silent-drop behaviour lacks — a miss should never again be invisible. ## Apply policy to runtime definitions & document ```task id: ACTIVITY-WP-0014-T04 status: wait priority: medium state_hub_task_id: "04e9d1d2-1192-4402-9402-b12c5d7d44e5" ``` Choose and set the appropriate `misfire_policy` for `daily-statehub-wsjf-triage` (likely `catchup_latest` — one missed daily run should still run, but a multi-day outage should not flood the triage feed). Update the Railiance runtime ConfigMap / bundle, redeploy, and document the run-miss options + per-definition guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist).