From ffc0ee2cb74895bcf7ed38b50f67265943f0f3de Mon Sep 17 00:00:00 2001 From: tegwick Date: Tue, 23 Jun 2026 13:46:19 +0200 Subject: [PATCH] feat(ACTIVITY-WP-0014): plan schedule misfire robustness & run-miss options Cron fires are silently dropped: _build_schedule() sets SchedulePolicy(overlap=) but never catchup_window, so a brief worker/Temporal outage at trigger time drops the fire with no recovery and no signal (root cause of missing 06-22/06-23 daily triage runs). Define three explicit run-miss policies: skip, catchup_all, catchup_latest, plus missed-fire detection. Co-Authored-By: Claude Opus 4.8 --- ...ITY-WP-0014-schedule-misfire-robustness.md | 124 ++++++++++++++++++ 1 file changed, 124 insertions(+) create mode 100644 workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md diff --git a/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md b/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md new file mode 100644 index 0000000..c092ef5 --- /dev/null +++ b/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md @@ -0,0 +1,124 @@ +--- +id: ACTIVITY-WP-0014 +type: workplan +title: "Schedule Misfire Robustness & Run-Miss Recovery Options" +domain: infotech +repo: activity-core +status: proposed +owner: claude +topic_slug: activity-core +created: "2026-06-23" +updated: "2026-06-23" +state_hub_workstream_id: "" +--- + +# Schedule Misfire Robustness & Run-Miss Recovery Options + +Make cron-triggered ActivityDefinitions robust to missed fires (worker/Temporal +unavailable at trigger time) with explicit, per-definition recovery behaviour, +plus detection/alerting when a scheduled fire is missed. + +## Motivation + +On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition +(cron `20 7 * * *` UTC, projected into the Railiance runtime ConfigMap +`actcore-external-activity-definitions`) produced **no `daily_triage` progress +event at all** — neither a success nor a `could not run; operator review +required` failure. Complete silence means the workflow never started, i.e. the +miss is at the **schedule/catchup level**, not the instruction level. + +Root cause (code-level, confirmed in `schedule_manager.py`): +`_build_schedule()` sets `SchedulePolicy(overlap=...)` but **never sets +`catchup_window`**. Temporal's default catchup window is short, so if the worker +pod or in-cluster Temporal (`actcore-temporal:7233`) was briefly unavailable at +07:20 UTC, the fire is **silently dropped with no recovery and no signal**. The +existing `misfire_policy` (`skip`/`catchup`/`compress`) only maps to +`ScheduleOverlapPolicy` (concurrency of overlapping runs) — it does **not** +control missed-fire recovery. The only recovery path is a manual 1-hour +`backfill` that runs solely at `upsert_schedule` time and only when +`misfire_policy == "catchup"` (the triage def uses the default `skip`). + +The trigger now originates entirely on **railiance01** (in-cluster Temporal +Schedule, ConfigMap-projected definition). The workstation laptop is **not** +required at trigger time — so any miss is a runtime-robustness gap, not a +"laptop was off" problem. + +## Desired run-miss options (from Bernd) + +Three explicit, per-definition behaviours when a fire is missed: + +1. **Run on trigger or skip** — never recover a missed fire. +2. **Run on trigger or later if missed** — recover **all** missed fires when back up. +3. **Run on trigger or later if missed, but skip if next trigger reached** — + recover only the **most recent** missed fire; do not accumulate a backlog. + +Proposed mapping to a new `misfire_policy` value set (names open to review): + +| Policy | Semantics | Temporal mapping | +| --- | --- | --- | +| `skip` | Run on trigger or skip | `catchup_window ≈ 0`, `overlap=SKIP` | +| `catchup_all` | Run on trigger or all missed later | `catchup_window=`, `overlap=BUFFER_ALL` | +| `catchup_latest` | Run on trigger or only the latest missed | `catchup_window ≈ 1 interval`, `overlap=BUFFER_ONE` | + +## Confirm root cause on Railiance01 + +```task +id: ACTIVITY-WP-0014-T01 +status: todo +priority: high +state_hub_task_id: "" +``` + +Bring up the ops-bridge tunnel (`bridge up state-hub-coulombcore`, +`bridge up state-hub-mcp-coulombcore`; tunnel to railiance01 was down from the +workstation during diagnosis). Inspect the live Temporal schedule +`activity-schedule-`: paused state, configured catchup window, +recent action/fire history, and worker pod status. Confirm whether the +06-22/06-23 07:20 UTC fires were dropped vs. failed. Backfill the missed runs if +calibration evidence is still wanted. Record findings in the workplan. + +## Implement explicit misfire recovery modes + +```task +id: ACTIVITY-WP-0014-T02 +status: todo +priority: high +state_hub_task_id: "" +``` + +Add `catchup_window_seconds` to `CronTriggerConfig` and redefine `misfire_policy` +into the three explicit modes above. In `_build_schedule()` set +`SchedulePolicy(overlap=..., catchup_window=timedelta(...))` per mode. Remove the +ad-hoc 1-hour `backfill` hack in favour of native catchup-window semantics. Keep +backward compatibility for existing `skip`/`catchup`/`compress` values (alias +map). Unit tests for each mode's `(catchup_window, overlap)` mapping. + +## Missed-fire detection & alert sink + +```task +id: ACTIVITY-WP-0014-T03 +status: todo +priority: medium +state_hub_task_id: "" +``` + +Detect when a scheduled definition has no successful run within its expected +interval + tolerance, and emit a signal (State Hub progress event and/or +agent-inbox message) so a miss is visible even under `skip`. This is the +observability the current silent-drop behaviour lacks — a miss should never again +be invisible. + +## Apply policy to runtime definitions & document + +```task +id: ACTIVITY-WP-0014-T04 +status: wait +priority: medium +state_hub_task_id: "" +``` + +Choose and set the appropriate `misfire_policy` for `daily-statehub-wsjf-triage` +(likely `catchup_latest` — one missed daily run should still run, but a +multi-day outage should not flood the triage feed). Update the Railiance runtime +ConfigMap / bundle, redeploy, and document the run-miss options + per-definition +guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist).