Updated by fix-consistency on 2026-06-23: - ACTIVITY-WP-0014-T04: progress → wait
5.1 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|
| ACTIVITY-WP-0014 | workplan | Schedule Misfire Robustness & Run-Miss Recovery Options | infotech | activity-core | active | claude | activity-core | 2026-06-23 | 2026-06-23 | 91b64686-5d17-4c86-bc9e-3d0ee6720cf5 |
Schedule Misfire Robustness & Run-Miss Recovery Options
Make cron-triggered ActivityDefinitions robust to missed fires (worker/Temporal unavailable at trigger time) with explicit, per-definition recovery behaviour, plus detection/alerting when a scheduled fire is missed.
Motivation
On 2026-06-22 and 2026-06-23 the daily-statehub-wsjf-triage definition
(cron 20 7 * * * UTC, projected into the Railiance runtime ConfigMap
actcore-external-activity-definitions) produced no daily_triage progress
event at all — neither a success nor a could not run; operator review required failure. Complete silence means the workflow never started, i.e. the
miss is at the schedule/catchup level, not the instruction level.
Root cause (code-level, confirmed in schedule_manager.py):
_build_schedule() sets SchedulePolicy(overlap=...) but never sets
catchup_window. Temporal's default catchup window is short, so if the worker
pod or in-cluster Temporal (actcore-temporal:7233) was briefly unavailable at
07:20 UTC, the fire is silently dropped with no recovery and no signal. The
existing misfire_policy (skip/catchup/compress) only maps to
ScheduleOverlapPolicy (concurrency of overlapping runs) — it does not
control missed-fire recovery. The only recovery path is a manual 1-hour
backfill that runs solely at upsert_schedule time and only when
misfire_policy == "catchup" (the triage def uses the default skip).
The trigger now originates entirely on railiance01 (in-cluster Temporal Schedule, ConfigMap-projected definition). The workstation laptop is not required at trigger time — so any miss is a runtime-robustness gap, not a "laptop was off" problem.
Desired run-miss options (from Bernd)
Three explicit, per-definition behaviours when a fire is missed:
- Run on trigger or skip — never recover a missed fire.
- Run on trigger or later if missed — recover all missed fires when back up.
- Run on trigger or later if missed, but skip if next trigger reached — recover only the most recent missed fire; do not accumulate a backlog.
Proposed mapping to a new misfire_policy value set (names open to review):
| Policy | Semantics | Temporal mapping |
|---|---|---|
skip |
Run on trigger or skip | catchup_window ≈ 0, overlap=SKIP |
catchup_all |
Run on trigger or all missed later | catchup_window=<long>, overlap=BUFFER_ALL |
catchup_latest |
Run on trigger or only the latest missed | catchup_window ≈ 1 interval, overlap=BUFFER_ONE |
Confirm root cause on Railiance01
id: ACTIVITY-WP-0014-T01
status: todo
priority: high
state_hub_task_id: "c90ff214-9214-48c7-96b9-7d699528d5ab"
Bring up the ops-bridge tunnel (bridge up state-hub-coulombcore,
bridge up state-hub-mcp-coulombcore; tunnel to railiance01 was down from the
workstation during diagnosis). Inspect the live Temporal schedule
activity-schedule-<daily-triage-id>: paused state, configured catchup window,
recent action/fire history, and worker pod status. Confirm whether the
06-22/06-23 07:20 UTC fires were dropped vs. failed. Backfill the missed runs if
calibration evidence is still wanted. Record findings in the workplan.
Implement explicit misfire recovery modes
id: ACTIVITY-WP-0014-T02
status: done
priority: high
state_hub_task_id: "19615562-4cb2-4f25-872f-505d6e40dcc5"
Add catchup_window_seconds to CronTriggerConfig and redefine misfire_policy
into the three explicit modes above. In _build_schedule() set
SchedulePolicy(overlap=..., catchup_window=timedelta(...)) per mode. Remove the
ad-hoc 1-hour backfill hack in favour of native catchup-window semantics. Keep
backward compatibility for existing skip/catchup/compress values (alias
map). Unit tests for each mode's (catchup_window, overlap) mapping.
Missed-fire detection & alert sink
id: ACTIVITY-WP-0014-T03
status: todo
priority: medium
state_hub_task_id: "dbedd96a-59ca-4b83-bce6-35755b076807"
Detect when a scheduled definition has no successful run within its expected
interval + tolerance, and emit a signal (State Hub progress event and/or
agent-inbox message) so a miss is visible even under skip. This is the
observability the current silent-drop behaviour lacks — a miss should never again
be invisible.
Apply policy to runtime definitions & document
id: ACTIVITY-WP-0014-T04
status: wait
priority: medium
state_hub_task_id: "04e9d1d2-1192-4402-9402-b12c5d7d44e5"
Choose and set the appropriate misfire_policy for daily-statehub-wsjf-triage
(likely catchup_latest — one missed daily run should still run, but a
multi-day outage should not flood the triage feed). Update the Railiance runtime
ConfigMap / bundle, redeploy, and document the run-miss options + per-definition
guidance in docs/runbook.md. Depends on T01 (confirm) and T02 (modes exist).