8.1 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|
| ACTIVITY-WP-0014 | workplan | Schedule Misfire Robustness & Run-Miss Recovery Options | infotech | activity-core | active | claude | activity-core | 2026-06-23 | 2026-06-23 | 91b64686-5d17-4c86-bc9e-3d0ee6720cf5 |
Schedule Misfire Robustness & Run-Miss Recovery Options
Make cron-triggered ActivityDefinitions robust to missed fires (worker/Temporal unavailable at trigger time) with explicit, per-definition recovery behaviour, plus detection/alerting when a scheduled fire is missed.
Motivation
On 2026-06-22 and 2026-06-23 the daily-statehub-wsjf-triage definition
(cron 20 7 * * * Europe/Berlin, projected into the Railiance runtime ConfigMap
actcore-external-activity-definitions) produced no daily_triage progress
event at all — neither a success nor a could not run; operator review required failure.
Corrected by T01 (2026-06-23). The initial hypothesis below — that
_build_schedule()never setcatchup_window, so a short-default catchup window silently dropped the fire — was disproven on the live cluster. The Temporal schedule is healthy withCatchupWindow 365d(the server default) and0 MissedCatchupWindow. The real cause is that the run fired and ran but failed at the report sink withConnection refusedposting to State Hub, because railiance01 reaches State Hub via a reverse tunnel back to the workstation, which is asleep at 07:20 Berlin. See the T01 findings and T05.
The trigger now originates entirely on railiance01 (in-cluster Temporal Schedule, ConfigMap-projected definition) and is not laptop-dependent — but the triage's State Hub data dependencies (context resolution and report delivery) still route back to the workstation State Hub.
This workplan still delivers worthwhile robustness — explicit run-miss recovery policies (T02) and missed-fire detection (T03) — but the fix for this incident is T05 (resilient sinks/resolvers + a workstation-independent State Hub endpoint).
Desired run-miss options (from Bernd)
Three explicit, per-definition behaviours when a fire is missed:
- Run on trigger or skip — never recover a missed fire.
- Run on trigger or later if missed — recover all missed fires when back up.
- Run on trigger or later if missed, but skip if next trigger reached — recover only the most recent missed fire; do not accumulate a backlog.
Proposed mapping to a new misfire_policy value set (names open to review):
| Policy | Semantics | Temporal mapping |
|---|---|---|
skip |
Run on trigger or skip | catchup_window ≈ 0, overlap=SKIP |
catchup_all |
Run on trigger or all missed later | catchup_window=<long>, overlap=BUFFER_ALL |
catchup_latest |
Run on trigger or only the latest missed | catchup_window ≈ 1 interval, overlap=BUFFER_ONE |
Confirm root cause on Railiance01
id: ACTIVITY-WP-0014-T01
status: done
priority: high
state_hub_task_id: "c90ff214-9214-48c7-96b9-7d699528d5ab"
Inspected via ssh railiance01 + in-node kubectl/temporal (no k3s tunnel is
defined for railiance01; the documented access path is SSH to the host).
Findings (2026-06-23) — the WP-0014 premise was wrong for this incident:
- All pods healthy;
actcore-workerup 44h, 0 restarts. Not a crash. - The daily-triage Temporal schedule (
activity-schedule-6fca51fa-…) is healthy:Paused false,OverlapPolicy Skip,CatchupWindow 365d(Temporal's default when unset),ActionCounts {Total:8, MissedCatchupWindow:0}. So fires were not silently dropped — my original "no catchup window → silent drop" hypothesis does not hold; the server default is already 365d. - The
2026-06-23T05:20:00Zfire did fire and ran, then Failed at the report sink:report sink failure: state-hub-progress … '[Errno 111] Connection refused'. The run produced a report but could not deliver it to State Hub, so nodaily_triageprogress event (not even a "could not run" one) was posted → the silence. The 06-22 fire has no execution in retention (bridge likely down then too / schedule update window atLastUpdateAt 1d ago). - Root cause is State Hub connectivity from railiance01, not Temporal. The
in-cluster
actcore-state-hub-bridge(hostNetwork) proxies to127.0.0.1:18000on the node — the local end of the ops-bridge reverse tunnel back to the workstation's State Hub. At 07:20 Europe/Berlin (= 05:20 UTC) the workstation/tunnel was unreachable →Connection refused. Chronic flakiness confirmed: 102 State Hub resolver timeouts in 24h (69recently_on_scope, 33consistency_sweep).
Implication: the trigger is independent of the laptop, but the triage's data dependencies (State Hub context resolution + report delivery) still route back to the workstation State Hub, which is asleep at 07:20 Berlin. WP-0014's misfire policies are still good robustness, but the real fix is (a) State Hub reachable from railiance01 independent of the workstation, and/or (b) sinks/ resolvers resilient to transient State Hub unavailability (retry/backoff, store-and-forward) instead of hard-failing the workflow. Tracked as follow-up below. Backfill deferred: a replay only succeeds while the workstation State Hub is reachable.
Implement explicit misfire recovery modes
id: ACTIVITY-WP-0014-T02
status: done
priority: high
state_hub_task_id: "19615562-4cb2-4f25-872f-505d6e40dcc5"
Add catchup_window_seconds to CronTriggerConfig and redefine misfire_policy
into the three explicit modes above. In _build_schedule() set
SchedulePolicy(overlap=..., catchup_window=timedelta(...)) per mode. Remove the
ad-hoc 1-hour backfill hack in favour of native catchup-window semantics. Keep
backward compatibility for existing skip/catchup/compress values (alias
map). Unit tests for each mode's (catchup_window, overlap) mapping.
Missed-fire detection & alert sink
id: ACTIVITY-WP-0014-T03
status: done
priority: medium
state_hub_task_id: "dbedd96a-59ca-4b83-bce6-35755b076807"
Detect when a scheduled definition has no successful run within its expected
interval + tolerance, and emit a signal (State Hub progress event and/or
agent-inbox message) so a miss is visible even under skip. This is the
observability the current silent-drop behaviour lacks — a miss should never again
be invisible.
Apply policy to runtime definitions & document
id: ACTIVITY-WP-0014-T04
status: progress
priority: medium
state_hub_task_id: "04e9d1d2-1192-4402-9402-b12c5d7d44e5"
Choose and set the appropriate misfire_policy for daily-statehub-wsjf-triage
(likely catchup_latest — one missed daily run should still run, but a
multi-day outage should not flood the triage feed). Update the Railiance runtime
ConfigMap / bundle, redeploy, and document the run-miss options + per-definition
guidance in docs/runbook.md. Depends on T01 (confirm) and T02 (modes exist).
Resilient State Hub sinks/resolvers (real incident fix)
id: ACTIVITY-WP-0014-T05
status: todo
priority: high
state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"
T01 proved the 06-22/06-23 silence was not a Temporal misfire but a State Hub
Connection refused at the report sink (and chronic resolver timeouts) because
railiance01 reaches State Hub via a reverse tunnel back to the workstation, which
is asleep at 07:20 Berlin. Misfire policies do not help: the run fires and fails
the same way. Make activity-core resilient to transient State Hub unavailability:
- Report sinks should retry with backoff and not hard-fail the workflow when the only failure is transient State Hub delivery; preserve the generated report (working-memory note + a deferred/outbox state-hub-progress) for later flush.
- Required State Hub context resolvers should retry/backoff and surface a clear,
single diagnostic rather than a bare
timed out. - Separately (out of this repo): give railiance01 a State Hub endpoint that does not depend on the workstation being awake, or run the triage at a time the workstation is reliably up. Owner decision needed.