Files
activity-core/workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md
2026-06-23 17:16:17 +02:00

8.1 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
ACTIVITY-WP-0014 workplan Schedule Misfire Robustness & Run-Miss Recovery Options infotech activity-core active claude activity-core 2026-06-23 2026-06-23 91b64686-5d17-4c86-bc9e-3d0ee6720cf5

Schedule Misfire Robustness & Run-Miss Recovery Options

Make cron-triggered ActivityDefinitions robust to missed fires (worker/Temporal unavailable at trigger time) with explicit, per-definition recovery behaviour, plus detection/alerting when a scheduled fire is missed.

Motivation

On 2026-06-22 and 2026-06-23 the daily-statehub-wsjf-triage definition (cron 20 7 * * * Europe/Berlin, projected into the Railiance runtime ConfigMap actcore-external-activity-definitions) produced no daily_triage progress event at all — neither a success nor a could not run; operator review required failure.

Corrected by T01 (2026-06-23). The initial hypothesis below — that _build_schedule() never set catchup_window, so a short-default catchup window silently dropped the fire — was disproven on the live cluster. The Temporal schedule is healthy with CatchupWindow 365d (the server default) and 0 MissedCatchupWindow. The real cause is that the run fired and ran but failed at the report sink with Connection refused posting to State Hub, because railiance01 reaches State Hub via a reverse tunnel back to the workstation, which is asleep at 07:20 Berlin. See the T01 findings and T05.

The trigger now originates entirely on railiance01 (in-cluster Temporal Schedule, ConfigMap-projected definition) and is not laptop-dependent — but the triage's State Hub data dependencies (context resolution and report delivery) still route back to the workstation State Hub.

This workplan still delivers worthwhile robustness — explicit run-miss recovery policies (T02) and missed-fire detection (T03) — but the fix for this incident is T05 (resilient sinks/resolvers + a workstation-independent State Hub endpoint).

Desired run-miss options (from Bernd)

Three explicit, per-definition behaviours when a fire is missed:

  1. Run on trigger or skip — never recover a missed fire.
  2. Run on trigger or later if missed — recover all missed fires when back up.
  3. Run on trigger or later if missed, but skip if next trigger reached — recover only the most recent missed fire; do not accumulate a backlog.

Proposed mapping to a new misfire_policy value set (names open to review):

Policy Semantics Temporal mapping
skip Run on trigger or skip catchup_window ≈ 0, overlap=SKIP
catchup_all Run on trigger or all missed later catchup_window=<long>, overlap=BUFFER_ALL
catchup_latest Run on trigger or only the latest missed catchup_window ≈ 1 interval, overlap=BUFFER_ONE

Confirm root cause on Railiance01

id: ACTIVITY-WP-0014-T01
status: done
priority: high
state_hub_task_id: "c90ff214-9214-48c7-96b9-7d699528d5ab"

Inspected via ssh railiance01 + in-node kubectl/temporal (no k3s tunnel is defined for railiance01; the documented access path is SSH to the host).

Findings (2026-06-23) — the WP-0014 premise was wrong for this incident:

  • All pods healthy; actcore-worker up 44h, 0 restarts. Not a crash.
  • The daily-triage Temporal schedule (activity-schedule-6fca51fa-…) is healthy: Paused false, OverlapPolicy Skip, CatchupWindow 365d (Temporal's default when unset), ActionCounts {Total:8, MissedCatchupWindow:0}. So fires were not silently dropped — my original "no catchup window → silent drop" hypothesis does not hold; the server default is already 365d.
  • The 2026-06-23T05:20:00Z fire did fire and ran, then Failed at the report sink: report sink failure: state-hub-progress … '[Errno 111] Connection refused'. The run produced a report but could not deliver it to State Hub, so no daily_triage progress event (not even a "could not run" one) was posted → the silence. The 06-22 fire has no execution in retention (bridge likely down then too / schedule update window at LastUpdateAt 1d ago).
  • Root cause is State Hub connectivity from railiance01, not Temporal. The in-cluster actcore-state-hub-bridge (hostNetwork) proxies to 127.0.0.1:18000 on the node — the local end of the ops-bridge reverse tunnel back to the workstation's State Hub. At 07:20 Europe/Berlin (= 05:20 UTC) the workstation/tunnel was unreachable → Connection refused. Chronic flakiness confirmed: 102 State Hub resolver timeouts in 24h (69 recently_on_scope, 33 consistency_sweep).

Implication: the trigger is independent of the laptop, but the triage's data dependencies (State Hub context resolution + report delivery) still route back to the workstation State Hub, which is asleep at 07:20 Berlin. WP-0014's misfire policies are still good robustness, but the real fix is (a) State Hub reachable from railiance01 independent of the workstation, and/or (b) sinks/ resolvers resilient to transient State Hub unavailability (retry/backoff, store-and-forward) instead of hard-failing the workflow. Tracked as follow-up below. Backfill deferred: a replay only succeeds while the workstation State Hub is reachable.

Implement explicit misfire recovery modes

id: ACTIVITY-WP-0014-T02
status: done
priority: high
state_hub_task_id: "19615562-4cb2-4f25-872f-505d6e40dcc5"

Add catchup_window_seconds to CronTriggerConfig and redefine misfire_policy into the three explicit modes above. In _build_schedule() set SchedulePolicy(overlap=..., catchup_window=timedelta(...)) per mode. Remove the ad-hoc 1-hour backfill hack in favour of native catchup-window semantics. Keep backward compatibility for existing skip/catchup/compress values (alias map). Unit tests for each mode's (catchup_window, overlap) mapping.

Missed-fire detection & alert sink

id: ACTIVITY-WP-0014-T03
status: done
priority: medium
state_hub_task_id: "dbedd96a-59ca-4b83-bce6-35755b076807"

Detect when a scheduled definition has no successful run within its expected interval + tolerance, and emit a signal (State Hub progress event and/or agent-inbox message) so a miss is visible even under skip. This is the observability the current silent-drop behaviour lacks — a miss should never again be invisible.

Apply policy to runtime definitions & document

id: ACTIVITY-WP-0014-T04
status: progress
priority: medium
state_hub_task_id: "04e9d1d2-1192-4402-9402-b12c5d7d44e5"

Choose and set the appropriate misfire_policy for daily-statehub-wsjf-triage (likely catchup_latest — one missed daily run should still run, but a multi-day outage should not flood the triage feed). Update the Railiance runtime ConfigMap / bundle, redeploy, and document the run-miss options + per-definition guidance in docs/runbook.md. Depends on T01 (confirm) and T02 (modes exist).

Resilient State Hub sinks/resolvers (real incident fix)

id: ACTIVITY-WP-0014-T05
status: todo
priority: high
state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"

T01 proved the 06-22/06-23 silence was not a Temporal misfire but a State Hub Connection refused at the report sink (and chronic resolver timeouts) because railiance01 reaches State Hub via a reverse tunnel back to the workstation, which is asleep at 07:20 Berlin. Misfire policies do not help: the run fires and fails the same way. Make activity-core resilient to transient State Hub unavailability:

  • Report sinks should retry with backoff and not hard-fail the workflow when the only failure is transient State Hub delivery; preserve the generated report (working-memory note + a deferred/outbox state-hub-progress) for later flush.
  • Required State Hub context resolvers should retry/backoff and surface a clear, single diagnostic rather than a bare timed out.
  • Separately (out of this repo): give railiance01 a State Hub endpoint that does not depend on the workstation being awake, or run the triage at a time the workstation is reliably up. Owner decision needed.