generated from coulomb/repo-seed
177 lines
8.1 KiB
Markdown
177 lines
8.1 KiB
Markdown
---
|
|
id: ACTIVITY-WP-0014
|
|
type: workplan
|
|
title: "Schedule Misfire Robustness & Run-Miss Recovery Options"
|
|
domain: infotech
|
|
repo: activity-core
|
|
status: active
|
|
owner: claude
|
|
topic_slug: activity-core
|
|
created: "2026-06-23"
|
|
updated: "2026-06-23"
|
|
state_hub_workstream_id: "91b64686-5d17-4c86-bc9e-3d0ee6720cf5"
|
|
---
|
|
|
|
# Schedule Misfire Robustness & Run-Miss Recovery Options
|
|
|
|
Make cron-triggered ActivityDefinitions robust to missed fires (worker/Temporal
|
|
unavailable at trigger time) with explicit, per-definition recovery behaviour,
|
|
plus detection/alerting when a scheduled fire is missed.
|
|
|
|
## Motivation
|
|
|
|
On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition
|
|
(cron `20 7 * * *` Europe/Berlin, projected into the Railiance runtime ConfigMap
|
|
`actcore-external-activity-definitions`) produced **no `daily_triage` progress
|
|
event at all** — neither a success nor a `could not run; operator review
|
|
required` failure.
|
|
|
|
> **Corrected by T01 (2026-06-23).** The initial hypothesis below — that
|
|
> `_build_schedule()` never set `catchup_window`, so a short-default catchup
|
|
> window silently dropped the fire — was **disproven on the live cluster**. The
|
|
> Temporal schedule is healthy with `CatchupWindow 365d` (the server default) and
|
|
> `0 MissedCatchupWindow`. The real cause is that the run **fired and ran but
|
|
> failed at the report sink** with `Connection refused` posting to State Hub,
|
|
> because railiance01 reaches State Hub via a reverse tunnel back to the
|
|
> workstation, which is asleep at 07:20 Berlin. See the T01 findings and T05.
|
|
|
|
The trigger now originates entirely on **railiance01** (in-cluster Temporal
|
|
Schedule, ConfigMap-projected definition) and is **not** laptop-dependent — but
|
|
the triage's State Hub *data dependencies* (context resolution and report
|
|
delivery) still route back to the workstation State Hub.
|
|
|
|
This workplan still delivers worthwhile robustness — explicit run-miss recovery
|
|
policies (T02) and missed-fire detection (T03) — but the fix for *this* incident
|
|
is T05 (resilient sinks/resolvers + a workstation-independent State Hub endpoint).
|
|
|
|
## Desired run-miss options (from Bernd)
|
|
|
|
Three explicit, per-definition behaviours when a fire is missed:
|
|
|
|
1. **Run on trigger or skip** — never recover a missed fire.
|
|
2. **Run on trigger or later if missed** — recover **all** missed fires when back up.
|
|
3. **Run on trigger or later if missed, but skip if next trigger reached** —
|
|
recover only the **most recent** missed fire; do not accumulate a backlog.
|
|
|
|
Proposed mapping to a new `misfire_policy` value set (names open to review):
|
|
|
|
| Policy | Semantics | Temporal mapping |
|
|
| --- | --- | --- |
|
|
| `skip` | Run on trigger or skip | `catchup_window ≈ 0`, `overlap=SKIP` |
|
|
| `catchup_all` | Run on trigger or all missed later | `catchup_window=<long>`, `overlap=BUFFER_ALL` |
|
|
| `catchup_latest` | Run on trigger or only the latest missed | `catchup_window ≈ 1 interval`, `overlap=BUFFER_ONE` |
|
|
|
|
## Confirm root cause on Railiance01
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0014-T01
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "c90ff214-9214-48c7-96b9-7d699528d5ab"
|
|
```
|
|
|
|
Inspected via `ssh railiance01` + in-node `kubectl`/`temporal` (no k3s tunnel is
|
|
defined for railiance01; the documented access path is SSH to the host).
|
|
|
|
**Findings (2026-06-23) — the WP-0014 premise was wrong for this incident:**
|
|
|
|
- All pods healthy; `actcore-worker` up 44h, 0 restarts. Not a crash.
|
|
- The daily-triage Temporal schedule (`activity-schedule-6fca51fa-…`) is
|
|
**healthy**: `Paused false`, `OverlapPolicy Skip`, **`CatchupWindow 365d`**
|
|
(Temporal's *default* when unset), `ActionCounts {Total:8, MissedCatchupWindow:0}`.
|
|
So fires were **not** silently dropped — my original "no catchup window → silent
|
|
drop" hypothesis does not hold; the server default is already 365d.
|
|
- The `2026-06-23T05:20:00Z` fire **did fire and ran**, then **Failed at the report
|
|
sink**: `report sink failure: state-hub-progress … '[Errno 111] Connection
|
|
refused'`. The run produced a report but could not deliver it to State Hub, so
|
|
no `daily_triage` progress event (not even a "could not run" one) was posted →
|
|
the silence. The 06-22 fire has no execution in retention (bridge likely down
|
|
then too / schedule update window at `LastUpdateAt 1d ago`).
|
|
- Root cause is **State Hub connectivity from railiance01**, not Temporal. The
|
|
in-cluster `actcore-state-hub-bridge` (`hostNetwork`) proxies to
|
|
`127.0.0.1:18000` on the node — the local end of the ops-bridge **reverse tunnel
|
|
back to the workstation's State Hub**. At 07:20 Europe/Berlin (= 05:20 UTC) the
|
|
workstation/tunnel was unreachable → `Connection refused`. Chronic flakiness
|
|
confirmed: 102 State Hub resolver timeouts in 24h (69 `recently_on_scope`,
|
|
33 `consistency_sweep`).
|
|
|
|
**Implication:** the trigger *is* independent of the laptop, but the triage's
|
|
**data dependencies (State Hub context resolution + report delivery) still route
|
|
back to the workstation State Hub**, which is asleep at 07:20 Berlin. WP-0014's
|
|
misfire policies are still good robustness, but the real fix is (a) State Hub
|
|
reachable from railiance01 independent of the workstation, and/or (b) sinks/
|
|
resolvers resilient to transient State Hub unavailability (retry/backoff,
|
|
store-and-forward) instead of hard-failing the workflow. Tracked as follow-up
|
|
below. Backfill deferred: a replay only succeeds while the workstation State Hub
|
|
is reachable.
|
|
|
|
## Implement explicit misfire recovery modes
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0014-T02
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "19615562-4cb2-4f25-872f-505d6e40dcc5"
|
|
```
|
|
|
|
Add `catchup_window_seconds` to `CronTriggerConfig` and redefine `misfire_policy`
|
|
into the three explicit modes above. In `_build_schedule()` set
|
|
`SchedulePolicy(overlap=..., catchup_window=timedelta(...))` per mode. Remove the
|
|
ad-hoc 1-hour `backfill` hack in favour of native catchup-window semantics. Keep
|
|
backward compatibility for existing `skip`/`catchup`/`compress` values (alias
|
|
map). Unit tests for each mode's `(catchup_window, overlap)` mapping.
|
|
|
|
## Missed-fire detection & alert sink
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0014-T03
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "dbedd96a-59ca-4b83-bce6-35755b076807"
|
|
```
|
|
|
|
Detect when a scheduled definition has no successful run within its expected
|
|
interval + tolerance, and emit a signal (State Hub progress event and/or
|
|
agent-inbox message) so a miss is visible even under `skip`. This is the
|
|
observability the current silent-drop behaviour lacks — a miss should never again
|
|
be invisible.
|
|
|
|
## Apply policy to runtime definitions & document
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0014-T04
|
|
status: progress
|
|
priority: medium
|
|
state_hub_task_id: "04e9d1d2-1192-4402-9402-b12c5d7d44e5"
|
|
```
|
|
|
|
Choose and set the appropriate `misfire_policy` for `daily-statehub-wsjf-triage`
|
|
(likely `catchup_latest` — one missed daily run should still run, but a
|
|
multi-day outage should not flood the triage feed). Update the Railiance runtime
|
|
ConfigMap / bundle, redeploy, and document the run-miss options + per-definition
|
|
guidance in `docs/runbook.md`. Depends on T01 (confirm) and T02 (modes exist).
|
|
|
|
## Resilient State Hub sinks/resolvers (real incident fix)
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0014-T05
|
|
status: todo
|
|
priority: high
|
|
state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"
|
|
```
|
|
|
|
T01 proved the 06-22/06-23 silence was **not** a Temporal misfire but a State Hub
|
|
**`Connection refused` at the report sink** (and chronic resolver timeouts) because
|
|
railiance01 reaches State Hub via a reverse tunnel back to the workstation, which
|
|
is asleep at 07:20 Berlin. Misfire policies do not help: the run fires and fails
|
|
the same way. Make activity-core resilient to transient State Hub unavailability:
|
|
|
|
- Report sinks should retry with backoff and **not hard-fail the workflow** when
|
|
the only failure is transient State Hub delivery; preserve the generated report
|
|
(working-memory note + a deferred/outbox state-hub-progress) for later flush.
|
|
- Required State Hub context resolvers should retry/backoff and surface a clear,
|
|
single diagnostic rather than a bare `timed out`.
|
|
- Separately (out of this repo): give railiance01 a State Hub endpoint that does
|
|
not depend on the workstation being awake, or run the triage at a time the
|
|
workstation is reliably up. Owner decision needed.
|