feat(ACTIVITY-WP-0014): explicit run-miss recovery policies (T02, T04)

Set Temporal catchup_window on cron schedules so a fire missed during a
worker/Temporal outage is no longer silently dropped. Redefine misfire_policy
into three explicit modes — skip, catchup_all, catchup_latest — mapping to
(catchup_window, overlap) pairs; legacy catchup/compress aliased. Add
catchup_window_seconds override. Remove the ad-hoc upsert-time 1h backfill in
favour of native catchup. Apply catchup_latest to daily-statehub-wsjf-triage in
the Railiance runtime manifest and document run-miss policies in the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-23 14:15:45 +02:00
parent ffc0ee2cb7
commit a83b117f60
6 changed files with 181 additions and 29 deletions

View File

@@ -333,6 +333,31 @@ the same durable consumer name provides automatic failover.
---
## Run-miss recovery policies (cron triggers)
A cron fire is **missed** when the worker or Temporal is unavailable at trigger
time. `trigger_config.misfire_policy` selects what happens when the system
recovers. Each policy combines a Temporal **catchup window** (how far back missed
fires are recovered) with an **overlap policy** (what to do if a recovered fire
would start while a prior run is still executing):
| `misfire_policy` | Behaviour | Default catchup window | Overlap |
| --- | --- | --- | --- |
| `skip` | Run on trigger or skip — a missed fire is never recovered | 60s grace | `SKIP` |
| `catchup_all` | Recover **every** fire missed during the outage | 365 days | `BUFFER_ALL` |
| `catchup_latest` | Recover only the **most recent** missed fire; no backlog | 24h | `BUFFER_ONE` |
Set `trigger_config.catchup_window_seconds` to override the per-policy default
(e.g. an hourly definition using `catchup_latest` should set it to ~3600 so a
single missed hour is recovered but older ones are not).
Legacy values are still accepted: `catchup``catchup_all`,
`compress``catchup_latest`.
> **Why this exists:** before ACTIVITY-WP-0014 no catchup window was set, so a
> brief outage at trigger time silently dropped the fire with no recovery and no
> log line. The `daily-statehub-wsjf-triage` definition now uses `catchup_latest`.
## Troubleshooting
### Worker fails to start: "ACTCORE_DB_URL is required"
@@ -342,6 +367,9 @@ Set the environment variable before running the worker.
1. Check Temporal UI → Schedules tab for the schedule status.
2. Ensure `enabled=True` on the ActivityDefinition (paused schedules don't fire).
3. Verify the cron expression with: `docker exec temporal-admin-tools temporal schedule describe --schedule-id activity-schedule-<uuid>`
4. If a fire was **missed entirely** (no run, no failure event) during an outage,
check `misfire_policy` — under `skip` missed fires are dropped by design. Use
`catchup_all` or `catchup_latest` to recover them. See *Run-miss recovery policies*.
### Event not routing
1. Check NATS monitoring: http://localhost:8222/jsz to verify the `ACTIVITY_EVENTS` stream exists.