generated from coulomb/repo-seed
feat(ACTIVITY-WP-0014): explicit run-miss recovery policies (T02, T04)
Set Temporal catchup_window on cron schedules so a fire missed during a worker/Temporal outage is no longer silently dropped. Redefine misfire_policy into three explicit modes — skip, catchup_all, catchup_latest — mapping to (catchup_window, overlap) pairs; legacy catchup/compress aliased. Add catchup_window_seconds override. Remove the ad-hoc upsert-time 1h backfill in favour of native catchup. Apply catchup_latest to daily-statehub-wsjf-triage in the Railiance runtime manifest and document run-miss policies in the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -333,6 +333,31 @@ the same durable consumer name provides automatic failover.
|
||||
|
||||
---
|
||||
|
||||
## Run-miss recovery policies (cron triggers)
|
||||
|
||||
A cron fire is **missed** when the worker or Temporal is unavailable at trigger
|
||||
time. `trigger_config.misfire_policy` selects what happens when the system
|
||||
recovers. Each policy combines a Temporal **catchup window** (how far back missed
|
||||
fires are recovered) with an **overlap policy** (what to do if a recovered fire
|
||||
would start while a prior run is still executing):
|
||||
|
||||
| `misfire_policy` | Behaviour | Default catchup window | Overlap |
|
||||
| --- | --- | --- | --- |
|
||||
| `skip` | Run on trigger or skip — a missed fire is never recovered | 60s grace | `SKIP` |
|
||||
| `catchup_all` | Recover **every** fire missed during the outage | 365 days | `BUFFER_ALL` |
|
||||
| `catchup_latest` | Recover only the **most recent** missed fire; no backlog | 24h | `BUFFER_ONE` |
|
||||
|
||||
Set `trigger_config.catchup_window_seconds` to override the per-policy default
|
||||
(e.g. an hourly definition using `catchup_latest` should set it to ~3600 so a
|
||||
single missed hour is recovered but older ones are not).
|
||||
|
||||
Legacy values are still accepted: `catchup` → `catchup_all`,
|
||||
`compress` → `catchup_latest`.
|
||||
|
||||
> **Why this exists:** before ACTIVITY-WP-0014 no catchup window was set, so a
|
||||
> brief outage at trigger time silently dropped the fire with no recovery and no
|
||||
> log line. The `daily-statehub-wsjf-triage` definition now uses `catchup_latest`.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Worker fails to start: "ACTCORE_DB_URL is required"
|
||||
@@ -342,6 +367,9 @@ Set the environment variable before running the worker.
|
||||
1. Check Temporal UI → Schedules tab for the schedule status.
|
||||
2. Ensure `enabled=True` on the ActivityDefinition (paused schedules don't fire).
|
||||
3. Verify the cron expression with: `docker exec temporal-admin-tools temporal schedule describe --schedule-id activity-schedule-<uuid>`
|
||||
4. If a fire was **missed entirely** (no run, no failure event) during an outage,
|
||||
check `misfire_policy` — under `skip` missed fires are dropped by design. Use
|
||||
`catchup_all` or `catchup_latest` to recover them. See *Run-miss recovery policies*.
|
||||
|
||||
### Event not routing
|
||||
1. Check NATS monitoring: http://localhost:8222/jsz to verify the `ACTIVITY_EVENTS` stream exists.
|
||||
|
||||
Reference in New Issue
Block a user