generated from coulomb/repo-seed
feat(ACTIVITY-WP-0014): explicit run-miss recovery policies (T02, T04)
Set Temporal catchup_window on cron schedules so a fire missed during a worker/Temporal outage is no longer silently dropped. Redefine misfire_policy into three explicit modes — skip, catchup_all, catchup_latest — mapping to (catchup_window, overlap) pairs; legacy catchup/compress aliased. Add catchup_window_seconds override. Remove the ad-hoc upsert-time 1h backfill in favour of native catchup. Apply catchup_latest to daily-statehub-wsjf-triage in the Railiance runtime manifest and document run-miss policies in the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -333,6 +333,31 @@ the same durable consumer name provides automatic failover.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Run-miss recovery policies (cron triggers)
|
||||||
|
|
||||||
|
A cron fire is **missed** when the worker or Temporal is unavailable at trigger
|
||||||
|
time. `trigger_config.misfire_policy` selects what happens when the system
|
||||||
|
recovers. Each policy combines a Temporal **catchup window** (how far back missed
|
||||||
|
fires are recovered) with an **overlap policy** (what to do if a recovered fire
|
||||||
|
would start while a prior run is still executing):
|
||||||
|
|
||||||
|
| `misfire_policy` | Behaviour | Default catchup window | Overlap |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| `skip` | Run on trigger or skip — a missed fire is never recovered | 60s grace | `SKIP` |
|
||||||
|
| `catchup_all` | Recover **every** fire missed during the outage | 365 days | `BUFFER_ALL` |
|
||||||
|
| `catchup_latest` | Recover only the **most recent** missed fire; no backlog | 24h | `BUFFER_ONE` |
|
||||||
|
|
||||||
|
Set `trigger_config.catchup_window_seconds` to override the per-policy default
|
||||||
|
(e.g. an hourly definition using `catchup_latest` should set it to ~3600 so a
|
||||||
|
single missed hour is recovered but older ones are not).
|
||||||
|
|
||||||
|
Legacy values are still accepted: `catchup` → `catchup_all`,
|
||||||
|
`compress` → `catchup_latest`.
|
||||||
|
|
||||||
|
> **Why this exists:** before ACTIVITY-WP-0014 no catchup window was set, so a
|
||||||
|
> brief outage at trigger time silently dropped the fire with no recovery and no
|
||||||
|
> log line. The `daily-statehub-wsjf-triage` definition now uses `catchup_latest`.
|
||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
### Worker fails to start: "ACTCORE_DB_URL is required"
|
### Worker fails to start: "ACTCORE_DB_URL is required"
|
||||||
@@ -342,6 +367,9 @@ Set the environment variable before running the worker.
|
|||||||
1. Check Temporal UI → Schedules tab for the schedule status.
|
1. Check Temporal UI → Schedules tab for the schedule status.
|
||||||
2. Ensure `enabled=True` on the ActivityDefinition (paused schedules don't fire).
|
2. Ensure `enabled=True` on the ActivityDefinition (paused schedules don't fire).
|
||||||
3. Verify the cron expression with: `docker exec temporal-admin-tools temporal schedule describe --schedule-id activity-schedule-<uuid>`
|
3. Verify the cron expression with: `docker exec temporal-admin-tools temporal schedule describe --schedule-id activity-schedule-<uuid>`
|
||||||
|
4. If a fire was **missed entirely** (no run, no failure event) during an outage,
|
||||||
|
check `misfire_policy` — under `skip` missed fires are dropped by design. Use
|
||||||
|
`catchup_all` or `catchup_latest` to recover them. See *Run-miss recovery policies*.
|
||||||
|
|
||||||
### Event not routing
|
### Event not routing
|
||||||
1. Check NATS monitoring: http://localhost:8222/jsz to verify the `ACTIVITY_EVENTS` stream exists.
|
1. Check NATS monitoring: http://localhost:8222/jsz to verify the `ACTIVITY_EVENTS` stream exists.
|
||||||
|
|||||||
@@ -47,7 +47,10 @@ data:
|
|||||||
type: cron
|
type: cron
|
||||||
cron_expression: "20 7 * * *"
|
cron_expression: "20 7 * * *"
|
||||||
timezone: Europe/Berlin
|
timezone: Europe/Berlin
|
||||||
misfire_policy: skip
|
# ACTIVITY-WP-0014: recover the most recent missed daily fire when the
|
||||||
|
# worker/Temporal was unavailable at trigger time, without accumulating a
|
||||||
|
# backlog after a multi-day outage.
|
||||||
|
misfire_policy: catchup_latest
|
||||||
context_sources:
|
context_sources:
|
||||||
- type: static
|
- type: static
|
||||||
bind_to: context.prompt_path
|
bind_to: context.prompt_path
|
||||||
|
|||||||
@@ -49,7 +49,18 @@ class CronTriggerConfig(BaseModel):
|
|||||||
)
|
)
|
||||||
timezone: str = Field(default="UTC", description="IANA timezone name.")
|
timezone: str = Field(default="UTC", description="IANA timezone name.")
|
||||||
jitter_seconds: int = Field(default=0, ge=0)
|
jitter_seconds: int = Field(default=0, ge=0)
|
||||||
misfire_policy: Literal["skip", "catchup", "compress"] = Field(default="skip")
|
# Run-miss recovery behaviour (ACTIVITY-WP-0014). What happens when a fire is
|
||||||
|
# missed because the worker / Temporal was unavailable at trigger time:
|
||||||
|
# skip - run on trigger or skip; a missed fire is never recovered
|
||||||
|
# catchup_all - recover every fire missed during the outage window
|
||||||
|
# catchup_latest - recover only the most recent missed fire; do not accumulate
|
||||||
|
# Legacy aliases are accepted: catchup → catchup_all, compress → catchup_latest.
|
||||||
|
misfire_policy: Literal[
|
||||||
|
"skip", "catchup_all", "catchup_latest", "catchup", "compress"
|
||||||
|
] = Field(default="skip")
|
||||||
|
# Override the per-policy default catchup window (how far back Temporal will
|
||||||
|
# recover missed fires after an outage). None uses the policy default.
|
||||||
|
catchup_window_seconds: int | None = Field(default=None, ge=0)
|
||||||
|
|
||||||
|
|
||||||
class EventTriggerConfig(BaseModel):
|
class EventTriggerConfig(BaseModel):
|
||||||
|
|||||||
@@ -17,7 +17,6 @@ from temporalio.client import (
|
|||||||
Schedule,
|
Schedule,
|
||||||
ScheduleActionStartWorkflow,
|
ScheduleActionStartWorkflow,
|
||||||
ScheduleAlreadyRunningError,
|
ScheduleAlreadyRunningError,
|
||||||
ScheduleBackfill,
|
|
||||||
ScheduleCalendarSpec,
|
ScheduleCalendarSpec,
|
||||||
ScheduleHandle,
|
ScheduleHandle,
|
||||||
ScheduleOverlapPolicy,
|
ScheduleOverlapPolicy,
|
||||||
@@ -38,13 +37,49 @@ _ORCHESTRATOR_TASK_QUEUE = "orchestrator-tq"
|
|||||||
# RunActivityWorkflow detects this value and derives run dedup key from workflow_id.
|
# RunActivityWorkflow detects this value and derives run dedup key from workflow_id.
|
||||||
SCHEDULED_TRIGGER_KEY = "scheduled"
|
SCHEDULED_TRIGGER_KEY = "scheduled"
|
||||||
|
|
||||||
# T24: misfire_policy → ScheduleOverlapPolicy
|
# ACTIVITY-WP-0014: misfire_policy → run-miss recovery behaviour.
|
||||||
_MISFIRE_TO_OVERLAP: dict[str, ScheduleOverlapPolicy] = {
|
#
|
||||||
"skip": ScheduleOverlapPolicy.SKIP,
|
# A "missed fire" happens when the worker / Temporal is unavailable at trigger
|
||||||
"catchup": ScheduleOverlapPolicy.BUFFER_ALL,
|
# time. Two Temporal levers together define the behaviour:
|
||||||
"compress": ScheduleOverlapPolicy.BUFFER_ONE,
|
# - catchup_window: how far back the server will recover missed fires once it
|
||||||
|
# is healthy again. The previous code never set this, so a brief outage at
|
||||||
|
# trigger time silently dropped the fire with no recovery and no signal.
|
||||||
|
# - overlap: what to do when a (recovered) fire would start while a prior run
|
||||||
|
# is still executing.
|
||||||
|
#
|
||||||
|
# Legacy values (catchup, compress) are aliased onto the explicit names.
|
||||||
|
_MISFIRE_ALIASES: dict[str, str] = {
|
||||||
|
"catchup": "catchup_all",
|
||||||
|
"compress": "catchup_latest",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# overlap policy + default catchup window (seconds) per normalised policy.
|
||||||
|
_SKIP_WINDOW_SECONDS = 60
|
||||||
|
_CATCHUP_ALL_WINDOW_SECONDS = 365 * 24 * 3600
|
||||||
|
_CATCHUP_LATEST_WINDOW_SECONDS = 24 * 3600
|
||||||
|
|
||||||
|
_MISFIRE_TO_OVERLAP: dict[str, ScheduleOverlapPolicy] = {
|
||||||
|
# Run on trigger or skip — recover nothing past a tiny grace window.
|
||||||
|
"skip": ScheduleOverlapPolicy.SKIP,
|
||||||
|
# Run on trigger or recover every missed fire during the outage window.
|
||||||
|
"catchup_all": ScheduleOverlapPolicy.BUFFER_ALL,
|
||||||
|
# Run on trigger or recover the most recent missed fire only; BUFFER_ONE
|
||||||
|
# buffers at most one start and drops the rest, so a backlog never accumulates.
|
||||||
|
"catchup_latest": ScheduleOverlapPolicy.BUFFER_ONE,
|
||||||
|
}
|
||||||
|
|
||||||
|
_MISFIRE_DEFAULT_WINDOW: dict[str, int] = {
|
||||||
|
"skip": _SKIP_WINDOW_SECONDS,
|
||||||
|
"catchup_all": _CATCHUP_ALL_WINDOW_SECONDS,
|
||||||
|
"catchup_latest": _CATCHUP_LATEST_WINDOW_SECONDS,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _normalize_misfire_policy(misfire_policy: str) -> str:
|
||||||
|
"""Map legacy aliases onto the explicit run-miss policy names."""
|
||||||
|
canonical = _MISFIRE_ALIASES.get(misfire_policy, misfire_policy)
|
||||||
|
return canonical if canonical in _MISFIRE_TO_OVERLAP else "skip"
|
||||||
|
|
||||||
|
|
||||||
def schedule_id(activity_id: str | UUID) -> str:
|
def schedule_id(activity_id: str | UUID) -> str:
|
||||||
"""Return the canonical Temporal Schedule ID for an ActivityDefinition."""
|
"""Return the canonical Temporal Schedule ID for an ActivityDefinition."""
|
||||||
@@ -57,7 +92,15 @@ def smoke_schedule_id(activity_id: str | UUID) -> str:
|
|||||||
|
|
||||||
|
|
||||||
def _overlap_policy(misfire_policy: str) -> ScheduleOverlapPolicy:
|
def _overlap_policy(misfire_policy: str) -> ScheduleOverlapPolicy:
|
||||||
return _MISFIRE_TO_OVERLAP.get(misfire_policy, ScheduleOverlapPolicy.SKIP)
|
return _MISFIRE_TO_OVERLAP[_normalize_misfire_policy(misfire_policy)]
|
||||||
|
|
||||||
|
|
||||||
|
def _catchup_window(cfg: CronTriggerConfig) -> timedelta:
|
||||||
|
"""Resolve the catchup window: explicit override, else the policy default."""
|
||||||
|
if cfg.catchup_window_seconds is not None:
|
||||||
|
return timedelta(seconds=cfg.catchup_window_seconds)
|
||||||
|
policy = _normalize_misfire_policy(cfg.misfire_policy)
|
||||||
|
return timedelta(seconds=_MISFIRE_DEFAULT_WINDOW[policy])
|
||||||
|
|
||||||
|
|
||||||
def _build_schedule(defn: ActivityDefinition) -> Schedule:
|
def _build_schedule(defn: ActivityDefinition) -> Schedule:
|
||||||
@@ -80,7 +123,10 @@ def _build_schedule(defn: ActivityDefinition) -> Schedule:
|
|||||||
jitter=timedelta(seconds=cfg.jitter_seconds) if cfg.jitter_seconds else None,
|
jitter=timedelta(seconds=cfg.jitter_seconds) if cfg.jitter_seconds else None,
|
||||||
)
|
)
|
||||||
|
|
||||||
policy = SchedulePolicy(overlap=_overlap_policy(cfg.misfire_policy))
|
policy = SchedulePolicy(
|
||||||
|
overlap=_overlap_policy(cfg.misfire_policy),
|
||||||
|
catchup_window=_catchup_window(cfg),
|
||||||
|
)
|
||||||
state = ScheduleState(paused=not defn.enabled)
|
state = ScheduleState(paused=not defn.enabled)
|
||||||
|
|
||||||
return Schedule(action=action, spec=spec, policy=policy, state=state)
|
return Schedule(action=action, spec=spec, policy=policy, state=state)
|
||||||
@@ -282,18 +328,10 @@ async def upsert_schedule(client: Client, defn: ActivityDefinition) -> ScheduleH
|
|||||||
else:
|
else:
|
||||||
await handle.pause(note="disabled via upsert_schedule")
|
await handle.pause(note="disabled via upsert_schedule")
|
||||||
|
|
||||||
# T24 catchup: backfill any fires missed in the last hour.
|
# ACTIVITY-WP-0014: missed-fire recovery is now handled natively by the
|
||||||
if isinstance(defn.trigger_config, CronTriggerConfig):
|
# schedule's catchup_window (see _build_schedule), which the server applies
|
||||||
if defn.trigger_config.misfire_policy == "catchup":
|
# continuously after any outage — not only at upsert time. The previous
|
||||||
now = datetime.now(tz=timezone.utc)
|
# ad-hoc 1-hour backfill is therefore no longer needed.
|
||||||
backfill_start = now - timedelta(hours=1)
|
|
||||||
await handle.backfill(
|
|
||||||
ScheduleBackfill(
|
|
||||||
start_at=backfill_start,
|
|
||||||
end_at=now,
|
|
||||||
overlap=ScheduleOverlapPolicy.BUFFER_ALL,
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
return handle
|
return handle
|
||||||
|
|
||||||
|
|||||||
@@ -37,6 +37,7 @@ def _make_defn(
|
|||||||
misfire_policy: str = "skip",
|
misfire_policy: str = "skip",
|
||||||
enabled: bool = True,
|
enabled: bool = True,
|
||||||
jitter: int = 0,
|
jitter: int = 0,
|
||||||
|
catchup_window_seconds: int | None = None,
|
||||||
) -> ActivityDefinition:
|
) -> ActivityDefinition:
|
||||||
return ActivityDefinition(
|
return ActivityDefinition(
|
||||||
id=uuid.uuid4(),
|
id=uuid.uuid4(),
|
||||||
@@ -46,6 +47,7 @@ def _make_defn(
|
|||||||
cron_expression=cron,
|
cron_expression=cron,
|
||||||
misfire_policy=misfire_policy,
|
misfire_policy=misfire_policy,
|
||||||
jitter_seconds=jitter,
|
jitter_seconds=jitter,
|
||||||
|
catchup_window_seconds=catchup_window_seconds,
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -186,6 +188,76 @@ async def test_misfire_policy_compress_sets_overlap_buffer_one(env: WorkflowEnvi
|
|||||||
await delete_schedule(env.client, defn.id)
|
await delete_schedule(env.client, defn.id)
|
||||||
|
|
||||||
|
|
||||||
|
# ── ACTIVITY-WP-0014: explicit run-miss policies + catchup window ────────────
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_skip_sets_short_catchup_window(env: WorkflowEnvironment) -> None:
|
||||||
|
"""skip = run on trigger or skip: tiny grace window, no real recovery."""
|
||||||
|
defn = _make_defn(misfire_policy="skip")
|
||||||
|
await upsert_schedule(env.client, defn)
|
||||||
|
|
||||||
|
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
|
||||||
|
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.SKIP
|
||||||
|
assert desc.schedule.policy.catchup_window == timedelta(seconds=60)
|
||||||
|
|
||||||
|
await delete_schedule(env.client, defn.id)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_catchup_all_recovers_full_window(env: WorkflowEnvironment) -> None:
|
||||||
|
"""catchup_all = recover every missed fire: long window, BUFFER_ALL."""
|
||||||
|
defn = _make_defn(misfire_policy="catchup_all")
|
||||||
|
await upsert_schedule(env.client, defn)
|
||||||
|
|
||||||
|
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
|
||||||
|
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.BUFFER_ALL
|
||||||
|
assert desc.schedule.policy.catchup_window == timedelta(days=365)
|
||||||
|
|
||||||
|
await delete_schedule(env.client, defn.id)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_catchup_latest_does_not_accumulate(env: WorkflowEnvironment) -> None:
|
||||||
|
"""catchup_latest = recover only the most recent missed fire: BUFFER_ONE."""
|
||||||
|
defn = _make_defn(misfire_policy="catchup_latest")
|
||||||
|
await upsert_schedule(env.client, defn)
|
||||||
|
|
||||||
|
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
|
||||||
|
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.BUFFER_ONE
|
||||||
|
assert desc.schedule.policy.catchup_window == timedelta(hours=24)
|
||||||
|
|
||||||
|
await delete_schedule(env.client, defn.id)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_legacy_aliases_map_to_explicit_policies(env: WorkflowEnvironment) -> None:
|
||||||
|
"""Legacy catchup/compress keep working and pick up the new catchup windows."""
|
||||||
|
catchup = _make_defn(misfire_policy="catchup")
|
||||||
|
compress = _make_defn(misfire_policy="compress")
|
||||||
|
await upsert_schedule(env.client, catchup)
|
||||||
|
await upsert_schedule(env.client, compress)
|
||||||
|
|
||||||
|
d1 = await env.client.get_schedule_handle(schedule_id(catchup.id)).describe()
|
||||||
|
d2 = await env.client.get_schedule_handle(schedule_id(compress.id)).describe()
|
||||||
|
assert d1.schedule.policy.catchup_window == timedelta(days=365)
|
||||||
|
assert d2.schedule.policy.catchup_window == timedelta(hours=24)
|
||||||
|
|
||||||
|
await delete_schedule(env.client, catchup.id)
|
||||||
|
await delete_schedule(env.client, compress.id)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_explicit_catchup_window_override(env: WorkflowEnvironment) -> None:
|
||||||
|
"""An explicit catchup_window_seconds overrides the per-policy default."""
|
||||||
|
defn = _make_defn(misfire_policy="skip", catchup_window_seconds=7200)
|
||||||
|
await upsert_schedule(env.client, defn)
|
||||||
|
|
||||||
|
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
|
||||||
|
assert desc.schedule.policy.catchup_window == timedelta(hours=2)
|
||||||
|
|
||||||
|
await delete_schedule(env.client, defn.id)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
@pytest.mark.asyncio
|
||||||
async def test_schedule_smoke_test_creates_one_shot_schedule(
|
async def test_schedule_smoke_test_creates_one_shot_schedule(
|
||||||
env: WorkflowEnvironment,
|
env: WorkflowEnvironment,
|
||||||
|
|||||||
@@ -9,7 +9,7 @@ owner: claude
|
|||||||
topic_slug: activity-core
|
topic_slug: activity-core
|
||||||
created: "2026-06-23"
|
created: "2026-06-23"
|
||||||
updated: "2026-06-23"
|
updated: "2026-06-23"
|
||||||
state_hub_workstream_id: ""
|
state_hub_workstream_id: "91b64686-5d17-4c86-bc9e-3d0ee6720cf5"
|
||||||
---
|
---
|
||||||
|
|
||||||
# Schedule Misfire Robustness & Run-Miss Recovery Options
|
# Schedule Misfire Robustness & Run-Miss Recovery Options
|
||||||
@@ -66,7 +66,7 @@ Proposed mapping to a new `misfire_policy` value set (names open to review):
|
|||||||
id: ACTIVITY-WP-0014-T01
|
id: ACTIVITY-WP-0014-T01
|
||||||
status: todo
|
status: todo
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: ""
|
state_hub_task_id: "c90ff214-9214-48c7-96b9-7d699528d5ab"
|
||||||
```
|
```
|
||||||
|
|
||||||
Bring up the ops-bridge tunnel (`bridge up state-hub-coulombcore`,
|
Bring up the ops-bridge tunnel (`bridge up state-hub-coulombcore`,
|
||||||
@@ -81,9 +81,9 @@ calibration evidence is still wanted. Record findings in the workplan.
|
|||||||
|
|
||||||
```task
|
```task
|
||||||
id: ACTIVITY-WP-0014-T02
|
id: ACTIVITY-WP-0014-T02
|
||||||
status: todo
|
status: done
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: ""
|
state_hub_task_id: "19615562-4cb2-4f25-872f-505d6e40dcc5"
|
||||||
```
|
```
|
||||||
|
|
||||||
Add `catchup_window_seconds` to `CronTriggerConfig` and redefine `misfire_policy`
|
Add `catchup_window_seconds` to `CronTriggerConfig` and redefine `misfire_policy`
|
||||||
@@ -99,7 +99,7 @@ map). Unit tests for each mode's `(catchup_window, overlap)` mapping.
|
|||||||
id: ACTIVITY-WP-0014-T03
|
id: ACTIVITY-WP-0014-T03
|
||||||
status: todo
|
status: todo
|
||||||
priority: medium
|
priority: medium
|
||||||
state_hub_task_id: ""
|
state_hub_task_id: "dbedd96a-59ca-4b83-bce6-35755b076807"
|
||||||
```
|
```
|
||||||
|
|
||||||
Detect when a scheduled definition has no successful run within its expected
|
Detect when a scheduled definition has no successful run within its expected
|
||||||
@@ -112,9 +112,9 @@ be invisible.
|
|||||||
|
|
||||||
```task
|
```task
|
||||||
id: ACTIVITY-WP-0014-T04
|
id: ACTIVITY-WP-0014-T04
|
||||||
status: wait
|
status: progress
|
||||||
priority: medium
|
priority: medium
|
||||||
state_hub_task_id: ""
|
state_hub_task_id: "04e9d1d2-1192-4402-9402-b12c5d7d44e5"
|
||||||
```
|
```
|
||||||
|
|
||||||
Choose and set the appropriate `misfire_policy` for `daily-statehub-wsjf-triage`
|
Choose and set the appropriate `misfire_policy` for `daily-statehub-wsjf-triage`
|
||||||
|
|||||||
Reference in New Issue
Block a user