Implement post-triage operational hardening

2026-06-04 12:15:07 +02:00
parent 8a33ec44b6
commit 20d4f26166
11 changed files with 775 additions and 31 deletions
--- a/docs/runbook.md
+++ b/docs/runbook.md
@@ -147,6 +147,55 @@ docker exec temporal-admin-tools temporal workflow list \

 ---

+## Daily State Hub WSJF triage verification
+
+Use this when answering: "did today's daily triage run happen?"
+
+Set the ActivityDefinition id when known. If it is not known, pass the
+definition name used in the environment and let the live helper resolve it from
+Postgres.
+
+```bash
+export DAILY_TRIAGE_ACTIVITY_ID=<daily-triage-activity-definition-uuid>
+
+# Dry-run checklist; safe from any shell because it only prints checks.
+uv run python scripts/verify_daily_triage.py \
+  --activity-id "$DAILY_TRIAGE_ACTIVITY_ID" \
+  --date "$(date -u +%F)"
+
+# Live check from a shell with Temporal, DB, State Hub, and working-memory access.
+ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
+TEMPORAL_HOST=localhost:7233 \
+STATE_HUB_URL=http://127.0.0.1:8000 \
+  uv run python scripts/verify_daily_triage.py \
+    --activity-id "$DAILY_TRIAGE_ACTIVITY_ID" \
+    --working-memory-dir /home/worsch/the-custodian/working-memory \
+    --live
+```
+
+The verification is complete when all of these agree:
+
+- Temporal schedule `activity-schedule-$DAILY_TRIAGE_ACTIVITY_ID` exists, is not
+  paused, and uses the `skip` overlap policy.
+- The latest workflow found with `ActivityId="$DAILY_TRIAGE_ACTIVITY_ID"` either
+  completed or is visibly retrying a failed activity in history.
+- `activity_runs` has a row for the daily triage ActivityDefinition with today's
+  `scheduled_for` or `fired_at` date.
+- State Hub `/progress/` contains a `daily_triage` event whose detail includes
+  the same `activity_core_run_id`.
+- The working-memory sink wrote `daily-triage-YYYY-MM-DD-<run>.md` and its
+  frontmatter contains the same `activity_core_run_id`.
+- The ActivityDefinition's instruction model, token budget, and sink timeouts fit
+  under `ACTIVITY_TIMEOUT_SECONDS` (default 900 seconds). Temporal retries each
+  activity up to 10 attempts, so a slow LLM or sink failure should show as
+  workflow retry history rather than a silent missing report.
+
+Expected missed-run behavior: the daily triage definition should use
+`misfire_policy: skip`. Planned downtime does not catch up missed daily reports;
+the next scheduled fire is the next authoritative run.
+
+---
+
 ## Scale-out

 ### Multiple worker replicas
@@ -204,6 +253,44 @@ Set the environment variable before running the worker.
 2. `curl http://localhost:9090/metrics` should return Temporal SDK metrics.
 3. If port 9090 conflicts with Prometheus server, set `PROMETHEUS_BIND_ADDR=0.0.0.0:9091`.

+### Production alerting and failure modes
+
+Kubernetes health expectations:
+
+```bash
+kubectl -n activity-core get deploy actcore-worker actcore-api actcore-event-router
+kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core
+kubectl -n activity-core port-forward svc/actcore-worker-metrics 9090:9090
+curl -sf http://127.0.0.1:9090/metrics
+```
+
+Page an operator when:
+
+- `actcore-worker` has no ready pod, cannot connect to Temporal, or cannot reach
+  Postgres.
+- The daily triage schedule is missing or paused outside an approved maintenance
+  window.
+- The expected daily triage run is absent from Temporal and `activity_runs`
+  after the retry window.
+- Both State Hub progress and working-memory report sinks are missing for a
+  completed run.
+- Report sink or task emission failures repeat across Temporal retries.
+
+Leave a State Hub progress note, but do not page, when:
+
+- A planned outage caused one skipped run and the schedule is healthy again.
+- A sink idempotency check reports `exists` for the expected run id.
+- The report completed but calibration feedback says the recommendations were
+  noisy, too long, or under-sensitive.
+
+Handle in the next operator session:
+
+- Prompt/schema tuning, loose-end sensitivity, and stale-but-parked work
+  calibration.
+- Non-urgent schedule jitter or timeout adjustments.
+- Moving a task sink from `ISSUE_SINK_TYPE=null` to the real issue-core endpoint
+  after a dry-run contract check has passed.
+
 ### DB migration drift
 ```bash
 uv run alembic current    # show current revision