generated from coulomb/repo-seed
Implement post-triage operational hardening
This commit is contained in:
@@ -147,6 +147,55 @@ docker exec temporal-admin-tools temporal workflow list \
|
||||
|
||||
---
|
||||
|
||||
## Daily State Hub WSJF triage verification
|
||||
|
||||
Use this when answering: "did today's daily triage run happen?"
|
||||
|
||||
Set the ActivityDefinition id when known. If it is not known, pass the
|
||||
definition name used in the environment and let the live helper resolve it from
|
||||
Postgres.
|
||||
|
||||
```bash
|
||||
export DAILY_TRIAGE_ACTIVITY_ID=<daily-triage-activity-definition-uuid>
|
||||
|
||||
# Dry-run checklist; safe from any shell because it only prints checks.
|
||||
uv run python scripts/verify_daily_triage.py \
|
||||
--activity-id "$DAILY_TRIAGE_ACTIVITY_ID" \
|
||||
--date "$(date -u +%F)"
|
||||
|
||||
# Live check from a shell with Temporal, DB, State Hub, and working-memory access.
|
||||
ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
|
||||
TEMPORAL_HOST=localhost:7233 \
|
||||
STATE_HUB_URL=http://127.0.0.1:8000 \
|
||||
uv run python scripts/verify_daily_triage.py \
|
||||
--activity-id "$DAILY_TRIAGE_ACTIVITY_ID" \
|
||||
--working-memory-dir /home/worsch/the-custodian/working-memory \
|
||||
--live
|
||||
```
|
||||
|
||||
The verification is complete when all of these agree:
|
||||
|
||||
- Temporal schedule `activity-schedule-$DAILY_TRIAGE_ACTIVITY_ID` exists, is not
|
||||
paused, and uses the `skip` overlap policy.
|
||||
- The latest workflow found with `ActivityId="$DAILY_TRIAGE_ACTIVITY_ID"` either
|
||||
completed or is visibly retrying a failed activity in history.
|
||||
- `activity_runs` has a row for the daily triage ActivityDefinition with today's
|
||||
`scheduled_for` or `fired_at` date.
|
||||
- State Hub `/progress/` contains a `daily_triage` event whose detail includes
|
||||
the same `activity_core_run_id`.
|
||||
- The working-memory sink wrote `daily-triage-YYYY-MM-DD-<run>.md` and its
|
||||
frontmatter contains the same `activity_core_run_id`.
|
||||
- The ActivityDefinition's instruction model, token budget, and sink timeouts fit
|
||||
under `ACTIVITY_TIMEOUT_SECONDS` (default 900 seconds). Temporal retries each
|
||||
activity up to 10 attempts, so a slow LLM or sink failure should show as
|
||||
workflow retry history rather than a silent missing report.
|
||||
|
||||
Expected missed-run behavior: the daily triage definition should use
|
||||
`misfire_policy: skip`. Planned downtime does not catch up missed daily reports;
|
||||
the next scheduled fire is the next authoritative run.
|
||||
|
||||
---
|
||||
|
||||
## Scale-out
|
||||
|
||||
### Multiple worker replicas
|
||||
@@ -204,6 +253,44 @@ Set the environment variable before running the worker.
|
||||
2. `curl http://localhost:9090/metrics` should return Temporal SDK metrics.
|
||||
3. If port 9090 conflicts with Prometheus server, set `PROMETHEUS_BIND_ADDR=0.0.0.0:9091`.
|
||||
|
||||
### Production alerting and failure modes
|
||||
|
||||
Kubernetes health expectations:
|
||||
|
||||
```bash
|
||||
kubectl -n activity-core get deploy actcore-worker actcore-api actcore-event-router
|
||||
kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core
|
||||
kubectl -n activity-core port-forward svc/actcore-worker-metrics 9090:9090
|
||||
curl -sf http://127.0.0.1:9090/metrics
|
||||
```
|
||||
|
||||
Page an operator when:
|
||||
|
||||
- `actcore-worker` has no ready pod, cannot connect to Temporal, or cannot reach
|
||||
Postgres.
|
||||
- The daily triage schedule is missing or paused outside an approved maintenance
|
||||
window.
|
||||
- The expected daily triage run is absent from Temporal and `activity_runs`
|
||||
after the retry window.
|
||||
- Both State Hub progress and working-memory report sinks are missing for a
|
||||
completed run.
|
||||
- Report sink or task emission failures repeat across Temporal retries.
|
||||
|
||||
Leave a State Hub progress note, but do not page, when:
|
||||
|
||||
- A planned outage caused one skipped run and the schedule is healthy again.
|
||||
- A sink idempotency check reports `exists` for the expected run id.
|
||||
- The report completed but calibration feedback says the recommendations were
|
||||
noisy, too long, or under-sensitive.
|
||||
|
||||
Handle in the next operator session:
|
||||
|
||||
- Prompt/schema tuning, loose-end sensitivity, and stale-but-parked work
|
||||
calibration.
|
||||
- Non-urgent schedule jitter or timeout adjustments.
|
||||
- Moving a task sink from `ISSUE_SINK_TYPE=null` to the real issue-core endpoint
|
||||
after a dry-run contract check has passed.
|
||||
|
||||
### DB migration drift
|
||||
```bash
|
||||
uv run alembic current # show current revision
|
||||
|
||||
Reference in New Issue
Block a user