Files
activity-core/docs/runbook.md

424 lines
14 KiB
Markdown

# activity-core Operational Runbook
## Dev environment — quick start
```bash
# 1. Start the full stack (Temporal + PostgreSQL + Elasticsearch + NATS)
docker compose -f docker-compose.dev.yml up -d
# 2. Apply DB migrations
uv run alembic upgrade head
# 3. Seed initial ActivityDefinitions
uv run python src/activity_core/seed.py
# 4. Register custom Temporal search attributes (one-time per namespace)
docker exec temporal temporal operator search-attribute create \
--name ActivityId --type Keyword \
--name ActivityName --type Keyword \
--address temporal:7233
# 5. Start the worker (syncs schedules automatically on startup)
TEMPORAL_HOST=localhost:7233 \
ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
uv run python -m activity_core.worker
# 6. Start the Event Router (in a second terminal)
TEMPORAL_HOST=localhost:7233 \
ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
NATS_URL=nats://localhost:4222 \
uv run python -m activity_core.event_router
# 7. Start the REST API (in a third terminal)
TEMPORAL_HOST=localhost:7233 \
ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
uv run uvicorn activity_core.api:app --port 8010 --reload
```
---
## Endpoints
| Service | URL |
|---------|-----|
| Temporal Web UI | http://localhost:8080 |
| REST API docs (Swagger) | http://localhost:8010/docs |
| NATS monitoring | http://localhost:8222 |
| Prometheus metrics (worker) | http://localhost:9090/metrics |
---
## REST API — common operations
```bash
# List all ActivityDefinitions
curl http://localhost:8010/activity-definitions/
# Create a cron ActivityDefinition (fires every weekday at 09:00 Berlin time)
curl -s -X POST http://localhost:8010/activity-definitions/ \
-H "Content-Type: application/json" -d '{
"name": "daily-report",
"trigger_config": {
"trigger_type": "cron",
"cron_expression": "0 9 * * 1-5",
"timezone": "Europe/Berlin",
"misfire_policy": "skip"
}
}'
# Create an event-triggered ActivityDefinition
curl -s -X POST http://localhost:8010/activity-definitions/ \
-H "Content-Type: application/json" -d '{
"name": "user-onboarding",
"trigger_config": {
"trigger_type": "event",
"event_type": "user.created",
"filters": {"tier": "pro"}
}
}'
# Manually trigger a one-shot run
curl -s -X POST http://localhost:8010/activity-definitions/<id>/trigger
# Disable an activity (pauses its schedule)
curl -s -X PUT http://localhost:8010/activity-definitions/<id> \
-H "Content-Type: application/json" -d '{"enabled": false}'
```
---
## Publishing events to the Event Router
The Event Router subscribes to the `activity.>` NATS subject on the `ACTIVITY_EVENTS` stream.
```python
import asyncio, json, nats
from datetime import datetime, timezone
import uuid
async def publish():
nc = await nats.connect("nats://localhost:4222")
js = nc.jetstream()
envelope = {
"event_id": str(uuid.uuid4()),
"type": "user.created",
"source": "user-service",
"occurred_at": datetime.now(tz=timezone.utc).isoformat(),
"subject": "user/42",
"trace_id": str(uuid.uuid4()),
"payload": {"tier": "pro", "region": "eu"},
}
await js.publish("activity.user.created", json.dumps(envelope).encode())
await nc.drain()
asyncio.run(publish())
```
---
## Syncing schedules manually
```bash
TEMPORAL_HOST=localhost:7233 \
ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
uv run python -m activity_core.sync_schedules
```
This reconciles all Temporal Schedules with the `activity_definitions` table:
- Upserts schedules for every enabled cron definition
- Creates paused schedules for disabled cron definitions
- Deletes orphaned schedules with no matching DB row
After adding or changing a recurring ActivityDefinition or workflow activity
wiring, run a smoke schedule before trusting the next real fire:
```bash
ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
TEMPORAL_HOST=localhost:7233 \
uv run python scripts/smoke_test_schedule.py \
--activity-id <activity-definition-uuid> \
--recreate-recurring
```
The smoke command deletes and recreates the recurring Temporal Schedule when
`--recreate-recurring` is set, creates a distinct one-shot smoke Schedule one
minute in the future, waits for the smoke workflow to complete, and exits
non-zero if the workflow fails or times out. Use this after worker deployments
that add workflow imports or new activities; it catches stale-worker and missing
activity registration issues before the next scheduled run.
---
## Weekly maintenance definitions
`weekly-sbom-staleness` is the canonical rule-only weekly maintenance schedule.
It runs Mondays at 09:00 Europe/Berlin, resolves State Hub SBOM status for all
repos, and emits one automated task per stale repo through explicit
`for_each: context.repos.repos`.
`weekly-coding-retro` follows the same cron -> context resolver -> per-repo task
pattern for coding-session retrospection. It runs Saturdays at 19:00
Europe/Berlin and resolves the latest State Hub `/progress/` item with
`event_type=coding_retro` into `context.retro.suggestions`. Each positive-score
suggestion emits one task to `context.s.repo` with labels
`coding-retro`, `improvement`, and `automated`.
Keep `weekly-coding-retro` disabled until Helix Forge publishes the
`coding_retro` read model and a smoke run confirms the resolver returns a
non-empty suggestion set with no duplicate target tasks on re-run.
---
## Temporal UI — filtering by activity
With search attributes registered, you can filter in the Temporal Web UI:
```
ActivityId = "your-activity-uuid"
```
Or via `tctl`:
```bash
docker exec temporal-admin-tools temporal workflow list \
--query 'ActivityId="<uuid>"' \
--address temporal:7233
```
---
## Daily State Hub WSJF triage verification
Use this when answering: "did today's daily triage run happen?"
Set the ActivityDefinition id when known. If it is not known, pass the
definition name used in the environment and let the live helper resolve it from
Postgres.
```bash
export DAILY_TRIAGE_ACTIVITY_ID=<daily-triage-activity-definition-uuid>
# Dry-run checklist; safe from any shell because it only prints checks.
uv run python scripts/verify_daily_triage.py \
--activity-id "$DAILY_TRIAGE_ACTIVITY_ID" \
--date "$(date -u +%F)"
# Live check from a shell with Temporal, DB, State Hub, and working-memory access.
ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
TEMPORAL_HOST=localhost:7233 \
STATE_HUB_URL=http://127.0.0.1:8000 \
uv run python scripts/verify_daily_triage.py \
--activity-id "$DAILY_TRIAGE_ACTIVITY_ID" \
--working-memory-dir /home/worsch/the-custodian/memory/working \
--live
```
The verification is complete when all of these agree:
- Temporal schedule `activity-schedule-$DAILY_TRIAGE_ACTIVITY_ID` exists, is not
paused, and uses the `skip` overlap policy.
- The latest workflow found with `ActivityId="$DAILY_TRIAGE_ACTIVITY_ID"` either
completed or is visibly retrying a failed activity in history.
- `activity_runs` has a row for the daily triage ActivityDefinition with today's
`scheduled_for` or `fired_at` date.
- State Hub `/progress/` contains a `daily_triage` event whose detail includes
the same `activity_core_run_id` and its `output_validated` flag.
- The working-memory sink wrote `daily-triage-YYYY-MM-DD-<run>.md` and its
frontmatter contains the same `activity_core_run_id` and validation metadata.
- The ActivityDefinition's instruction model, token budget, and sink timeouts fit
under `ACTIVITY_TIMEOUT_SECONDS` (default 900 seconds). Temporal retries each
activity up to 10 attempts, so a slow LLM or sink failure should show as
workflow retry history rather than a silent missing report.
Expected missed-run behavior: the daily triage definition should use
`misfire_policy: skip`. Planned downtime does not catch up missed daily reports;
the next scheduled fire is the next authoritative run.
---
## Scale-out
### Multiple worker replicas
Temporal workers are stateless and horizontally scalable. Run additional worker
processes to increase throughput on `orchestrator-tq` and `task-execution-tq`.
Each worker registers the same workflows/activities — Temporal distributes tasks
across all pollers automatically.
**Important:** Only one process should call `sync_schedules` at startup to avoid
race conditions. Consider disabling the startup sync on secondary worker replicas
via an env var:
```bash
SKIP_SCHEDULE_SYNC=true uv run python -m activity_core.worker
```
(Implement the `SKIP_SCHEDULE_SYNC` check in `worker.py` when needed.)
### Multiple Event Router replicas
The durable NATS consumer (`activity-core-event-router`) ensures that only one
subscriber processes each message. Running multiple `event_router` processes with
the same durable consumer name provides automatic failover.
---
## Troubleshooting
### Worker fails to start: "ACTCORE_DB_URL is required"
Set the environment variable before running the worker.
### Schedule not firing
1. Check Temporal UI → Schedules tab for the schedule status.
2. Ensure `enabled=True` on the ActivityDefinition (paused schedules don't fire).
3. Verify the cron expression with: `docker exec temporal-admin-tools temporal schedule describe --schedule-id activity-schedule-<uuid>`
### Event not routing
1. Check NATS monitoring: http://localhost:8222/jsz to verify the `ACTIVITY_EVENTS` stream exists.
2. Verify the consumer is active: http://localhost:8222/jsz?consumers=true
3. Check Event Router logs for "matched no definitions" — the event type may not match any enabled ActivityDefinition.
4. Check `trigger_config.filters` — all key/value pairs must match the event payload exactly.
### Workflow stuck / not completing
1. Open Temporal UI → find the workflow by ID or ActivityId search attribute.
2. Check the workflow history for failed activities.
3. Common causes:
- DB connection lost during `load_activity_definition` or `log_run`
- Activity retry exhausted (check `maximum_attempts=10`)
- `ActivityDefinition` row was deleted while workflow was running
### Prometheus metrics not appearing
1. Confirm the worker is running with `PROMETHEUS_BIND_ADDR` set.
2. `curl http://localhost:9090/metrics` should return Temporal SDK metrics.
3. If port 9090 conflicts with Prometheus server, set `PROMETHEUS_BIND_ADDR=0.0.0.0:9091`.
### Production alerting and failure modes
Kubernetes health expectations:
```bash
kubectl -n activity-core get deploy actcore-worker actcore-api actcore-event-router
kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core
kubectl -n activity-core port-forward svc/actcore-worker-metrics 9090:9090
curl -sf http://127.0.0.1:9090/metrics
```
Page an operator when:
- `actcore-worker` has no ready pod, cannot connect to Temporal, or cannot reach
Postgres.
- The daily triage schedule is missing or paused outside an approved maintenance
window.
- The expected daily triage run is absent from Temporal and `activity_runs`
after the retry window.
- Both State Hub progress and working-memory report sinks are missing for a
completed run.
- Report sink or task emission failures repeat across Temporal retries.
Leave a State Hub progress note, but do not page, when:
- A planned outage caused one skipped run and the schedule is healthy again.
- A sink idempotency check reports `exists` for the expected run id.
- An instruction report has `output_validated=false` but still emitted a
validation-failure note preserving partial model output for review.
- The report completed but calibration feedback says the recommendations were
noisy, too long, or under-sensitive.
Handle in the next operator session:
- Prompt/schema tuning, loose-end sensitivity, and stale-but-parked work
calibration.
- Non-urgent schedule jitter or timeout adjustments.
- Moving a task sink from `ISSUE_SINK_TYPE=null` to the real issue-core endpoint
after a dry-run contract check has passed.
### DB migration drift
```bash
uv run alembic current # show current revision
uv run alembic upgrade head # apply pending migrations
uv run alembic history # show full migration history
```
---
## Railiance Deployment
### Pre-requisites
- Docker ≥ 24 with Compose v2 (`docker compose` not `docker-compose`)
- ≥ 4 GB RAM available (Temporal server takes ~1 GB)
- Ports available: 4222 (NATS), 7233 (Temporal gRPC), 8010 (API), 8080 (Temporal UI),
9090 (Prometheus metrics)
### First-time setup
```bash
# 1. Copy and edit the env file — fill in all secrets and URLs
cp .env.example .env
# 2. Build the image and start all services
make railiance-up
# 3. Wait for health (retry until 200)
curl -sf http://localhost:8010/health # → {"status":"ok","db":true,"temporal":true}
# 4. Register Temporal search attributes (one-time per namespace)
docker exec actcore-temporal temporal operator search-attribute create \
--name ActivityId --type Keyword \
--name ActivityName --type Keyword \
--address temporal:7233
# 5. Load event types and activity definitions
make sync-all
```
### Upgrade procedure
```bash
git pull
make railiance-up # rebuilds image, restarts changed services
make migrate # apply any new migrations (safe to run when none pending)
curl -sf http://localhost:8010/health
```
### Health verification
```bash
# API health (db + temporal probes)
curl -s http://localhost:8010/health | python3 -m json.tool
# Temporal UI
open http://localhost:8080
# Prometheus metrics
curl -s http://localhost:9090/metrics | head -20
```
### Common ops
```bash
# Follow logs for one service
docker compose -f docker-compose.railiance.yml logs -f actcore-worker
# Restart one service without bringing down others
docker compose -f docker-compose.railiance.yml restart actcore-api
# Re-run migrations manually
docker compose -f docker-compose.railiance.yml run --rm actcore-migrate
# Wipe and reset (DESTRUCTIVE — deletes all volumes including DB data)
make railiance-down
docker volume rm activity-core_temporal-db-data activity-core_app-db-data activity-core_nats-data
make railiance-up
```
---
## Wipe and restart dev stack
```bash
docker compose -f docker-compose.dev.yml down -v # removes all volumes
docker compose -f docker-compose.dev.yml up -d
uv run alembic upgrade head
uv run python src/activity_core/seed.py
# Re-register search attributes (see Dev environment step 4)
```