# activity-core Operational Runbook ## Dev environment — quick start ```bash # 1. Start the full stack (Temporal + PostgreSQL + Elasticsearch + NATS) docker compose -f docker-compose.dev.yml up -d # 2. Apply DB migrations uv run alembic upgrade head # 3. Seed initial ActivityDefinitions uv run python src/activity_core/seed.py # 4. Register custom Temporal search attributes (one-time per namespace) docker exec temporal temporal operator search-attribute create \ --name ActivityId --type Keyword \ --name ActivityName --type Keyword \ --address temporal:7233 # 5. Start the worker (syncs schedules automatically on startup) TEMPORAL_HOST=localhost:7233 \ ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \ uv run python -m activity_core.worker # 6. Start the Event Router (in a second terminal) TEMPORAL_HOST=localhost:7233 \ ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \ NATS_URL=nats://localhost:4222 \ uv run python -m activity_core.event_router # 7. Start the REST API (in a third terminal) TEMPORAL_HOST=localhost:7233 \ ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \ uv run uvicorn activity_core.api:app --port 8010 --reload ``` --- ## Endpoints | Service | URL | |---------|-----| | Temporal Web UI | http://localhost:8080 | | REST API docs (Swagger) | http://localhost:8010/docs | | NATS monitoring | http://localhost:8222 | | Prometheus metrics (worker) | http://localhost:9090/metrics | --- ## REST API — common operations ```bash # List all ActivityDefinitions curl http://localhost:8010/activity-definitions/ # Create a cron ActivityDefinition (fires every weekday at 09:00 Berlin time) curl -s -X POST http://localhost:8010/activity-definitions/ \ -H "Content-Type: application/json" -d '{ "name": "daily-report", "trigger_config": { "trigger_type": "cron", "cron_expression": "0 9 * * 1-5", "timezone": "Europe/Berlin", "misfire_policy": "skip" } }' # Create an event-triggered ActivityDefinition curl -s -X POST http://localhost:8010/activity-definitions/ \ -H "Content-Type: application/json" -d '{ "name": "user-onboarding", "trigger_config": { "trigger_type": "event", "event_type": "user.created", "filters": {"tier": "pro"} } }' # Manually trigger a one-shot run curl -s -X POST http://localhost:8010/activity-definitions//trigger # Disable an activity (pauses its schedule) curl -s -X PUT http://localhost:8010/activity-definitions/ \ -H "Content-Type: application/json" -d '{"enabled": false}' ``` --- ## Publishing events to the Event Router The Event Router subscribes to the `activity.>` NATS subject on the `ACTIVITY_EVENTS` stream. ```python import asyncio, json, nats from datetime import datetime, timezone import uuid async def publish(): nc = await nats.connect("nats://localhost:4222") js = nc.jetstream() envelope = { "event_id": str(uuid.uuid4()), "type": "user.created", "source": "user-service", "occurred_at": datetime.now(tz=timezone.utc).isoformat(), "subject": "user/42", "trace_id": str(uuid.uuid4()), "payload": {"tier": "pro", "region": "eu"}, } await js.publish("activity.user.created", json.dumps(envelope).encode()) await nc.drain() asyncio.run(publish()) ``` --- ## Syncing schedules manually ```bash TEMPORAL_HOST=localhost:7233 \ ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \ uv run python -m activity_core.sync_schedules ``` This reconciles all Temporal Schedules with the `activity_definitions` table: - Upserts schedules for every enabled cron definition - Creates paused schedules for disabled cron definitions - Deletes orphaned schedules with no matching DB row --- ## Temporal UI — filtering by activity With search attributes registered, you can filter in the Temporal Web UI: ``` ActivityId = "your-activity-uuid" ``` Or via `tctl`: ```bash docker exec temporal-admin-tools temporal workflow list \ --query 'ActivityId=""' \ --address temporal:7233 ``` --- ## Scale-out ### Multiple worker replicas Temporal workers are stateless and horizontally scalable. Run additional worker processes to increase throughput on `orchestrator-tq` and `task-execution-tq`. Each worker registers the same workflows/activities — Temporal distributes tasks across all pollers automatically. **Important:** Only one process should call `sync_schedules` at startup to avoid race conditions. Consider disabling the startup sync on secondary worker replicas via an env var: ```bash SKIP_SCHEDULE_SYNC=true uv run python -m activity_core.worker ``` (Implement the `SKIP_SCHEDULE_SYNC` check in `worker.py` when needed.) ### Multiple Event Router replicas The durable NATS consumer (`activity-core-event-router`) ensures that only one subscriber processes each message. Running multiple `event_router` processes with the same durable consumer name provides automatic failover. --- ## Troubleshooting ### Worker fails to start: "ACTCORE_DB_URL is required" Set the environment variable before running the worker. ### Schedule not firing 1. Check Temporal UI → Schedules tab for the schedule status. 2. Ensure `enabled=True` on the ActivityDefinition (paused schedules don't fire). 3. Verify the cron expression with: `docker exec temporal-admin-tools temporal schedule describe --schedule-id activity-schedule-` ### Event not routing 1. Check NATS monitoring: http://localhost:8222/jsz to verify the `ACTIVITY_EVENTS` stream exists. 2. Verify the consumer is active: http://localhost:8222/jsz?consumers=true 3. Check Event Router logs for "matched no definitions" — the event type may not match any enabled ActivityDefinition. 4. Check `trigger_config.filters` — all key/value pairs must match the event payload exactly. ### Workflow stuck / not completing 1. Open Temporal UI → find the workflow by ID or ActivityId search attribute. 2. Check the workflow history for failed activities. 3. Common causes: - DB connection lost during `load_activity_definition` or `log_run` - Activity retry exhausted (check `maximum_attempts=10`) - `ActivityDefinition` row was deleted while workflow was running ### Prometheus metrics not appearing 1. Confirm the worker is running with `PROMETHEUS_BIND_ADDR` set. 2. `curl http://localhost:9090/metrics` should return Temporal SDK metrics. 3. If port 9090 conflicts with Prometheus server, set `PROMETHEUS_BIND_ADDR=0.0.0.0:9091`. ### DB migration drift ```bash uv run alembic current # show current revision uv run alembic upgrade head # apply pending migrations uv run alembic history # show full migration history ``` --- ## Railiance Deployment ### Pre-requisites - Docker ≥ 24 with Compose v2 (`docker compose` not `docker-compose`) - ≥ 4 GB RAM available (Temporal server takes ~1 GB) - Ports available: 4222 (NATS), 7233 (Temporal gRPC), 8010 (API), 8080 (Temporal UI), 9090 (Prometheus metrics) ### First-time setup ```bash # 1. Copy and edit the env file — fill in all secrets and URLs cp .env.example .env # 2. Build the image and start all services make railiance-up # 3. Wait for health (retry until 200) curl -sf http://localhost:8010/health # → {"status":"ok","db":true,"temporal":true} # 4. Register Temporal search attributes (one-time per namespace) docker exec actcore-temporal temporal operator search-attribute create \ --name ActivityId --type Keyword \ --name ActivityName --type Keyword \ --address temporal:7233 # 5. Load event types and activity definitions make sync-all ``` ### Upgrade procedure ```bash git pull make railiance-up # rebuilds image, restarts changed services make migrate # apply any new migrations (safe to run when none pending) curl -sf http://localhost:8010/health ``` ### Health verification ```bash # API health (db + temporal probes) curl -s http://localhost:8010/health | python3 -m json.tool # Temporal UI open http://localhost:8080 # Prometheus metrics curl -s http://localhost:9090/metrics | head -20 ``` ### Common ops ```bash # Follow logs for one service docker compose -f docker-compose.railiance.yml logs -f actcore-worker # Restart one service without bringing down others docker compose -f docker-compose.railiance.yml restart actcore-api # Re-run migrations manually docker compose -f docker-compose.railiance.yml run --rm actcore-migrate # Wipe and reset (DESTRUCTIVE — deletes all volumes including DB data) make railiance-down docker volume rm activity-core_temporal-db-data activity-core_app-db-data activity-core_nats-data make railiance-up ``` --- ## Wipe and restart dev stack ```bash docker compose -f docker-compose.dev.yml down -v # removes all volumes docker compose -f docker-compose.dev.yml up -d uv run alembic upgrade head uv run python src/activity_core/seed.py # Re-register search attributes (see Dev environment step 4) ```