--- id: custodian-WP-0004 type: workplan domain: custodian repo: activity-core status: done state_hub_workstream_id: 759b1255-aa78-42b5-8fab-c5fcf168f5f4 tasks: - id: T58 title: Dockerfile (multi-stage, uv-based) status: done priority: high state_hub_task_id: 2a81f9ba-47cb-480f-a5d3-af74fe238485 - id: T59 title: docker-compose.railiance.yml (full stack, no Elasticsearch) status: done priority: high state_hub_task_id: 67981de4-b766-4c47-9985-9573297ac464 - id: T60 title: "GET /health endpoint" status: done priority: high state_hub_task_id: 7c7a1617-29e9-4ea5-a675-fd75e325c451 - id: T61 title: .env.example — complete env var reference status: done priority: medium state_hub_task_id: 2a2a8d02-19dc-4b09-8792-f855ed48388a - id: T62 title: Makefile ops targets status: done priority: medium state_hub_task_id: 55ac37b5-5606-42e5-bf11-e8dae3a188b7 - id: T63 title: SIGTERM graceful shutdown (worker + event_router) status: done priority: medium state_hub_task_id: 65b3229b-5f03-450d-a263-f2c205be3d28 - id: T64 title: docs/runbook.md — railiance deployment section status: done priority: medium state_hub_task_id: 83dcd765-5715-499f-be23-f728d0261cfb created: "2026-05-14" --- # activity-core WP-0004 — Railiance Deployment & Service Ops **Hub workstream:** `759b1255-aa78-42b5-8fab-c5fcf168f5f4` **Goal:** Package activity-core as a fully standalone deployable service, runnable on railiance with no imports from or requirements against other custodian repos. ## Context WP-0001 through WP-0003 built the complete event-bridge implementation. The service is functionally complete. This workplan makes it operationally deployable: containerised, healthchecked, gracefully shutting down, and documented for railiance. Runtime dependencies are infrastructure only (Temporal, PostgreSQL, NATS). The optional soft dependencies (state-hub, repo-scoping, issue-core) are already gracefully degraded in the code — they bind `{}` on failure and never abort a run. ## Existing assets (no rework needed) - `docker-compose.dev.yml` — full dev stack (Temporal + ES + PG + NATS); keep as-is - `worker.py`, `api.py`, `event_router.py` — entry points exist and are functional - `migrations/`, `alembic.ini` — schema management is in place - `docs/runbook.md` — dev quick-start section exists; extend for railiance ## Build Order ``` T58 (Dockerfile) → T59 (railiance compose — depends on image) T60, T61, T62 — parallel, no deps T63 — independent, improves T59 (clean shutdown in compose) T64 — last, documents T58-T63 ``` --- ## T58: Dockerfile (multi-stage, uv-based) **File:** `Dockerfile` Two-stage build to keep the runtime image lean. ```dockerfile # Stage 1 — install Python deps FROM python:3.12-slim AS builder RUN pip install uv --no-cache-dir WORKDIR /app COPY pyproject.toml uv.lock ./ COPY src/ ./src/ RUN uv sync --no-dev --frozen # Stage 2 — runtime image FROM python:3.12-slim AS runtime WORKDIR /app COPY --from=builder /app/.venv /app/.venv COPY --from=builder /app/src /app/src # Include definition files that ship with the repo COPY activity-definitions/ ./activity-definitions/ COPY event-types/ ./event-types/ COPY tasks/ ./tasks/ ENV PATH="/app/.venv/bin:$PATH" ENV PYTHONPATH="/app/src" CMD ["python", "-m", "activity_core.worker"] ``` Also add `.dockerignore`: ``` .venv/ __pycache__/ *.pyc .git/ tests/ *.egg-info/ .env ``` The three processes share the same image; docker-compose overrides `command:`: - worker: default (`python -m activity_core.worker`) - api: `uvicorn activity_core.api:app --host 0.0.0.0 --port 8010` - event-router: `python -m activity_core.event_router` --- ## T59: docker-compose.railiance.yml (full stack, no Elasticsearch) **File:** `docker-compose.railiance.yml` Self-contained production stack. Uses PostgreSQL-based Temporal visibility (no Elasticsearch required). All activity-core services read env from `env_file: .env`. Services: - `temporal-db` — postgres:16 (Temporal schema + visibility) - `temporal` — temporalio/auto-setup:1.29.1 with `ENABLE_ES: "false"`, `DB: postgres12`, `VISIBILITY_DBNAME: temporal_visibility` - `temporal-ui` — temporalio/ui on port 8080 - `nats` — nats:2.10-alpine with `-js` and persistent volume for JetStream state - `app-db` — postgres:16 (activity-core application data) - `actcore-migrate` — one-shot service: `build: .`, `command: alembic upgrade head`, `restart: no`, runs migrations then exits; other actcore services depend on it - `actcore-worker` — worker process, metrics port 9090 - `actcore-api` — API server, port 8010, with healthcheck against `/health` - `actcore-event-router` — event router process All three actcore processes share one image (`build: .`) and depend on: - `temporal` (condition: service_healthy) - `app-db` (condition: service_healthy) - `nats` (condition: service_healthy) - `actcore-migrate` (condition: service_completed_successfully) Persistent volumes: `temporal-db-data`, `app-db-data`, `nats-data`. Network: `actcore-net` (bridge). --- ## T60: GET /health endpoint **File:** `src/activity_core/api.py` (add route) ```python from sqlalchemy import text @app.get("/health") async def health() -> JSONResponse: db_ok = False temporal_ok = False try: async with _get_db()() as session: await session.execute(text("SELECT 1")) db_ok = True except Exception: pass try: await _get_temporal().describe_namespace(TEMPORAL_NAMESPACE) temporal_ok = True except Exception: pass status = "ok" if db_ok and temporal_ok else "degraded" code = 200 if status == "ok" else 503 return JSONResponse( {"status": status, "db": db_ok, "temporal": temporal_ok}, status_code=code, ) ``` No authentication. Used by: - `docker-compose healthcheck: test: ["CMD-SHELL", "curl -sf http://localhost:8010/health"]` - Railiance monitoring (external HTTP probe) --- ## T61: .env.example — complete env var reference **File:** `.env.example` ```bash # ── Required ────────────────────────────────────────────────────────────────── # PostgreSQL connection string for activity-core application data. ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@app-db:5432/actcore # ── Temporal ────────────────────────────────────────────────────────────────── # Temporal frontend gRPC address. TEMPORAL_HOST=temporal:7233 # Temporal namespace (must exist before workers start). TEMPORAL_NAMESPACE=default # ── NATS ────────────────────────────────────────────────────────────────────── # NATS server URL. JetStream must be enabled (-js flag). NATS_URL=nats://nats:4222 # ── Service integrations (gracefully degraded if unavailable) ───────────────── # State Hub — used by the state-hub context adapter. Binds {} on failure. STATE_HUB_URL=http://127.0.0.1:8000 # Repo scoping — used by the repo-scoping context adapter. Binds {} on failure. REPO_SCOPING_URL=http://127.0.0.1:8020 # Issue Core — task emission backend. ISSUE_CORE_URL=http://127.0.0.1:8010 # Sink type: 'rest' (POST to issue-core) or 'null' (discard, for dry-run). ISSUE_SINK_TYPE=rest # ── Activity definitions ─────────────────────────────────────────────────────── # Colon-separated paths to additional activity-definitions/ directories. # The local activity-definitions/ directory is always scanned. ACTIVITY_DEFINITION_DIRS= # ── Observability ───────────────────────────────────────────────────────────── # Prometheus metrics bind address (Temporal SDK metrics). PROMETHEUS_BIND_ADDR=0.0.0.0:9090 # ── Security (webhook receiver) ─────────────────────────────────────────────── # HMAC-SHA256 secret for Gitea webhook signature validation. WEBHOOK_SECRET_GITEA= # HMAC-SHA256 secret for GitHub webhook signature validation. WEBHOOK_SECRET_GITHUB= # ── Curator gate ────────────────────────────────────────────────────────────── # 'disabled': accepts active + pending event types (pending logged as warning). # 'required': only active event types accepted; pending events are discarded. ACTIVITY_CURATOR_GATE=disabled ``` --- ## T62: Makefile ops targets **File:** `Makefile` (extend) New targets to add: ```makefile # ── Infrastructure ──────────────────────────────────────────────────────────── dev-up: ## Start full dev stack (Temporal + PG + NATS) dev-down: ## Stop and remove dev stack containers railiance-up: ## Start full railiance stack (builds image first) railiance-down: ## Stop and remove railiance stack containers # ── Database ────────────────────────────────────────────────────────────────── migrate: ## Apply all pending Alembic migrations sync-all: ## Sync event types and activity definitions (runs both) # ── Local dev processes ─────────────────────────────────────────────────────── start-worker: ## Start Temporal worker (uses ACTCORE_DB_URL from env) start-api: ## Start FastAPI server on :8010 (hot reload) start-event-router: ## Start NATS event router # ── Help ────────────────────────────────────────────────────────────────────── help: ## Show this help message ``` The `help` target uses `grep -E '^[a-zA-Z_-]+:.*?##' Makefile` to extract the `## description` comments and format them as a table. Makes `make help` the entry point for operators. `start-*` targets load `.env` if it exists (`-include .env; export`) so developers can run locally without setting env vars manually. --- ## T63: SIGTERM graceful shutdown (worker + event_router) **Files:** `src/activity_core/worker.py`, `src/activity_core/event_router.py` Replace the `await asyncio.Future()` pattern (which blocks forever, ignoring SIGTERM) with a signal-aware stop event: ```python import signal async def run() -> None: ... loop = asyncio.get_running_loop() stop = asyncio.Event() loop.add_signal_handler(signal.SIGTERM, stop.set) loop.add_signal_handler(signal.SIGINT, stop.set) async with orchestrator_worker, task_worker: # (worker.py) logger.info("Workers running — waiting for shutdown signal") await stop.wait() logger.info("Shutdown signal received — draining workers") logger.info("Workers stopped cleanly") ``` For `event_router.py`, the same pattern applies: `stop.wait()` replaces the infinite future; on signal, the NATS consumer is unsubscribed before exiting. This ensures `docker stop` (SIGTERM → 10-second grace period → SIGKILL) completes within the grace window by draining in-flight Temporal tasks and NATS messages before the process exits. --- ## T64: docs/runbook.md — Railiance Deployment section **File:** `docs/runbook.md` (extend with new section) Add after the existing Dev environment section: ```markdown ## Railiance Deployment ### Pre-requisites - Docker ≥ 24 with Compose v2 (`docker compose` not `docker-compose`) - ≥ 4 GB RAM available (Temporal server takes ~1 GB) - Ports available: 7233 (Temporal gRPC), 8010 (API), 8080 (Temporal UI), 9090 (Prometheus metrics) ### First-time setup 1. `cp .env.example .env` — edit all values, especially secrets 2. `make railiance-up` — builds image and starts all services 3. Wait for health: `curl -sf http://localhost:8010/health` → `{"status":"ok",...}` 4. Register Temporal search attributes (one-time per namespace): `docker exec actcore-temporal temporal operator search-attribute create \` ` --name ActivityId --type Keyword --name ActivityName --type Keyword \` ` --address temporal:7233` 5. `make sync-all` — load event types and activity definitions ### Upgrade procedure 1. `git pull` 2. `make railiance-up` — rebuilds image and restarts changed services 3. `make migrate` — apply any new migrations (safe to run even if none pending) 4. `curl -sf http://localhost:8010/health` — verify health ### Health verification - API health: `curl -s http://localhost:8010/health | python3 -m json.tool` - Temporal UI: http://localhost:8080 - Prometheus metrics: http://localhost:9090/metrics ### Common ops - View logs: `docker compose -f docker-compose.railiance.yml logs -f actcore-worker` - Restart one service: `docker compose -f docker-compose.railiance.yml restart actcore-api` - Wipe and reset (destructive): `make railiance-down && docker volume rm ...` ``` --- ## Completion Criteria 1. `docker build -t activity-core .` succeeds and image is < 400 MB 2. `make railiance-up` starts all 8 services; all reach healthy state 3. `curl http://localhost:8010/health` returns `{"status":"ok",...}` with HTTP 200 4. `docker stop actcore-worker` causes graceful drain (no SIGKILL within 10s) 5. `make help` prints a clean table of all targets with descriptions 6. `.env.example` covers every env var used anywhere in the codebase ## New Files Produced | Path | Task | |---|---| | `Dockerfile` | T58 | | `.dockerignore` | T58 | | `docker-compose.railiance.yml` | T59 | | `.env.example` | T61 | ## Modified Files | Path | Task | Change | |---|---|---| | `src/activity_core/api.py` | T60 | Add `/health` route | | `Makefile` | T62 | Add ops targets | | `src/activity_core/worker.py` | T63 | SIGTERM handler | | `src/activity_core/event_router.py` | T63 | SIGTERM handler | | `docs/runbook.md` | T64 | Railiance deployment section | ## Change History - v1.0 (2026-05-14): Initial workplan.