Files
activity-core/workplans/custodian-WP-0004-railiance-ops.md
2026-05-15 00:05:01 +02:00

14 KiB

id, type, domain, repo, status, state_hub_workstream_id, tasks, created
id type domain repo status state_hub_workstream_id tasks created
custodian-WP-0004 workplan custodian activity-core done 759b1255-aa78-42b5-8fab-c5fcf168f5f4
id title status priority state_hub_task_id
T58 Dockerfile (multi-stage, uv-based) done high 2a81f9ba-47cb-480f-a5d3-af74fe238485
id title status priority state_hub_task_id
T59 docker-compose.railiance.yml (full stack, no Elasticsearch) done high 67981de4-b766-4c47-9985-9573297ac464
id title status priority state_hub_task_id
T60 GET /health endpoint done high 7c7a1617-29e9-4ea5-a675-fd75e325c451
id title status priority state_hub_task_id
T61 .env.example — complete env var reference done medium 2a2a8d02-19dc-4b09-8792-f855ed48388a
id title status priority state_hub_task_id
T62 Makefile ops targets done medium 55ac37b5-5606-42e5-bf11-e8dae3a188b7
id title status priority state_hub_task_id
T63 SIGTERM graceful shutdown (worker + event_router) done medium 65b3229b-5f03-450d-a263-f2c205be3d28
id title status priority state_hub_task_id
T64 docs/runbook.md — railiance deployment section done medium 83dcd765-5715-499f-be23-f728d0261cfb
2026-05-14

activity-core WP-0004 — Railiance Deployment & Service Ops

Hub workstream: 759b1255-aa78-42b5-8fab-c5fcf168f5f4 Goal: Package activity-core as a fully standalone deployable service, runnable on railiance with no imports from or requirements against other custodian repos.

Context

WP-0001 through WP-0003 built the complete event-bridge implementation. The service is functionally complete. This workplan makes it operationally deployable: containerised, healthchecked, gracefully shutting down, and documented for railiance.

Runtime dependencies are infrastructure only (Temporal, PostgreSQL, NATS). The optional soft dependencies (state-hub, repo-scoping, issue-core) are already gracefully degraded in the code — they bind {} on failure and never abort a run.

Existing assets (no rework needed)

  • docker-compose.dev.yml — full dev stack (Temporal + ES + PG + NATS); keep as-is
  • worker.py, api.py, event_router.py — entry points exist and are functional
  • migrations/, alembic.ini — schema management is in place
  • docs/runbook.md — dev quick-start section exists; extend for railiance

Build Order

T58 (Dockerfile) → T59 (railiance compose — depends on image)
T60, T61, T62 — parallel, no deps
T63 — independent, improves T59 (clean shutdown in compose)
T64 — last, documents T58-T63

T58: Dockerfile (multi-stage, uv-based)

File: Dockerfile

Two-stage build to keep the runtime image lean.

# Stage 1 — install Python deps
FROM python:3.12-slim AS builder
RUN pip install uv --no-cache-dir
WORKDIR /app
COPY pyproject.toml uv.lock ./
COPY src/ ./src/
RUN uv sync --no-dev --frozen

# Stage 2 — runtime image
FROM python:3.12-slim AS runtime
WORKDIR /app
COPY --from=builder /app/.venv /app/.venv
COPY --from=builder /app/src /app/src
# Include definition files that ship with the repo
COPY activity-definitions/ ./activity-definitions/
COPY event-types/ ./event-types/
COPY tasks/ ./tasks/
ENV PATH="/app/.venv/bin:$PATH"
ENV PYTHONPATH="/app/src"
CMD ["python", "-m", "activity_core.worker"]

Also add .dockerignore:

.venv/
__pycache__/
*.pyc
.git/
tests/
*.egg-info/
.env

The three processes share the same image; docker-compose overrides command::

  • worker: default (python -m activity_core.worker)
  • api: uvicorn activity_core.api:app --host 0.0.0.0 --port 8010
  • event-router: python -m activity_core.event_router

T59: docker-compose.railiance.yml (full stack, no Elasticsearch)

File: docker-compose.railiance.yml

Self-contained production stack. Uses PostgreSQL-based Temporal visibility (no Elasticsearch required). All activity-core services read env from env_file: .env.

Services:

  • temporal-db — postgres:16 (Temporal schema + visibility)
  • temporal — temporalio/auto-setup:1.29.1 with ENABLE_ES: "false", DB: postgres12, VISIBILITY_DBNAME: temporal_visibility
  • temporal-ui — temporalio/ui on port 8080
  • nats — nats:2.10-alpine with -js and persistent volume for JetStream state
  • app-db — postgres:16 (activity-core application data)
  • actcore-migrate — one-shot service: build: ., command: alembic upgrade head, restart: no, runs migrations then exits; other actcore services depend on it
  • actcore-worker — worker process, metrics port 9090
  • actcore-api — API server, port 8010, with healthcheck against /health
  • actcore-event-router — event router process

All three actcore processes share one image (build: .) and depend on:

  • temporal (condition: service_healthy)
  • app-db (condition: service_healthy)
  • nats (condition: service_healthy)
  • actcore-migrate (condition: service_completed_successfully)

Persistent volumes: temporal-db-data, app-db-data, nats-data.

Network: actcore-net (bridge).


T60: GET /health endpoint

File: src/activity_core/api.py (add route)

from sqlalchemy import text

@app.get("/health")
async def health() -> JSONResponse:
    db_ok = False
    temporal_ok = False

    try:
        async with _get_db()() as session:
            await session.execute(text("SELECT 1"))
        db_ok = True
    except Exception:
        pass

    try:
        await _get_temporal().describe_namespace(TEMPORAL_NAMESPACE)
        temporal_ok = True
    except Exception:
        pass

    status = "ok" if db_ok and temporal_ok else "degraded"
    code = 200 if status == "ok" else 503
    return JSONResponse(
        {"status": status, "db": db_ok, "temporal": temporal_ok},
        status_code=code,
    )

No authentication. Used by:

  • docker-compose healthcheck: test: ["CMD-SHELL", "curl -sf http://localhost:8010/health"]
  • Railiance monitoring (external HTTP probe)

T61: .env.example — complete env var reference

File: .env.example

# ── Required ──────────────────────────────────────────────────────────────────
# PostgreSQL connection string for activity-core application data.
ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@app-db:5432/actcore

# ── Temporal ──────────────────────────────────────────────────────────────────
# Temporal frontend gRPC address.
TEMPORAL_HOST=temporal:7233
# Temporal namespace (must exist before workers start).
TEMPORAL_NAMESPACE=default

# ── NATS ──────────────────────────────────────────────────────────────────────
# NATS server URL. JetStream must be enabled (-js flag).
NATS_URL=nats://nats:4222

# ── Service integrations (gracefully degraded if unavailable) ─────────────────
# State Hub — used by the state-hub context adapter. Binds {} on failure.
STATE_HUB_URL=http://127.0.0.1:8000
# Repo scoping — used by the repo-scoping context adapter. Binds {} on failure.
REPO_SCOPING_URL=http://127.0.0.1:8020
# Issue Core — task emission backend.
ISSUE_CORE_URL=http://127.0.0.1:8010
# Sink type: 'rest' (POST to issue-core) or 'null' (discard, for dry-run).
ISSUE_SINK_TYPE=rest

# ── Activity definitions ───────────────────────────────────────────────────────
# Colon-separated paths to additional activity-definitions/ directories.
# The local activity-definitions/ directory is always scanned.
ACTIVITY_DEFINITION_DIRS=

# ── Observability ─────────────────────────────────────────────────────────────
# Prometheus metrics bind address (Temporal SDK metrics).
PROMETHEUS_BIND_ADDR=0.0.0.0:9090

# ── Security (webhook receiver) ───────────────────────────────────────────────
# HMAC-SHA256 secret for Gitea webhook signature validation.
WEBHOOK_SECRET_GITEA=
# HMAC-SHA256 secret for GitHub webhook signature validation.
WEBHOOK_SECRET_GITHUB=

# ── Curator gate ──────────────────────────────────────────────────────────────
# 'disabled': accepts active + pending event types (pending logged as warning).
# 'required': only active event types accepted; pending events are discarded.
ACTIVITY_CURATOR_GATE=disabled

T62: Makefile ops targets

File: Makefile (extend)

New targets to add:

# ── Infrastructure ────────────────────────────────────────────────────────────
dev-up:         ## Start full dev stack (Temporal + PG + NATS)
dev-down:       ## Stop and remove dev stack containers
railiance-up:   ## Start full railiance stack (builds image first)
railiance-down: ## Stop and remove railiance stack containers

# ── Database ──────────────────────────────────────────────────────────────────
migrate:        ## Apply all pending Alembic migrations
sync-all:       ## Sync event types and activity definitions (runs both)

# ── Local dev processes ───────────────────────────────────────────────────────
start-worker:        ## Start Temporal worker (uses ACTCORE_DB_URL from env)
start-api:           ## Start FastAPI server on :8010 (hot reload)
start-event-router:  ## Start NATS event router

# ── Help ──────────────────────────────────────────────────────────────────────
help:           ## Show this help message

The help target uses grep -E '^[a-zA-Z_-]+:.*?##' Makefile to extract the ## description comments and format them as a table. Makes make help the entry point for operators.

start-* targets load .env if it exists (-include .env; export) so developers can run locally without setting env vars manually.


T63: SIGTERM graceful shutdown (worker + event_router)

Files: src/activity_core/worker.py, src/activity_core/event_router.py

Replace the await asyncio.Future() pattern (which blocks forever, ignoring SIGTERM) with a signal-aware stop event:

import signal

async def run() -> None:
    ...
    loop = asyncio.get_running_loop()
    stop = asyncio.Event()
    loop.add_signal_handler(signal.SIGTERM, stop.set)
    loop.add_signal_handler(signal.SIGINT, stop.set)

    async with orchestrator_worker, task_worker:  # (worker.py)
        logger.info("Workers running — waiting for shutdown signal")
        await stop.wait()
        logger.info("Shutdown signal received — draining workers")
    logger.info("Workers stopped cleanly")

For event_router.py, the same pattern applies: stop.wait() replaces the infinite future; on signal, the NATS consumer is unsubscribed before exiting.

This ensures docker stop (SIGTERM → 10-second grace period → SIGKILL) completes within the grace window by draining in-flight Temporal tasks and NATS messages before the process exits.


T64: docs/runbook.md — Railiance Deployment section

File: docs/runbook.md (extend with new section)

Add after the existing Dev environment section:

## Railiance Deployment

### Pre-requisites
- Docker ≥ 24 with Compose v2 (`docker compose` not `docker-compose`)
- ≥ 4 GB RAM available (Temporal server takes ~1 GB)
- Ports available: 7233 (Temporal gRPC), 8010 (API), 8080 (Temporal UI),
  9090 (Prometheus metrics)

### First-time setup
1. `cp .env.example .env` — edit all values, especially secrets
2. `make railiance-up` — builds image and starts all services
3. Wait for health: `curl -sf http://localhost:8010/health``{"status":"ok",...}`
4. Register Temporal search attributes (one-time per namespace):
   `docker exec actcore-temporal temporal operator search-attribute create \`
   `  --name ActivityId --type Keyword --name ActivityName --type Keyword \`
   `  --address temporal:7233`
5. `make sync-all` — load event types and activity definitions

### Upgrade procedure
1. `git pull`
2. `make railiance-up` — rebuilds image and restarts changed services
3. `make migrate` — apply any new migrations (safe to run even if none pending)
4. `curl -sf http://localhost:8010/health` — verify health

### Health verification
- API health: `curl -s http://localhost:8010/health | python3 -m json.tool`
- Temporal UI: http://localhost:8080
- Prometheus metrics: http://localhost:9090/metrics

### Common ops
- View logs: `docker compose -f docker-compose.railiance.yml logs -f actcore-worker`
- Restart one service: `docker compose -f docker-compose.railiance.yml restart actcore-api`
- Wipe and reset (destructive): `make railiance-down && docker volume rm ...`

Completion Criteria

  1. docker build -t activity-core . succeeds and image is < 400 MB
  2. make railiance-up starts all 8 services; all reach healthy state
  3. curl http://localhost:8010/health returns {"status":"ok",...} with HTTP 200
  4. docker stop actcore-worker causes graceful drain (no SIGKILL within 10s)
  5. make help prints a clean table of all targets with descriptions
  6. .env.example covers every env var used anywhere in the codebase

New Files Produced

Path Task
Dockerfile T58
.dockerignore T58
docker-compose.railiance.yml T59
.env.example T61

Modified Files

Path Task Change
src/activity_core/api.py T60 Add /health route
Makefile T62 Add ops targets
src/activity_core/worker.py T63 SIGTERM handler
src/activity_core/event_router.py T63 SIGTERM handler
docs/runbook.md T64 Railiance deployment section

Change History

  • v1.0 (2026-05-14): Initial workplan.