diff --git a/workplans/CUST-WP-0045-cutover-runbook.md b/workplans/CUST-WP-0045-cutover-runbook.md new file mode 100644 index 0000000..37b15a7 --- /dev/null +++ b/workplans/CUST-WP-0045-cutover-runbook.md @@ -0,0 +1,239 @@ +--- +id: CUST-WP-0045-cutover-runbook +type: runbook +title: "CUST-WP-0045 T06 cutover — exact command sequence" +parent_workplan: CUST-WP-0045 +created: "2026-06-01" +--- + +# CUST-WP-0045 T06 cutover — exact command sequence + +Runbook for finishing `CUST-WP-0045-T06` (Canary Cutover And Disable Codex +Automation). The fixes for the 2026-05-21 real-LLM canary failure are merged +(activity-core `cf92f0d`/`5c4f96e`, llm-connect `b12d1af`/`82e3c07`); this +runbook reruns the patched canary and, if it passes, performs the cutover. + +Definition UUID: `6fca51fa-387a-4fd0-bc4e-d62c29eb859a` (stable UUIDv5 from the +slug; the same id used in the 2026-05-19 mocked canary). + +## Heads-up before you start + +- **Concurrent-session collision.** A real-LLM canary calls Claude Code through + llm-connect. If you trigger this from inside an active Claude Code session, + the canary competes for Claude CLI quota and can hit the same + `Execution error` / HTTP 500 path that broke the 2026-05-21 attempts. End any + active Claude Code session in the same shell account first, or run from a + fresh terminal with that risk in mind. +- The repo `.env` uses Docker network hostnames (`temporal:7233`, + `app-db:5432`, `nats:4222`). The `make` targets auto-load that file, so the + host-mode processes below pass explicit env vars on the command line to + override. +- Volumes from the 2026-05-19 dev run persist (`temporal-db-data`, + `elasticsearch-data`, `app-db-data`). The ActivityDefinition row and the + paused Schedule should already be there. Re-syncing is idempotent. + +## Step 0 — Sanity check dependencies + +```bash +# State Hub on :8000 (if down: cd ~/state-hub && make api) +curl -sf http://127.0.0.1:8000/state/health && echo " state-hub OK" + +# Claude CLI resolves to user install, schema file exists +ls -l /home/worsch/.local/bin/claude +ls -l /home/worsch/the-custodian/schemas/daily-triage-report.json +``` + +## Step 1 — Bring the activity-core dev stack up + +```bash +cd /home/worsch/activity-core +make dev-up +docker compose -f docker-compose.dev.yml ps +``` + +Wait until `temporal` reports `(healthy)`. + +## Step 2 — Migrate and re-sync the definition from the-custodian dir + +```bash +cd /home/worsch/activity-core +export ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore +export TEMPORAL_HOST=localhost:7233 +export TEMPORAL_NAMESPACE=default +export ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian + +uv run alembic upgrade head +uv run python -m activity_core.sync_activity_definitions +``` + +The sync should report parsing `daily-statehub-wsjf-triage.md` with trigger, +instruction block, and report sinks. The Temporal schedule stays paused +(`enabled: false`). + +## Step 3 — Start the activity-core worker (foreground, leave running) + +```bash +cd /home/worsch/activity-core +export ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore +export TEMPORAL_HOST=localhost:7233 +export TEMPORAL_NAMESPACE=default +export STATE_HUB_URL=http://127.0.0.1:8000 +export LLM_CONNECT_URL=http://127.0.0.1:8080 +export ISSUE_SINK_TYPE=null # canary must not spawn real tasks + +uv run python -m activity_core.worker +``` + +Watch for "registered RunActivityWorkflow" / polling lines. + +## Step 4 — Start the activity-core admin API (foreground, separate terminal) + +```bash +cd /home/worsch/activity-core +export ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore +export TEMPORAL_HOST=localhost:7233 +export TEMPORAL_NAMESPACE=default + +uv run uvicorn activity_core.api:app --host 127.0.0.1 --port 8010 +``` + +## Step 5 — Start llm-connect server with the patched Claude Code adapter + +```bash +cd /home/worsch/llm-connect +export LLM_CONNECT_CLAUDE_CLI_PATH=/home/worsch/.local/bin/claude + +uv run python -m llm_connect.server --host 127.0.0.1 --port 8080 --provider claude-code +``` + +The explicit `LLM_CONNECT_CLAUDE_CLI_PATH` is belt-and-suspenders against the +2026-05-21 failure (the adapter resolved `/usr/bin/claude` and got the literal +string `Execution error`). + +## Step 6 — Smoke probe llm-connect before risking the full canary + +```bash +curl -sS -X POST http://127.0.0.1:8080/execute \ + -H "Content-Type: application/json" \ + -d '{ + "prompt": "Return ONLY JSON: {\"ok\": true}", + "config": { + "model_params": { + "json_schema": "{\"type\":\"object\",\"required\":[\"ok\"],\"properties\":{\"ok\":{\"type\":\"boolean\"}}}" + } + } + }' +``` + +Want: HTTP 200 with content that parses as `{"ok": true}`. +- Literal `Execution error` → wrong CLI; recheck Step 5. +- HTTP 500 → look at llm-connect terminal output before continuing. + +## Optional — Inspect the digest the model will see + +```bash +cd /home/worsch/activity-core +STATE_HUB_URL=http://127.0.0.1:8000 uv run python -c " +import json +from activity_core.context_resolvers.state_hub import _daily_triage_digest +d = json.loads(_daily_triage_digest({ + 'refresh': False, 'to_agent': 'hub', 'unread_only': True, + 'max_workstreams': 12, 'max_next_steps': 8, +})) +print(json.dumps(d, indent=2)) +" > /tmp/digest.json +echo "digest: $(wc -c < /tmp/digest.json) bytes" +``` + +## Step 7 — Trigger the canary + +```bash +curl -sS -X POST \ + http://127.0.0.1:8010/activity-definitions/6fca51fa-387a-4fd0-bc4e-d62c29eb859a/trigger \ + -H "Content-Type: application/json" +``` + +Returns `{"workflow_id": "activity-6fca51fa-...:manual-", "trigger_key": "manual-"}`. +Note the `workflow_id` for Step 8. + +Watch the worker terminal: `RunActivityWorkflow` start → resolvers +(`state_hub`, `prompt_path`) → llm-connect call → schema validation → two +report sinks (`working-memory`, `state-hub-progress`). + +## Step 8 — Verify all three evidence surfaces + +```bash +# (a) Working-memory note dated today (Europe/Berlin), with run-id suffix +TODAY=$(TZ=Europe/Berlin date +%F) +ls -lat /home/worsch/the-custodian/memory/working/ | head +ls /home/worsch/the-custodian/memory/working/daily-triage-${TODAY}-*.md + +# (b) State Hub daily_triage event for this run +curl -s "http://127.0.0.1:8000/progress/?event_type=daily_triage&limit=5" \ + | python3 -m json.tool + +# (c) ActivityRun row + Temporal workflow status +docker exec -i actcore-app-db psql -U actcore -d actcore -c \ + "SELECT id, activity_definition_id, status, started_at, completed_at + FROM activity_runs + ORDER BY started_at DESC LIMIT 5;" + +# Replace with the value from Step 7 +docker exec temporal-admin-tools \ + temporal workflow show -w "" --address temporal:7233 | head -40 +``` + +**Acceptance for T06's canary half:** working-memory note + `daily_triage` +progress event + `activity_runs` row all reference the same `run_id`, and the +report is a real WSJF triage (not the `"summary":"ok"` stub from 2026-05-19). + +## Recursive trap to watch for + +The digest tells the model that `cust-wp-0045` is itself among the +highest-priority open workplans. A real LLM may recommend `work-next` on T06 — +"finish the activity-core daily triage cutover." That's correct, but it also +means a successful canary will literally be a report telling you to finish the +canary. Treat that as evidence the model is reading the digest, not as a new +recommendation. + +## Step 9 — Only after Step 8 is clean: cutover + +Re-check the report content with a human before running these — they pause real +automation and are visible to operators. + +```bash +# Pause the Codex automation. There's no CLI for it; do it from the Claude +# Desktop "Tasks/Automations" panel. The entry is named +# "daily-state-hub-wsjf-triage". Leaving Codex running while activity-core is +# enabled violates the workplan's single-runner rule. + +# Flip the ActivityDefinition to enabled and re-sync +sed -i 's/^enabled: false$/enabled: true/' \ + /home/worsch/the-custodian/activity-definitions/daily-statehub-wsjf-triage.md + +cd /home/worsch/activity-core +export ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore +export TEMPORAL_HOST=localhost:7233 +export ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian + +uv run python -m activity_core.sync_activity_definitions +uv run python -m activity_core.sync_schedules + +# Confirm Temporal schedule is now unpaused and points at RunActivityWorkflow +docker exec temporal-admin-tools \ + temporal schedule describe \ + --schedule-id "activity-6fca51fa-387a-4fd0-bc4e-d62c29eb859a" \ + --address temporal:7233 +``` + +Then commit the the-custodian change, push, and watch the next 07:20 +Europe/Berlin tick land all three evidence surfaces. + +## Cleanup if you abort + +```bash +cd /home/worsch/activity-core +make dev-down # keeps volumes +# make dev-down -v # ONLY if you want to wipe state — would lose the + # daily-triage ActivityDefinition row and any history +```