Files
the-custodian/workplans/CUST-WP-0045-cutover-runbook.md
tegwick 1874ab52bb Pin llm-connect to :8088 in CUST-WP-0045 cutover runbook
llm-connect's CLI default port (:8080) collides with the dev stack's
temporal-ui container. Hit during the 2026-06-01 cutover attempt with
OSError: Address already in use. Update Steps 3, 5, and 6 to use :8088
and note the conflict reason inline so the next operator does not
rediscover this the slow way.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 03:32:01 +02:00

8.8 KiB

id, type, title, parent_workplan, created
id type title parent_workplan created
CUST-WP-0045-cutover-runbook runbook CUST-WP-0045 T06 cutover — exact command sequence CUST-WP-0045 2026-06-01

CUST-WP-0045 T06 cutover — exact command sequence

Runbook for finishing CUST-WP-0045-T06 (Canary Cutover And Disable Codex Automation). The fixes for the 2026-05-21 real-LLM canary failure are merged (activity-core cf92f0d/5c4f96e, llm-connect b12d1af/82e3c07); this runbook reruns the patched canary and, if it passes, performs the cutover.

Definition UUID: 6fca51fa-387a-4fd0-bc4e-d62c29eb859a (stable UUIDv5 from the slug; the same id used in the 2026-05-19 mocked canary).

Heads-up before you start

  • Concurrent-session collision. A real-LLM canary calls Claude Code through llm-connect. If you trigger this from inside an active Claude Code session, the canary competes for Claude CLI quota and can hit the same Execution error / HTTP 500 path that broke the 2026-05-21 attempts. End any active Claude Code session in the same shell account first, or run from a fresh terminal with that risk in mind.
  • The repo .env uses Docker network hostnames (temporal:7233, app-db:5432, nats:4222). The make targets auto-load that file, so the host-mode processes below pass explicit env vars on the command line to override.
  • Volumes from the 2026-05-19 dev run persist (temporal-db-data, elasticsearch-data, app-db-data). The ActivityDefinition row and the paused Schedule should already be there. Re-syncing is idempotent.

Step 0 — Sanity check dependencies

# State Hub on :8000 (if down: cd ~/state-hub && make api)
curl -sf http://127.0.0.1:8000/state/health && echo " state-hub OK"

# Claude CLI resolves to user install, schema file exists
ls -l /home/worsch/.local/bin/claude
ls -l /home/worsch/the-custodian/schemas/daily-triage-report.json

Step 1 — Bring the activity-core dev stack up

cd /home/worsch/activity-core
make dev-up
docker compose -f docker-compose.dev.yml ps

Wait until temporal reports (healthy).

Step 2 — Migrate and re-sync the definition from the-custodian dir

cd /home/worsch/activity-core
export ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore
export TEMPORAL_HOST=localhost:7233
export TEMPORAL_NAMESPACE=default
export ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian

uv run alembic upgrade head
uv run python -m activity_core.sync_activity_definitions

The sync should report parsing daily-statehub-wsjf-triage.md with trigger, instruction block, and report sinks. The Temporal schedule stays paused (enabled: false).

Step 3 — Start the activity-core worker (foreground, leave running)

cd /home/worsch/activity-core
export ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore
export TEMPORAL_HOST=localhost:7233
export TEMPORAL_NAMESPACE=default
export STATE_HUB_URL=http://127.0.0.1:8000
export LLM_CONNECT_URL=http://127.0.0.1:8088   # NOT :8080 — taken by temporal-ui
export ISSUE_SINK_TYPE=null   # canary must not spawn real tasks

uv run python -m activity_core.worker

Watch for "registered RunActivityWorkflow" / polling lines.

Step 4 — Start the activity-core admin API (foreground, separate terminal)

cd /home/worsch/activity-core
export ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore
export TEMPORAL_HOST=localhost:7233
export TEMPORAL_NAMESPACE=default

uv run uvicorn activity_core.api:app --host 127.0.0.1 --port 8010

Step 5 — Start llm-connect server with the patched Claude Code adapter

cd /home/worsch/llm-connect
export LLM_CONNECT_CLAUDE_CLI_PATH=/home/worsch/.local/bin/claude

uv run python -m llm_connect.server --host 127.0.0.1 --port 8088 --provider claude-code

Port :8088 is chosen because the dev stack already binds :8080 for the Temporal UI (temporal-ui container). llm-connect's CLI default is :8080 and will OSError: [Errno 98] Address already in use if you accept it.

The explicit LLM_CONNECT_CLAUDE_CLI_PATH is belt-and-suspenders against the 2026-05-21 failure (the adapter resolved /usr/bin/claude and got the literal string Execution error).

Step 6 — Smoke probe llm-connect before risking the full canary

curl -sS -X POST http://127.0.0.1:8088/execute \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Return ONLY JSON: {\"ok\": true}",
    "config": {
      "model_params": {
        "json_schema": "{\"type\":\"object\",\"required\":[\"ok\"],\"properties\":{\"ok\":{\"type\":\"boolean\"}}}"
      }
    }
  }'

Want: HTTP 200 with content that parses as {"ok": true}.

  • Literal Execution error → wrong CLI; recheck Step 5.
  • HTTP 500 → look at llm-connect terminal output before continuing.

Optional — Inspect the digest the model will see

cd /home/worsch/activity-core
STATE_HUB_URL=http://127.0.0.1:8000 uv run python -c "
import json
from activity_core.context_resolvers.state_hub import _daily_triage_digest
d = json.loads(_daily_triage_digest({
    'refresh': False, 'to_agent': 'hub', 'unread_only': True,
    'max_workstreams': 12, 'max_next_steps': 8,
}))
print(json.dumps(d, indent=2))
" > /tmp/digest.json
echo "digest: $(wc -c < /tmp/digest.json) bytes"

Step 7 — Trigger the canary

curl -sS -X POST \
  http://127.0.0.1:8010/activity-definitions/6fca51fa-387a-4fd0-bc4e-d62c29eb859a/trigger \
  -H "Content-Type: application/json"

Returns {"workflow_id": "activity-6fca51fa-...:manual-<uuid>", "trigger_key": "manual-<uuid>"}. Note the workflow_id for Step 8.

Watch the worker terminal: RunActivityWorkflow start → resolvers (state_hub, prompt_path) → llm-connect call → schema validation → two report sinks (working-memory, state-hub-progress).

Step 8 — Verify all three evidence surfaces

# (a) Working-memory note dated today (Europe/Berlin), with run-id suffix
TODAY=$(TZ=Europe/Berlin date +%F)
ls -lat /home/worsch/the-custodian/memory/working/ | head
ls /home/worsch/the-custodian/memory/working/daily-triage-${TODAY}-*.md

# (b) State Hub daily_triage event for this run
curl -s "http://127.0.0.1:8000/progress/?event_type=daily_triage&limit=5" \
  | python3 -m json.tool

# (c) ActivityRun row + Temporal workflow status
docker exec -i actcore-app-db psql -U actcore -d actcore -c \
  "SELECT id, activity_definition_id, status, started_at, completed_at
   FROM activity_runs
   ORDER BY started_at DESC LIMIT 5;"

# Replace <workflow_id> with the value from Step 7
docker exec temporal-admin-tools \
  temporal workflow show -w "<workflow_id>" --address temporal:7233 | head -40

Acceptance for T06's canary half: working-memory note + daily_triage progress event + activity_runs row all reference the same run_id, and the report is a real WSJF triage (not the "summary":"ok" stub from 2026-05-19).

Recursive trap to watch for

The digest tells the model that cust-wp-0045 is itself among the highest-priority open workplans. A real LLM may recommend work-next on T06 — "finish the activity-core daily triage cutover." That's correct, but it also means a successful canary will literally be a report telling you to finish the canary. Treat that as evidence the model is reading the digest, not as a new recommendation.

Step 9 — Only after Step 8 is clean: cutover

Re-check the report content with a human before running these — they pause real automation and are visible to operators.

# Pause the Codex automation. There's no CLI for it; do it from the Claude
# Desktop "Tasks/Automations" panel. The entry is named
# "daily-state-hub-wsjf-triage". Leaving Codex running while activity-core is
# enabled violates the workplan's single-runner rule.

# Flip the ActivityDefinition to enabled and re-sync
sed -i 's/^enabled: false$/enabled: true/' \
  /home/worsch/the-custodian/activity-definitions/daily-statehub-wsjf-triage.md

cd /home/worsch/activity-core
export ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore
export TEMPORAL_HOST=localhost:7233
export ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian

uv run python -m activity_core.sync_activity_definitions
uv run python -m activity_core.sync_schedules

# Confirm Temporal schedule is now unpaused and points at RunActivityWorkflow
docker exec temporal-admin-tools \
  temporal schedule describe \
    --schedule-id "activity-6fca51fa-387a-4fd0-bc4e-d62c29eb859a" \
    --address temporal:7233

Then commit the the-custodian change, push, and watch the next 07:20 Europe/Berlin tick land all three evidence surfaces.

Cleanup if you abort

cd /home/worsch/activity-core
make dev-down            # keeps volumes
# make dev-down -v       # ONLY if you want to wipe state — would lose the
                         # daily-triage ActivityDefinition row and any history