llm-connect's CLI default port (:8080) collides with the dev stack's temporal-ui container. Hit during the 2026-06-01 cutover attempt with OSError: Address already in use. Update Steps 3, 5, and 6 to use :8088 and note the conflict reason inline so the next operator does not rediscover this the slow way. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
244 lines
8.8 KiB
Markdown
244 lines
8.8 KiB
Markdown
---
|
|
id: CUST-WP-0045-cutover-runbook
|
|
type: runbook
|
|
title: "CUST-WP-0045 T06 cutover — exact command sequence"
|
|
parent_workplan: CUST-WP-0045
|
|
created: "2026-06-01"
|
|
---
|
|
|
|
# CUST-WP-0045 T06 cutover — exact command sequence
|
|
|
|
Runbook for finishing `CUST-WP-0045-T06` (Canary Cutover And Disable Codex
|
|
Automation). The fixes for the 2026-05-21 real-LLM canary failure are merged
|
|
(activity-core `cf92f0d`/`5c4f96e`, llm-connect `b12d1af`/`82e3c07`); this
|
|
runbook reruns the patched canary and, if it passes, performs the cutover.
|
|
|
|
Definition UUID: `6fca51fa-387a-4fd0-bc4e-d62c29eb859a` (stable UUIDv5 from the
|
|
slug; the same id used in the 2026-05-19 mocked canary).
|
|
|
|
## Heads-up before you start
|
|
|
|
- **Concurrent-session collision.** A real-LLM canary calls Claude Code through
|
|
llm-connect. If you trigger this from inside an active Claude Code session,
|
|
the canary competes for Claude CLI quota and can hit the same
|
|
`Execution error` / HTTP 500 path that broke the 2026-05-21 attempts. End any
|
|
active Claude Code session in the same shell account first, or run from a
|
|
fresh terminal with that risk in mind.
|
|
- The repo `.env` uses Docker network hostnames (`temporal:7233`,
|
|
`app-db:5432`, `nats:4222`). The `make` targets auto-load that file, so the
|
|
host-mode processes below pass explicit env vars on the command line to
|
|
override.
|
|
- Volumes from the 2026-05-19 dev run persist (`temporal-db-data`,
|
|
`elasticsearch-data`, `app-db-data`). The ActivityDefinition row and the
|
|
paused Schedule should already be there. Re-syncing is idempotent.
|
|
|
|
## Step 0 — Sanity check dependencies
|
|
|
|
```bash
|
|
# State Hub on :8000 (if down: cd ~/state-hub && make api)
|
|
curl -sf http://127.0.0.1:8000/state/health && echo " state-hub OK"
|
|
|
|
# Claude CLI resolves to user install, schema file exists
|
|
ls -l /home/worsch/.local/bin/claude
|
|
ls -l /home/worsch/the-custodian/schemas/daily-triage-report.json
|
|
```
|
|
|
|
## Step 1 — Bring the activity-core dev stack up
|
|
|
|
```bash
|
|
cd /home/worsch/activity-core
|
|
make dev-up
|
|
docker compose -f docker-compose.dev.yml ps
|
|
```
|
|
|
|
Wait until `temporal` reports `(healthy)`.
|
|
|
|
## Step 2 — Migrate and re-sync the definition from the-custodian dir
|
|
|
|
```bash
|
|
cd /home/worsch/activity-core
|
|
export ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore
|
|
export TEMPORAL_HOST=localhost:7233
|
|
export TEMPORAL_NAMESPACE=default
|
|
export ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian
|
|
|
|
uv run alembic upgrade head
|
|
uv run python -m activity_core.sync_activity_definitions
|
|
```
|
|
|
|
The sync should report parsing `daily-statehub-wsjf-triage.md` with trigger,
|
|
instruction block, and report sinks. The Temporal schedule stays paused
|
|
(`enabled: false`).
|
|
|
|
## Step 3 — Start the activity-core worker (foreground, leave running)
|
|
|
|
```bash
|
|
cd /home/worsch/activity-core
|
|
export ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore
|
|
export TEMPORAL_HOST=localhost:7233
|
|
export TEMPORAL_NAMESPACE=default
|
|
export STATE_HUB_URL=http://127.0.0.1:8000
|
|
export LLM_CONNECT_URL=http://127.0.0.1:8088 # NOT :8080 — taken by temporal-ui
|
|
export ISSUE_SINK_TYPE=null # canary must not spawn real tasks
|
|
|
|
uv run python -m activity_core.worker
|
|
```
|
|
|
|
Watch for "registered RunActivityWorkflow" / polling lines.
|
|
|
|
## Step 4 — Start the activity-core admin API (foreground, separate terminal)
|
|
|
|
```bash
|
|
cd /home/worsch/activity-core
|
|
export ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore
|
|
export TEMPORAL_HOST=localhost:7233
|
|
export TEMPORAL_NAMESPACE=default
|
|
|
|
uv run uvicorn activity_core.api:app --host 127.0.0.1 --port 8010
|
|
```
|
|
|
|
## Step 5 — Start llm-connect server with the patched Claude Code adapter
|
|
|
|
```bash
|
|
cd /home/worsch/llm-connect
|
|
export LLM_CONNECT_CLAUDE_CLI_PATH=/home/worsch/.local/bin/claude
|
|
|
|
uv run python -m llm_connect.server --host 127.0.0.1 --port 8088 --provider claude-code
|
|
```
|
|
|
|
Port `:8088` is chosen because the dev stack already binds `:8080` for the
|
|
Temporal UI (`temporal-ui` container). llm-connect's CLI default is `:8080`
|
|
and will `OSError: [Errno 98] Address already in use` if you accept it.
|
|
|
|
The explicit `LLM_CONNECT_CLAUDE_CLI_PATH` is belt-and-suspenders against the
|
|
2026-05-21 failure (the adapter resolved `/usr/bin/claude` and got the literal
|
|
string `Execution error`).
|
|
|
|
## Step 6 — Smoke probe llm-connect before risking the full canary
|
|
|
|
```bash
|
|
curl -sS -X POST http://127.0.0.1:8088/execute \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"prompt": "Return ONLY JSON: {\"ok\": true}",
|
|
"config": {
|
|
"model_params": {
|
|
"json_schema": "{\"type\":\"object\",\"required\":[\"ok\"],\"properties\":{\"ok\":{\"type\":\"boolean\"}}}"
|
|
}
|
|
}
|
|
}'
|
|
```
|
|
|
|
Want: HTTP 200 with content that parses as `{"ok": true}`.
|
|
- Literal `Execution error` → wrong CLI; recheck Step 5.
|
|
- HTTP 500 → look at llm-connect terminal output before continuing.
|
|
|
|
## Optional — Inspect the digest the model will see
|
|
|
|
```bash
|
|
cd /home/worsch/activity-core
|
|
STATE_HUB_URL=http://127.0.0.1:8000 uv run python -c "
|
|
import json
|
|
from activity_core.context_resolvers.state_hub import _daily_triage_digest
|
|
d = json.loads(_daily_triage_digest({
|
|
'refresh': False, 'to_agent': 'hub', 'unread_only': True,
|
|
'max_workstreams': 12, 'max_next_steps': 8,
|
|
}))
|
|
print(json.dumps(d, indent=2))
|
|
" > /tmp/digest.json
|
|
echo "digest: $(wc -c < /tmp/digest.json) bytes"
|
|
```
|
|
|
|
## Step 7 — Trigger the canary
|
|
|
|
```bash
|
|
curl -sS -X POST \
|
|
http://127.0.0.1:8010/activity-definitions/6fca51fa-387a-4fd0-bc4e-d62c29eb859a/trigger \
|
|
-H "Content-Type: application/json"
|
|
```
|
|
|
|
Returns `{"workflow_id": "activity-6fca51fa-...:manual-<uuid>", "trigger_key": "manual-<uuid>"}`.
|
|
Note the `workflow_id` for Step 8.
|
|
|
|
Watch the worker terminal: `RunActivityWorkflow` start → resolvers
|
|
(`state_hub`, `prompt_path`) → llm-connect call → schema validation → two
|
|
report sinks (`working-memory`, `state-hub-progress`).
|
|
|
|
## Step 8 — Verify all three evidence surfaces
|
|
|
|
```bash
|
|
# (a) Working-memory note dated today (Europe/Berlin), with run-id suffix
|
|
TODAY=$(TZ=Europe/Berlin date +%F)
|
|
ls -lat /home/worsch/the-custodian/memory/working/ | head
|
|
ls /home/worsch/the-custodian/memory/working/daily-triage-${TODAY}-*.md
|
|
|
|
# (b) State Hub daily_triage event for this run
|
|
curl -s "http://127.0.0.1:8000/progress/?event_type=daily_triage&limit=5" \
|
|
| python3 -m json.tool
|
|
|
|
# (c) ActivityRun row + Temporal workflow status
|
|
docker exec -i actcore-app-db psql -U actcore -d actcore -c \
|
|
"SELECT id, activity_definition_id, status, started_at, completed_at
|
|
FROM activity_runs
|
|
ORDER BY started_at DESC LIMIT 5;"
|
|
|
|
# Replace <workflow_id> with the value from Step 7
|
|
docker exec temporal-admin-tools \
|
|
temporal workflow show -w "<workflow_id>" --address temporal:7233 | head -40
|
|
```
|
|
|
|
**Acceptance for T06's canary half:** working-memory note + `daily_triage`
|
|
progress event + `activity_runs` row all reference the same `run_id`, and the
|
|
report is a real WSJF triage (not the `"summary":"ok"` stub from 2026-05-19).
|
|
|
|
## Recursive trap to watch for
|
|
|
|
The digest tells the model that `cust-wp-0045` is itself among the
|
|
highest-priority open workplans. A real LLM may recommend `work-next` on T06 —
|
|
"finish the activity-core daily triage cutover." That's correct, but it also
|
|
means a successful canary will literally be a report telling you to finish the
|
|
canary. Treat that as evidence the model is reading the digest, not as a new
|
|
recommendation.
|
|
|
|
## Step 9 — Only after Step 8 is clean: cutover
|
|
|
|
Re-check the report content with a human before running these — they pause real
|
|
automation and are visible to operators.
|
|
|
|
```bash
|
|
# Pause the Codex automation. There's no CLI for it; do it from the Claude
|
|
# Desktop "Tasks/Automations" panel. The entry is named
|
|
# "daily-state-hub-wsjf-triage". Leaving Codex running while activity-core is
|
|
# enabled violates the workplan's single-runner rule.
|
|
|
|
# Flip the ActivityDefinition to enabled and re-sync
|
|
sed -i 's/^enabled: false$/enabled: true/' \
|
|
/home/worsch/the-custodian/activity-definitions/daily-statehub-wsjf-triage.md
|
|
|
|
cd /home/worsch/activity-core
|
|
export ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore
|
|
export TEMPORAL_HOST=localhost:7233
|
|
export ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian
|
|
|
|
uv run python -m activity_core.sync_activity_definitions
|
|
uv run python -m activity_core.sync_schedules
|
|
|
|
# Confirm Temporal schedule is now unpaused and points at RunActivityWorkflow
|
|
docker exec temporal-admin-tools \
|
|
temporal schedule describe \
|
|
--schedule-id "activity-6fca51fa-387a-4fd0-bc4e-d62c29eb859a" \
|
|
--address temporal:7233
|
|
```
|
|
|
|
Then commit the the-custodian change, push, and watch the next 07:20
|
|
Europe/Berlin tick land all three evidence surfaces.
|
|
|
|
## Cleanup if you abort
|
|
|
|
```bash
|
|
cd /home/worsch/activity-core
|
|
make dev-down # keeps volumes
|
|
# make dev-down -v # ONLY if you want to wipe state — would lose the
|
|
# daily-triage ActivityDefinition row and any history
|
|
```
|