Tunnel (state-hub/Makefile): - Replace interactive `make tunnel` (now non-blocking with -N flag) - Add tunnel-daemon (autossh background), tunnel-loop (reconnect fallback), tunnel-status, tunnel-stop - Default COULOMBCORE=tegwick@92.205.130.254; TUNNEL_PORT configurable - Clarified server topology: COULOMBCORE=92.205.130.254 (old), Railiance01=92.205.62.239 (ThreePhoenix node 1) Workplans: - CUST-WP-0011: Migrate Custodian State Hub to ThreePhoenix cluster — 9-task plan with hard pre-condition gates (3-node cluster, Longhorn HA, backup drill), data migration, 2-week stabilisation, WSL2 retirement - CUST-WP-0000: Retroactive record for state-hub v0.1 (pre-ADR-001) - CUST-WP-0000b: Retroactive record for state-hub v0.2 (pre-ADR-001) Consistency: repo now ✓ PASS (0 fail, 18 warn — all pre-ADR-001 C-12 history) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
10 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|
| CUST-WP-0011 | workplan | Migrate Custodian State Hub to ThreePhoenix Cluster | custodian | the-custodian | active | custodian | custodian | 2026-03-11 | 2026-03-11 | 967baafb-d92d-405a-ba0b-0d00d37c4940 |
Migrate Custodian State Hub to ThreePhoenix Cluster
Goal
Move the Custodian State Hub (FastAPI + PostgreSQL) from its current home on the WSL2 operator workstation to the ThreePhoenix Kubernetes cluster (Railiance01/02/03), making it available to Claude Code sessions running on any machine with cluster access — without public internet exposure.
The State Hub is irreplaceable episodic memory. This migration must be executed with zero tolerance for data loss and a tested rollback path at every stage.
Pre-conditions (gate — do not start until all satisfied)
- ThreePhoenix cluster has three healthy nodes (Railiance01 confirmed, Railiance02 + Railiance03 joined)
- Longhorn distributed storage installed and verified (replication factor ≥ 2)
- HA failover test passes (
tests/test_ha_failover.shexits 0 on the cluster) - S2 integrated backup operational and tested on the cluster
- A full WSL2 State Hub backup has been taken and restore-drilled within 24h of starting this workplan
These gates are mandatory. A single-node cluster or unverified storage is not an acceptable migration target for the Custodian.
Architecture after migration
COULOMBCORE / operator workstation (WSL2)
└─ Claude Code
└─ MCP server subprocess (Python, local clone of the-custodian)
└─ HTTP → ssh -L 8000:state-hub-svc:8000 tegwick@92.205.62.239
└─ Railiance01 k3s
└─ state-hub ClusterIP service
├─ FastAPI pod (1–2 replicas)
└─ PostgreSQL PVC (Longhorn, 2-way replicated)
Key properties:
- Not publicly exposed — ClusterIP only; access via SSH port-forward
- Replicated storage — Longhorn replicates the PG data volume across nodes
- WSL2 instance retained as DR fallback during the stabilisation period
- MCP config unchanged — subprocess still calls
http://127.0.0.1:8000; the SSH port-forward provides the binding
Backup and disaster recovery contract
Before and during migration, the following must hold at all times:
| Asset | Backup mechanism | RPO | Tested? |
|---|---|---|---|
| State Hub PostgreSQL DB | make backup (pg_dump → age-encrypted, Nextcloud offsite) |
Daily | Must be drilled before T03 |
| State Hub DB on cluster | Longhorn snapshot + age-encrypted copy to /opt/backup/ |
Daily | Must be drilled before T06 |
| WSL2 instance | Remains live during stabilisation period | — | Running |
Rollback rule: at any task boundary, if something is wrong, revert to WSL2. No task should leave the system in a state where both WSL2 and cluster are broken.
Tasks
T01 — Drill WSL2 backup restore end-to-end
id: T01
status: todo
priority: high
state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
Before touching anything, prove the current backup can actually be restored:
# In the-custodian/state-hub/
make backup # take fresh backup
# Spin up a test postgres container
docker run -d --name pg-restore-test -e POSTGRES_PASSWORD=test \
-p 5433:5432 postgres:16
# Decrypt and restore
age -d -i ~/.config/sops/age/keys.txt \
/opt/backup/custodian/state-hub-latest.sql.gz.age | \
gunzip | psql -h 127.0.0.1 -p 5433 -U postgres state_hub
# Spot-check: count topics
psql -h 127.0.0.1 -p 5433 -U postgres -c "SELECT COUNT(*) FROM topics;" state_hub
docker rm -f pg-restore-test
Done when: restore completes, topic count matches production, drill logged
in memory/episodic/.
T02 — Helm chart for State Hub (new: railiance-platform)
id: T02
status: todo
priority: high
state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
Create helm/state-hub/ in railiance-platform (S3 layer owns platform
services). The chart must deploy:
- FastAPI deployment — image built from
the-custodian/state-hub/, 1 replica initially (scale to 2 after T06) - PostgreSQL StatefulSet — single instance backed by a Longhorn PVC (minimum 5 Gi); HA not required here — Longhorn replication IS the HA
- ClusterIP service
state-hubon port 8000 - ConfigMap for non-secret config (DB URL template, log level)
- Secret for DB credentials (SOPS-encrypted values file)
- Liveness/readiness probe —
GET /state/health
Values:
image:
repository: gitea.local/custodian/state-hub
tag: latest
postgres:
storageClass: longhorn
size: 5Gi
replicaCount: 1
Done when: helm lint passes; chart committed in railiance-platform.
T03 — Build and push State Hub container image
id: T03
status: todo
priority: high
state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
Add state-hub/Dockerfile to the-custodian:
FROM python:3.12-slim
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN pip install uv && uv sync --frozen --no-dev
COPY api/ ./api/
COPY mcp_server/ ./mcp_server/
CMD ["uv", "run", "uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
Build and push to the cluster-local Gitea registry:
docker build -t gitea.local/custodian/state-hub:latest .
docker push gitea.local/custodian/state-hub:latest
Done when: image available in Gitea registry; helm install --dry-run
resolves the image.
T04 — Deploy to cluster and run Alembic migrations
id: T04
status: todo
priority: high
state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
# From operator workstation via SSH port-forward to k3s API
helm install state-hub ./helm/state-hub/ \
-n custodian --create-namespace \
-f helm/state-hub/values-production.yaml
# Wait for pods
kubectl -n custodian rollout status deployment/state-hub
# Run migrations inside the pod
kubectl -n custodian exec -it deploy/state-hub -- \
uv run alembic upgrade head
Done when: pod Running, /state/health returns 200, Alembic reports
"head" from inside the pod.
T05 — Migrate data from WSL2 to cluster
id: T05
status: todo
priority: high
state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
This is the point of no return for the DB — execute with care:
# 1. Take final WSL2 backup
make -C ~/the-custodian/state-hub backup
# 2. Copy dump into the cluster postgres pod
kubectl -n custodian cp /tmp/state-hub-migration.sql \
$(kubectl -n custodian get pod -l app=state-hub-postgres -o name):/tmp/
# 3. Restore
kubectl -n custodian exec -it deploy/state-hub-postgres -- \
psql -U postgres -d state_hub -f /tmp/state-hub-migration.sql
# 4. Spot-check counts match WSL2
kubectl -n custodian exec -it deploy/state-hub -- \
psql -c "SELECT relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
Rollback: if counts differ, delete cluster DB data, re-run from T04. WSL2 is still live and unchanged.
Done when: all table row counts match the WSL2 instance.
T06 — Drill cluster backup restore
id: T06
status: todo
priority: high
state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
Before cutting over, prove the cluster backup can be restored:
# Trigger a backup via the cluster cron (or manually)
kubectl -n custodian create job --from=cronjob/state-hub-backup backup-drill-01
# Verify output in /opt/backup/ on the node holding the PVC
# Decrypt and restore to a test namespace
kubectl create ns restore-test
# ... restore steps similar to T01 but against cluster postgres
Done when: restore drill passes; drill logged.
T07 — Cutover: redirect MCP config to cluster
id: T07
status: todo
priority: medium
state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
Update the MCP config on every operator workstation (WSL2, COULOMBCORE) to reach the cluster state hub via SSH port-forward instead of the local process.
The MCP server subprocess still runs locally (Python, same server.py).
Only the API endpoint it calls changes — via a persistent port-forward:
# On operator workstation — keep this running (add to tunnel-daemon or tunnel-loop)
ssh -L 8000:state-hub.custodian.svc.cluster.local:8000 tegwick@92.205.62.239
No change to .mcp.json needed — subprocess still calls http://127.0.0.1:8000.
Alternatively: update the MCP server's API_BASE env var to point directly
to the port-forward. Either approach is valid; document the chosen one.
Done when: claude /mcp shows state-hub connected; get_state_summary()
returns live cluster data.
T08 — Stabilisation period (2 weeks minimum)
id: T08
status: todo
priority: medium
state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
Run the cluster state hub as the primary for two weeks before retiring WSL2:
- Keep WSL2 state hub running (but frozen — no writes) as DR fallback
- Monitor cluster pod restarts, storage health, backup cron
- Run
get_state_summary()at the start of each session; confirm data is live - Test failover: kill the FastAPI pod; verify it restarts and responds within 60s
Done when: two weeks elapsed with no data loss events; all backup drills passed.
T09 — Retire WSL2 instance
id: T09
status: todo
priority: low
state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
Once T08 stabilisation passes:
- Take a final WSL2 backup (archive, keep indefinitely)
- Stop the WSL2 Docker container:
make -C ~/the-custodian/state-hub clean - Update
CLAUDE.mdglobal and project to remove WSL2 state hub start instructions - Update MEMORY.md — state hub is now cluster-hosted
- Record a decision in the state hub: "State Hub WSL2 instance retired"
Done when: WSL2 state hub no longer running; documentation updated.
References
- Constitution constraint: irreversible actions require human approval — T05 (data migration) and T09 (WSL2 retirement) require explicit sign-off
- OAS layer: S3 Platform Services (railiance-platform)
- DR dependency: Longhorn storage (railiance-cluster WP to be linked)
- Extension point: EP-RAIL-005 (full-stack backup) — state hub must implement
make backup/make restorestandard interface before T06 - Domain goal:
6f96c712-60e6-4ea9-ab06-168878eafbce(Three-Phoenix Secure Kubernetes Infrastructure)