Files

tegwick 890b2f9fc7 feat(ops+workplans): fix tunnel targets, plan custodian migration, close legacy ADR-001 gaps

Tunnel (state-hub/Makefile):
- Replace interactive `make tunnel` (now non-blocking with -N flag)
- Add tunnel-daemon (autossh background), tunnel-loop (reconnect fallback),
  tunnel-status, tunnel-stop
- Default COULOMBCORE=tegwick@92.205.130.254; TUNNEL_PORT configurable
- Clarified server topology: COULOMBCORE=92.205.130.254 (old),
  Railiance01=92.205.62.239 (ThreePhoenix node 1)

Workplans:
- CUST-WP-0011: Migrate Custodian State Hub to ThreePhoenix cluster —
  9-task plan with hard pre-condition gates (3-node cluster, Longhorn HA,
  backup drill), data migration, 2-week stabilisation, WSL2 retirement
- CUST-WP-0000: Retroactive record for state-hub v0.1 (pre-ADR-001)
- CUST-WP-0000b: Retroactive record for state-hub v0.2 (pre-ADR-001)

Consistency: repo now ✓ PASS (0 fail, 18 warn — all pre-ADR-001 C-12 history)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-11 01:09:07 +01:00

10 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	created	updated	state_hub_workstream_id
CUST-WP-0011	workplan	Migrate Custodian State Hub to ThreePhoenix Cluster	custodian	the-custodian	active	custodian	custodian	2026-03-11	2026-03-11	967baafb-d92d-405a-ba0b-0d00d37c4940

Migrate Custodian State Hub to ThreePhoenix Cluster

Goal

Move the Custodian State Hub (FastAPI + PostgreSQL) from its current home on the WSL2 operator workstation to the ThreePhoenix Kubernetes cluster (Railiance01/02/03), making it available to Claude Code sessions running on any machine with cluster access — without public internet exposure.

The State Hub is irreplaceable episodic memory. This migration must be executed with zero tolerance for data loss and a tested rollback path at every stage.

Pre-conditions (gate — do not start until all satisfied)

ThreePhoenix cluster has three healthy nodes (Railiance01 confirmed, Railiance02 + Railiance03 joined)
Longhorn distributed storage installed and verified (replication factor ≥ 2)
HA failover test passes (tests/test_ha_failover.sh exits 0 on the cluster)
S2 integrated backup operational and tested on the cluster
A full WSL2 State Hub backup has been taken and restore-drilled within 24h of starting this workplan

These gates are mandatory. A single-node cluster or unverified storage is not an acceptable migration target for the Custodian.

Architecture after migration

COULOMBCORE / operator workstation (WSL2)
  └─ Claude Code
       └─ MCP server subprocess (Python, local clone of the-custodian)
            └─ HTTP → ssh -L 8000:state-hub-svc:8000 tegwick@92.205.62.239
                          └─ Railiance01 k3s
                               └─ state-hub ClusterIP service
                                    ├─ FastAPI pod (1–2 replicas)
                                    └─ PostgreSQL PVC (Longhorn, 2-way replicated)

Key properties:

Not publicly exposed — ClusterIP only; access via SSH port-forward
Replicated storage — Longhorn replicates the PG data volume across nodes
WSL2 instance retained as DR fallback during the stabilisation period
MCP config unchanged — subprocess still calls http://127.0.0.1:8000; the SSH port-forward provides the binding

Backup and disaster recovery contract

Before and during migration, the following must hold at all times:

Asset	Backup mechanism	RPO	Tested?
State Hub PostgreSQL DB	`make backup` (pg_dump → age-encrypted, Nextcloud offsite)	Daily	Must be drilled before T03
State Hub DB on cluster	Longhorn snapshot + age-encrypted copy to `/opt/backup/`	Daily	Must be drilled before T06
WSL2 instance	Remains live during stabilisation period	—	Running

Rollback rule: at any task boundary, if something is wrong, revert to WSL2. No task should leave the system in a state where both WSL2 and cluster are broken.

Tasks

T01 — Drill WSL2 backup restore end-to-end

id: T01
status: todo
priority: high
state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"

Before touching anything, prove the current backup can actually be restored:

# In the-custodian/state-hub/
make backup                         # take fresh backup
# Spin up a test postgres container
docker run -d --name pg-restore-test -e POSTGRES_PASSWORD=test \
  -p 5433:5432 postgres:16
# Decrypt and restore
age -d -i ~/.config/sops/age/keys.txt \
  /opt/backup/custodian/state-hub-latest.sql.gz.age | \
  gunzip | psql -h 127.0.0.1 -p 5433 -U postgres state_hub
# Spot-check: count topics
psql -h 127.0.0.1 -p 5433 -U postgres -c "SELECT COUNT(*) FROM topics;" state_hub
docker rm -f pg-restore-test

Done when: restore completes, topic count matches production, drill logged in memory/episodic/.

T02 — Helm chart for State Hub (new: railiance-platform)

id: T02
status: todo
priority: high
state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"

Create helm/state-hub/ in railiance-platform (S3 layer owns platform services). The chart must deploy:

FastAPI deployment — image built from the-custodian/state-hub/, 1 replica initially (scale to 2 after T06)
PostgreSQL StatefulSet — single instance backed by a Longhorn PVC (minimum 5 Gi); HA not required here — Longhorn replication IS the HA
ClusterIP service state-hub on port 8000
ConfigMap for non-secret config (DB URL template, log level)
Secret for DB credentials (SOPS-encrypted values file)
Liveness/readiness probe — GET /state/health

Values:

image:
  repository: gitea.local/custodian/state-hub
  tag: latest
postgres:
  storageClass: longhorn
  size: 5Gi
replicaCount: 1

Done when: helm lint passes; chart committed in railiance-platform.

T03 — Build and push State Hub container image

id: T03
status: todo
priority: high
state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"

Add state-hub/Dockerfile to the-custodian:

FROM python:3.12-slim
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN pip install uv && uv sync --frozen --no-dev
COPY api/ ./api/
COPY mcp_server/ ./mcp_server/
CMD ["uv", "run", "uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]

Build and push to the cluster-local Gitea registry:

docker build -t gitea.local/custodian/state-hub:latest .
docker push gitea.local/custodian/state-hub:latest

Done when: image available in Gitea registry; helm install --dry-run resolves the image.

T04 — Deploy to cluster and run Alembic migrations

id: T04
status: todo
priority: high
state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"

# From operator workstation via SSH port-forward to k3s API
helm install state-hub ./helm/state-hub/ \
  -n custodian --create-namespace \
  -f helm/state-hub/values-production.yaml

# Wait for pods
kubectl -n custodian rollout status deployment/state-hub

# Run migrations inside the pod
kubectl -n custodian exec -it deploy/state-hub -- \
  uv run alembic upgrade head

Done when: pod Running, /state/health returns 200, Alembic reports "head" from inside the pod.

T05 — Migrate data from WSL2 to cluster

id: T05
status: todo
priority: high
state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"

This is the point of no return for the DB — execute with care:

# 1. Take final WSL2 backup
make -C ~/the-custodian/state-hub backup

# 2. Copy dump into the cluster postgres pod
kubectl -n custodian cp /tmp/state-hub-migration.sql \
  $(kubectl -n custodian get pod -l app=state-hub-postgres -o name):/tmp/

# 3. Restore
kubectl -n custodian exec -it deploy/state-hub-postgres -- \
  psql -U postgres -d state_hub -f /tmp/state-hub-migration.sql

# 4. Spot-check counts match WSL2
kubectl -n custodian exec -it deploy/state-hub -- \
  psql -c "SELECT relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"

Rollback: if counts differ, delete cluster DB data, re-run from T04. WSL2 is still live and unchanged.

Done when: all table row counts match the WSL2 instance.

T06 — Drill cluster backup restore

id: T06
status: todo
priority: high
state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"

Before cutting over, prove the cluster backup can be restored:

# Trigger a backup via the cluster cron (or manually)
kubectl -n custodian create job --from=cronjob/state-hub-backup backup-drill-01

# Verify output in /opt/backup/ on the node holding the PVC
# Decrypt and restore to a test namespace
kubectl create ns restore-test
# ... restore steps similar to T01 but against cluster postgres

Done when: restore drill passes; drill logged.

T07 — Cutover: redirect MCP config to cluster

id: T07
status: todo
priority: medium
state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"

Update the MCP config on every operator workstation (WSL2, COULOMBCORE) to reach the cluster state hub via SSH port-forward instead of the local process.

The MCP server subprocess still runs locally (Python, same server.py). Only the API endpoint it calls changes — via a persistent port-forward:

# On operator workstation — keep this running (add to tunnel-daemon or tunnel-loop)
ssh -L 8000:state-hub.custodian.svc.cluster.local:8000 tegwick@92.205.62.239

No change to .mcp.json needed — subprocess still calls http://127.0.0.1:8000.

Alternatively: update the MCP server's API_BASE env var to point directly to the port-forward. Either approach is valid; document the chosen one.

Done when: claude /mcp shows state-hub connected; get_state_summary() returns live cluster data.

T08 — Stabilisation period (2 weeks minimum)

id: T08
status: todo
priority: medium
state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"

Run the cluster state hub as the primary for two weeks before retiring WSL2:

Keep WSL2 state hub running (but frozen — no writes) as DR fallback
Monitor cluster pod restarts, storage health, backup cron
Run get_state_summary() at the start of each session; confirm data is live
Test failover: kill the FastAPI pod; verify it restarts and responds within 60s

Done when: two weeks elapsed with no data loss events; all backup drills passed.

T09 — Retire WSL2 instance

id: T09
status: todo
priority: low
state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"

Once T08 stabilisation passes:

Take a final WSL2 backup (archive, keep indefinitely)
Stop the WSL2 Docker container: make -C ~/the-custodian/state-hub clean
Update CLAUDE.md global and project to remove WSL2 state hub start instructions
Update MEMORY.md — state hub is now cluster-hosted
Record a decision in the state hub: "State Hub WSL2 instance retired"

Done when: WSL2 state hub no longer running; documentation updated.

References

Constitution constraint: irreversible actions require human approval — T05 (data migration) and T09 (WSL2 retirement) require explicit sign-off
OAS layer: S3 Platform Services (railiance-platform)
DR dependency: Longhorn storage (railiance-cluster WP to be linked)
Extension point: EP-RAIL-005 (full-stack backup) — state hub must implement make backup / make restore standard interface before T06
Domain goal: 6f96c712-60e6-4ea9-ab06-168878eafbce (Three-Phoenix Secure Kubernetes Infrastructure)

10 KiB Raw Blame History Unescape Escape