the-custodian/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md

---
id: CUST-WP-0011
type: workplan
title: "Migrate Custodian State Hub to ThreePhoenix Cluster"
domain: custodian
repo: the-custodian
status: active
owner: custodian
topic_slug: custodian
created: "2026-03-11"
updated: "2026-03-11"
state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940"
---

# Migrate Custodian State Hub to ThreePhoenix Cluster

## Goal

Move the Custodian State Hub (FastAPI + PostgreSQL) from its current home on
the WSL2 operator workstation to the ThreePhoenix Kubernetes cluster
(Railiance01/02/03), making it available to Claude Code sessions running on
any machine with cluster access — without public internet exposure.

The State Hub is **irreplaceable episodic memory**. This migration must be
executed with zero tolerance for data loss and a tested rollback path at
every stage.

## Pre-conditions (gate — do not start until all satisfied)

- [ ] ThreePhoenix cluster has three healthy nodes (Railiance01 confirmed, Railiance02 + Railiance03 joined)
- [ ] Longhorn distributed storage installed and verified (replication factor ≥ 2)
- [ ] HA failover test passes (`tests/test_ha_failover.sh` exits 0 on the cluster)
- [ ] S2 integrated backup operational and tested on the cluster
- [ ] A full WSL2 State Hub backup has been taken and restore-drilled **within 24h of starting this workplan**

These gates are mandatory. A single-node cluster or unverified storage is not
an acceptable migration target for the Custodian.

## Architecture after migration

```
COULOMBCORE / operator workstation (WSL2)
  └─ Claude Code
       └─ MCP server subprocess (Python, local clone of the-custodian)
            └─ HTTP → ssh -L 8000:state-hub-svc:8000 tegwick@92.205.62.239
                          └─ Railiance01 k3s
                               └─ state-hub ClusterIP service
                                    ├─ FastAPI pod (1–2 replicas)
                                    └─ PostgreSQL PVC (Longhorn, 2-way replicated)
```

Key properties:
- **Not publicly exposed** — ClusterIP only; access via SSH port-forward
- **Replicated storage** — Longhorn replicates the PG data volume across nodes
- **WSL2 instance retained as DR fallback** during the stabilisation period
- **MCP config unchanged** — subprocess still calls `http://127.0.0.1:8000`;
  the SSH port-forward provides the binding

## Backup and disaster recovery contract

Before and during migration, the following must hold at all times:

| Asset | Backup mechanism | RPO | Tested? |
|---|---|---|---|
| State Hub PostgreSQL DB | `make backup` (pg_dump → age-encrypted, Nextcloud offsite) | Daily | Must be drilled before T03 |
| State Hub DB on cluster | Longhorn snapshot + age-encrypted copy to `/opt/backup/` | Daily | Must be drilled before T06 |
| WSL2 instance | Remains live during stabilisation period | — | Running |

**Rollback rule:** at any task boundary, if something is wrong, revert to
WSL2. No task should leave the system in a state where both WSL2 and cluster
are broken.

---

## Tasks

### T01 — Drill WSL2 backup restore end-to-end

```task
id: T01
status: todo
priority: high
state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
```

Before touching anything, prove the current backup can actually be restored:

```bash
# In the-custodian/state-hub/
make backup                         # take fresh backup
# Spin up a test postgres container
docker run -d --name pg-restore-test -e POSTGRES_PASSWORD=test \
  -p 5433:5432 postgres:16
# Decrypt and restore
age -d -i ~/.config/sops/age/keys.txt \
  /opt/backup/custodian/state-hub-latest.sql.gz.age | \
  gunzip | psql -h 127.0.0.1 -p 5433 -U postgres state_hub
# Spot-check: count topics
psql -h 127.0.0.1 -p 5433 -U postgres -c "SELECT COUNT(*) FROM topics;" state_hub
docker rm -f pg-restore-test
```

**Done when:** restore completes, topic count matches production, drill logged
in `memory/episodic/`.

---

### T02 — Helm chart for State Hub (new: railiance-platform)

```task
id: T02
status: todo
priority: high
state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
```

Create `helm/state-hub/` in `railiance-platform` (S3 layer owns platform
services). The chart must deploy:

- **FastAPI deployment** — image built from `the-custodian/state-hub/`,
  1 replica initially (scale to 2 after T06)
- **PostgreSQL StatefulSet** — single instance backed by a Longhorn PVC
  (minimum 5 Gi); HA not required here — Longhorn replication IS the HA
- **ClusterIP service** `state-hub` on port 8000
- **ConfigMap** for non-secret config (DB URL template, log level)
- **Secret** for DB credentials (SOPS-encrypted values file)
- **Liveness/readiness probe** — `GET /state/health`

Values:
```yaml
image:
  repository: gitea.local/custodian/state-hub
  tag: latest
postgres:
  storageClass: longhorn
  size: 5Gi
replicaCount: 1
```

**Done when:** `helm lint` passes; chart committed in railiance-platform.

---

### T03 — Build and push State Hub container image

```task
id: T03
status: todo
priority: high
state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
```

Add `state-hub/Dockerfile` to the-custodian:

```dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN pip install uv && uv sync --frozen --no-dev
COPY api/ ./api/
COPY mcp_server/ ./mcp_server/
CMD ["uv", "run", "uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
```

Build and push to the cluster-local Gitea registry:

```bash
docker build -t gitea.local/custodian/state-hub:latest .
docker push gitea.local/custodian/state-hub:latest
```

**Done when:** image available in Gitea registry; `helm install --dry-run`
resolves the image.

---

### T04 — Deploy to cluster and run Alembic migrations

```task
id: T04
status: todo
priority: high
state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
```

```bash
# From operator workstation via SSH port-forward to k3s API
helm install state-hub ./helm/state-hub/ \
  -n custodian --create-namespace \
  -f helm/state-hub/values-production.yaml

# Wait for pods
kubectl -n custodian rollout status deployment/state-hub

# Run migrations inside the pod
kubectl -n custodian exec -it deploy/state-hub -- \
  uv run alembic upgrade head
```

**Done when:** pod Running, `/state/health` returns 200, Alembic reports
"head" from inside the pod.

---

### T05 — Migrate data from WSL2 to cluster

```task
id: T05
status: todo
priority: high
state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
```

This is the point of no return for the DB — execute with care:

```bash
# 1. Take final WSL2 backup
make -C ~/the-custodian/state-hub backup

# 2. Copy dump into the cluster postgres pod
kubectl -n custodian cp /tmp/state-hub-migration.sql \
  $(kubectl -n custodian get pod -l app=state-hub-postgres -o name):/tmp/

# 3. Restore
kubectl -n custodian exec -it deploy/state-hub-postgres -- \
  psql -U postgres -d state_hub -f /tmp/state-hub-migration.sql

# 4. Spot-check counts match WSL2
kubectl -n custodian exec -it deploy/state-hub -- \
  psql -c "SELECT relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
```

**Rollback:** if counts differ, delete cluster DB data, re-run from T04.
WSL2 is still live and unchanged.

**Done when:** all table row counts match the WSL2 instance.

---

### T06 — Drill cluster backup restore

```task
id: T06
status: todo
priority: high
state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
```

Before cutting over, prove the cluster backup can be restored:

```bash
# Trigger a backup via the cluster cron (or manually)
kubectl -n custodian create job --from=cronjob/state-hub-backup backup-drill-01

# Verify output in /opt/backup/ on the node holding the PVC
# Decrypt and restore to a test namespace
kubectl create ns restore-test
# ... restore steps similar to T01 but against cluster postgres
```

**Done when:** restore drill passes; drill logged.

---

### T07 — Cutover: redirect MCP config to cluster

```task
id: T07
status: todo
priority: medium
state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
```

Update the MCP config on every operator workstation (WSL2, COULOMBCORE) to
reach the cluster state hub via SSH port-forward instead of the local process.

The MCP server subprocess still runs locally (Python, same `server.py`).
Only the API endpoint it calls changes — via a persistent port-forward:

```bash
# On operator workstation — keep this running (add to tunnel-daemon or tunnel-loop)
ssh -L 8000:state-hub.custodian.svc.cluster.local:8000 tegwick@92.205.62.239
```

No change to `.mcp.json` needed — subprocess still calls `http://127.0.0.1:8000`.

Alternatively: update the MCP server's `API_BASE` env var to point directly
to the port-forward. Either approach is valid; document the chosen one.

**Done when:** `claude /mcp` shows `state-hub` connected; `get_state_summary()`
returns live cluster data.

---

### T08 — Stabilisation period (2 weeks minimum)

```task
id: T08
status: todo
priority: medium
state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
```

Run the cluster state hub as the primary for two weeks before retiring WSL2:

- Keep WSL2 state hub running (but frozen — no writes) as DR fallback
- Monitor cluster pod restarts, storage health, backup cron
- Run `get_state_summary()` at the start of each session; confirm data is live
- Test failover: kill the FastAPI pod; verify it restarts and responds within 60s

**Done when:** two weeks elapsed with no data loss events; all backup drills
passed.

---

### T09 — Retire WSL2 instance

```task
id: T09
status: todo
priority: low
state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
```

Once T08 stabilisation passes:

1. Take a final WSL2 backup (archive, keep indefinitely)
2. Stop the WSL2 Docker container: `make -C ~/the-custodian/state-hub clean`
3. Update `CLAUDE.md` global and project to remove WSL2 state hub start instructions
4. Update MEMORY.md — state hub is now cluster-hosted
5. Record a decision in the state hub: "State Hub WSL2 instance retired"

**Done when:** WSL2 state hub no longer running; documentation updated.

---

## References

- Constitution constraint: irreversible actions require human approval — T05
  (data migration) and T09 (WSL2 retirement) require explicit sign-off
- OAS layer: S3 Platform Services (railiance-platform)
- DR dependency: Longhorn storage (railiance-cluster WP to be linked)
- Extension point: EP-RAIL-005 (full-stack backup) — state hub must implement
  `make backup` / `make restore` standard interface before T06
- Domain goal: `6f96c712-60e6-4ea9-ab06-168878eafbce` (Three-Phoenix Secure
  Kubernetes Infrastructure)