Files
the-custodian/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md
tegwick 890b2f9fc7 feat(ops+workplans): fix tunnel targets, plan custodian migration, close legacy ADR-001 gaps
Tunnel (state-hub/Makefile):
- Replace interactive `make tunnel` (now non-blocking with -N flag)
- Add tunnel-daemon (autossh background), tunnel-loop (reconnect fallback),
  tunnel-status, tunnel-stop
- Default COULOMBCORE=tegwick@92.205.130.254; TUNNEL_PORT configurable
- Clarified server topology: COULOMBCORE=92.205.130.254 (old),
  Railiance01=92.205.62.239 (ThreePhoenix node 1)

Workplans:
- CUST-WP-0011: Migrate Custodian State Hub to ThreePhoenix cluster —
  9-task plan with hard pre-condition gates (3-node cluster, Longhorn HA,
  backup drill), data migration, 2-week stabilisation, WSL2 retirement
- CUST-WP-0000: Retroactive record for state-hub v0.1 (pre-ADR-001)
- CUST-WP-0000b: Retroactive record for state-hub v0.2 (pre-ADR-001)

Consistency: repo now ✓ PASS (0 fail, 18 warn — all pre-ADR-001 C-12 history)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-11 01:09:07 +01:00

347 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: CUST-WP-0011
type: workplan
title: "Migrate Custodian State Hub to ThreePhoenix Cluster"
domain: custodian
repo: the-custodian
status: active
owner: custodian
topic_slug: custodian
created: "2026-03-11"
updated: "2026-03-11"
state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940"
---
# Migrate Custodian State Hub to ThreePhoenix Cluster
## Goal
Move the Custodian State Hub (FastAPI + PostgreSQL) from its current home on
the WSL2 operator workstation to the ThreePhoenix Kubernetes cluster
(Railiance01/02/03), making it available to Claude Code sessions running on
any machine with cluster access — without public internet exposure.
The State Hub is **irreplaceable episodic memory**. This migration must be
executed with zero tolerance for data loss and a tested rollback path at
every stage.
## Pre-conditions (gate — do not start until all satisfied)
- [ ] ThreePhoenix cluster has three healthy nodes (Railiance01 confirmed, Railiance02 + Railiance03 joined)
- [ ] Longhorn distributed storage installed and verified (replication factor ≥ 2)
- [ ] HA failover test passes (`tests/test_ha_failover.sh` exits 0 on the cluster)
- [ ] S2 integrated backup operational and tested on the cluster
- [ ] A full WSL2 State Hub backup has been taken and restore-drilled **within 24h of starting this workplan**
These gates are mandatory. A single-node cluster or unverified storage is not
an acceptable migration target for the Custodian.
## Architecture after migration
```
COULOMBCORE / operator workstation (WSL2)
└─ Claude Code
└─ MCP server subprocess (Python, local clone of the-custodian)
└─ HTTP → ssh -L 8000:state-hub-svc:8000 tegwick@92.205.62.239
└─ Railiance01 k3s
└─ state-hub ClusterIP service
├─ FastAPI pod (12 replicas)
└─ PostgreSQL PVC (Longhorn, 2-way replicated)
```
Key properties:
- **Not publicly exposed** — ClusterIP only; access via SSH port-forward
- **Replicated storage** — Longhorn replicates the PG data volume across nodes
- **WSL2 instance retained as DR fallback** during the stabilisation period
- **MCP config unchanged** — subprocess still calls `http://127.0.0.1:8000`;
the SSH port-forward provides the binding
## Backup and disaster recovery contract
Before and during migration, the following must hold at all times:
| Asset | Backup mechanism | RPO | Tested? |
|---|---|---|---|
| State Hub PostgreSQL DB | `make backup` (pg_dump → age-encrypted, Nextcloud offsite) | Daily | Must be drilled before T03 |
| State Hub DB on cluster | Longhorn snapshot + age-encrypted copy to `/opt/backup/` | Daily | Must be drilled before T06 |
| WSL2 instance | Remains live during stabilisation period | — | Running |
**Rollback rule:** at any task boundary, if something is wrong, revert to
WSL2. No task should leave the system in a state where both WSL2 and cluster
are broken.
---
## Tasks
### T01 — Drill WSL2 backup restore end-to-end
```task
id: T01
status: todo
priority: high
state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
```
Before touching anything, prove the current backup can actually be restored:
```bash
# In the-custodian/state-hub/
make backup # take fresh backup
# Spin up a test postgres container
docker run -d --name pg-restore-test -e POSTGRES_PASSWORD=test \
-p 5433:5432 postgres:16
# Decrypt and restore
age -d -i ~/.config/sops/age/keys.txt \
/opt/backup/custodian/state-hub-latest.sql.gz.age | \
gunzip | psql -h 127.0.0.1 -p 5433 -U postgres state_hub
# Spot-check: count topics
psql -h 127.0.0.1 -p 5433 -U postgres -c "SELECT COUNT(*) FROM topics;" state_hub
docker rm -f pg-restore-test
```
**Done when:** restore completes, topic count matches production, drill logged
in `memory/episodic/`.
---
### T02 — Helm chart for State Hub (new: railiance-platform)
```task
id: T02
status: todo
priority: high
state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
```
Create `helm/state-hub/` in `railiance-platform` (S3 layer owns platform
services). The chart must deploy:
- **FastAPI deployment** — image built from `the-custodian/state-hub/`,
1 replica initially (scale to 2 after T06)
- **PostgreSQL StatefulSet** — single instance backed by a Longhorn PVC
(minimum 5 Gi); HA not required here — Longhorn replication IS the HA
- **ClusterIP service** `state-hub` on port 8000
- **ConfigMap** for non-secret config (DB URL template, log level)
- **Secret** for DB credentials (SOPS-encrypted values file)
- **Liveness/readiness probe** — `GET /state/health`
Values:
```yaml
image:
repository: gitea.local/custodian/state-hub
tag: latest
postgres:
storageClass: longhorn
size: 5Gi
replicaCount: 1
```
**Done when:** `helm lint` passes; chart committed in railiance-platform.
---
### T03 — Build and push State Hub container image
```task
id: T03
status: todo
priority: high
state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
```
Add `state-hub/Dockerfile` to the-custodian:
```dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN pip install uv && uv sync --frozen --no-dev
COPY api/ ./api/
COPY mcp_server/ ./mcp_server/
CMD ["uv", "run", "uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
```
Build and push to the cluster-local Gitea registry:
```bash
docker build -t gitea.local/custodian/state-hub:latest .
docker push gitea.local/custodian/state-hub:latest
```
**Done when:** image available in Gitea registry; `helm install --dry-run`
resolves the image.
---
### T04 — Deploy to cluster and run Alembic migrations
```task
id: T04
status: todo
priority: high
state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
```
```bash
# From operator workstation via SSH port-forward to k3s API
helm install state-hub ./helm/state-hub/ \
-n custodian --create-namespace \
-f helm/state-hub/values-production.yaml
# Wait for pods
kubectl -n custodian rollout status deployment/state-hub
# Run migrations inside the pod
kubectl -n custodian exec -it deploy/state-hub -- \
uv run alembic upgrade head
```
**Done when:** pod Running, `/state/health` returns 200, Alembic reports
"head" from inside the pod.
---
### T05 — Migrate data from WSL2 to cluster
```task
id: T05
status: todo
priority: high
state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
```
This is the point of no return for the DB — execute with care:
```bash
# 1. Take final WSL2 backup
make -C ~/the-custodian/state-hub backup
# 2. Copy dump into the cluster postgres pod
kubectl -n custodian cp /tmp/state-hub-migration.sql \
$(kubectl -n custodian get pod -l app=state-hub-postgres -o name):/tmp/
# 3. Restore
kubectl -n custodian exec -it deploy/state-hub-postgres -- \
psql -U postgres -d state_hub -f /tmp/state-hub-migration.sql
# 4. Spot-check counts match WSL2
kubectl -n custodian exec -it deploy/state-hub -- \
psql -c "SELECT relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
```
**Rollback:** if counts differ, delete cluster DB data, re-run from T04.
WSL2 is still live and unchanged.
**Done when:** all table row counts match the WSL2 instance.
---
### T06 — Drill cluster backup restore
```task
id: T06
status: todo
priority: high
state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
```
Before cutting over, prove the cluster backup can be restored:
```bash
# Trigger a backup via the cluster cron (or manually)
kubectl -n custodian create job --from=cronjob/state-hub-backup backup-drill-01
# Verify output in /opt/backup/ on the node holding the PVC
# Decrypt and restore to a test namespace
kubectl create ns restore-test
# ... restore steps similar to T01 but against cluster postgres
```
**Done when:** restore drill passes; drill logged.
---
### T07 — Cutover: redirect MCP config to cluster
```task
id: T07
status: todo
priority: medium
state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
```
Update the MCP config on every operator workstation (WSL2, COULOMBCORE) to
reach the cluster state hub via SSH port-forward instead of the local process.
The MCP server subprocess still runs locally (Python, same `server.py`).
Only the API endpoint it calls changes — via a persistent port-forward:
```bash
# On operator workstation — keep this running (add to tunnel-daemon or tunnel-loop)
ssh -L 8000:state-hub.custodian.svc.cluster.local:8000 tegwick@92.205.62.239
```
No change to `.mcp.json` needed — subprocess still calls `http://127.0.0.1:8000`.
Alternatively: update the MCP server's `API_BASE` env var to point directly
to the port-forward. Either approach is valid; document the chosen one.
**Done when:** `claude /mcp` shows `state-hub` connected; `get_state_summary()`
returns live cluster data.
---
### T08 — Stabilisation period (2 weeks minimum)
```task
id: T08
status: todo
priority: medium
state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
```
Run the cluster state hub as the primary for two weeks before retiring WSL2:
- Keep WSL2 state hub running (but frozen — no writes) as DR fallback
- Monitor cluster pod restarts, storage health, backup cron
- Run `get_state_summary()` at the start of each session; confirm data is live
- Test failover: kill the FastAPI pod; verify it restarts and responds within 60s
**Done when:** two weeks elapsed with no data loss events; all backup drills
passed.
---
### T09 — Retire WSL2 instance
```task
id: T09
status: todo
priority: low
state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
```
Once T08 stabilisation passes:
1. Take a final WSL2 backup (archive, keep indefinitely)
2. Stop the WSL2 Docker container: `make -C ~/the-custodian/state-hub clean`
3. Update `CLAUDE.md` global and project to remove WSL2 state hub start instructions
4. Update MEMORY.md — state hub is now cluster-hosted
5. Record a decision in the state hub: "State Hub WSL2 instance retired"
**Done when:** WSL2 state hub no longer running; documentation updated.
---
## References
- Constitution constraint: irreversible actions require human approval — T05
(data migration) and T09 (WSL2 retirement) require explicit sign-off
- OAS layer: S3 Platform Services (railiance-platform)
- DR dependency: Longhorn storage (railiance-cluster WP to be linked)
- Extension point: EP-RAIL-005 (full-stack backup) — state hub must implement
`make backup` / `make restore` standard interface before T06
- Domain goal: `6f96c712-60e6-4ea9-ab06-168878eafbce` (Three-Phoenix Secure
Kubernetes Infrastructure)