Tunnel (state-hub/Makefile): - Replace interactive `make tunnel` (now non-blocking with -N flag) - Add tunnel-daemon (autossh background), tunnel-loop (reconnect fallback), tunnel-status, tunnel-stop - Default COULOMBCORE=tegwick@92.205.130.254; TUNNEL_PORT configurable - Clarified server topology: COULOMBCORE=92.205.130.254 (old), Railiance01=92.205.62.239 (ThreePhoenix node 1) Workplans: - CUST-WP-0011: Migrate Custodian State Hub to ThreePhoenix cluster — 9-task plan with hard pre-condition gates (3-node cluster, Longhorn HA, backup drill), data migration, 2-week stabilisation, WSL2 retirement - CUST-WP-0000: Retroactive record for state-hub v0.1 (pre-ADR-001) - CUST-WP-0000b: Retroactive record for state-hub v0.2 (pre-ADR-001) Consistency: repo now ✓ PASS (0 fail, 18 warn — all pre-ADR-001 C-12 history) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
347 lines
10 KiB
Markdown
347 lines
10 KiB
Markdown
---
|
||
id: CUST-WP-0011
|
||
type: workplan
|
||
title: "Migrate Custodian State Hub to ThreePhoenix Cluster"
|
||
domain: custodian
|
||
repo: the-custodian
|
||
status: active
|
||
owner: custodian
|
||
topic_slug: custodian
|
||
created: "2026-03-11"
|
||
updated: "2026-03-11"
|
||
state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940"
|
||
---
|
||
|
||
# Migrate Custodian State Hub to ThreePhoenix Cluster
|
||
|
||
## Goal
|
||
|
||
Move the Custodian State Hub (FastAPI + PostgreSQL) from its current home on
|
||
the WSL2 operator workstation to the ThreePhoenix Kubernetes cluster
|
||
(Railiance01/02/03), making it available to Claude Code sessions running on
|
||
any machine with cluster access — without public internet exposure.
|
||
|
||
The State Hub is **irreplaceable episodic memory**. This migration must be
|
||
executed with zero tolerance for data loss and a tested rollback path at
|
||
every stage.
|
||
|
||
## Pre-conditions (gate — do not start until all satisfied)
|
||
|
||
- [ ] ThreePhoenix cluster has three healthy nodes (Railiance01 confirmed, Railiance02 + Railiance03 joined)
|
||
- [ ] Longhorn distributed storage installed and verified (replication factor ≥ 2)
|
||
- [ ] HA failover test passes (`tests/test_ha_failover.sh` exits 0 on the cluster)
|
||
- [ ] S2 integrated backup operational and tested on the cluster
|
||
- [ ] A full WSL2 State Hub backup has been taken and restore-drilled **within 24h of starting this workplan**
|
||
|
||
These gates are mandatory. A single-node cluster or unverified storage is not
|
||
an acceptable migration target for the Custodian.
|
||
|
||
## Architecture after migration
|
||
|
||
```
|
||
COULOMBCORE / operator workstation (WSL2)
|
||
└─ Claude Code
|
||
└─ MCP server subprocess (Python, local clone of the-custodian)
|
||
└─ HTTP → ssh -L 8000:state-hub-svc:8000 tegwick@92.205.62.239
|
||
└─ Railiance01 k3s
|
||
└─ state-hub ClusterIP service
|
||
├─ FastAPI pod (1–2 replicas)
|
||
└─ PostgreSQL PVC (Longhorn, 2-way replicated)
|
||
```
|
||
|
||
Key properties:
|
||
- **Not publicly exposed** — ClusterIP only; access via SSH port-forward
|
||
- **Replicated storage** — Longhorn replicates the PG data volume across nodes
|
||
- **WSL2 instance retained as DR fallback** during the stabilisation period
|
||
- **MCP config unchanged** — subprocess still calls `http://127.0.0.1:8000`;
|
||
the SSH port-forward provides the binding
|
||
|
||
## Backup and disaster recovery contract
|
||
|
||
Before and during migration, the following must hold at all times:
|
||
|
||
| Asset | Backup mechanism | RPO | Tested? |
|
||
|---|---|---|---|
|
||
| State Hub PostgreSQL DB | `make backup` (pg_dump → age-encrypted, Nextcloud offsite) | Daily | Must be drilled before T03 |
|
||
| State Hub DB on cluster | Longhorn snapshot + age-encrypted copy to `/opt/backup/` | Daily | Must be drilled before T06 |
|
||
| WSL2 instance | Remains live during stabilisation period | — | Running |
|
||
|
||
**Rollback rule:** at any task boundary, if something is wrong, revert to
|
||
WSL2. No task should leave the system in a state where both WSL2 and cluster
|
||
are broken.
|
||
|
||
---
|
||
|
||
## Tasks
|
||
|
||
### T01 — Drill WSL2 backup restore end-to-end
|
||
|
||
```task
|
||
id: T01
|
||
status: todo
|
||
priority: high
|
||
state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
|
||
```
|
||
|
||
Before touching anything, prove the current backup can actually be restored:
|
||
|
||
```bash
|
||
# In the-custodian/state-hub/
|
||
make backup # take fresh backup
|
||
# Spin up a test postgres container
|
||
docker run -d --name pg-restore-test -e POSTGRES_PASSWORD=test \
|
||
-p 5433:5432 postgres:16
|
||
# Decrypt and restore
|
||
age -d -i ~/.config/sops/age/keys.txt \
|
||
/opt/backup/custodian/state-hub-latest.sql.gz.age | \
|
||
gunzip | psql -h 127.0.0.1 -p 5433 -U postgres state_hub
|
||
# Spot-check: count topics
|
||
psql -h 127.0.0.1 -p 5433 -U postgres -c "SELECT COUNT(*) FROM topics;" state_hub
|
||
docker rm -f pg-restore-test
|
||
```
|
||
|
||
**Done when:** restore completes, topic count matches production, drill logged
|
||
in `memory/episodic/`.
|
||
|
||
---
|
||
|
||
### T02 — Helm chart for State Hub (new: railiance-platform)
|
||
|
||
```task
|
||
id: T02
|
||
status: todo
|
||
priority: high
|
||
state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
|
||
```
|
||
|
||
Create `helm/state-hub/` in `railiance-platform` (S3 layer owns platform
|
||
services). The chart must deploy:
|
||
|
||
- **FastAPI deployment** — image built from `the-custodian/state-hub/`,
|
||
1 replica initially (scale to 2 after T06)
|
||
- **PostgreSQL StatefulSet** — single instance backed by a Longhorn PVC
|
||
(minimum 5 Gi); HA not required here — Longhorn replication IS the HA
|
||
- **ClusterIP service** `state-hub` on port 8000
|
||
- **ConfigMap** for non-secret config (DB URL template, log level)
|
||
- **Secret** for DB credentials (SOPS-encrypted values file)
|
||
- **Liveness/readiness probe** — `GET /state/health`
|
||
|
||
Values:
|
||
```yaml
|
||
image:
|
||
repository: gitea.local/custodian/state-hub
|
||
tag: latest
|
||
postgres:
|
||
storageClass: longhorn
|
||
size: 5Gi
|
||
replicaCount: 1
|
||
```
|
||
|
||
**Done when:** `helm lint` passes; chart committed in railiance-platform.
|
||
|
||
---
|
||
|
||
### T03 — Build and push State Hub container image
|
||
|
||
```task
|
||
id: T03
|
||
status: todo
|
||
priority: high
|
||
state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
|
||
```
|
||
|
||
Add `state-hub/Dockerfile` to the-custodian:
|
||
|
||
```dockerfile
|
||
FROM python:3.12-slim
|
||
WORKDIR /app
|
||
COPY pyproject.toml uv.lock ./
|
||
RUN pip install uv && uv sync --frozen --no-dev
|
||
COPY api/ ./api/
|
||
COPY mcp_server/ ./mcp_server/
|
||
CMD ["uv", "run", "uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||
```
|
||
|
||
Build and push to the cluster-local Gitea registry:
|
||
|
||
```bash
|
||
docker build -t gitea.local/custodian/state-hub:latest .
|
||
docker push gitea.local/custodian/state-hub:latest
|
||
```
|
||
|
||
**Done when:** image available in Gitea registry; `helm install --dry-run`
|
||
resolves the image.
|
||
|
||
---
|
||
|
||
### T04 — Deploy to cluster and run Alembic migrations
|
||
|
||
```task
|
||
id: T04
|
||
status: todo
|
||
priority: high
|
||
state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
|
||
```
|
||
|
||
```bash
|
||
# From operator workstation via SSH port-forward to k3s API
|
||
helm install state-hub ./helm/state-hub/ \
|
||
-n custodian --create-namespace \
|
||
-f helm/state-hub/values-production.yaml
|
||
|
||
# Wait for pods
|
||
kubectl -n custodian rollout status deployment/state-hub
|
||
|
||
# Run migrations inside the pod
|
||
kubectl -n custodian exec -it deploy/state-hub -- \
|
||
uv run alembic upgrade head
|
||
```
|
||
|
||
**Done when:** pod Running, `/state/health` returns 200, Alembic reports
|
||
"head" from inside the pod.
|
||
|
||
---
|
||
|
||
### T05 — Migrate data from WSL2 to cluster
|
||
|
||
```task
|
||
id: T05
|
||
status: todo
|
||
priority: high
|
||
state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
|
||
```
|
||
|
||
This is the point of no return for the DB — execute with care:
|
||
|
||
```bash
|
||
# 1. Take final WSL2 backup
|
||
make -C ~/the-custodian/state-hub backup
|
||
|
||
# 2. Copy dump into the cluster postgres pod
|
||
kubectl -n custodian cp /tmp/state-hub-migration.sql \
|
||
$(kubectl -n custodian get pod -l app=state-hub-postgres -o name):/tmp/
|
||
|
||
# 3. Restore
|
||
kubectl -n custodian exec -it deploy/state-hub-postgres -- \
|
||
psql -U postgres -d state_hub -f /tmp/state-hub-migration.sql
|
||
|
||
# 4. Spot-check counts match WSL2
|
||
kubectl -n custodian exec -it deploy/state-hub -- \
|
||
psql -c "SELECT relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
|
||
```
|
||
|
||
**Rollback:** if counts differ, delete cluster DB data, re-run from T04.
|
||
WSL2 is still live and unchanged.
|
||
|
||
**Done when:** all table row counts match the WSL2 instance.
|
||
|
||
---
|
||
|
||
### T06 — Drill cluster backup restore
|
||
|
||
```task
|
||
id: T06
|
||
status: todo
|
||
priority: high
|
||
state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
|
||
```
|
||
|
||
Before cutting over, prove the cluster backup can be restored:
|
||
|
||
```bash
|
||
# Trigger a backup via the cluster cron (or manually)
|
||
kubectl -n custodian create job --from=cronjob/state-hub-backup backup-drill-01
|
||
|
||
# Verify output in /opt/backup/ on the node holding the PVC
|
||
# Decrypt and restore to a test namespace
|
||
kubectl create ns restore-test
|
||
# ... restore steps similar to T01 but against cluster postgres
|
||
```
|
||
|
||
**Done when:** restore drill passes; drill logged.
|
||
|
||
---
|
||
|
||
### T07 — Cutover: redirect MCP config to cluster
|
||
|
||
```task
|
||
id: T07
|
||
status: todo
|
||
priority: medium
|
||
state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
|
||
```
|
||
|
||
Update the MCP config on every operator workstation (WSL2, COULOMBCORE) to
|
||
reach the cluster state hub via SSH port-forward instead of the local process.
|
||
|
||
The MCP server subprocess still runs locally (Python, same `server.py`).
|
||
Only the API endpoint it calls changes — via a persistent port-forward:
|
||
|
||
```bash
|
||
# On operator workstation — keep this running (add to tunnel-daemon or tunnel-loop)
|
||
ssh -L 8000:state-hub.custodian.svc.cluster.local:8000 tegwick@92.205.62.239
|
||
```
|
||
|
||
No change to `.mcp.json` needed — subprocess still calls `http://127.0.0.1:8000`.
|
||
|
||
Alternatively: update the MCP server's `API_BASE` env var to point directly
|
||
to the port-forward. Either approach is valid; document the chosen one.
|
||
|
||
**Done when:** `claude /mcp` shows `state-hub` connected; `get_state_summary()`
|
||
returns live cluster data.
|
||
|
||
---
|
||
|
||
### T08 — Stabilisation period (2 weeks minimum)
|
||
|
||
```task
|
||
id: T08
|
||
status: todo
|
||
priority: medium
|
||
state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
|
||
```
|
||
|
||
Run the cluster state hub as the primary for two weeks before retiring WSL2:
|
||
|
||
- Keep WSL2 state hub running (but frozen — no writes) as DR fallback
|
||
- Monitor cluster pod restarts, storage health, backup cron
|
||
- Run `get_state_summary()` at the start of each session; confirm data is live
|
||
- Test failover: kill the FastAPI pod; verify it restarts and responds within 60s
|
||
|
||
**Done when:** two weeks elapsed with no data loss events; all backup drills
|
||
passed.
|
||
|
||
---
|
||
|
||
### T09 — Retire WSL2 instance
|
||
|
||
```task
|
||
id: T09
|
||
status: todo
|
||
priority: low
|
||
state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
|
||
```
|
||
|
||
Once T08 stabilisation passes:
|
||
|
||
1. Take a final WSL2 backup (archive, keep indefinitely)
|
||
2. Stop the WSL2 Docker container: `make -C ~/the-custodian/state-hub clean`
|
||
3. Update `CLAUDE.md` global and project to remove WSL2 state hub start instructions
|
||
4. Update MEMORY.md — state hub is now cluster-hosted
|
||
5. Record a decision in the state hub: "State Hub WSL2 instance retired"
|
||
|
||
**Done when:** WSL2 state hub no longer running; documentation updated.
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
- Constitution constraint: irreversible actions require human approval — T05
|
||
(data migration) and T09 (WSL2 retirement) require explicit sign-off
|
||
- OAS layer: S3 Platform Services (railiance-platform)
|
||
- DR dependency: Longhorn storage (railiance-cluster WP to be linked)
|
||
- Extension point: EP-RAIL-005 (full-stack backup) — state hub must implement
|
||
`make backup` / `make restore` standard interface before T06
|
||
- Domain goal: `6f96c712-60e6-4ea9-ab06-168878eafbce` (Three-Phoenix Secure
|
||
Kubernetes Infrastructure)
|