Updated workplans for migrating the custodian to Railiance01
This commit is contained in:
@@ -1,80 +1,107 @@
|
||||
---
|
||||
id: CUST-WP-0011
|
||||
type: workplan
|
||||
title: "Migrate Custodian State Hub to ThreePhoenix Cluster"
|
||||
title: "Pragmatic State Hub Migration to railiance01"
|
||||
domain: custodian
|
||||
repo: the-custodian
|
||||
status: active
|
||||
owner: custodian
|
||||
topic_slug: custodian
|
||||
created: "2026-03-11"
|
||||
updated: "2026-03-11"
|
||||
updated: "2026-05-02"
|
||||
state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940"
|
||||
supersedes_intent_from: "Migrate Custodian State Hub to ThreePhoenix Cluster"
|
||||
follow_up_workplan: CUST-WP-0038
|
||||
---
|
||||
|
||||
# Migrate Custodian State Hub to ThreePhoenix Cluster
|
||||
# Pragmatic State Hub Migration to railiance01
|
||||
|
||||
## Goal
|
||||
|
||||
Move the Custodian State Hub (FastAPI + PostgreSQL) from its current home on
|
||||
the WSL2 operator workstation to the ThreePhoenix Kubernetes cluster
|
||||
(Railiance01/02/03), making it available to Claude Code sessions running on
|
||||
any machine with cluster access — without public internet exposure.
|
||||
Move the Custodian State Hub (FastAPI + PostgreSQL) from the WSL2 operator
|
||||
workstation to the current railiance01 Kubernetes environment, using the
|
||||
Railiance production-readiness path that exists now:
|
||||
|
||||
The State Hub is **irreplaceable episodic memory**. This migration must be
|
||||
executed with zero tolerance for data loss and a tested rollback path at
|
||||
every stage.
|
||||
- CloudNative PG (`cnpg`) for the State Hub database in the `databases`
|
||||
namespace.
|
||||
- State Hub as an S5 workload in `railiance-apps`.
|
||||
- Platform/database ownership in `railiance-platform`.
|
||||
- Access through the existing private tunnel/ops-bridge model, not public
|
||||
exposure.
|
||||
- WSL2 retained as a disaster-recovery fallback until the cluster deployment
|
||||
has proven stable.
|
||||
|
||||
## Pre-conditions (gate — do not start until all satisfied)
|
||||
This is a deliberate pragmatic step. It improves durability and multi-machine
|
||||
access before the full ThreePhoenix target is ready. The ultimate multi-node,
|
||||
replicated, long-term cluster goal is preserved in `CUST-WP-0038`.
|
||||
|
||||
- [ ] ThreePhoenix cluster has three healthy nodes (Railiance01 confirmed, Railiance02 + Railiance03 joined)
|
||||
- [ ] Longhorn distributed storage installed and verified (replication factor ≥ 2)
|
||||
- [ ] HA failover test passes (`tests/test_ha_failover.sh` exits 0 on the cluster)
|
||||
- [ ] S2 integrated backup operational and tested on the cluster
|
||||
- [ ] A full WSL2 State Hub backup has been taken and restore-drilled **within 24h of starting this workplan**
|
||||
## Context Update
|
||||
|
||||
These gates are mandatory. A single-node cluster or unverified storage is not
|
||||
an acceptable migration target for the Custodian.
|
||||
The original 2026-03-11 version of this workplan targeted a future
|
||||
ThreePhoenix cluster with Railiance01/02/03, Longhorn, and full HA gates before
|
||||
starting. That was correct as an end-state, but it blocks useful progress now.
|
||||
|
||||
## Architecture after migration
|
||||
The current Railiance architecture has moved on:
|
||||
|
||||
- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
|
||||
supersedes the older Bitnami PostgreSQL HA platform baseline.
|
||||
- CloudNative PG is the deployed database operator.
|
||||
- `RAIL-HO-WP-0004-T09` is the Railiance-side task for deploying State Hub to
|
||||
the cluster, and it still requires human decisions before live data
|
||||
migration.
|
||||
|
||||
This workplan is now the Custodian-side coordination and safety plan for that
|
||||
T09 effort.
|
||||
|
||||
## Safety Contract
|
||||
|
||||
State Hub is irreplaceable episodic memory. This migration may prepare, deploy,
|
||||
test, and compare as much as needed, but it must not make the cluster the only
|
||||
source of truth until the explicit cutover gate is satisfied.
|
||||
|
||||
Rules:
|
||||
|
||||
- A fresh WSL2 backup and restore drill is mandatory before data migration.
|
||||
- The WSL2 State Hub remains available as rollback until stabilisation passes.
|
||||
- Any task that changes the live writer endpoint requires explicit human
|
||||
approval.
|
||||
- A failed cluster deploy must leave the WSL2 instance untouched and usable.
|
||||
- Row counts and key API checks must match before cutover.
|
||||
|
||||
## Target Architecture After This Workplan
|
||||
|
||||
```
|
||||
COULOMBCORE / operator workstation (WSL2)
|
||||
└─ Claude Code
|
||||
└─ MCP server subprocess (Python, local clone of the-custodian)
|
||||
└─ HTTP → ssh -L 8000:state-hub-svc:8000 tegwick@92.205.62.239
|
||||
└─ Railiance01 k3s
|
||||
└─ state-hub ClusterIP service
|
||||
├─ FastAPI pod (1–2 replicas)
|
||||
└─ PostgreSQL PVC (Longhorn, 2-way replicated)
|
||||
Operator workstation / COULOMBCORE / other agent hosts
|
||||
-> local MCP server subprocess
|
||||
-> http://127.0.0.1:8000 or configured API_BASE
|
||||
-> private tunnel / ops-bridge
|
||||
-> railiance01 k3s
|
||||
-> state-hub Service
|
||||
-> FastAPI Deployment
|
||||
-> state-hub-db CloudNative PG Cluster
|
||||
```
|
||||
|
||||
Key properties:
|
||||
- **Not publicly exposed** — ClusterIP only; access via SSH port-forward
|
||||
- **Replicated storage** — Longhorn replicates the PG data volume across nodes
|
||||
- **WSL2 instance retained as DR fallback** during the stabilisation period
|
||||
- **MCP config unchanged** — subprocess still calls `http://127.0.0.1:8000`;
|
||||
the SSH port-forward provides the binding
|
||||
|
||||
## Backup and disaster recovery contract
|
||||
- Single-node pragmatic deployment on railiance01.
|
||||
- No public unauthenticated exposure.
|
||||
- Database managed by cnpg, not an ad-hoc Postgres StatefulSet.
|
||||
- WSL2 retained as DR fallback during stabilisation.
|
||||
- Future multi-node HA and storage replication are deferred to `CUST-WP-0038`.
|
||||
|
||||
Before and during migration, the following must hold at all times:
|
||||
## Open Human Decisions
|
||||
|
||||
| Asset | Backup mechanism | RPO | Tested? |
|
||||
|---|---|---|---|
|
||||
| State Hub PostgreSQL DB | `make backup` (pg_dump → age-encrypted, Nextcloud offsite) | Daily | Must be drilled before T03 |
|
||||
| State Hub DB on cluster | Longhorn snapshot + age-encrypted copy to `/opt/backup/` | Daily | Must be drilled before T06 |
|
||||
| WSL2 instance | Remains live during stabilisation period | — | Running |
|
||||
Resolve these before T04/T05 can become live migration work:
|
||||
|
||||
**Rollback rule:** at any task boundary, if something is wrong, revert to
|
||||
WSL2. No task should leave the system in a state where both WSL2 and cluster
|
||||
are broken.
|
||||
|
||||
---
|
||||
1. Final State Hub hostname or tunnel-only endpoint.
|
||||
2. Container registry choice: Gitea registry vs external interim registry.
|
||||
3. Exposure model: ClusterIP plus tunnel, private ingress, or both.
|
||||
4. Approval window for freezing WSL2 writes and migrating the production DB.
|
||||
5. Stabilisation duration before WSL2 can be considered non-primary fallback.
|
||||
|
||||
## Tasks
|
||||
|
||||
### T01 — Drill WSL2 backup restore end-to-end
|
||||
### T01 — Drill WSL2 State Hub backup restore
|
||||
|
||||
```task
|
||||
id: T01
|
||||
@@ -83,29 +110,23 @@ priority: high
|
||||
state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
|
||||
```
|
||||
|
||||
Before touching anything, prove the current backup can actually be restored:
|
||||
Take a fresh State Hub backup from the current WSL2 instance and restore it
|
||||
into an isolated test PostgreSQL instance.
|
||||
|
||||
```bash
|
||||
# In the-custodian/state-hub/
|
||||
make backup # take fresh backup
|
||||
# Spin up a test postgres container
|
||||
docker run -d --name pg-restore-test -e POSTGRES_PASSWORD=test \
|
||||
-p 5433:5432 postgres:16
|
||||
# Decrypt and restore
|
||||
age -d -i ~/.config/sops/age/keys.txt \
|
||||
/opt/backup/custodian/state-hub-latest.sql.gz.age | \
|
||||
gunzip | psql -h 127.0.0.1 -p 5433 -U postgres state_hub
|
||||
# Spot-check: count topics
|
||||
psql -h 127.0.0.1 -p 5433 -U postgres -c "SELECT COUNT(*) FROM topics;" state_hub
|
||||
docker rm -f pg-restore-test
|
||||
```
|
||||
Minimum checks:
|
||||
|
||||
**Done when:** restore completes, topic count matches production, drill logged
|
||||
in `memory/episodic/`.
|
||||
- Restore completes without errors.
|
||||
- Core table row counts match the live WSL2 database.
|
||||
- `/state/summary` can be served from the restored copy if wired to a test API.
|
||||
- Drill result is recorded in State Hub progress and, if useful, episodic
|
||||
memory.
|
||||
|
||||
**Done when:** backup and restore are proven within 24 hours of live migration
|
||||
work.
|
||||
|
||||
---
|
||||
|
||||
### T02 — Helm chart for State Hub (new: railiance-platform)
|
||||
### T02 — Align with Railiance deployment plan
|
||||
|
||||
```task
|
||||
id: T02
|
||||
@@ -114,34 +135,22 @@ priority: high
|
||||
state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
|
||||
```
|
||||
|
||||
Create `helm/state-hub/` in `railiance-platform` (S3 layer owns platform
|
||||
services). The chart must deploy:
|
||||
Update the cross-repo plan so this Custodian workplan and
|
||||
`RAIL-HO-WP-0004-T09` point to the same architecture.
|
||||
|
||||
- **FastAPI deployment** — image built from `the-custodian/state-hub/`,
|
||||
1 replica initially (scale to 2 after T06)
|
||||
- **PostgreSQL StatefulSet** — single instance backed by a Longhorn PVC
|
||||
(minimum 5 Gi); HA not required here — Longhorn replication IS the HA
|
||||
- **ClusterIP service** `state-hub` on port 8000
|
||||
- **ConfigMap** for non-secret config (DB URL template, log level)
|
||||
- **Secret** for DB credentials (SOPS-encrypted values file)
|
||||
- **Liveness/readiness probe** — `GET /state/health`
|
||||
Expected outputs:
|
||||
|
||||
Values:
|
||||
```yaml
|
||||
image:
|
||||
repository: gitea.local/custodian/state-hub
|
||||
tag: latest
|
||||
postgres:
|
||||
storageClass: longhorn
|
||||
size: 5Gi
|
||||
replicaCount: 1
|
||||
```
|
||||
- `RAIL-HO-WP-0004-T09` remains the Railiance-side execution task.
|
||||
- This workplan remains the Custodian-side safety/cutover task list.
|
||||
- Any stale Longhorn/Postgres StatefulSet assumptions are removed from the
|
||||
near-term migration plan.
|
||||
- The future HA goal is referenced through `CUST-WP-0038`.
|
||||
|
||||
**Done when:** `helm lint` passes; chart committed in railiance-platform.
|
||||
**Done when:** both workplans describe compatible responsibilities and gates.
|
||||
|
||||
---
|
||||
|
||||
### T03 — Build and push State Hub container image
|
||||
### T03 — Build and publish State Hub container image
|
||||
|
||||
```task
|
||||
id: T03
|
||||
@@ -150,31 +159,22 @@ priority: high
|
||||
state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
|
||||
```
|
||||
|
||||
Add `state-hub/Dockerfile` to the-custodian:
|
||||
Package `state-hub/` as a production image.
|
||||
|
||||
```dockerfile
|
||||
FROM python:3.12-slim
|
||||
WORKDIR /app
|
||||
COPY pyproject.toml uv.lock ./
|
||||
RUN pip install uv && uv sync --frozen --no-dev
|
||||
COPY api/ ./api/
|
||||
COPY mcp_server/ ./mcp_server/
|
||||
CMD ["uv", "run", "uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||
```
|
||||
Requirements:
|
||||
|
||||
Build and push to the cluster-local Gitea registry:
|
||||
- Dockerfile builds from the current Python/uv project.
|
||||
- Alembic and runtime dependencies are available inside the image.
|
||||
- Image exposes the FastAPI service on port 8000.
|
||||
- Image tag is pushed to the chosen registry.
|
||||
- Build provenance is documented in the commit/workplan.
|
||||
|
||||
```bash
|
||||
docker build -t gitea.local/custodian/state-hub:latest .
|
||||
docker push gitea.local/custodian/state-hub:latest
|
||||
```
|
||||
|
||||
**Done when:** image available in Gitea registry; `helm install --dry-run`
|
||||
resolves the image.
|
||||
**Done when:** railiance01 can pull the image and a dry-run deployment resolves
|
||||
it.
|
||||
|
||||
---
|
||||
|
||||
### T04 — Deploy to cluster and run Alembic migrations
|
||||
### T04 — Define State Hub database and app manifests
|
||||
|
||||
```task
|
||||
id: T04
|
||||
@@ -183,26 +183,20 @@ priority: high
|
||||
state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
|
||||
```
|
||||
|
||||
```bash
|
||||
# From operator workstation via SSH port-forward to k3s API
|
||||
helm install state-hub ./helm/state-hub/ \
|
||||
-n custodian --create-namespace \
|
||||
-f helm/state-hub/values-production.yaml
|
||||
Create the cluster-side deployment assets using current Railiance boundaries:
|
||||
|
||||
# Wait for pods
|
||||
kubectl -n custodian rollout status deployment/state-hub
|
||||
- `railiance-platform`: `state-hub-db` cnpg cluster and database credentials.
|
||||
- `railiance-apps`: State Hub Deployment, Service, ConfigMap, Secret/External
|
||||
Secret reference, and optional private Ingress.
|
||||
- Health probes use `GET /state/health`.
|
||||
- Environment includes `DATABASE_URL` and any required API settings.
|
||||
|
||||
# Run migrations inside the pod
|
||||
kubectl -n custodian exec -it deploy/state-hub -- \
|
||||
uv run alembic upgrade head
|
||||
```
|
||||
|
||||
**Done when:** pod Running, `/state/health` returns 200, Alembic reports
|
||||
"head" from inside the pod.
|
||||
**Done when:** manifests lint/apply in a non-destructive dry run and ownership
|
||||
boundaries are documented.
|
||||
|
||||
---
|
||||
|
||||
### T05 — Migrate data from WSL2 to cluster
|
||||
### T05 — Deploy empty State Hub and run migrations on railiance01
|
||||
|
||||
```task
|
||||
id: T05
|
||||
@@ -211,33 +205,21 @@ priority: high
|
||||
state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
|
||||
```
|
||||
|
||||
This is the point of no return for the DB — execute with care:
|
||||
Deploy State Hub against an empty `state-hub-db` cnpg database and run Alembic
|
||||
migrations in the cluster environment.
|
||||
|
||||
```bash
|
||||
# 1. Take final WSL2 backup
|
||||
make -C ~/the-custodian/state-hub backup
|
||||
Checks:
|
||||
|
||||
# 2. Copy dump into the cluster postgres pod
|
||||
kubectl -n custodian cp /tmp/state-hub-migration.sql \
|
||||
$(kubectl -n custodian get pod -l app=state-hub-postgres -o name):/tmp/
|
||||
- Pod reaches Ready.
|
||||
- `/state/health` returns healthy through the intended private access path.
|
||||
- Alembic reports head.
|
||||
- Logs show no repeated crash/restart loop.
|
||||
|
||||
# 3. Restore
|
||||
kubectl -n custodian exec -it deploy/state-hub-postgres -- \
|
||||
psql -U postgres -d state_hub -f /tmp/state-hub-migration.sql
|
||||
|
||||
# 4. Spot-check counts match WSL2
|
||||
kubectl -n custodian exec -it deploy/state-hub -- \
|
||||
psql -c "SELECT relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
|
||||
```
|
||||
|
||||
**Rollback:** if counts differ, delete cluster DB data, re-run from T04.
|
||||
WSL2 is still live and unchanged.
|
||||
|
||||
**Done when:** all table row counts match the WSL2 instance.
|
||||
**Done when:** an empty but structurally valid State Hub runs on railiance01.
|
||||
|
||||
---
|
||||
|
||||
### T06 — Drill cluster backup restore
|
||||
### T06 — Restore WSL2 data copy into cluster and compare
|
||||
|
||||
```task
|
||||
id: T06
|
||||
@@ -246,53 +228,49 @@ priority: high
|
||||
state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
|
||||
```
|
||||
|
||||
Before cutting over, prove the cluster backup can be restored:
|
||||
Restore a fresh WSL2 dump into the cluster database while WSL2 remains the live
|
||||
source of truth.
|
||||
|
||||
```bash
|
||||
# Trigger a backup via the cluster cron (or manually)
|
||||
kubectl -n custodian create job --from=cronjob/state-hub-backup backup-drill-01
|
||||
Required comparison:
|
||||
|
||||
# Verify output in /opt/backup/ on the node holding the PVC
|
||||
# Decrypt and restore to a test namespace
|
||||
kubectl create ns restore-test
|
||||
# ... restore steps similar to T01 but against cluster postgres
|
||||
```
|
||||
- Table row counts match.
|
||||
- Representative workstreams, tasks, decisions, progress events, repos, and
|
||||
token events are queryable.
|
||||
- Dashboard and MCP summary calls return expected data through the cluster API.
|
||||
- Any mismatch is investigated before proceeding.
|
||||
|
||||
**Done when:** restore drill passes; drill logged.
|
||||
**Done when:** cluster data is a verified copy of WSL2, but not yet the only
|
||||
writer.
|
||||
|
||||
---
|
||||
|
||||
### T07 — Cutover: redirect MCP config to cluster
|
||||
### T07 — Cut over private access to cluster State Hub
|
||||
|
||||
```task
|
||||
id: T07
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
|
||||
needs_human: true
|
||||
intervention_note: "Requires explicit approval to freeze WSL2 writes and make the cluster State Hub the primary endpoint."
|
||||
```
|
||||
|
||||
Update the MCP config on every operator workstation (WSL2, COULOMBCORE) to
|
||||
reach the cluster state hub via SSH port-forward instead of the local process.
|
||||
With human approval, freeze WSL2 writes, take a final dump, restore it to the
|
||||
cluster, compare counts again, and redirect the active private access path to
|
||||
the cluster API.
|
||||
|
||||
The MCP server subprocess still runs locally (Python, same `server.py`).
|
||||
Only the API endpoint it calls changes — via a persistent port-forward:
|
||||
Accepted approaches:
|
||||
|
||||
```bash
|
||||
# On operator workstation — keep this running (add to tunnel-daemon or tunnel-loop)
|
||||
ssh -L 8000:state-hub.custodian.svc.cluster.local:8000 tegwick@92.205.62.239
|
||||
```
|
||||
- Keep local MCP config pointed at `http://127.0.0.1:8000` and move that port
|
||||
to an ops-bridge/SSH tunnel.
|
||||
- Or set the MCP server `API_BASE` to the chosen private cluster endpoint.
|
||||
|
||||
No change to `.mcp.json` needed — subprocess still calls `http://127.0.0.1:8000`.
|
||||
|
||||
Alternatively: update the MCP server's `API_BASE` env var to point directly
|
||||
to the port-forward. Either approach is valid; document the chosen one.
|
||||
|
||||
**Done when:** `claude /mcp` shows `state-hub` connected; `get_state_summary()`
|
||||
returns live cluster data.
|
||||
**Done when:** `get_state_summary()` and dashboard live data are served by the
|
||||
cluster State Hub, and WSL2 is no longer receiving normal writes.
|
||||
|
||||
---
|
||||
|
||||
### T08 — Stabilisation period (2 weeks minimum)
|
||||
### T08 — Stabilise with WSL2 retained as fallback
|
||||
|
||||
```task
|
||||
id: T08
|
||||
@@ -301,19 +279,23 @@ priority: medium
|
||||
state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
|
||||
```
|
||||
|
||||
Run the cluster state hub as the primary for two weeks before retiring WSL2:
|
||||
Run the cluster State Hub as primary while keeping the WSL2 instance available
|
||||
as a fallback.
|
||||
|
||||
- Keep WSL2 state hub running (but frozen — no writes) as DR fallback
|
||||
- Monitor cluster pod restarts, storage health, backup cron
|
||||
- Run `get_state_summary()` at the start of each session; confirm data is live
|
||||
- Test failover: kill the FastAPI pod; verify it restarts and responds within 60s
|
||||
Monitor:
|
||||
|
||||
**Done when:** two weeks elapsed with no data loss events; all backup drills
|
||||
passed.
|
||||
- State Hub pod restarts.
|
||||
- cnpg cluster health.
|
||||
- Backup job success.
|
||||
- Dashboard and MCP behavior from each operator machine.
|
||||
- Consistency sync behavior for file-backed workplans.
|
||||
|
||||
**Done when:** the agreed stabilisation window passes without data loss or
|
||||
unresolved operational defects.
|
||||
|
||||
---
|
||||
|
||||
### T09 — Retire WSL2 instance
|
||||
### T09 — Document operating model and defer final WSL2 retirement
|
||||
|
||||
```task
|
||||
id: T09
|
||||
@@ -322,25 +304,24 @@ priority: low
|
||||
state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
|
||||
```
|
||||
|
||||
Once T08 stabilisation passes:
|
||||
Document the new operating model:
|
||||
|
||||
1. Take a final WSL2 backup (archive, keep indefinitely)
|
||||
2. Stop the WSL2 Docker container: `make -C ~/the-custodian/state-hub clean`
|
||||
3. Update `CLAUDE.md` global and project to remove WSL2 state hub start instructions
|
||||
4. Update MEMORY.md — state hub is now cluster-hosted
|
||||
5. Record a decision in the state hub: "State Hub WSL2 instance retired"
|
||||
- How agents reach State Hub.
|
||||
- How backups and restores work.
|
||||
- How to roll back to WSL2 if needed.
|
||||
- Which parts remain pragmatic/single-node.
|
||||
- Which long-term requirements moved to `CUST-WP-0038`.
|
||||
|
||||
**Done when:** WSL2 state hub no longer running; documentation updated.
|
||||
Do not permanently retire WSL2 in this workplan unless a separate human
|
||||
decision is recorded. Retirement belongs after proven stability or in the
|
||||
future HA workplan.
|
||||
|
||||
---
|
||||
**Done when:** runbooks and project instructions match the deployed reality.
|
||||
|
||||
## References
|
||||
|
||||
- Constitution constraint: irreversible actions require human approval — T05
|
||||
(data migration) and T09 (WSL2 retirement) require explicit sign-off
|
||||
- OAS layer: S3 Platform Services (railiance-platform)
|
||||
- DR dependency: Longhorn storage (railiance-cluster WP to be linked)
|
||||
- Extension point: EP-RAIL-005 (full-stack backup) — state hub must implement
|
||||
`make backup` / `make restore` standard interface before T06
|
||||
- Domain goal: `6f96c712-60e6-4ea9-ab06-168878eafbce` (Three-Phoenix Secure
|
||||
Kubernetes Infrastructure)
|
||||
- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
|
||||
- `RAIL-HO-WP-0004-T09` — Railiance-side State Hub deployment task
|
||||
- `CUST-WP-0038` — future full ThreePhoenix HA State Hub migration
|
||||
- Constitution constraint: production data migration and fallback retirement
|
||||
require explicit human approval
|
||||
|
||||
246
workplans/CUST-WP-0038-state-hub-threephoenix-ha.md
Normal file
246
workplans/CUST-WP-0038-state-hub-threephoenix-ha.md
Normal file
@@ -0,0 +1,246 @@
|
||||
---
|
||||
id: CUST-WP-0038
|
||||
type: workplan
|
||||
title: "State Hub Full ThreePhoenix HA Migration"
|
||||
domain: custodian
|
||||
repo: the-custodian
|
||||
status: proposed
|
||||
owner: custodian
|
||||
topic_slug: custodian
|
||||
created: "2026-05-02"
|
||||
updated: "2026-05-02"
|
||||
depends_on: CUST-WP-0011
|
||||
state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85"
|
||||
---
|
||||
|
||||
# State Hub Full ThreePhoenix HA Migration
|
||||
|
||||
## Goal
|
||||
|
||||
Preserve the original long-term State Hub infrastructure goal while
|
||||
`CUST-WP-0011` takes the pragmatic railiance01 path.
|
||||
|
||||
This workplan completes the migration from a useful single-node cluster-hosted
|
||||
State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes,
|
||||
replicated storage, tested failover, tested restore, and retirement of the WSL2
|
||||
fallback only after operational confidence is earned.
|
||||
|
||||
## Why This Exists
|
||||
|
||||
The near-term State Hub migration should not wait for every HA precondition,
|
||||
because the workstation-hosted State Hub is already a bottleneck for
|
||||
multi-machine work.
|
||||
|
||||
But the original requirement remains valid:
|
||||
|
||||
- State Hub is irreplaceable episodic memory.
|
||||
- A single node is not a final home.
|
||||
- Backup and restore must be drilled, not assumed.
|
||||
- Long-term operations must survive node loss and operator-machine loss.
|
||||
|
||||
`CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan
|
||||
keeps the ultimate target visible and reviewable.
|
||||
|
||||
## Entry Criteria
|
||||
|
||||
- `CUST-WP-0011` completed or explicitly superseded.
|
||||
- Cluster-hosted State Hub has passed its stabilisation period.
|
||||
- railiance01 is not the only planned durable node.
|
||||
- Railiance architecture decision for storage replication is current:
|
||||
Longhorn, cnpg replication, external backup, or a documented replacement.
|
||||
- Backup and restore tooling has an owner and runbook.
|
||||
|
||||
## Target Properties
|
||||
|
||||
- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03.
|
||||
- State Hub database survives loss of one node.
|
||||
- State Hub API recovers from pod loss without manual repair.
|
||||
- Backups are encrypted, off-node, and restorable into a test namespace.
|
||||
- Agent access remains private.
|
||||
- WSL2 is no longer needed as the primary disaster-recovery fallback.
|
||||
|
||||
## Tasks
|
||||
|
||||
### T01 — Confirm ThreePhoenix cluster readiness
|
||||
|
||||
```task
|
||||
id: T01
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110"
|
||||
```
|
||||
|
||||
Verify the target cluster state:
|
||||
|
||||
- Three nodes are joined and Ready.
|
||||
- Control-plane and worker roles are documented.
|
||||
- Cluster version and node resources are recorded.
|
||||
- Smoke tests pass from the operator machine and from CoulombCore.
|
||||
|
||||
**Done when:** a current readiness report exists and no node is marked
|
||||
NotReady or operationally unknown.
|
||||
|
||||
---
|
||||
|
||||
### T02 — Establish replicated storage/database strategy
|
||||
|
||||
```task
|
||||
id: T02
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140"
|
||||
```
|
||||
|
||||
Choose and document the durable data strategy for State Hub:
|
||||
|
||||
- cnpg multi-instance PostgreSQL cluster, and/or
|
||||
- Longhorn-backed storage with suitable replication, and/or
|
||||
- another explicitly approved architecture.
|
||||
|
||||
The decision must define RPO, RTO, failover behavior, and restore procedure.
|
||||
|
||||
**Done when:** the selected architecture is documented and approved before any
|
||||
production data movement.
|
||||
|
||||
---
|
||||
|
||||
### T03 — Implement HA State Hub database
|
||||
|
||||
```task
|
||||
id: T03
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6"
|
||||
```
|
||||
|
||||
Apply the chosen database/storage architecture to State Hub.
|
||||
|
||||
Requirements:
|
||||
|
||||
- Database credentials remain SOPS/secret-managed.
|
||||
- The database has automated backup configured.
|
||||
- The database exposes a stable service endpoint for the API.
|
||||
- Health and replication status are observable.
|
||||
|
||||
**Done when:** State Hub can run against the HA database in a test or staging
|
||||
namespace.
|
||||
|
||||
---
|
||||
|
||||
### T04 — Add State Hub API high-availability behavior
|
||||
|
||||
```task
|
||||
id: T04
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24"
|
||||
```
|
||||
|
||||
Run State Hub API with the right availability posture for its workload:
|
||||
|
||||
- At least one replica, optionally more if DB/session behavior permits.
|
||||
- Readiness and liveness probes.
|
||||
- Rolling update behavior documented.
|
||||
- Resource requests/limits set.
|
||||
|
||||
**Done when:** killing an API pod does not require manual recovery.
|
||||
|
||||
---
|
||||
|
||||
### T05 — Drill database failover
|
||||
|
||||
```task
|
||||
id: T05
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86"
|
||||
```
|
||||
|
||||
Perform a controlled failover drill for the State Hub database.
|
||||
|
||||
Checks:
|
||||
|
||||
- Failure trigger is documented.
|
||||
- API behavior during failover is observed.
|
||||
- Recovery time is measured.
|
||||
- No data loss is detected after recovery.
|
||||
|
||||
**Done when:** the failover drill passes and results are logged.
|
||||
|
||||
---
|
||||
|
||||
### T06 — Drill backup restore to isolated namespace
|
||||
|
||||
```task
|
||||
id: T06
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74"
|
||||
```
|
||||
|
||||
Restore the latest encrypted State Hub backup into an isolated namespace or
|
||||
separate test database.
|
||||
|
||||
Checks:
|
||||
|
||||
- Backup can be decrypted with the documented key path.
|
||||
- Restore completes from off-node backup material.
|
||||
- Row counts and representative records match.
|
||||
- Restored API can serve `/state/health` and `/state/summary` when pointed at
|
||||
the restored database.
|
||||
|
||||
**Done when:** restore drill passes without depending on the live database.
|
||||
|
||||
---
|
||||
|
||||
### T07 — Update agent access and runbooks for HA endpoint
|
||||
|
||||
```task
|
||||
id: T07
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c"
|
||||
```
|
||||
|
||||
Update the private access model after the HA endpoint is available:
|
||||
|
||||
- ops-bridge or tunnel target.
|
||||
- MCP `API_BASE` or local port-forward convention.
|
||||
- Dashboard access.
|
||||
- Operator recovery instructions.
|
||||
|
||||
**Done when:** each active operator machine can reach the HA State Hub endpoint
|
||||
through the documented path.
|
||||
|
||||
---
|
||||
|
||||
### T08 — Retire WSL2 fallback after explicit approval
|
||||
|
||||
```task
|
||||
id: T08
|
||||
status: todo
|
||||
priority: low
|
||||
needs_human: true
|
||||
intervention_note: "Requires explicit approval after HA failover and restore drills pass."
|
||||
state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add"
|
||||
```
|
||||
|
||||
Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA
|
||||
cluster path has passed drills.
|
||||
|
||||
Steps:
|
||||
|
||||
1. Take and archive a final WSL2 backup.
|
||||
2. Stop local WSL2 State Hub services.
|
||||
3. Update global and repo instructions.
|
||||
4. Record the retirement decision in State Hub.
|
||||
|
||||
**Done when:** WSL2 is no longer part of the normal or fallback operating
|
||||
model, and the cluster runbook is the source of truth.
|
||||
|
||||
## References
|
||||
|
||||
- `CUST-WP-0011` — pragmatic railiance01 migration
|
||||
- Railiance ThreePhoenix infrastructure goal
|
||||
- State Hub backup/restore runbooks
|
||||
- Constitution constraint: irreversible retirement requires human approval
|
||||
Reference in New Issue
Block a user