From dd887c8c81322248c392134c2715b8bf8d796515 Mon Sep 17 00:00:00 2001 From: tegwick Date: Sat, 2 May 2026 23:38:56 +0200 Subject: [PATCH] Updated workplans for migrating the custodian to Railiance01 --- ...P-0011-state-hub-threephoenix-migration.md | 363 +++++++++--------- .../CUST-WP-0038-state-hub-threephoenix-ha.md | 246 ++++++++++++ 2 files changed, 418 insertions(+), 191 deletions(-) create mode 100644 workplans/CUST-WP-0038-state-hub-threephoenix-ha.md diff --git a/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md b/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md index b96308c..5ff3174 100644 --- a/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md +++ b/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md @@ -1,80 +1,107 @@ --- id: CUST-WP-0011 type: workplan -title: "Migrate Custodian State Hub to ThreePhoenix Cluster" +title: "Pragmatic State Hub Migration to railiance01" domain: custodian repo: the-custodian status: active owner: custodian topic_slug: custodian created: "2026-03-11" -updated: "2026-03-11" +updated: "2026-05-02" state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940" +supersedes_intent_from: "Migrate Custodian State Hub to ThreePhoenix Cluster" +follow_up_workplan: CUST-WP-0038 --- -# Migrate Custodian State Hub to ThreePhoenix Cluster +# Pragmatic State Hub Migration to railiance01 ## Goal -Move the Custodian State Hub (FastAPI + PostgreSQL) from its current home on -the WSL2 operator workstation to the ThreePhoenix Kubernetes cluster -(Railiance01/02/03), making it available to Claude Code sessions running on -any machine with cluster access — without public internet exposure. +Move the Custodian State Hub (FastAPI + PostgreSQL) from the WSL2 operator +workstation to the current railiance01 Kubernetes environment, using the +Railiance production-readiness path that exists now: -The State Hub is **irreplaceable episodic memory**. This migration must be -executed with zero tolerance for data loss and a tested rollback path at -every stage. +- CloudNative PG (`cnpg`) for the State Hub database in the `databases` + namespace. +- State Hub as an S5 workload in `railiance-apps`. +- Platform/database ownership in `railiance-platform`. +- Access through the existing private tunnel/ops-bridge model, not public + exposure. +- WSL2 retained as a disaster-recovery fallback until the cluster deployment + has proven stable. -## Pre-conditions (gate — do not start until all satisfied) +This is a deliberate pragmatic step. It improves durability and multi-machine +access before the full ThreePhoenix target is ready. The ultimate multi-node, +replicated, long-term cluster goal is preserved in `CUST-WP-0038`. -- [ ] ThreePhoenix cluster has three healthy nodes (Railiance01 confirmed, Railiance02 + Railiance03 joined) -- [ ] Longhorn distributed storage installed and verified (replication factor ≥ 2) -- [ ] HA failover test passes (`tests/test_ha_failover.sh` exits 0 on the cluster) -- [ ] S2 integrated backup operational and tested on the cluster -- [ ] A full WSL2 State Hub backup has been taken and restore-drilled **within 24h of starting this workplan** +## Context Update -These gates are mandatory. A single-node cluster or unverified storage is not -an acceptable migration target for the Custodian. +The original 2026-03-11 version of this workplan targeted a future +ThreePhoenix cluster with Railiance01/02/03, Longhorn, and full HA gates before +starting. That was correct as an end-state, but it blocks useful progress now. -## Architecture after migration +The current Railiance architecture has moved on: + +- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md` + supersedes the older Bitnami PostgreSQL HA platform baseline. +- CloudNative PG is the deployed database operator. +- `RAIL-HO-WP-0004-T09` is the Railiance-side task for deploying State Hub to + the cluster, and it still requires human decisions before live data + migration. + +This workplan is now the Custodian-side coordination and safety plan for that +T09 effort. + +## Safety Contract + +State Hub is irreplaceable episodic memory. This migration may prepare, deploy, +test, and compare as much as needed, but it must not make the cluster the only +source of truth until the explicit cutover gate is satisfied. + +Rules: + +- A fresh WSL2 backup and restore drill is mandatory before data migration. +- The WSL2 State Hub remains available as rollback until stabilisation passes. +- Any task that changes the live writer endpoint requires explicit human + approval. +- A failed cluster deploy must leave the WSL2 instance untouched and usable. +- Row counts and key API checks must match before cutover. + +## Target Architecture After This Workplan ``` -COULOMBCORE / operator workstation (WSL2) - └─ Claude Code - └─ MCP server subprocess (Python, local clone of the-custodian) - └─ HTTP → ssh -L 8000:state-hub-svc:8000 tegwick@92.205.62.239 - └─ Railiance01 k3s - └─ state-hub ClusterIP service - ├─ FastAPI pod (1–2 replicas) - └─ PostgreSQL PVC (Longhorn, 2-way replicated) +Operator workstation / COULOMBCORE / other agent hosts + -> local MCP server subprocess + -> http://127.0.0.1:8000 or configured API_BASE + -> private tunnel / ops-bridge + -> railiance01 k3s + -> state-hub Service + -> FastAPI Deployment + -> state-hub-db CloudNative PG Cluster ``` Key properties: -- **Not publicly exposed** — ClusterIP only; access via SSH port-forward -- **Replicated storage** — Longhorn replicates the PG data volume across nodes -- **WSL2 instance retained as DR fallback** during the stabilisation period -- **MCP config unchanged** — subprocess still calls `http://127.0.0.1:8000`; - the SSH port-forward provides the binding -## Backup and disaster recovery contract +- Single-node pragmatic deployment on railiance01. +- No public unauthenticated exposure. +- Database managed by cnpg, not an ad-hoc Postgres StatefulSet. +- WSL2 retained as DR fallback during stabilisation. +- Future multi-node HA and storage replication are deferred to `CUST-WP-0038`. -Before and during migration, the following must hold at all times: +## Open Human Decisions -| Asset | Backup mechanism | RPO | Tested? | -|---|---|---|---| -| State Hub PostgreSQL DB | `make backup` (pg_dump → age-encrypted, Nextcloud offsite) | Daily | Must be drilled before T03 | -| State Hub DB on cluster | Longhorn snapshot + age-encrypted copy to `/opt/backup/` | Daily | Must be drilled before T06 | -| WSL2 instance | Remains live during stabilisation period | — | Running | +Resolve these before T04/T05 can become live migration work: -**Rollback rule:** at any task boundary, if something is wrong, revert to -WSL2. No task should leave the system in a state where both WSL2 and cluster -are broken. - ---- +1. Final State Hub hostname or tunnel-only endpoint. +2. Container registry choice: Gitea registry vs external interim registry. +3. Exposure model: ClusterIP plus tunnel, private ingress, or both. +4. Approval window for freezing WSL2 writes and migrating the production DB. +5. Stabilisation duration before WSL2 can be considered non-primary fallback. ## Tasks -### T01 — Drill WSL2 backup restore end-to-end +### T01 — Drill WSL2 State Hub backup restore ```task id: T01 @@ -83,29 +110,23 @@ priority: high state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf" ``` -Before touching anything, prove the current backup can actually be restored: +Take a fresh State Hub backup from the current WSL2 instance and restore it +into an isolated test PostgreSQL instance. -```bash -# In the-custodian/state-hub/ -make backup # take fresh backup -# Spin up a test postgres container -docker run -d --name pg-restore-test -e POSTGRES_PASSWORD=test \ - -p 5433:5432 postgres:16 -# Decrypt and restore -age -d -i ~/.config/sops/age/keys.txt \ - /opt/backup/custodian/state-hub-latest.sql.gz.age | \ - gunzip | psql -h 127.0.0.1 -p 5433 -U postgres state_hub -# Spot-check: count topics -psql -h 127.0.0.1 -p 5433 -U postgres -c "SELECT COUNT(*) FROM topics;" state_hub -docker rm -f pg-restore-test -``` +Minimum checks: -**Done when:** restore completes, topic count matches production, drill logged -in `memory/episodic/`. +- Restore completes without errors. +- Core table row counts match the live WSL2 database. +- `/state/summary` can be served from the restored copy if wired to a test API. +- Drill result is recorded in State Hub progress and, if useful, episodic + memory. + +**Done when:** backup and restore are proven within 24 hours of live migration +work. --- -### T02 — Helm chart for State Hub (new: railiance-platform) +### T02 — Align with Railiance deployment plan ```task id: T02 @@ -114,34 +135,22 @@ priority: high state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c" ``` -Create `helm/state-hub/` in `railiance-platform` (S3 layer owns platform -services). The chart must deploy: +Update the cross-repo plan so this Custodian workplan and +`RAIL-HO-WP-0004-T09` point to the same architecture. -- **FastAPI deployment** — image built from `the-custodian/state-hub/`, - 1 replica initially (scale to 2 after T06) -- **PostgreSQL StatefulSet** — single instance backed by a Longhorn PVC - (minimum 5 Gi); HA not required here — Longhorn replication IS the HA -- **ClusterIP service** `state-hub` on port 8000 -- **ConfigMap** for non-secret config (DB URL template, log level) -- **Secret** for DB credentials (SOPS-encrypted values file) -- **Liveness/readiness probe** — `GET /state/health` +Expected outputs: -Values: -```yaml -image: - repository: gitea.local/custodian/state-hub - tag: latest -postgres: - storageClass: longhorn - size: 5Gi -replicaCount: 1 -``` +- `RAIL-HO-WP-0004-T09` remains the Railiance-side execution task. +- This workplan remains the Custodian-side safety/cutover task list. +- Any stale Longhorn/Postgres StatefulSet assumptions are removed from the + near-term migration plan. +- The future HA goal is referenced through `CUST-WP-0038`. -**Done when:** `helm lint` passes; chart committed in railiance-platform. +**Done when:** both workplans describe compatible responsibilities and gates. --- -### T03 — Build and push State Hub container image +### T03 — Build and publish State Hub container image ```task id: T03 @@ -150,31 +159,22 @@ priority: high state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a" ``` -Add `state-hub/Dockerfile` to the-custodian: +Package `state-hub/` as a production image. -```dockerfile -FROM python:3.12-slim -WORKDIR /app -COPY pyproject.toml uv.lock ./ -RUN pip install uv && uv sync --frozen --no-dev -COPY api/ ./api/ -COPY mcp_server/ ./mcp_server/ -CMD ["uv", "run", "uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"] -``` +Requirements: -Build and push to the cluster-local Gitea registry: +- Dockerfile builds from the current Python/uv project. +- Alembic and runtime dependencies are available inside the image. +- Image exposes the FastAPI service on port 8000. +- Image tag is pushed to the chosen registry. +- Build provenance is documented in the commit/workplan. -```bash -docker build -t gitea.local/custodian/state-hub:latest . -docker push gitea.local/custodian/state-hub:latest -``` - -**Done when:** image available in Gitea registry; `helm install --dry-run` -resolves the image. +**Done when:** railiance01 can pull the image and a dry-run deployment resolves +it. --- -### T04 — Deploy to cluster and run Alembic migrations +### T04 — Define State Hub database and app manifests ```task id: T04 @@ -183,26 +183,20 @@ priority: high state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844" ``` -```bash -# From operator workstation via SSH port-forward to k3s API -helm install state-hub ./helm/state-hub/ \ - -n custodian --create-namespace \ - -f helm/state-hub/values-production.yaml +Create the cluster-side deployment assets using current Railiance boundaries: -# Wait for pods -kubectl -n custodian rollout status deployment/state-hub +- `railiance-platform`: `state-hub-db` cnpg cluster and database credentials. +- `railiance-apps`: State Hub Deployment, Service, ConfigMap, Secret/External + Secret reference, and optional private Ingress. +- Health probes use `GET /state/health`. +- Environment includes `DATABASE_URL` and any required API settings. -# Run migrations inside the pod -kubectl -n custodian exec -it deploy/state-hub -- \ - uv run alembic upgrade head -``` - -**Done when:** pod Running, `/state/health` returns 200, Alembic reports -"head" from inside the pod. +**Done when:** manifests lint/apply in a non-destructive dry run and ownership +boundaries are documented. --- -### T05 — Migrate data from WSL2 to cluster +### T05 — Deploy empty State Hub and run migrations on railiance01 ```task id: T05 @@ -211,33 +205,21 @@ priority: high state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1" ``` -This is the point of no return for the DB — execute with care: +Deploy State Hub against an empty `state-hub-db` cnpg database and run Alembic +migrations in the cluster environment. -```bash -# 1. Take final WSL2 backup -make -C ~/the-custodian/state-hub backup +Checks: -# 2. Copy dump into the cluster postgres pod -kubectl -n custodian cp /tmp/state-hub-migration.sql \ - $(kubectl -n custodian get pod -l app=state-hub-postgres -o name):/tmp/ +- Pod reaches Ready. +- `/state/health` returns healthy through the intended private access path. +- Alembic reports head. +- Logs show no repeated crash/restart loop. -# 3. Restore -kubectl -n custodian exec -it deploy/state-hub-postgres -- \ - psql -U postgres -d state_hub -f /tmp/state-hub-migration.sql - -# 4. Spot-check counts match WSL2 -kubectl -n custodian exec -it deploy/state-hub -- \ - psql -c "SELECT relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;" -``` - -**Rollback:** if counts differ, delete cluster DB data, re-run from T04. -WSL2 is still live and unchanged. - -**Done when:** all table row counts match the WSL2 instance. +**Done when:** an empty but structurally valid State Hub runs on railiance01. --- -### T06 — Drill cluster backup restore +### T06 — Restore WSL2 data copy into cluster and compare ```task id: T06 @@ -246,53 +228,49 @@ priority: high state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060" ``` -Before cutting over, prove the cluster backup can be restored: +Restore a fresh WSL2 dump into the cluster database while WSL2 remains the live +source of truth. -```bash -# Trigger a backup via the cluster cron (or manually) -kubectl -n custodian create job --from=cronjob/state-hub-backup backup-drill-01 +Required comparison: -# Verify output in /opt/backup/ on the node holding the PVC -# Decrypt and restore to a test namespace -kubectl create ns restore-test -# ... restore steps similar to T01 but against cluster postgres -``` +- Table row counts match. +- Representative workstreams, tasks, decisions, progress events, repos, and + token events are queryable. +- Dashboard and MCP summary calls return expected data through the cluster API. +- Any mismatch is investigated before proceeding. -**Done when:** restore drill passes; drill logged. +**Done when:** cluster data is a verified copy of WSL2, but not yet the only +writer. --- -### T07 — Cutover: redirect MCP config to cluster +### T07 — Cut over private access to cluster State Hub ```task id: T07 status: todo priority: medium state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e" +needs_human: true +intervention_note: "Requires explicit approval to freeze WSL2 writes and make the cluster State Hub the primary endpoint." ``` -Update the MCP config on every operator workstation (WSL2, COULOMBCORE) to -reach the cluster state hub via SSH port-forward instead of the local process. +With human approval, freeze WSL2 writes, take a final dump, restore it to the +cluster, compare counts again, and redirect the active private access path to +the cluster API. -The MCP server subprocess still runs locally (Python, same `server.py`). -Only the API endpoint it calls changes — via a persistent port-forward: +Accepted approaches: -```bash -# On operator workstation — keep this running (add to tunnel-daemon or tunnel-loop) -ssh -L 8000:state-hub.custodian.svc.cluster.local:8000 tegwick@92.205.62.239 -``` +- Keep local MCP config pointed at `http://127.0.0.1:8000` and move that port + to an ops-bridge/SSH tunnel. +- Or set the MCP server `API_BASE` to the chosen private cluster endpoint. -No change to `.mcp.json` needed — subprocess still calls `http://127.0.0.1:8000`. - -Alternatively: update the MCP server's `API_BASE` env var to point directly -to the port-forward. Either approach is valid; document the chosen one. - -**Done when:** `claude /mcp` shows `state-hub` connected; `get_state_summary()` -returns live cluster data. +**Done when:** `get_state_summary()` and dashboard live data are served by the +cluster State Hub, and WSL2 is no longer receiving normal writes. --- -### T08 — Stabilisation period (2 weeks minimum) +### T08 — Stabilise with WSL2 retained as fallback ```task id: T08 @@ -301,19 +279,23 @@ priority: medium state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2" ``` -Run the cluster state hub as the primary for two weeks before retiring WSL2: +Run the cluster State Hub as primary while keeping the WSL2 instance available +as a fallback. -- Keep WSL2 state hub running (but frozen — no writes) as DR fallback -- Monitor cluster pod restarts, storage health, backup cron -- Run `get_state_summary()` at the start of each session; confirm data is live -- Test failover: kill the FastAPI pod; verify it restarts and responds within 60s +Monitor: -**Done when:** two weeks elapsed with no data loss events; all backup drills -passed. +- State Hub pod restarts. +- cnpg cluster health. +- Backup job success. +- Dashboard and MCP behavior from each operator machine. +- Consistency sync behavior for file-backed workplans. + +**Done when:** the agreed stabilisation window passes without data loss or +unresolved operational defects. --- -### T09 — Retire WSL2 instance +### T09 — Document operating model and defer final WSL2 retirement ```task id: T09 @@ -322,25 +304,24 @@ priority: low state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681" ``` -Once T08 stabilisation passes: +Document the new operating model: -1. Take a final WSL2 backup (archive, keep indefinitely) -2. Stop the WSL2 Docker container: `make -C ~/the-custodian/state-hub clean` -3. Update `CLAUDE.md` global and project to remove WSL2 state hub start instructions -4. Update MEMORY.md — state hub is now cluster-hosted -5. Record a decision in the state hub: "State Hub WSL2 instance retired" +- How agents reach State Hub. +- How backups and restores work. +- How to roll back to WSL2 if needed. +- Which parts remain pragmatic/single-node. +- Which long-term requirements moved to `CUST-WP-0038`. -**Done when:** WSL2 state hub no longer running; documentation updated. +Do not permanently retire WSL2 in this workplan unless a separate human +decision is recorded. Retirement belongs after proven stability or in the +future HA workplan. ---- +**Done when:** runbooks and project instructions match the deployed reality. ## References -- Constitution constraint: irreversible actions require human approval — T05 - (data migration) and T09 (WSL2 retirement) require explicit sign-off -- OAS layer: S3 Platform Services (railiance-platform) -- DR dependency: Longhorn storage (railiance-cluster WP to be linked) -- Extension point: EP-RAIL-005 (full-stack backup) — state hub must implement - `make backup` / `make restore` standard interface before T06 -- Domain goal: `6f96c712-60e6-4ea9-ab06-168878eafbce` (Three-Phoenix Secure - Kubernetes Infrastructure) +- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md` +- `RAIL-HO-WP-0004-T09` — Railiance-side State Hub deployment task +- `CUST-WP-0038` — future full ThreePhoenix HA State Hub migration +- Constitution constraint: production data migration and fallback retirement + require explicit human approval diff --git a/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md b/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md new file mode 100644 index 0000000..94dfc23 --- /dev/null +++ b/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md @@ -0,0 +1,246 @@ +--- +id: CUST-WP-0038 +type: workplan +title: "State Hub Full ThreePhoenix HA Migration" +domain: custodian +repo: the-custodian +status: proposed +owner: custodian +topic_slug: custodian +created: "2026-05-02" +updated: "2026-05-02" +depends_on: CUST-WP-0011 +state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85" +--- + +# State Hub Full ThreePhoenix HA Migration + +## Goal + +Preserve the original long-term State Hub infrastructure goal while +`CUST-WP-0011` takes the pragmatic railiance01 path. + +This workplan completes the migration from a useful single-node cluster-hosted +State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes, +replicated storage, tested failover, tested restore, and retirement of the WSL2 +fallback only after operational confidence is earned. + +## Why This Exists + +The near-term State Hub migration should not wait for every HA precondition, +because the workstation-hosted State Hub is already a bottleneck for +multi-machine work. + +But the original requirement remains valid: + +- State Hub is irreplaceable episodic memory. +- A single node is not a final home. +- Backup and restore must be drilled, not assumed. +- Long-term operations must survive node loss and operator-machine loss. + +`CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan +keeps the ultimate target visible and reviewable. + +## Entry Criteria + +- `CUST-WP-0011` completed or explicitly superseded. +- Cluster-hosted State Hub has passed its stabilisation period. +- railiance01 is not the only planned durable node. +- Railiance architecture decision for storage replication is current: + Longhorn, cnpg replication, external backup, or a documented replacement. +- Backup and restore tooling has an owner and runbook. + +## Target Properties + +- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03. +- State Hub database survives loss of one node. +- State Hub API recovers from pod loss without manual repair. +- Backups are encrypted, off-node, and restorable into a test namespace. +- Agent access remains private. +- WSL2 is no longer needed as the primary disaster-recovery fallback. + +## Tasks + +### T01 — Confirm ThreePhoenix cluster readiness + +```task +id: T01 +status: todo +priority: high +state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110" +``` + +Verify the target cluster state: + +- Three nodes are joined and Ready. +- Control-plane and worker roles are documented. +- Cluster version and node resources are recorded. +- Smoke tests pass from the operator machine and from CoulombCore. + +**Done when:** a current readiness report exists and no node is marked +NotReady or operationally unknown. + +--- + +### T02 — Establish replicated storage/database strategy + +```task +id: T02 +status: todo +priority: high +state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140" +``` + +Choose and document the durable data strategy for State Hub: + +- cnpg multi-instance PostgreSQL cluster, and/or +- Longhorn-backed storage with suitable replication, and/or +- another explicitly approved architecture. + +The decision must define RPO, RTO, failover behavior, and restore procedure. + +**Done when:** the selected architecture is documented and approved before any +production data movement. + +--- + +### T03 — Implement HA State Hub database + +```task +id: T03 +status: todo +priority: high +state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6" +``` + +Apply the chosen database/storage architecture to State Hub. + +Requirements: + +- Database credentials remain SOPS/secret-managed. +- The database has automated backup configured. +- The database exposes a stable service endpoint for the API. +- Health and replication status are observable. + +**Done when:** State Hub can run against the HA database in a test or staging +namespace. + +--- + +### T04 — Add State Hub API high-availability behavior + +```task +id: T04 +status: todo +priority: medium +state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24" +``` + +Run State Hub API with the right availability posture for its workload: + +- At least one replica, optionally more if DB/session behavior permits. +- Readiness and liveness probes. +- Rolling update behavior documented. +- Resource requests/limits set. + +**Done when:** killing an API pod does not require manual recovery. + +--- + +### T05 — Drill database failover + +```task +id: T05 +status: todo +priority: high +state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86" +``` + +Perform a controlled failover drill for the State Hub database. + +Checks: + +- Failure trigger is documented. +- API behavior during failover is observed. +- Recovery time is measured. +- No data loss is detected after recovery. + +**Done when:** the failover drill passes and results are logged. + +--- + +### T06 — Drill backup restore to isolated namespace + +```task +id: T06 +status: todo +priority: high +state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74" +``` + +Restore the latest encrypted State Hub backup into an isolated namespace or +separate test database. + +Checks: + +- Backup can be decrypted with the documented key path. +- Restore completes from off-node backup material. +- Row counts and representative records match. +- Restored API can serve `/state/health` and `/state/summary` when pointed at + the restored database. + +**Done when:** restore drill passes without depending on the live database. + +--- + +### T07 — Update agent access and runbooks for HA endpoint + +```task +id: T07 +status: todo +priority: medium +state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c" +``` + +Update the private access model after the HA endpoint is available: + +- ops-bridge or tunnel target. +- MCP `API_BASE` or local port-forward convention. +- Dashboard access. +- Operator recovery instructions. + +**Done when:** each active operator machine can reach the HA State Hub endpoint +through the documented path. + +--- + +### T08 — Retire WSL2 fallback after explicit approval + +```task +id: T08 +status: todo +priority: low +needs_human: true +intervention_note: "Requires explicit approval after HA failover and restore drills pass." +state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add" +``` + +Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA +cluster path has passed drills. + +Steps: + +1. Take and archive a final WSL2 backup. +2. Stop local WSL2 State Hub services. +3. Update global and repo instructions. +4. Record the retirement decision in State Hub. + +**Done when:** WSL2 is no longer part of the normal or fallback operating +model, and the cluster runbook is the source of truth. + +## References + +- `CUST-WP-0011` — pragmatic railiance01 migration +- Railiance ThreePhoenix infrastructure goal +- State Hub backup/restore runbooks +- Constitution constraint: irreversible retirement requires human approval