Updated workplans for migrating the custodian to Railiance01

2026-05-02 23:38:56 +02:00
parent 94916cbfb0
commit dd887c8c81
2 changed files with 418 additions and 191 deletions
--- a/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md
+++ b/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md
@@ -1,80 +1,107 @@
 ---
 id: CUST-WP-0011
 type: workplan
-title: "Migrate Custodian State Hub to ThreePhoenix Cluster"
+title: "Pragmatic State Hub Migration to railiance01"
 domain: custodian
 repo: the-custodian
 status: active
 owner: custodian
 topic_slug: custodian
 created: "2026-03-11"
-updated: "2026-03-11"
+updated: "2026-05-02"
 state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940"
+supersedes_intent_from: "Migrate Custodian State Hub to ThreePhoenix Cluster"
+follow_up_workplan: CUST-WP-0038
 ---

-# Migrate Custodian State Hub to ThreePhoenix Cluster
+# Pragmatic State Hub Migration to railiance01

 ## Goal

-Move the Custodian State Hub (FastAPI + PostgreSQL) from its current home on
-the WSL2 operator workstation to the ThreePhoenix Kubernetes cluster
-(Railiance01/02/03), making it available to Claude Code sessions running on
-any machine with cluster access — without public internet exposure.
+Move the Custodian State Hub (FastAPI + PostgreSQL) from the WSL2 operator
+workstation to the current railiance01 Kubernetes environment, using the
+Railiance production-readiness path that exists now:

-The State Hub is **irreplaceable episodic memory**. This migration must be
-executed with zero tolerance for data loss and a tested rollback path at
-every stage.
+- CloudNative PG (`cnpg`) for the State Hub database in the `databases`
+  namespace.
+- State Hub as an S5 workload in `railiance-apps`.
+- Platform/database ownership in `railiance-platform`.
+- Access through the existing private tunnel/ops-bridge model, not public
+  exposure.
+- WSL2 retained as a disaster-recovery fallback until the cluster deployment
+  has proven stable.

-## Pre-conditions (gate — do not start until all satisfied)
+This is a deliberate pragmatic step. It improves durability and multi-machine
+access before the full ThreePhoenix target is ready. The ultimate multi-node,
+replicated, long-term cluster goal is preserved in `CUST-WP-0038`.

- [ ] ThreePhoenix cluster has three healthy nodes (Railiance01 confirmed, Railiance02 + Railiance03 joined)
- [ ] Longhorn distributed storage installed and verified (replication factor ≥ 2)
- [ ] HA failover test passes (`tests/test_ha_failover.sh` exits 0 on the cluster)
- [ ] S2 integrated backup operational and tested on the cluster
- [ ] A full WSL2 State Hub backup has been taken and restore-drilled **within 24h of starting this workplan**
+## Context Update

-These gates are mandatory. A single-node cluster or unverified storage is not
-an acceptable migration target for the Custodian.
+The original 2026-03-11 version of this workplan targeted a future
+ThreePhoenix cluster with Railiance01/02/03, Longhorn, and full HA gates before
+starting. That was correct as an end-state, but it blocks useful progress now.

-## Architecture after migration
+The current Railiance architecture has moved on:
+
+- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
+  supersedes the older Bitnami PostgreSQL HA platform baseline.
+- CloudNative PG is the deployed database operator.
+- `RAIL-HO-WP-0004-T09` is the Railiance-side task for deploying State Hub to
+  the cluster, and it still requires human decisions before live data
+  migration.
+
+This workplan is now the Custodian-side coordination and safety plan for that
+T09 effort.
+
+## Safety Contract
+
+State Hub is irreplaceable episodic memory. This migration may prepare, deploy,
+test, and compare as much as needed, but it must not make the cluster the only
+source of truth until the explicit cutover gate is satisfied.
+
+Rules:
+
+- A fresh WSL2 backup and restore drill is mandatory before data migration.
+- The WSL2 State Hub remains available as rollback until stabilisation passes.
+- Any task that changes the live writer endpoint requires explicit human
+  approval.
+- A failed cluster deploy must leave the WSL2 instance untouched and usable.
+- Row counts and key API checks must match before cutover.
+
+## Target Architecture After This Workplan

 ```
-COULOMBCORE / operator workstation (WSL2)
-  └─ Claude Code
-       └─ MCP server subprocess (Python, local clone of the-custodian)
-            └─ HTTP → ssh -L 8000:state-hub-svc:8000 tegwick@92.205.62.239
-                          └─ Railiance01 k3s
-                               └─ state-hub ClusterIP service
-                                    ├─ FastAPI pod (1–2 replicas)
-                                    └─ PostgreSQL PVC (Longhorn, 2-way replicated)
+Operator workstation / COULOMBCORE / other agent hosts
+  -> local MCP server subprocess
+     -> http://127.0.0.1:8000 or configured API_BASE
+        -> private tunnel / ops-bridge
+           -> railiance01 k3s
+              -> state-hub Service
+                 -> FastAPI Deployment
+                 -> state-hub-db CloudNative PG Cluster
 ```

 Key properties:
- **Not publicly exposed** — ClusterIP only; access via SSH port-forward
- **Replicated storage** — Longhorn replicates the PG data volume across nodes
- **WSL2 instance retained as DR fallback** during the stabilisation period
- **MCP config unchanged** — subprocess still calls `http://127.0.0.1:8000`;
-  the SSH port-forward provides the binding

-## Backup and disaster recovery contract
+- Single-node pragmatic deployment on railiance01.
+- No public unauthenticated exposure.
+- Database managed by cnpg, not an ad-hoc Postgres StatefulSet.
+- WSL2 retained as DR fallback during stabilisation.
+- Future multi-node HA and storage replication are deferred to `CUST-WP-0038`.

-Before and during migration, the following must hold at all times:
+## Open Human Decisions

-| Asset | Backup mechanism | RPO | Tested? |
-|---|---|---|---|
-| State Hub PostgreSQL DB | `make backup` (pg_dump → age-encrypted, Nextcloud offsite) | Daily | Must be drilled before T03 |
-| State Hub DB on cluster | Longhorn snapshot + age-encrypted copy to `/opt/backup/` | Daily | Must be drilled before T06 |
-| WSL2 instance | Remains live during stabilisation period | — | Running |
+Resolve these before T04/T05 can become live migration work:

-**Rollback rule:** at any task boundary, if something is wrong, revert to
-WSL2. No task should leave the system in a state where both WSL2 and cluster
-are broken.
-
---
+1. Final State Hub hostname or tunnel-only endpoint.
+2. Container registry choice: Gitea registry vs external interim registry.
+3. Exposure model: ClusterIP plus tunnel, private ingress, or both.
+4. Approval window for freezing WSL2 writes and migrating the production DB.
+5. Stabilisation duration before WSL2 can be considered non-primary fallback.

 ## Tasks

-### T01 — Drill WSL2 backup restore end-to-end
+### T01 — Drill WSL2 State Hub backup restore

 ```task
 id: T01
@@ -83,29 +110,23 @@ priority: high
 state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
 ```

-Before touching anything, prove the current backup can actually be restored:
+Take a fresh State Hub backup from the current WSL2 instance and restore it
+into an isolated test PostgreSQL instance.

-```bash
-# In the-custodian/state-hub/
-make backup                         # take fresh backup
-# Spin up a test postgres container
-docker run -d --name pg-restore-test -e POSTGRES_PASSWORD=test \
-  -p 5433:5432 postgres:16
-# Decrypt and restore
-age -d -i ~/.config/sops/age/keys.txt \
-  /opt/backup/custodian/state-hub-latest.sql.gz.age | \
-  gunzip | psql -h 127.0.0.1 -p 5433 -U postgres state_hub
-# Spot-check: count topics
-psql -h 127.0.0.1 -p 5433 -U postgres -c "SELECT COUNT(*) FROM topics;" state_hub
-docker rm -f pg-restore-test
-```
+Minimum checks:

-**Done when:** restore completes, topic count matches production, drill logged
-in `memory/episodic/`.
+- Restore completes without errors.
+- Core table row counts match the live WSL2 database.
+- `/state/summary` can be served from the restored copy if wired to a test API.
+- Drill result is recorded in State Hub progress and, if useful, episodic
+  memory.
+
+**Done when:** backup and restore are proven within 24 hours of live migration
+work.

 ---

-### T02 — Helm chart for State Hub (new: railiance-platform)
+### T02 — Align with Railiance deployment plan

 ```task
 id: T02
@@ -114,34 +135,22 @@ priority: high
 state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
 ```

-Create `helm/state-hub/` in `railiance-platform` (S3 layer owns platform
-services). The chart must deploy:
+Update the cross-repo plan so this Custodian workplan and
+`RAIL-HO-WP-0004-T09` point to the same architecture.

- **FastAPI deployment** — image built from `the-custodian/state-hub/`,
-  1 replica initially (scale to 2 after T06)
- **PostgreSQL StatefulSet** — single instance backed by a Longhorn PVC
-  (minimum 5 Gi); HA not required here — Longhorn replication IS the HA
- **ClusterIP service** `state-hub` on port 8000
- **ConfigMap** for non-secret config (DB URL template, log level)
- **Secret** for DB credentials (SOPS-encrypted values file)
- **Liveness/readiness probe** — `GET /state/health`
+Expected outputs:

-Values:
-```yaml
-image:
-  repository: gitea.local/custodian/state-hub
-  tag: latest
-postgres:
-  storageClass: longhorn
-  size: 5Gi
-replicaCount: 1
-```
+- `RAIL-HO-WP-0004-T09` remains the Railiance-side execution task.
+- This workplan remains the Custodian-side safety/cutover task list.
+- Any stale Longhorn/Postgres StatefulSet assumptions are removed from the
+  near-term migration plan.
+- The future HA goal is referenced through `CUST-WP-0038`.

-**Done when:** `helm lint` passes; chart committed in railiance-platform.
+**Done when:** both workplans describe compatible responsibilities and gates.

 ---

-### T03 — Build and push State Hub container image
+### T03 — Build and publish State Hub container image

 ```task
 id: T03
@@ -150,31 +159,22 @@ priority: high
 state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
 ```

-Add `state-hub/Dockerfile` to the-custodian:
+Package `state-hub/` as a production image.

-```dockerfile
-FROM python:3.12-slim
-WORKDIR /app
-COPY pyproject.toml uv.lock ./
-RUN pip install uv && uv sync --frozen --no-dev
-COPY api/ ./api/
-COPY mcp_server/ ./mcp_server/
-CMD ["uv", "run", "uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
-```
+Requirements:

-Build and push to the cluster-local Gitea registry:
+- Dockerfile builds from the current Python/uv project.
+- Alembic and runtime dependencies are available inside the image.
+- Image exposes the FastAPI service on port 8000.
+- Image tag is pushed to the chosen registry.
+- Build provenance is documented in the commit/workplan.

-```bash
-docker build -t gitea.local/custodian/state-hub:latest .
-docker push gitea.local/custodian/state-hub:latest
-```
-
-**Done when:** image available in Gitea registry; `helm install --dry-run`
-resolves the image.
+**Done when:** railiance01 can pull the image and a dry-run deployment resolves
+it.

 ---

-### T04 — Deploy to cluster and run Alembic migrations
+### T04 — Define State Hub database and app manifests

 ```task
 id: T04
@@ -183,26 +183,20 @@ priority: high
 state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
 ```

-```bash
-# From operator workstation via SSH port-forward to k3s API
-helm install state-hub ./helm/state-hub/ \
-  -n custodian --create-namespace \
-  -f helm/state-hub/values-production.yaml
+Create the cluster-side deployment assets using current Railiance boundaries:

-# Wait for pods
-kubectl -n custodian rollout status deployment/state-hub
+- `railiance-platform`: `state-hub-db` cnpg cluster and database credentials.
+- `railiance-apps`: State Hub Deployment, Service, ConfigMap, Secret/External
+  Secret reference, and optional private Ingress.
+- Health probes use `GET /state/health`.
+- Environment includes `DATABASE_URL` and any required API settings.

-# Run migrations inside the pod
-kubectl -n custodian exec -it deploy/state-hub -- \
-  uv run alembic upgrade head
-```
-
-**Done when:** pod Running, `/state/health` returns 200, Alembic reports
-"head" from inside the pod.
+**Done when:** manifests lint/apply in a non-destructive dry run and ownership
+boundaries are documented.

 ---

-### T05 — Migrate data from WSL2 to cluster
+### T05 — Deploy empty State Hub and run migrations on railiance01

 ```task
 id: T05
@@ -211,33 +205,21 @@ priority: high
 state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
 ```

-This is the point of no return for the DB — execute with care:
+Deploy State Hub against an empty `state-hub-db` cnpg database and run Alembic
+migrations in the cluster environment.

-```bash
-# 1. Take final WSL2 backup
-make -C ~/the-custodian/state-hub backup
+Checks:

-# 2. Copy dump into the cluster postgres pod
-kubectl -n custodian cp /tmp/state-hub-migration.sql \
-  $(kubectl -n custodian get pod -l app=state-hub-postgres -o name):/tmp/
+- Pod reaches Ready.
+- `/state/health` returns healthy through the intended private access path.
+- Alembic reports head.
+- Logs show no repeated crash/restart loop.

-# 3. Restore
-kubectl -n custodian exec -it deploy/state-hub-postgres -- \
-  psql -U postgres -d state_hub -f /tmp/state-hub-migration.sql
-
-# 4. Spot-check counts match WSL2
-kubectl -n custodian exec -it deploy/state-hub -- \
-  psql -c "SELECT relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
-```
-
-**Rollback:** if counts differ, delete cluster DB data, re-run from T04.
-WSL2 is still live and unchanged.
-
-**Done when:** all table row counts match the WSL2 instance.
+**Done when:** an empty but structurally valid State Hub runs on railiance01.

 ---

-### T06 — Drill cluster backup restore
+### T06 — Restore WSL2 data copy into cluster and compare

 ```task
 id: T06
@@ -246,53 +228,49 @@ priority: high
 state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
 ```

-Before cutting over, prove the cluster backup can be restored:
+Restore a fresh WSL2 dump into the cluster database while WSL2 remains the live
+source of truth.

-```bash
-# Trigger a backup via the cluster cron (or manually)
-kubectl -n custodian create job --from=cronjob/state-hub-backup backup-drill-01
+Required comparison:

-# Verify output in /opt/backup/ on the node holding the PVC
-# Decrypt and restore to a test namespace
-kubectl create ns restore-test
-# ... restore steps similar to T01 but against cluster postgres
-```
+- Table row counts match.
+- Representative workstreams, tasks, decisions, progress events, repos, and
+  token events are queryable.
+- Dashboard and MCP summary calls return expected data through the cluster API.
+- Any mismatch is investigated before proceeding.

-**Done when:** restore drill passes; drill logged.
+**Done when:** cluster data is a verified copy of WSL2, but not yet the only
+writer.

 ---

-### T07 — Cutover: redirect MCP config to cluster
+### T07 — Cut over private access to cluster State Hub

 ```task
 id: T07
 status: todo
 priority: medium
 state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
+needs_human: true
+intervention_note: "Requires explicit approval to freeze WSL2 writes and make the cluster State Hub the primary endpoint."
 ```

-Update the MCP config on every operator workstation (WSL2, COULOMBCORE) to
-reach the cluster state hub via SSH port-forward instead of the local process.
+With human approval, freeze WSL2 writes, take a final dump, restore it to the
+cluster, compare counts again, and redirect the active private access path to
+the cluster API.

-The MCP server subprocess still runs locally (Python, same `server.py`).
-Only the API endpoint it calls changes — via a persistent port-forward:
+Accepted approaches:

-```bash
-# On operator workstation — keep this running (add to tunnel-daemon or tunnel-loop)
-ssh -L 8000:state-hub.custodian.svc.cluster.local:8000 tegwick@92.205.62.239
-```
+- Keep local MCP config pointed at `http://127.0.0.1:8000` and move that port
+  to an ops-bridge/SSH tunnel.
+- Or set the MCP server `API_BASE` to the chosen private cluster endpoint.

-No change to `.mcp.json` needed — subprocess still calls `http://127.0.0.1:8000`.
-
-Alternatively: update the MCP server's `API_BASE` env var to point directly
-to the port-forward. Either approach is valid; document the chosen one.
-
-**Done when:** `claude /mcp` shows `state-hub` connected; `get_state_summary()`
-returns live cluster data.
+**Done when:** `get_state_summary()` and dashboard live data are served by the
+cluster State Hub, and WSL2 is no longer receiving normal writes.

 ---

-### T08 — Stabilisation period (2 weeks minimum)
+### T08 — Stabilise with WSL2 retained as fallback

 ```task
 id: T08
@@ -301,19 +279,23 @@ priority: medium
 state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
 ```

-Run the cluster state hub as the primary for two weeks before retiring WSL2:
+Run the cluster State Hub as primary while keeping the WSL2 instance available
+as a fallback.

- Keep WSL2 state hub running (but frozen — no writes) as DR fallback
- Monitor cluster pod restarts, storage health, backup cron
- Run `get_state_summary()` at the start of each session; confirm data is live
- Test failover: kill the FastAPI pod; verify it restarts and responds within 60s
+Monitor:

-**Done when:** two weeks elapsed with no data loss events; all backup drills
-passed.
+- State Hub pod restarts.
+- cnpg cluster health.
+- Backup job success.
+- Dashboard and MCP behavior from each operator machine.
+- Consistency sync behavior for file-backed workplans.
+
+**Done when:** the agreed stabilisation window passes without data loss or
+unresolved operational defects.

 ---

-### T09 — Retire WSL2 instance
+### T09 — Document operating model and defer final WSL2 retirement

 ```task
 id: T09
@@ -322,25 +304,24 @@ priority: low
 state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
 ```

-Once T08 stabilisation passes:
+Document the new operating model:

-1. Take a final WSL2 backup (archive, keep indefinitely)
-2. Stop the WSL2 Docker container: `make -C ~/the-custodian/state-hub clean`
-3. Update `CLAUDE.md` global and project to remove WSL2 state hub start instructions
-4. Update MEMORY.md — state hub is now cluster-hosted
-5. Record a decision in the state hub: "State Hub WSL2 instance retired"
+- How agents reach State Hub.
+- How backups and restores work.
+- How to roll back to WSL2 if needed.
+- Which parts remain pragmatic/single-node.
+- Which long-term requirements moved to `CUST-WP-0038`.

-**Done when:** WSL2 state hub no longer running; documentation updated.
+Do not permanently retire WSL2 in this workplan unless a separate human
+decision is recorded. Retirement belongs after proven stability or in the
+future HA workplan.

---
+**Done when:** runbooks and project instructions match the deployed reality.

 ## References

- Constitution constraint: irreversible actions require human approval — T05
-  (data migration) and T09 (WSL2 retirement) require explicit sign-off
- OAS layer: S3 Platform Services (railiance-platform)
- DR dependency: Longhorn storage (railiance-cluster WP to be linked)
- Extension point: EP-RAIL-005 (full-stack backup) — state hub must implement
-  `make backup` / `make restore` standard interface before T06
- Domain goal: `6f96c712-60e6-4ea9-ab06-168878eafbce` (Three-Phoenix Secure
-  Kubernetes Infrastructure)
+- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
+- `RAIL-HO-WP-0004-T09` — Railiance-side State Hub deployment task
+- `CUST-WP-0038` — future full ThreePhoenix HA State Hub migration
+- Constitution constraint: production data migration and fallback retirement
+  require explicit human approval
--- a/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md
+++ b/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md
@@ -0,0 +1,246 @@
+---
+id: CUST-WP-0038
+type: workplan
+title: "State Hub Full ThreePhoenix HA Migration"
+domain: custodian
+repo: the-custodian
+status: proposed
+owner: custodian
+topic_slug: custodian
+created: "2026-05-02"
+updated: "2026-05-02"
+depends_on: CUST-WP-0011
+state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85"
+---
+
+# State Hub Full ThreePhoenix HA Migration
+
+## Goal
+
+Preserve the original long-term State Hub infrastructure goal while
+`CUST-WP-0011` takes the pragmatic railiance01 path.
+
+This workplan completes the migration from a useful single-node cluster-hosted
+State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes,
+replicated storage, tested failover, tested restore, and retirement of the WSL2
+fallback only after operational confidence is earned.
+
+## Why This Exists
+
+The near-term State Hub migration should not wait for every HA precondition,
+because the workstation-hosted State Hub is already a bottleneck for
+multi-machine work.
+
+But the original requirement remains valid:
+
+- State Hub is irreplaceable episodic memory.
+- A single node is not a final home.
+- Backup and restore must be drilled, not assumed.
+- Long-term operations must survive node loss and operator-machine loss.
+
+`CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan
+keeps the ultimate target visible and reviewable.
+
+## Entry Criteria
+
+- `CUST-WP-0011` completed or explicitly superseded.
+- Cluster-hosted State Hub has passed its stabilisation period.
+- railiance01 is not the only planned durable node.
+- Railiance architecture decision for storage replication is current:
+  Longhorn, cnpg replication, external backup, or a documented replacement.
+- Backup and restore tooling has an owner and runbook.
+
+## Target Properties
+
+- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03.
+- State Hub database survives loss of one node.
+- State Hub API recovers from pod loss without manual repair.
+- Backups are encrypted, off-node, and restorable into a test namespace.
+- Agent access remains private.
+- WSL2 is no longer needed as the primary disaster-recovery fallback.
+
+## Tasks
+
+### T01 — Confirm ThreePhoenix cluster readiness
+
+```task
+id: T01
+status: todo
+priority: high
+state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110"
+```
+
+Verify the target cluster state:
+
+- Three nodes are joined and Ready.
+- Control-plane and worker roles are documented.
+- Cluster version and node resources are recorded.
+- Smoke tests pass from the operator machine and from CoulombCore.
+
+**Done when:** a current readiness report exists and no node is marked
+NotReady or operationally unknown.
+
+---
+
+### T02 — Establish replicated storage/database strategy
+
+```task
+id: T02
+status: todo
+priority: high
+state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140"
+```
+
+Choose and document the durable data strategy for State Hub:
+
+- cnpg multi-instance PostgreSQL cluster, and/or
+- Longhorn-backed storage with suitable replication, and/or
+- another explicitly approved architecture.
+
+The decision must define RPO, RTO, failover behavior, and restore procedure.
+
+**Done when:** the selected architecture is documented and approved before any
+production data movement.
+
+---
+
+### T03 — Implement HA State Hub database
+
+```task
+id: T03
+status: todo
+priority: high
+state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6"
+```
+
+Apply the chosen database/storage architecture to State Hub.
+
+Requirements:
+
+- Database credentials remain SOPS/secret-managed.
+- The database has automated backup configured.
+- The database exposes a stable service endpoint for the API.
+- Health and replication status are observable.
+
+**Done when:** State Hub can run against the HA database in a test or staging
+namespace.
+
+---
+
+### T04 — Add State Hub API high-availability behavior
+
+```task
+id: T04
+status: todo
+priority: medium
+state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24"
+```
+
+Run State Hub API with the right availability posture for its workload:
+
+- At least one replica, optionally more if DB/session behavior permits.
+- Readiness and liveness probes.
+- Rolling update behavior documented.
+- Resource requests/limits set.
+
+**Done when:** killing an API pod does not require manual recovery.
+
+---
+
+### T05 — Drill database failover
+
+```task
+id: T05
+status: todo
+priority: high
+state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86"
+```
+
+Perform a controlled failover drill for the State Hub database.
+
+Checks:
+
+- Failure trigger is documented.
+- API behavior during failover is observed.
+- Recovery time is measured.
+- No data loss is detected after recovery.
+
+**Done when:** the failover drill passes and results are logged.
+
+---
+
+### T06 — Drill backup restore to isolated namespace
+
+```task
+id: T06
+status: todo
+priority: high
+state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74"
+```
+
+Restore the latest encrypted State Hub backup into an isolated namespace or
+separate test database.
+
+Checks:
+
+- Backup can be decrypted with the documented key path.
+- Restore completes from off-node backup material.
+- Row counts and representative records match.
+- Restored API can serve `/state/health` and `/state/summary` when pointed at
+  the restored database.
+
+**Done when:** restore drill passes without depending on the live database.
+
+---
+
+### T07 — Update agent access and runbooks for HA endpoint
+
+```task
+id: T07
+status: todo
+priority: medium
+state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c"
+```
+
+Update the private access model after the HA endpoint is available:
+
+- ops-bridge or tunnel target.
+- MCP `API_BASE` or local port-forward convention.
+- Dashboard access.
+- Operator recovery instructions.
+
+**Done when:** each active operator machine can reach the HA State Hub endpoint
+through the documented path.
+
+---
+
+### T08 — Retire WSL2 fallback after explicit approval
+
+```task
+id: T08
+status: todo
+priority: low
+needs_human: true
+intervention_note: "Requires explicit approval after HA failover and restore drills pass."
+state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add"
+```
+
+Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA
+cluster path has passed drills.
+
+Steps:
+
+1. Take and archive a final WSL2 backup.
+2. Stop local WSL2 State Hub services.
+3. Update global and repo instructions.
+4. Record the retirement decision in State Hub.
+
+**Done when:** WSL2 is no longer part of the normal or fallback operating
+model, and the cluster runbook is the source of truth.
+
+## References
+
+- `CUST-WP-0011` — pragmatic railiance01 migration
+- Railiance ThreePhoenix infrastructure goal
+- State Hub backup/restore runbooks
+- Constitution constraint: irreversible retirement requires human approval