Updated workplans for migrating the custodian to Railiance01

This commit is contained in:
2026-05-02 23:38:56 +02:00
parent 94916cbfb0
commit dd887c8c81
2 changed files with 418 additions and 191 deletions

View File

@@ -1,80 +1,107 @@
---
id: CUST-WP-0011
type: workplan
title: "Migrate Custodian State Hub to ThreePhoenix Cluster"
title: "Pragmatic State Hub Migration to railiance01"
domain: custodian
repo: the-custodian
status: active
owner: custodian
topic_slug: custodian
created: "2026-03-11"
updated: "2026-03-11"
updated: "2026-05-02"
state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940"
supersedes_intent_from: "Migrate Custodian State Hub to ThreePhoenix Cluster"
follow_up_workplan: CUST-WP-0038
---
# Migrate Custodian State Hub to ThreePhoenix Cluster
# Pragmatic State Hub Migration to railiance01
## Goal
Move the Custodian State Hub (FastAPI + PostgreSQL) from its current home on
the WSL2 operator workstation to the ThreePhoenix Kubernetes cluster
(Railiance01/02/03), making it available to Claude Code sessions running on
any machine with cluster access — without public internet exposure.
Move the Custodian State Hub (FastAPI + PostgreSQL) from the WSL2 operator
workstation to the current railiance01 Kubernetes environment, using the
Railiance production-readiness path that exists now:
The State Hub is **irreplaceable episodic memory**. This migration must be
executed with zero tolerance for data loss and a tested rollback path at
every stage.
- CloudNative PG (`cnpg`) for the State Hub database in the `databases`
namespace.
- State Hub as an S5 workload in `railiance-apps`.
- Platform/database ownership in `railiance-platform`.
- Access through the existing private tunnel/ops-bridge model, not public
exposure.
- WSL2 retained as a disaster-recovery fallback until the cluster deployment
has proven stable.
## Pre-conditions (gate — do not start until all satisfied)
This is a deliberate pragmatic step. It improves durability and multi-machine
access before the full ThreePhoenix target is ready. The ultimate multi-node,
replicated, long-term cluster goal is preserved in `CUST-WP-0038`.
- [ ] ThreePhoenix cluster has three healthy nodes (Railiance01 confirmed, Railiance02 + Railiance03 joined)
- [ ] Longhorn distributed storage installed and verified (replication factor ≥ 2)
- [ ] HA failover test passes (`tests/test_ha_failover.sh` exits 0 on the cluster)
- [ ] S2 integrated backup operational and tested on the cluster
- [ ] A full WSL2 State Hub backup has been taken and restore-drilled **within 24h of starting this workplan**
## Context Update
These gates are mandatory. A single-node cluster or unverified storage is not
an acceptable migration target for the Custodian.
The original 2026-03-11 version of this workplan targeted a future
ThreePhoenix cluster with Railiance01/02/03, Longhorn, and full HA gates before
starting. That was correct as an end-state, but it blocks useful progress now.
## Architecture after migration
The current Railiance architecture has moved on:
- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
supersedes the older Bitnami PostgreSQL HA platform baseline.
- CloudNative PG is the deployed database operator.
- `RAIL-HO-WP-0004-T09` is the Railiance-side task for deploying State Hub to
the cluster, and it still requires human decisions before live data
migration.
This workplan is now the Custodian-side coordination and safety plan for that
T09 effort.
## Safety Contract
State Hub is irreplaceable episodic memory. This migration may prepare, deploy,
test, and compare as much as needed, but it must not make the cluster the only
source of truth until the explicit cutover gate is satisfied.
Rules:
- A fresh WSL2 backup and restore drill is mandatory before data migration.
- The WSL2 State Hub remains available as rollback until stabilisation passes.
- Any task that changes the live writer endpoint requires explicit human
approval.
- A failed cluster deploy must leave the WSL2 instance untouched and usable.
- Row counts and key API checks must match before cutover.
## Target Architecture After This Workplan
```
COULOMBCORE / operator workstation (WSL2)
└─ Claude Code
└─ MCP server subprocess (Python, local clone of the-custodian)
└─ HTTP → ssh -L 8000:state-hub-svc:8000 tegwick@92.205.62.239
└─ Railiance01 k3s
└─ state-hub ClusterIP service
├─ FastAPI pod (12 replicas)
└─ PostgreSQL PVC (Longhorn, 2-way replicated)
Operator workstation / COULOMBCORE / other agent hosts
-> local MCP server subprocess
-> http://127.0.0.1:8000 or configured API_BASE
-> private tunnel / ops-bridge
-> railiance01 k3s
-> state-hub Service
-> FastAPI Deployment
-> state-hub-db CloudNative PG Cluster
```
Key properties:
- **Not publicly exposed** — ClusterIP only; access via SSH port-forward
- **Replicated storage** — Longhorn replicates the PG data volume across nodes
- **WSL2 instance retained as DR fallback** during the stabilisation period
- **MCP config unchanged** — subprocess still calls `http://127.0.0.1:8000`;
the SSH port-forward provides the binding
## Backup and disaster recovery contract
- Single-node pragmatic deployment on railiance01.
- No public unauthenticated exposure.
- Database managed by cnpg, not an ad-hoc Postgres StatefulSet.
- WSL2 retained as DR fallback during stabilisation.
- Future multi-node HA and storage replication are deferred to `CUST-WP-0038`.
Before and during migration, the following must hold at all times:
## Open Human Decisions
| Asset | Backup mechanism | RPO | Tested? |
|---|---|---|---|
| State Hub PostgreSQL DB | `make backup` (pg_dump → age-encrypted, Nextcloud offsite) | Daily | Must be drilled before T03 |
| State Hub DB on cluster | Longhorn snapshot + age-encrypted copy to `/opt/backup/` | Daily | Must be drilled before T06 |
| WSL2 instance | Remains live during stabilisation period | — | Running |
Resolve these before T04/T05 can become live migration work:
**Rollback rule:** at any task boundary, if something is wrong, revert to
WSL2. No task should leave the system in a state where both WSL2 and cluster
are broken.
---
1. Final State Hub hostname or tunnel-only endpoint.
2. Container registry choice: Gitea registry vs external interim registry.
3. Exposure model: ClusterIP plus tunnel, private ingress, or both.
4. Approval window for freezing WSL2 writes and migrating the production DB.
5. Stabilisation duration before WSL2 can be considered non-primary fallback.
## Tasks
### T01 — Drill WSL2 backup restore end-to-end
### T01 — Drill WSL2 State Hub backup restore
```task
id: T01
@@ -83,29 +110,23 @@ priority: high
state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
```
Before touching anything, prove the current backup can actually be restored:
Take a fresh State Hub backup from the current WSL2 instance and restore it
into an isolated test PostgreSQL instance.
```bash
# In the-custodian/state-hub/
make backup # take fresh backup
# Spin up a test postgres container
docker run -d --name pg-restore-test -e POSTGRES_PASSWORD=test \
-p 5433:5432 postgres:16
# Decrypt and restore
age -d -i ~/.config/sops/age/keys.txt \
/opt/backup/custodian/state-hub-latest.sql.gz.age | \
gunzip | psql -h 127.0.0.1 -p 5433 -U postgres state_hub
# Spot-check: count topics
psql -h 127.0.0.1 -p 5433 -U postgres -c "SELECT COUNT(*) FROM topics;" state_hub
docker rm -f pg-restore-test
```
Minimum checks:
**Done when:** restore completes, topic count matches production, drill logged
in `memory/episodic/`.
- Restore completes without errors.
- Core table row counts match the live WSL2 database.
- `/state/summary` can be served from the restored copy if wired to a test API.
- Drill result is recorded in State Hub progress and, if useful, episodic
memory.
**Done when:** backup and restore are proven within 24 hours of live migration
work.
---
### T02 — Helm chart for State Hub (new: railiance-platform)
### T02 — Align with Railiance deployment plan
```task
id: T02
@@ -114,34 +135,22 @@ priority: high
state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
```
Create `helm/state-hub/` in `railiance-platform` (S3 layer owns platform
services). The chart must deploy:
Update the cross-repo plan so this Custodian workplan and
`RAIL-HO-WP-0004-T09` point to the same architecture.
- **FastAPI deployment** — image built from `the-custodian/state-hub/`,
1 replica initially (scale to 2 after T06)
- **PostgreSQL StatefulSet** — single instance backed by a Longhorn PVC
(minimum 5 Gi); HA not required here — Longhorn replication IS the HA
- **ClusterIP service** `state-hub` on port 8000
- **ConfigMap** for non-secret config (DB URL template, log level)
- **Secret** for DB credentials (SOPS-encrypted values file)
- **Liveness/readiness probe** — `GET /state/health`
Expected outputs:
Values:
```yaml
image:
repository: gitea.local/custodian/state-hub
tag: latest
postgres:
storageClass: longhorn
size: 5Gi
replicaCount: 1
```
- `RAIL-HO-WP-0004-T09` remains the Railiance-side execution task.
- This workplan remains the Custodian-side safety/cutover task list.
- Any stale Longhorn/Postgres StatefulSet assumptions are removed from the
near-term migration plan.
- The future HA goal is referenced through `CUST-WP-0038`.
**Done when:** `helm lint` passes; chart committed in railiance-platform.
**Done when:** both workplans describe compatible responsibilities and gates.
---
### T03 — Build and push State Hub container image
### T03 — Build and publish State Hub container image
```task
id: T03
@@ -150,31 +159,22 @@ priority: high
state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
```
Add `state-hub/Dockerfile` to the-custodian:
Package `state-hub/` as a production image.
```dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN pip install uv && uv sync --frozen --no-dev
COPY api/ ./api/
COPY mcp_server/ ./mcp_server/
CMD ["uv", "run", "uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
```
Requirements:
Build and push to the cluster-local Gitea registry:
- Dockerfile builds from the current Python/uv project.
- Alembic and runtime dependencies are available inside the image.
- Image exposes the FastAPI service on port 8000.
- Image tag is pushed to the chosen registry.
- Build provenance is documented in the commit/workplan.
```bash
docker build -t gitea.local/custodian/state-hub:latest .
docker push gitea.local/custodian/state-hub:latest
```
**Done when:** image available in Gitea registry; `helm install --dry-run`
resolves the image.
**Done when:** railiance01 can pull the image and a dry-run deployment resolves
it.
---
### T04 — Deploy to cluster and run Alembic migrations
### T04 — Define State Hub database and app manifests
```task
id: T04
@@ -183,26 +183,20 @@ priority: high
state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
```
```bash
# From operator workstation via SSH port-forward to k3s API
helm install state-hub ./helm/state-hub/ \
-n custodian --create-namespace \
-f helm/state-hub/values-production.yaml
Create the cluster-side deployment assets using current Railiance boundaries:
# Wait for pods
kubectl -n custodian rollout status deployment/state-hub
- `railiance-platform`: `state-hub-db` cnpg cluster and database credentials.
- `railiance-apps`: State Hub Deployment, Service, ConfigMap, Secret/External
Secret reference, and optional private Ingress.
- Health probes use `GET /state/health`.
- Environment includes `DATABASE_URL` and any required API settings.
# Run migrations inside the pod
kubectl -n custodian exec -it deploy/state-hub -- \
uv run alembic upgrade head
```
**Done when:** pod Running, `/state/health` returns 200, Alembic reports
"head" from inside the pod.
**Done when:** manifests lint/apply in a non-destructive dry run and ownership
boundaries are documented.
---
### T05 — Migrate data from WSL2 to cluster
### T05 — Deploy empty State Hub and run migrations on railiance01
```task
id: T05
@@ -211,33 +205,21 @@ priority: high
state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
```
This is the point of no return for the DB — execute with care:
Deploy State Hub against an empty `state-hub-db` cnpg database and run Alembic
migrations in the cluster environment.
```bash
# 1. Take final WSL2 backup
make -C ~/the-custodian/state-hub backup
Checks:
# 2. Copy dump into the cluster postgres pod
kubectl -n custodian cp /tmp/state-hub-migration.sql \
$(kubectl -n custodian get pod -l app=state-hub-postgres -o name):/tmp/
- Pod reaches Ready.
- `/state/health` returns healthy through the intended private access path.
- Alembic reports head.
- Logs show no repeated crash/restart loop.
# 3. Restore
kubectl -n custodian exec -it deploy/state-hub-postgres -- \
psql -U postgres -d state_hub -f /tmp/state-hub-migration.sql
# 4. Spot-check counts match WSL2
kubectl -n custodian exec -it deploy/state-hub -- \
psql -c "SELECT relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
```
**Rollback:** if counts differ, delete cluster DB data, re-run from T04.
WSL2 is still live and unchanged.
**Done when:** all table row counts match the WSL2 instance.
**Done when:** an empty but structurally valid State Hub runs on railiance01.
---
### T06 — Drill cluster backup restore
### T06 — Restore WSL2 data copy into cluster and compare
```task
id: T06
@@ -246,53 +228,49 @@ priority: high
state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
```
Before cutting over, prove the cluster backup can be restored:
Restore a fresh WSL2 dump into the cluster database while WSL2 remains the live
source of truth.
```bash
# Trigger a backup via the cluster cron (or manually)
kubectl -n custodian create job --from=cronjob/state-hub-backup backup-drill-01
Required comparison:
# Verify output in /opt/backup/ on the node holding the PVC
# Decrypt and restore to a test namespace
kubectl create ns restore-test
# ... restore steps similar to T01 but against cluster postgres
```
- Table row counts match.
- Representative workstreams, tasks, decisions, progress events, repos, and
token events are queryable.
- Dashboard and MCP summary calls return expected data through the cluster API.
- Any mismatch is investigated before proceeding.
**Done when:** restore drill passes; drill logged.
**Done when:** cluster data is a verified copy of WSL2, but not yet the only
writer.
---
### T07 — Cutover: redirect MCP config to cluster
### T07 — Cut over private access to cluster State Hub
```task
id: T07
status: todo
priority: medium
state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
needs_human: true
intervention_note: "Requires explicit approval to freeze WSL2 writes and make the cluster State Hub the primary endpoint."
```
Update the MCP config on every operator workstation (WSL2, COULOMBCORE) to
reach the cluster state hub via SSH port-forward instead of the local process.
With human approval, freeze WSL2 writes, take a final dump, restore it to the
cluster, compare counts again, and redirect the active private access path to
the cluster API.
The MCP server subprocess still runs locally (Python, same `server.py`).
Only the API endpoint it calls changes — via a persistent port-forward:
Accepted approaches:
```bash
# On operator workstation — keep this running (add to tunnel-daemon or tunnel-loop)
ssh -L 8000:state-hub.custodian.svc.cluster.local:8000 tegwick@92.205.62.239
```
- Keep local MCP config pointed at `http://127.0.0.1:8000` and move that port
to an ops-bridge/SSH tunnel.
- Or set the MCP server `API_BASE` to the chosen private cluster endpoint.
No change to `.mcp.json` needed — subprocess still calls `http://127.0.0.1:8000`.
Alternatively: update the MCP server's `API_BASE` env var to point directly
to the port-forward. Either approach is valid; document the chosen one.
**Done when:** `claude /mcp` shows `state-hub` connected; `get_state_summary()`
returns live cluster data.
**Done when:** `get_state_summary()` and dashboard live data are served by the
cluster State Hub, and WSL2 is no longer receiving normal writes.
---
### T08 — Stabilisation period (2 weeks minimum)
### T08 — Stabilise with WSL2 retained as fallback
```task
id: T08
@@ -301,19 +279,23 @@ priority: medium
state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
```
Run the cluster state hub as the primary for two weeks before retiring WSL2:
Run the cluster State Hub as primary while keeping the WSL2 instance available
as a fallback.
- Keep WSL2 state hub running (but frozen — no writes) as DR fallback
- Monitor cluster pod restarts, storage health, backup cron
- Run `get_state_summary()` at the start of each session; confirm data is live
- Test failover: kill the FastAPI pod; verify it restarts and responds within 60s
Monitor:
**Done when:** two weeks elapsed with no data loss events; all backup drills
passed.
- State Hub pod restarts.
- cnpg cluster health.
- Backup job success.
- Dashboard and MCP behavior from each operator machine.
- Consistency sync behavior for file-backed workplans.
**Done when:** the agreed stabilisation window passes without data loss or
unresolved operational defects.
---
### T09 — Retire WSL2 instance
### T09 — Document operating model and defer final WSL2 retirement
```task
id: T09
@@ -322,25 +304,24 @@ priority: low
state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
```
Once T08 stabilisation passes:
Document the new operating model:
1. Take a final WSL2 backup (archive, keep indefinitely)
2. Stop the WSL2 Docker container: `make -C ~/the-custodian/state-hub clean`
3. Update `CLAUDE.md` global and project to remove WSL2 state hub start instructions
4. Update MEMORY.md — state hub is now cluster-hosted
5. Record a decision in the state hub: "State Hub WSL2 instance retired"
- How agents reach State Hub.
- How backups and restores work.
- How to roll back to WSL2 if needed.
- Which parts remain pragmatic/single-node.
- Which long-term requirements moved to `CUST-WP-0038`.
**Done when:** WSL2 state hub no longer running; documentation updated.
Do not permanently retire WSL2 in this workplan unless a separate human
decision is recorded. Retirement belongs after proven stability or in the
future HA workplan.
---
**Done when:** runbooks and project instructions match the deployed reality.
## References
- Constitution constraint: irreversible actions require human approval — T05
(data migration) and T09 (WSL2 retirement) require explicit sign-off
- OAS layer: S3 Platform Services (railiance-platform)
- DR dependency: Longhorn storage (railiance-cluster WP to be linked)
- Extension point: EP-RAIL-005 (full-stack backup) — state hub must implement
`make backup` / `make restore` standard interface before T06
- Domain goal: `6f96c712-60e6-4ea9-ab06-168878eafbce` (Three-Phoenix Secure
Kubernetes Infrastructure)
- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
- `RAIL-HO-WP-0004-T09` — Railiance-side State Hub deployment task
- `CUST-WP-0038` — future full ThreePhoenix HA State Hub migration
- Constitution constraint: production data migration and fallback retirement
require explicit human approval

View File

@@ -0,0 +1,246 @@
---
id: CUST-WP-0038
type: workplan
title: "State Hub Full ThreePhoenix HA Migration"
domain: custodian
repo: the-custodian
status: proposed
owner: custodian
topic_slug: custodian
created: "2026-05-02"
updated: "2026-05-02"
depends_on: CUST-WP-0011
state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85"
---
# State Hub Full ThreePhoenix HA Migration
## Goal
Preserve the original long-term State Hub infrastructure goal while
`CUST-WP-0011` takes the pragmatic railiance01 path.
This workplan completes the migration from a useful single-node cluster-hosted
State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes,
replicated storage, tested failover, tested restore, and retirement of the WSL2
fallback only after operational confidence is earned.
## Why This Exists
The near-term State Hub migration should not wait for every HA precondition,
because the workstation-hosted State Hub is already a bottleneck for
multi-machine work.
But the original requirement remains valid:
- State Hub is irreplaceable episodic memory.
- A single node is not a final home.
- Backup and restore must be drilled, not assumed.
- Long-term operations must survive node loss and operator-machine loss.
`CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan
keeps the ultimate target visible and reviewable.
## Entry Criteria
- `CUST-WP-0011` completed or explicitly superseded.
- Cluster-hosted State Hub has passed its stabilisation period.
- railiance01 is not the only planned durable node.
- Railiance architecture decision for storage replication is current:
Longhorn, cnpg replication, external backup, or a documented replacement.
- Backup and restore tooling has an owner and runbook.
## Target Properties
- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03.
- State Hub database survives loss of one node.
- State Hub API recovers from pod loss without manual repair.
- Backups are encrypted, off-node, and restorable into a test namespace.
- Agent access remains private.
- WSL2 is no longer needed as the primary disaster-recovery fallback.
## Tasks
### T01 — Confirm ThreePhoenix cluster readiness
```task
id: T01
status: todo
priority: high
state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110"
```
Verify the target cluster state:
- Three nodes are joined and Ready.
- Control-plane and worker roles are documented.
- Cluster version and node resources are recorded.
- Smoke tests pass from the operator machine and from CoulombCore.
**Done when:** a current readiness report exists and no node is marked
NotReady or operationally unknown.
---
### T02 — Establish replicated storage/database strategy
```task
id: T02
status: todo
priority: high
state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140"
```
Choose and document the durable data strategy for State Hub:
- cnpg multi-instance PostgreSQL cluster, and/or
- Longhorn-backed storage with suitable replication, and/or
- another explicitly approved architecture.
The decision must define RPO, RTO, failover behavior, and restore procedure.
**Done when:** the selected architecture is documented and approved before any
production data movement.
---
### T03 — Implement HA State Hub database
```task
id: T03
status: todo
priority: high
state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6"
```
Apply the chosen database/storage architecture to State Hub.
Requirements:
- Database credentials remain SOPS/secret-managed.
- The database has automated backup configured.
- The database exposes a stable service endpoint for the API.
- Health and replication status are observable.
**Done when:** State Hub can run against the HA database in a test or staging
namespace.
---
### T04 — Add State Hub API high-availability behavior
```task
id: T04
status: todo
priority: medium
state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24"
```
Run State Hub API with the right availability posture for its workload:
- At least one replica, optionally more if DB/session behavior permits.
- Readiness and liveness probes.
- Rolling update behavior documented.
- Resource requests/limits set.
**Done when:** killing an API pod does not require manual recovery.
---
### T05 — Drill database failover
```task
id: T05
status: todo
priority: high
state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86"
```
Perform a controlled failover drill for the State Hub database.
Checks:
- Failure trigger is documented.
- API behavior during failover is observed.
- Recovery time is measured.
- No data loss is detected after recovery.
**Done when:** the failover drill passes and results are logged.
---
### T06 — Drill backup restore to isolated namespace
```task
id: T06
status: todo
priority: high
state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74"
```
Restore the latest encrypted State Hub backup into an isolated namespace or
separate test database.
Checks:
- Backup can be decrypted with the documented key path.
- Restore completes from off-node backup material.
- Row counts and representative records match.
- Restored API can serve `/state/health` and `/state/summary` when pointed at
the restored database.
**Done when:** restore drill passes without depending on the live database.
---
### T07 — Update agent access and runbooks for HA endpoint
```task
id: T07
status: todo
priority: medium
state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c"
```
Update the private access model after the HA endpoint is available:
- ops-bridge or tunnel target.
- MCP `API_BASE` or local port-forward convention.
- Dashboard access.
- Operator recovery instructions.
**Done when:** each active operator machine can reach the HA State Hub endpoint
through the documented path.
---
### T08 — Retire WSL2 fallback after explicit approval
```task
id: T08
status: todo
priority: low
needs_human: true
intervention_note: "Requires explicit approval after HA failover and restore drills pass."
state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add"
```
Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA
cluster path has passed drills.
Steps:
1. Take and archive a final WSL2 backup.
2. Stop local WSL2 State Hub services.
3. Update global and repo instructions.
4. Record the retirement decision in State Hub.
**Done when:** WSL2 is no longer part of the normal or fallback operating
model, and the cluster runbook is the source of truth.
## References
- `CUST-WP-0011` — pragmatic railiance01 migration
- Railiance ThreePhoenix infrastructure goal
- State Hub backup/restore runbooks
- Constitution constraint: irreversible retirement requires human approval