--- id: CUST-WP-0038 type: workplan title: "State Hub Full ThreePhoenix HA Migration" domain: custodian repo: state-hub status: active owner: custodian topic_slug: custodian created: "2026-05-02" updated: "2026-05-17" depends_on: CUST-WP-0011 state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85" --- # State Hub Full ThreePhoenix HA Migration ## Goal Preserve the original long-term State Hub infrastructure goal while `CUST-WP-0011` takes the pragmatic railiance01 path. This workplan completes the migration from a useful single-node cluster-hosted State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes, replicated storage, tested failover, tested restore, and retirement of the WSL2 fallback only after operational confidence is earned. ## Why This Exists The near-term State Hub migration should not wait for every HA precondition, because the workstation-hosted State Hub is already a bottleneck for multi-machine work. But the original requirement remains valid: - State Hub is irreplaceable episodic memory. - A single node is not a final home. - Backup and restore must be drilled, not assumed. - Long-term operations must survive node loss and operator-machine loss. `CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan keeps the ultimate target visible and reviewable. ## Entry Criteria - `CUST-WP-0011` completed or explicitly superseded. - Cluster-hosted State Hub has passed its stabilisation period. - railiance01 is not the only planned durable node. - Railiance architecture decision for storage replication is current: Longhorn, cnpg replication, external backup, or a documented replacement. - Backup and restore tooling has an owner and runbook. ## Target Properties - Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03. - State Hub database survives loss of one node. - State Hub API recovers from pod loss without manual repair. - Backups are encrypted, off-node, and restorable into a test namespace. - Agent access remains private. - WSL2 is no longer needed as the primary disaster-recovery fallback. ## Tasks ### T01 — Confirm ThreePhoenix cluster readiness ```task id: CUST-WP-0038-T01 status: todo priority: high state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110" ``` Verify the target cluster state: - Three nodes are joined and Ready. - Control-plane and worker roles are documented. - Cluster version and node resources are recorded. - Smoke tests pass from the operator machine and from CoulombCore. **Done when:** a current readiness report exists and no node is marked NotReady or operationally unknown. --- ### T02 — Establish replicated storage/database strategy ```task id: CUST-WP-0038-T02 status: todo priority: high state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140" ``` Choose and document the durable data strategy for State Hub: - cnpg multi-instance PostgreSQL cluster, and/or - Longhorn-backed storage with suitable replication, and/or - another explicitly approved architecture. The decision must define RPO, RTO, failover behavior, and restore procedure. **Done when:** the selected architecture is documented and approved before any production data movement. --- ### T03 — Implement HA State Hub database ```task id: CUST-WP-0038-T03 status: todo priority: high state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6" ``` Apply the chosen database/storage architecture to State Hub. Requirements: - Database credentials remain SOPS/secret-managed. - The database has automated backup configured. - The database exposes a stable service endpoint for the API. - Health and replication status are observable. **Done when:** State Hub can run against the HA database in a test or staging namespace. --- ### T04 — Add State Hub API high-availability behavior ```task id: CUST-WP-0038-T04 status: todo priority: medium state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24" ``` Run State Hub API with the right availability posture for its workload: - At least one replica, optionally more if DB/session behavior permits. - Readiness and liveness probes. - Rolling update behavior documented. - Resource requests/limits set. **Done when:** killing an API pod does not require manual recovery. --- ### T05 — Drill database failover ```task id: CUST-WP-0038-T05 status: todo priority: high state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86" ``` Perform a controlled failover drill for the State Hub database. Checks: - Failure trigger is documented. - API behavior during failover is observed. - Recovery time is measured. - No data loss is detected after recovery. **Done when:** the failover drill passes and results are logged. --- ### T06 — Drill backup restore to isolated namespace ```task id: CUST-WP-0038-T06 status: todo priority: high state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74" ``` Restore the latest encrypted State Hub backup into an isolated namespace or separate test database. Checks: - Backup can be decrypted with the documented key path. - Restore completes from off-node backup material. - Row counts and representative records match. - Restored API can serve `/state/health` and `/state/summary` when pointed at the restored database. **Done when:** restore drill passes without depending on the live database. --- ### T07 — Update agent access and runbooks for HA endpoint ```task id: CUST-WP-0038-T07 status: todo priority: medium state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c" ``` Update the private access model after the HA endpoint is available: - ops-bridge or tunnel target. - MCP `API_BASE` or local port-forward convention. - Dashboard access. - Operator recovery instructions. **Done when:** each active operator machine can reach the HA State Hub endpoint through the documented path. --- ### T08 — Retire WSL2 fallback after explicit approval ```task id: CUST-WP-0038-T08 status: todo priority: low needs_human: true intervention_note: "Requires explicit approval after HA failover and restore drills pass." state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add" ``` Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA cluster path has passed drills. Steps: 1. Take and archive a final WSL2 backup. 2. Stop local WSL2 State Hub services. 3. Update global and repo instructions. 4. Record the retirement decision in State Hub. **Done when:** WSL2 is no longer part of the normal or fallback operating model, and the cluster runbook is the source of truth. ## References - `CUST-WP-0011` — pragmatic railiance01 migration - Railiance ThreePhoenix infrastructure goal - State Hub backup/restore runbooks - Constitution constraint: irreversible retirement requires human approval