Files
state-hub/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md

247 lines
6.6 KiB
Markdown

---
id: CUST-WP-0038
type: workplan
title: "State Hub Full ThreePhoenix HA Migration"
domain: custodian
repo: state-hub
status: active
owner: custodian
topic_slug: custodian
created: "2026-05-02"
updated: "2026-05-17"
depends_on: CUST-WP-0011
state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85"
---
# State Hub Full ThreePhoenix HA Migration
## Goal
Preserve the original long-term State Hub infrastructure goal while
`CUST-WP-0011` takes the pragmatic railiance01 path.
This workplan completes the migration from a useful single-node cluster-hosted
State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes,
replicated storage, tested failover, tested restore, and retirement of the WSL2
fallback only after operational confidence is earned.
## Why This Exists
The near-term State Hub migration should not wait for every HA precondition,
because the workstation-hosted State Hub is already a bottleneck for
multi-machine work.
But the original requirement remains valid:
- State Hub is irreplaceable episodic memory.
- A single node is not a final home.
- Backup and restore must be drilled, not assumed.
- Long-term operations must survive node loss and operator-machine loss.
`CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan
keeps the ultimate target visible and reviewable.
## Entry Criteria
- `CUST-WP-0011` completed or explicitly superseded.
- Cluster-hosted State Hub has passed its stabilisation period.
- railiance01 is not the only planned durable node.
- Railiance architecture decision for storage replication is current:
Longhorn, cnpg replication, external backup, or a documented replacement.
- Backup and restore tooling has an owner and runbook.
## Target Properties
- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03.
- State Hub database survives loss of one node.
- State Hub API recovers from pod loss without manual repair.
- Backups are encrypted, off-node, and restorable into a test namespace.
- Agent access remains private.
- WSL2 is no longer needed as the primary disaster-recovery fallback.
## Tasks
### T01 — Confirm ThreePhoenix cluster readiness
```task
id: CUST-WP-0038-T01
status: todo
priority: high
state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110"
```
Verify the target cluster state:
- Three nodes are joined and Ready.
- Control-plane and worker roles are documented.
- Cluster version and node resources are recorded.
- Smoke tests pass from the operator machine and from CoulombCore.
**Done when:** a current readiness report exists and no node is marked
NotReady or operationally unknown.
---
### T02 — Establish replicated storage/database strategy
```task
id: CUST-WP-0038-T02
status: todo
priority: high
state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140"
```
Choose and document the durable data strategy for State Hub:
- cnpg multi-instance PostgreSQL cluster, and/or
- Longhorn-backed storage with suitable replication, and/or
- another explicitly approved architecture.
The decision must define RPO, RTO, failover behavior, and restore procedure.
**Done when:** the selected architecture is documented and approved before any
production data movement.
---
### T03 — Implement HA State Hub database
```task
id: CUST-WP-0038-T03
status: todo
priority: high
state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6"
```
Apply the chosen database/storage architecture to State Hub.
Requirements:
- Database credentials remain SOPS/secret-managed.
- The database has automated backup configured.
- The database exposes a stable service endpoint for the API.
- Health and replication status are observable.
**Done when:** State Hub can run against the HA database in a test or staging
namespace.
---
### T04 — Add State Hub API high-availability behavior
```task
id: CUST-WP-0038-T04
status: todo
priority: medium
state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24"
```
Run State Hub API with the right availability posture for its workload:
- At least one replica, optionally more if DB/session behavior permits.
- Readiness and liveness probes.
- Rolling update behavior documented.
- Resource requests/limits set.
**Done when:** killing an API pod does not require manual recovery.
---
### T05 — Drill database failover
```task
id: CUST-WP-0038-T05
status: todo
priority: high
state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86"
```
Perform a controlled failover drill for the State Hub database.
Checks:
- Failure trigger is documented.
- API behavior during failover is observed.
- Recovery time is measured.
- No data loss is detected after recovery.
**Done when:** the failover drill passes and results are logged.
---
### T06 — Drill backup restore to isolated namespace
```task
id: CUST-WP-0038-T06
status: todo
priority: high
state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74"
```
Restore the latest encrypted State Hub backup into an isolated namespace or
separate test database.
Checks:
- Backup can be decrypted with the documented key path.
- Restore completes from off-node backup material.
- Row counts and representative records match.
- Restored API can serve `/state/health` and `/state/summary` when pointed at
the restored database.
**Done when:** restore drill passes without depending on the live database.
---
### T07 — Update agent access and runbooks for HA endpoint
```task
id: CUST-WP-0038-T07
status: todo
priority: medium
state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c"
```
Update the private access model after the HA endpoint is available:
- ops-bridge or tunnel target.
- MCP `API_BASE` or local port-forward convention.
- Dashboard access.
- Operator recovery instructions.
**Done when:** each active operator machine can reach the HA State Hub endpoint
through the documented path.
---
### T08 — Retire WSL2 fallback after explicit approval
```task
id: CUST-WP-0038-T08
status: todo
priority: low
needs_human: true
intervention_note: "Requires explicit approval after HA failover and restore drills pass."
state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add"
```
Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA
cluster path has passed drills.
Steps:
1. Take and archive a final WSL2 backup.
2. Stop local WSL2 State Hub services.
3. Update global and repo instructions.
4. Record the retirement decision in State Hub.
**Done when:** WSL2 is no longer part of the normal or fallback operating
model, and the cluster runbook is the source of truth.
## References
- `CUST-WP-0011` — pragmatic railiance01 migration
- Railiance ThreePhoenix infrastructure goal
- State Hub backup/restore runbooks
- Constitution constraint: irreversible retirement requires human approval