6.5 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, depends_on, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | depends_on | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CUST-WP-0038 | workplan | State Hub Full ThreePhoenix HA Migration | custodian | the-custodian | active | custodian | custodian | 2026-05-02 | 2026-05-02 | CUST-WP-0011 | 8d0c1b5d-44da-4b91-8357-e6526d3e0a85 |
State Hub Full ThreePhoenix HA Migration
Goal
Preserve the original long-term State Hub infrastructure goal while
CUST-WP-0011 takes the pragmatic railiance01 path.
This workplan completes the migration from a useful single-node cluster-hosted State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes, replicated storage, tested failover, tested restore, and retirement of the WSL2 fallback only after operational confidence is earned.
Why This Exists
The near-term State Hub migration should not wait for every HA precondition, because the workstation-hosted State Hub is already a bottleneck for multi-machine work.
But the original requirement remains valid:
- State Hub is irreplaceable episodic memory.
- A single node is not a final home.
- Backup and restore must be drilled, not assumed.
- Long-term operations must survive node loss and operator-machine loss.
CUST-WP-0011 moves State Hub to railiance01 pragmatically. This workplan
keeps the ultimate target visible and reviewable.
Entry Criteria
CUST-WP-0011completed or explicitly superseded.- Cluster-hosted State Hub has passed its stabilisation period.
- railiance01 is not the only planned durable node.
- Railiance architecture decision for storage replication is current: Longhorn, cnpg replication, external backup, or a documented replacement.
- Backup and restore tooling has an owner and runbook.
Target Properties
- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03.
- State Hub database survives loss of one node.
- State Hub API recovers from pod loss without manual repair.
- Backups are encrypted, off-node, and restorable into a test namespace.
- Agent access remains private.
- WSL2 is no longer needed as the primary disaster-recovery fallback.
Tasks
T01 — Confirm ThreePhoenix cluster readiness
id: T01
status: todo
priority: high
state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110"
Verify the target cluster state:
- Three nodes are joined and Ready.
- Control-plane and worker roles are documented.
- Cluster version and node resources are recorded.
- Smoke tests pass from the operator machine and from CoulombCore.
Done when: a current readiness report exists and no node is marked NotReady or operationally unknown.
T02 — Establish replicated storage/database strategy
id: T02
status: todo
priority: high
state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140"
Choose and document the durable data strategy for State Hub:
- cnpg multi-instance PostgreSQL cluster, and/or
- Longhorn-backed storage with suitable replication, and/or
- another explicitly approved architecture.
The decision must define RPO, RTO, failover behavior, and restore procedure.
Done when: the selected architecture is documented and approved before any production data movement.
T03 — Implement HA State Hub database
id: T03
status: todo
priority: high
state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6"
Apply the chosen database/storage architecture to State Hub.
Requirements:
- Database credentials remain SOPS/secret-managed.
- The database has automated backup configured.
- The database exposes a stable service endpoint for the API.
- Health and replication status are observable.
Done when: State Hub can run against the HA database in a test or staging namespace.
T04 — Add State Hub API high-availability behavior
id: T04
status: todo
priority: medium
state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24"
Run State Hub API with the right availability posture for its workload:
- At least one replica, optionally more if DB/session behavior permits.
- Readiness and liveness probes.
- Rolling update behavior documented.
- Resource requests/limits set.
Done when: killing an API pod does not require manual recovery.
T05 — Drill database failover
id: T05
status: todo
priority: high
state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86"
Perform a controlled failover drill for the State Hub database.
Checks:
- Failure trigger is documented.
- API behavior during failover is observed.
- Recovery time is measured.
- No data loss is detected after recovery.
Done when: the failover drill passes and results are logged.
T06 — Drill backup restore to isolated namespace
id: T06
status: todo
priority: high
state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74"
Restore the latest encrypted State Hub backup into an isolated namespace or separate test database.
Checks:
- Backup can be decrypted with the documented key path.
- Restore completes from off-node backup material.
- Row counts and representative records match.
- Restored API can serve
/state/healthand/state/summarywhen pointed at the restored database.
Done when: restore drill passes without depending on the live database.
T07 — Update agent access and runbooks for HA endpoint
id: T07
status: todo
priority: medium
state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c"
Update the private access model after the HA endpoint is available:
- ops-bridge or tunnel target.
- MCP
API_BASEor local port-forward convention. - Dashboard access.
- Operator recovery instructions.
Done when: each active operator machine can reach the HA State Hub endpoint through the documented path.
T08 — Retire WSL2 fallback after explicit approval
id: T08
status: todo
priority: low
needs_human: true
intervention_note: "Requires explicit approval after HA failover and restore drills pass."
state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add"
Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA cluster path has passed drills.
Steps:
- Take and archive a final WSL2 backup.
- Stop local WSL2 State Hub services.
- Update global and repo instructions.
- Record the retirement decision in State Hub.
Done when: WSL2 is no longer part of the normal or fallback operating model, and the cluster runbook is the source of truth.
References
CUST-WP-0011— pragmatic railiance01 migration- Railiance ThreePhoenix infrastructure goal
- State Hub backup/restore runbooks
- Constitution constraint: irreversible retirement requires human approval