generated from coulomb/repo-seed
247 lines
6.6 KiB
Markdown
247 lines
6.6 KiB
Markdown
---
|
|
id: CUST-WP-0038
|
|
type: workplan
|
|
title: "State Hub Full ThreePhoenix HA Migration"
|
|
domain: custodian
|
|
repo: state-hub
|
|
status: active
|
|
owner: custodian
|
|
topic_slug: custodian
|
|
created: "2026-05-02"
|
|
updated: "2026-05-17"
|
|
depends_on: CUST-WP-0011
|
|
state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85"
|
|
---
|
|
|
|
# State Hub Full ThreePhoenix HA Migration
|
|
|
|
## Goal
|
|
|
|
Preserve the original long-term State Hub infrastructure goal while
|
|
`CUST-WP-0011` takes the pragmatic railiance01 path.
|
|
|
|
This workplan completes the migration from a useful single-node cluster-hosted
|
|
State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes,
|
|
replicated storage, tested failover, tested restore, and retirement of the WSL2
|
|
fallback only after operational confidence is earned.
|
|
|
|
## Why This Exists
|
|
|
|
The near-term State Hub migration should not wait for every HA precondition,
|
|
because the workstation-hosted State Hub is already a bottleneck for
|
|
multi-machine work.
|
|
|
|
But the original requirement remains valid:
|
|
|
|
- State Hub is irreplaceable episodic memory.
|
|
- A single node is not a final home.
|
|
- Backup and restore must be drilled, not assumed.
|
|
- Long-term operations must survive node loss and operator-machine loss.
|
|
|
|
`CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan
|
|
keeps the ultimate target visible and reviewable.
|
|
|
|
## Entry Criteria
|
|
|
|
- `CUST-WP-0011` completed or explicitly superseded.
|
|
- Cluster-hosted State Hub has passed its stabilisation period.
|
|
- railiance01 is not the only planned durable node.
|
|
- Railiance architecture decision for storage replication is current:
|
|
Longhorn, cnpg replication, external backup, or a documented replacement.
|
|
- Backup and restore tooling has an owner and runbook.
|
|
|
|
## Target Properties
|
|
|
|
- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03.
|
|
- State Hub database survives loss of one node.
|
|
- State Hub API recovers from pod loss without manual repair.
|
|
- Backups are encrypted, off-node, and restorable into a test namespace.
|
|
- Agent access remains private.
|
|
- WSL2 is no longer needed as the primary disaster-recovery fallback.
|
|
|
|
## Tasks
|
|
|
|
### T01 — Confirm ThreePhoenix cluster readiness
|
|
|
|
```task
|
|
id: CUST-WP-0038-T01
|
|
status: todo
|
|
priority: high
|
|
state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110"
|
|
```
|
|
|
|
Verify the target cluster state:
|
|
|
|
- Three nodes are joined and Ready.
|
|
- Control-plane and worker roles are documented.
|
|
- Cluster version and node resources are recorded.
|
|
- Smoke tests pass from the operator machine and from CoulombCore.
|
|
|
|
**Done when:** a current readiness report exists and no node is marked
|
|
NotReady or operationally unknown.
|
|
|
|
---
|
|
|
|
### T02 — Establish replicated storage/database strategy
|
|
|
|
```task
|
|
id: CUST-WP-0038-T02
|
|
status: todo
|
|
priority: high
|
|
state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140"
|
|
```
|
|
|
|
Choose and document the durable data strategy for State Hub:
|
|
|
|
- cnpg multi-instance PostgreSQL cluster, and/or
|
|
- Longhorn-backed storage with suitable replication, and/or
|
|
- another explicitly approved architecture.
|
|
|
|
The decision must define RPO, RTO, failover behavior, and restore procedure.
|
|
|
|
**Done when:** the selected architecture is documented and approved before any
|
|
production data movement.
|
|
|
|
---
|
|
|
|
### T03 — Implement HA State Hub database
|
|
|
|
```task
|
|
id: CUST-WP-0038-T03
|
|
status: todo
|
|
priority: high
|
|
state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6"
|
|
```
|
|
|
|
Apply the chosen database/storage architecture to State Hub.
|
|
|
|
Requirements:
|
|
|
|
- Database credentials remain SOPS/secret-managed.
|
|
- The database has automated backup configured.
|
|
- The database exposes a stable service endpoint for the API.
|
|
- Health and replication status are observable.
|
|
|
|
**Done when:** State Hub can run against the HA database in a test or staging
|
|
namespace.
|
|
|
|
---
|
|
|
|
### T04 — Add State Hub API high-availability behavior
|
|
|
|
```task
|
|
id: CUST-WP-0038-T04
|
|
status: todo
|
|
priority: medium
|
|
state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24"
|
|
```
|
|
|
|
Run State Hub API with the right availability posture for its workload:
|
|
|
|
- At least one replica, optionally more if DB/session behavior permits.
|
|
- Readiness and liveness probes.
|
|
- Rolling update behavior documented.
|
|
- Resource requests/limits set.
|
|
|
|
**Done when:** killing an API pod does not require manual recovery.
|
|
|
|
---
|
|
|
|
### T05 — Drill database failover
|
|
|
|
```task
|
|
id: CUST-WP-0038-T05
|
|
status: todo
|
|
priority: high
|
|
state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86"
|
|
```
|
|
|
|
Perform a controlled failover drill for the State Hub database.
|
|
|
|
Checks:
|
|
|
|
- Failure trigger is documented.
|
|
- API behavior during failover is observed.
|
|
- Recovery time is measured.
|
|
- No data loss is detected after recovery.
|
|
|
|
**Done when:** the failover drill passes and results are logged.
|
|
|
|
---
|
|
|
|
### T06 — Drill backup restore to isolated namespace
|
|
|
|
```task
|
|
id: CUST-WP-0038-T06
|
|
status: todo
|
|
priority: high
|
|
state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74"
|
|
```
|
|
|
|
Restore the latest encrypted State Hub backup into an isolated namespace or
|
|
separate test database.
|
|
|
|
Checks:
|
|
|
|
- Backup can be decrypted with the documented key path.
|
|
- Restore completes from off-node backup material.
|
|
- Row counts and representative records match.
|
|
- Restored API can serve `/state/health` and `/state/summary` when pointed at
|
|
the restored database.
|
|
|
|
**Done when:** restore drill passes without depending on the live database.
|
|
|
|
---
|
|
|
|
### T07 — Update agent access and runbooks for HA endpoint
|
|
|
|
```task
|
|
id: CUST-WP-0038-T07
|
|
status: todo
|
|
priority: medium
|
|
state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c"
|
|
```
|
|
|
|
Update the private access model after the HA endpoint is available:
|
|
|
|
- ops-bridge or tunnel target.
|
|
- MCP `API_BASE` or local port-forward convention.
|
|
- Dashboard access.
|
|
- Operator recovery instructions.
|
|
|
|
**Done when:** each active operator machine can reach the HA State Hub endpoint
|
|
through the documented path.
|
|
|
|
---
|
|
|
|
### T08 — Retire WSL2 fallback after explicit approval
|
|
|
|
```task
|
|
id: CUST-WP-0038-T08
|
|
status: todo
|
|
priority: low
|
|
needs_human: true
|
|
intervention_note: "Requires explicit approval after HA failover and restore drills pass."
|
|
state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add"
|
|
```
|
|
|
|
Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA
|
|
cluster path has passed drills.
|
|
|
|
Steps:
|
|
|
|
1. Take and archive a final WSL2 backup.
|
|
2. Stop local WSL2 State Hub services.
|
|
3. Update global and repo instructions.
|
|
4. Record the retirement decision in State Hub.
|
|
|
|
**Done when:** WSL2 is no longer part of the normal or fallback operating
|
|
model, and the cluster runbook is the source of truth.
|
|
|
|
## References
|
|
|
|
- `CUST-WP-0011` — pragmatic railiance01 migration
|
|
- Railiance ThreePhoenix infrastructure goal
|
|
- State Hub backup/restore runbooks
|
|
- Constitution constraint: irreversible retirement requires human approval
|