state-hub/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md

---
id: CUST-WP-0038
type: workplan
title: "State Hub Full ThreePhoenix HA Migration"
domain: custodian
repo: state-hub
status: active
owner: custodian
topic_slug: custodian
created: "2026-05-02"
updated: "2026-05-17"
depends_on: CUST-WP-0011
state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85"
---

# State Hub Full ThreePhoenix HA Migration

## Goal

Preserve the original long-term State Hub infrastructure goal while
`CUST-WP-0011` takes the pragmatic railiance01 path.

This workplan completes the migration from a useful single-node cluster-hosted
State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes,
replicated storage, tested failover, tested restore, and retirement of the WSL2
fallback only after operational confidence is earned.

## Why This Exists

The near-term State Hub migration should not wait for every HA precondition,
because the workstation-hosted State Hub is already a bottleneck for
multi-machine work.

But the original requirement remains valid:

- State Hub is irreplaceable episodic memory.
- A single node is not a final home.
- Backup and restore must be drilled, not assumed.
- Long-term operations must survive node loss and operator-machine loss.

`CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan
keeps the ultimate target visible and reviewable.

## Entry Criteria

- `CUST-WP-0011` completed or explicitly superseded.
- Cluster-hosted State Hub has passed its stabilisation period.
- railiance01 is not the only planned durable node.
- Railiance architecture decision for storage replication is current:
  Longhorn, cnpg replication, external backup, or a documented replacement.
- Backup and restore tooling has an owner and runbook.

## Target Properties

- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03.
- State Hub database survives loss of one node.
- State Hub API recovers from pod loss without manual repair.
- Backups are encrypted, off-node, and restorable into a test namespace.
- Agent access remains private.
- WSL2 is no longer needed as the primary disaster-recovery fallback.

## Tasks

### T01 — Confirm ThreePhoenix cluster readiness

```task
id: CUST-WP-0038-T01
status: todo
priority: high
state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110"
```

Verify the target cluster state:

- Three nodes are joined and Ready.
- Control-plane and worker roles are documented.
- Cluster version and node resources are recorded.
- Smoke tests pass from the operator machine and from CoulombCore.

**Done when:** a current readiness report exists and no node is marked
NotReady or operationally unknown.

---

### T02 — Establish replicated storage/database strategy

```task
id: CUST-WP-0038-T02
status: todo
priority: high
state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140"
```

Choose and document the durable data strategy for State Hub:

- cnpg multi-instance PostgreSQL cluster, and/or
- Longhorn-backed storage with suitable replication, and/or
- another explicitly approved architecture.

The decision must define RPO, RTO, failover behavior, and restore procedure.

**Done when:** the selected architecture is documented and approved before any
production data movement.

---

### T03 — Implement HA State Hub database

```task
id: CUST-WP-0038-T03
status: todo
priority: high
state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6"
```

Apply the chosen database/storage architecture to State Hub.

Requirements:

- Database credentials remain SOPS/secret-managed.
- The database has automated backup configured.
- The database exposes a stable service endpoint for the API.
- Health and replication status are observable.

**Done when:** State Hub can run against the HA database in a test or staging
namespace.

---

### T04 — Add State Hub API high-availability behavior

```task
id: CUST-WP-0038-T04
status: todo
priority: medium
state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24"
```

Run State Hub API with the right availability posture for its workload:

- At least one replica, optionally more if DB/session behavior permits.
- Readiness and liveness probes.
- Rolling update behavior documented.
- Resource requests/limits set.

**Done when:** killing an API pod does not require manual recovery.

---

### T05 — Drill database failover

```task
id: CUST-WP-0038-T05
status: todo
priority: high
state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86"
```

Perform a controlled failover drill for the State Hub database.

Checks:

- Failure trigger is documented.
- API behavior during failover is observed.
- Recovery time is measured.
- No data loss is detected after recovery.

**Done when:** the failover drill passes and results are logged.

---

### T06 — Drill backup restore to isolated namespace

```task
id: CUST-WP-0038-T06
status: todo
priority: high
state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74"
```

Restore the latest encrypted State Hub backup into an isolated namespace or
separate test database.

Checks:

- Backup can be decrypted with the documented key path.
- Restore completes from off-node backup material.
- Row counts and representative records match.
- Restored API can serve `/state/health` and `/state/summary` when pointed at
  the restored database.

**Done when:** restore drill passes without depending on the live database.

---

### T07 — Update agent access and runbooks for HA endpoint

```task
id: CUST-WP-0038-T07
status: todo
priority: medium
state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c"
```

Update the private access model after the HA endpoint is available:

- ops-bridge or tunnel target.
- MCP `API_BASE` or local port-forward convention.
- Dashboard access.
- Operator recovery instructions.

**Done when:** each active operator machine can reach the HA State Hub endpoint
through the documented path.

---

### T08 — Retire WSL2 fallback after explicit approval

```task
id: CUST-WP-0038-T08
status: todo
priority: low
needs_human: true
intervention_note: "Requires explicit approval after HA failover and restore drills pass."
state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add"
```

Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA
cluster path has passed drills.

Steps:

1. Take and archive a final WSL2 backup.
2. Stop local WSL2 State Hub services.
3. Update global and repo instructions.
4. Record the retirement decision in State Hub.

**Done when:** WSL2 is no longer part of the normal or fallback operating
model, and the cluster runbook is the source of truth.

## References

- `CUST-WP-0011` — pragmatic railiance01 migration
- Railiance ThreePhoenix infrastructure goal
- State Hub backup/restore runbooks
- Constitution constraint: irreversible retirement requires human approval