Files
the-custodian/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md
2026-05-02 23:56:20 +02:00

6.5 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, depends_on, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated depends_on state_hub_workstream_id
CUST-WP-0038 workplan State Hub Full ThreePhoenix HA Migration custodian the-custodian active custodian custodian 2026-05-02 2026-05-02 CUST-WP-0011 8d0c1b5d-44da-4b91-8357-e6526d3e0a85

State Hub Full ThreePhoenix HA Migration

Goal

Preserve the original long-term State Hub infrastructure goal while CUST-WP-0011 takes the pragmatic railiance01 path.

This workplan completes the migration from a useful single-node cluster-hosted State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes, replicated storage, tested failover, tested restore, and retirement of the WSL2 fallback only after operational confidence is earned.

Why This Exists

The near-term State Hub migration should not wait for every HA precondition, because the workstation-hosted State Hub is already a bottleneck for multi-machine work.

But the original requirement remains valid:

  • State Hub is irreplaceable episodic memory.
  • A single node is not a final home.
  • Backup and restore must be drilled, not assumed.
  • Long-term operations must survive node loss and operator-machine loss.

CUST-WP-0011 moves State Hub to railiance01 pragmatically. This workplan keeps the ultimate target visible and reviewable.

Entry Criteria

  • CUST-WP-0011 completed or explicitly superseded.
  • Cluster-hosted State Hub has passed its stabilisation period.
  • railiance01 is not the only planned durable node.
  • Railiance architecture decision for storage replication is current: Longhorn, cnpg replication, external backup, or a documented replacement.
  • Backup and restore tooling has an owner and runbook.

Target Properties

  • Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03.
  • State Hub database survives loss of one node.
  • State Hub API recovers from pod loss without manual repair.
  • Backups are encrypted, off-node, and restorable into a test namespace.
  • Agent access remains private.
  • WSL2 is no longer needed as the primary disaster-recovery fallback.

Tasks

T01 — Confirm ThreePhoenix cluster readiness

id: T01
status: todo
priority: high
state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110"

Verify the target cluster state:

  • Three nodes are joined and Ready.
  • Control-plane and worker roles are documented.
  • Cluster version and node resources are recorded.
  • Smoke tests pass from the operator machine and from CoulombCore.

Done when: a current readiness report exists and no node is marked NotReady or operationally unknown.


T02 — Establish replicated storage/database strategy

id: T02
status: todo
priority: high
state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140"

Choose and document the durable data strategy for State Hub:

  • cnpg multi-instance PostgreSQL cluster, and/or
  • Longhorn-backed storage with suitable replication, and/or
  • another explicitly approved architecture.

The decision must define RPO, RTO, failover behavior, and restore procedure.

Done when: the selected architecture is documented and approved before any production data movement.


T03 — Implement HA State Hub database

id: T03
status: todo
priority: high
state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6"

Apply the chosen database/storage architecture to State Hub.

Requirements:

  • Database credentials remain SOPS/secret-managed.
  • The database has automated backup configured.
  • The database exposes a stable service endpoint for the API.
  • Health and replication status are observable.

Done when: State Hub can run against the HA database in a test or staging namespace.


T04 — Add State Hub API high-availability behavior

id: T04
status: todo
priority: medium
state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24"

Run State Hub API with the right availability posture for its workload:

  • At least one replica, optionally more if DB/session behavior permits.
  • Readiness and liveness probes.
  • Rolling update behavior documented.
  • Resource requests/limits set.

Done when: killing an API pod does not require manual recovery.


T05 — Drill database failover

id: T05
status: todo
priority: high
state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86"

Perform a controlled failover drill for the State Hub database.

Checks:

  • Failure trigger is documented.
  • API behavior during failover is observed.
  • Recovery time is measured.
  • No data loss is detected after recovery.

Done when: the failover drill passes and results are logged.


T06 — Drill backup restore to isolated namespace

id: T06
status: todo
priority: high
state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74"

Restore the latest encrypted State Hub backup into an isolated namespace or separate test database.

Checks:

  • Backup can be decrypted with the documented key path.
  • Restore completes from off-node backup material.
  • Row counts and representative records match.
  • Restored API can serve /state/health and /state/summary when pointed at the restored database.

Done when: restore drill passes without depending on the live database.


T07 — Update agent access and runbooks for HA endpoint

id: T07
status: todo
priority: medium
state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c"

Update the private access model after the HA endpoint is available:

  • ops-bridge or tunnel target.
  • MCP API_BASE or local port-forward convention.
  • Dashboard access.
  • Operator recovery instructions.

Done when: each active operator machine can reach the HA State Hub endpoint through the documented path.


T08 — Retire WSL2 fallback after explicit approval

id: T08
status: todo
priority: low
needs_human: true
intervention_note: "Requires explicit approval after HA failover and restore drills pass."
state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add"

Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA cluster path has passed drills.

Steps:

  1. Take and archive a final WSL2 backup.
  2. Stop local WSL2 State Hub services.
  3. Update global and repo instructions.
  4. Record the retirement decision in State Hub.

Done when: WSL2 is no longer part of the normal or fallback operating model, and the cluster runbook is the source of truth.

References

  • CUST-WP-0011 — pragmatic railiance01 migration
  • Railiance ThreePhoenix infrastructure goal
  • State Hub backup/restore runbooks
  • Constitution constraint: irreversible retirement requires human approval