Files

tegwick 49696cb0c2 Backup and restore drill

2026-05-02 23:56:20 +02:00

6.5 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, depends_on, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	created	updated	depends_on	state_hub_workstream_id
CUST-WP-0038	workplan	State Hub Full ThreePhoenix HA Migration	custodian	the-custodian	active	custodian	custodian	2026-05-02	2026-05-02	CUST-WP-0011	8d0c1b5d-44da-4b91-8357-e6526d3e0a85

State Hub Full ThreePhoenix HA Migration

Goal

Preserve the original long-term State Hub infrastructure goal while CUST-WP-0011 takes the pragmatic railiance01 path.

This workplan completes the migration from a useful single-node cluster-hosted State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes, replicated storage, tested failover, tested restore, and retirement of the WSL2 fallback only after operational confidence is earned.

Why This Exists

The near-term State Hub migration should not wait for every HA precondition, because the workstation-hosted State Hub is already a bottleneck for multi-machine work.

But the original requirement remains valid:

State Hub is irreplaceable episodic memory.
A single node is not a final home.
Backup and restore must be drilled, not assumed.
Long-term operations must survive node loss and operator-machine loss.

CUST-WP-0011 moves State Hub to railiance01 pragmatically. This workplan keeps the ultimate target visible and reviewable.

Entry Criteria

CUST-WP-0011 completed or explicitly superseded.
Cluster-hosted State Hub has passed its stabilisation period.
railiance01 is not the only planned durable node.
Railiance architecture decision for storage replication is current: Longhorn, cnpg replication, external backup, or a documented replacement.
Backup and restore tooling has an owner and runbook.

Target Properties

Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03.
State Hub database survives loss of one node.
State Hub API recovers from pod loss without manual repair.
Backups are encrypted, off-node, and restorable into a test namespace.
Agent access remains private.
WSL2 is no longer needed as the primary disaster-recovery fallback.

Tasks

T01 — Confirm ThreePhoenix cluster readiness

id: T01
status: todo
priority: high
state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110"

Verify the target cluster state:

Three nodes are joined and Ready.
Control-plane and worker roles are documented.
Cluster version and node resources are recorded.
Smoke tests pass from the operator machine and from CoulombCore.

Done when: a current readiness report exists and no node is marked NotReady or operationally unknown.

T02 — Establish replicated storage/database strategy

id: T02
status: todo
priority: high
state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140"

Choose and document the durable data strategy for State Hub:

cnpg multi-instance PostgreSQL cluster, and/or
Longhorn-backed storage with suitable replication, and/or
another explicitly approved architecture.

The decision must define RPO, RTO, failover behavior, and restore procedure.

Done when: the selected architecture is documented and approved before any production data movement.

T03 — Implement HA State Hub database

id: T03
status: todo
priority: high
state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6"

Apply the chosen database/storage architecture to State Hub.

Requirements:

Database credentials remain SOPS/secret-managed.
The database has automated backup configured.
The database exposes a stable service endpoint for the API.
Health and replication status are observable.

Done when: State Hub can run against the HA database in a test or staging namespace.

T04 — Add State Hub API high-availability behavior

id: T04
status: todo
priority: medium
state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24"

Run State Hub API with the right availability posture for its workload:

At least one replica, optionally more if DB/session behavior permits.
Readiness and liveness probes.
Rolling update behavior documented.
Resource requests/limits set.

Done when: killing an API pod does not require manual recovery.

T05 — Drill database failover

id: T05
status: todo
priority: high
state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86"

Perform a controlled failover drill for the State Hub database.

Checks:

Failure trigger is documented.
API behavior during failover is observed.
Recovery time is measured.
No data loss is detected after recovery.

Done when: the failover drill passes and results are logged.

T06 — Drill backup restore to isolated namespace

id: T06
status: todo
priority: high
state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74"

Restore the latest encrypted State Hub backup into an isolated namespace or separate test database.

Checks:

Backup can be decrypted with the documented key path.
Restore completes from off-node backup material.
Row counts and representative records match.
Restored API can serve /state/health and /state/summary when pointed at the restored database.

Done when: restore drill passes without depending on the live database.

T07 — Update agent access and runbooks for HA endpoint

id: T07
status: todo
priority: medium
state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c"

Update the private access model after the HA endpoint is available:

ops-bridge or tunnel target.
MCP API_BASE or local port-forward convention.
Dashboard access.
Operator recovery instructions.

Done when: each active operator machine can reach the HA State Hub endpoint through the documented path.

T08 — Retire WSL2 fallback after explicit approval

id: T08
status: todo
priority: low
needs_human: true
intervention_note: "Requires explicit approval after HA failover and restore drills pass."
state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add"

Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA cluster path has passed drills.

Steps:

Take and archive a final WSL2 backup.
Stop local WSL2 State Hub services.
Update global and repo instructions.
Record the retirement decision in State Hub.

Done when: WSL2 is no longer part of the normal or fallback operating model, and the cluster runbook is the source of truth.

References

CUST-WP-0011 — pragmatic railiance01 migration
Railiance ThreePhoenix infrastructure goal
State Hub backup/restore runbooks
Constitution constraint: irreversible retirement requires human approval

6.5 KiB Raw Blame History