Files
state-hub/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md

12 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id, supersedes_intent_from, follow_up_workplan
id type title domain repo status owner topic_slug created updated state_hub_workstream_id supersedes_intent_from follow_up_workplan
CUST-WP-0011 workplan Pragmatic State Hub Migration to railiance01 custodian state-hub active custodian custodian 2026-03-11 2026-05-17 967baafb-d92d-405a-ba0b-0d00d37c4940 Migrate Custodian State Hub to ThreePhoenix Cluster CUST-WP-0038

Pragmatic State Hub Migration to railiance01

Goal

Move the Custodian State Hub (FastAPI + PostgreSQL) from the WSL2 operator workstation to the current railiance01 Kubernetes environment, using the Railiance production-readiness path that exists now:

  • CloudNative PG (cnpg) for the State Hub database in the databases namespace.
  • State Hub as an S5 workload in railiance-apps.
  • Platform/database ownership in railiance-platform.
  • Access through the existing private tunnel/ops-bridge model, not public exposure.
  • WSL2 retained as a disaster-recovery fallback until the cluster deployment has proven stable.

This is a deliberate pragmatic step. It improves durability and multi-machine access before the full ThreePhoenix target is ready. The ultimate multi-node, replicated, long-term cluster goal is preserved in CUST-WP-0038.

Context Update

The original 2026-03-11 version of this workplan targeted a future ThreePhoenix cluster with Railiance01/02/03, Longhorn, and full HA gates before starting. That was correct as an end-state, but it blocks useful progress now.

The current Railiance architecture has moved on:

  • railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md supersedes the older Bitnami PostgreSQL HA platform baseline.
  • CloudNative PG is the deployed database operator.
  • RAIL-HO-WP-0004-T09 is the Railiance-side task for deploying State Hub to the cluster, and it still requires human decisions before live data migration.

This workplan is now the Custodian-side coordination and safety plan for that T09 effort.

Safety Contract

State Hub is irreplaceable episodic memory. This migration may prepare, deploy, test, and compare as much as needed, but it must not make the cluster the only source of truth until the explicit cutover gate is satisfied.

Rules:

  • A fresh WSL2 backup and restore drill is mandatory before data migration.
  • The WSL2 State Hub remains available as rollback until stabilisation passes.
  • Any task that changes the live writer endpoint requires explicit human approval.
  • A failed cluster deploy must leave the WSL2 instance untouched and usable.
  • Row counts and key API checks must match before cutover.

Target Architecture After This Workplan

Operator workstation / COULOMBCORE / other agent hosts
  -> local MCP server subprocess
     -> http://127.0.0.1:8000 or configured API_BASE
        -> private tunnel / ops-bridge
           -> railiance01 k3s
              -> state-hub Service
                 -> FastAPI Deployment
                 -> state-hub-db CloudNative PG Cluster

Key properties:

  • Single-node pragmatic deployment on railiance01.
  • No public unauthenticated exposure.
  • Database managed by cnpg, not an ad-hoc Postgres StatefulSet.
  • WSL2 retained as DR fallback during stabilisation.
  • Future multi-node HA and storage replication are deferred to CUST-WP-0038.

Open Human Decisions

Resolve these before T04/T05 can become live migration work:

  1. Final State Hub hostname or tunnel-only endpoint.
  2. Container registry choice: Gitea registry vs external interim registry.
  3. Exposure model: ClusterIP plus tunnel, private ingress, or both.
  4. Approval window for freezing WSL2 writes and migrating the production DB.
  5. Stabilisation duration before WSL2 can be considered non-primary fallback.

Tasks

T01 — Drill WSL2 State Hub backup restore

id: CUST-WP-0011-T01
status: done
priority: high
state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
completed: "2026-05-02"

Take a fresh State Hub backup from the current WSL2 instance and restore it into an isolated test PostgreSQL instance.

Minimum checks:

  • Restore completes without errors.
  • Core table row counts match the live WSL2 database.
  • /state/summary can be served from the restored copy if wired to a test API.
  • Drill result is recorded in State Hub progress and, if useful, episodic memory.

Done when: backup and restore are proven within 24 hours of live migration work.

Result: completed 2026-05-02. A fresh dump from infra-postgres-1 restored into disposable container state-hub-restore-test on 127.0.0.1:5433. Application health and summary checks against the restored database returned HTTP 200. Restored row counts matched production exactly, including 117 workstreams, 989 tasks, 1423 progress events, and 208 token events.


T02 — Align with Railiance deployment plan

id: CUST-WP-0011-T02
status: done
priority: high
state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
completed: "2026-05-02"

Update the cross-repo plan so this Custodian workplan and RAIL-HO-WP-0004-T09 point to the same architecture.

Expected outputs:

  • RAIL-HO-WP-0004-T09 remains the Railiance-side execution task.
  • This workplan remains the Custodian-side safety/cutover task list.
  • Any stale Longhorn/Postgres StatefulSet assumptions are removed from the near-term migration plan.
  • The future HA goal is referenced through CUST-WP-0038.

Done when: both workplans describe compatible responsibilities and gates.

Result: completed 2026-05-02. RAIL-HO-WP-0004-T09 now names the same pragmatic railiance01 path: cnpg database, S5 State Hub workload, restore drill precondition, empty deploy before data copy, explicit human approval before freezing WSL2 writes, and WSL2 retained as fallback. Full ThreePhoenix HA stays deferred to CUST-WP-0038.


T03 — Build and publish State Hub container image

id: CUST-WP-0011-T03
status: in_progress
priority: high
state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"

Package this repository as a production image.

Requirements:

  • Dockerfile builds from the current Python/uv project.
  • Alembic and runtime dependencies are available inside the image.
  • Image exposes the FastAPI service on port 8000.
  • Image tag is pushed to the chosen registry.
  • Build provenance is documented in the commit/workplan.

Done when: railiance01 can pull the image and a dry-run deployment resolves it.

Progress 2026-05-03: added Dockerfile, .dockerignore, and docs/container-image.md. Built local image state-hub:local successfully: sha256:e96dbd1e7d2b63e4fb17584c8c2216088a2c9937bfe880c2ad565c7a9f51c0fc (~106 MB). Verified container /state/health returns HTTP 200 against the current database when run locally with host networking. Verified Alembic is available in-image and reports current revision r5m6n7o8p9q0 (head).

Progress 2026-05-03: registry target decision resolved to the self-hosted Gitea registry. A local SSH tunnel to the NodePort can reach Gitea, but Docker login/push still receives HTTP 404 from /v2/. Runtime inspection shows the live Gitea app.ini has no [packages] section, so package registry enablement/routing must be applied before publishing state-hub:local.

Progress 2026-05-15: rebuilt the image from current State Hub sources as state-hub:local with digest sha256:039d29654ccb3754c6ecdbe497c6364bbd8452edcdcb7fa937dd9debf5b734ff (106004480 bytes). Verified /state/health returns {"status":"ok","db":"connected"} from a temporary container on host port 18000 and confirmed in-image Alembic reports t7o8p9q0r1s2 (head). Build provenance is recorded in docs/container-image.md.

Remaining: enable the Gitea package/container registry, then tag, push, and pull the image from railiance01.


T04 — Define State Hub database and app manifests

id: CUST-WP-0011-T04
status: todo
priority: high
state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"

Create the cluster-side deployment assets using current Railiance boundaries:

  • railiance-platform: state-hub-db cnpg cluster and database credentials.
  • railiance-apps: State Hub Deployment, Service, ConfigMap, Secret/External Secret reference, and optional private Ingress.
  • Health probes use GET /state/health.
  • Environment includes DATABASE_URL and any required API settings.

Done when: manifests lint/apply in a non-destructive dry run and ownership boundaries are documented.


T05 — Deploy empty State Hub and run migrations on railiance01

id: CUST-WP-0011-T05
status: todo
priority: high
state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"

Deploy State Hub against an empty state-hub-db cnpg database and run Alembic migrations in the cluster environment.

Checks:

  • Pod reaches Ready.
  • /state/health returns healthy through the intended private access path.
  • Alembic reports head.
  • Logs show no repeated crash/restart loop.

Done when: an empty but structurally valid State Hub runs on railiance01.


T06 — Restore WSL2 data copy into cluster and compare

id: CUST-WP-0011-T06
status: todo
priority: high
state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"

Restore a fresh WSL2 dump into the cluster database while WSL2 remains the live source of truth.

Required comparison:

  • Table row counts match.
  • Representative workstreams, tasks, decisions, progress events, repos, and token events are queryable.
  • Dashboard and MCP summary calls return expected data through the cluster API.
  • Any mismatch is investigated before proceeding.

Done when: cluster data is a verified copy of WSL2, but not yet the only writer.


T07 — Cut over private access to cluster State Hub

id: CUST-WP-0011-T07
status: todo
priority: medium
state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
needs_human: true
intervention_note: "Requires explicit approval to freeze WSL2 writes and make the cluster State Hub the primary endpoint."

With human approval, freeze WSL2 writes, take a final dump, restore it to the cluster, compare counts again, and redirect the active private access path to the cluster API.

Accepted approaches:

  • Keep local MCP config pointed at http://127.0.0.1:8000 and move that port to an ops-bridge/SSH tunnel.
  • Or set the MCP server API_BASE to the chosen private cluster endpoint.

Done when: get_state_summary() and dashboard live data are served by the cluster State Hub, and WSL2 is no longer receiving normal writes.


T08 — Stabilise with WSL2 retained as fallback

id: CUST-WP-0011-T08
status: todo
priority: medium
state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"

Run the cluster State Hub as primary while keeping the WSL2 instance available as a fallback.

Monitor:

  • State Hub pod restarts.
  • cnpg cluster health.
  • Backup job success.
  • Dashboard and MCP behavior from each operator machine.
  • Consistency sync behavior for file-backed workplans.

Done when: the agreed stabilisation window passes without data loss or unresolved operational defects.


T09 — Document operating model and defer final WSL2 retirement

id: CUST-WP-0011-T09
status: todo
priority: low
state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"

Document the new operating model:

  • How agents reach State Hub.
  • How backups and restores work.
  • How to roll back to WSL2 if needed.
  • Which parts remain pragmatic/single-node.
  • Which long-term requirements moved to CUST-WP-0038.

Do not permanently retire WSL2 in this workplan unless a separate human decision is recorded. Retirement belongs after proven stability or in the future HA workplan.

Done when: runbooks and project instructions match the deployed reality.

References

  • railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md
  • RAIL-HO-WP-0004-T09 — Railiance-side State Hub deployment task
  • CUST-WP-0038 — future full ThreePhoenix HA State Hub migration
  • Constitution constraint: production data migration and fallback retirement require explicit human approval