12 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id, supersedes_intent_from, follow_up_workplan
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | state_hub_workstream_id | supersedes_intent_from | follow_up_workplan |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUST-WP-0011 | workplan | Pragmatic State Hub Migration to railiance01 | custodian | state-hub | active | custodian | custodian | 2026-03-11 | 2026-05-17 | 967baafb-d92d-405a-ba0b-0d00d37c4940 | Migrate Custodian State Hub to ThreePhoenix Cluster | CUST-WP-0038 |
Pragmatic State Hub Migration to railiance01
Goal
Move the Custodian State Hub (FastAPI + PostgreSQL) from the WSL2 operator workstation to the current railiance01 Kubernetes environment, using the Railiance production-readiness path that exists now:
- CloudNative PG (
cnpg) for the State Hub database in thedatabasesnamespace. - State Hub as an S5 workload in
railiance-apps. - Platform/database ownership in
railiance-platform. - Access through the existing private tunnel/ops-bridge model, not public exposure.
- WSL2 retained as a disaster-recovery fallback until the cluster deployment has proven stable.
This is a deliberate pragmatic step. It improves durability and multi-machine
access before the full ThreePhoenix target is ready. The ultimate multi-node,
replicated, long-term cluster goal is preserved in CUST-WP-0038.
Context Update
The original 2026-03-11 version of this workplan targeted a future ThreePhoenix cluster with Railiance01/02/03, Longhorn, and full HA gates before starting. That was correct as an end-state, but it blocks useful progress now.
The current Railiance architecture has moved on:
railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.mdsupersedes the older Bitnami PostgreSQL HA platform baseline.- CloudNative PG is the deployed database operator.
RAIL-HO-WP-0004-T09is the Railiance-side task for deploying State Hub to the cluster, and it still requires human decisions before live data migration.
This workplan is now the Custodian-side coordination and safety plan for that T09 effort.
Safety Contract
State Hub is irreplaceable episodic memory. This migration may prepare, deploy, test, and compare as much as needed, but it must not make the cluster the only source of truth until the explicit cutover gate is satisfied.
Rules:
- A fresh WSL2 backup and restore drill is mandatory before data migration.
- The WSL2 State Hub remains available as rollback until stabilisation passes.
- Any task that changes the live writer endpoint requires explicit human approval.
- A failed cluster deploy must leave the WSL2 instance untouched and usable.
- Row counts and key API checks must match before cutover.
Target Architecture After This Workplan
Operator workstation / COULOMBCORE / other agent hosts
-> local MCP server subprocess
-> http://127.0.0.1:8000 or configured API_BASE
-> private tunnel / ops-bridge
-> railiance01 k3s
-> state-hub Service
-> FastAPI Deployment
-> state-hub-db CloudNative PG Cluster
Key properties:
- Single-node pragmatic deployment on railiance01.
- No public unauthenticated exposure.
- Database managed by cnpg, not an ad-hoc Postgres StatefulSet.
- WSL2 retained as DR fallback during stabilisation.
- Future multi-node HA and storage replication are deferred to
CUST-WP-0038.
Open Human Decisions
Resolve these before T04/T05 can become live migration work:
- Final State Hub hostname or tunnel-only endpoint.
- Container registry choice: Gitea registry vs external interim registry.
- Exposure model: ClusterIP plus tunnel, private ingress, or both.
- Approval window for freezing WSL2 writes and migrating the production DB.
- Stabilisation duration before WSL2 can be considered non-primary fallback.
Tasks
T01 — Drill WSL2 State Hub backup restore
id: CUST-WP-0011-T01
status: done
priority: high
state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
completed: "2026-05-02"
Take a fresh State Hub backup from the current WSL2 instance and restore it into an isolated test PostgreSQL instance.
Minimum checks:
- Restore completes without errors.
- Core table row counts match the live WSL2 database.
/state/summarycan be served from the restored copy if wired to a test API.- Drill result is recorded in State Hub progress and, if useful, episodic memory.
Done when: backup and restore are proven within 24 hours of live migration work.
Result: completed 2026-05-02. A fresh dump from infra-postgres-1 restored
into disposable container state-hub-restore-test on 127.0.0.1:5433.
Application health and summary checks against the restored database returned
HTTP 200. Restored row counts matched production exactly, including 117
workstreams, 989 tasks, 1423 progress events, and 208 token events.
T02 — Align with Railiance deployment plan
id: CUST-WP-0011-T02
status: done
priority: high
state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
completed: "2026-05-02"
Update the cross-repo plan so this Custodian workplan and
RAIL-HO-WP-0004-T09 point to the same architecture.
Expected outputs:
RAIL-HO-WP-0004-T09remains the Railiance-side execution task.- This workplan remains the Custodian-side safety/cutover task list.
- Any stale Longhorn/Postgres StatefulSet assumptions are removed from the near-term migration plan.
- The future HA goal is referenced through
CUST-WP-0038.
Done when: both workplans describe compatible responsibilities and gates.
Result: completed 2026-05-02. RAIL-HO-WP-0004-T09 now names the same
pragmatic railiance01 path: cnpg database, S5 State Hub workload, restore drill
precondition, empty deploy before data copy, explicit human approval before
freezing WSL2 writes, and WSL2 retained as fallback. Full ThreePhoenix HA stays
deferred to CUST-WP-0038.
T03 — Build and publish State Hub container image
id: CUST-WP-0011-T03
status: in_progress
priority: high
state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
Package this repository as a production image.
Requirements:
- Dockerfile builds from the current Python/uv project.
- Alembic and runtime dependencies are available inside the image.
- Image exposes the FastAPI service on port 8000.
- Image tag is pushed to the chosen registry.
- Build provenance is documented in the commit/workplan.
Done when: railiance01 can pull the image and a dry-run deployment resolves it.
Progress 2026-05-03: added Dockerfile,
.dockerignore, and docs/container-image.md. Built
local image state-hub:local successfully:
sha256:e96dbd1e7d2b63e4fb17584c8c2216088a2c9937bfe880c2ad565c7a9f51c0fc
(~106 MB). Verified container /state/health returns HTTP 200 against the
current database when run locally with host networking. Verified Alembic is
available in-image and reports current revision r5m6n7o8p9q0 (head).
Progress 2026-05-03: registry target decision resolved to the self-hosted
Gitea registry. A local SSH tunnel to the NodePort can reach Gitea, but Docker
login/push still receives HTTP 404 from /v2/. Runtime inspection shows the
live Gitea app.ini has no [packages] section, so package registry
enablement/routing must be applied before publishing state-hub:local.
Progress 2026-05-15: rebuilt the image from current State Hub sources as
state-hub:local with digest
sha256:039d29654ccb3754c6ecdbe497c6364bbd8452edcdcb7fa937dd9debf5b734ff
(106004480 bytes). Verified /state/health returns
{"status":"ok","db":"connected"} from a temporary container on host port
18000 and confirmed in-image Alembic reports t7o8p9q0r1s2 (head). Build
provenance is recorded in docs/container-image.md.
Remaining: enable the Gitea package/container registry, then tag, push, and pull the image from railiance01.
T04 — Define State Hub database and app manifests
id: CUST-WP-0011-T04
status: todo
priority: high
state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
Create the cluster-side deployment assets using current Railiance boundaries:
railiance-platform:state-hub-dbcnpg cluster and database credentials.railiance-apps: State Hub Deployment, Service, ConfigMap, Secret/External Secret reference, and optional private Ingress.- Health probes use
GET /state/health. - Environment includes
DATABASE_URLand any required API settings.
Done when: manifests lint/apply in a non-destructive dry run and ownership boundaries are documented.
T05 — Deploy empty State Hub and run migrations on railiance01
id: CUST-WP-0011-T05
status: todo
priority: high
state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
Deploy State Hub against an empty state-hub-db cnpg database and run Alembic
migrations in the cluster environment.
Checks:
- Pod reaches Ready.
/state/healthreturns healthy through the intended private access path.- Alembic reports head.
- Logs show no repeated crash/restart loop.
Done when: an empty but structurally valid State Hub runs on railiance01.
T06 — Restore WSL2 data copy into cluster and compare
id: CUST-WP-0011-T06
status: todo
priority: high
state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
Restore a fresh WSL2 dump into the cluster database while WSL2 remains the live source of truth.
Required comparison:
- Table row counts match.
- Representative workstreams, tasks, decisions, progress events, repos, and token events are queryable.
- Dashboard and MCP summary calls return expected data through the cluster API.
- Any mismatch is investigated before proceeding.
Done when: cluster data is a verified copy of WSL2, but not yet the only writer.
T07 — Cut over private access to cluster State Hub
id: CUST-WP-0011-T07
status: todo
priority: medium
state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
needs_human: true
intervention_note: "Requires explicit approval to freeze WSL2 writes and make the cluster State Hub the primary endpoint."
With human approval, freeze WSL2 writes, take a final dump, restore it to the cluster, compare counts again, and redirect the active private access path to the cluster API.
Accepted approaches:
- Keep local MCP config pointed at
http://127.0.0.1:8000and move that port to an ops-bridge/SSH tunnel. - Or set the MCP server
API_BASEto the chosen private cluster endpoint.
Done when: get_state_summary() and dashboard live data are served by the
cluster State Hub, and WSL2 is no longer receiving normal writes.
T08 — Stabilise with WSL2 retained as fallback
id: CUST-WP-0011-T08
status: todo
priority: medium
state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
Run the cluster State Hub as primary while keeping the WSL2 instance available as a fallback.
Monitor:
- State Hub pod restarts.
- cnpg cluster health.
- Backup job success.
- Dashboard and MCP behavior from each operator machine.
- Consistency sync behavior for file-backed workplans.
Done when: the agreed stabilisation window passes without data loss or unresolved operational defects.
T09 — Document operating model and defer final WSL2 retirement
id: CUST-WP-0011-T09
status: todo
priority: low
state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
Document the new operating model:
- How agents reach State Hub.
- How backups and restores work.
- How to roll back to WSL2 if needed.
- Which parts remain pragmatic/single-node.
- Which long-term requirements moved to
CUST-WP-0038.
Do not permanently retire WSL2 in this workplan unless a separate human decision is recorded. Retirement belongs after proven stability or in the future HA workplan.
Done when: runbooks and project instructions match the deployed reality.
References
railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.mdRAIL-HO-WP-0004-T09— Railiance-side State Hub deployment taskCUST-WP-0038— future full ThreePhoenix HA State Hub migration- Constitution constraint: production data migration and fallback retirement require explicit human approval