generated from coulomb/repo-seed
367 lines
12 KiB
Markdown
367 lines
12 KiB
Markdown
---
|
|
id: CUST-WP-0011
|
|
type: workplan
|
|
title: "Pragmatic State Hub Migration to railiance01"
|
|
domain: custodian
|
|
repo: state-hub
|
|
status: active
|
|
owner: custodian
|
|
topic_slug: custodian
|
|
created: "2026-03-11"
|
|
updated: "2026-05-17"
|
|
state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940"
|
|
supersedes_intent_from: "Migrate Custodian State Hub to ThreePhoenix Cluster"
|
|
follow_up_workplan: CUST-WP-0038
|
|
---
|
|
|
|
# Pragmatic State Hub Migration to railiance01
|
|
|
|
## Goal
|
|
|
|
Move the Custodian State Hub (FastAPI + PostgreSQL) from the WSL2 operator
|
|
workstation to the current railiance01 Kubernetes environment, using the
|
|
Railiance production-readiness path that exists now:
|
|
|
|
- CloudNative PG (`cnpg`) for the State Hub database in the `databases`
|
|
namespace.
|
|
- State Hub as an S5 workload in `railiance-apps`.
|
|
- Platform/database ownership in `railiance-platform`.
|
|
- Access through the existing private tunnel/ops-bridge model, not public
|
|
exposure.
|
|
- WSL2 retained as a disaster-recovery fallback until the cluster deployment
|
|
has proven stable.
|
|
|
|
This is a deliberate pragmatic step. It improves durability and multi-machine
|
|
access before the full ThreePhoenix target is ready. The ultimate multi-node,
|
|
replicated, long-term cluster goal is preserved in `CUST-WP-0038`.
|
|
|
|
## Context Update
|
|
|
|
The original 2026-03-11 version of this workplan targeted a future
|
|
ThreePhoenix cluster with Railiance01/02/03, Longhorn, and full HA gates before
|
|
starting. That was correct as an end-state, but it blocks useful progress now.
|
|
|
|
The current Railiance architecture has moved on:
|
|
|
|
- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
|
|
supersedes the older Bitnami PostgreSQL HA platform baseline.
|
|
- CloudNative PG is the deployed database operator.
|
|
- `RAIL-HO-WP-0004-T09` is the Railiance-side task for deploying State Hub to
|
|
the cluster, and it still requires human decisions before live data
|
|
migration.
|
|
|
|
This workplan is now the Custodian-side coordination and safety plan for that
|
|
T09 effort.
|
|
|
|
## Safety Contract
|
|
|
|
State Hub is irreplaceable episodic memory. This migration may prepare, deploy,
|
|
test, and compare as much as needed, but it must not make the cluster the only
|
|
source of truth until the explicit cutover gate is satisfied.
|
|
|
|
Rules:
|
|
|
|
- A fresh WSL2 backup and restore drill is mandatory before data migration.
|
|
- The WSL2 State Hub remains available as rollback until stabilisation passes.
|
|
- Any task that changes the live writer endpoint requires explicit human
|
|
approval.
|
|
- A failed cluster deploy must leave the WSL2 instance untouched and usable.
|
|
- Row counts and key API checks must match before cutover.
|
|
|
|
## Target Architecture After This Workplan
|
|
|
|
```
|
|
Operator workstation / COULOMBCORE / other agent hosts
|
|
-> local MCP server subprocess
|
|
-> http://127.0.0.1:8000 or configured API_BASE
|
|
-> private tunnel / ops-bridge
|
|
-> railiance01 k3s
|
|
-> state-hub Service
|
|
-> FastAPI Deployment
|
|
-> state-hub-db CloudNative PG Cluster
|
|
```
|
|
|
|
Key properties:
|
|
|
|
- Single-node pragmatic deployment on railiance01.
|
|
- No public unauthenticated exposure.
|
|
- Database managed by cnpg, not an ad-hoc Postgres StatefulSet.
|
|
- WSL2 retained as DR fallback during stabilisation.
|
|
- Future multi-node HA and storage replication are deferred to `CUST-WP-0038`.
|
|
|
|
## Open Human Decisions
|
|
|
|
Resolve these before T04/T05 can become live migration work:
|
|
|
|
1. Final State Hub hostname or tunnel-only endpoint.
|
|
2. Container registry choice: Gitea registry vs external interim registry.
|
|
3. Exposure model: ClusterIP plus tunnel, private ingress, or both.
|
|
4. Approval window for freezing WSL2 writes and migrating the production DB.
|
|
5. Stabilisation duration before WSL2 can be considered non-primary fallback.
|
|
|
|
## Tasks
|
|
|
|
### T01 — Drill WSL2 State Hub backup restore
|
|
|
|
```task
|
|
id: CUST-WP-0011-T01
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
|
|
completed: "2026-05-02"
|
|
```
|
|
|
|
Take a fresh State Hub backup from the current WSL2 instance and restore it
|
|
into an isolated test PostgreSQL instance.
|
|
|
|
Minimum checks:
|
|
|
|
- Restore completes without errors.
|
|
- Core table row counts match the live WSL2 database.
|
|
- `/state/summary` can be served from the restored copy if wired to a test API.
|
|
- Drill result is recorded in State Hub progress and, if useful, episodic
|
|
memory.
|
|
|
|
**Done when:** backup and restore are proven within 24 hours of live migration
|
|
work.
|
|
|
|
Result: completed 2026-05-02. A fresh dump from `infra-postgres-1` restored
|
|
into disposable container `state-hub-restore-test` on `127.0.0.1:5433`.
|
|
Application health and summary checks against the restored database returned
|
|
HTTP 200. Restored row counts matched production exactly, including 117
|
|
workstreams, 989 tasks, 1423 progress events, and 208 token events.
|
|
|
|
---
|
|
|
|
### T02 — Align with Railiance deployment plan
|
|
|
|
```task
|
|
id: CUST-WP-0011-T02
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
|
|
completed: "2026-05-02"
|
|
```
|
|
|
|
Update the cross-repo plan so this Custodian workplan and
|
|
`RAIL-HO-WP-0004-T09` point to the same architecture.
|
|
|
|
Expected outputs:
|
|
|
|
- `RAIL-HO-WP-0004-T09` remains the Railiance-side execution task.
|
|
- This workplan remains the Custodian-side safety/cutover task list.
|
|
- Any stale Longhorn/Postgres StatefulSet assumptions are removed from the
|
|
near-term migration plan.
|
|
- The future HA goal is referenced through `CUST-WP-0038`.
|
|
|
|
**Done when:** both workplans describe compatible responsibilities and gates.
|
|
|
|
Result: completed 2026-05-02. `RAIL-HO-WP-0004-T09` now names the same
|
|
pragmatic railiance01 path: cnpg database, S5 State Hub workload, restore drill
|
|
precondition, empty deploy before data copy, explicit human approval before
|
|
freezing WSL2 writes, and WSL2 retained as fallback. Full ThreePhoenix HA stays
|
|
deferred to `CUST-WP-0038`.
|
|
|
|
---
|
|
|
|
### T03 — Build and publish State Hub container image
|
|
|
|
```task
|
|
id: CUST-WP-0011-T03
|
|
status: in_progress
|
|
priority: high
|
|
state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
|
|
```
|
|
|
|
Package this repository as a production image.
|
|
|
|
Requirements:
|
|
|
|
- Dockerfile builds from the current Python/uv project.
|
|
- Alembic and runtime dependencies are available inside the image.
|
|
- Image exposes the FastAPI service on port 8000.
|
|
- Image tag is pushed to the chosen registry.
|
|
- Build provenance is documented in the commit/workplan.
|
|
|
|
**Done when:** railiance01 can pull the image and a dry-run deployment resolves
|
|
it.
|
|
|
|
Progress 2026-05-03: added `Dockerfile`,
|
|
`.dockerignore`, and `docs/container-image.md`. Built
|
|
local image `state-hub:local` successfully:
|
|
`sha256:e96dbd1e7d2b63e4fb17584c8c2216088a2c9937bfe880c2ad565c7a9f51c0fc`
|
|
(~106 MB). Verified container `/state/health` returns HTTP 200 against the
|
|
current database when run locally with host networking. Verified Alembic is
|
|
available in-image and reports current revision `r5m6n7o8p9q0 (head)`.
|
|
|
|
Progress 2026-05-03: registry target decision resolved to the self-hosted
|
|
Gitea registry. A local SSH tunnel to the NodePort can reach Gitea, but Docker
|
|
login/push still receives HTTP 404 from `/v2/`. Runtime inspection shows the
|
|
live Gitea `app.ini` has no `[packages]` section, so package registry
|
|
enablement/routing must be applied before publishing `state-hub:local`.
|
|
|
|
Progress 2026-05-15: rebuilt the image from current State Hub sources as
|
|
`state-hub:local` with digest
|
|
`sha256:039d29654ccb3754c6ecdbe497c6364bbd8452edcdcb7fa937dd9debf5b734ff`
|
|
(106004480 bytes). Verified `/state/health` returns
|
|
`{"status":"ok","db":"connected"}` from a temporary container on host port
|
|
18000 and confirmed in-image Alembic reports `t7o8p9q0r1s2 (head)`. Build
|
|
provenance is recorded in `docs/container-image.md`.
|
|
|
|
Remaining: enable the Gitea package/container registry, then tag, push, and
|
|
pull the image from railiance01.
|
|
|
|
---
|
|
|
|
### T04 — Define State Hub database and app manifests
|
|
|
|
```task
|
|
id: CUST-WP-0011-T04
|
|
status: todo
|
|
priority: high
|
|
state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
|
|
```
|
|
|
|
Create the cluster-side deployment assets using current Railiance boundaries:
|
|
|
|
- `railiance-platform`: `state-hub-db` cnpg cluster and database credentials.
|
|
- `railiance-apps`: State Hub Deployment, Service, ConfigMap, Secret/External
|
|
Secret reference, and optional private Ingress.
|
|
- Health probes use `GET /state/health`.
|
|
- Environment includes `DATABASE_URL` and any required API settings.
|
|
|
|
**Done when:** manifests lint/apply in a non-destructive dry run and ownership
|
|
boundaries are documented.
|
|
|
|
---
|
|
|
|
### T05 — Deploy empty State Hub and run migrations on railiance01
|
|
|
|
```task
|
|
id: CUST-WP-0011-T05
|
|
status: todo
|
|
priority: high
|
|
state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
|
|
```
|
|
|
|
Deploy State Hub against an empty `state-hub-db` cnpg database and run Alembic
|
|
migrations in the cluster environment.
|
|
|
|
Checks:
|
|
|
|
- Pod reaches Ready.
|
|
- `/state/health` returns healthy through the intended private access path.
|
|
- Alembic reports head.
|
|
- Logs show no repeated crash/restart loop.
|
|
|
|
**Done when:** an empty but structurally valid State Hub runs on railiance01.
|
|
|
|
---
|
|
|
|
### T06 — Restore WSL2 data copy into cluster and compare
|
|
|
|
```task
|
|
id: CUST-WP-0011-T06
|
|
status: todo
|
|
priority: high
|
|
state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
|
|
```
|
|
|
|
Restore a fresh WSL2 dump into the cluster database while WSL2 remains the live
|
|
source of truth.
|
|
|
|
Required comparison:
|
|
|
|
- Table row counts match.
|
|
- Representative workstreams, tasks, decisions, progress events, repos, and
|
|
token events are queryable.
|
|
- Dashboard and MCP summary calls return expected data through the cluster API.
|
|
- Any mismatch is investigated before proceeding.
|
|
|
|
**Done when:** cluster data is a verified copy of WSL2, but not yet the only
|
|
writer.
|
|
|
|
---
|
|
|
|
### T07 — Cut over private access to cluster State Hub
|
|
|
|
```task
|
|
id: CUST-WP-0011-T07
|
|
status: todo
|
|
priority: medium
|
|
state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
|
|
needs_human: true
|
|
intervention_note: "Requires explicit approval to freeze WSL2 writes and make the cluster State Hub the primary endpoint."
|
|
```
|
|
|
|
With human approval, freeze WSL2 writes, take a final dump, restore it to the
|
|
cluster, compare counts again, and redirect the active private access path to
|
|
the cluster API.
|
|
|
|
Accepted approaches:
|
|
|
|
- Keep local MCP config pointed at `http://127.0.0.1:8000` and move that port
|
|
to an ops-bridge/SSH tunnel.
|
|
- Or set the MCP server `API_BASE` to the chosen private cluster endpoint.
|
|
|
|
**Done when:** `get_state_summary()` and dashboard live data are served by the
|
|
cluster State Hub, and WSL2 is no longer receiving normal writes.
|
|
|
|
---
|
|
|
|
### T08 — Stabilise with WSL2 retained as fallback
|
|
|
|
```task
|
|
id: CUST-WP-0011-T08
|
|
status: todo
|
|
priority: medium
|
|
state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
|
|
```
|
|
|
|
Run the cluster State Hub as primary while keeping the WSL2 instance available
|
|
as a fallback.
|
|
|
|
Monitor:
|
|
|
|
- State Hub pod restarts.
|
|
- cnpg cluster health.
|
|
- Backup job success.
|
|
- Dashboard and MCP behavior from each operator machine.
|
|
- Consistency sync behavior for file-backed workplans.
|
|
|
|
**Done when:** the agreed stabilisation window passes without data loss or
|
|
unresolved operational defects.
|
|
|
|
---
|
|
|
|
### T09 — Document operating model and defer final WSL2 retirement
|
|
|
|
```task
|
|
id: CUST-WP-0011-T09
|
|
status: todo
|
|
priority: low
|
|
state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
|
|
```
|
|
|
|
Document the new operating model:
|
|
|
|
- How agents reach State Hub.
|
|
- How backups and restores work.
|
|
- How to roll back to WSL2 if needed.
|
|
- Which parts remain pragmatic/single-node.
|
|
- Which long-term requirements moved to `CUST-WP-0038`.
|
|
|
|
Do not permanently retire WSL2 in this workplan unless a separate human
|
|
decision is recorded. Retirement belongs after proven stability or in the
|
|
future HA workplan.
|
|
|
|
**Done when:** runbooks and project instructions match the deployed reality.
|
|
|
|
## References
|
|
|
|
- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
|
|
- `RAIL-HO-WP-0004-T09` — Railiance-side State Hub deployment task
|
|
- `CUST-WP-0038` — future full ThreePhoenix HA State Hub migration
|
|
- Constitution constraint: production data migration and fallback retirement
|
|
require explicit human approval
|