Remove migrated State Hub workplans

This commit is contained in:
2026-05-18 01:33:25 +02:00
parent be9ccf1074
commit 5d478cf746
4 changed files with 0 additions and 1044 deletions

View File

@@ -1,186 +0,0 @@
---
id: CUST-WP-0003
type: workplan
title: "State Hub v0.4 — Workstream Health Index (WHI) KPI Card"
domain: custodian
status: active
owner: custodian
topic_slug: custodian
state_hub_workstream_id: 9cc32158-2f5c-4ef6-9713-aacce4623d5e
created: "2026-02-26"
updated: "2026-02-28"
---
# State Hub v0.4 — Workstream Health Index (WHI) KPI Card
## Summary
Implement the Workstream Health Index (WHI) — a composite structural-health
KPI — as a live card injected into the TOC sidebar of the Workstreams
dashboard page. All six metrics are computable client-side from data
already fetched by `workstreams.md`; no API or schema changes required.
## Context
The WHI formula and metric definitions are specified in
`state-hub/dashboard/src/docs/workstream-kpi.md`. This workplan covers
only the implementation of that spec as running dashboard code.
The six base metrics:
- **DD** — Dependency Density: edge count / open workstream count
- **BR** — Blocked Ratio: blocked workstreams / open count
- **SPR** — Single Point of Risk: max inbound edges / open count
- **PEP** — Progression Enablement Proportion: ready-to-start workstreams
- **CDDR** — Cross-Domain Dependency Ratio: cross-domain edges / total edges
- **CPI** — Cycle Penalty Indicator: 1 if any cycle detected, 0 otherwise
WHI formula: `0.30*(1-DDnorm) + 0.25*(1-BR) + 0.15*(1-SPR) + 0.20*PEP + 0.10*(1-CDDR)`
CPI penalty: `WHI = WHI * 0.5` if CPI=1.
## Tasks
### P1 — Verify dependency edge fields in open_workstreams
```task
id: CUST-WP-0003-T01
state_hub_task_id: 243646e0-b77a-41e7-ac51-82c5828e63d2
status: todo
priority: high
```
Confirm that `summary.open_workstreams[].depends_on[]` and `blocks[]`
each carry `workstream_id`, `workstream_slug`, and `workstream_title`.
Verify these fields are sufficient to build a complete directed dependency
graph client-side without additional API calls. (Already verified during
workplan design — open_workstreams is the confirmed data source.)
### P2.1 — Build directed dependency graph from openWs + completedIds
```task
id: CUST-WP-0003-T02
state_hub_task_id: 6dbef71f-d2d7-44ee-abb8-279dbaeec505
status: todo
priority: high
```
In `workstreams.md`: derive `completedIds = new Set` of IDs of workstreams
with status completed. Build an adjacency list: for each entry in openWs,
map workstream id → array of `depends_on[].workstream_id`. Build reverse
map (prerequisite id → list of dependent ids) for SPR computation. Also
build `idToDomain` map from `data[]` for CDDR.
### P2.2 — Implement DFS cycle detection (CPI)
```task
id: CUST-WP-0003-T03
state_hub_task_id: f0d5c107-6029-4ad0-af00-645d35ce7db0
status: todo
priority: high
```
Implement a DFS-based topological sort over the dependency adjacency list.
Detect back edges using visited / inStack colour sets. Return `CPI = 1`
if any cycle found, `CPI = 0` otherwise. Only nodes in openWs participate
(completed/archived workstreams excluded). Edge case: isolated nodes (no
deps, no dependents) are valid and never form cycles.
### P2.3 — Compute DD, BR, SPR, PEP, CDDR
```task
id: CUST-WP-0003-T04
state_hub_task_id: 6da60567-cc46-4a32-9855-b07bafe2faeb
status: todo
priority: high
```
Using the graph from P2.1:
- `DD`: totalEdges / openCount, where totalEdges = openWs.flatMap(w=>w.depends_on).length
- `BR`: openWs.filter(w=>w.status==="blocked").length / openCount
- `SPR`: max inbound-edge count across prerequisite workstreams in openWs / openCount
- `PEP`: openWs.filter(w=>active && all depends_on are in completedIds).length / openCount
- `CDDR`: crossDomainEdges / totalEdges (edge with different domain endpoints); 0 if no edges
### P2.4 — WHI formula: normalisation + CPI penalty
```task
id: CUST-WP-0003-T05
state_hub_task_id: 29b2dbbd-5d60-49b6-ae84-3dbf22167df7
status: todo
priority: high
```
Implement the weighted aggregation:
```
DDnorm = min(1, DD / 1.0) // DD_critical = 1.0
WHI = 0.30*(1-DDnorm) + 0.25*(1-BR) + 0.15*(1-SPR) + 0.20*PEP + 0.10*(1-CDDR)
if CPI === 1: WHI = WHI * 0.5
```
Clamp to [0, 1]. Return `{whi, dd, ddNorm, br, spr, pep, cddr, cpi, openCount, edgeCount}`.
Factor into `computeWHI(nodes, edges, idToDomain)` for reuse in per-domain scope.
### P2.5 — Per-domain WHI breakdown
```task
id: CUST-WP-0003-T06
state_hub_task_id: 8ce5ef74-5eb8-4259-9b11-dde13bf84a89
status: todo
priority: medium
```
For each domain present in openWs, compute a domain-scoped WHI:
- `domainNodes = openWs.filter(w => idToDomain[w.id] === domain)`
- `domainEdges = domainNodes.flatMap(w => w.depends_on.filter(d => idToDomain[d.workstream_id] === domain))`
- `result = computeWHI(domainNodes, domainEdges, idToDomain)`
Store as `[{domain, whi, br, pep, cpi, openCount}]`. Skip domains with
`openCount === 0`.
### P3 — WHI KPI card UI
```task
id: CUST-WP-0003-T07
state_hub_task_id: 91efba5c-3be2-4bfe-b5ef-1b261e9423f2
status: todo
priority: high
```
Build the `_whiBox` element in `workstreams.md` (mirrors `_kpiBox` in
`decisions.md`):
- Card title: "Workstream Health"
- Main WHI value with health state label: GREEN ≥ 0.75 / ORANGE ≥ 0.50 / RED < 0.50
- Sub-metric rows for DD, BR, SPR, PEP, CDDR with individual warning colours
- Cycle alert row (red ⚠) when CPI=1
- Domain breakdown: compact rows with domain name + coloured score
- Empty state if openCount=0 or no edges
Inject via `injectTocTop("whi-kpi-box", _whiBox)`. Wire
`withDocHelp(_whiBox, "/docs/workstream-health-index")`.
### P4.1 — Create src/docs/workstream-health-index.md
```task
id: CUST-WP-0003-T08
state_hub_task_id: 4c898472-e4ae-49a2-b6cd-7aa1a3c7604a
status: todo
priority: medium
```
Reference documentation for the WHI KPI card. Cover: purpose, all six
metrics (formula + interpretation), WHI aggregation formula with CPI
penalty, DD normalisation, health state thresholds, domain breakdown,
cycle detection, and how to improve a poor score. Update
`workstream-kpi.md` to link to this doc.
### P4.2 — Wire withDocHelp and add to Reference nav
```task
id: CUST-WP-0003-T09
state_hub_task_id: 20976663-7ac9-4909-8029-a479190f52ff
status: todo
priority: low
```
Confirm `withDocHelp(_whiBox, "/docs/workstream-health-index")` is wired
(from P3). Add `{ name: "Workstream Health", path: "/docs/workstream-health-index" }`
to the Reference pages array in `observablehq.config.js`. Verify
Reference nav renders correctly in `npm run dev`.

View File

@@ -1,366 +0,0 @@
---
id: CUST-WP-0011
type: workplan
title: "Pragmatic State Hub Migration to railiance01"
domain: custodian
repo: the-custodian
status: active
owner: custodian
topic_slug: custodian
created: "2026-03-11"
updated: "2026-05-15"
state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940"
supersedes_intent_from: "Migrate Custodian State Hub to ThreePhoenix Cluster"
follow_up_workplan: CUST-WP-0038
---
# Pragmatic State Hub Migration to railiance01
## Goal
Move the Custodian State Hub (FastAPI + PostgreSQL) from the WSL2 operator
workstation to the current railiance01 Kubernetes environment, using the
Railiance production-readiness path that exists now:
- CloudNative PG (`cnpg`) for the State Hub database in the `databases`
namespace.
- State Hub as an S5 workload in `railiance-apps`.
- Platform/database ownership in `railiance-platform`.
- Access through the existing private tunnel/ops-bridge model, not public
exposure.
- WSL2 retained as a disaster-recovery fallback until the cluster deployment
has proven stable.
This is a deliberate pragmatic step. It improves durability and multi-machine
access before the full ThreePhoenix target is ready. The ultimate multi-node,
replicated, long-term cluster goal is preserved in `CUST-WP-0038`.
## Context Update
The original 2026-03-11 version of this workplan targeted a future
ThreePhoenix cluster with Railiance01/02/03, Longhorn, and full HA gates before
starting. That was correct as an end-state, but it blocks useful progress now.
The current Railiance architecture has moved on:
- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
supersedes the older Bitnami PostgreSQL HA platform baseline.
- CloudNative PG is the deployed database operator.
- `RAIL-HO-WP-0004-T09` is the Railiance-side task for deploying State Hub to
the cluster, and it still requires human decisions before live data
migration.
This workplan is now the Custodian-side coordination and safety plan for that
T09 effort.
## Safety Contract
State Hub is irreplaceable episodic memory. This migration may prepare, deploy,
test, and compare as much as needed, but it must not make the cluster the only
source of truth until the explicit cutover gate is satisfied.
Rules:
- A fresh WSL2 backup and restore drill is mandatory before data migration.
- The WSL2 State Hub remains available as rollback until stabilisation passes.
- Any task that changes the live writer endpoint requires explicit human
approval.
- A failed cluster deploy must leave the WSL2 instance untouched and usable.
- Row counts and key API checks must match before cutover.
## Target Architecture After This Workplan
```
Operator workstation / COULOMBCORE / other agent hosts
-> local MCP server subprocess
-> http://127.0.0.1:8000 or configured API_BASE
-> private tunnel / ops-bridge
-> railiance01 k3s
-> state-hub Service
-> FastAPI Deployment
-> state-hub-db CloudNative PG Cluster
```
Key properties:
- Single-node pragmatic deployment on railiance01.
- No public unauthenticated exposure.
- Database managed by cnpg, not an ad-hoc Postgres StatefulSet.
- WSL2 retained as DR fallback during stabilisation.
- Future multi-node HA and storage replication are deferred to `CUST-WP-0038`.
## Open Human Decisions
Resolve these before T04/T05 can become live migration work:
1. Final State Hub hostname or tunnel-only endpoint.
2. Container registry choice: Gitea registry vs external interim registry.
3. Exposure model: ClusterIP plus tunnel, private ingress, or both.
4. Approval window for freezing WSL2 writes and migrating the production DB.
5. Stabilisation duration before WSL2 can be considered non-primary fallback.
## Tasks
### T01 — Drill WSL2 State Hub backup restore
```task
id: T01
status: done
priority: high
state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
completed: "2026-05-02"
```
Take a fresh State Hub backup from the current WSL2 instance and restore it
into an isolated test PostgreSQL instance.
Minimum checks:
- Restore completes without errors.
- Core table row counts match the live WSL2 database.
- `/state/summary` can be served from the restored copy if wired to a test API.
- Drill result is recorded in State Hub progress and, if useful, episodic
memory.
**Done when:** backup and restore are proven within 24 hours of live migration
work.
Result: completed 2026-05-02. A fresh dump from `infra-postgres-1` restored
into disposable container `state-hub-restore-test` on `127.0.0.1:5433`.
Application health and summary checks against the restored database returned
HTTP 200. Restored row counts matched production exactly, including 117
workstreams, 989 tasks, 1423 progress events, and 208 token events.
---
### T02 — Align with Railiance deployment plan
```task
id: T02
status: done
priority: high
state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
completed: "2026-05-02"
```
Update the cross-repo plan so this Custodian workplan and
`RAIL-HO-WP-0004-T09` point to the same architecture.
Expected outputs:
- `RAIL-HO-WP-0004-T09` remains the Railiance-side execution task.
- This workplan remains the Custodian-side safety/cutover task list.
- Any stale Longhorn/Postgres StatefulSet assumptions are removed from the
near-term migration plan.
- The future HA goal is referenced through `CUST-WP-0038`.
**Done when:** both workplans describe compatible responsibilities and gates.
Result: completed 2026-05-02. `RAIL-HO-WP-0004-T09` now names the same
pragmatic railiance01 path: cnpg database, S5 State Hub workload, restore drill
precondition, empty deploy before data copy, explicit human approval before
freezing WSL2 writes, and WSL2 retained as fallback. Full ThreePhoenix HA stays
deferred to `CUST-WP-0038`.
---
### T03 — Build and publish State Hub container image
```task
id: T03
status: in_progress
priority: high
state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
```
Package `state-hub/` as a production image.
Requirements:
- Dockerfile builds from the current Python/uv project.
- Alembic and runtime dependencies are available inside the image.
- Image exposes the FastAPI service on port 8000.
- Image tag is pushed to the chosen registry.
- Build provenance is documented in the commit/workplan.
**Done when:** railiance01 can pull the image and a dry-run deployment resolves
it.
Progress 2026-05-03: added `state-hub/Dockerfile`,
`state-hub/.dockerignore`, and `state-hub/docs/container-image.md`. Built
local image `state-hub:local` successfully:
`sha256:e96dbd1e7d2b63e4fb17584c8c2216088a2c9937bfe880c2ad565c7a9f51c0fc`
(~106 MB). Verified container `/state/health` returns HTTP 200 against the
current database when run locally with host networking. Verified Alembic is
available in-image and reports current revision `r5m6n7o8p9q0 (head)`.
Progress 2026-05-03: registry target decision resolved to the self-hosted
Gitea registry. A local SSH tunnel to the NodePort can reach Gitea, but Docker
login/push still receives HTTP 404 from `/v2/`. Runtime inspection shows the
live Gitea `app.ini` has no `[packages]` section, so package registry
enablement/routing must be applied before publishing `state-hub:local`.
Progress 2026-05-15: rebuilt the image from current `state-hub/` sources as
`state-hub:local` with digest
`sha256:039d29654ccb3754c6ecdbe497c6364bbd8452edcdcb7fa937dd9debf5b734ff`
(106004480 bytes). Verified `/state/health` returns
`{"status":"ok","db":"connected"}` from a temporary container on host port
18000 and confirmed in-image Alembic reports `t7o8p9q0r1s2 (head)`. Build
provenance is recorded in `state-hub/docs/container-image.md`.
Remaining: enable the Gitea package/container registry, then tag, push, and
pull the image from railiance01.
---
### T04 — Define State Hub database and app manifests
```task
id: T04
status: todo
priority: high
state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
```
Create the cluster-side deployment assets using current Railiance boundaries:
- `railiance-platform`: `state-hub-db` cnpg cluster and database credentials.
- `railiance-apps`: State Hub Deployment, Service, ConfigMap, Secret/External
Secret reference, and optional private Ingress.
- Health probes use `GET /state/health`.
- Environment includes `DATABASE_URL` and any required API settings.
**Done when:** manifests lint/apply in a non-destructive dry run and ownership
boundaries are documented.
---
### T05 — Deploy empty State Hub and run migrations on railiance01
```task
id: T05
status: todo
priority: high
state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
```
Deploy State Hub against an empty `state-hub-db` cnpg database and run Alembic
migrations in the cluster environment.
Checks:
- Pod reaches Ready.
- `/state/health` returns healthy through the intended private access path.
- Alembic reports head.
- Logs show no repeated crash/restart loop.
**Done when:** an empty but structurally valid State Hub runs on railiance01.
---
### T06 — Restore WSL2 data copy into cluster and compare
```task
id: T06
status: todo
priority: high
state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
```
Restore a fresh WSL2 dump into the cluster database while WSL2 remains the live
source of truth.
Required comparison:
- Table row counts match.
- Representative workstreams, tasks, decisions, progress events, repos, and
token events are queryable.
- Dashboard and MCP summary calls return expected data through the cluster API.
- Any mismatch is investigated before proceeding.
**Done when:** cluster data is a verified copy of WSL2, but not yet the only
writer.
---
### T07 — Cut over private access to cluster State Hub
```task
id: T07
status: todo
priority: medium
state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
needs_human: true
intervention_note: "Requires explicit approval to freeze WSL2 writes and make the cluster State Hub the primary endpoint."
```
With human approval, freeze WSL2 writes, take a final dump, restore it to the
cluster, compare counts again, and redirect the active private access path to
the cluster API.
Accepted approaches:
- Keep local MCP config pointed at `http://127.0.0.1:8000` and move that port
to an ops-bridge/SSH tunnel.
- Or set the MCP server `API_BASE` to the chosen private cluster endpoint.
**Done when:** `get_state_summary()` and dashboard live data are served by the
cluster State Hub, and WSL2 is no longer receiving normal writes.
---
### T08 — Stabilise with WSL2 retained as fallback
```task
id: T08
status: todo
priority: medium
state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
```
Run the cluster State Hub as primary while keeping the WSL2 instance available
as a fallback.
Monitor:
- State Hub pod restarts.
- cnpg cluster health.
- Backup job success.
- Dashboard and MCP behavior from each operator machine.
- Consistency sync behavior for file-backed workplans.
**Done when:** the agreed stabilisation window passes without data loss or
unresolved operational defects.
---
### T09 — Document operating model and defer final WSL2 retirement
```task
id: T09
status: todo
priority: low
state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
```
Document the new operating model:
- How agents reach State Hub.
- How backups and restores work.
- How to roll back to WSL2 if needed.
- Which parts remain pragmatic/single-node.
- Which long-term requirements moved to `CUST-WP-0038`.
Do not permanently retire WSL2 in this workplan unless a separate human
decision is recorded. Retirement belongs after proven stability or in the
future HA workplan.
**Done when:** runbooks and project instructions match the deployed reality.
## References
- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
- `RAIL-HO-WP-0004-T09` — Railiance-side State Hub deployment task
- `CUST-WP-0038` — future full ThreePhoenix HA State Hub migration
- Constitution constraint: production data migration and fallback retirement
require explicit human approval

View File

@@ -1,246 +0,0 @@
---
id: CUST-WP-0012
type: workplan
title: "Multi-User Onboarding and Environment Bootstrap"
domain: custodian
repo: the-custodian
status: active
owner: custodian
topic_slug: custodian
state_hub_workstream_id: "a28d9e29-4119-4b73-9469-f921920253ef"
created: "2026-03-11"
updated: "2026-03-11"
---
# Multi-User Onboarding and Environment Bootstrap
## Goal
Make the Custodian system accessible to collaborators beyond the primary
operator. A new user (or a new machine for the existing operator) should
be able to go from zero to a productive Claude Code session with full
State Hub MCP connectivity in a single session, without manual steps or
undocumented tribal knowledge.
## Context
Several friction points surfaced during the 2026-03-11 session:
- No SSH key for Railiance01 on WSL2 → blocked `make tunnel-loop`
- No `~/.railiance_gitea.conf` → blocked repo creation script
- Token missing `read:user` scope → blocked org repo creation
- No `git credential.helper` → credentials required on every push
- MCP registration is manual and documented only in `CLAUDE.md`
Each of these is a solved problem in isolation. This workstream collects
them into a repeatable, documented bootstrap experience.
## Scope
Two personas:
| Persona | Access level | Typical machine |
|---------|-------------|-----------------|
| Primary operator | Full access, all domains | WSL2 workstation |
| Domain collaborator | Read + write to one domain | COULOMBCORE, remote laptop |
## Tasks
### T01 — Git credential.helper for Gitea access
```task
id: CUST-WP-0012-T01
state_hub_task_id: 71628269-9a75-4dae-a347-e64a86040322
status: todo
priority: medium
```
Document and automate `git credential.helper` setup for Gitea
(`http://92.205.130.254:32166`). Recommend `libsecret` (keyring-backed)
on machines that support it; fall back to `credential.helper=store`
(persistent, plaintext `~/.git-credentials`) on headless servers.
Include in bootstrap script (T04) and onboarding guide (T05).
```bash
# Preferred: libsecret (GNOME keyring, WSL2 with keyring daemon)
sudo apt-get install -y libsecret-1-0 libsecret-1-dev
sudo make -C /usr/share/doc/git/contrib/credential/libsecret
git config --global credential.helper \
/usr/share/doc/git/contrib/credential/libsecret/git-credential-libsecret
# Fallback: store (plaintext, suitable for headless servers)
git config --global credential.helper store
# Headless server alternative: cache (in-memory, 1h timeout)
git config --global credential.helper 'cache --timeout=3600'
```
**Done when:** included in bootstrap script; push to Gitea works without
re-entering credentials on second attempt.
---
### T02 — SSH key generation and authorization automation
```task
id: CUST-WP-0012-T02
state_hub_task_id: fea965e9-8a8f-439c-9096-8f7756eb71ed
status: todo
priority: medium
```
Script or Ansible task that:
1. Generates an `ed25519` key pair on the new machine if none exists
2. Displays the public key with copy instructions
3. Authorizes it on all managed hosts (Railiance01, COULOMBCORE) via
`ssh-copy-id` or Ansible `authorized_key` module
Surfaced by: RAIL-PL-WP-0001 T01 — no SSH key on WSL2 blocked
`make tunnel-loop HOST=tegwick@92.205.62.239`.
```bash
# Generate if missing
[[ -f ~/.ssh/id_ed25519 ]] || ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N ""
# Authorize on a target host (requires existing access once)
ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.62.239
ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.130.254
```
**Done when:** included in bootstrap script; documented in onboarding guide.
---
### T03 — Claude Code MCP registration automation
```task
id: CUST-WP-0012-T03
state_hub_task_id: 60318e9a-972e-45c8-afde-82ed0625f594
status: todo
priority: medium
```
Automate the state-hub MCP server registration on a new machine.
Currently this is a multi-step manual process documented in
`~/.claude/CLAUDE.md`. It should be a single `make` target or script:
```bash
# In the-custodian/state-hub/
make register-mcp # idempotent; safe to re-run
```
The script should:
1. Detect whether `state-hub` is already in `~/.claude.json`
2. Extract the server config from `.mcp.json`
3. Run `claude mcp add-json -s user state-hub <config>`
4. Run `patch_mcp_cwd.py` to restore the cwd field
5. Print instructions to restart Claude Code
Should also detect whether the state hub is reachable directly
(`http://127.0.0.1:8000`) or needs a tunnel (via ops-bridge), and emit
a warning if neither is available.
**Done when:** `make register-mcp` works on a clean machine; documented
in onboarding guide.
---
### T04 — Environment bootstrap script
```task
id: CUST-WP-0012-T04
state_hub_task_id: 84a94761-e424-4470-a9a2-64d9cabadb7f
status: todo
priority: high
```
Single idempotent script: `state-hub/scripts/bootstrap-env.sh`
Checks/installs prerequisites and configures the environment:
| Step | What |
|------|------|
| Prerequisites | git, sops, age, helm, kubectl, uv, claude CLI |
| Git credential | `credential.helper` (libsecret or store) |
| SSH key | Generate ed25519 if missing; display public key |
| MCP registration | `make register-mcp` (T03) |
| Gitea config | Prompt for token; write `~/.railiance_gitea.conf` |
| Health check | `curl /state/health`; warn if tunnel needed |
Design constraints:
- Idempotent: safe to run on an already-configured machine
- No silent failures: each step prints ✓ / ✗ / ⚠
- Minimal dependencies: bash + curl only to get started
**Done when:** running the script on a clean Ubuntu 24.04 machine
produces a working Custodian environment with no additional manual steps.
---
### T05 — Onboarding guide and user journey documentation
```task
id: CUST-WP-0012-T05
state_hub_task_id: b0839802-659a-475b-8b84-ab7341ea3d15
status: todo
priority: medium
```
Write `docs/onboarding.md` in the-custodian covering the full journey
for both personas:
**Primary operator (new machine):**
1. Prerequisites (git, SSH client)
2. Clone `the-custodian`
3. Run `make bootstrap-env` (T04)
4. Restart Claude Code → verify MCP is active
5. First session: `get_state_summary()` → orient → work
**Domain collaborator (new person):**
1. Prerequisites + Gitea account
2. `ssh-copy-id` to get access to Railiance01 (or just COULOMBCORE)
3. Set up ops-bridge tunnel to reach state hub
4. Clone domain repo
5. First Claude Code session with MCP via tunnel
6. Contributing a workplan (ADR-001 convention)
**Done when:** a new collaborator can follow the guide without
clarification from the primary operator.
---
### T06 — State Hub multi-user model (deferred)
```task
id: CUST-WP-0012-T06
state_hub_task_id: d5df3302-67b9-4765-a8d8-ea2df53dff6e
status: todo
priority: low
```
Design a lightweight user/role model for the state hub:
| Role | Permissions |
|------|-------------|
| Primary operator | Full read/write, all domains |
| Domain collaborator | Read all; write to own domain only |
| Observer | Read-only |
Decision needed: enforce at API layer (HTTP Basic / token auth per
domain) or rely on Gitea repo permissions as the authoritative boundary
(simpler — the hub is a read model anyway).
**Deferred until:** first external collaborator is actively onboarding.
Implement T01T05 first; multi-user access control is only needed when
there is more than one user.
---
## References
- ops-bridge repo: `ops-bridge` (tunnel lifecycle management)
- MCP registration: `~/.claude/CLAUDE.md` (current manual procedure)
- Gitea repo creation: `railiance-cluster/tools/create_railiance_repo.sh`
- ADR-001: workplans as repo artefacts
- Surfaced by: RAIL-PL-WP-0001 T01 execution, 2026-03-11

View File

@@ -1,246 +0,0 @@
---
id: CUST-WP-0038
type: workplan
title: "State Hub Full ThreePhoenix HA Migration"
domain: custodian
repo: the-custodian
status: active
owner: custodian
topic_slug: custodian
created: "2026-05-02"
updated: "2026-05-02"
depends_on: CUST-WP-0011
state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85"
---
# State Hub Full ThreePhoenix HA Migration
## Goal
Preserve the original long-term State Hub infrastructure goal while
`CUST-WP-0011` takes the pragmatic railiance01 path.
This workplan completes the migration from a useful single-node cluster-hosted
State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes,
replicated storage, tested failover, tested restore, and retirement of the WSL2
fallback only after operational confidence is earned.
## Why This Exists
The near-term State Hub migration should not wait for every HA precondition,
because the workstation-hosted State Hub is already a bottleneck for
multi-machine work.
But the original requirement remains valid:
- State Hub is irreplaceable episodic memory.
- A single node is not a final home.
- Backup and restore must be drilled, not assumed.
- Long-term operations must survive node loss and operator-machine loss.
`CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan
keeps the ultimate target visible and reviewable.
## Entry Criteria
- `CUST-WP-0011` completed or explicitly superseded.
- Cluster-hosted State Hub has passed its stabilisation period.
- railiance01 is not the only planned durable node.
- Railiance architecture decision for storage replication is current:
Longhorn, cnpg replication, external backup, or a documented replacement.
- Backup and restore tooling has an owner and runbook.
## Target Properties
- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03.
- State Hub database survives loss of one node.
- State Hub API recovers from pod loss without manual repair.
- Backups are encrypted, off-node, and restorable into a test namespace.
- Agent access remains private.
- WSL2 is no longer needed as the primary disaster-recovery fallback.
## Tasks
### T01 — Confirm ThreePhoenix cluster readiness
```task
id: T01
status: todo
priority: high
state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110"
```
Verify the target cluster state:
- Three nodes are joined and Ready.
- Control-plane and worker roles are documented.
- Cluster version and node resources are recorded.
- Smoke tests pass from the operator machine and from CoulombCore.
**Done when:** a current readiness report exists and no node is marked
NotReady or operationally unknown.
---
### T02 — Establish replicated storage/database strategy
```task
id: T02
status: todo
priority: high
state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140"
```
Choose and document the durable data strategy for State Hub:
- cnpg multi-instance PostgreSQL cluster, and/or
- Longhorn-backed storage with suitable replication, and/or
- another explicitly approved architecture.
The decision must define RPO, RTO, failover behavior, and restore procedure.
**Done when:** the selected architecture is documented and approved before any
production data movement.
---
### T03 — Implement HA State Hub database
```task
id: T03
status: todo
priority: high
state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6"
```
Apply the chosen database/storage architecture to State Hub.
Requirements:
- Database credentials remain SOPS/secret-managed.
- The database has automated backup configured.
- The database exposes a stable service endpoint for the API.
- Health and replication status are observable.
**Done when:** State Hub can run against the HA database in a test or staging
namespace.
---
### T04 — Add State Hub API high-availability behavior
```task
id: T04
status: todo
priority: medium
state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24"
```
Run State Hub API with the right availability posture for its workload:
- At least one replica, optionally more if DB/session behavior permits.
- Readiness and liveness probes.
- Rolling update behavior documented.
- Resource requests/limits set.
**Done when:** killing an API pod does not require manual recovery.
---
### T05 — Drill database failover
```task
id: T05
status: todo
priority: high
state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86"
```
Perform a controlled failover drill for the State Hub database.
Checks:
- Failure trigger is documented.
- API behavior during failover is observed.
- Recovery time is measured.
- No data loss is detected after recovery.
**Done when:** the failover drill passes and results are logged.
---
### T06 — Drill backup restore to isolated namespace
```task
id: T06
status: todo
priority: high
state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74"
```
Restore the latest encrypted State Hub backup into an isolated namespace or
separate test database.
Checks:
- Backup can be decrypted with the documented key path.
- Restore completes from off-node backup material.
- Row counts and representative records match.
- Restored API can serve `/state/health` and `/state/summary` when pointed at
the restored database.
**Done when:** restore drill passes without depending on the live database.
---
### T07 — Update agent access and runbooks for HA endpoint
```task
id: T07
status: todo
priority: medium
state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c"
```
Update the private access model after the HA endpoint is available:
- ops-bridge or tunnel target.
- MCP `API_BASE` or local port-forward convention.
- Dashboard access.
- Operator recovery instructions.
**Done when:** each active operator machine can reach the HA State Hub endpoint
through the documented path.
---
### T08 — Retire WSL2 fallback after explicit approval
```task
id: T08
status: todo
priority: low
needs_human: true
intervention_note: "Requires explicit approval after HA failover and restore drills pass."
state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add"
```
Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA
cluster path has passed drills.
Steps:
1. Take and archive a final WSL2 backup.
2. Stop local WSL2 State Hub services.
3. Update global and repo instructions.
4. Record the retirement decision in State Hub.
**Done when:** WSL2 is no longer part of the normal or fallback operating
model, and the cluster runbook is the source of truth.
## References
- `CUST-WP-0011` — pragmatic railiance01 migration
- Railiance ThreePhoenix infrastructure goal
- State Hub backup/restore runbooks
- Constitution constraint: irreversible retirement requires human approval