From febad058e639c24d7358d4ce013a1ff0312c8ae8 Mon Sep 17 00:00:00 2001
From: tegwick <bernd.worsch@gmail.com>
Date: Mon, 18 May 2026 01:32:29 +0200
Subject: [PATCH] Migrate State Hub workplans

---
 workplans/CUST-WP-0003-whi-kpi-card.md        | 188 +++++++++
 ...P-0011-state-hub-threephoenix-migration.md | 366 ++++++++++++++++++
 .../CUST-WP-0012-multi-user-onboarding.md     | 246 ++++++++++++
 .../CUST-WP-0038-state-hub-threephoenix-ha.md | 246 ++++++++++++
 4 files changed, 1046 insertions(+)
 create mode 100644 workplans/CUST-WP-0003-whi-kpi-card.md
 create mode 100644 workplans/CUST-WP-0011-state-hub-threephoenix-migration.md
 create mode 100644 workplans/CUST-WP-0012-multi-user-onboarding.md
 create mode 100644 workplans/CUST-WP-0038-state-hub-threephoenix-ha.md

diff --git a/workplans/CUST-WP-0003-whi-kpi-card.md b/workplans/CUST-WP-0003-whi-kpi-card.md
new file mode 100644
index 0000000..87dffef
--- /dev/null
+++ b/workplans/CUST-WP-0003-whi-kpi-card.md
@@ -0,0 +1,188 @@
+---
+id: CUST-WP-0003
+type: workplan
+title: "State Hub v0.4 — Workstream Health Index (WHI) KPI Card"
+domain: custodian
+repo: state-hub
+status: active
+owner: custodian
+topic_slug: custodian
+state_hub_workstream_id: 9cc32158-2f5c-4ef6-9713-aacce4623d5e
+created: "2026-02-26"
+updated: "2026-05-17"
+---
+
+# State Hub v0.4 — Workstream Health Index (WHI) KPI Card
+
+## Summary
+
+Implement the Workstream Health Index (WHI) — a composite structural-health
+KPI — as a live card injected into the TOC sidebar of the Workstreams
+dashboard page. All six metrics are computable client-side from data
+already fetched by `dashboard/src/workstreams.md`; no API or schema changes
+required.
+
+## Context
+
+The WHI formula and metric definitions are specified in
+`dashboard/src/docs/workstream-kpi.md`. This workplan covers
+only the implementation of that spec as running dashboard code.
+
+The six base metrics:
+- **DD** — Dependency Density: edge count / open workstream count
+- **BR** — Blocked Ratio: blocked workstreams / open count
+- **SPR** — Single Point of Risk: max inbound edges / open count
+- **PEP** — Progression Enablement Proportion: ready-to-start workstreams
+- **CDDR** — Cross-Domain Dependency Ratio: cross-domain edges / total edges
+- **CPI** — Cycle Penalty Indicator: 1 if any cycle detected, 0 otherwise
+
+WHI formula: `0.30*(1-DDnorm) + 0.25*(1-BR) + 0.15*(1-SPR) + 0.20*PEP + 0.10*(1-CDDR)`
+CPI penalty: `WHI = WHI * 0.5` if CPI=1.
+
+## Tasks
+
+### P1 — Verify dependency edge fields in open_workstreams
+
+```task
+id: CUST-WP-0003-T01
+state_hub_task_id: 243646e0-b77a-41e7-ac51-82c5828e63d2
+status: todo
+priority: high
+```
+
+Confirm that `summary.open_workstreams[].depends_on[]` and `blocks[]`
+each carry `workstream_id`, `workstream_slug`, and `workstream_title`.
+Verify these fields are sufficient to build a complete directed dependency
+graph client-side without additional API calls. (Already verified during
+workplan design — open_workstreams is the confirmed data source.)
+
+### P2.1 — Build directed dependency graph from openWs + completedIds
+
+```task
+id: CUST-WP-0003-T02
+state_hub_task_id: 6dbef71f-d2d7-44ee-abb8-279dbaeec505
+status: todo
+priority: high
+```
+
+In `dashboard/src/workstreams.md`: derive `completedIds = new Set` of IDs of workstreams
+with status completed. Build an adjacency list: for each entry in openWs,
+map workstream id → array of `depends_on[].workstream_id`. Build reverse
+map (prerequisite id → list of dependent ids) for SPR computation. Also
+build `idToDomain` map from `data[]` for CDDR.
+
+### P2.2 — Implement DFS cycle detection (CPI)
+
+```task
+id: CUST-WP-0003-T03
+state_hub_task_id: f0d5c107-6029-4ad0-af00-645d35ce7db0
+status: todo
+priority: high
+```
+
+Implement a DFS-based topological sort over the dependency adjacency list.
+Detect back edges using visited / inStack colour sets. Return `CPI = 1`
+if any cycle found, `CPI = 0` otherwise. Only nodes in openWs participate
+(completed/archived workstreams excluded). Edge case: isolated nodes (no
+deps, no dependents) are valid and never form cycles.
+
+### P2.3 — Compute DD, BR, SPR, PEP, CDDR
+
+```task
+id: CUST-WP-0003-T04
+state_hub_task_id: 6da60567-cc46-4a32-9855-b07bafe2faeb
+status: todo
+priority: high
+```
+
+Using the graph from P2.1:
+- `DD`: totalEdges / openCount, where totalEdges = openWs.flatMap(w=>w.depends_on).length
+- `BR`: openWs.filter(w=>w.status==="blocked").length / openCount
+- `SPR`: max inbound-edge count across prerequisite workstreams in openWs / openCount
+- `PEP`: openWs.filter(w=>active && all depends_on are in completedIds).length / openCount
+- `CDDR`: crossDomainEdges / totalEdges (edge with different domain endpoints); 0 if no edges
+
+### P2.4 — WHI formula: normalisation + CPI penalty
+
+```task
+id: CUST-WP-0003-T05
+state_hub_task_id: 29b2dbbd-5d60-49b6-ae84-3dbf22167df7
+status: todo
+priority: high
+```
+
+Implement the weighted aggregation:
+```
+DDnorm = min(1, DD / 1.0)   // DD_critical = 1.0
+WHI    = 0.30*(1-DDnorm) + 0.25*(1-BR) + 0.15*(1-SPR) + 0.20*PEP + 0.10*(1-CDDR)
+if CPI === 1: WHI = WHI * 0.5
+```
+Clamp to [0, 1]. Return `{whi, dd, ddNorm, br, spr, pep, cddr, cpi, openCount, edgeCount}`.
+Factor into `computeWHI(nodes, edges, idToDomain)` for reuse in per-domain scope.
+
+### P2.5 — Per-domain WHI breakdown
+
+```task
+id: CUST-WP-0003-T06
+state_hub_task_id: 8ce5ef74-5eb8-4259-9b11-dde13bf84a89
+status: todo
+priority: medium
+```
+
+For each domain present in openWs, compute a domain-scoped WHI:
+- `domainNodes = openWs.filter(w => idToDomain[w.id] === domain)`
+- `domainEdges = domainNodes.flatMap(w => w.depends_on.filter(d => idToDomain[d.workstream_id] === domain))`
+- `result = computeWHI(domainNodes, domainEdges, idToDomain)`
+
+Store as `[{domain, whi, br, pep, cpi, openCount}]`. Skip domains with
+`openCount === 0`.
+
+### P3 — WHI KPI card UI
+
+```task
+id: CUST-WP-0003-T07
+state_hub_task_id: 91efba5c-3be2-4bfe-b5ef-1b261e9423f2
+status: todo
+priority: high
+```
+
+Build the `_whiBox` element in `dashboard/src/workstreams.md` (mirrors `_kpiBox` in
+`decisions.md`):
+- Card title: "Workstream Health"
+- Main WHI value with health state label: GREEN ≥ 0.75 / ORANGE ≥ 0.50 / RED < 0.50
+- Sub-metric rows for DD, BR, SPR, PEP, CDDR with individual warning colours
+- Cycle alert row (red ⚠) when CPI=1
+- Domain breakdown: compact rows with domain name + coloured score
+- Empty state if openCount=0 or no edges
+
+Inject via `injectTocTop("whi-kpi-box", _whiBox)`. Wire
+`withDocHelp(_whiBox, "/docs/workstream-health-index")`.
+
+### P4.1 — Create src/docs/workstream-health-index.md
+
+```task
+id: CUST-WP-0003-T08
+state_hub_task_id: 4c898472-e4ae-49a2-b6cd-7aa1a3c7604a
+status: todo
+priority: medium
+```
+
+Reference documentation for the WHI KPI card. Cover: purpose, all six
+metrics (formula + interpretation), WHI aggregation formula with CPI
+penalty, DD normalisation, health state thresholds, domain breakdown,
+cycle detection, and how to improve a poor score. Update
+`workstream-kpi.md` to link to this doc.
+
+### P4.2 — Wire withDocHelp and add to Reference nav
+
+```task
+id: CUST-WP-0003-T09
+state_hub_task_id: 20976663-7ac9-4909-8029-a479190f52ff
+status: todo
+priority: low
+```
+
+Confirm `withDocHelp(_whiBox, "/docs/workstream-health-index")` is wired
+(from P3). Add `{ name: "Workstream Health", path: "/docs/workstream-health-index" }`
+to the Reference pages array in `observablehq.config.js`. Verify
+Reference nav renders correctly in `npm run dev`.
diff --git a/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md b/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md
new file mode 100644
index 0000000..a33e267
--- /dev/null
+++ b/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md
@@ -0,0 +1,366 @@
+---
+id: CUST-WP-0011
+type: workplan
+title: "Pragmatic State Hub Migration to railiance01"
+domain: custodian
+repo: state-hub
+status: active
+owner: custodian
+topic_slug: custodian
+created: "2026-03-11"
+updated: "2026-05-17"
+state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940"
+supersedes_intent_from: "Migrate Custodian State Hub to ThreePhoenix Cluster"
+follow_up_workplan: CUST-WP-0038
+---
+
+# Pragmatic State Hub Migration to railiance01
+
+## Goal
+
+Move the Custodian State Hub (FastAPI + PostgreSQL) from the WSL2 operator
+workstation to the current railiance01 Kubernetes environment, using the
+Railiance production-readiness path that exists now:
+
+- CloudNative PG (`cnpg`) for the State Hub database in the `databases`
+  namespace.
+- State Hub as an S5 workload in `railiance-apps`.
+- Platform/database ownership in `railiance-platform`.
+- Access through the existing private tunnel/ops-bridge model, not public
+  exposure.
+- WSL2 retained as a disaster-recovery fallback until the cluster deployment
+  has proven stable.
+
+This is a deliberate pragmatic step. It improves durability and multi-machine
+access before the full ThreePhoenix target is ready. The ultimate multi-node,
+replicated, long-term cluster goal is preserved in `CUST-WP-0038`.
+
+## Context Update
+
+The original 2026-03-11 version of this workplan targeted a future
+ThreePhoenix cluster with Railiance01/02/03, Longhorn, and full HA gates before
+starting. That was correct as an end-state, but it blocks useful progress now.
+
+The current Railiance architecture has moved on:
+
+- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
+  supersedes the older Bitnami PostgreSQL HA platform baseline.
+- CloudNative PG is the deployed database operator.
+- `RAIL-HO-WP-0004-T09` is the Railiance-side task for deploying State Hub to
+  the cluster, and it still requires human decisions before live data
+  migration.
+
+This workplan is now the Custodian-side coordination and safety plan for that
+T09 effort.
+
+## Safety Contract
+
+State Hub is irreplaceable episodic memory. This migration may prepare, deploy,
+test, and compare as much as needed, but it must not make the cluster the only
+source of truth until the explicit cutover gate is satisfied.
+
+Rules:
+
+- A fresh WSL2 backup and restore drill is mandatory before data migration.
+- The WSL2 State Hub remains available as rollback until stabilisation passes.
+- Any task that changes the live writer endpoint requires explicit human
+  approval.
+- A failed cluster deploy must leave the WSL2 instance untouched and usable.
+- Row counts and key API checks must match before cutover.
+
+## Target Architecture After This Workplan
+
+```
+Operator workstation / COULOMBCORE / other agent hosts
+  -> local MCP server subprocess
+     -> http://127.0.0.1:8000 or configured API_BASE
+        -> private tunnel / ops-bridge
+           -> railiance01 k3s
+              -> state-hub Service
+                 -> FastAPI Deployment
+                 -> state-hub-db CloudNative PG Cluster
+```
+
+Key properties:
+
+- Single-node pragmatic deployment on railiance01.
+- No public unauthenticated exposure.
+- Database managed by cnpg, not an ad-hoc Postgres StatefulSet.
+- WSL2 retained as DR fallback during stabilisation.
+- Future multi-node HA and storage replication are deferred to `CUST-WP-0038`.
+
+## Open Human Decisions
+
+Resolve these before T04/T05 can become live migration work:
+
+1. Final State Hub hostname or tunnel-only endpoint.
+2. Container registry choice: Gitea registry vs external interim registry.
+3. Exposure model: ClusterIP plus tunnel, private ingress, or both.
+4. Approval window for freezing WSL2 writes and migrating the production DB.
+5. Stabilisation duration before WSL2 can be considered non-primary fallback.
+
+## Tasks
+
+### T01 — Drill WSL2 State Hub backup restore
+
+```task
+id: T01
+status: done
+priority: high
+state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
+completed: "2026-05-02"
+```
+
+Take a fresh State Hub backup from the current WSL2 instance and restore it
+into an isolated test PostgreSQL instance.
+
+Minimum checks:
+
+- Restore completes without errors.
+- Core table row counts match the live WSL2 database.
+- `/state/summary` can be served from the restored copy if wired to a test API.
+- Drill result is recorded in State Hub progress and, if useful, episodic
+  memory.
+
+**Done when:** backup and restore are proven within 24 hours of live migration
+work.
+
+Result: completed 2026-05-02. A fresh dump from `infra-postgres-1` restored
+into disposable container `state-hub-restore-test` on `127.0.0.1:5433`.
+Application health and summary checks against the restored database returned
+HTTP 200. Restored row counts matched production exactly, including 117
+workstreams, 989 tasks, 1423 progress events, and 208 token events.
+
+---
+
+### T02 — Align with Railiance deployment plan
+
+```task
+id: T02
+status: done
+priority: high
+state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
+completed: "2026-05-02"
+```
+
+Update the cross-repo plan so this Custodian workplan and
+`RAIL-HO-WP-0004-T09` point to the same architecture.
+
+Expected outputs:
+
+- `RAIL-HO-WP-0004-T09` remains the Railiance-side execution task.
+- This workplan remains the Custodian-side safety/cutover task list.
+- Any stale Longhorn/Postgres StatefulSet assumptions are removed from the
+  near-term migration plan.
+- The future HA goal is referenced through `CUST-WP-0038`.
+
+**Done when:** both workplans describe compatible responsibilities and gates.
+
+Result: completed 2026-05-02. `RAIL-HO-WP-0004-T09` now names the same
+pragmatic railiance01 path: cnpg database, S5 State Hub workload, restore drill
+precondition, empty deploy before data copy, explicit human approval before
+freezing WSL2 writes, and WSL2 retained as fallback. Full ThreePhoenix HA stays
+deferred to `CUST-WP-0038`.
+
+---
+
+### T03 — Build and publish State Hub container image
+
+```task
+id: T03
+status: in_progress
+priority: high
+state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
+```
+
+Package this repository as a production image.
+
+Requirements:
+
+- Dockerfile builds from the current Python/uv project.
+- Alembic and runtime dependencies are available inside the image.
+- Image exposes the FastAPI service on port 8000.
+- Image tag is pushed to the chosen registry.
+- Build provenance is documented in the commit/workplan.
+
+**Done when:** railiance01 can pull the image and a dry-run deployment resolves
+it.
+
+Progress 2026-05-03: added `Dockerfile`,
+`.dockerignore`, and `docs/container-image.md`. Built
+local image `state-hub:local` successfully:
+`sha256:e96dbd1e7d2b63e4fb17584c8c2216088a2c9937bfe880c2ad565c7a9f51c0fc`
+(~106 MB). Verified container `/state/health` returns HTTP 200 against the
+current database when run locally with host networking. Verified Alembic is
+available in-image and reports current revision `r5m6n7o8p9q0 (head)`.
+
+Progress 2026-05-03: registry target decision resolved to the self-hosted
+Gitea registry. A local SSH tunnel to the NodePort can reach Gitea, but Docker
+login/push still receives HTTP 404 from `/v2/`. Runtime inspection shows the
+live Gitea `app.ini` has no `[packages]` section, so package registry
+enablement/routing must be applied before publishing `state-hub:local`.
+
+Progress 2026-05-15: rebuilt the image from current State Hub sources as
+`state-hub:local` with digest
+`sha256:039d29654ccb3754c6ecdbe497c6364bbd8452edcdcb7fa937dd9debf5b734ff`
+(106004480 bytes). Verified `/state/health` returns
+`{"status":"ok","db":"connected"}` from a temporary container on host port
+18000 and confirmed in-image Alembic reports `t7o8p9q0r1s2 (head)`. Build
+provenance is recorded in `docs/container-image.md`.
+
+Remaining: enable the Gitea package/container registry, then tag, push, and
+pull the image from railiance01.
+
+---
+
+### T04 — Define State Hub database and app manifests
+
+```task
+id: T04
+status: todo
+priority: high
+state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
+```
+
+Create the cluster-side deployment assets using current Railiance boundaries:
+
+- `railiance-platform`: `state-hub-db` cnpg cluster and database credentials.
+- `railiance-apps`: State Hub Deployment, Service, ConfigMap, Secret/External
+  Secret reference, and optional private Ingress.
+- Health probes use `GET /state/health`.
+- Environment includes `DATABASE_URL` and any required API settings.
+
+**Done when:** manifests lint/apply in a non-destructive dry run and ownership
+boundaries are documented.
+
+---
+
+### T05 — Deploy empty State Hub and run migrations on railiance01
+
+```task
+id: T05
+status: todo
+priority: high
+state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
+```
+
+Deploy State Hub against an empty `state-hub-db` cnpg database and run Alembic
+migrations in the cluster environment.
+
+Checks:
+
+- Pod reaches Ready.
+- `/state/health` returns healthy through the intended private access path.
+- Alembic reports head.
+- Logs show no repeated crash/restart loop.
+
+**Done when:** an empty but structurally valid State Hub runs on railiance01.
+
+---
+
+### T06 — Restore WSL2 data copy into cluster and compare
+
+```task
+id: T06
+status: todo
+priority: high
+state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
+```
+
+Restore a fresh WSL2 dump into the cluster database while WSL2 remains the live
+source of truth.
+
+Required comparison:
+
+- Table row counts match.
+- Representative workstreams, tasks, decisions, progress events, repos, and
+  token events are queryable.
+- Dashboard and MCP summary calls return expected data through the cluster API.
+- Any mismatch is investigated before proceeding.
+
+**Done when:** cluster data is a verified copy of WSL2, but not yet the only
+writer.
+
+---
+
+### T07 — Cut over private access to cluster State Hub
+
+```task
+id: T07
+status: todo
+priority: medium
+state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
+needs_human: true
+intervention_note: "Requires explicit approval to freeze WSL2 writes and make the cluster State Hub the primary endpoint."
+```
+
+With human approval, freeze WSL2 writes, take a final dump, restore it to the
+cluster, compare counts again, and redirect the active private access path to
+the cluster API.
+
+Accepted approaches:
+
+- Keep local MCP config pointed at `http://127.0.0.1:8000` and move that port
+  to an ops-bridge/SSH tunnel.
+- Or set the MCP server `API_BASE` to the chosen private cluster endpoint.
+
+**Done when:** `get_state_summary()` and dashboard live data are served by the
+cluster State Hub, and WSL2 is no longer receiving normal writes.
+
+---
+
+### T08 — Stabilise with WSL2 retained as fallback
+
+```task
+id: T08
+status: todo
+priority: medium
+state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
+```
+
+Run the cluster State Hub as primary while keeping the WSL2 instance available
+as a fallback.
+
+Monitor:
+
+- State Hub pod restarts.
+- cnpg cluster health.
+- Backup job success.
+- Dashboard and MCP behavior from each operator machine.
+- Consistency sync behavior for file-backed workplans.
+
+**Done when:** the agreed stabilisation window passes without data loss or
+unresolved operational defects.
+
+---
+
+### T09 — Document operating model and defer final WSL2 retirement
+
+```task
+id: T09
+status: todo
+priority: low
+state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
+```
+
+Document the new operating model:
+
+- How agents reach State Hub.
+- How backups and restores work.
+- How to roll back to WSL2 if needed.
+- Which parts remain pragmatic/single-node.
+- Which long-term requirements moved to `CUST-WP-0038`.
+
+Do not permanently retire WSL2 in this workplan unless a separate human
+decision is recorded. Retirement belongs after proven stability or in the
+future HA workplan.
+
+**Done when:** runbooks and project instructions match the deployed reality.
+
+## References
+
+- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
+- `RAIL-HO-WP-0004-T09` — Railiance-side State Hub deployment task
+- `CUST-WP-0038` — future full ThreePhoenix HA State Hub migration
+- Constitution constraint: production data migration and fallback retirement
+  require explicit human approval
diff --git a/workplans/CUST-WP-0012-multi-user-onboarding.md b/workplans/CUST-WP-0012-multi-user-onboarding.md
new file mode 100644
index 0000000..a754d8b
--- /dev/null
+++ b/workplans/CUST-WP-0012-multi-user-onboarding.md
@@ -0,0 +1,246 @@
+---
+id: CUST-WP-0012
+type: workplan
+title: "Multi-User Onboarding and Environment Bootstrap"
+domain: custodian
+repo: state-hub
+status: active
+owner: custodian
+topic_slug: custodian
+state_hub_workstream_id: "a28d9e29-4119-4b73-9469-f921920253ef"
+created: "2026-03-11"
+updated: "2026-05-17"
+---
+
+# Multi-User Onboarding and Environment Bootstrap
+
+## Goal
+
+Make the Custodian system accessible to collaborators beyond the primary
+operator. A new user (or a new machine for the existing operator) should
+be able to go from zero to a productive Claude Code session with full
+State Hub MCP connectivity in a single session, without manual steps or
+undocumented tribal knowledge.
+
+## Context
+
+Several friction points surfaced during the 2026-03-11 session:
+
+- No SSH key for Railiance01 on WSL2 → blocked `make tunnel-loop`
+- No `~/.railiance_gitea.conf` → blocked repo creation script
+- Token missing `read:user` scope → blocked org repo creation
+- No `git credential.helper` → credentials required on every push
+- MCP registration is manual and documented only in `CLAUDE.md`
+
+Each of these is a solved problem in isolation. This workstream collects
+them into a repeatable, documented bootstrap experience.
+
+## Scope
+
+Two personas:
+
+| Persona | Access level | Typical machine |
+|---------|-------------|-----------------|
+| Primary operator | Full access, all domains | WSL2 workstation |
+| Domain collaborator | Read + write to one domain | COULOMBCORE, remote laptop |
+
+## Tasks
+
+### T01 — Git credential.helper for Gitea access
+
+```task
+id: CUST-WP-0012-T01
+state_hub_task_id: 71628269-9a75-4dae-a347-e64a86040322
+status: todo
+priority: medium
+```
+
+Document and automate `git credential.helper` setup for Gitea
+(`http://92.205.130.254:32166`). Recommend `libsecret` (keyring-backed)
+on machines that support it; fall back to `credential.helper=store`
+(persistent, plaintext `~/.git-credentials`) on headless servers.
+
+Include in bootstrap script (T04) and onboarding guide (T05).
+
+```bash
+# Preferred: libsecret (GNOME keyring, WSL2 with keyring daemon)
+sudo apt-get install -y libsecret-1-0 libsecret-1-dev
+sudo make -C /usr/share/doc/git/contrib/credential/libsecret
+git config --global credential.helper \
+  /usr/share/doc/git/contrib/credential/libsecret/git-credential-libsecret
+
+# Fallback: store (plaintext, suitable for headless servers)
+git config --global credential.helper store
+
+# Headless server alternative: cache (in-memory, 1h timeout)
+git config --global credential.helper 'cache --timeout=3600'
+```
+
+**Done when:** included in bootstrap script; push to Gitea works without
+re-entering credentials on second attempt.
+
+---
+
+### T02 — SSH key generation and authorization automation
+
+```task
+id: CUST-WP-0012-T02
+state_hub_task_id: fea965e9-8a8f-439c-9096-8f7756eb71ed
+status: todo
+priority: medium
+```
+
+Script or Ansible task that:
+1. Generates an `ed25519` key pair on the new machine if none exists
+2. Displays the public key with copy instructions
+3. Authorizes it on all managed hosts (Railiance01, COULOMBCORE) via
+   `ssh-copy-id` or Ansible `authorized_key` module
+
+Surfaced by: RAIL-PL-WP-0001 T01 — no SSH key on WSL2 blocked
+`make tunnel-loop HOST=tegwick@92.205.62.239`.
+
+```bash
+# Generate if missing
+[[ -f ~/.ssh/id_ed25519 ]] || ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N ""
+
+# Authorize on a target host (requires existing access once)
+ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.62.239
+ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.130.254
+```
+
+**Done when:** included in bootstrap script; documented in onboarding guide.
+
+---
+
+### T03 — Claude Code MCP registration automation
+
+```task
+id: CUST-WP-0012-T03
+state_hub_task_id: 60318e9a-972e-45c8-afde-82ed0625f594
+status: todo
+priority: medium
+```
+
+Automate the state-hub MCP server registration on a new machine.
+Currently this is a multi-step manual process documented in
+`~/.claude/CLAUDE.md`. It should be a single `make` target or script:
+
+```bash
+# In /home/worsch/state-hub/
+make register-mcp   # idempotent; safe to re-run
+```
+
+The script should:
+1. Detect whether `state-hub` is already in `~/.claude.json`
+2. Extract the server config from `.mcp.json`
+3. Run `claude mcp add-json -s user state-hub <config>`
+4. Run `patch_mcp_cwd.py` to restore the cwd field
+5. Print instructions to restart Claude Code
+
+Should also detect whether the state hub is reachable directly
+(`http://127.0.0.1:8000`) or needs a tunnel (via ops-bridge), and emit
+a warning if neither is available.
+
+**Done when:** `make register-mcp` works on a clean machine; documented
+in onboarding guide.
+
+---
+
+### T04 — Environment bootstrap script
+
+```task
+id: CUST-WP-0012-T04
+state_hub_task_id: 84a94761-e424-4470-a9a2-64d9cabadb7f
+status: todo
+priority: high
+```
+
+Single idempotent script: `scripts/bootstrap-env.sh`
+
+Checks/installs prerequisites and configures the environment:
+
+| Step | What |
+|------|------|
+| Prerequisites | git, sops, age, helm, kubectl, uv, claude CLI |
+| Git credential | `credential.helper` (libsecret or store) |
+| SSH key | Generate ed25519 if missing; display public key |
+| MCP registration | `make register-mcp` (T03) |
+| Gitea config | Prompt for token; write `~/.railiance_gitea.conf` |
+| Health check | `curl /state/health`; warn if tunnel needed |
+
+Design constraints:
+- Idempotent: safe to run on an already-configured machine
+- No silent failures: each step prints ✓ / ✗ / ⚠
+- Minimal dependencies: bash + curl only to get started
+
+**Done when:** running the script on a clean Ubuntu 24.04 machine
+produces a working Custodian environment with no additional manual steps.
+
+---
+
+### T05 — Onboarding guide and user journey documentation
+
+```task
+id: CUST-WP-0012-T05
+state_hub_task_id: b0839802-659a-475b-8b84-ab7341ea3d15
+status: todo
+priority: medium
+```
+
+Write `docs/onboarding.md` in this repository covering the full journey
+for both personas:
+
+**Primary operator (new machine):**
+1. Prerequisites (git, SSH client)
+2. Clone `state-hub` and the relevant domain repository
+3. Run `make bootstrap-env` (T04)
+4. Restart Claude Code → verify MCP is active
+5. First session: `get_state_summary()` → orient → work
+
+**Domain collaborator (new person):**
+1. Prerequisites + Gitea account
+2. `ssh-copy-id` to get access to Railiance01 (or just COULOMBCORE)
+3. Set up ops-bridge tunnel to reach state hub
+4. Clone domain repo
+5. First Claude Code session with MCP via tunnel
+6. Contributing a workplan (ADR-001 convention)
+
+**Done when:** a new collaborator can follow the guide without
+clarification from the primary operator.
+
+---
+
+### T06 — State Hub multi-user model (deferred)
+
+```task
+id: CUST-WP-0012-T06
+state_hub_task_id: d5df3302-67b9-4765-a8d8-ea2df53dff6e
+status: todo
+priority: low
+```
+
+Design a lightweight user/role model for the state hub:
+
+| Role | Permissions |
+|------|-------------|
+| Primary operator | Full read/write, all domains |
+| Domain collaborator | Read all; write to own domain only |
+| Observer | Read-only |
+
+Decision needed: enforce at API layer (HTTP Basic / token auth per
+domain) or rely on Gitea repo permissions as the authoritative boundary
+(simpler — the hub is a read model anyway).
+
+**Deferred until:** first external collaborator is actively onboarding.
+Implement T01–T05 first; multi-user access control is only needed when
+there is more than one user.
+
+---
+
+## References
+
+- ops-bridge repo: `ops-bridge` (tunnel lifecycle management)
+- MCP registration: `~/.claude/CLAUDE.md` (current manual procedure)
+- Gitea repo creation: `railiance-cluster/tools/create_railiance_repo.sh`
+- ADR-001: workplans as repo artefacts
+- Surfaced by: RAIL-PL-WP-0001 T01 execution, 2026-03-11
diff --git a/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md b/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md
new file mode 100644
index 0000000..f0260fb
--- /dev/null
+++ b/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md
@@ -0,0 +1,246 @@
+---
+id: CUST-WP-0038
+type: workplan
+title: "State Hub Full ThreePhoenix HA Migration"
+domain: custodian
+repo: state-hub
+status: active
+owner: custodian
+topic_slug: custodian
+created: "2026-05-02"
+updated: "2026-05-17"
+depends_on: CUST-WP-0011
+state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85"
+---
+
+# State Hub Full ThreePhoenix HA Migration
+
+## Goal
+
+Preserve the original long-term State Hub infrastructure goal while
+`CUST-WP-0011` takes the pragmatic railiance01 path.
+
+This workplan completes the migration from a useful single-node cluster-hosted
+State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes,
+replicated storage, tested failover, tested restore, and retirement of the WSL2
+fallback only after operational confidence is earned.
+
+## Why This Exists
+
+The near-term State Hub migration should not wait for every HA precondition,
+because the workstation-hosted State Hub is already a bottleneck for
+multi-machine work.
+
+But the original requirement remains valid:
+
+- State Hub is irreplaceable episodic memory.
+- A single node is not a final home.
+- Backup and restore must be drilled, not assumed.
+- Long-term operations must survive node loss and operator-machine loss.
+
+`CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan
+keeps the ultimate target visible and reviewable.
+
+## Entry Criteria
+
+- `CUST-WP-0011` completed or explicitly superseded.
+- Cluster-hosted State Hub has passed its stabilisation period.
+- railiance01 is not the only planned durable node.
+- Railiance architecture decision for storage replication is current:
+  Longhorn, cnpg replication, external backup, or a documented replacement.
+- Backup and restore tooling has an owner and runbook.
+
+## Target Properties
+
+- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03.
+- State Hub database survives loss of one node.
+- State Hub API recovers from pod loss without manual repair.
+- Backups are encrypted, off-node, and restorable into a test namespace.
+- Agent access remains private.
+- WSL2 is no longer needed as the primary disaster-recovery fallback.
+
+## Tasks
+
+### T01 — Confirm ThreePhoenix cluster readiness
+
+```task
+id: T01
+status: todo
+priority: high
+state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110"
+```
+
+Verify the target cluster state:
+
+- Three nodes are joined and Ready.
+- Control-plane and worker roles are documented.
+- Cluster version and node resources are recorded.
+- Smoke tests pass from the operator machine and from CoulombCore.
+
+**Done when:** a current readiness report exists and no node is marked
+NotReady or operationally unknown.
+
+---
+
+### T02 — Establish replicated storage/database strategy
+
+```task
+id: T02
+status: todo
+priority: high
+state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140"
+```
+
+Choose and document the durable data strategy for State Hub:
+
+- cnpg multi-instance PostgreSQL cluster, and/or
+- Longhorn-backed storage with suitable replication, and/or
+- another explicitly approved architecture.
+
+The decision must define RPO, RTO, failover behavior, and restore procedure.
+
+**Done when:** the selected architecture is documented and approved before any
+production data movement.
+
+---
+
+### T03 — Implement HA State Hub database
+
+```task
+id: T03
+status: todo
+priority: high
+state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6"
+```
+
+Apply the chosen database/storage architecture to State Hub.
+
+Requirements:
+
+- Database credentials remain SOPS/secret-managed.
+- The database has automated backup configured.
+- The database exposes a stable service endpoint for the API.
+- Health and replication status are observable.
+
+**Done when:** State Hub can run against the HA database in a test or staging
+namespace.
+
+---
+
+### T04 — Add State Hub API high-availability behavior
+
+```task
+id: T04
+status: todo
+priority: medium
+state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24"
+```
+
+Run State Hub API with the right availability posture for its workload:
+
+- At least one replica, optionally more if DB/session behavior permits.
+- Readiness and liveness probes.
+- Rolling update behavior documented.
+- Resource requests/limits set.
+
+**Done when:** killing an API pod does not require manual recovery.
+
+---
+
+### T05 — Drill database failover
+
+```task
+id: T05
+status: todo
+priority: high
+state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86"
+```
+
+Perform a controlled failover drill for the State Hub database.
+
+Checks:
+
+- Failure trigger is documented.
+- API behavior during failover is observed.
+- Recovery time is measured.
+- No data loss is detected after recovery.
+
+**Done when:** the failover drill passes and results are logged.
+
+---
+
+### T06 — Drill backup restore to isolated namespace
+
+```task
+id: T06
+status: todo
+priority: high
+state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74"
+```
+
+Restore the latest encrypted State Hub backup into an isolated namespace or
+separate test database.
+
+Checks:
+
+- Backup can be decrypted with the documented key path.
+- Restore completes from off-node backup material.
+- Row counts and representative records match.
+- Restored API can serve `/state/health` and `/state/summary` when pointed at
+  the restored database.
+
+**Done when:** restore drill passes without depending on the live database.
+
+---
+
+### T07 — Update agent access and runbooks for HA endpoint
+
+```task
+id: T07
+status: todo
+priority: medium
+state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c"
+```
+
+Update the private access model after the HA endpoint is available:
+
+- ops-bridge or tunnel target.
+- MCP `API_BASE` or local port-forward convention.
+- Dashboard access.
+- Operator recovery instructions.
+
+**Done when:** each active operator machine can reach the HA State Hub endpoint
+through the documented path.
+
+---
+
+### T08 — Retire WSL2 fallback after explicit approval
+
+```task
+id: T08
+status: todo
+priority: low
+needs_human: true
+intervention_note: "Requires explicit approval after HA failover and restore drills pass."
+state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add"
+```
+
+Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA
+cluster path has passed drills.
+
+Steps:
+
+1. Take and archive a final WSL2 backup.
+2. Stop local WSL2 State Hub services.
+3. Update global and repo instructions.
+4. Record the retirement decision in State Hub.
+
+**Done when:** WSL2 is no longer part of the normal or fallback operating
+model, and the cluster runbook is the source of truth.
+
+## References
+
+- `CUST-WP-0011` — pragmatic railiance01 migration
+- Railiance ThreePhoenix infrastructure goal
+- State Hub backup/restore runbooks
+- Constitution constraint: irreversible retirement requires human approval