Remove migrated State Hub workplans

2026-05-18 01:33:25 +02:00
parent be9ccf1074
commit 5d478cf746
4 changed files with 0 additions and 1044 deletions
--- a/workplans/CUST-WP-0003-whi-kpi-card.md
+++ b/workplans/CUST-WP-0003-whi-kpi-card.md
@@ -1,186 +0,0 @@
---
-id: CUST-WP-0003
-type: workplan
-title: "State Hub v0.4 — Workstream Health Index (WHI) KPI Card"
-domain: custodian
-status: active
-owner: custodian
-topic_slug: custodian
-state_hub_workstream_id: 9cc32158-2f5c-4ef6-9713-aacce4623d5e
-created: "2026-02-26"
-updated: "2026-02-28"
---
-
-# State Hub v0.4 — Workstream Health Index (WHI) KPI Card
-
-## Summary
-
-Implement the Workstream Health Index (WHI) — a composite structural-health
-KPI — as a live card injected into the TOC sidebar of the Workstreams
-dashboard page. All six metrics are computable client-side from data
-already fetched by `workstreams.md`; no API or schema changes required.
-
-## Context
-
-The WHI formula and metric definitions are specified in
-`state-hub/dashboard/src/docs/workstream-kpi.md`. This workplan covers
-only the implementation of that spec as running dashboard code.
-
-The six base metrics:
- **DD** — Dependency Density: edge count / open workstream count
- **BR** — Blocked Ratio: blocked workstreams / open count
- **SPR** — Single Point of Risk: max inbound edges / open count
- **PEP** — Progression Enablement Proportion: ready-to-start workstreams
- **CDDR** — Cross-Domain Dependency Ratio: cross-domain edges / total edges
- **CPI** — Cycle Penalty Indicator: 1 if any cycle detected, 0 otherwise
-
-WHI formula: `0.30*(1-DDnorm) + 0.25*(1-BR) + 0.15*(1-SPR) + 0.20*PEP + 0.10*(1-CDDR)`
-CPI penalty: `WHI = WHI * 0.5` if CPI=1.
-
-## Tasks
-
-### P1 — Verify dependency edge fields in open_workstreams
-
-```task
-id: CUST-WP-0003-T01
-state_hub_task_id: 243646e0-b77a-41e7-ac51-82c5828e63d2
-status: todo
-priority: high
-```
-
-Confirm that `summary.open_workstreams[].depends_on[]` and `blocks[]`
-each carry `workstream_id`, `workstream_slug`, and `workstream_title`.
-Verify these fields are sufficient to build a complete directed dependency
-graph client-side without additional API calls. (Already verified during
-workplan design — open_workstreams is the confirmed data source.)
-
-### P2.1 — Build directed dependency graph from openWs + completedIds
-
-```task
-id: CUST-WP-0003-T02
-state_hub_task_id: 6dbef71f-d2d7-44ee-abb8-279dbaeec505
-status: todo
-priority: high
-```
-
-In `workstreams.md`: derive `completedIds = new Set` of IDs of workstreams
-with status completed. Build an adjacency list: for each entry in openWs,
-map workstream id → array of `depends_on[].workstream_id`. Build reverse
-map (prerequisite id → list of dependent ids) for SPR computation. Also
-build `idToDomain` map from `data[]` for CDDR.
-
-### P2.2 — Implement DFS cycle detection (CPI)
-
-```task
-id: CUST-WP-0003-T03
-state_hub_task_id: f0d5c107-6029-4ad0-af00-645d35ce7db0
-status: todo
-priority: high
-```
-
-Implement a DFS-based topological sort over the dependency adjacency list.
-Detect back edges using visited / inStack colour sets. Return `CPI = 1`
-if any cycle found, `CPI = 0` otherwise. Only nodes in openWs participate
-(completed/archived workstreams excluded). Edge case: isolated nodes (no
-deps, no dependents) are valid and never form cycles.
-
-### P2.3 — Compute DD, BR, SPR, PEP, CDDR
-
-```task
-id: CUST-WP-0003-T04
-state_hub_task_id: 6da60567-cc46-4a32-9855-b07bafe2faeb
-status: todo
-priority: high
-```
-
-Using the graph from P2.1:
- `DD`: totalEdges / openCount, where totalEdges = openWs.flatMap(w=>w.depends_on).length
- `BR`: openWs.filter(w=>w.status==="blocked").length / openCount
- `SPR`: max inbound-edge count across prerequisite workstreams in openWs / openCount
- `PEP`: openWs.filter(w=>active && all depends_on are in completedIds).length / openCount
- `CDDR`: crossDomainEdges / totalEdges (edge with different domain endpoints); 0 if no edges
-
-### P2.4 — WHI formula: normalisation + CPI penalty
-
-```task
-id: CUST-WP-0003-T05
-state_hub_task_id: 29b2dbbd-5d60-49b6-ae84-3dbf22167df7
-status: todo
-priority: high
-```
-
-Implement the weighted aggregation:
-```
-DDnorm = min(1, DD / 1.0)   // DD_critical = 1.0
-WHI    = 0.30*(1-DDnorm) + 0.25*(1-BR) + 0.15*(1-SPR) + 0.20*PEP + 0.10*(1-CDDR)
-if CPI === 1: WHI = WHI * 0.5
-```
-Clamp to [0, 1]. Return `{whi, dd, ddNorm, br, spr, pep, cddr, cpi, openCount, edgeCount}`.
-Factor into `computeWHI(nodes, edges, idToDomain)` for reuse in per-domain scope.
-
-### P2.5 — Per-domain WHI breakdown
-
-```task
-id: CUST-WP-0003-T06
-state_hub_task_id: 8ce5ef74-5eb8-4259-9b11-dde13bf84a89
-status: todo
-priority: medium
-```
-
-For each domain present in openWs, compute a domain-scoped WHI:
- `domainNodes = openWs.filter(w => idToDomain[w.id] === domain)`
- `domainEdges = domainNodes.flatMap(w => w.depends_on.filter(d => idToDomain[d.workstream_id] === domain))`
- `result = computeWHI(domainNodes, domainEdges, idToDomain)`
-
-Store as `[{domain, whi, br, pep, cpi, openCount}]`. Skip domains with
-`openCount === 0`.
-
-### P3 — WHI KPI card UI
-
-```task
-id: CUST-WP-0003-T07
-state_hub_task_id: 91efba5c-3be2-4bfe-b5ef-1b261e9423f2
-status: todo
-priority: high
-```
-
-Build the `_whiBox` element in `workstreams.md` (mirrors `_kpiBox` in
-`decisions.md`):
- Card title: "Workstream Health"
- Main WHI value with health state label: GREEN ≥ 0.75 / ORANGE ≥ 0.50 / RED < 0.50
- Sub-metric rows for DD, BR, SPR, PEP, CDDR with individual warning colours
- Cycle alert row (red ⚠) when CPI=1
- Domain breakdown: compact rows with domain name + coloured score
- Empty state if openCount=0 or no edges
-
-Inject via `injectTocTop("whi-kpi-box", _whiBox)`. Wire
-`withDocHelp(_whiBox, "/docs/workstream-health-index")`.
-
-### P4.1 — Create src/docs/workstream-health-index.md
-
-```task
-id: CUST-WP-0003-T08
-state_hub_task_id: 4c898472-e4ae-49a2-b6cd-7aa1a3c7604a
-status: todo
-priority: medium
-```
-
-Reference documentation for the WHI KPI card. Cover: purpose, all six
-metrics (formula + interpretation), WHI aggregation formula with CPI
-penalty, DD normalisation, health state thresholds, domain breakdown,
-cycle detection, and how to improve a poor score. Update
-`workstream-kpi.md` to link to this doc.
-
-### P4.2 — Wire withDocHelp and add to Reference nav
-
-```task
-id: CUST-WP-0003-T09
-state_hub_task_id: 20976663-7ac9-4909-8029-a479190f52ff
-status: todo
-priority: low
-```
-
-Confirm `withDocHelp(_whiBox, "/docs/workstream-health-index")` is wired
-(from P3). Add `{ name: "Workstream Health", path: "/docs/workstream-health-index" }`
-to the Reference pages array in `observablehq.config.js`. Verify
-Reference nav renders correctly in `npm run dev`.
--- a/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md
+++ b/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md
@@ -1,366 +0,0 @@
---
-id: CUST-WP-0011
-type: workplan
-title: "Pragmatic State Hub Migration to railiance01"
-domain: custodian
-repo: the-custodian
-status: active
-owner: custodian
-topic_slug: custodian
-created: "2026-03-11"
-updated: "2026-05-15"
-state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940"
-supersedes_intent_from: "Migrate Custodian State Hub to ThreePhoenix Cluster"
-follow_up_workplan: CUST-WP-0038
---
-
-# Pragmatic State Hub Migration to railiance01
-
-## Goal
-
-Move the Custodian State Hub (FastAPI + PostgreSQL) from the WSL2 operator
-workstation to the current railiance01 Kubernetes environment, using the
-Railiance production-readiness path that exists now:
-
- CloudNative PG (`cnpg`) for the State Hub database in the `databases`
-  namespace.
- State Hub as an S5 workload in `railiance-apps`.
- Platform/database ownership in `railiance-platform`.
- Access through the existing private tunnel/ops-bridge model, not public
-  exposure.
- WSL2 retained as a disaster-recovery fallback until the cluster deployment
-  has proven stable.
-
-This is a deliberate pragmatic step. It improves durability and multi-machine
-access before the full ThreePhoenix target is ready. The ultimate multi-node,
-replicated, long-term cluster goal is preserved in `CUST-WP-0038`.
-
-## Context Update
-
-The original 2026-03-11 version of this workplan targeted a future
-ThreePhoenix cluster with Railiance01/02/03, Longhorn, and full HA gates before
-starting. That was correct as an end-state, but it blocks useful progress now.
-
-The current Railiance architecture has moved on:
-
- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
-  supersedes the older Bitnami PostgreSQL HA platform baseline.
- CloudNative PG is the deployed database operator.
- `RAIL-HO-WP-0004-T09` is the Railiance-side task for deploying State Hub to
-  the cluster, and it still requires human decisions before live data
-  migration.
-
-This workplan is now the Custodian-side coordination and safety plan for that
-T09 effort.
-
-## Safety Contract
-
-State Hub is irreplaceable episodic memory. This migration may prepare, deploy,
-test, and compare as much as needed, but it must not make the cluster the only
-source of truth until the explicit cutover gate is satisfied.
-
-Rules:
-
- A fresh WSL2 backup and restore drill is mandatory before data migration.
- The WSL2 State Hub remains available as rollback until stabilisation passes.
- Any task that changes the live writer endpoint requires explicit human
-  approval.
- A failed cluster deploy must leave the WSL2 instance untouched and usable.
- Row counts and key API checks must match before cutover.
-
-## Target Architecture After This Workplan
-
-```
-Operator workstation / COULOMBCORE / other agent hosts
-  -> local MCP server subprocess
-     -> http://127.0.0.1:8000 or configured API_BASE
-        -> private tunnel / ops-bridge
-           -> railiance01 k3s
-              -> state-hub Service
-                 -> FastAPI Deployment
-                 -> state-hub-db CloudNative PG Cluster
-```
-
-Key properties:
-
- Single-node pragmatic deployment on railiance01.
- No public unauthenticated exposure.
- Database managed by cnpg, not an ad-hoc Postgres StatefulSet.
- WSL2 retained as DR fallback during stabilisation.
- Future multi-node HA and storage replication are deferred to `CUST-WP-0038`.
-
-## Open Human Decisions
-
-Resolve these before T04/T05 can become live migration work:
-
-1. Final State Hub hostname or tunnel-only endpoint.
-2. Container registry choice: Gitea registry vs external interim registry.
-3. Exposure model: ClusterIP plus tunnel, private ingress, or both.
-4. Approval window for freezing WSL2 writes and migrating the production DB.
-5. Stabilisation duration before WSL2 can be considered non-primary fallback.
-
-## Tasks
-
-### T01 — Drill WSL2 State Hub backup restore
-
-```task
-id: T01
-status: done
-priority: high
-state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
-completed: "2026-05-02"
-```
-
-Take a fresh State Hub backup from the current WSL2 instance and restore it
-into an isolated test PostgreSQL instance.
-
-Minimum checks:
-
- Restore completes without errors.
- Core table row counts match the live WSL2 database.
- `/state/summary` can be served from the restored copy if wired to a test API.
- Drill result is recorded in State Hub progress and, if useful, episodic
-  memory.
-
-**Done when:** backup and restore are proven within 24 hours of live migration
-work.
-
-Result: completed 2026-05-02. A fresh dump from `infra-postgres-1` restored
-into disposable container `state-hub-restore-test` on `127.0.0.1:5433`.
-Application health and summary checks against the restored database returned
-HTTP 200. Restored row counts matched production exactly, including 117
-workstreams, 989 tasks, 1423 progress events, and 208 token events.
-
---
-
-### T02 — Align with Railiance deployment plan
-
-```task
-id: T02
-status: done
-priority: high
-state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
-completed: "2026-05-02"
-```
-
-Update the cross-repo plan so this Custodian workplan and
-`RAIL-HO-WP-0004-T09` point to the same architecture.
-
-Expected outputs:
-
- `RAIL-HO-WP-0004-T09` remains the Railiance-side execution task.
- This workplan remains the Custodian-side safety/cutover task list.
- Any stale Longhorn/Postgres StatefulSet assumptions are removed from the
-  near-term migration plan.
- The future HA goal is referenced through `CUST-WP-0038`.
-
-**Done when:** both workplans describe compatible responsibilities and gates.
-
-Result: completed 2026-05-02. `RAIL-HO-WP-0004-T09` now names the same
-pragmatic railiance01 path: cnpg database, S5 State Hub workload, restore drill
-precondition, empty deploy before data copy, explicit human approval before
-freezing WSL2 writes, and WSL2 retained as fallback. Full ThreePhoenix HA stays
-deferred to `CUST-WP-0038`.
-
---
-
-### T03 — Build and publish State Hub container image
-
-```task
-id: T03
-status: in_progress
-priority: high
-state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
-```
-
-Package `state-hub/` as a production image.
-
-Requirements:
-
- Dockerfile builds from the current Python/uv project.
- Alembic and runtime dependencies are available inside the image.
- Image exposes the FastAPI service on port 8000.
- Image tag is pushed to the chosen registry.
- Build provenance is documented in the commit/workplan.
-
-**Done when:** railiance01 can pull the image and a dry-run deployment resolves
-it.
-
-Progress 2026-05-03: added `state-hub/Dockerfile`,
-`state-hub/.dockerignore`, and `state-hub/docs/container-image.md`. Built
-local image `state-hub:local` successfully:
-`sha256:e96dbd1e7d2b63e4fb17584c8c2216088a2c9937bfe880c2ad565c7a9f51c0fc`
-(~106 MB). Verified container `/state/health` returns HTTP 200 against the
-current database when run locally with host networking. Verified Alembic is
-available in-image and reports current revision `r5m6n7o8p9q0 (head)`.
-
-Progress 2026-05-03: registry target decision resolved to the self-hosted
-Gitea registry. A local SSH tunnel to the NodePort can reach Gitea, but Docker
-login/push still receives HTTP 404 from `/v2/`. Runtime inspection shows the
-live Gitea `app.ini` has no `[packages]` section, so package registry
-enablement/routing must be applied before publishing `state-hub:local`.
-
-Progress 2026-05-15: rebuilt the image from current `state-hub/` sources as
-`state-hub:local` with digest
-`sha256:039d29654ccb3754c6ecdbe497c6364bbd8452edcdcb7fa937dd9debf5b734ff`
-(106004480 bytes). Verified `/state/health` returns
-`{"status":"ok","db":"connected"}` from a temporary container on host port
-18000 and confirmed in-image Alembic reports `t7o8p9q0r1s2 (head)`. Build
-provenance is recorded in `state-hub/docs/container-image.md`.
-
-Remaining: enable the Gitea package/container registry, then tag, push, and
-pull the image from railiance01.
-
---
-
-### T04 — Define State Hub database and app manifests
-
-```task
-id: T04
-status: todo
-priority: high
-state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
-```
-
-Create the cluster-side deployment assets using current Railiance boundaries:
-
- `railiance-platform`: `state-hub-db` cnpg cluster and database credentials.
- `railiance-apps`: State Hub Deployment, Service, ConfigMap, Secret/External
-  Secret reference, and optional private Ingress.
- Health probes use `GET /state/health`.
- Environment includes `DATABASE_URL` and any required API settings.
-
-**Done when:** manifests lint/apply in a non-destructive dry run and ownership
-boundaries are documented.
-
---
-
-### T05 — Deploy empty State Hub and run migrations on railiance01
-
-```task
-id: T05
-status: todo
-priority: high
-state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
-```
-
-Deploy State Hub against an empty `state-hub-db` cnpg database and run Alembic
-migrations in the cluster environment.
-
-Checks:
-
- Pod reaches Ready.
- `/state/health` returns healthy through the intended private access path.
- Alembic reports head.
- Logs show no repeated crash/restart loop.
-
-**Done when:** an empty but structurally valid State Hub runs on railiance01.
-
---
-
-### T06 — Restore WSL2 data copy into cluster and compare
-
-```task
-id: T06
-status: todo
-priority: high
-state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
-```
-
-Restore a fresh WSL2 dump into the cluster database while WSL2 remains the live
-source of truth.
-
-Required comparison:
-
- Table row counts match.
- Representative workstreams, tasks, decisions, progress events, repos, and
-  token events are queryable.
- Dashboard and MCP summary calls return expected data through the cluster API.
- Any mismatch is investigated before proceeding.
-
-**Done when:** cluster data is a verified copy of WSL2, but not yet the only
-writer.
-
---
-
-### T07 — Cut over private access to cluster State Hub
-
-```task
-id: T07
-status: todo
-priority: medium
-state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
-needs_human: true
-intervention_note: "Requires explicit approval to freeze WSL2 writes and make the cluster State Hub the primary endpoint."
-```
-
-With human approval, freeze WSL2 writes, take a final dump, restore it to the
-cluster, compare counts again, and redirect the active private access path to
-the cluster API.
-
-Accepted approaches:
-
- Keep local MCP config pointed at `http://127.0.0.1:8000` and move that port
-  to an ops-bridge/SSH tunnel.
- Or set the MCP server `API_BASE` to the chosen private cluster endpoint.
-
-**Done when:** `get_state_summary()` and dashboard live data are served by the
-cluster State Hub, and WSL2 is no longer receiving normal writes.
-
---
-
-### T08 — Stabilise with WSL2 retained as fallback
-
-```task
-id: T08
-status: todo
-priority: medium
-state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
-```
-
-Run the cluster State Hub as primary while keeping the WSL2 instance available
-as a fallback.
-
-Monitor:
-
- State Hub pod restarts.
- cnpg cluster health.
- Backup job success.
- Dashboard and MCP behavior from each operator machine.
- Consistency sync behavior for file-backed workplans.
-
-**Done when:** the agreed stabilisation window passes without data loss or
-unresolved operational defects.
-
---
-
-### T09 — Document operating model and defer final WSL2 retirement
-
-```task
-id: T09
-status: todo
-priority: low
-state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
-```
-
-Document the new operating model:
-
- How agents reach State Hub.
- How backups and restores work.
- How to roll back to WSL2 if needed.
- Which parts remain pragmatic/single-node.
- Which long-term requirements moved to `CUST-WP-0038`.
-
-Do not permanently retire WSL2 in this workplan unless a separate human
-decision is recorded. Retirement belongs after proven stability or in the
-future HA workplan.
-
-**Done when:** runbooks and project instructions match the deployed reality.
-
-## References
-
- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
- `RAIL-HO-WP-0004-T09` — Railiance-side State Hub deployment task
- `CUST-WP-0038` — future full ThreePhoenix HA State Hub migration
- Constitution constraint: production data migration and fallback retirement
-  require explicit human approval
--- a/workplans/CUST-WP-0012-multi-user-onboarding.md
+++ b/workplans/CUST-WP-0012-multi-user-onboarding.md
@@ -1,246 +0,0 @@
---
-id: CUST-WP-0012
-type: workplan
-title: "Multi-User Onboarding and Environment Bootstrap"
-domain: custodian
-repo: the-custodian
-status: active
-owner: custodian
-topic_slug: custodian
-state_hub_workstream_id: "a28d9e29-4119-4b73-9469-f921920253ef"
-created: "2026-03-11"
-updated: "2026-03-11"
---
-
-# Multi-User Onboarding and Environment Bootstrap
-
-## Goal
-
-Make the Custodian system accessible to collaborators beyond the primary
-operator. A new user (or a new machine for the existing operator) should
-be able to go from zero to a productive Claude Code session with full
-State Hub MCP connectivity in a single session, without manual steps or
-undocumented tribal knowledge.
-
-## Context
-
-Several friction points surfaced during the 2026-03-11 session:
-
- No SSH key for Railiance01 on WSL2 → blocked `make tunnel-loop`
- No `~/.railiance_gitea.conf` → blocked repo creation script
- Token missing `read:user` scope → blocked org repo creation
- No `git credential.helper` → credentials required on every push
- MCP registration is manual and documented only in `CLAUDE.md`
-
-Each of these is a solved problem in isolation. This workstream collects
-them into a repeatable, documented bootstrap experience.
-
-## Scope
-
-Two personas:
-
-| Persona | Access level | Typical machine |
-|---------|-------------|-----------------|
-| Primary operator | Full access, all domains | WSL2 workstation |
-| Domain collaborator | Read + write to one domain | COULOMBCORE, remote laptop |
-
-## Tasks
-
-### T01 — Git credential.helper for Gitea access
-
-```task
-id: CUST-WP-0012-T01
-state_hub_task_id: 71628269-9a75-4dae-a347-e64a86040322
-status: todo
-priority: medium
-```
-
-Document and automate `git credential.helper` setup for Gitea
-(`http://92.205.130.254:32166`). Recommend `libsecret` (keyring-backed)
-on machines that support it; fall back to `credential.helper=store`
-(persistent, plaintext `~/.git-credentials`) on headless servers.
-
-Include in bootstrap script (T04) and onboarding guide (T05).
-
-```bash
-# Preferred: libsecret (GNOME keyring, WSL2 with keyring daemon)
-sudo apt-get install -y libsecret-1-0 libsecret-1-dev
-sudo make -C /usr/share/doc/git/contrib/credential/libsecret
-git config --global credential.helper \
-  /usr/share/doc/git/contrib/credential/libsecret/git-credential-libsecret
-
-# Fallback: store (plaintext, suitable for headless servers)
-git config --global credential.helper store
-
-# Headless server alternative: cache (in-memory, 1h timeout)
-git config --global credential.helper 'cache --timeout=3600'
-```
-
-**Done when:** included in bootstrap script; push to Gitea works without
-re-entering credentials on second attempt.
-
---
-
-### T02 — SSH key generation and authorization automation
-
-```task
-id: CUST-WP-0012-T02
-state_hub_task_id: fea965e9-8a8f-439c-9096-8f7756eb71ed
-status: todo
-priority: medium
-```
-
-Script or Ansible task that:
-1. Generates an `ed25519` key pair on the new machine if none exists
-2. Displays the public key with copy instructions
-3. Authorizes it on all managed hosts (Railiance01, COULOMBCORE) via
-   `ssh-copy-id` or Ansible `authorized_key` module
-
-Surfaced by: RAIL-PL-WP-0001 T01 — no SSH key on WSL2 blocked
-`make tunnel-loop HOST=tegwick@92.205.62.239`.
-
-```bash
-# Generate if missing
-[[ -f ~/.ssh/id_ed25519 ]] || ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N ""
-
-# Authorize on a target host (requires existing access once)
-ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.62.239
-ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.130.254
-```
-
-**Done when:** included in bootstrap script; documented in onboarding guide.
-
---
-
-### T03 — Claude Code MCP registration automation
-
-```task
-id: CUST-WP-0012-T03
-state_hub_task_id: 60318e9a-972e-45c8-afde-82ed0625f594
-status: todo
-priority: medium
-```
-
-Automate the state-hub MCP server registration on a new machine.
-Currently this is a multi-step manual process documented in
-`~/.claude/CLAUDE.md`. It should be a single `make` target or script:
-
-```bash
-# In the-custodian/state-hub/
-make register-mcp   # idempotent; safe to re-run
-```
-
-The script should:
-1. Detect whether `state-hub` is already in `~/.claude.json`
-2. Extract the server config from `.mcp.json`
-3. Run `claude mcp add-json -s user state-hub <config>`
-4. Run `patch_mcp_cwd.py` to restore the cwd field
-5. Print instructions to restart Claude Code
-
-Should also detect whether the state hub is reachable directly
-(`http://127.0.0.1:8000`) or needs a tunnel (via ops-bridge), and emit
-a warning if neither is available.
-
-**Done when:** `make register-mcp` works on a clean machine; documented
-in onboarding guide.
-
---
-
-### T04 — Environment bootstrap script
-
-```task
-id: CUST-WP-0012-T04
-state_hub_task_id: 84a94761-e424-4470-a9a2-64d9cabadb7f
-status: todo
-priority: high
-```
-
-Single idempotent script: `state-hub/scripts/bootstrap-env.sh`
-
-Checks/installs prerequisites and configures the environment:
-
-| Step | What |
-|------|------|
-| Prerequisites | git, sops, age, helm, kubectl, uv, claude CLI |
-| Git credential | `credential.helper` (libsecret or store) |
-| SSH key | Generate ed25519 if missing; display public key |
-| MCP registration | `make register-mcp` (T03) |
-| Gitea config | Prompt for token; write `~/.railiance_gitea.conf` |
-| Health check | `curl /state/health`; warn if tunnel needed |
-
-Design constraints:
- Idempotent: safe to run on an already-configured machine
- No silent failures: each step prints ✓ / ✗ / ⚠
- Minimal dependencies: bash + curl only to get started
-
-**Done when:** running the script on a clean Ubuntu 24.04 machine
-produces a working Custodian environment with no additional manual steps.
-
---
-
-### T05 — Onboarding guide and user journey documentation
-
-```task
-id: CUST-WP-0012-T05
-state_hub_task_id: b0839802-659a-475b-8b84-ab7341ea3d15
-status: todo
-priority: medium
-```
-
-Write `docs/onboarding.md` in the-custodian covering the full journey
-for both personas:
-
-**Primary operator (new machine):**
-1. Prerequisites (git, SSH client)
-2. Clone `the-custodian`
-3. Run `make bootstrap-env` (T04)
-4. Restart Claude Code → verify MCP is active
-5. First session: `get_state_summary()` → orient → work
-
-**Domain collaborator (new person):**
-1. Prerequisites + Gitea account
-2. `ssh-copy-id` to get access to Railiance01 (or just COULOMBCORE)
-3. Set up ops-bridge tunnel to reach state hub
-4. Clone domain repo
-5. First Claude Code session with MCP via tunnel
-6. Contributing a workplan (ADR-001 convention)
-
-**Done when:** a new collaborator can follow the guide without
-clarification from the primary operator.
-
---
-
-### T06 — State Hub multi-user model (deferred)
-
-```task
-id: CUST-WP-0012-T06
-state_hub_task_id: d5df3302-67b9-4765-a8d8-ea2df53dff6e
-status: todo
-priority: low
-```
-
-Design a lightweight user/role model for the state hub:
-
-| Role | Permissions |
-|------|-------------|
-| Primary operator | Full read/write, all domains |
-| Domain collaborator | Read all; write to own domain only |
-| Observer | Read-only |
-
-Decision needed: enforce at API layer (HTTP Basic / token auth per
-domain) or rely on Gitea repo permissions as the authoritative boundary
-(simpler — the hub is a read model anyway).
-
-**Deferred until:** first external collaborator is actively onboarding.
-Implement T01–T05 first; multi-user access control is only needed when
-there is more than one user.
-
---
-
-## References
-
- ops-bridge repo: `ops-bridge` (tunnel lifecycle management)
- MCP registration: `~/.claude/CLAUDE.md` (current manual procedure)
- Gitea repo creation: `railiance-cluster/tools/create_railiance_repo.sh`
- ADR-001: workplans as repo artefacts
- Surfaced by: RAIL-PL-WP-0001 T01 execution, 2026-03-11
--- a/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md
+++ b/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md
@@ -1,246 +0,0 @@
---
-id: CUST-WP-0038
-type: workplan
-title: "State Hub Full ThreePhoenix HA Migration"
-domain: custodian
-repo: the-custodian
-status: active
-owner: custodian
-topic_slug: custodian
-created: "2026-05-02"
-updated: "2026-05-02"
-depends_on: CUST-WP-0011
-state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85"
---
-
-# State Hub Full ThreePhoenix HA Migration
-
-## Goal
-
-Preserve the original long-term State Hub infrastructure goal while
-`CUST-WP-0011` takes the pragmatic railiance01 path.
-
-This workplan completes the migration from a useful single-node cluster-hosted
-State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes,
-replicated storage, tested failover, tested restore, and retirement of the WSL2
-fallback only after operational confidence is earned.
-
-## Why This Exists
-
-The near-term State Hub migration should not wait for every HA precondition,
-because the workstation-hosted State Hub is already a bottleneck for
-multi-machine work.
-
-But the original requirement remains valid:
-
- State Hub is irreplaceable episodic memory.
- A single node is not a final home.
- Backup and restore must be drilled, not assumed.
- Long-term operations must survive node loss and operator-machine loss.
-
-`CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan
-keeps the ultimate target visible and reviewable.
-
-## Entry Criteria
-
- `CUST-WP-0011` completed or explicitly superseded.
- Cluster-hosted State Hub has passed its stabilisation period.
- railiance01 is not the only planned durable node.
- Railiance architecture decision for storage replication is current:
-  Longhorn, cnpg replication, external backup, or a documented replacement.
- Backup and restore tooling has an owner and runbook.
-
-## Target Properties
-
- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03.
- State Hub database survives loss of one node.
- State Hub API recovers from pod loss without manual repair.
- Backups are encrypted, off-node, and restorable into a test namespace.
- Agent access remains private.
- WSL2 is no longer needed as the primary disaster-recovery fallback.
-
-## Tasks
-
-### T01 — Confirm ThreePhoenix cluster readiness
-
-```task
-id: T01
-status: todo
-priority: high
-state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110"
-```
-
-Verify the target cluster state:
-
- Three nodes are joined and Ready.
- Control-plane and worker roles are documented.
- Cluster version and node resources are recorded.
- Smoke tests pass from the operator machine and from CoulombCore.
-
-**Done when:** a current readiness report exists and no node is marked
-NotReady or operationally unknown.
-
---
-
-### T02 — Establish replicated storage/database strategy
-
-```task
-id: T02
-status: todo
-priority: high
-state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140"
-```
-
-Choose and document the durable data strategy for State Hub:
-
- cnpg multi-instance PostgreSQL cluster, and/or
- Longhorn-backed storage with suitable replication, and/or
- another explicitly approved architecture.
-
-The decision must define RPO, RTO, failover behavior, and restore procedure.
-
-**Done when:** the selected architecture is documented and approved before any
-production data movement.
-
---
-
-### T03 — Implement HA State Hub database
-
-```task
-id: T03
-status: todo
-priority: high
-state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6"
-```
-
-Apply the chosen database/storage architecture to State Hub.
-
-Requirements:
-
- Database credentials remain SOPS/secret-managed.
- The database has automated backup configured.
- The database exposes a stable service endpoint for the API.
- Health and replication status are observable.
-
-**Done when:** State Hub can run against the HA database in a test or staging
-namespace.
-
---
-
-### T04 — Add State Hub API high-availability behavior
-
-```task
-id: T04
-status: todo
-priority: medium
-state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24"
-```
-
-Run State Hub API with the right availability posture for its workload:
-
- At least one replica, optionally more if DB/session behavior permits.
- Readiness and liveness probes.
- Rolling update behavior documented.
- Resource requests/limits set.
-
-**Done when:** killing an API pod does not require manual recovery.
-
---
-
-### T05 — Drill database failover
-
-```task
-id: T05
-status: todo
-priority: high
-state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86"
-```
-
-Perform a controlled failover drill for the State Hub database.
-
-Checks:
-
- Failure trigger is documented.
- API behavior during failover is observed.
- Recovery time is measured.
- No data loss is detected after recovery.
-
-**Done when:** the failover drill passes and results are logged.
-
---
-
-### T06 — Drill backup restore to isolated namespace
-
-```task
-id: T06
-status: todo
-priority: high
-state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74"
-```
-
-Restore the latest encrypted State Hub backup into an isolated namespace or
-separate test database.
-
-Checks:
-
- Backup can be decrypted with the documented key path.
- Restore completes from off-node backup material.
- Row counts and representative records match.
- Restored API can serve `/state/health` and `/state/summary` when pointed at
-  the restored database.
-
-**Done when:** restore drill passes without depending on the live database.
-
---
-
-### T07 — Update agent access and runbooks for HA endpoint
-
-```task
-id: T07
-status: todo
-priority: medium
-state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c"
-```
-
-Update the private access model after the HA endpoint is available:
-
- ops-bridge or tunnel target.
- MCP `API_BASE` or local port-forward convention.
- Dashboard access.
- Operator recovery instructions.
-
-**Done when:** each active operator machine can reach the HA State Hub endpoint
-through the documented path.
-
---
-
-### T08 — Retire WSL2 fallback after explicit approval
-
-```task
-id: T08
-status: todo
-priority: low
-needs_human: true
-intervention_note: "Requires explicit approval after HA failover and restore drills pass."
-state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add"
-```
-
-Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA
-cluster path has passed drills.
-
-Steps:
-
-1. Take and archive a final WSL2 backup.
-2. Stop local WSL2 State Hub services.
-3. Update global and repo instructions.
-4. Record the retirement decision in State Hub.
-
-**Done when:** WSL2 is no longer part of the normal or fallback operating
-model, and the cluster runbook is the source of truth.
-
-## References
-
- `CUST-WP-0011` — pragmatic railiance01 migration
- Railiance ThreePhoenix infrastructure goal
- State Hub backup/restore runbooks
- Constitution constraint: irreversible retirement requires human approval