From febad058e639c24d7358d4ce013a1ff0312c8ae8 Mon Sep 17 00:00:00 2001 From: tegwick Date: Mon, 18 May 2026 01:32:29 +0200 Subject: [PATCH] Migrate State Hub workplans --- workplans/CUST-WP-0003-whi-kpi-card.md | 188 +++++++++ ...P-0011-state-hub-threephoenix-migration.md | 366 ++++++++++++++++++ .../CUST-WP-0012-multi-user-onboarding.md | 246 ++++++++++++ .../CUST-WP-0038-state-hub-threephoenix-ha.md | 246 ++++++++++++ 4 files changed, 1046 insertions(+) create mode 100644 workplans/CUST-WP-0003-whi-kpi-card.md create mode 100644 workplans/CUST-WP-0011-state-hub-threephoenix-migration.md create mode 100644 workplans/CUST-WP-0012-multi-user-onboarding.md create mode 100644 workplans/CUST-WP-0038-state-hub-threephoenix-ha.md diff --git a/workplans/CUST-WP-0003-whi-kpi-card.md b/workplans/CUST-WP-0003-whi-kpi-card.md new file mode 100644 index 0000000..87dffef --- /dev/null +++ b/workplans/CUST-WP-0003-whi-kpi-card.md @@ -0,0 +1,188 @@ +--- +id: CUST-WP-0003 +type: workplan +title: "State Hub v0.4 — Workstream Health Index (WHI) KPI Card" +domain: custodian +repo: state-hub +status: active +owner: custodian +topic_slug: custodian +state_hub_workstream_id: 9cc32158-2f5c-4ef6-9713-aacce4623d5e +created: "2026-02-26" +updated: "2026-05-17" +--- + +# State Hub v0.4 — Workstream Health Index (WHI) KPI Card + +## Summary + +Implement the Workstream Health Index (WHI) — a composite structural-health +KPI — as a live card injected into the TOC sidebar of the Workstreams +dashboard page. All six metrics are computable client-side from data +already fetched by `dashboard/src/workstreams.md`; no API or schema changes +required. + +## Context + +The WHI formula and metric definitions are specified in +`dashboard/src/docs/workstream-kpi.md`. This workplan covers +only the implementation of that spec as running dashboard code. + +The six base metrics: +- **DD** — Dependency Density: edge count / open workstream count +- **BR** — Blocked Ratio: blocked workstreams / open count +- **SPR** — Single Point of Risk: max inbound edges / open count +- **PEP** — Progression Enablement Proportion: ready-to-start workstreams +- **CDDR** — Cross-Domain Dependency Ratio: cross-domain edges / total edges +- **CPI** — Cycle Penalty Indicator: 1 if any cycle detected, 0 otherwise + +WHI formula: `0.30*(1-DDnorm) + 0.25*(1-BR) + 0.15*(1-SPR) + 0.20*PEP + 0.10*(1-CDDR)` +CPI penalty: `WHI = WHI * 0.5` if CPI=1. + +## Tasks + +### P1 — Verify dependency edge fields in open_workstreams + +```task +id: CUST-WP-0003-T01 +state_hub_task_id: 243646e0-b77a-41e7-ac51-82c5828e63d2 +status: todo +priority: high +``` + +Confirm that `summary.open_workstreams[].depends_on[]` and `blocks[]` +each carry `workstream_id`, `workstream_slug`, and `workstream_title`. +Verify these fields are sufficient to build a complete directed dependency +graph client-side without additional API calls. (Already verified during +workplan design — open_workstreams is the confirmed data source.) + +### P2.1 — Build directed dependency graph from openWs + completedIds + +```task +id: CUST-WP-0003-T02 +state_hub_task_id: 6dbef71f-d2d7-44ee-abb8-279dbaeec505 +status: todo +priority: high +``` + +In `dashboard/src/workstreams.md`: derive `completedIds = new Set` of IDs of workstreams +with status completed. Build an adjacency list: for each entry in openWs, +map workstream id → array of `depends_on[].workstream_id`. Build reverse +map (prerequisite id → list of dependent ids) for SPR computation. Also +build `idToDomain` map from `data[]` for CDDR. + +### P2.2 — Implement DFS cycle detection (CPI) + +```task +id: CUST-WP-0003-T03 +state_hub_task_id: f0d5c107-6029-4ad0-af00-645d35ce7db0 +status: todo +priority: high +``` + +Implement a DFS-based topological sort over the dependency adjacency list. +Detect back edges using visited / inStack colour sets. Return `CPI = 1` +if any cycle found, `CPI = 0` otherwise. Only nodes in openWs participate +(completed/archived workstreams excluded). Edge case: isolated nodes (no +deps, no dependents) are valid and never form cycles. + +### P2.3 — Compute DD, BR, SPR, PEP, CDDR + +```task +id: CUST-WP-0003-T04 +state_hub_task_id: 6da60567-cc46-4a32-9855-b07bafe2faeb +status: todo +priority: high +``` + +Using the graph from P2.1: +- `DD`: totalEdges / openCount, where totalEdges = openWs.flatMap(w=>w.depends_on).length +- `BR`: openWs.filter(w=>w.status==="blocked").length / openCount +- `SPR`: max inbound-edge count across prerequisite workstreams in openWs / openCount +- `PEP`: openWs.filter(w=>active && all depends_on are in completedIds).length / openCount +- `CDDR`: crossDomainEdges / totalEdges (edge with different domain endpoints); 0 if no edges + +### P2.4 — WHI formula: normalisation + CPI penalty + +```task +id: CUST-WP-0003-T05 +state_hub_task_id: 29b2dbbd-5d60-49b6-ae84-3dbf22167df7 +status: todo +priority: high +``` + +Implement the weighted aggregation: +``` +DDnorm = min(1, DD / 1.0) // DD_critical = 1.0 +WHI = 0.30*(1-DDnorm) + 0.25*(1-BR) + 0.15*(1-SPR) + 0.20*PEP + 0.10*(1-CDDR) +if CPI === 1: WHI = WHI * 0.5 +``` +Clamp to [0, 1]. Return `{whi, dd, ddNorm, br, spr, pep, cddr, cpi, openCount, edgeCount}`. +Factor into `computeWHI(nodes, edges, idToDomain)` for reuse in per-domain scope. + +### P2.5 — Per-domain WHI breakdown + +```task +id: CUST-WP-0003-T06 +state_hub_task_id: 8ce5ef74-5eb8-4259-9b11-dde13bf84a89 +status: todo +priority: medium +``` + +For each domain present in openWs, compute a domain-scoped WHI: +- `domainNodes = openWs.filter(w => idToDomain[w.id] === domain)` +- `domainEdges = domainNodes.flatMap(w => w.depends_on.filter(d => idToDomain[d.workstream_id] === domain))` +- `result = computeWHI(domainNodes, domainEdges, idToDomain)` + +Store as `[{domain, whi, br, pep, cpi, openCount}]`. Skip domains with +`openCount === 0`. + +### P3 — WHI KPI card UI + +```task +id: CUST-WP-0003-T07 +state_hub_task_id: 91efba5c-3be2-4bfe-b5ef-1b261e9423f2 +status: todo +priority: high +``` + +Build the `_whiBox` element in `dashboard/src/workstreams.md` (mirrors `_kpiBox` in +`decisions.md`): +- Card title: "Workstream Health" +- Main WHI value with health state label: GREEN ≥ 0.75 / ORANGE ≥ 0.50 / RED < 0.50 +- Sub-metric rows for DD, BR, SPR, PEP, CDDR with individual warning colours +- Cycle alert row (red ⚠) when CPI=1 +- Domain breakdown: compact rows with domain name + coloured score +- Empty state if openCount=0 or no edges + +Inject via `injectTocTop("whi-kpi-box", _whiBox)`. Wire +`withDocHelp(_whiBox, "/docs/workstream-health-index")`. + +### P4.1 — Create src/docs/workstream-health-index.md + +```task +id: CUST-WP-0003-T08 +state_hub_task_id: 4c898472-e4ae-49a2-b6cd-7aa1a3c7604a +status: todo +priority: medium +``` + +Reference documentation for the WHI KPI card. Cover: purpose, all six +metrics (formula + interpretation), WHI aggregation formula with CPI +penalty, DD normalisation, health state thresholds, domain breakdown, +cycle detection, and how to improve a poor score. Update +`workstream-kpi.md` to link to this doc. + +### P4.2 — Wire withDocHelp and add to Reference nav + +```task +id: CUST-WP-0003-T09 +state_hub_task_id: 20976663-7ac9-4909-8029-a479190f52ff +status: todo +priority: low +``` + +Confirm `withDocHelp(_whiBox, "/docs/workstream-health-index")` is wired +(from P3). Add `{ name: "Workstream Health", path: "/docs/workstream-health-index" }` +to the Reference pages array in `observablehq.config.js`. Verify +Reference nav renders correctly in `npm run dev`. diff --git a/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md b/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md new file mode 100644 index 0000000..a33e267 --- /dev/null +++ b/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md @@ -0,0 +1,366 @@ +--- +id: CUST-WP-0011 +type: workplan +title: "Pragmatic State Hub Migration to railiance01" +domain: custodian +repo: state-hub +status: active +owner: custodian +topic_slug: custodian +created: "2026-03-11" +updated: "2026-05-17" +state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940" +supersedes_intent_from: "Migrate Custodian State Hub to ThreePhoenix Cluster" +follow_up_workplan: CUST-WP-0038 +--- + +# Pragmatic State Hub Migration to railiance01 + +## Goal + +Move the Custodian State Hub (FastAPI + PostgreSQL) from the WSL2 operator +workstation to the current railiance01 Kubernetes environment, using the +Railiance production-readiness path that exists now: + +- CloudNative PG (`cnpg`) for the State Hub database in the `databases` + namespace. +- State Hub as an S5 workload in `railiance-apps`. +- Platform/database ownership in `railiance-platform`. +- Access through the existing private tunnel/ops-bridge model, not public + exposure. +- WSL2 retained as a disaster-recovery fallback until the cluster deployment + has proven stable. + +This is a deliberate pragmatic step. It improves durability and multi-machine +access before the full ThreePhoenix target is ready. The ultimate multi-node, +replicated, long-term cluster goal is preserved in `CUST-WP-0038`. + +## Context Update + +The original 2026-03-11 version of this workplan targeted a future +ThreePhoenix cluster with Railiance01/02/03, Longhorn, and full HA gates before +starting. That was correct as an end-state, but it blocks useful progress now. + +The current Railiance architecture has moved on: + +- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md` + supersedes the older Bitnami PostgreSQL HA platform baseline. +- CloudNative PG is the deployed database operator. +- `RAIL-HO-WP-0004-T09` is the Railiance-side task for deploying State Hub to + the cluster, and it still requires human decisions before live data + migration. + +This workplan is now the Custodian-side coordination and safety plan for that +T09 effort. + +## Safety Contract + +State Hub is irreplaceable episodic memory. This migration may prepare, deploy, +test, and compare as much as needed, but it must not make the cluster the only +source of truth until the explicit cutover gate is satisfied. + +Rules: + +- A fresh WSL2 backup and restore drill is mandatory before data migration. +- The WSL2 State Hub remains available as rollback until stabilisation passes. +- Any task that changes the live writer endpoint requires explicit human + approval. +- A failed cluster deploy must leave the WSL2 instance untouched and usable. +- Row counts and key API checks must match before cutover. + +## Target Architecture After This Workplan + +``` +Operator workstation / COULOMBCORE / other agent hosts + -> local MCP server subprocess + -> http://127.0.0.1:8000 or configured API_BASE + -> private tunnel / ops-bridge + -> railiance01 k3s + -> state-hub Service + -> FastAPI Deployment + -> state-hub-db CloudNative PG Cluster +``` + +Key properties: + +- Single-node pragmatic deployment on railiance01. +- No public unauthenticated exposure. +- Database managed by cnpg, not an ad-hoc Postgres StatefulSet. +- WSL2 retained as DR fallback during stabilisation. +- Future multi-node HA and storage replication are deferred to `CUST-WP-0038`. + +## Open Human Decisions + +Resolve these before T04/T05 can become live migration work: + +1. Final State Hub hostname or tunnel-only endpoint. +2. Container registry choice: Gitea registry vs external interim registry. +3. Exposure model: ClusterIP plus tunnel, private ingress, or both. +4. Approval window for freezing WSL2 writes and migrating the production DB. +5. Stabilisation duration before WSL2 can be considered non-primary fallback. + +## Tasks + +### T01 — Drill WSL2 State Hub backup restore + +```task +id: T01 +status: done +priority: high +state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf" +completed: "2026-05-02" +``` + +Take a fresh State Hub backup from the current WSL2 instance and restore it +into an isolated test PostgreSQL instance. + +Minimum checks: + +- Restore completes without errors. +- Core table row counts match the live WSL2 database. +- `/state/summary` can be served from the restored copy if wired to a test API. +- Drill result is recorded in State Hub progress and, if useful, episodic + memory. + +**Done when:** backup and restore are proven within 24 hours of live migration +work. + +Result: completed 2026-05-02. A fresh dump from `infra-postgres-1` restored +into disposable container `state-hub-restore-test` on `127.0.0.1:5433`. +Application health and summary checks against the restored database returned +HTTP 200. Restored row counts matched production exactly, including 117 +workstreams, 989 tasks, 1423 progress events, and 208 token events. + +--- + +### T02 — Align with Railiance deployment plan + +```task +id: T02 +status: done +priority: high +state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c" +completed: "2026-05-02" +``` + +Update the cross-repo plan so this Custodian workplan and +`RAIL-HO-WP-0004-T09` point to the same architecture. + +Expected outputs: + +- `RAIL-HO-WP-0004-T09` remains the Railiance-side execution task. +- This workplan remains the Custodian-side safety/cutover task list. +- Any stale Longhorn/Postgres StatefulSet assumptions are removed from the + near-term migration plan. +- The future HA goal is referenced through `CUST-WP-0038`. + +**Done when:** both workplans describe compatible responsibilities and gates. + +Result: completed 2026-05-02. `RAIL-HO-WP-0004-T09` now names the same +pragmatic railiance01 path: cnpg database, S5 State Hub workload, restore drill +precondition, empty deploy before data copy, explicit human approval before +freezing WSL2 writes, and WSL2 retained as fallback. Full ThreePhoenix HA stays +deferred to `CUST-WP-0038`. + +--- + +### T03 — Build and publish State Hub container image + +```task +id: T03 +status: in_progress +priority: high +state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a" +``` + +Package this repository as a production image. + +Requirements: + +- Dockerfile builds from the current Python/uv project. +- Alembic and runtime dependencies are available inside the image. +- Image exposes the FastAPI service on port 8000. +- Image tag is pushed to the chosen registry. +- Build provenance is documented in the commit/workplan. + +**Done when:** railiance01 can pull the image and a dry-run deployment resolves +it. + +Progress 2026-05-03: added `Dockerfile`, +`.dockerignore`, and `docs/container-image.md`. Built +local image `state-hub:local` successfully: +`sha256:e96dbd1e7d2b63e4fb17584c8c2216088a2c9937bfe880c2ad565c7a9f51c0fc` +(~106 MB). Verified container `/state/health` returns HTTP 200 against the +current database when run locally with host networking. Verified Alembic is +available in-image and reports current revision `r5m6n7o8p9q0 (head)`. + +Progress 2026-05-03: registry target decision resolved to the self-hosted +Gitea registry. A local SSH tunnel to the NodePort can reach Gitea, but Docker +login/push still receives HTTP 404 from `/v2/`. Runtime inspection shows the +live Gitea `app.ini` has no `[packages]` section, so package registry +enablement/routing must be applied before publishing `state-hub:local`. + +Progress 2026-05-15: rebuilt the image from current State Hub sources as +`state-hub:local` with digest +`sha256:039d29654ccb3754c6ecdbe497c6364bbd8452edcdcb7fa937dd9debf5b734ff` +(106004480 bytes). Verified `/state/health` returns +`{"status":"ok","db":"connected"}` from a temporary container on host port +18000 and confirmed in-image Alembic reports `t7o8p9q0r1s2 (head)`. Build +provenance is recorded in `docs/container-image.md`. + +Remaining: enable the Gitea package/container registry, then tag, push, and +pull the image from railiance01. + +--- + +### T04 — Define State Hub database and app manifests + +```task +id: T04 +status: todo +priority: high +state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844" +``` + +Create the cluster-side deployment assets using current Railiance boundaries: + +- `railiance-platform`: `state-hub-db` cnpg cluster and database credentials. +- `railiance-apps`: State Hub Deployment, Service, ConfigMap, Secret/External + Secret reference, and optional private Ingress. +- Health probes use `GET /state/health`. +- Environment includes `DATABASE_URL` and any required API settings. + +**Done when:** manifests lint/apply in a non-destructive dry run and ownership +boundaries are documented. + +--- + +### T05 — Deploy empty State Hub and run migrations on railiance01 + +```task +id: T05 +status: todo +priority: high +state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1" +``` + +Deploy State Hub against an empty `state-hub-db` cnpg database and run Alembic +migrations in the cluster environment. + +Checks: + +- Pod reaches Ready. +- `/state/health` returns healthy through the intended private access path. +- Alembic reports head. +- Logs show no repeated crash/restart loop. + +**Done when:** an empty but structurally valid State Hub runs on railiance01. + +--- + +### T06 — Restore WSL2 data copy into cluster and compare + +```task +id: T06 +status: todo +priority: high +state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060" +``` + +Restore a fresh WSL2 dump into the cluster database while WSL2 remains the live +source of truth. + +Required comparison: + +- Table row counts match. +- Representative workstreams, tasks, decisions, progress events, repos, and + token events are queryable. +- Dashboard and MCP summary calls return expected data through the cluster API. +- Any mismatch is investigated before proceeding. + +**Done when:** cluster data is a verified copy of WSL2, but not yet the only +writer. + +--- + +### T07 — Cut over private access to cluster State Hub + +```task +id: T07 +status: todo +priority: medium +state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e" +needs_human: true +intervention_note: "Requires explicit approval to freeze WSL2 writes and make the cluster State Hub the primary endpoint." +``` + +With human approval, freeze WSL2 writes, take a final dump, restore it to the +cluster, compare counts again, and redirect the active private access path to +the cluster API. + +Accepted approaches: + +- Keep local MCP config pointed at `http://127.0.0.1:8000` and move that port + to an ops-bridge/SSH tunnel. +- Or set the MCP server `API_BASE` to the chosen private cluster endpoint. + +**Done when:** `get_state_summary()` and dashboard live data are served by the +cluster State Hub, and WSL2 is no longer receiving normal writes. + +--- + +### T08 — Stabilise with WSL2 retained as fallback + +```task +id: T08 +status: todo +priority: medium +state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2" +``` + +Run the cluster State Hub as primary while keeping the WSL2 instance available +as a fallback. + +Monitor: + +- State Hub pod restarts. +- cnpg cluster health. +- Backup job success. +- Dashboard and MCP behavior from each operator machine. +- Consistency sync behavior for file-backed workplans. + +**Done when:** the agreed stabilisation window passes without data loss or +unresolved operational defects. + +--- + +### T09 — Document operating model and defer final WSL2 retirement + +```task +id: T09 +status: todo +priority: low +state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681" +``` + +Document the new operating model: + +- How agents reach State Hub. +- How backups and restores work. +- How to roll back to WSL2 if needed. +- Which parts remain pragmatic/single-node. +- Which long-term requirements moved to `CUST-WP-0038`. + +Do not permanently retire WSL2 in this workplan unless a separate human +decision is recorded. Retirement belongs after proven stability or in the +future HA workplan. + +**Done when:** runbooks and project instructions match the deployed reality. + +## References + +- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md` +- `RAIL-HO-WP-0004-T09` — Railiance-side State Hub deployment task +- `CUST-WP-0038` — future full ThreePhoenix HA State Hub migration +- Constitution constraint: production data migration and fallback retirement + require explicit human approval diff --git a/workplans/CUST-WP-0012-multi-user-onboarding.md b/workplans/CUST-WP-0012-multi-user-onboarding.md new file mode 100644 index 0000000..a754d8b --- /dev/null +++ b/workplans/CUST-WP-0012-multi-user-onboarding.md @@ -0,0 +1,246 @@ +--- +id: CUST-WP-0012 +type: workplan +title: "Multi-User Onboarding and Environment Bootstrap" +domain: custodian +repo: state-hub +status: active +owner: custodian +topic_slug: custodian +state_hub_workstream_id: "a28d9e29-4119-4b73-9469-f921920253ef" +created: "2026-03-11" +updated: "2026-05-17" +--- + +# Multi-User Onboarding and Environment Bootstrap + +## Goal + +Make the Custodian system accessible to collaborators beyond the primary +operator. A new user (or a new machine for the existing operator) should +be able to go from zero to a productive Claude Code session with full +State Hub MCP connectivity in a single session, without manual steps or +undocumented tribal knowledge. + +## Context + +Several friction points surfaced during the 2026-03-11 session: + +- No SSH key for Railiance01 on WSL2 → blocked `make tunnel-loop` +- No `~/.railiance_gitea.conf` → blocked repo creation script +- Token missing `read:user` scope → blocked org repo creation +- No `git credential.helper` → credentials required on every push +- MCP registration is manual and documented only in `CLAUDE.md` + +Each of these is a solved problem in isolation. This workstream collects +them into a repeatable, documented bootstrap experience. + +## Scope + +Two personas: + +| Persona | Access level | Typical machine | +|---------|-------------|-----------------| +| Primary operator | Full access, all domains | WSL2 workstation | +| Domain collaborator | Read + write to one domain | COULOMBCORE, remote laptop | + +## Tasks + +### T01 — Git credential.helper for Gitea access + +```task +id: CUST-WP-0012-T01 +state_hub_task_id: 71628269-9a75-4dae-a347-e64a86040322 +status: todo +priority: medium +``` + +Document and automate `git credential.helper` setup for Gitea +(`http://92.205.130.254:32166`). Recommend `libsecret` (keyring-backed) +on machines that support it; fall back to `credential.helper=store` +(persistent, plaintext `~/.git-credentials`) on headless servers. + +Include in bootstrap script (T04) and onboarding guide (T05). + +```bash +# Preferred: libsecret (GNOME keyring, WSL2 with keyring daemon) +sudo apt-get install -y libsecret-1-0 libsecret-1-dev +sudo make -C /usr/share/doc/git/contrib/credential/libsecret +git config --global credential.helper \ + /usr/share/doc/git/contrib/credential/libsecret/git-credential-libsecret + +# Fallback: store (plaintext, suitable for headless servers) +git config --global credential.helper store + +# Headless server alternative: cache (in-memory, 1h timeout) +git config --global credential.helper 'cache --timeout=3600' +``` + +**Done when:** included in bootstrap script; push to Gitea works without +re-entering credentials on second attempt. + +--- + +### T02 — SSH key generation and authorization automation + +```task +id: CUST-WP-0012-T02 +state_hub_task_id: fea965e9-8a8f-439c-9096-8f7756eb71ed +status: todo +priority: medium +``` + +Script or Ansible task that: +1. Generates an `ed25519` key pair on the new machine if none exists +2. Displays the public key with copy instructions +3. Authorizes it on all managed hosts (Railiance01, COULOMBCORE) via + `ssh-copy-id` or Ansible `authorized_key` module + +Surfaced by: RAIL-PL-WP-0001 T01 — no SSH key on WSL2 blocked +`make tunnel-loop HOST=tegwick@92.205.62.239`. + +```bash +# Generate if missing +[[ -f ~/.ssh/id_ed25519 ]] || ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N "" + +# Authorize on a target host (requires existing access once) +ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.62.239 +ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.130.254 +``` + +**Done when:** included in bootstrap script; documented in onboarding guide. + +--- + +### T03 — Claude Code MCP registration automation + +```task +id: CUST-WP-0012-T03 +state_hub_task_id: 60318e9a-972e-45c8-afde-82ed0625f594 +status: todo +priority: medium +``` + +Automate the state-hub MCP server registration on a new machine. +Currently this is a multi-step manual process documented in +`~/.claude/CLAUDE.md`. It should be a single `make` target or script: + +```bash +# In /home/worsch/state-hub/ +make register-mcp # idempotent; safe to re-run +``` + +The script should: +1. Detect whether `state-hub` is already in `~/.claude.json` +2. Extract the server config from `.mcp.json` +3. Run `claude mcp add-json -s user state-hub ` +4. Run `patch_mcp_cwd.py` to restore the cwd field +5. Print instructions to restart Claude Code + +Should also detect whether the state hub is reachable directly +(`http://127.0.0.1:8000`) or needs a tunnel (via ops-bridge), and emit +a warning if neither is available. + +**Done when:** `make register-mcp` works on a clean machine; documented +in onboarding guide. + +--- + +### T04 — Environment bootstrap script + +```task +id: CUST-WP-0012-T04 +state_hub_task_id: 84a94761-e424-4470-a9a2-64d9cabadb7f +status: todo +priority: high +``` + +Single idempotent script: `scripts/bootstrap-env.sh` + +Checks/installs prerequisites and configures the environment: + +| Step | What | +|------|------| +| Prerequisites | git, sops, age, helm, kubectl, uv, claude CLI | +| Git credential | `credential.helper` (libsecret or store) | +| SSH key | Generate ed25519 if missing; display public key | +| MCP registration | `make register-mcp` (T03) | +| Gitea config | Prompt for token; write `~/.railiance_gitea.conf` | +| Health check | `curl /state/health`; warn if tunnel needed | + +Design constraints: +- Idempotent: safe to run on an already-configured machine +- No silent failures: each step prints ✓ / ✗ / ⚠ +- Minimal dependencies: bash + curl only to get started + +**Done when:** running the script on a clean Ubuntu 24.04 machine +produces a working Custodian environment with no additional manual steps. + +--- + +### T05 — Onboarding guide and user journey documentation + +```task +id: CUST-WP-0012-T05 +state_hub_task_id: b0839802-659a-475b-8b84-ab7341ea3d15 +status: todo +priority: medium +``` + +Write `docs/onboarding.md` in this repository covering the full journey +for both personas: + +**Primary operator (new machine):** +1. Prerequisites (git, SSH client) +2. Clone `state-hub` and the relevant domain repository +3. Run `make bootstrap-env` (T04) +4. Restart Claude Code → verify MCP is active +5. First session: `get_state_summary()` → orient → work + +**Domain collaborator (new person):** +1. Prerequisites + Gitea account +2. `ssh-copy-id` to get access to Railiance01 (or just COULOMBCORE) +3. Set up ops-bridge tunnel to reach state hub +4. Clone domain repo +5. First Claude Code session with MCP via tunnel +6. Contributing a workplan (ADR-001 convention) + +**Done when:** a new collaborator can follow the guide without +clarification from the primary operator. + +--- + +### T06 — State Hub multi-user model (deferred) + +```task +id: CUST-WP-0012-T06 +state_hub_task_id: d5df3302-67b9-4765-a8d8-ea2df53dff6e +status: todo +priority: low +``` + +Design a lightweight user/role model for the state hub: + +| Role | Permissions | +|------|-------------| +| Primary operator | Full read/write, all domains | +| Domain collaborator | Read all; write to own domain only | +| Observer | Read-only | + +Decision needed: enforce at API layer (HTTP Basic / token auth per +domain) or rely on Gitea repo permissions as the authoritative boundary +(simpler — the hub is a read model anyway). + +**Deferred until:** first external collaborator is actively onboarding. +Implement T01–T05 first; multi-user access control is only needed when +there is more than one user. + +--- + +## References + +- ops-bridge repo: `ops-bridge` (tunnel lifecycle management) +- MCP registration: `~/.claude/CLAUDE.md` (current manual procedure) +- Gitea repo creation: `railiance-cluster/tools/create_railiance_repo.sh` +- ADR-001: workplans as repo artefacts +- Surfaced by: RAIL-PL-WP-0001 T01 execution, 2026-03-11 diff --git a/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md b/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md new file mode 100644 index 0000000..f0260fb --- /dev/null +++ b/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md @@ -0,0 +1,246 @@ +--- +id: CUST-WP-0038 +type: workplan +title: "State Hub Full ThreePhoenix HA Migration" +domain: custodian +repo: state-hub +status: active +owner: custodian +topic_slug: custodian +created: "2026-05-02" +updated: "2026-05-17" +depends_on: CUST-WP-0011 +state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85" +--- + +# State Hub Full ThreePhoenix HA Migration + +## Goal + +Preserve the original long-term State Hub infrastructure goal while +`CUST-WP-0011` takes the pragmatic railiance01 path. + +This workplan completes the migration from a useful single-node cluster-hosted +State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes, +replicated storage, tested failover, tested restore, and retirement of the WSL2 +fallback only after operational confidence is earned. + +## Why This Exists + +The near-term State Hub migration should not wait for every HA precondition, +because the workstation-hosted State Hub is already a bottleneck for +multi-machine work. + +But the original requirement remains valid: + +- State Hub is irreplaceable episodic memory. +- A single node is not a final home. +- Backup and restore must be drilled, not assumed. +- Long-term operations must survive node loss and operator-machine loss. + +`CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan +keeps the ultimate target visible and reviewable. + +## Entry Criteria + +- `CUST-WP-0011` completed or explicitly superseded. +- Cluster-hosted State Hub has passed its stabilisation period. +- railiance01 is not the only planned durable node. +- Railiance architecture decision for storage replication is current: + Longhorn, cnpg replication, external backup, or a documented replacement. +- Backup and restore tooling has an owner and runbook. + +## Target Properties + +- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03. +- State Hub database survives loss of one node. +- State Hub API recovers from pod loss without manual repair. +- Backups are encrypted, off-node, and restorable into a test namespace. +- Agent access remains private. +- WSL2 is no longer needed as the primary disaster-recovery fallback. + +## Tasks + +### T01 — Confirm ThreePhoenix cluster readiness + +```task +id: T01 +status: todo +priority: high +state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110" +``` + +Verify the target cluster state: + +- Three nodes are joined and Ready. +- Control-plane and worker roles are documented. +- Cluster version and node resources are recorded. +- Smoke tests pass from the operator machine and from CoulombCore. + +**Done when:** a current readiness report exists and no node is marked +NotReady or operationally unknown. + +--- + +### T02 — Establish replicated storage/database strategy + +```task +id: T02 +status: todo +priority: high +state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140" +``` + +Choose and document the durable data strategy for State Hub: + +- cnpg multi-instance PostgreSQL cluster, and/or +- Longhorn-backed storage with suitable replication, and/or +- another explicitly approved architecture. + +The decision must define RPO, RTO, failover behavior, and restore procedure. + +**Done when:** the selected architecture is documented and approved before any +production data movement. + +--- + +### T03 — Implement HA State Hub database + +```task +id: T03 +status: todo +priority: high +state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6" +``` + +Apply the chosen database/storage architecture to State Hub. + +Requirements: + +- Database credentials remain SOPS/secret-managed. +- The database has automated backup configured. +- The database exposes a stable service endpoint for the API. +- Health and replication status are observable. + +**Done when:** State Hub can run against the HA database in a test or staging +namespace. + +--- + +### T04 — Add State Hub API high-availability behavior + +```task +id: T04 +status: todo +priority: medium +state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24" +``` + +Run State Hub API with the right availability posture for its workload: + +- At least one replica, optionally more if DB/session behavior permits. +- Readiness and liveness probes. +- Rolling update behavior documented. +- Resource requests/limits set. + +**Done when:** killing an API pod does not require manual recovery. + +--- + +### T05 — Drill database failover + +```task +id: T05 +status: todo +priority: high +state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86" +``` + +Perform a controlled failover drill for the State Hub database. + +Checks: + +- Failure trigger is documented. +- API behavior during failover is observed. +- Recovery time is measured. +- No data loss is detected after recovery. + +**Done when:** the failover drill passes and results are logged. + +--- + +### T06 — Drill backup restore to isolated namespace + +```task +id: T06 +status: todo +priority: high +state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74" +``` + +Restore the latest encrypted State Hub backup into an isolated namespace or +separate test database. + +Checks: + +- Backup can be decrypted with the documented key path. +- Restore completes from off-node backup material. +- Row counts and representative records match. +- Restored API can serve `/state/health` and `/state/summary` when pointed at + the restored database. + +**Done when:** restore drill passes without depending on the live database. + +--- + +### T07 — Update agent access and runbooks for HA endpoint + +```task +id: T07 +status: todo +priority: medium +state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c" +``` + +Update the private access model after the HA endpoint is available: + +- ops-bridge or tunnel target. +- MCP `API_BASE` or local port-forward convention. +- Dashboard access. +- Operator recovery instructions. + +**Done when:** each active operator machine can reach the HA State Hub endpoint +through the documented path. + +--- + +### T08 — Retire WSL2 fallback after explicit approval + +```task +id: T08 +status: todo +priority: low +needs_human: true +intervention_note: "Requires explicit approval after HA failover and restore drills pass." +state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add" +``` + +Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA +cluster path has passed drills. + +Steps: + +1. Take and archive a final WSL2 backup. +2. Stop local WSL2 State Hub services. +3. Update global and repo instructions. +4. Record the retirement decision in State Hub. + +**Done when:** WSL2 is no longer part of the normal or fallback operating +model, and the cluster runbook is the source of truth. + +## References + +- `CUST-WP-0011` — pragmatic railiance01 migration +- Railiance ThreePhoenix infrastructure goal +- State Hub backup/restore runbooks +- Constitution constraint: irreversible retirement requires human approval