diff --git a/workplans/CUST-WP-0003-whi-kpi-card.md b/workplans/CUST-WP-0003-whi-kpi-card.md deleted file mode 100644 index b38f27c..0000000 --- a/workplans/CUST-WP-0003-whi-kpi-card.md +++ /dev/null @@ -1,186 +0,0 @@ ---- -id: CUST-WP-0003 -type: workplan -title: "State Hub v0.4 — Workstream Health Index (WHI) KPI Card" -domain: custodian -status: active -owner: custodian -topic_slug: custodian -state_hub_workstream_id: 9cc32158-2f5c-4ef6-9713-aacce4623d5e -created: "2026-02-26" -updated: "2026-02-28" ---- - -# State Hub v0.4 — Workstream Health Index (WHI) KPI Card - -## Summary - -Implement the Workstream Health Index (WHI) — a composite structural-health -KPI — as a live card injected into the TOC sidebar of the Workstreams -dashboard page. All six metrics are computable client-side from data -already fetched by `workstreams.md`; no API or schema changes required. - -## Context - -The WHI formula and metric definitions are specified in -`state-hub/dashboard/src/docs/workstream-kpi.md`. This workplan covers -only the implementation of that spec as running dashboard code. - -The six base metrics: -- **DD** — Dependency Density: edge count / open workstream count -- **BR** — Blocked Ratio: blocked workstreams / open count -- **SPR** — Single Point of Risk: max inbound edges / open count -- **PEP** — Progression Enablement Proportion: ready-to-start workstreams -- **CDDR** — Cross-Domain Dependency Ratio: cross-domain edges / total edges -- **CPI** — Cycle Penalty Indicator: 1 if any cycle detected, 0 otherwise - -WHI formula: `0.30*(1-DDnorm) + 0.25*(1-BR) + 0.15*(1-SPR) + 0.20*PEP + 0.10*(1-CDDR)` -CPI penalty: `WHI = WHI * 0.5` if CPI=1. - -## Tasks - -### P1 — Verify dependency edge fields in open_workstreams - -```task -id: CUST-WP-0003-T01 -state_hub_task_id: 243646e0-b77a-41e7-ac51-82c5828e63d2 -status: todo -priority: high -``` - -Confirm that `summary.open_workstreams[].depends_on[]` and `blocks[]` -each carry `workstream_id`, `workstream_slug`, and `workstream_title`. -Verify these fields are sufficient to build a complete directed dependency -graph client-side without additional API calls. (Already verified during -workplan design — open_workstreams is the confirmed data source.) - -### P2.1 — Build directed dependency graph from openWs + completedIds - -```task -id: CUST-WP-0003-T02 -state_hub_task_id: 6dbef71f-d2d7-44ee-abb8-279dbaeec505 -status: todo -priority: high -``` - -In `workstreams.md`: derive `completedIds = new Set` of IDs of workstreams -with status completed. Build an adjacency list: for each entry in openWs, -map workstream id → array of `depends_on[].workstream_id`. Build reverse -map (prerequisite id → list of dependent ids) for SPR computation. Also -build `idToDomain` map from `data[]` for CDDR. - -### P2.2 — Implement DFS cycle detection (CPI) - -```task -id: CUST-WP-0003-T03 -state_hub_task_id: f0d5c107-6029-4ad0-af00-645d35ce7db0 -status: todo -priority: high -``` - -Implement a DFS-based topological sort over the dependency adjacency list. -Detect back edges using visited / inStack colour sets. Return `CPI = 1` -if any cycle found, `CPI = 0` otherwise. Only nodes in openWs participate -(completed/archived workstreams excluded). Edge case: isolated nodes (no -deps, no dependents) are valid and never form cycles. - -### P2.3 — Compute DD, BR, SPR, PEP, CDDR - -```task -id: CUST-WP-0003-T04 -state_hub_task_id: 6da60567-cc46-4a32-9855-b07bafe2faeb -status: todo -priority: high -``` - -Using the graph from P2.1: -- `DD`: totalEdges / openCount, where totalEdges = openWs.flatMap(w=>w.depends_on).length -- `BR`: openWs.filter(w=>w.status==="blocked").length / openCount -- `SPR`: max inbound-edge count across prerequisite workstreams in openWs / openCount -- `PEP`: openWs.filter(w=>active && all depends_on are in completedIds).length / openCount -- `CDDR`: crossDomainEdges / totalEdges (edge with different domain endpoints); 0 if no edges - -### P2.4 — WHI formula: normalisation + CPI penalty - -```task -id: CUST-WP-0003-T05 -state_hub_task_id: 29b2dbbd-5d60-49b6-ae84-3dbf22167df7 -status: todo -priority: high -``` - -Implement the weighted aggregation: -``` -DDnorm = min(1, DD / 1.0) // DD_critical = 1.0 -WHI = 0.30*(1-DDnorm) + 0.25*(1-BR) + 0.15*(1-SPR) + 0.20*PEP + 0.10*(1-CDDR) -if CPI === 1: WHI = WHI * 0.5 -``` -Clamp to [0, 1]. Return `{whi, dd, ddNorm, br, spr, pep, cddr, cpi, openCount, edgeCount}`. -Factor into `computeWHI(nodes, edges, idToDomain)` for reuse in per-domain scope. - -### P2.5 — Per-domain WHI breakdown - -```task -id: CUST-WP-0003-T06 -state_hub_task_id: 8ce5ef74-5eb8-4259-9b11-dde13bf84a89 -status: todo -priority: medium -``` - -For each domain present in openWs, compute a domain-scoped WHI: -- `domainNodes = openWs.filter(w => idToDomain[w.id] === domain)` -- `domainEdges = domainNodes.flatMap(w => w.depends_on.filter(d => idToDomain[d.workstream_id] === domain))` -- `result = computeWHI(domainNodes, domainEdges, idToDomain)` - -Store as `[{domain, whi, br, pep, cpi, openCount}]`. Skip domains with -`openCount === 0`. - -### P3 — WHI KPI card UI - -```task -id: CUST-WP-0003-T07 -state_hub_task_id: 91efba5c-3be2-4bfe-b5ef-1b261e9423f2 -status: todo -priority: high -``` - -Build the `_whiBox` element in `workstreams.md` (mirrors `_kpiBox` in -`decisions.md`): -- Card title: "Workstream Health" -- Main WHI value with health state label: GREEN ≥ 0.75 / ORANGE ≥ 0.50 / RED < 0.50 -- Sub-metric rows for DD, BR, SPR, PEP, CDDR with individual warning colours -- Cycle alert row (red ⚠) when CPI=1 -- Domain breakdown: compact rows with domain name + coloured score -- Empty state if openCount=0 or no edges - -Inject via `injectTocTop("whi-kpi-box", _whiBox)`. Wire -`withDocHelp(_whiBox, "/docs/workstream-health-index")`. - -### P4.1 — Create src/docs/workstream-health-index.md - -```task -id: CUST-WP-0003-T08 -state_hub_task_id: 4c898472-e4ae-49a2-b6cd-7aa1a3c7604a -status: todo -priority: medium -``` - -Reference documentation for the WHI KPI card. Cover: purpose, all six -metrics (formula + interpretation), WHI aggregation formula with CPI -penalty, DD normalisation, health state thresholds, domain breakdown, -cycle detection, and how to improve a poor score. Update -`workstream-kpi.md` to link to this doc. - -### P4.2 — Wire withDocHelp and add to Reference nav - -```task -id: CUST-WP-0003-T09 -state_hub_task_id: 20976663-7ac9-4909-8029-a479190f52ff -status: todo -priority: low -``` - -Confirm `withDocHelp(_whiBox, "/docs/workstream-health-index")` is wired -(from P3). Add `{ name: "Workstream Health", path: "/docs/workstream-health-index" }` -to the Reference pages array in `observablehq.config.js`. Verify -Reference nav renders correctly in `npm run dev`. diff --git a/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md b/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md deleted file mode 100644 index 19767eb..0000000 --- a/workplans/CUST-WP-0011-state-hub-threephoenix-migration.md +++ /dev/null @@ -1,366 +0,0 @@ ---- -id: CUST-WP-0011 -type: workplan -title: "Pragmatic State Hub Migration to railiance01" -domain: custodian -repo: the-custodian -status: active -owner: custodian -topic_slug: custodian -created: "2026-03-11" -updated: "2026-05-15" -state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940" -supersedes_intent_from: "Migrate Custodian State Hub to ThreePhoenix Cluster" -follow_up_workplan: CUST-WP-0038 ---- - -# Pragmatic State Hub Migration to railiance01 - -## Goal - -Move the Custodian State Hub (FastAPI + PostgreSQL) from the WSL2 operator -workstation to the current railiance01 Kubernetes environment, using the -Railiance production-readiness path that exists now: - -- CloudNative PG (`cnpg`) for the State Hub database in the `databases` - namespace. -- State Hub as an S5 workload in `railiance-apps`. -- Platform/database ownership in `railiance-platform`. -- Access through the existing private tunnel/ops-bridge model, not public - exposure. -- WSL2 retained as a disaster-recovery fallback until the cluster deployment - has proven stable. - -This is a deliberate pragmatic step. It improves durability and multi-machine -access before the full ThreePhoenix target is ready. The ultimate multi-node, -replicated, long-term cluster goal is preserved in `CUST-WP-0038`. - -## Context Update - -The original 2026-03-11 version of this workplan targeted a future -ThreePhoenix cluster with Railiance01/02/03, Longhorn, and full HA gates before -starting. That was correct as an end-state, but it blocks useful progress now. - -The current Railiance architecture has moved on: - -- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md` - supersedes the older Bitnami PostgreSQL HA platform baseline. -- CloudNative PG is the deployed database operator. -- `RAIL-HO-WP-0004-T09` is the Railiance-side task for deploying State Hub to - the cluster, and it still requires human decisions before live data - migration. - -This workplan is now the Custodian-side coordination and safety plan for that -T09 effort. - -## Safety Contract - -State Hub is irreplaceable episodic memory. This migration may prepare, deploy, -test, and compare as much as needed, but it must not make the cluster the only -source of truth until the explicit cutover gate is satisfied. - -Rules: - -- A fresh WSL2 backup and restore drill is mandatory before data migration. -- The WSL2 State Hub remains available as rollback until stabilisation passes. -- Any task that changes the live writer endpoint requires explicit human - approval. -- A failed cluster deploy must leave the WSL2 instance untouched and usable. -- Row counts and key API checks must match before cutover. - -## Target Architecture After This Workplan - -``` -Operator workstation / COULOMBCORE / other agent hosts - -> local MCP server subprocess - -> http://127.0.0.1:8000 or configured API_BASE - -> private tunnel / ops-bridge - -> railiance01 k3s - -> state-hub Service - -> FastAPI Deployment - -> state-hub-db CloudNative PG Cluster -``` - -Key properties: - -- Single-node pragmatic deployment on railiance01. -- No public unauthenticated exposure. -- Database managed by cnpg, not an ad-hoc Postgres StatefulSet. -- WSL2 retained as DR fallback during stabilisation. -- Future multi-node HA and storage replication are deferred to `CUST-WP-0038`. - -## Open Human Decisions - -Resolve these before T04/T05 can become live migration work: - -1. Final State Hub hostname or tunnel-only endpoint. -2. Container registry choice: Gitea registry vs external interim registry. -3. Exposure model: ClusterIP plus tunnel, private ingress, or both. -4. Approval window for freezing WSL2 writes and migrating the production DB. -5. Stabilisation duration before WSL2 can be considered non-primary fallback. - -## Tasks - -### T01 — Drill WSL2 State Hub backup restore - -```task -id: T01 -status: done -priority: high -state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf" -completed: "2026-05-02" -``` - -Take a fresh State Hub backup from the current WSL2 instance and restore it -into an isolated test PostgreSQL instance. - -Minimum checks: - -- Restore completes without errors. -- Core table row counts match the live WSL2 database. -- `/state/summary` can be served from the restored copy if wired to a test API. -- Drill result is recorded in State Hub progress and, if useful, episodic - memory. - -**Done when:** backup and restore are proven within 24 hours of live migration -work. - -Result: completed 2026-05-02. A fresh dump from `infra-postgres-1` restored -into disposable container `state-hub-restore-test` on `127.0.0.1:5433`. -Application health and summary checks against the restored database returned -HTTP 200. Restored row counts matched production exactly, including 117 -workstreams, 989 tasks, 1423 progress events, and 208 token events. - ---- - -### T02 — Align with Railiance deployment plan - -```task -id: T02 -status: done -priority: high -state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c" -completed: "2026-05-02" -``` - -Update the cross-repo plan so this Custodian workplan and -`RAIL-HO-WP-0004-T09` point to the same architecture. - -Expected outputs: - -- `RAIL-HO-WP-0004-T09` remains the Railiance-side execution task. -- This workplan remains the Custodian-side safety/cutover task list. -- Any stale Longhorn/Postgres StatefulSet assumptions are removed from the - near-term migration plan. -- The future HA goal is referenced through `CUST-WP-0038`. - -**Done when:** both workplans describe compatible responsibilities and gates. - -Result: completed 2026-05-02. `RAIL-HO-WP-0004-T09` now names the same -pragmatic railiance01 path: cnpg database, S5 State Hub workload, restore drill -precondition, empty deploy before data copy, explicit human approval before -freezing WSL2 writes, and WSL2 retained as fallback. Full ThreePhoenix HA stays -deferred to `CUST-WP-0038`. - ---- - -### T03 — Build and publish State Hub container image - -```task -id: T03 -status: in_progress -priority: high -state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a" -``` - -Package `state-hub/` as a production image. - -Requirements: - -- Dockerfile builds from the current Python/uv project. -- Alembic and runtime dependencies are available inside the image. -- Image exposes the FastAPI service on port 8000. -- Image tag is pushed to the chosen registry. -- Build provenance is documented in the commit/workplan. - -**Done when:** railiance01 can pull the image and a dry-run deployment resolves -it. - -Progress 2026-05-03: added `state-hub/Dockerfile`, -`state-hub/.dockerignore`, and `state-hub/docs/container-image.md`. Built -local image `state-hub:local` successfully: -`sha256:e96dbd1e7d2b63e4fb17584c8c2216088a2c9937bfe880c2ad565c7a9f51c0fc` -(~106 MB). Verified container `/state/health` returns HTTP 200 against the -current database when run locally with host networking. Verified Alembic is -available in-image and reports current revision `r5m6n7o8p9q0 (head)`. - -Progress 2026-05-03: registry target decision resolved to the self-hosted -Gitea registry. A local SSH tunnel to the NodePort can reach Gitea, but Docker -login/push still receives HTTP 404 from `/v2/`. Runtime inspection shows the -live Gitea `app.ini` has no `[packages]` section, so package registry -enablement/routing must be applied before publishing `state-hub:local`. - -Progress 2026-05-15: rebuilt the image from current `state-hub/` sources as -`state-hub:local` with digest -`sha256:039d29654ccb3754c6ecdbe497c6364bbd8452edcdcb7fa937dd9debf5b734ff` -(106004480 bytes). Verified `/state/health` returns -`{"status":"ok","db":"connected"}` from a temporary container on host port -18000 and confirmed in-image Alembic reports `t7o8p9q0r1s2 (head)`. Build -provenance is recorded in `state-hub/docs/container-image.md`. - -Remaining: enable the Gitea package/container registry, then tag, push, and -pull the image from railiance01. - ---- - -### T04 — Define State Hub database and app manifests - -```task -id: T04 -status: todo -priority: high -state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844" -``` - -Create the cluster-side deployment assets using current Railiance boundaries: - -- `railiance-platform`: `state-hub-db` cnpg cluster and database credentials. -- `railiance-apps`: State Hub Deployment, Service, ConfigMap, Secret/External - Secret reference, and optional private Ingress. -- Health probes use `GET /state/health`. -- Environment includes `DATABASE_URL` and any required API settings. - -**Done when:** manifests lint/apply in a non-destructive dry run and ownership -boundaries are documented. - ---- - -### T05 — Deploy empty State Hub and run migrations on railiance01 - -```task -id: T05 -status: todo -priority: high -state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1" -``` - -Deploy State Hub against an empty `state-hub-db` cnpg database and run Alembic -migrations in the cluster environment. - -Checks: - -- Pod reaches Ready. -- `/state/health` returns healthy through the intended private access path. -- Alembic reports head. -- Logs show no repeated crash/restart loop. - -**Done when:** an empty but structurally valid State Hub runs on railiance01. - ---- - -### T06 — Restore WSL2 data copy into cluster and compare - -```task -id: T06 -status: todo -priority: high -state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060" -``` - -Restore a fresh WSL2 dump into the cluster database while WSL2 remains the live -source of truth. - -Required comparison: - -- Table row counts match. -- Representative workstreams, tasks, decisions, progress events, repos, and - token events are queryable. -- Dashboard and MCP summary calls return expected data through the cluster API. -- Any mismatch is investigated before proceeding. - -**Done when:** cluster data is a verified copy of WSL2, but not yet the only -writer. - ---- - -### T07 — Cut over private access to cluster State Hub - -```task -id: T07 -status: todo -priority: medium -state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e" -needs_human: true -intervention_note: "Requires explicit approval to freeze WSL2 writes and make the cluster State Hub the primary endpoint." -``` - -With human approval, freeze WSL2 writes, take a final dump, restore it to the -cluster, compare counts again, and redirect the active private access path to -the cluster API. - -Accepted approaches: - -- Keep local MCP config pointed at `http://127.0.0.1:8000` and move that port - to an ops-bridge/SSH tunnel. -- Or set the MCP server `API_BASE` to the chosen private cluster endpoint. - -**Done when:** `get_state_summary()` and dashboard live data are served by the -cluster State Hub, and WSL2 is no longer receiving normal writes. - ---- - -### T08 — Stabilise with WSL2 retained as fallback - -```task -id: T08 -status: todo -priority: medium -state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2" -``` - -Run the cluster State Hub as primary while keeping the WSL2 instance available -as a fallback. - -Monitor: - -- State Hub pod restarts. -- cnpg cluster health. -- Backup job success. -- Dashboard and MCP behavior from each operator machine. -- Consistency sync behavior for file-backed workplans. - -**Done when:** the agreed stabilisation window passes without data loss or -unresolved operational defects. - ---- - -### T09 — Document operating model and defer final WSL2 retirement - -```task -id: T09 -status: todo -priority: low -state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681" -``` - -Document the new operating model: - -- How agents reach State Hub. -- How backups and restores work. -- How to roll back to WSL2 if needed. -- Which parts remain pragmatic/single-node. -- Which long-term requirements moved to `CUST-WP-0038`. - -Do not permanently retire WSL2 in this workplan unless a separate human -decision is recorded. Retirement belongs after proven stability or in the -future HA workplan. - -**Done when:** runbooks and project instructions match the deployed reality. - -## References - -- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md` -- `RAIL-HO-WP-0004-T09` — Railiance-side State Hub deployment task -- `CUST-WP-0038` — future full ThreePhoenix HA State Hub migration -- Constitution constraint: production data migration and fallback retirement - require explicit human approval diff --git a/workplans/CUST-WP-0012-multi-user-onboarding.md b/workplans/CUST-WP-0012-multi-user-onboarding.md deleted file mode 100644 index beb982d..0000000 --- a/workplans/CUST-WP-0012-multi-user-onboarding.md +++ /dev/null @@ -1,246 +0,0 @@ ---- -id: CUST-WP-0012 -type: workplan -title: "Multi-User Onboarding and Environment Bootstrap" -domain: custodian -repo: the-custodian -status: active -owner: custodian -topic_slug: custodian -state_hub_workstream_id: "a28d9e29-4119-4b73-9469-f921920253ef" -created: "2026-03-11" -updated: "2026-03-11" ---- - -# Multi-User Onboarding and Environment Bootstrap - -## Goal - -Make the Custodian system accessible to collaborators beyond the primary -operator. A new user (or a new machine for the existing operator) should -be able to go from zero to a productive Claude Code session with full -State Hub MCP connectivity in a single session, without manual steps or -undocumented tribal knowledge. - -## Context - -Several friction points surfaced during the 2026-03-11 session: - -- No SSH key for Railiance01 on WSL2 → blocked `make tunnel-loop` -- No `~/.railiance_gitea.conf` → blocked repo creation script -- Token missing `read:user` scope → blocked org repo creation -- No `git credential.helper` → credentials required on every push -- MCP registration is manual and documented only in `CLAUDE.md` - -Each of these is a solved problem in isolation. This workstream collects -them into a repeatable, documented bootstrap experience. - -## Scope - -Two personas: - -| Persona | Access level | Typical machine | -|---------|-------------|-----------------| -| Primary operator | Full access, all domains | WSL2 workstation | -| Domain collaborator | Read + write to one domain | COULOMBCORE, remote laptop | - -## Tasks - -### T01 — Git credential.helper for Gitea access - -```task -id: CUST-WP-0012-T01 -state_hub_task_id: 71628269-9a75-4dae-a347-e64a86040322 -status: todo -priority: medium -``` - -Document and automate `git credential.helper` setup for Gitea -(`http://92.205.130.254:32166`). Recommend `libsecret` (keyring-backed) -on machines that support it; fall back to `credential.helper=store` -(persistent, plaintext `~/.git-credentials`) on headless servers. - -Include in bootstrap script (T04) and onboarding guide (T05). - -```bash -# Preferred: libsecret (GNOME keyring, WSL2 with keyring daemon) -sudo apt-get install -y libsecret-1-0 libsecret-1-dev -sudo make -C /usr/share/doc/git/contrib/credential/libsecret -git config --global credential.helper \ - /usr/share/doc/git/contrib/credential/libsecret/git-credential-libsecret - -# Fallback: store (plaintext, suitable for headless servers) -git config --global credential.helper store - -# Headless server alternative: cache (in-memory, 1h timeout) -git config --global credential.helper 'cache --timeout=3600' -``` - -**Done when:** included in bootstrap script; push to Gitea works without -re-entering credentials on second attempt. - ---- - -### T02 — SSH key generation and authorization automation - -```task -id: CUST-WP-0012-T02 -state_hub_task_id: fea965e9-8a8f-439c-9096-8f7756eb71ed -status: todo -priority: medium -``` - -Script or Ansible task that: -1. Generates an `ed25519` key pair on the new machine if none exists -2. Displays the public key with copy instructions -3. Authorizes it on all managed hosts (Railiance01, COULOMBCORE) via - `ssh-copy-id` or Ansible `authorized_key` module - -Surfaced by: RAIL-PL-WP-0001 T01 — no SSH key on WSL2 blocked -`make tunnel-loop HOST=tegwick@92.205.62.239`. - -```bash -# Generate if missing -[[ -f ~/.ssh/id_ed25519 ]] || ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N "" - -# Authorize on a target host (requires existing access once) -ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.62.239 -ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.130.254 -``` - -**Done when:** included in bootstrap script; documented in onboarding guide. - ---- - -### T03 — Claude Code MCP registration automation - -```task -id: CUST-WP-0012-T03 -state_hub_task_id: 60318e9a-972e-45c8-afde-82ed0625f594 -status: todo -priority: medium -``` - -Automate the state-hub MCP server registration on a new machine. -Currently this is a multi-step manual process documented in -`~/.claude/CLAUDE.md`. It should be a single `make` target or script: - -```bash -# In the-custodian/state-hub/ -make register-mcp # idempotent; safe to re-run -``` - -The script should: -1. Detect whether `state-hub` is already in `~/.claude.json` -2. Extract the server config from `.mcp.json` -3. Run `claude mcp add-json -s user state-hub ` -4. Run `patch_mcp_cwd.py` to restore the cwd field -5. Print instructions to restart Claude Code - -Should also detect whether the state hub is reachable directly -(`http://127.0.0.1:8000`) or needs a tunnel (via ops-bridge), and emit -a warning if neither is available. - -**Done when:** `make register-mcp` works on a clean machine; documented -in onboarding guide. - ---- - -### T04 — Environment bootstrap script - -```task -id: CUST-WP-0012-T04 -state_hub_task_id: 84a94761-e424-4470-a9a2-64d9cabadb7f -status: todo -priority: high -``` - -Single idempotent script: `state-hub/scripts/bootstrap-env.sh` - -Checks/installs prerequisites and configures the environment: - -| Step | What | -|------|------| -| Prerequisites | git, sops, age, helm, kubectl, uv, claude CLI | -| Git credential | `credential.helper` (libsecret or store) | -| SSH key | Generate ed25519 if missing; display public key | -| MCP registration | `make register-mcp` (T03) | -| Gitea config | Prompt for token; write `~/.railiance_gitea.conf` | -| Health check | `curl /state/health`; warn if tunnel needed | - -Design constraints: -- Idempotent: safe to run on an already-configured machine -- No silent failures: each step prints ✓ / ✗ / ⚠ -- Minimal dependencies: bash + curl only to get started - -**Done when:** running the script on a clean Ubuntu 24.04 machine -produces a working Custodian environment with no additional manual steps. - ---- - -### T05 — Onboarding guide and user journey documentation - -```task -id: CUST-WP-0012-T05 -state_hub_task_id: b0839802-659a-475b-8b84-ab7341ea3d15 -status: todo -priority: medium -``` - -Write `docs/onboarding.md` in the-custodian covering the full journey -for both personas: - -**Primary operator (new machine):** -1. Prerequisites (git, SSH client) -2. Clone `the-custodian` -3. Run `make bootstrap-env` (T04) -4. Restart Claude Code → verify MCP is active -5. First session: `get_state_summary()` → orient → work - -**Domain collaborator (new person):** -1. Prerequisites + Gitea account -2. `ssh-copy-id` to get access to Railiance01 (or just COULOMBCORE) -3. Set up ops-bridge tunnel to reach state hub -4. Clone domain repo -5. First Claude Code session with MCP via tunnel -6. Contributing a workplan (ADR-001 convention) - -**Done when:** a new collaborator can follow the guide without -clarification from the primary operator. - ---- - -### T06 — State Hub multi-user model (deferred) - -```task -id: CUST-WP-0012-T06 -state_hub_task_id: d5df3302-67b9-4765-a8d8-ea2df53dff6e -status: todo -priority: low -``` - -Design a lightweight user/role model for the state hub: - -| Role | Permissions | -|------|-------------| -| Primary operator | Full read/write, all domains | -| Domain collaborator | Read all; write to own domain only | -| Observer | Read-only | - -Decision needed: enforce at API layer (HTTP Basic / token auth per -domain) or rely on Gitea repo permissions as the authoritative boundary -(simpler — the hub is a read model anyway). - -**Deferred until:** first external collaborator is actively onboarding. -Implement T01–T05 first; multi-user access control is only needed when -there is more than one user. - ---- - -## References - -- ops-bridge repo: `ops-bridge` (tunnel lifecycle management) -- MCP registration: `~/.claude/CLAUDE.md` (current manual procedure) -- Gitea repo creation: `railiance-cluster/tools/create_railiance_repo.sh` -- ADR-001: workplans as repo artefacts -- Surfaced by: RAIL-PL-WP-0001 T01 execution, 2026-03-11 diff --git a/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md b/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md deleted file mode 100644 index 2eef1d6..0000000 --- a/workplans/CUST-WP-0038-state-hub-threephoenix-ha.md +++ /dev/null @@ -1,246 +0,0 @@ ---- -id: CUST-WP-0038 -type: workplan -title: "State Hub Full ThreePhoenix HA Migration" -domain: custodian -repo: the-custodian -status: active -owner: custodian -topic_slug: custodian -created: "2026-05-02" -updated: "2026-05-02" -depends_on: CUST-WP-0011 -state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85" ---- - -# State Hub Full ThreePhoenix HA Migration - -## Goal - -Preserve the original long-term State Hub infrastructure goal while -`CUST-WP-0011` takes the pragmatic railiance01 path. - -This workplan completes the migration from a useful single-node cluster-hosted -State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes, -replicated storage, tested failover, tested restore, and retirement of the WSL2 -fallback only after operational confidence is earned. - -## Why This Exists - -The near-term State Hub migration should not wait for every HA precondition, -because the workstation-hosted State Hub is already a bottleneck for -multi-machine work. - -But the original requirement remains valid: - -- State Hub is irreplaceable episodic memory. -- A single node is not a final home. -- Backup and restore must be drilled, not assumed. -- Long-term operations must survive node loss and operator-machine loss. - -`CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan -keeps the ultimate target visible and reviewable. - -## Entry Criteria - -- `CUST-WP-0011` completed or explicitly superseded. -- Cluster-hosted State Hub has passed its stabilisation period. -- railiance01 is not the only planned durable node. -- Railiance architecture decision for storage replication is current: - Longhorn, cnpg replication, external backup, or a documented replacement. -- Backup and restore tooling has an owner and runbook. - -## Target Properties - -- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03. -- State Hub database survives loss of one node. -- State Hub API recovers from pod loss without manual repair. -- Backups are encrypted, off-node, and restorable into a test namespace. -- Agent access remains private. -- WSL2 is no longer needed as the primary disaster-recovery fallback. - -## Tasks - -### T01 — Confirm ThreePhoenix cluster readiness - -```task -id: T01 -status: todo -priority: high -state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110" -``` - -Verify the target cluster state: - -- Three nodes are joined and Ready. -- Control-plane and worker roles are documented. -- Cluster version and node resources are recorded. -- Smoke tests pass from the operator machine and from CoulombCore. - -**Done when:** a current readiness report exists and no node is marked -NotReady or operationally unknown. - ---- - -### T02 — Establish replicated storage/database strategy - -```task -id: T02 -status: todo -priority: high -state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140" -``` - -Choose and document the durable data strategy for State Hub: - -- cnpg multi-instance PostgreSQL cluster, and/or -- Longhorn-backed storage with suitable replication, and/or -- another explicitly approved architecture. - -The decision must define RPO, RTO, failover behavior, and restore procedure. - -**Done when:** the selected architecture is documented and approved before any -production data movement. - ---- - -### T03 — Implement HA State Hub database - -```task -id: T03 -status: todo -priority: high -state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6" -``` - -Apply the chosen database/storage architecture to State Hub. - -Requirements: - -- Database credentials remain SOPS/secret-managed. -- The database has automated backup configured. -- The database exposes a stable service endpoint for the API. -- Health and replication status are observable. - -**Done when:** State Hub can run against the HA database in a test or staging -namespace. - ---- - -### T04 — Add State Hub API high-availability behavior - -```task -id: T04 -status: todo -priority: medium -state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24" -``` - -Run State Hub API with the right availability posture for its workload: - -- At least one replica, optionally more if DB/session behavior permits. -- Readiness and liveness probes. -- Rolling update behavior documented. -- Resource requests/limits set. - -**Done when:** killing an API pod does not require manual recovery. - ---- - -### T05 — Drill database failover - -```task -id: T05 -status: todo -priority: high -state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86" -``` - -Perform a controlled failover drill for the State Hub database. - -Checks: - -- Failure trigger is documented. -- API behavior during failover is observed. -- Recovery time is measured. -- No data loss is detected after recovery. - -**Done when:** the failover drill passes and results are logged. - ---- - -### T06 — Drill backup restore to isolated namespace - -```task -id: T06 -status: todo -priority: high -state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74" -``` - -Restore the latest encrypted State Hub backup into an isolated namespace or -separate test database. - -Checks: - -- Backup can be decrypted with the documented key path. -- Restore completes from off-node backup material. -- Row counts and representative records match. -- Restored API can serve `/state/health` and `/state/summary` when pointed at - the restored database. - -**Done when:** restore drill passes without depending on the live database. - ---- - -### T07 — Update agent access and runbooks for HA endpoint - -```task -id: T07 -status: todo -priority: medium -state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c" -``` - -Update the private access model after the HA endpoint is available: - -- ops-bridge or tunnel target. -- MCP `API_BASE` or local port-forward convention. -- Dashboard access. -- Operator recovery instructions. - -**Done when:** each active operator machine can reach the HA State Hub endpoint -through the documented path. - ---- - -### T08 — Retire WSL2 fallback after explicit approval - -```task -id: T08 -status: todo -priority: low -needs_human: true -intervention_note: "Requires explicit approval after HA failover and restore drills pass." -state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add" -``` - -Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA -cluster path has passed drills. - -Steps: - -1. Take and archive a final WSL2 backup. -2. Stop local WSL2 State Hub services. -3. Update global and repo instructions. -4. Record the retirement decision in State Hub. - -**Done when:** WSL2 is no longer part of the normal or fallback operating -model, and the cluster runbook is the source of truth. - -## References - -- `CUST-WP-0011` — pragmatic railiance01 migration -- Railiance ThreePhoenix infrastructure goal -- State Hub backup/restore runbooks -- Constitution constraint: irreversible retirement requires human approval