Remove migrated State Hub workplans
This commit is contained in:
@@ -1,186 +0,0 @@
|
|||||||
---
|
|
||||||
id: CUST-WP-0003
|
|
||||||
type: workplan
|
|
||||||
title: "State Hub v0.4 — Workstream Health Index (WHI) KPI Card"
|
|
||||||
domain: custodian
|
|
||||||
status: active
|
|
||||||
owner: custodian
|
|
||||||
topic_slug: custodian
|
|
||||||
state_hub_workstream_id: 9cc32158-2f5c-4ef6-9713-aacce4623d5e
|
|
||||||
created: "2026-02-26"
|
|
||||||
updated: "2026-02-28"
|
|
||||||
---
|
|
||||||
|
|
||||||
# State Hub v0.4 — Workstream Health Index (WHI) KPI Card
|
|
||||||
|
|
||||||
## Summary
|
|
||||||
|
|
||||||
Implement the Workstream Health Index (WHI) — a composite structural-health
|
|
||||||
KPI — as a live card injected into the TOC sidebar of the Workstreams
|
|
||||||
dashboard page. All six metrics are computable client-side from data
|
|
||||||
already fetched by `workstreams.md`; no API or schema changes required.
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
The WHI formula and metric definitions are specified in
|
|
||||||
`state-hub/dashboard/src/docs/workstream-kpi.md`. This workplan covers
|
|
||||||
only the implementation of that spec as running dashboard code.
|
|
||||||
|
|
||||||
The six base metrics:
|
|
||||||
- **DD** — Dependency Density: edge count / open workstream count
|
|
||||||
- **BR** — Blocked Ratio: blocked workstreams / open count
|
|
||||||
- **SPR** — Single Point of Risk: max inbound edges / open count
|
|
||||||
- **PEP** — Progression Enablement Proportion: ready-to-start workstreams
|
|
||||||
- **CDDR** — Cross-Domain Dependency Ratio: cross-domain edges / total edges
|
|
||||||
- **CPI** — Cycle Penalty Indicator: 1 if any cycle detected, 0 otherwise
|
|
||||||
|
|
||||||
WHI formula: `0.30*(1-DDnorm) + 0.25*(1-BR) + 0.15*(1-SPR) + 0.20*PEP + 0.10*(1-CDDR)`
|
|
||||||
CPI penalty: `WHI = WHI * 0.5` if CPI=1.
|
|
||||||
|
|
||||||
## Tasks
|
|
||||||
|
|
||||||
### P1 — Verify dependency edge fields in open_workstreams
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0003-T01
|
|
||||||
state_hub_task_id: 243646e0-b77a-41e7-ac51-82c5828e63d2
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
```
|
|
||||||
|
|
||||||
Confirm that `summary.open_workstreams[].depends_on[]` and `blocks[]`
|
|
||||||
each carry `workstream_id`, `workstream_slug`, and `workstream_title`.
|
|
||||||
Verify these fields are sufficient to build a complete directed dependency
|
|
||||||
graph client-side without additional API calls. (Already verified during
|
|
||||||
workplan design — open_workstreams is the confirmed data source.)
|
|
||||||
|
|
||||||
### P2.1 — Build directed dependency graph from openWs + completedIds
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0003-T02
|
|
||||||
state_hub_task_id: 6dbef71f-d2d7-44ee-abb8-279dbaeec505
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
```
|
|
||||||
|
|
||||||
In `workstreams.md`: derive `completedIds = new Set` of IDs of workstreams
|
|
||||||
with status completed. Build an adjacency list: for each entry in openWs,
|
|
||||||
map workstream id → array of `depends_on[].workstream_id`. Build reverse
|
|
||||||
map (prerequisite id → list of dependent ids) for SPR computation. Also
|
|
||||||
build `idToDomain` map from `data[]` for CDDR.
|
|
||||||
|
|
||||||
### P2.2 — Implement DFS cycle detection (CPI)
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0003-T03
|
|
||||||
state_hub_task_id: f0d5c107-6029-4ad0-af00-645d35ce7db0
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
```
|
|
||||||
|
|
||||||
Implement a DFS-based topological sort over the dependency adjacency list.
|
|
||||||
Detect back edges using visited / inStack colour sets. Return `CPI = 1`
|
|
||||||
if any cycle found, `CPI = 0` otherwise. Only nodes in openWs participate
|
|
||||||
(completed/archived workstreams excluded). Edge case: isolated nodes (no
|
|
||||||
deps, no dependents) are valid and never form cycles.
|
|
||||||
|
|
||||||
### P2.3 — Compute DD, BR, SPR, PEP, CDDR
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0003-T04
|
|
||||||
state_hub_task_id: 6da60567-cc46-4a32-9855-b07bafe2faeb
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
```
|
|
||||||
|
|
||||||
Using the graph from P2.1:
|
|
||||||
- `DD`: totalEdges / openCount, where totalEdges = openWs.flatMap(w=>w.depends_on).length
|
|
||||||
- `BR`: openWs.filter(w=>w.status==="blocked").length / openCount
|
|
||||||
- `SPR`: max inbound-edge count across prerequisite workstreams in openWs / openCount
|
|
||||||
- `PEP`: openWs.filter(w=>active && all depends_on are in completedIds).length / openCount
|
|
||||||
- `CDDR`: crossDomainEdges / totalEdges (edge with different domain endpoints); 0 if no edges
|
|
||||||
|
|
||||||
### P2.4 — WHI formula: normalisation + CPI penalty
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0003-T05
|
|
||||||
state_hub_task_id: 29b2dbbd-5d60-49b6-ae84-3dbf22167df7
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
```
|
|
||||||
|
|
||||||
Implement the weighted aggregation:
|
|
||||||
```
|
|
||||||
DDnorm = min(1, DD / 1.0) // DD_critical = 1.0
|
|
||||||
WHI = 0.30*(1-DDnorm) + 0.25*(1-BR) + 0.15*(1-SPR) + 0.20*PEP + 0.10*(1-CDDR)
|
|
||||||
if CPI === 1: WHI = WHI * 0.5
|
|
||||||
```
|
|
||||||
Clamp to [0, 1]. Return `{whi, dd, ddNorm, br, spr, pep, cddr, cpi, openCount, edgeCount}`.
|
|
||||||
Factor into `computeWHI(nodes, edges, idToDomain)` for reuse in per-domain scope.
|
|
||||||
|
|
||||||
### P2.5 — Per-domain WHI breakdown
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0003-T06
|
|
||||||
state_hub_task_id: 8ce5ef74-5eb8-4259-9b11-dde13bf84a89
|
|
||||||
status: todo
|
|
||||||
priority: medium
|
|
||||||
```
|
|
||||||
|
|
||||||
For each domain present in openWs, compute a domain-scoped WHI:
|
|
||||||
- `domainNodes = openWs.filter(w => idToDomain[w.id] === domain)`
|
|
||||||
- `domainEdges = domainNodes.flatMap(w => w.depends_on.filter(d => idToDomain[d.workstream_id] === domain))`
|
|
||||||
- `result = computeWHI(domainNodes, domainEdges, idToDomain)`
|
|
||||||
|
|
||||||
Store as `[{domain, whi, br, pep, cpi, openCount}]`. Skip domains with
|
|
||||||
`openCount === 0`.
|
|
||||||
|
|
||||||
### P3 — WHI KPI card UI
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0003-T07
|
|
||||||
state_hub_task_id: 91efba5c-3be2-4bfe-b5ef-1b261e9423f2
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
```
|
|
||||||
|
|
||||||
Build the `_whiBox` element in `workstreams.md` (mirrors `_kpiBox` in
|
|
||||||
`decisions.md`):
|
|
||||||
- Card title: "Workstream Health"
|
|
||||||
- Main WHI value with health state label: GREEN ≥ 0.75 / ORANGE ≥ 0.50 / RED < 0.50
|
|
||||||
- Sub-metric rows for DD, BR, SPR, PEP, CDDR with individual warning colours
|
|
||||||
- Cycle alert row (red ⚠) when CPI=1
|
|
||||||
- Domain breakdown: compact rows with domain name + coloured score
|
|
||||||
- Empty state if openCount=0 or no edges
|
|
||||||
|
|
||||||
Inject via `injectTocTop("whi-kpi-box", _whiBox)`. Wire
|
|
||||||
`withDocHelp(_whiBox, "/docs/workstream-health-index")`.
|
|
||||||
|
|
||||||
### P4.1 — Create src/docs/workstream-health-index.md
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0003-T08
|
|
||||||
state_hub_task_id: 4c898472-e4ae-49a2-b6cd-7aa1a3c7604a
|
|
||||||
status: todo
|
|
||||||
priority: medium
|
|
||||||
```
|
|
||||||
|
|
||||||
Reference documentation for the WHI KPI card. Cover: purpose, all six
|
|
||||||
metrics (formula + interpretation), WHI aggregation formula with CPI
|
|
||||||
penalty, DD normalisation, health state thresholds, domain breakdown,
|
|
||||||
cycle detection, and how to improve a poor score. Update
|
|
||||||
`workstream-kpi.md` to link to this doc.
|
|
||||||
|
|
||||||
### P4.2 — Wire withDocHelp and add to Reference nav
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0003-T09
|
|
||||||
state_hub_task_id: 20976663-7ac9-4909-8029-a479190f52ff
|
|
||||||
status: todo
|
|
||||||
priority: low
|
|
||||||
```
|
|
||||||
|
|
||||||
Confirm `withDocHelp(_whiBox, "/docs/workstream-health-index")` is wired
|
|
||||||
(from P3). Add `{ name: "Workstream Health", path: "/docs/workstream-health-index" }`
|
|
||||||
to the Reference pages array in `observablehq.config.js`. Verify
|
|
||||||
Reference nav renders correctly in `npm run dev`.
|
|
||||||
@@ -1,366 +0,0 @@
|
|||||||
---
|
|
||||||
id: CUST-WP-0011
|
|
||||||
type: workplan
|
|
||||||
title: "Pragmatic State Hub Migration to railiance01"
|
|
||||||
domain: custodian
|
|
||||||
repo: the-custodian
|
|
||||||
status: active
|
|
||||||
owner: custodian
|
|
||||||
topic_slug: custodian
|
|
||||||
created: "2026-03-11"
|
|
||||||
updated: "2026-05-15"
|
|
||||||
state_hub_workstream_id: "967baafb-d92d-405a-ba0b-0d00d37c4940"
|
|
||||||
supersedes_intent_from: "Migrate Custodian State Hub to ThreePhoenix Cluster"
|
|
||||||
follow_up_workplan: CUST-WP-0038
|
|
||||||
---
|
|
||||||
|
|
||||||
# Pragmatic State Hub Migration to railiance01
|
|
||||||
|
|
||||||
## Goal
|
|
||||||
|
|
||||||
Move the Custodian State Hub (FastAPI + PostgreSQL) from the WSL2 operator
|
|
||||||
workstation to the current railiance01 Kubernetes environment, using the
|
|
||||||
Railiance production-readiness path that exists now:
|
|
||||||
|
|
||||||
- CloudNative PG (`cnpg`) for the State Hub database in the `databases`
|
|
||||||
namespace.
|
|
||||||
- State Hub as an S5 workload in `railiance-apps`.
|
|
||||||
- Platform/database ownership in `railiance-platform`.
|
|
||||||
- Access through the existing private tunnel/ops-bridge model, not public
|
|
||||||
exposure.
|
|
||||||
- WSL2 retained as a disaster-recovery fallback until the cluster deployment
|
|
||||||
has proven stable.
|
|
||||||
|
|
||||||
This is a deliberate pragmatic step. It improves durability and multi-machine
|
|
||||||
access before the full ThreePhoenix target is ready. The ultimate multi-node,
|
|
||||||
replicated, long-term cluster goal is preserved in `CUST-WP-0038`.
|
|
||||||
|
|
||||||
## Context Update
|
|
||||||
|
|
||||||
The original 2026-03-11 version of this workplan targeted a future
|
|
||||||
ThreePhoenix cluster with Railiance01/02/03, Longhorn, and full HA gates before
|
|
||||||
starting. That was correct as an end-state, but it blocks useful progress now.
|
|
||||||
|
|
||||||
The current Railiance architecture has moved on:
|
|
||||||
|
|
||||||
- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
|
|
||||||
supersedes the older Bitnami PostgreSQL HA platform baseline.
|
|
||||||
- CloudNative PG is the deployed database operator.
|
|
||||||
- `RAIL-HO-WP-0004-T09` is the Railiance-side task for deploying State Hub to
|
|
||||||
the cluster, and it still requires human decisions before live data
|
|
||||||
migration.
|
|
||||||
|
|
||||||
This workplan is now the Custodian-side coordination and safety plan for that
|
|
||||||
T09 effort.
|
|
||||||
|
|
||||||
## Safety Contract
|
|
||||||
|
|
||||||
State Hub is irreplaceable episodic memory. This migration may prepare, deploy,
|
|
||||||
test, and compare as much as needed, but it must not make the cluster the only
|
|
||||||
source of truth until the explicit cutover gate is satisfied.
|
|
||||||
|
|
||||||
Rules:
|
|
||||||
|
|
||||||
- A fresh WSL2 backup and restore drill is mandatory before data migration.
|
|
||||||
- The WSL2 State Hub remains available as rollback until stabilisation passes.
|
|
||||||
- Any task that changes the live writer endpoint requires explicit human
|
|
||||||
approval.
|
|
||||||
- A failed cluster deploy must leave the WSL2 instance untouched and usable.
|
|
||||||
- Row counts and key API checks must match before cutover.
|
|
||||||
|
|
||||||
## Target Architecture After This Workplan
|
|
||||||
|
|
||||||
```
|
|
||||||
Operator workstation / COULOMBCORE / other agent hosts
|
|
||||||
-> local MCP server subprocess
|
|
||||||
-> http://127.0.0.1:8000 or configured API_BASE
|
|
||||||
-> private tunnel / ops-bridge
|
|
||||||
-> railiance01 k3s
|
|
||||||
-> state-hub Service
|
|
||||||
-> FastAPI Deployment
|
|
||||||
-> state-hub-db CloudNative PG Cluster
|
|
||||||
```
|
|
||||||
|
|
||||||
Key properties:
|
|
||||||
|
|
||||||
- Single-node pragmatic deployment on railiance01.
|
|
||||||
- No public unauthenticated exposure.
|
|
||||||
- Database managed by cnpg, not an ad-hoc Postgres StatefulSet.
|
|
||||||
- WSL2 retained as DR fallback during stabilisation.
|
|
||||||
- Future multi-node HA and storage replication are deferred to `CUST-WP-0038`.
|
|
||||||
|
|
||||||
## Open Human Decisions
|
|
||||||
|
|
||||||
Resolve these before T04/T05 can become live migration work:
|
|
||||||
|
|
||||||
1. Final State Hub hostname or tunnel-only endpoint.
|
|
||||||
2. Container registry choice: Gitea registry vs external interim registry.
|
|
||||||
3. Exposure model: ClusterIP plus tunnel, private ingress, or both.
|
|
||||||
4. Approval window for freezing WSL2 writes and migrating the production DB.
|
|
||||||
5. Stabilisation duration before WSL2 can be considered non-primary fallback.
|
|
||||||
|
|
||||||
## Tasks
|
|
||||||
|
|
||||||
### T01 — Drill WSL2 State Hub backup restore
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T01
|
|
||||||
status: done
|
|
||||||
priority: high
|
|
||||||
state_hub_task_id: "b0caf112-dc1d-43a8-9f27-d627dd4aa2bf"
|
|
||||||
completed: "2026-05-02"
|
|
||||||
```
|
|
||||||
|
|
||||||
Take a fresh State Hub backup from the current WSL2 instance and restore it
|
|
||||||
into an isolated test PostgreSQL instance.
|
|
||||||
|
|
||||||
Minimum checks:
|
|
||||||
|
|
||||||
- Restore completes without errors.
|
|
||||||
- Core table row counts match the live WSL2 database.
|
|
||||||
- `/state/summary` can be served from the restored copy if wired to a test API.
|
|
||||||
- Drill result is recorded in State Hub progress and, if useful, episodic
|
|
||||||
memory.
|
|
||||||
|
|
||||||
**Done when:** backup and restore are proven within 24 hours of live migration
|
|
||||||
work.
|
|
||||||
|
|
||||||
Result: completed 2026-05-02. A fresh dump from `infra-postgres-1` restored
|
|
||||||
into disposable container `state-hub-restore-test` on `127.0.0.1:5433`.
|
|
||||||
Application health and summary checks against the restored database returned
|
|
||||||
HTTP 200. Restored row counts matched production exactly, including 117
|
|
||||||
workstreams, 989 tasks, 1423 progress events, and 208 token events.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T02 — Align with Railiance deployment plan
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T02
|
|
||||||
status: done
|
|
||||||
priority: high
|
|
||||||
state_hub_task_id: "24887dd9-7d50-4cc4-add7-bffa1454b80c"
|
|
||||||
completed: "2026-05-02"
|
|
||||||
```
|
|
||||||
|
|
||||||
Update the cross-repo plan so this Custodian workplan and
|
|
||||||
`RAIL-HO-WP-0004-T09` point to the same architecture.
|
|
||||||
|
|
||||||
Expected outputs:
|
|
||||||
|
|
||||||
- `RAIL-HO-WP-0004-T09` remains the Railiance-side execution task.
|
|
||||||
- This workplan remains the Custodian-side safety/cutover task list.
|
|
||||||
- Any stale Longhorn/Postgres StatefulSet assumptions are removed from the
|
|
||||||
near-term migration plan.
|
|
||||||
- The future HA goal is referenced through `CUST-WP-0038`.
|
|
||||||
|
|
||||||
**Done when:** both workplans describe compatible responsibilities and gates.
|
|
||||||
|
|
||||||
Result: completed 2026-05-02. `RAIL-HO-WP-0004-T09` now names the same
|
|
||||||
pragmatic railiance01 path: cnpg database, S5 State Hub workload, restore drill
|
|
||||||
precondition, empty deploy before data copy, explicit human approval before
|
|
||||||
freezing WSL2 writes, and WSL2 retained as fallback. Full ThreePhoenix HA stays
|
|
||||||
deferred to `CUST-WP-0038`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T03 — Build and publish State Hub container image
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T03
|
|
||||||
status: in_progress
|
|
||||||
priority: high
|
|
||||||
state_hub_task_id: "79908ade-3e38-451b-a403-2361a16a3f3a"
|
|
||||||
```
|
|
||||||
|
|
||||||
Package `state-hub/` as a production image.
|
|
||||||
|
|
||||||
Requirements:
|
|
||||||
|
|
||||||
- Dockerfile builds from the current Python/uv project.
|
|
||||||
- Alembic and runtime dependencies are available inside the image.
|
|
||||||
- Image exposes the FastAPI service on port 8000.
|
|
||||||
- Image tag is pushed to the chosen registry.
|
|
||||||
- Build provenance is documented in the commit/workplan.
|
|
||||||
|
|
||||||
**Done when:** railiance01 can pull the image and a dry-run deployment resolves
|
|
||||||
it.
|
|
||||||
|
|
||||||
Progress 2026-05-03: added `state-hub/Dockerfile`,
|
|
||||||
`state-hub/.dockerignore`, and `state-hub/docs/container-image.md`. Built
|
|
||||||
local image `state-hub:local` successfully:
|
|
||||||
`sha256:e96dbd1e7d2b63e4fb17584c8c2216088a2c9937bfe880c2ad565c7a9f51c0fc`
|
|
||||||
(~106 MB). Verified container `/state/health` returns HTTP 200 against the
|
|
||||||
current database when run locally with host networking. Verified Alembic is
|
|
||||||
available in-image and reports current revision `r5m6n7o8p9q0 (head)`.
|
|
||||||
|
|
||||||
Progress 2026-05-03: registry target decision resolved to the self-hosted
|
|
||||||
Gitea registry. A local SSH tunnel to the NodePort can reach Gitea, but Docker
|
|
||||||
login/push still receives HTTP 404 from `/v2/`. Runtime inspection shows the
|
|
||||||
live Gitea `app.ini` has no `[packages]` section, so package registry
|
|
||||||
enablement/routing must be applied before publishing `state-hub:local`.
|
|
||||||
|
|
||||||
Progress 2026-05-15: rebuilt the image from current `state-hub/` sources as
|
|
||||||
`state-hub:local` with digest
|
|
||||||
`sha256:039d29654ccb3754c6ecdbe497c6364bbd8452edcdcb7fa937dd9debf5b734ff`
|
|
||||||
(106004480 bytes). Verified `/state/health` returns
|
|
||||||
`{"status":"ok","db":"connected"}` from a temporary container on host port
|
|
||||||
18000 and confirmed in-image Alembic reports `t7o8p9q0r1s2 (head)`. Build
|
|
||||||
provenance is recorded in `state-hub/docs/container-image.md`.
|
|
||||||
|
|
||||||
Remaining: enable the Gitea package/container registry, then tag, push, and
|
|
||||||
pull the image from railiance01.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T04 — Define State Hub database and app manifests
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T04
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
state_hub_task_id: "a7baf2eb-abd7-4aa3-b2cb-a5370ac09844"
|
|
||||||
```
|
|
||||||
|
|
||||||
Create the cluster-side deployment assets using current Railiance boundaries:
|
|
||||||
|
|
||||||
- `railiance-platform`: `state-hub-db` cnpg cluster and database credentials.
|
|
||||||
- `railiance-apps`: State Hub Deployment, Service, ConfigMap, Secret/External
|
|
||||||
Secret reference, and optional private Ingress.
|
|
||||||
- Health probes use `GET /state/health`.
|
|
||||||
- Environment includes `DATABASE_URL` and any required API settings.
|
|
||||||
|
|
||||||
**Done when:** manifests lint/apply in a non-destructive dry run and ownership
|
|
||||||
boundaries are documented.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T05 — Deploy empty State Hub and run migrations on railiance01
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T05
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
state_hub_task_id: "a307dd46-a8e2-49df-b016-c187759ebcf1"
|
|
||||||
```
|
|
||||||
|
|
||||||
Deploy State Hub against an empty `state-hub-db` cnpg database and run Alembic
|
|
||||||
migrations in the cluster environment.
|
|
||||||
|
|
||||||
Checks:
|
|
||||||
|
|
||||||
- Pod reaches Ready.
|
|
||||||
- `/state/health` returns healthy through the intended private access path.
|
|
||||||
- Alembic reports head.
|
|
||||||
- Logs show no repeated crash/restart loop.
|
|
||||||
|
|
||||||
**Done when:** an empty but structurally valid State Hub runs on railiance01.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T06 — Restore WSL2 data copy into cluster and compare
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T06
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
state_hub_task_id: "03753b88-824c-4448-97b2-f7315d145060"
|
|
||||||
```
|
|
||||||
|
|
||||||
Restore a fresh WSL2 dump into the cluster database while WSL2 remains the live
|
|
||||||
source of truth.
|
|
||||||
|
|
||||||
Required comparison:
|
|
||||||
|
|
||||||
- Table row counts match.
|
|
||||||
- Representative workstreams, tasks, decisions, progress events, repos, and
|
|
||||||
token events are queryable.
|
|
||||||
- Dashboard and MCP summary calls return expected data through the cluster API.
|
|
||||||
- Any mismatch is investigated before proceeding.
|
|
||||||
|
|
||||||
**Done when:** cluster data is a verified copy of WSL2, but not yet the only
|
|
||||||
writer.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T07 — Cut over private access to cluster State Hub
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T07
|
|
||||||
status: todo
|
|
||||||
priority: medium
|
|
||||||
state_hub_task_id: "ff1de25e-c301-4b86-9420-84dfe72e565e"
|
|
||||||
needs_human: true
|
|
||||||
intervention_note: "Requires explicit approval to freeze WSL2 writes and make the cluster State Hub the primary endpoint."
|
|
||||||
```
|
|
||||||
|
|
||||||
With human approval, freeze WSL2 writes, take a final dump, restore it to the
|
|
||||||
cluster, compare counts again, and redirect the active private access path to
|
|
||||||
the cluster API.
|
|
||||||
|
|
||||||
Accepted approaches:
|
|
||||||
|
|
||||||
- Keep local MCP config pointed at `http://127.0.0.1:8000` and move that port
|
|
||||||
to an ops-bridge/SSH tunnel.
|
|
||||||
- Or set the MCP server `API_BASE` to the chosen private cluster endpoint.
|
|
||||||
|
|
||||||
**Done when:** `get_state_summary()` and dashboard live data are served by the
|
|
||||||
cluster State Hub, and WSL2 is no longer receiving normal writes.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T08 — Stabilise with WSL2 retained as fallback
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T08
|
|
||||||
status: todo
|
|
||||||
priority: medium
|
|
||||||
state_hub_task_id: "e06a59a0-5310-4c1c-9ba5-7cfaadda62e2"
|
|
||||||
```
|
|
||||||
|
|
||||||
Run the cluster State Hub as primary while keeping the WSL2 instance available
|
|
||||||
as a fallback.
|
|
||||||
|
|
||||||
Monitor:
|
|
||||||
|
|
||||||
- State Hub pod restarts.
|
|
||||||
- cnpg cluster health.
|
|
||||||
- Backup job success.
|
|
||||||
- Dashboard and MCP behavior from each operator machine.
|
|
||||||
- Consistency sync behavior for file-backed workplans.
|
|
||||||
|
|
||||||
**Done when:** the agreed stabilisation window passes without data loss or
|
|
||||||
unresolved operational defects.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T09 — Document operating model and defer final WSL2 retirement
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T09
|
|
||||||
status: todo
|
|
||||||
priority: low
|
|
||||||
state_hub_task_id: "d75a2d49-f3b1-4bdd-b9e1-a1c6a9744681"
|
|
||||||
```
|
|
||||||
|
|
||||||
Document the new operating model:
|
|
||||||
|
|
||||||
- How agents reach State Hub.
|
|
||||||
- How backups and restores work.
|
|
||||||
- How to roll back to WSL2 if needed.
|
|
||||||
- Which parts remain pragmatic/single-node.
|
|
||||||
- Which long-term requirements moved to `CUST-WP-0038`.
|
|
||||||
|
|
||||||
Do not permanently retire WSL2 in this workplan unless a separate human
|
|
||||||
decision is recorded. Retirement belongs after proven stability or in the
|
|
||||||
future HA workplan.
|
|
||||||
|
|
||||||
**Done when:** runbooks and project instructions match the deployed reality.
|
|
||||||
|
|
||||||
## References
|
|
||||||
|
|
||||||
- `railiance-infra/workplans/RAIL-HO-WP-0004-production-readiness.md`
|
|
||||||
- `RAIL-HO-WP-0004-T09` — Railiance-side State Hub deployment task
|
|
||||||
- `CUST-WP-0038` — future full ThreePhoenix HA State Hub migration
|
|
||||||
- Constitution constraint: production data migration and fallback retirement
|
|
||||||
require explicit human approval
|
|
||||||
@@ -1,246 +0,0 @@
|
|||||||
---
|
|
||||||
id: CUST-WP-0012
|
|
||||||
type: workplan
|
|
||||||
title: "Multi-User Onboarding and Environment Bootstrap"
|
|
||||||
domain: custodian
|
|
||||||
repo: the-custodian
|
|
||||||
status: active
|
|
||||||
owner: custodian
|
|
||||||
topic_slug: custodian
|
|
||||||
state_hub_workstream_id: "a28d9e29-4119-4b73-9469-f921920253ef"
|
|
||||||
created: "2026-03-11"
|
|
||||||
updated: "2026-03-11"
|
|
||||||
---
|
|
||||||
|
|
||||||
# Multi-User Onboarding and Environment Bootstrap
|
|
||||||
|
|
||||||
## Goal
|
|
||||||
|
|
||||||
Make the Custodian system accessible to collaborators beyond the primary
|
|
||||||
operator. A new user (or a new machine for the existing operator) should
|
|
||||||
be able to go from zero to a productive Claude Code session with full
|
|
||||||
State Hub MCP connectivity in a single session, without manual steps or
|
|
||||||
undocumented tribal knowledge.
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
Several friction points surfaced during the 2026-03-11 session:
|
|
||||||
|
|
||||||
- No SSH key for Railiance01 on WSL2 → blocked `make tunnel-loop`
|
|
||||||
- No `~/.railiance_gitea.conf` → blocked repo creation script
|
|
||||||
- Token missing `read:user` scope → blocked org repo creation
|
|
||||||
- No `git credential.helper` → credentials required on every push
|
|
||||||
- MCP registration is manual and documented only in `CLAUDE.md`
|
|
||||||
|
|
||||||
Each of these is a solved problem in isolation. This workstream collects
|
|
||||||
them into a repeatable, documented bootstrap experience.
|
|
||||||
|
|
||||||
## Scope
|
|
||||||
|
|
||||||
Two personas:
|
|
||||||
|
|
||||||
| Persona | Access level | Typical machine |
|
|
||||||
|---------|-------------|-----------------|
|
|
||||||
| Primary operator | Full access, all domains | WSL2 workstation |
|
|
||||||
| Domain collaborator | Read + write to one domain | COULOMBCORE, remote laptop |
|
|
||||||
|
|
||||||
## Tasks
|
|
||||||
|
|
||||||
### T01 — Git credential.helper for Gitea access
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0012-T01
|
|
||||||
state_hub_task_id: 71628269-9a75-4dae-a347-e64a86040322
|
|
||||||
status: todo
|
|
||||||
priority: medium
|
|
||||||
```
|
|
||||||
|
|
||||||
Document and automate `git credential.helper` setup for Gitea
|
|
||||||
(`http://92.205.130.254:32166`). Recommend `libsecret` (keyring-backed)
|
|
||||||
on machines that support it; fall back to `credential.helper=store`
|
|
||||||
(persistent, plaintext `~/.git-credentials`) on headless servers.
|
|
||||||
|
|
||||||
Include in bootstrap script (T04) and onboarding guide (T05).
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Preferred: libsecret (GNOME keyring, WSL2 with keyring daemon)
|
|
||||||
sudo apt-get install -y libsecret-1-0 libsecret-1-dev
|
|
||||||
sudo make -C /usr/share/doc/git/contrib/credential/libsecret
|
|
||||||
git config --global credential.helper \
|
|
||||||
/usr/share/doc/git/contrib/credential/libsecret/git-credential-libsecret
|
|
||||||
|
|
||||||
# Fallback: store (plaintext, suitable for headless servers)
|
|
||||||
git config --global credential.helper store
|
|
||||||
|
|
||||||
# Headless server alternative: cache (in-memory, 1h timeout)
|
|
||||||
git config --global credential.helper 'cache --timeout=3600'
|
|
||||||
```
|
|
||||||
|
|
||||||
**Done when:** included in bootstrap script; push to Gitea works without
|
|
||||||
re-entering credentials on second attempt.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T02 — SSH key generation and authorization automation
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0012-T02
|
|
||||||
state_hub_task_id: fea965e9-8a8f-439c-9096-8f7756eb71ed
|
|
||||||
status: todo
|
|
||||||
priority: medium
|
|
||||||
```
|
|
||||||
|
|
||||||
Script or Ansible task that:
|
|
||||||
1. Generates an `ed25519` key pair on the new machine if none exists
|
|
||||||
2. Displays the public key with copy instructions
|
|
||||||
3. Authorizes it on all managed hosts (Railiance01, COULOMBCORE) via
|
|
||||||
`ssh-copy-id` or Ansible `authorized_key` module
|
|
||||||
|
|
||||||
Surfaced by: RAIL-PL-WP-0001 T01 — no SSH key on WSL2 blocked
|
|
||||||
`make tunnel-loop HOST=tegwick@92.205.62.239`.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Generate if missing
|
|
||||||
[[ -f ~/.ssh/id_ed25519 ]] || ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N ""
|
|
||||||
|
|
||||||
# Authorize on a target host (requires existing access once)
|
|
||||||
ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.62.239
|
|
||||||
ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.130.254
|
|
||||||
```
|
|
||||||
|
|
||||||
**Done when:** included in bootstrap script; documented in onboarding guide.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T03 — Claude Code MCP registration automation
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0012-T03
|
|
||||||
state_hub_task_id: 60318e9a-972e-45c8-afde-82ed0625f594
|
|
||||||
status: todo
|
|
||||||
priority: medium
|
|
||||||
```
|
|
||||||
|
|
||||||
Automate the state-hub MCP server registration on a new machine.
|
|
||||||
Currently this is a multi-step manual process documented in
|
|
||||||
`~/.claude/CLAUDE.md`. It should be a single `make` target or script:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# In the-custodian/state-hub/
|
|
||||||
make register-mcp # idempotent; safe to re-run
|
|
||||||
```
|
|
||||||
|
|
||||||
The script should:
|
|
||||||
1. Detect whether `state-hub` is already in `~/.claude.json`
|
|
||||||
2. Extract the server config from `.mcp.json`
|
|
||||||
3. Run `claude mcp add-json -s user state-hub <config>`
|
|
||||||
4. Run `patch_mcp_cwd.py` to restore the cwd field
|
|
||||||
5. Print instructions to restart Claude Code
|
|
||||||
|
|
||||||
Should also detect whether the state hub is reachable directly
|
|
||||||
(`http://127.0.0.1:8000`) or needs a tunnel (via ops-bridge), and emit
|
|
||||||
a warning if neither is available.
|
|
||||||
|
|
||||||
**Done when:** `make register-mcp` works on a clean machine; documented
|
|
||||||
in onboarding guide.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T04 — Environment bootstrap script
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0012-T04
|
|
||||||
state_hub_task_id: 84a94761-e424-4470-a9a2-64d9cabadb7f
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
```
|
|
||||||
|
|
||||||
Single idempotent script: `state-hub/scripts/bootstrap-env.sh`
|
|
||||||
|
|
||||||
Checks/installs prerequisites and configures the environment:
|
|
||||||
|
|
||||||
| Step | What |
|
|
||||||
|------|------|
|
|
||||||
| Prerequisites | git, sops, age, helm, kubectl, uv, claude CLI |
|
|
||||||
| Git credential | `credential.helper` (libsecret or store) |
|
|
||||||
| SSH key | Generate ed25519 if missing; display public key |
|
|
||||||
| MCP registration | `make register-mcp` (T03) |
|
|
||||||
| Gitea config | Prompt for token; write `~/.railiance_gitea.conf` |
|
|
||||||
| Health check | `curl /state/health`; warn if tunnel needed |
|
|
||||||
|
|
||||||
Design constraints:
|
|
||||||
- Idempotent: safe to run on an already-configured machine
|
|
||||||
- No silent failures: each step prints ✓ / ✗ / ⚠
|
|
||||||
- Minimal dependencies: bash + curl only to get started
|
|
||||||
|
|
||||||
**Done when:** running the script on a clean Ubuntu 24.04 machine
|
|
||||||
produces a working Custodian environment with no additional manual steps.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T05 — Onboarding guide and user journey documentation
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0012-T05
|
|
||||||
state_hub_task_id: b0839802-659a-475b-8b84-ab7341ea3d15
|
|
||||||
status: todo
|
|
||||||
priority: medium
|
|
||||||
```
|
|
||||||
|
|
||||||
Write `docs/onboarding.md` in the-custodian covering the full journey
|
|
||||||
for both personas:
|
|
||||||
|
|
||||||
**Primary operator (new machine):**
|
|
||||||
1. Prerequisites (git, SSH client)
|
|
||||||
2. Clone `the-custodian`
|
|
||||||
3. Run `make bootstrap-env` (T04)
|
|
||||||
4. Restart Claude Code → verify MCP is active
|
|
||||||
5. First session: `get_state_summary()` → orient → work
|
|
||||||
|
|
||||||
**Domain collaborator (new person):**
|
|
||||||
1. Prerequisites + Gitea account
|
|
||||||
2. `ssh-copy-id` to get access to Railiance01 (or just COULOMBCORE)
|
|
||||||
3. Set up ops-bridge tunnel to reach state hub
|
|
||||||
4. Clone domain repo
|
|
||||||
5. First Claude Code session with MCP via tunnel
|
|
||||||
6. Contributing a workplan (ADR-001 convention)
|
|
||||||
|
|
||||||
**Done when:** a new collaborator can follow the guide without
|
|
||||||
clarification from the primary operator.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T06 — State Hub multi-user model (deferred)
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: CUST-WP-0012-T06
|
|
||||||
state_hub_task_id: d5df3302-67b9-4765-a8d8-ea2df53dff6e
|
|
||||||
status: todo
|
|
||||||
priority: low
|
|
||||||
```
|
|
||||||
|
|
||||||
Design a lightweight user/role model for the state hub:
|
|
||||||
|
|
||||||
| Role | Permissions |
|
|
||||||
|------|-------------|
|
|
||||||
| Primary operator | Full read/write, all domains |
|
|
||||||
| Domain collaborator | Read all; write to own domain only |
|
|
||||||
| Observer | Read-only |
|
|
||||||
|
|
||||||
Decision needed: enforce at API layer (HTTP Basic / token auth per
|
|
||||||
domain) or rely on Gitea repo permissions as the authoritative boundary
|
|
||||||
(simpler — the hub is a read model anyway).
|
|
||||||
|
|
||||||
**Deferred until:** first external collaborator is actively onboarding.
|
|
||||||
Implement T01–T05 first; multi-user access control is only needed when
|
|
||||||
there is more than one user.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## References
|
|
||||||
|
|
||||||
- ops-bridge repo: `ops-bridge` (tunnel lifecycle management)
|
|
||||||
- MCP registration: `~/.claude/CLAUDE.md` (current manual procedure)
|
|
||||||
- Gitea repo creation: `railiance-cluster/tools/create_railiance_repo.sh`
|
|
||||||
- ADR-001: workplans as repo artefacts
|
|
||||||
- Surfaced by: RAIL-PL-WP-0001 T01 execution, 2026-03-11
|
|
||||||
@@ -1,246 +0,0 @@
|
|||||||
---
|
|
||||||
id: CUST-WP-0038
|
|
||||||
type: workplan
|
|
||||||
title: "State Hub Full ThreePhoenix HA Migration"
|
|
||||||
domain: custodian
|
|
||||||
repo: the-custodian
|
|
||||||
status: active
|
|
||||||
owner: custodian
|
|
||||||
topic_slug: custodian
|
|
||||||
created: "2026-05-02"
|
|
||||||
updated: "2026-05-02"
|
|
||||||
depends_on: CUST-WP-0011
|
|
||||||
state_hub_workstream_id: "8d0c1b5d-44da-4b91-8357-e6526d3e0a85"
|
|
||||||
---
|
|
||||||
|
|
||||||
# State Hub Full ThreePhoenix HA Migration
|
|
||||||
|
|
||||||
## Goal
|
|
||||||
|
|
||||||
Preserve the original long-term State Hub infrastructure goal while
|
|
||||||
`CUST-WP-0011` takes the pragmatic railiance01 path.
|
|
||||||
|
|
||||||
This workplan completes the migration from a useful single-node cluster-hosted
|
|
||||||
State Hub to a full ThreePhoenix-grade service: multi-node Kubernetes,
|
|
||||||
replicated storage, tested failover, tested restore, and retirement of the WSL2
|
|
||||||
fallback only after operational confidence is earned.
|
|
||||||
|
|
||||||
## Why This Exists
|
|
||||||
|
|
||||||
The near-term State Hub migration should not wait for every HA precondition,
|
|
||||||
because the workstation-hosted State Hub is already a bottleneck for
|
|
||||||
multi-machine work.
|
|
||||||
|
|
||||||
But the original requirement remains valid:
|
|
||||||
|
|
||||||
- State Hub is irreplaceable episodic memory.
|
|
||||||
- A single node is not a final home.
|
|
||||||
- Backup and restore must be drilled, not assumed.
|
|
||||||
- Long-term operations must survive node loss and operator-machine loss.
|
|
||||||
|
|
||||||
`CUST-WP-0011` moves State Hub to railiance01 pragmatically. This workplan
|
|
||||||
keeps the ultimate target visible and reviewable.
|
|
||||||
|
|
||||||
## Entry Criteria
|
|
||||||
|
|
||||||
- `CUST-WP-0011` completed or explicitly superseded.
|
|
||||||
- Cluster-hosted State Hub has passed its stabilisation period.
|
|
||||||
- railiance01 is not the only planned durable node.
|
|
||||||
- Railiance architecture decision for storage replication is current:
|
|
||||||
Longhorn, cnpg replication, external backup, or a documented replacement.
|
|
||||||
- Backup and restore tooling has an owner and runbook.
|
|
||||||
|
|
||||||
## Target Properties
|
|
||||||
|
|
||||||
- Three healthy Kubernetes nodes: Railiance01, Railiance02, Railiance03.
|
|
||||||
- State Hub database survives loss of one node.
|
|
||||||
- State Hub API recovers from pod loss without manual repair.
|
|
||||||
- Backups are encrypted, off-node, and restorable into a test namespace.
|
|
||||||
- Agent access remains private.
|
|
||||||
- WSL2 is no longer needed as the primary disaster-recovery fallback.
|
|
||||||
|
|
||||||
## Tasks
|
|
||||||
|
|
||||||
### T01 — Confirm ThreePhoenix cluster readiness
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T01
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
state_hub_task_id: "aa1bf291-3b6c-4940-a4f5-7680b0349110"
|
|
||||||
```
|
|
||||||
|
|
||||||
Verify the target cluster state:
|
|
||||||
|
|
||||||
- Three nodes are joined and Ready.
|
|
||||||
- Control-plane and worker roles are documented.
|
|
||||||
- Cluster version and node resources are recorded.
|
|
||||||
- Smoke tests pass from the operator machine and from CoulombCore.
|
|
||||||
|
|
||||||
**Done when:** a current readiness report exists and no node is marked
|
|
||||||
NotReady or operationally unknown.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T02 — Establish replicated storage/database strategy
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T02
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
state_hub_task_id: "5575f244-5cef-47aa-a168-24027cd08140"
|
|
||||||
```
|
|
||||||
|
|
||||||
Choose and document the durable data strategy for State Hub:
|
|
||||||
|
|
||||||
- cnpg multi-instance PostgreSQL cluster, and/or
|
|
||||||
- Longhorn-backed storage with suitable replication, and/or
|
|
||||||
- another explicitly approved architecture.
|
|
||||||
|
|
||||||
The decision must define RPO, RTO, failover behavior, and restore procedure.
|
|
||||||
|
|
||||||
**Done when:** the selected architecture is documented and approved before any
|
|
||||||
production data movement.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T03 — Implement HA State Hub database
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T03
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
state_hub_task_id: "5330fcc3-684b-49f6-8d28-ea8c929733d6"
|
|
||||||
```
|
|
||||||
|
|
||||||
Apply the chosen database/storage architecture to State Hub.
|
|
||||||
|
|
||||||
Requirements:
|
|
||||||
|
|
||||||
- Database credentials remain SOPS/secret-managed.
|
|
||||||
- The database has automated backup configured.
|
|
||||||
- The database exposes a stable service endpoint for the API.
|
|
||||||
- Health and replication status are observable.
|
|
||||||
|
|
||||||
**Done when:** State Hub can run against the HA database in a test or staging
|
|
||||||
namespace.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T04 — Add State Hub API high-availability behavior
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T04
|
|
||||||
status: todo
|
|
||||||
priority: medium
|
|
||||||
state_hub_task_id: "64175ed0-af36-47ea-9401-74c4b15ffe24"
|
|
||||||
```
|
|
||||||
|
|
||||||
Run State Hub API with the right availability posture for its workload:
|
|
||||||
|
|
||||||
- At least one replica, optionally more if DB/session behavior permits.
|
|
||||||
- Readiness and liveness probes.
|
|
||||||
- Rolling update behavior documented.
|
|
||||||
- Resource requests/limits set.
|
|
||||||
|
|
||||||
**Done when:** killing an API pod does not require manual recovery.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T05 — Drill database failover
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T05
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
state_hub_task_id: "73c5008a-380e-42bf-ad57-1c9d0bda3a86"
|
|
||||||
```
|
|
||||||
|
|
||||||
Perform a controlled failover drill for the State Hub database.
|
|
||||||
|
|
||||||
Checks:
|
|
||||||
|
|
||||||
- Failure trigger is documented.
|
|
||||||
- API behavior during failover is observed.
|
|
||||||
- Recovery time is measured.
|
|
||||||
- No data loss is detected after recovery.
|
|
||||||
|
|
||||||
**Done when:** the failover drill passes and results are logged.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T06 — Drill backup restore to isolated namespace
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T06
|
|
||||||
status: todo
|
|
||||||
priority: high
|
|
||||||
state_hub_task_id: "4e5b97ff-ef1c-414d-812b-39b87b242c74"
|
|
||||||
```
|
|
||||||
|
|
||||||
Restore the latest encrypted State Hub backup into an isolated namespace or
|
|
||||||
separate test database.
|
|
||||||
|
|
||||||
Checks:
|
|
||||||
|
|
||||||
- Backup can be decrypted with the documented key path.
|
|
||||||
- Restore completes from off-node backup material.
|
|
||||||
- Row counts and representative records match.
|
|
||||||
- Restored API can serve `/state/health` and `/state/summary` when pointed at
|
|
||||||
the restored database.
|
|
||||||
|
|
||||||
**Done when:** restore drill passes without depending on the live database.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T07 — Update agent access and runbooks for HA endpoint
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T07
|
|
||||||
status: todo
|
|
||||||
priority: medium
|
|
||||||
state_hub_task_id: "959062d8-decb-4969-a60b-0d3b618a8d6c"
|
|
||||||
```
|
|
||||||
|
|
||||||
Update the private access model after the HA endpoint is available:
|
|
||||||
|
|
||||||
- ops-bridge or tunnel target.
|
|
||||||
- MCP `API_BASE` or local port-forward convention.
|
|
||||||
- Dashboard access.
|
|
||||||
- Operator recovery instructions.
|
|
||||||
|
|
||||||
**Done when:** each active operator machine can reach the HA State Hub endpoint
|
|
||||||
through the documented path.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### T08 — Retire WSL2 fallback after explicit approval
|
|
||||||
|
|
||||||
```task
|
|
||||||
id: T08
|
|
||||||
status: todo
|
|
||||||
priority: low
|
|
||||||
needs_human: true
|
|
||||||
intervention_note: "Requires explicit approval after HA failover and restore drills pass."
|
|
||||||
state_hub_task_id: "d4a7ca26-c338-48a1-b8b1-85a356550add"
|
|
||||||
```
|
|
||||||
|
|
||||||
Retire the WSL2 State Hub as a disaster-recovery fallback only after the HA
|
|
||||||
cluster path has passed drills.
|
|
||||||
|
|
||||||
Steps:
|
|
||||||
|
|
||||||
1. Take and archive a final WSL2 backup.
|
|
||||||
2. Stop local WSL2 State Hub services.
|
|
||||||
3. Update global and repo instructions.
|
|
||||||
4. Record the retirement decision in State Hub.
|
|
||||||
|
|
||||||
**Done when:** WSL2 is no longer part of the normal or fallback operating
|
|
||||||
model, and the cluster runbook is the source of truth.
|
|
||||||
|
|
||||||
## References
|
|
||||||
|
|
||||||
- `CUST-WP-0011` — pragmatic railiance01 migration
|
|
||||||
- Railiance ThreePhoenix infrastructure goal
|
|
||||||
- State Hub backup/restore runbooks
|
|
||||||
- Constitution constraint: irreversible retirement requires human approval
|
|
||||||
Reference in New Issue
Block a user