From a2b41e7d0a6e26d4ac9b8aaaa1d0cb3f2d7b5a8c Mon Sep 17 00:00:00 2001 From: codex Date: Fri, 3 Jul 2026 13:25:26 +0200 Subject: [PATCH] CUST-WP-0054 (proposed): workstation independence and fleet role realignment Co-Authored-By: Claude Fable 5 --- ...tion-independence-and-fleet-realignment.md | 289 ++++++++++++++++++ 1 file changed, 289 insertions(+) create mode 100644 workplans/CUST-WP-0054-workstation-independence-and-fleet-realignment.md diff --git a/workplans/CUST-WP-0054-workstation-independence-and-fleet-realignment.md b/workplans/CUST-WP-0054-workstation-independence-and-fleet-realignment.md new file mode 100644 index 0000000..110e21a --- /dev/null +++ b/workplans/CUST-WP-0054-workstation-independence-and-fleet-realignment.md @@ -0,0 +1,289 @@ +--- +id: CUST-WP-0054 +type: workplan +title: "Workstation Independence and Fleet Role Realignment" +domain: infotech +repo: the-custodian +status: proposed +owner: codex +topic_slug: custodian +planning_priority: high +planning_order: 54 +created: "2026-07-03" +updated: "2026-07-03" +--- + +# CUST-WP-0054 - Workstation Independence and Fleet Role Realignment + +## Goal + +Remove every runtime dependency and always-on workload from the workstation, +realign machine roles, and make the production loop run uninterrupted with the +workstation powered off: + +| Machine | Target role | +| --- | --- | +| **railiance01** | Production workloads (the first node of the growing Railiance fleet). | +| **coulombcore** | Early/legacy/experimental workloads only; frozen for new production; eventually decommissioned and **reborn as railiance02** (a machine-scale phoenix rotation). | +| **workstation** | Local, temporary dev environment to build and evolve repos — nothing depends on it being on. | + +The acceptance proof for the whole plan is T10: production runs for 24h+ with +the workstation offline. + +## Current-State Findings (2026-07-03 inventory) + +These are the concrete dependencies this plan removes: + +1. **The workstation is a production network hub.** All 16 ops-bridge tunnels + originate here. Two production paths *chain through* it: + - State Hub primary: everything reaches the cluster hub via the + workstation's `state-hub-primary` forward tunnel, and railiance01's + automations via `state-hub-railiance01` (reverse) → workstation → cluster. + - issue-core emission: activity-core (railiance01) → workstation → + issue-core (coulombcore). + A workstation reboot breaks daily triage evidence, consistency sweeps, and + task emission. +2. **Production landed on the condemned machine.** The 2026-07-02/03 cutovers + correctly moved State Hub and Core Hub off the workstation — but onto + coulombcore (State Hub cluster primary, Core Hub behind + `hub.coulomb.social`, issue-core, OpenBao, ArgoCD, ESO, the identity stack, + and Gitea all run there). Under the new roles these are production + workloads on a legacy machine. +3. **Source of truth for code is on the condemned machine.** Every repo's + canonical remote is `gitea.coulomb.social` (coulombcore). The Forgejo + production migration (`RAIL-HO-WP-0005`) already targets railiance01. +4. **The consistency sweep and daily triage depend on workstation repo + checkouts** (`/home/worsch/*` clones are what fix-consistency writes back + to) and the triage working-memory sink references a + `/home/worsch/the-custodian/memory/working` path identity. +5. **Images are built and published from the workstation** (docker build + + push by hand); there is no CI runner, so releases require this machine. +6. **WSL2 State Hub fallback** still exists on the workstation + (`CUST-WP-0011-T08/T09` stabilization window) — expected, retired by that + workplan, not this one. + +## Deployment Strategy (proposed — improves on the rough beachhead idea) + +**A. Files-first beachhead, not database replication.** ADR-001 already +declares that the State Hub is a read model rebuildable from repo files. So +the dev-environment story should be: `make dev-hub` starts a **local ephemeral +hub** (compose: postgres + API + MCP, exactly what exists today) whose content +is **rebuilt from the local repo checkouts** via the C-06 registration path — +not synced from the fleet database. Offline writes (progress events, task +status) accumulate in a **write buffer** and relay upstream on reconnect +(`STATE-WP-0068 offline-write-buffer-and-edge-relay` is the existing seed for +this). Two hubs never replicate; files reconcile them. This makes the +workstation genuinely temporary: clone repos → `make dev-hub` → work offline → +push + relay when connected. + +**B. Fleet mesh instead of workstation-hub networking.** Machine-to-machine +paths (railiance01 ↔ coulombcore/railiance02) get direct, persistent links +owned by the machines themselves — ops-bridge units running *on the fleet +machines* under `atm-` actor certs (ops-warden), or a minimal WireGuard mesh +if tunnel count grows. The workstation becomes a mesh *client* when present, +never a relay. + +**C. Promotion-gated production, ThreePhoenix ideas incrementally.** A +workload counts as "production on railiance01" only when it conforms to the +already-finished staged-promotion contract (`RAIL-BS-WP-0006`: overlay repo, +`railiance/app.toml`, stage commands, rollback). ThreePhoenix ideas adopted +now, without waiting for three nodes: +- **CNPG-managed Postgres** per workload (already proven on coulombcore); +- **greenfield rebuild automation** (NET-WP-0020 unseal + bootstrap chain) as + the standing "phoenix drill" for single machines; +- **phoenix rotation at machine scale**: the coulombcore → railiance02 + rebirth *is* the first full phoenix and should be executed with the + same automation that a future 3-node weekly rotation would use; +- Longhorn distributed storage and PG streaming HA remain deferred until a + third node exists (`RAIL-BS-WP-0007` / `CUST-WP-0038` stay the follow-on). + +**D. Decommission by attrition, not by big-bang.** coulombcore is frozen for +new production immediately (policy), drained workload-by-workload via staged +promotion onto railiance01, and only rebuilt as railiance02 when the last +production dependency (likely identity/OpenBao) has moved. + +## Task: Target architecture and dependency register + +```task +id: CUST-WP-0054-T01 +status: todo +priority: high +``` + +Write the canon-adjacent architecture note (`docs/` first; promote to +`canon/architecture/` after review) fixing the three machine roles, the fleet +mesh topology, the promotion gate for "production", and the phoenix path +coulombcore → railiance02. Include the full dependency register: every +workload, tunnel, repo remote, sink path, and build pipeline with its current +host and target host. Done when every row has a target and a migration owner +(this plan's task or an existing workplan reference). + +## Task: De-hub the network — fleet-owned direct tunnels + +```task +id: CUST-WP-0054-T02 +status: todo +priority: high +``` + +Remove the workstation from all production data paths: + +- Run ops-bridge (or systemd ssh units) **on railiance01** with `atm-` actor + certs for the two live cross-machine lanes: railiance01 → coulombcore + issue-core, and railiance01 → cluster State Hub (replacing the + workstation-chained `state-hub-railiance01` + `state-hub-primary` pair). +- Re-point `actcore-state-hub-bridge` and `actcore-issue-core-bridge` at the + machine-local tunnel ports. +- Workstation tunnels remain only for interactive dev access (k3s API, hub + client) and may drop at any time without production impact. +- Evaluate WireGuard mesh as the successor if unit count exceeds ~5. + +Done when killing every workstation tunnel leaves triage, sweeps, and +emission working (partial T10 rehearsal). + +## Task: Production placement plan and freeze policy + +```task +id: CUST-WP-0054-T03 +status: todo +priority: high +``` + +Declare coulombcore frozen for new production (policy note in canon). Produce +the drain sequence with per-workload target and method (staged-promotion +overlay or documented exception): State Hub, Core Hub, issue-core, OpenBao, +identity stack (KeyCape/Authelia/privacyIDEA/lldap), ESO/ArgoCD control +plane, Gitea/Forgejo, CNPG databases. Explicitly order them by coupling +(forge and State Hub early; identity + OpenBao last, since everything +authenticates through them). + +## Task: Forge to railiance01 + CI runners (kill workstation builds) + +```task +id: CUST-WP-0054-T04 +status: todo +priority: high +``` + +Execute/absorb `RAIL-HO-WP-0005`: Forgejo production on railiance01 becomes +the canonical remote for all repos; coulombcore Gitea becomes a read-only +mirror until decommission. Stand up Actions runners so container images +(state-hub, core-hub, issue-core, activity-core) build and push in CI from +tags — the workstation stops being the build/publish host. Done when a +release ships with the workstation off. + +## Task: State Hub production home on railiance01 + +```task +id: CUST-WP-0054-T05 +status: todo +priority: high +``` + +Move the State Hub primary from coulombcore to railiance01 using the proven +CUST-WP-0011-T07 playbook (freeze → exact-count restore → rewire). This makes +the automation loop machine-local: activity-core, Temporal, and the hub share +one machine, so daily triage and sweeps survive any other machine being down. +Prereq: railiance01 CNPG + storage reviewed (T03). Also relocate the +consistency-sweep repo checkouts to railiance01 (clones from the T04 forge) +so file writebacks no longer touch workstation paths. + +## Task: Working-memory and sink path decoupling + +```task +id: CUST-WP-0054-T06 +status: todo +priority: medium +``` + +Remove `/home/worsch/...` path identities from runtime contracts: the triage +working-memory sink and any prompt/context paths become repo-relative or +PVC-native with a defined sync-to-repo step (commit via the sweep). Done when +no ActivityDefinition or sink references a workstation-specific absolute path. + +## Task: Dev-environment beachhead (files-first) + +```task +id: CUST-WP-0054-T07 +status: todo +priority: high +``` + +Implement strategy A: `make dev-hub` (or `custodian dev up`) starts the local +compose hub, registers the locally cloned repos, and rebuilds workplan/task +state from files via C-06 — no fleet connection required. Implement the +offline write buffer + reconnect relay for progress/task events (align with +`STATE-WP-0068`; keep the buffer file-backed and idempotent). MCP config +gains an explicit `dev`/`fleet` profile switch. Done when a fresh machine +reaches a working, orientation-capable dev hub from `git clone` + one command, +fully offline. + +## Task: ThreePhoenix increment — phoenix drill automation + +```task +id: CUST-WP-0054-T08 +status: todo +priority: medium +``` + +Compose the existing pieces (NET-WP-0020 unseal automation, S1–S3 bootstrap +chain, staged-promotion overlays, CNPG restore drills) into one rehearsable +"phoenix a machine" runbook + automation entrypoint, proven on a disposable +target (haskelseed or a VM). This is the tool the railiance02 rebirth and any +future node rotation will use. Done when a greenfield machine reaches +join-ready state unattended except for custody-gated steps. + +## Task: coulombcore decommission readiness → railiance02 + +```task +id: CUST-WP-0054-T09 +status: wait +priority: medium +``` + +Gated on T03–T05 drains reaching identity/OpenBao. Final inventory sweep, +data archival (episodic memory of the machine's history), DNS/cert plan for +`*.coulomb.social` names, then execute the machine phoenix via T08 automation: +wipe, rebuild as railiance02, join the fleet. Longhorn/PG-HA +(`RAIL-BS-WP-0007`, `CUST-WP-0038`) unlock once railiance01 + railiance02 are +both fleet nodes. + +## Task: Workstation-off acceptance test + +```task +id: CUST-WP-0054-T10 +status: wait +priority: high +``` + +The plan's proof: workstation fully offline for 24h+ (no tunnels, no +processes). Verify afterwards from evidence alone: scheduled triage ran and +validated, consistency sweeps ran, issue emission works, hub/API/dashboards +served, forge and CI available. Then verify the inverse: workstation boots, +`custodian dev up` gives a working offline dev hub, and reconnect relays +buffered events. Done when both directions pass without manual repair. + +## Sequencing + +``` +T01 (architecture + register) + ├─ T02 de-hub network ── unblocks most of T10's first half + ├─ T03 placement plan/freeze ─┬─ T04 forge + CI ─┐ + │ └─ T05 hub home ├─ T09 decommission → railiance02 + ├─ T06 sink decoupling │ + ├─ T07 dev beachhead │ + └─ T08 phoenix drill ────────────────────────────┘ + T10 acceptance (both halves) +``` + +## Relationship to existing workplans + +- Absorbs the *sequencing* of `RAIL-HO-WP-0005` (Forgejo) and + `CUST-WP-0011-T08/T09` (WSL2 fallback retirement folds into T10). +- Builds on finished `RAIL-BS-WP-0006` (staged promotion) and `NET-WP-0020` + (unseal automation). +- Defers to `RAIL-BS-WP-0007` / `CUST-WP-0038` for true 3-node HA once + railiance02 exists. +- Extends `STATE-WP-0068` (offline write buffer) as the T07 relay mechanism. +- `CORE-WP-0005` stabilization + `CORE-WP-0007` Haskell retirement proceed + independently; Core Hub simply appears in the T03 drain sequence.