CUST-WP-0054 (proposed): workstation independence and fleet role realignment
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,289 @@
|
|||||||
|
---
|
||||||
|
id: CUST-WP-0054
|
||||||
|
type: workplan
|
||||||
|
title: "Workstation Independence and Fleet Role Realignment"
|
||||||
|
domain: infotech
|
||||||
|
repo: the-custodian
|
||||||
|
status: proposed
|
||||||
|
owner: codex
|
||||||
|
topic_slug: custodian
|
||||||
|
planning_priority: high
|
||||||
|
planning_order: 54
|
||||||
|
created: "2026-07-03"
|
||||||
|
updated: "2026-07-03"
|
||||||
|
---
|
||||||
|
|
||||||
|
# CUST-WP-0054 - Workstation Independence and Fleet Role Realignment
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Remove every runtime dependency and always-on workload from the workstation,
|
||||||
|
realign machine roles, and make the production loop run uninterrupted with the
|
||||||
|
workstation powered off:
|
||||||
|
|
||||||
|
| Machine | Target role |
|
||||||
|
| --- | --- |
|
||||||
|
| **railiance01** | Production workloads (the first node of the growing Railiance fleet). |
|
||||||
|
| **coulombcore** | Early/legacy/experimental workloads only; frozen for new production; eventually decommissioned and **reborn as railiance02** (a machine-scale phoenix rotation). |
|
||||||
|
| **workstation** | Local, temporary dev environment to build and evolve repos — nothing depends on it being on. |
|
||||||
|
|
||||||
|
The acceptance proof for the whole plan is T10: production runs for 24h+ with
|
||||||
|
the workstation offline.
|
||||||
|
|
||||||
|
## Current-State Findings (2026-07-03 inventory)
|
||||||
|
|
||||||
|
These are the concrete dependencies this plan removes:
|
||||||
|
|
||||||
|
1. **The workstation is a production network hub.** All 16 ops-bridge tunnels
|
||||||
|
originate here. Two production paths *chain through* it:
|
||||||
|
- State Hub primary: everything reaches the cluster hub via the
|
||||||
|
workstation's `state-hub-primary` forward tunnel, and railiance01's
|
||||||
|
automations via `state-hub-railiance01` (reverse) → workstation → cluster.
|
||||||
|
- issue-core emission: activity-core (railiance01) → workstation →
|
||||||
|
issue-core (coulombcore).
|
||||||
|
A workstation reboot breaks daily triage evidence, consistency sweeps, and
|
||||||
|
task emission.
|
||||||
|
2. **Production landed on the condemned machine.** The 2026-07-02/03 cutovers
|
||||||
|
correctly moved State Hub and Core Hub off the workstation — but onto
|
||||||
|
coulombcore (State Hub cluster primary, Core Hub behind
|
||||||
|
`hub.coulomb.social`, issue-core, OpenBao, ArgoCD, ESO, the identity stack,
|
||||||
|
and Gitea all run there). Under the new roles these are production
|
||||||
|
workloads on a legacy machine.
|
||||||
|
3. **Source of truth for code is on the condemned machine.** Every repo's
|
||||||
|
canonical remote is `gitea.coulomb.social` (coulombcore). The Forgejo
|
||||||
|
production migration (`RAIL-HO-WP-0005`) already targets railiance01.
|
||||||
|
4. **The consistency sweep and daily triage depend on workstation repo
|
||||||
|
checkouts** (`/home/worsch/*` clones are what fix-consistency writes back
|
||||||
|
to) and the triage working-memory sink references a
|
||||||
|
`/home/worsch/the-custodian/memory/working` path identity.
|
||||||
|
5. **Images are built and published from the workstation** (docker build +
|
||||||
|
push by hand); there is no CI runner, so releases require this machine.
|
||||||
|
6. **WSL2 State Hub fallback** still exists on the workstation
|
||||||
|
(`CUST-WP-0011-T08/T09` stabilization window) — expected, retired by that
|
||||||
|
workplan, not this one.
|
||||||
|
|
||||||
|
## Deployment Strategy (proposed — improves on the rough beachhead idea)
|
||||||
|
|
||||||
|
**A. Files-first beachhead, not database replication.** ADR-001 already
|
||||||
|
declares that the State Hub is a read model rebuildable from repo files. So
|
||||||
|
the dev-environment story should be: `make dev-hub` starts a **local ephemeral
|
||||||
|
hub** (compose: postgres + API + MCP, exactly what exists today) whose content
|
||||||
|
is **rebuilt from the local repo checkouts** via the C-06 registration path —
|
||||||
|
not synced from the fleet database. Offline writes (progress events, task
|
||||||
|
status) accumulate in a **write buffer** and relay upstream on reconnect
|
||||||
|
(`STATE-WP-0068 offline-write-buffer-and-edge-relay` is the existing seed for
|
||||||
|
this). Two hubs never replicate; files reconcile them. This makes the
|
||||||
|
workstation genuinely temporary: clone repos → `make dev-hub` → work offline →
|
||||||
|
push + relay when connected.
|
||||||
|
|
||||||
|
**B. Fleet mesh instead of workstation-hub networking.** Machine-to-machine
|
||||||
|
paths (railiance01 ↔ coulombcore/railiance02) get direct, persistent links
|
||||||
|
owned by the machines themselves — ops-bridge units running *on the fleet
|
||||||
|
machines* under `atm-` actor certs (ops-warden), or a minimal WireGuard mesh
|
||||||
|
if tunnel count grows. The workstation becomes a mesh *client* when present,
|
||||||
|
never a relay.
|
||||||
|
|
||||||
|
**C. Promotion-gated production, ThreePhoenix ideas incrementally.** A
|
||||||
|
workload counts as "production on railiance01" only when it conforms to the
|
||||||
|
already-finished staged-promotion contract (`RAIL-BS-WP-0006`: overlay repo,
|
||||||
|
`railiance/app.toml`, stage commands, rollback). ThreePhoenix ideas adopted
|
||||||
|
now, without waiting for three nodes:
|
||||||
|
- **CNPG-managed Postgres** per workload (already proven on coulombcore);
|
||||||
|
- **greenfield rebuild automation** (NET-WP-0020 unseal + bootstrap chain) as
|
||||||
|
the standing "phoenix drill" for single machines;
|
||||||
|
- **phoenix rotation at machine scale**: the coulombcore → railiance02
|
||||||
|
rebirth *is* the first full phoenix and should be executed with the
|
||||||
|
same automation that a future 3-node weekly rotation would use;
|
||||||
|
- Longhorn distributed storage and PG streaming HA remain deferred until a
|
||||||
|
third node exists (`RAIL-BS-WP-0007` / `CUST-WP-0038` stay the follow-on).
|
||||||
|
|
||||||
|
**D. Decommission by attrition, not by big-bang.** coulombcore is frozen for
|
||||||
|
new production immediately (policy), drained workload-by-workload via staged
|
||||||
|
promotion onto railiance01, and only rebuilt as railiance02 when the last
|
||||||
|
production dependency (likely identity/OpenBao) has moved.
|
||||||
|
|
||||||
|
## Task: Target architecture and dependency register
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0054-T01
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
```
|
||||||
|
|
||||||
|
Write the canon-adjacent architecture note (`docs/` first; promote to
|
||||||
|
`canon/architecture/` after review) fixing the three machine roles, the fleet
|
||||||
|
mesh topology, the promotion gate for "production", and the phoenix path
|
||||||
|
coulombcore → railiance02. Include the full dependency register: every
|
||||||
|
workload, tunnel, repo remote, sink path, and build pipeline with its current
|
||||||
|
host and target host. Done when every row has a target and a migration owner
|
||||||
|
(this plan's task or an existing workplan reference).
|
||||||
|
|
||||||
|
## Task: De-hub the network — fleet-owned direct tunnels
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0054-T02
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
```
|
||||||
|
|
||||||
|
Remove the workstation from all production data paths:
|
||||||
|
|
||||||
|
- Run ops-bridge (or systemd ssh units) **on railiance01** with `atm-` actor
|
||||||
|
certs for the two live cross-machine lanes: railiance01 → coulombcore
|
||||||
|
issue-core, and railiance01 → cluster State Hub (replacing the
|
||||||
|
workstation-chained `state-hub-railiance01` + `state-hub-primary` pair).
|
||||||
|
- Re-point `actcore-state-hub-bridge` and `actcore-issue-core-bridge` at the
|
||||||
|
machine-local tunnel ports.
|
||||||
|
- Workstation tunnels remain only for interactive dev access (k3s API, hub
|
||||||
|
client) and may drop at any time without production impact.
|
||||||
|
- Evaluate WireGuard mesh as the successor if unit count exceeds ~5.
|
||||||
|
|
||||||
|
Done when killing every workstation tunnel leaves triage, sweeps, and
|
||||||
|
emission working (partial T10 rehearsal).
|
||||||
|
|
||||||
|
## Task: Production placement plan and freeze policy
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0054-T03
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
```
|
||||||
|
|
||||||
|
Declare coulombcore frozen for new production (policy note in canon). Produce
|
||||||
|
the drain sequence with per-workload target and method (staged-promotion
|
||||||
|
overlay or documented exception): State Hub, Core Hub, issue-core, OpenBao,
|
||||||
|
identity stack (KeyCape/Authelia/privacyIDEA/lldap), ESO/ArgoCD control
|
||||||
|
plane, Gitea/Forgejo, CNPG databases. Explicitly order them by coupling
|
||||||
|
(forge and State Hub early; identity + OpenBao last, since everything
|
||||||
|
authenticates through them).
|
||||||
|
|
||||||
|
## Task: Forge to railiance01 + CI runners (kill workstation builds)
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0054-T04
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
```
|
||||||
|
|
||||||
|
Execute/absorb `RAIL-HO-WP-0005`: Forgejo production on railiance01 becomes
|
||||||
|
the canonical remote for all repos; coulombcore Gitea becomes a read-only
|
||||||
|
mirror until decommission. Stand up Actions runners so container images
|
||||||
|
(state-hub, core-hub, issue-core, activity-core) build and push in CI from
|
||||||
|
tags — the workstation stops being the build/publish host. Done when a
|
||||||
|
release ships with the workstation off.
|
||||||
|
|
||||||
|
## Task: State Hub production home on railiance01
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0054-T05
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
```
|
||||||
|
|
||||||
|
Move the State Hub primary from coulombcore to railiance01 using the proven
|
||||||
|
CUST-WP-0011-T07 playbook (freeze → exact-count restore → rewire). This makes
|
||||||
|
the automation loop machine-local: activity-core, Temporal, and the hub share
|
||||||
|
one machine, so daily triage and sweeps survive any other machine being down.
|
||||||
|
Prereq: railiance01 CNPG + storage reviewed (T03). Also relocate the
|
||||||
|
consistency-sweep repo checkouts to railiance01 (clones from the T04 forge)
|
||||||
|
so file writebacks no longer touch workstation paths.
|
||||||
|
|
||||||
|
## Task: Working-memory and sink path decoupling
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0054-T06
|
||||||
|
status: todo
|
||||||
|
priority: medium
|
||||||
|
```
|
||||||
|
|
||||||
|
Remove `/home/worsch/...` path identities from runtime contracts: the triage
|
||||||
|
working-memory sink and any prompt/context paths become repo-relative or
|
||||||
|
PVC-native with a defined sync-to-repo step (commit via the sweep). Done when
|
||||||
|
no ActivityDefinition or sink references a workstation-specific absolute path.
|
||||||
|
|
||||||
|
## Task: Dev-environment beachhead (files-first)
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0054-T07
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
```
|
||||||
|
|
||||||
|
Implement strategy A: `make dev-hub` (or `custodian dev up`) starts the local
|
||||||
|
compose hub, registers the locally cloned repos, and rebuilds workplan/task
|
||||||
|
state from files via C-06 — no fleet connection required. Implement the
|
||||||
|
offline write buffer + reconnect relay for progress/task events (align with
|
||||||
|
`STATE-WP-0068`; keep the buffer file-backed and idempotent). MCP config
|
||||||
|
gains an explicit `dev`/`fleet` profile switch. Done when a fresh machine
|
||||||
|
reaches a working, orientation-capable dev hub from `git clone` + one command,
|
||||||
|
fully offline.
|
||||||
|
|
||||||
|
## Task: ThreePhoenix increment — phoenix drill automation
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0054-T08
|
||||||
|
status: todo
|
||||||
|
priority: medium
|
||||||
|
```
|
||||||
|
|
||||||
|
Compose the existing pieces (NET-WP-0020 unseal automation, S1–S3 bootstrap
|
||||||
|
chain, staged-promotion overlays, CNPG restore drills) into one rehearsable
|
||||||
|
"phoenix a machine" runbook + automation entrypoint, proven on a disposable
|
||||||
|
target (haskelseed or a VM). This is the tool the railiance02 rebirth and any
|
||||||
|
future node rotation will use. Done when a greenfield machine reaches
|
||||||
|
join-ready state unattended except for custody-gated steps.
|
||||||
|
|
||||||
|
## Task: coulombcore decommission readiness → railiance02
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0054-T09
|
||||||
|
status: wait
|
||||||
|
priority: medium
|
||||||
|
```
|
||||||
|
|
||||||
|
Gated on T03–T05 drains reaching identity/OpenBao. Final inventory sweep,
|
||||||
|
data archival (episodic memory of the machine's history), DNS/cert plan for
|
||||||
|
`*.coulomb.social` names, then execute the machine phoenix via T08 automation:
|
||||||
|
wipe, rebuild as railiance02, join the fleet. Longhorn/PG-HA
|
||||||
|
(`RAIL-BS-WP-0007`, `CUST-WP-0038`) unlock once railiance01 + railiance02 are
|
||||||
|
both fleet nodes.
|
||||||
|
|
||||||
|
## Task: Workstation-off acceptance test
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0054-T10
|
||||||
|
status: wait
|
||||||
|
priority: high
|
||||||
|
```
|
||||||
|
|
||||||
|
The plan's proof: workstation fully offline for 24h+ (no tunnels, no
|
||||||
|
processes). Verify afterwards from evidence alone: scheduled triage ran and
|
||||||
|
validated, consistency sweeps ran, issue emission works, hub/API/dashboards
|
||||||
|
served, forge and CI available. Then verify the inverse: workstation boots,
|
||||||
|
`custodian dev up` gives a working offline dev hub, and reconnect relays
|
||||||
|
buffered events. Done when both directions pass without manual repair.
|
||||||
|
|
||||||
|
## Sequencing
|
||||||
|
|
||||||
|
```
|
||||||
|
T01 (architecture + register)
|
||||||
|
├─ T02 de-hub network ── unblocks most of T10's first half
|
||||||
|
├─ T03 placement plan/freeze ─┬─ T04 forge + CI ─┐
|
||||||
|
│ └─ T05 hub home ├─ T09 decommission → railiance02
|
||||||
|
├─ T06 sink decoupling │
|
||||||
|
├─ T07 dev beachhead │
|
||||||
|
└─ T08 phoenix drill ────────────────────────────┘
|
||||||
|
T10 acceptance (both halves)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Relationship to existing workplans
|
||||||
|
|
||||||
|
- Absorbs the *sequencing* of `RAIL-HO-WP-0005` (Forgejo) and
|
||||||
|
`CUST-WP-0011-T08/T09` (WSL2 fallback retirement folds into T10).
|
||||||
|
- Builds on finished `RAIL-BS-WP-0006` (staged promotion) and `NET-WP-0020`
|
||||||
|
(unseal automation).
|
||||||
|
- Defers to `RAIL-BS-WP-0007` / `CUST-WP-0038` for true 3-node HA once
|
||||||
|
railiance02 exists.
|
||||||
|
- Extends `STATE-WP-0068` (offline write buffer) as the T07 relay mechanism.
|
||||||
|
- `CORE-WP-0005` stabilization + `CORE-WP-0007` Haskell retirement proceed
|
||||||
|
independently; Core Hub simply appears in the T03 drain sequence.
|
||||||
Reference in New Issue
Block a user