Files
the-custodian/workplans/CUST-WP-0054-workstation-independence-and-fleet-realignment.md
codex e8a7f49bde Record ADR-004 in-cluster Forgejo runner decision for T04
Updates forgejo-production-decisions and CUST-WP-0054-T04 partial progress.
2026-07-03 22:29:28 +02:00

304 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: CUST-WP-0054
type: workplan
title: "Workstation Independence and Fleet Role Realignment"
domain: infotech
repo: the-custodian
status: active
owner: codex
topic_slug: custodian
planning_priority: high
planning_order: 54
created: "2026-07-03"
updated: "2026-07-03"
state_hub_workstream_id: "8a828444-dd49-4d7b-a2d1-9952b5bc929d"
---
# CUST-WP-0054 - Workstation Independence and Fleet Role Realignment
## Goal
Remove every runtime dependency and always-on workload from the workstation,
realign machine roles, and make the production loop run uninterrupted with the
workstation powered off:
| Machine | Target role |
| --- | --- |
| **railiance01** | Production workloads (the first node of the growing Railiance fleet). |
| **coulombcore** | Early/legacy/experimental workloads only; frozen for new production; eventually decommissioned and **reborn as railiance02** (a machine-scale phoenix rotation). |
| **workstation** | Local, temporary dev environment to build and evolve repos — nothing depends on it being on. |
The acceptance proof for the whole plan is T10: production runs for 24h+ with
the workstation offline.
## Current-State Findings (2026-07-03 inventory)
These are the concrete dependencies this plan removes:
1. **The workstation is a production network hub.** All 16 ops-bridge tunnels
originate here. Two production paths *chain through* it:
- State Hub primary: everything reaches the cluster hub via the
workstation's `state-hub-primary` forward tunnel, and railiance01's
automations via `state-hub-railiance01` (reverse) → workstation → cluster.
- issue-core emission: activity-core (railiance01) → workstation →
issue-core (coulombcore).
A workstation reboot breaks daily triage evidence, consistency sweeps, and
task emission.
2. **Production landed on the condemned machine.** The 2026-07-02/03 cutovers
correctly moved State Hub and Core Hub off the workstation — but onto
coulombcore (State Hub cluster primary, Core Hub behind
`hub.coulomb.social`, issue-core, OpenBao, ArgoCD, ESO, the identity stack,
and Gitea all run there). Under the new roles these are production
workloads on a legacy machine.
3. **Source of truth for code is on the condemned machine.** Every repo's
canonical remote is `gitea.coulomb.social` (coulombcore). The Forgejo
production migration (`RAIL-HO-WP-0005`) already targets railiance01.
4. **The consistency sweep and daily triage depend on workstation repo
checkouts** (`/home/worsch/*` clones are what fix-consistency writes back
to) and the triage working-memory sink references a
`/home/worsch/the-custodian/memory/working` path identity.
5. **Images are built and published from the workstation** (docker build +
push by hand); there is no CI runner, so releases require this machine.
6. **WSL2 State Hub fallback** still exists on the workstation
(`CUST-WP-0011-T08/T09` stabilization window) — expected, retired by that
workplan, not this one.
## Deployment Strategy (proposed — improves on the rough beachhead idea)
**A. Files-first beachhead, not database replication.** ADR-001 already
declares that the State Hub is a read model rebuildable from repo files. So
the dev-environment story should be: `make dev-hub` starts a **local ephemeral
hub** (compose: postgres + API + MCP, exactly what exists today) whose content
is **rebuilt from the local repo checkouts** via the C-06 registration path —
not synced from the fleet database. Offline writes (progress events, task
status) accumulate in a **write buffer** and relay upstream on reconnect
(`STATE-WP-0068 offline-write-buffer-and-edge-relay` is the existing seed for
this). Two hubs never replicate; files reconcile them. This makes the
workstation genuinely temporary: clone repos → `make dev-hub` → work offline →
push + relay when connected.
**B. Fleet mesh instead of workstation-hub networking.** Machine-to-machine
paths (railiance01 ↔ coulombcore/railiance02) get direct, persistent links
owned by the machines themselves — ops-bridge units running *on the fleet
machines* under `atm-` actor certs (ops-warden), or a minimal WireGuard mesh
if tunnel count grows. The workstation becomes a mesh *client* when present,
never a relay.
**C. Promotion-gated production, ThreePhoenix ideas incrementally.** A
workload counts as "production on railiance01" only when it conforms to the
already-finished staged-promotion contract (`RAIL-BS-WP-0006`: overlay repo,
`railiance/app.toml`, stage commands, rollback). ThreePhoenix ideas adopted
now, without waiting for three nodes:
- **CNPG-managed Postgres** per workload (already proven on coulombcore);
- **greenfield rebuild automation** (NET-WP-0020 unseal + bootstrap chain) as
the standing "phoenix drill" for single machines;
- **phoenix rotation at machine scale**: the coulombcore → railiance02
rebirth *is* the first full phoenix and should be executed with the
same automation that a future 3-node weekly rotation would use;
- Longhorn distributed storage and PG streaming HA remain deferred until a
third node exists (`RAIL-BS-WP-0007` / `CUST-WP-0038` stay the follow-on).
**D. Decommission by attrition, not by big-bang.** coulombcore is frozen for
new production immediately (policy), drained workload-by-workload via staged
promotion onto railiance01, and only rebuilt as railiance02 when the last
production dependency (likely identity/OpenBao) has moved.
## Task: Target architecture and dependency register
```task
id: CUST-WP-0054-T01
status: done
priority: high
state_hub_task_id: "67b91b18-9ad0-4917-990a-056a7007a2d4"
```
Write the canon-adjacent architecture note (`docs/` first; promote to
`canon/architecture/` after review) fixing the three machine roles, the fleet
mesh topology, the promotion gate for "production", and the phoenix path
coulombcore → railiance02. Include the full dependency register: every
workload, tunnel, repo remote, sink path, and build pipeline with its current
host and target host. Done when every row has a target and a migration owner
(this plan's task or an existing workplan reference).
## Task: De-hub the network — fleet-owned direct tunnels
```task
id: CUST-WP-0054-T02
status: done
priority: high
state_hub_task_id: "4f2ae1f1-f9ad-44bb-bae7-151030634f56"
```
Remove the workstation from all production data paths:
- Run ops-bridge (or systemd ssh units) **on railiance01** with `atm-` actor
certs for the two live cross-machine lanes: railiance01 → coulombcore
issue-core, and railiance01 → cluster State Hub (replacing the
workstation-chained `state-hub-railiance01` + `state-hub-primary` pair).
- Re-point `actcore-state-hub-bridge` and `actcore-issue-core-bridge` at the
machine-local tunnel ports.
- Workstation tunnels remain only for interactive dev access (k3s API, hub
client) and may drop at any time without production impact.
- Evaluate WireGuard mesh as the successor if unit count exceeds ~5.
Done when killing every workstation tunnel leaves triage, sweeps, and
emission working (partial T10 rehearsal).
## Task: Production placement plan and freeze policy
```task
id: CUST-WP-0054-T03
status: done
priority: high
state_hub_task_id: "70a25fbd-71d7-4d74-a04b-30e775984feb"
```
Declare coulombcore frozen for new production (policy note in canon). Produce
the drain sequence with per-workload target and method (staged-promotion
overlay or documented exception): State Hub, Core Hub, issue-core, OpenBao,
identity stack (KeyCape/Authelia/privacyIDEA/lldap), ESO/ArgoCD control
plane, Gitea/Forgejo, CNPG databases. Explicitly order them by coupling
(forge and State Hub early; identity + OpenBao last, since everything
authenticates through them).
## Task: Forge to railiance01 + CI runners (kill workstation builds)
```task
id: CUST-WP-0054-T04
status: progress
priority: high
state_hub_task_id: "79b9ee4d-f792-434c-a2ea-2fe216a948ca"
```
Execute/absorb `RAIL-HO-WP-0005`: Forgejo production on railiance01 becomes
the canonical remote for all repos; coulombcore Gitea becomes a read-only
mirror until decommission. Stand up Actions runners so container images
(state-hub, core-hub, issue-core, activity-core) build and push in CI from
tags — the workstation stops being the build/publish host.
**Partial (2026-07-03):** ADR-004 in-cluster runner (`railiance01-build-01` +
DinD) replaces interim coulombcore host runner. Remaining: image-build workflow
on runner, repo migration, release with workstation off.
## Task: State Hub production home on railiance01
```task
id: CUST-WP-0054-T05
status: todo
priority: high
state_hub_task_id: "e91db8d0-973d-4a31-b3c2-ca37fd002ec7"
```
Move the State Hub primary from coulombcore to railiance01 using the proven
CUST-WP-0011-T07 playbook (freeze → exact-count restore → rewire). This makes
the automation loop machine-local: activity-core, Temporal, and the hub share
one machine, so daily triage and sweeps survive any other machine being down.
Prereq: railiance01 CNPG + storage reviewed (T03). Also relocate the
consistency-sweep repo checkouts to railiance01 (clones from the T04 forge)
so file writebacks no longer touch workstation paths.
## Task: Working-memory and sink path decoupling
```task
id: CUST-WP-0054-T06
status: todo
priority: medium
state_hub_task_id: "f2c5dd4b-9af4-4e8c-8619-6814e7d1666e"
```
Remove `/home/worsch/...` path identities from runtime contracts: the triage
working-memory sink and any prompt/context paths become repo-relative or
PVC-native with a defined sync-to-repo step (commit via the sweep). Done when
no ActivityDefinition or sink references a workstation-specific absolute path.
## Task: Dev-environment beachhead (files-first)
```task
id: CUST-WP-0054-T07
status: todo
priority: high
state_hub_task_id: "0eaf1961-a4e7-459e-b710-3e72042cdf50"
```
Implement strategy A: `make dev-hub` (or `custodian dev up`) starts the local
compose hub, registers the locally cloned repos, and rebuilds workplan/task
state from files via C-06 — no fleet connection required. Implement the
offline write buffer + reconnect relay for progress/task events (align with
`STATE-WP-0068`; keep the buffer file-backed and idempotent). MCP config
gains an explicit `dev`/`fleet` profile switch. Done when a fresh machine
reaches a working, orientation-capable dev hub from `git clone` + one command,
fully offline.
## Task: ThreePhoenix increment — phoenix drill automation
```task
id: CUST-WP-0054-T08
status: todo
priority: medium
state_hub_task_id: "ede6713e-8552-469c-bfe5-b17b015e1809"
```
Compose the existing pieces (NET-WP-0020 unseal automation, S1S3 bootstrap
chain, staged-promotion overlays, CNPG restore drills) into one rehearsable
"phoenix a machine" runbook + automation entrypoint, proven on a disposable
target (haskelseed or a VM). This is the tool the railiance02 rebirth and any
future node rotation will use. Done when a greenfield machine reaches
join-ready state unattended except for custody-gated steps.
## Task: coulombcore decommission readiness → railiance02
```task
id: CUST-WP-0054-T09
status: wait
priority: medium
state_hub_task_id: "c6b0d0a7-88c7-46f6-9d05-5d1078df3c8c"
```
Gated on T03T05 drains reaching identity/OpenBao. Final inventory sweep,
data archival (episodic memory of the machine's history), DNS/cert plan for
`*.coulomb.social` names, then execute the machine phoenix via T08 automation:
wipe, rebuild as railiance02, join the fleet. Longhorn/PG-HA
(`RAIL-BS-WP-0007`, `CUST-WP-0038`) unlock once railiance01 + railiance02 are
both fleet nodes.
## Task: Workstation-off acceptance test
```task
id: CUST-WP-0054-T10
status: wait
priority: high
state_hub_task_id: "cd6a31e8-c99e-4191-97d7-68d0389137b0"
```
The plan's proof: workstation fully offline for 24h+ (no tunnels, no
processes). Verify afterwards from evidence alone: scheduled triage ran and
validated, consistency sweeps ran, issue emission works, hub/API/dashboards
served, forge and CI available. Then verify the inverse: workstation boots,
`custodian dev up` gives a working offline dev hub, and reconnect relays
buffered events. Done when both directions pass without manual repair.
## Sequencing
```
T01 (architecture + register)
├─ T02 de-hub network ── unblocks most of T10's first half
├─ T03 placement plan/freeze ─┬─ T04 forge + CI ─┐
│ └─ T05 hub home ├─ T09 decommission → railiance02
├─ T06 sink decoupling │
├─ T07 dev beachhead │
└─ T08 phoenix drill ────────────────────────────┘
T10 acceptance (both halves)
```
## Relationship to existing workplans
- Absorbs the *sequencing* of `RAIL-HO-WP-0005` (Forgejo) and
`CUST-WP-0011-T08/T09` (WSL2 fallback retirement folds into T10).
- Builds on finished `RAIL-BS-WP-0006` (staged promotion) and `NET-WP-0020`
(unseal automation).
- Defers to `RAIL-BS-WP-0007` / `CUST-WP-0038` for true 3-node HA once
railiance02 exists.
- Extends `STATE-WP-0068` (offline write buffer) as the T07 relay mechanism.
- `CORE-WP-0005` stabilization + `CORE-WP-0007` Haskell retirement proceed
independently; Core Hub simply appears in the T03 drain sequence.