Files
the-custodian/docs/workstation-independence-fleet-architecture.md
codex cf4be716e1 CUST-WP-0054 T01-T03: fleet architecture, de-hub runbook, drain plan
Documents the three-machine role model, fleet mesh topology, coulombcore
freeze policy, and ordered drain sequence. Adds railiance01 systemd tunnel
install assets and refreshes ops service inventory to reflect 2026-07-03
production placement (cluster State Hub, fleet mesh, draining coulombcore).
2026-07-04 00:29:55 +02:00

17 KiB

Workstation Independence and Fleet Role Architecture

Date: 2026-07-03
Status: draft (canon-adjacent; promote to canon/architecture/ after review)
Workplan: CUST-WP-0054 T01
Related: ADR-001, ADR-004, RAIL-BS-WP-0006, RAIL-HO-WP-0005, CUST-WP-0011

Purpose

Fix the three-machine role model, the fleet mesh topology, the promotion gate for "production", and the phoenix path coulombcore → railiance02. Provide a dependency register so every workload, tunnel, repo remote, sink path, and build pipeline has a current host, target host, and migration owner.

The acceptance proof for the whole plan is CUST-WP-0054-T10: production runs 24h+ with the workstation fully offline.

Machine Roles

Machine IP / identity Current role (2026-07-03) Target role
railiance01 92.205.62.239 First ThreePhoenix foundation node; hosts activity-core production, partial State Hub cluster footprint, automation schedules Production home — first node of the growing Railiance fleet; hosts State Hub primary, forge, CI runners, and the automation loop
coulombcore 92.205.130.254 De-facto production host: State Hub cluster primary, Core Hub (hub.coulomb.social), issue-core, OpenBao, identity stack, ESO/ArgoCD, Gitea/registry Frozen legacy — no new production; drain workload-by-workload; eventually wiped and reborn as railiance02
workstation bnt-lap001 / WSL2 Production network hub (all 16 ops-bridge tunnels), State Hub client endpoint (127.0.0.1:8000), consistency-sweep writebacks, image build/publish, dev checkouts for 74 registered repos Temporary dev environment — clone repos, run make dev-hub, push when connected; nothing in the production loop may depend on it being on

Role invariants

  1. Production workloads authenticate, schedule, emit, and reconcile without the workstation.
  2. coulombcore is frozen for new production immediately (policy; see T03).
  3. A workload counts as "production on railiance01" only after passing the staged-promotion gate (see below).
  4. Files remain authoritative per ADR-001; fleet databases are disposable caches.

Fleet Mesh Topology

Current topology (workstation as hub)

All ops-bridge tunnels originate on the workstation. Two production data paths chain through it:

railiance01                          workstation                         coulombcore
───────────                          ───────────                         ───────────
activity-core ──(state-hub-railiance01 reverse)──► :18000 ──(state-hub-primary forward)──► State Hub cluster
activity-core ──(issue-core-railiance01 reverse)──► :local ──(issue-core-coulombcore forward)──► issue-core

Live tunnel inventory (2026-07-03, bridge status):

Tunnel Direction Actor Production-critical?
state-hub-primary workstation → coulombcore cluster agt-claude-coulombcore yes — MCP/agents reach cluster hub via 127.0.0.1:8000
state-hub-cluster-coulombcore workstation → coulombcore agt-claude-coulombcore dev/ops access
state-hub-railiance01 railiance01 → workstation (reverse) agt-claude-railiance01 yes — activity-core reaches hub
state-hub-mcp-railiance01 railiance01 → workstation (reverse) agt-claude-railiance01 dev MCP
issue-core-railiance01 railiance01 → workstation (reverse) agt-claude-railiance01 yes — emission lane
issue-core-coulombcore workstation → coulombcore agt-claude-coulombcore yes — completes emission chain
state-hub-coulombcore workstation → coulombcore agt-claude-coulombcore legacy/dev
state-hub-mcp-coulombcore workstation → coulombcore agt-claude-coulombcore dev MCP
k3s-api-coulombcore workstation → coulombcore agt-claude-coulombcore operator dev
k3s-api-haskelseed workstation → haskelseed agt-claude-haskelseed experimental
flex-auth-coulombcore workstation → coulombcore agt-claude-coulombcore identity dev
core-hub-staging-coulombcore workstation → coulombcore agt-claude-coulombcore staging
inter-hub-coulombcore workstation → coulombcore agt-claude-coulombcore legacy Inter-Hub
state-hub-haskelseed haskelseed → workstation agt-claude-haskelseed experimental
state-hub-mcp-haskelseed haskelseed → workstation agt-claude-haskelseed experimental
nix-daemon-haskelseed haskelseed → workstation agt-claude-haskelseed build dev

A workstation reboot breaks daily triage evidence, consistency sweeps, and issue emission until tunnels recover.

Target topology (fleet-owned mesh)

railiance01 ◄────────────────────────────────────► coulombcore (draining)
     │   direct atm- tunnels (ops-bridge on-host)      │
     │   State Hub API                                   │ legacy until drain complete
     │   issue-core REST                                 │
     └─ activity-core, Temporal, sweep checkouts        └─ identity, OpenBao (last to move)

workstation (optional client)
     │  interactive-only: k3s API, hub read, dev-hub
     └─ may disconnect without production impact

Implementation owner: CUST-WP-0054-T02.

Key changes:

  • ops-bridge (or systemd ssh units) runs on railiance01 with atm- actor certs for cross-machine lanes.
  • actcore-state-hub-bridge and actcore-issue-core-bridge point at machine-local tunnel ports, not workstation forwards.
  • Workstation tunnels remain for interactive dev only.
  • Evaluate WireGuard mesh when persistent unit count exceeds ~5.

This posture extends ADR-004 (connectivity-first) from "workstation connects everything" to "fleet machines connect each other; workstation is a client."

Production Promotion Gate

A workload is production on railiance01 only when it conforms to the finished staged-promotion contract (RAIL-BS-WP-0006):

Gate Requirement
Overlay repo railiance/<app>/ with app.toml and stage manifests
Stage commands stage deploy, stage observe, stage promote, stage rollback proven
Evidence Backup/restore drill, canary observation, operator approval recorded
Registry Image in forge OCI registry with immutable tag

Exceptions must be documented in the placement plan (T03) with explicit rollback. No exception bypasses backup evidence for stateful workloads.

coulombcore workloads still running in production today are grandfathered legacy until their drain task completes — not newly promoted production.

Phoenix Path: coulombcore → railiance02

Machine-scale phoenix rotation reuses the same automation intended for future 3-node weekly rotations (RAIL-BS-WP-0007, CUST-WP-0038 deferred until railiance02 exists).

Preconditions (drain complete)

All production dependencies moved off coulombcore per T03 ordering:

  1. Forge + CI (T04) — repos and images no longer depend on gitea.coulomb.social
  2. State Hub primary (T05) — cluster DB and sweep checkouts on railiance01
  3. Core Hub, issue-core, Inter-Hub legacy — per T03 sequence
  4. Identity + OpenBao — last (everything authenticates through them)

Phoenix execution

Owner: CUST-WP-0054-T09, automation: CUST-WP-0054-T08.

Phase Action Tooling
S0 Final inventory sweep, DNS/cert plan for *.coulomb.social, data archival T09
S1 Wipe and greenfield rebuild NET-WP-0020 unseal + bootstrap chain
S2 Join as railiance02 railiance-cluster overlay, atm- certs
S3 Prove join-ready Phoenix drill on disposable target first (T08)

Longhorn distributed storage and PG streaming HA unlock once railiance01 + railiance02 are both fleet nodes.

Dev Environment (Files-First Beachhead)

Strategy A from the workplan; owner: CUST-WP-0054-T07.

git clone → make dev-hub → local ephemeral hub (compose)
                │
                ├─ C-06 registration rebuilds workplan/task state from files
                ├─ offline write buffer (STATE-WP-0068) for progress/task events
                └─ reconnect relay upstream; files reconcile, databases do not replicate

MCP config gains explicit dev / fleet profile switch. The workstation is genuinely temporary: no fleet DB sync required for orientation.

Dependency Register

Workloads

Workload Current host Target host Migration owner Method / notes
State Hub API (primary) coulombcore CNPG cluster via workstation tunnel state-hub-primary127.0.0.1:8000 railiance01 CUST-WP-0054-T05 CUST-WP-0011-T07 playbook: freeze → exact-count restore → rewire
State Hub API (WSL2 fallback) workstation WSL2 retired CUST-WP-0011-T08/T09 → absorbed by CUST-WP-0054-T10 Stabilization window; not part of target architecture
activity-core railiance01 k3s (activity-core ns) railiance01 (retain) Already on target machine; fix bridges in T02
issue-core coulombcore k3s railiance01 CUST-WP-0054-T03 drain seq. ISSUE-WP-0003 live; emission chain fixed in T02
Core Hub coulombcore (hub.coulomb.social) railiance01 CORE-WP-0005 + CUST-WP-0054-T03 Staging on coulombcore; production cutover human-gated
Inter-Hub (legacy Haskell) coulombcore external retired CORE-WP-0007 Rollback-only after Core Hub cutover
Gitea + OCI registry coulombcore k3s railiance01 Forgejo RAIL-HO-WP-0005 / CUST-WP-0054-T04 Read-only mirror on coulombcore until decommission
OpenBao coulombcore railiance01 CUST-WP-0054-T03 (last) NET-WP-0020 unseal automation
Identity stack (KeyCape, Authelia, privacyIDEA, lldap) coulombcore railiance01 CUST-WP-0054-T03 (last) Coupled to OpenBao
ESO + ArgoCD control plane coulombcore railiance01 CUST-WP-0054-T03 GitOps follows forge move
CNPG databases (per workload) coulombcore / railiance01 railiance01 per workload CUST-WP-0054-T03, CUST-WP-0054-T05 CNPG pattern proven; migrate with workload
llm-connect TBD cluster railiance01 near-term lanes board CCR-2026-0003 credential lane active
ops-hub (widget/evidence) files + Inter-Hub widgets railiance01 via Core Hub CUST-WP-0025, CUST-WP-0049 Not blocking workstation independence
Temporal (activity-core) railiance01 railiance01 (retain) Co-locate with activity-core
NATS (activity-core) railiance01 railiance01 (retain) Co-locate with activity-core

Network tunnels (production-critical)

Lane Current path Target path Owner
activity-core → State Hub railiance01 reverse → workstation → state-hub-primary → coulombcore railiance01 atm- forward → railiance01 State Hub (local or short hop) CUST-WP-0054-T02
Agents/MCP → State Hub workstation 127.0.0.1:8000state-hub-primary → coulombcore workstation 127.0.0.1:8000 → tunnel to railiance01 hub (dev client) or fleet endpoint CUST-WP-0054-T05 + T07 profiles
railiance01 automations → State Hub :18000 chain via workstation railiance01-local bridge port CUST-WP-0054-T02
activity-core → issue-core railiance01 reverse → workstation → issue-core-coulombcore railiance01 atm- forward → issue-core (on railiance01 post-drain) CUST-WP-0054-T02, then T03
Operator k3s access workstation forwards (k3s-api-*) workstation interactive (non-critical)

Repo remotes

All checked 2026-07-03; pattern is uniform:

Repo (sample) Current remote Target remote Owner
the-custodian gitea.coulomb.social/coulomb/the-custodian.git forgejo.coulomb.social/coulomb/the-custodian.git CUST-WP-0054-T04
state-hub gitea.coulomb.social/coulomb/state-hub.git forgejo.coulomb.social/coulomb/state-hub.git CUST-WP-0054-T04
activity-core gitea.coulomb.social/coulomb/activity-core.git forgejo.coulomb.social/coulomb/activity-core.git CUST-WP-0054-T04
issue-core gitea.coulomb.social/coulomb/issue-core.git forgejo.coulomb.social/coulomb/issue-core.git CUST-WP-0054-T04
ops-bridge gitea.coulomb.social/coulomb/ops-bridge.git forgejo.coulomb.social/coulomb/ops-bridge.git CUST-WP-0054-T04
ops-warden gitea.coulomb.social/coulomb/ops-warden.git forgejo.coulomb.social/coulomb/ops-warden.git CUST-WP-0054-T04
core-hub gitea.coulomb.social/coulomb/core-hub.git forgejo.coulomb.social/coulomb/core-hub.git CUST-WP-0054-T04
(all 74 registered repos) gitea.coulomb.social/coulomb/<slug>.git forgejo.coulomb.social/coulomb/<slug>.git CUST-WP-0054-T04

State Hub repo checkout paths

Concern Current Target Owner
local_path for 74 repos /home/worsch/<repo> on workstation railiance01 clone tree (e.g. /home/tegwick/<repo> or gitops-managed path) CUST-WP-0054-T05
Consistency sweep writeback host workstation (consistency_check.py --remote via API) railiance01 checkouts from forge CUST-WP-0054-T05, STATE-WP-0064
COULOMBCORE host_paths /home/tegwick/<repo> (11 repos, CUST-WP-0021) retired with coulombcore drain CUST-WP-0054-T09
Multi-host path resolution host_paths map per hostname fleet-primary host only + dev-hub local CUST-WP-0054-T07

Sink and prompt paths

Sink / path Current Target Owner
Daily triage working-memory /home/worsch/the-custodian/memory/working (ActivityDefinition + PVC mount) repo-relative or PVC-native path + sweep sync-to-repo CUST-WP-0054-T06
Daily triage State Hub progress cluster hub via workstation tunnel railiance01 hub direct CUST-WP-0054-T02, T05
Consistency sweep progress event via workstation-hosted sweep railiance01-hosted sweep CUST-WP-0054-T05, STATE-WP-0064
Agent session traces (runtime/agent.py) memory/working/agent-session-*.md on workstation dev-hub local buffer; commit on reconnect CUST-WP-0054-T07
output_schema in ActivityDefinitions absolute paths under /home/worsch/the-custodian/ repo-relative resolution in activity-core CUST-WP-0054-T06

Build and publish pipelines

Image / artifact Current build host Current registry Target build Target registry Owner
state-hub workstation docker build gitea.coulomb.social/coulomb/state-hub Forgejo Actions runner on railiance01 railiance01 forge OCI CUST-WP-0054-T04
core-hub workstation / railiance-forge docs gitea.coulomb.social/coulomb/core-hub CI runner railiance01 forge OCI CUST-WP-0054-T04
activity-core workstation manual rebuild + scp railiance01 k3s import / Gitea CI on tag push railiance01 forge OCI CUST-WP-0054-T04
issue-core workstation / manual gitea.coulomb.social/coulomb/issue-core CI runner railiance01 forge OCI CUST-WP-0054-T04
Haskell build agent workstation VM (haskell-build-vm) n/a retired (CORE-WP-0007) n/a CORE-WP-0007

Done criterion for T01: every row above has a target and migration owner. ✓

Drain Sequence

Detailed plan: docs/coulombcore-drain-placement-plan.md
Freeze policy: canon/standards/coulombcore-production-freeze_v0.1.md

Wave 1  Forge + CI (T04)
Wave 2  State Hub primary (T05)
Wave 3  Core Hub (CORE-WP-0005)
Wave 4  issue-core
Wave 5  ESO / ArgoCD
Wave 6  Supporting apps
Wave 7  OpenBao + identity (LAST)
Wave 8  coulombcore phoenix → railiance02 (T09)

Sequencing Map

T01 (this document) ✓
 ├─ T02 de-hub network ✓
 ├─ T03 placement plan / freeze ✓
 │    ├─ T04 forge + CI
 │    └─ T05 State Hub home on railiance01
 ├─ T06 sink decoupling
 ├─ T07 dev beachhead
 └─ T08 phoenix drill
      └─ T09 coulombcore → railiance02
           └─ T10 workstation-off acceptance

Evidence and Inventory Sources

  • Live tunnel state: bridge status (2026-07-03)
  • State Hub health: http://127.0.0.1:8000/state/health (cluster primary via tunnel)
  • Registered repos: GET /repos/ — 74 repos, all local_path under /home/worsch/
  • ops/service-inventory.yml (2026-06-05; predates cluster cutover — refresh in T03)
  • docs/infrastructure-stabilization-pickup-checkpoint.md (2026-07-03 metaplan closeout)
  • Activity definitions: activity-definitions/daily-statehub-wsjf-triage.md, activity-definitions/state-hub-consistency-sweep.md

Open Gaps (not T01 blockers)

Gap Follow-on
Forgejo production hostname / SMTP / exposure decisions RAIL-HO-WP-0005-T02 (human)
ops/service-inventory.yml stale environment labels Refresh during T03
Core Hub widget-type registry prerequisite CORE-WP-0005-T04
HA Postgres / Longhorn across 2+ nodes RAIL-BS-WP-0007, CUST-WP-0038 after railiance02

Promotion to Canon

After operator review:

  1. Move to canon/architecture/adr-006-workstation-independence-fleet-roles.md (or equivalent ADR number).
  2. Update ops/service-inventory.yml environment and service rows to match.
  3. Link from SCOPE.md and .custodian-brief.md generation inputs.