Documents the three-machine role model, fleet mesh topology, coulombcore freeze policy, and ordered drain sequence. Adds railiance01 systemd tunnel install assets and refreshes ops service inventory to reflect 2026-07-03 production placement (cluster State Hub, fleet mesh, draining coulombcore).
17 KiB
Workstation Independence and Fleet Role Architecture
Date: 2026-07-03
Status: draft (canon-adjacent; promote to canon/architecture/ after review)
Workplan: CUST-WP-0054 T01
Related: ADR-001, ADR-004, RAIL-BS-WP-0006, RAIL-HO-WP-0005, CUST-WP-0011
Purpose
Fix the three-machine role model, the fleet mesh topology, the promotion gate
for "production", and the phoenix path coulombcore → railiance02. Provide a
dependency register so every workload, tunnel, repo remote, sink path, and
build pipeline has a current host, target host, and migration owner.
The acceptance proof for the whole plan is CUST-WP-0054-T10: production runs
24h+ with the workstation fully offline.
Machine Roles
| Machine | IP / identity | Current role (2026-07-03) | Target role |
|---|---|---|---|
| railiance01 | 92.205.62.239 |
First ThreePhoenix foundation node; hosts activity-core production, partial State Hub cluster footprint, automation schedules | Production home — first node of the growing Railiance fleet; hosts State Hub primary, forge, CI runners, and the automation loop |
| coulombcore | 92.205.130.254 |
De-facto production host: State Hub cluster primary, Core Hub (hub.coulomb.social), issue-core, OpenBao, identity stack, ESO/ArgoCD, Gitea/registry |
Frozen legacy — no new production; drain workload-by-workload; eventually wiped and reborn as railiance02 |
| workstation | bnt-lap001 / WSL2 |
Production network hub (all 16 ops-bridge tunnels), State Hub client endpoint (127.0.0.1:8000), consistency-sweep writebacks, image build/publish, dev checkouts for 74 registered repos |
Temporary dev environment — clone repos, run make dev-hub, push when connected; nothing in the production loop may depend on it being on |
Role invariants
- Production workloads authenticate, schedule, emit, and reconcile without the workstation.
coulombcoreis frozen for new production immediately (policy; see T03).- A workload counts as "production on railiance01" only after passing the staged-promotion gate (see below).
- Files remain authoritative per ADR-001; fleet databases are disposable caches.
Fleet Mesh Topology
Current topology (workstation as hub)
All ops-bridge tunnels originate on the workstation. Two production data paths chain through it:
railiance01 workstation coulombcore
─────────── ─────────── ───────────
activity-core ──(state-hub-railiance01 reverse)──► :18000 ──(state-hub-primary forward)──► State Hub cluster
activity-core ──(issue-core-railiance01 reverse)──► :local ──(issue-core-coulombcore forward)──► issue-core
Live tunnel inventory (2026-07-03, bridge status):
| Tunnel | Direction | Actor | Production-critical? |
|---|---|---|---|
state-hub-primary |
workstation → coulombcore cluster | agt-claude-coulombcore |
yes — MCP/agents reach cluster hub via 127.0.0.1:8000 |
state-hub-cluster-coulombcore |
workstation → coulombcore | agt-claude-coulombcore |
dev/ops access |
state-hub-railiance01 |
railiance01 → workstation (reverse) | agt-claude-railiance01 |
yes — activity-core reaches hub |
state-hub-mcp-railiance01 |
railiance01 → workstation (reverse) | agt-claude-railiance01 |
dev MCP |
issue-core-railiance01 |
railiance01 → workstation (reverse) | agt-claude-railiance01 |
yes — emission lane |
issue-core-coulombcore |
workstation → coulombcore | agt-claude-coulombcore |
yes — completes emission chain |
state-hub-coulombcore |
workstation → coulombcore | agt-claude-coulombcore |
legacy/dev |
state-hub-mcp-coulombcore |
workstation → coulombcore | agt-claude-coulombcore |
dev MCP |
k3s-api-coulombcore |
workstation → coulombcore | agt-claude-coulombcore |
operator dev |
k3s-api-haskelseed |
workstation → haskelseed | agt-claude-haskelseed |
experimental |
flex-auth-coulombcore |
workstation → coulombcore | agt-claude-coulombcore |
identity dev |
core-hub-staging-coulombcore |
workstation → coulombcore | agt-claude-coulombcore |
staging |
inter-hub-coulombcore |
workstation → coulombcore | agt-claude-coulombcore |
legacy Inter-Hub |
state-hub-haskelseed |
haskelseed → workstation | agt-claude-haskelseed |
experimental |
state-hub-mcp-haskelseed |
haskelseed → workstation | agt-claude-haskelseed |
experimental |
nix-daemon-haskelseed |
haskelseed → workstation | agt-claude-haskelseed |
build dev |
A workstation reboot breaks daily triage evidence, consistency sweeps, and issue emission until tunnels recover.
Target topology (fleet-owned mesh)
railiance01 ◄────────────────────────────────────► coulombcore (draining)
│ direct atm- tunnels (ops-bridge on-host) │
│ State Hub API │ legacy until drain complete
│ issue-core REST │
└─ activity-core, Temporal, sweep checkouts └─ identity, OpenBao (last to move)
workstation (optional client)
│ interactive-only: k3s API, hub read, dev-hub
└─ may disconnect without production impact
Implementation owner: CUST-WP-0054-T02.
Key changes:
- ops-bridge (or systemd ssh units) runs on railiance01 with
atm-actor certs for cross-machine lanes. actcore-state-hub-bridgeandactcore-issue-core-bridgepoint at machine-local tunnel ports, not workstation forwards.- Workstation tunnels remain for interactive dev only.
- Evaluate WireGuard mesh when persistent unit count exceeds ~5.
This posture extends ADR-004 (connectivity-first) from "workstation connects everything" to "fleet machines connect each other; workstation is a client."
Production Promotion Gate
A workload is production on railiance01 only when it conforms to the
finished staged-promotion contract (RAIL-BS-WP-0006):
| Gate | Requirement |
|---|---|
| Overlay repo | railiance/<app>/ with app.toml and stage manifests |
| Stage commands | stage deploy, stage observe, stage promote, stage rollback proven |
| Evidence | Backup/restore drill, canary observation, operator approval recorded |
| Registry | Image in forge OCI registry with immutable tag |
Exceptions must be documented in the placement plan (T03) with explicit rollback. No exception bypasses backup evidence for stateful workloads.
coulombcore workloads still running in production today are grandfathered
legacy until their drain task completes — not newly promoted production.
Phoenix Path: coulombcore → railiance02
Machine-scale phoenix rotation reuses the same automation intended for future
3-node weekly rotations (RAIL-BS-WP-0007, CUST-WP-0038 deferred until
railiance02 exists).
Preconditions (drain complete)
All production dependencies moved off coulombcore per T03 ordering:
- Forge + CI (T04) — repos and images no longer depend on
gitea.coulomb.social - State Hub primary (T05) — cluster DB and sweep checkouts on railiance01
- Core Hub, issue-core, Inter-Hub legacy — per T03 sequence
- Identity + OpenBao — last (everything authenticates through them)
Phoenix execution
Owner: CUST-WP-0054-T09, automation: CUST-WP-0054-T08.
| Phase | Action | Tooling |
|---|---|---|
| S0 | Final inventory sweep, DNS/cert plan for *.coulomb.social, data archival |
T09 |
| S1 | Wipe and greenfield rebuild | NET-WP-0020 unseal + bootstrap chain |
| S2 | Join as railiance02 |
railiance-cluster overlay, atm- certs |
| S3 | Prove join-ready | Phoenix drill on disposable target first (T08) |
Longhorn distributed storage and PG streaming HA unlock once railiance01 + railiance02 are both fleet nodes.
Dev Environment (Files-First Beachhead)
Strategy A from the workplan; owner: CUST-WP-0054-T07.
git clone → make dev-hub → local ephemeral hub (compose)
│
├─ C-06 registration rebuilds workplan/task state from files
├─ offline write buffer (STATE-WP-0068) for progress/task events
└─ reconnect relay upstream; files reconcile, databases do not replicate
MCP config gains explicit dev / fleet profile switch. The workstation is
genuinely temporary: no fleet DB sync required for orientation.
Dependency Register
Workloads
| Workload | Current host | Target host | Migration owner | Method / notes |
|---|---|---|---|---|
| State Hub API (primary) | coulombcore CNPG cluster via workstation tunnel state-hub-primary → 127.0.0.1:8000 |
railiance01 | CUST-WP-0054-T05 |
CUST-WP-0011-T07 playbook: freeze → exact-count restore → rewire |
| State Hub API (WSL2 fallback) | workstation WSL2 | retired | CUST-WP-0011-T08/T09 → absorbed by CUST-WP-0054-T10 |
Stabilization window; not part of target architecture |
| activity-core | railiance01 k3s (activity-core ns) |
railiance01 (retain) | — | Already on target machine; fix bridges in T02 |
| issue-core | coulombcore k3s | railiance01 | CUST-WP-0054-T03 drain seq. |
ISSUE-WP-0003 live; emission chain fixed in T02 |
| Core Hub | coulombcore (hub.coulomb.social) |
railiance01 | CORE-WP-0005 + CUST-WP-0054-T03 |
Staging on coulombcore; production cutover human-gated |
| Inter-Hub (legacy Haskell) | coulombcore external | retired | CORE-WP-0007 |
Rollback-only after Core Hub cutover |
| Gitea + OCI registry | coulombcore k3s | railiance01 Forgejo | RAIL-HO-WP-0005 / CUST-WP-0054-T04 |
Read-only mirror on coulombcore until decommission |
| OpenBao | coulombcore | railiance01 | CUST-WP-0054-T03 (last) |
NET-WP-0020 unseal automation |
| Identity stack (KeyCape, Authelia, privacyIDEA, lldap) | coulombcore | railiance01 | CUST-WP-0054-T03 (last) |
Coupled to OpenBao |
| ESO + ArgoCD control plane | coulombcore | railiance01 | CUST-WP-0054-T03 |
GitOps follows forge move |
| CNPG databases (per workload) | coulombcore / railiance01 | railiance01 per workload | CUST-WP-0054-T03, CUST-WP-0054-T05 |
CNPG pattern proven; migrate with workload |
| llm-connect | TBD cluster | railiance01 | near-term lanes board | CCR-2026-0003 credential lane active |
| ops-hub (widget/evidence) | files + Inter-Hub widgets | railiance01 via Core Hub | CUST-WP-0025, CUST-WP-0049 |
Not blocking workstation independence |
| Temporal (activity-core) | railiance01 | railiance01 (retain) | — | Co-locate with activity-core |
| NATS (activity-core) | railiance01 | railiance01 (retain) | — | Co-locate with activity-core |
Network tunnels (production-critical)
| Lane | Current path | Target path | Owner |
|---|---|---|---|
| activity-core → State Hub | railiance01 reverse → workstation → state-hub-primary → coulombcore |
railiance01 atm- forward → railiance01 State Hub (local or short hop) |
CUST-WP-0054-T02 |
| Agents/MCP → State Hub | workstation 127.0.0.1:8000 → state-hub-primary → coulombcore |
workstation 127.0.0.1:8000 → tunnel to railiance01 hub (dev client) or fleet endpoint |
CUST-WP-0054-T05 + T07 profiles |
| railiance01 automations → State Hub | :18000 chain via workstation |
railiance01-local bridge port | CUST-WP-0054-T02 |
| activity-core → issue-core | railiance01 reverse → workstation → issue-core-coulombcore |
railiance01 atm- forward → issue-core (on railiance01 post-drain) |
CUST-WP-0054-T02, then T03 |
| Operator k3s access | workstation forwards (k3s-api-*) |
workstation interactive (non-critical) | — |
Repo remotes
All checked 2026-07-03; pattern is uniform:
| Repo (sample) | Current remote | Target remote | Owner |
|---|---|---|---|
| the-custodian | gitea.coulomb.social/coulomb/the-custodian.git |
forgejo.coulomb.social/coulomb/the-custodian.git |
CUST-WP-0054-T04 |
| state-hub | gitea.coulomb.social/coulomb/state-hub.git |
forgejo.coulomb.social/coulomb/state-hub.git |
CUST-WP-0054-T04 |
| activity-core | gitea.coulomb.social/coulomb/activity-core.git |
forgejo.coulomb.social/coulomb/activity-core.git |
CUST-WP-0054-T04 |
| issue-core | gitea.coulomb.social/coulomb/issue-core.git |
forgejo.coulomb.social/coulomb/issue-core.git |
CUST-WP-0054-T04 |
| ops-bridge | gitea.coulomb.social/coulomb/ops-bridge.git |
forgejo.coulomb.social/coulomb/ops-bridge.git |
CUST-WP-0054-T04 |
| ops-warden | gitea.coulomb.social/coulomb/ops-warden.git |
forgejo.coulomb.social/coulomb/ops-warden.git |
CUST-WP-0054-T04 |
| core-hub | gitea.coulomb.social/coulomb/core-hub.git |
forgejo.coulomb.social/coulomb/core-hub.git |
CUST-WP-0054-T04 |
| (all 74 registered repos) | gitea.coulomb.social/coulomb/<slug>.git |
forgejo.coulomb.social/coulomb/<slug>.git |
CUST-WP-0054-T04 |
State Hub repo checkout paths
| Concern | Current | Target | Owner |
|---|---|---|---|
local_path for 74 repos |
/home/worsch/<repo> on workstation |
railiance01 clone tree (e.g. /home/tegwick/<repo> or gitops-managed path) |
CUST-WP-0054-T05 |
| Consistency sweep writeback host | workstation (consistency_check.py --remote via API) |
railiance01 checkouts from forge | CUST-WP-0054-T05, STATE-WP-0064 |
COULOMBCORE host_paths |
/home/tegwick/<repo> (11 repos, CUST-WP-0021) |
retired with coulombcore drain | CUST-WP-0054-T09 |
| Multi-host path resolution | host_paths map per hostname |
fleet-primary host only + dev-hub local | CUST-WP-0054-T07 |
Sink and prompt paths
| Sink / path | Current | Target | Owner |
|---|---|---|---|
| Daily triage working-memory | /home/worsch/the-custodian/memory/working (ActivityDefinition + PVC mount) |
repo-relative or PVC-native path + sweep sync-to-repo | CUST-WP-0054-T06 |
| Daily triage State Hub progress | cluster hub via workstation tunnel | railiance01 hub direct | CUST-WP-0054-T02, T05 |
| Consistency sweep progress event | via workstation-hosted sweep | railiance01-hosted sweep | CUST-WP-0054-T05, STATE-WP-0064 |
Agent session traces (runtime/agent.py) |
memory/working/agent-session-*.md on workstation |
dev-hub local buffer; commit on reconnect | CUST-WP-0054-T07 |
output_schema in ActivityDefinitions |
absolute paths under /home/worsch/the-custodian/ |
repo-relative resolution in activity-core | CUST-WP-0054-T06 |
Build and publish pipelines
| Image / artifact | Current build host | Current registry | Target build | Target registry | Owner |
|---|---|---|---|---|---|
| state-hub | workstation docker build |
gitea.coulomb.social/coulomb/state-hub |
Forgejo Actions runner on railiance01 | railiance01 forge OCI | CUST-WP-0054-T04 |
| core-hub | workstation / railiance-forge docs | gitea.coulomb.social/coulomb/core-hub |
CI runner | railiance01 forge OCI | CUST-WP-0054-T04 |
| activity-core | workstation manual rebuild + scp | railiance01 k3s import / Gitea | CI on tag push | railiance01 forge OCI | CUST-WP-0054-T04 |
| issue-core | workstation / manual | gitea.coulomb.social/coulomb/issue-core |
CI runner | railiance01 forge OCI | CUST-WP-0054-T04 |
| Haskell build agent | workstation VM (haskell-build-vm) |
n/a | retired (CORE-WP-0007) |
n/a | CORE-WP-0007 |
Done criterion for T01: every row above has a target and migration owner. ✓
Drain Sequence
Detailed plan: docs/coulombcore-drain-placement-plan.md
Freeze policy: canon/standards/coulombcore-production-freeze_v0.1.md
Wave 1 Forge + CI (T04)
Wave 2 State Hub primary (T05)
Wave 3 Core Hub (CORE-WP-0005)
Wave 4 issue-core
Wave 5 ESO / ArgoCD
Wave 6 Supporting apps
Wave 7 OpenBao + identity (LAST)
Wave 8 coulombcore phoenix → railiance02 (T09)
Sequencing Map
T01 (this document) ✓
├─ T02 de-hub network ✓
├─ T03 placement plan / freeze ✓
│ ├─ T04 forge + CI
│ └─ T05 State Hub home on railiance01
├─ T06 sink decoupling
├─ T07 dev beachhead
└─ T08 phoenix drill
└─ T09 coulombcore → railiance02
└─ T10 workstation-off acceptance
Evidence and Inventory Sources
- Live tunnel state:
bridge status(2026-07-03) - State Hub health:
http://127.0.0.1:8000/state/health(cluster primary via tunnel) - Registered repos:
GET /repos/— 74 repos, alllocal_pathunder/home/worsch/ ops/service-inventory.yml(2026-06-05; predates cluster cutover — refresh in T03)docs/infrastructure-stabilization-pickup-checkpoint.md(2026-07-03 metaplan closeout)- Activity definitions:
activity-definitions/daily-statehub-wsjf-triage.md,activity-definitions/state-hub-consistency-sweep.md
Open Gaps (not T01 blockers)
| Gap | Follow-on |
|---|---|
| Forgejo production hostname / SMTP / exposure decisions | RAIL-HO-WP-0005-T02 (human) |
ops/service-inventory.yml stale environment labels |
Refresh during T03 |
| Core Hub widget-type registry prerequisite | CORE-WP-0005-T04 |
| HA Postgres / Longhorn across 2+ nodes | RAIL-BS-WP-0007, CUST-WP-0038 after railiance02 |
Promotion to Canon
After operator review:
- Move to
canon/architecture/adr-006-workstation-independence-fleet-roles.md(or equivalent ADR number). - Update
ops/service-inventory.ymlenvironment and service rows to match. - Link from
SCOPE.mdand.custodian-brief.mdgeneration inputs.