Documents the three-machine role model, fleet mesh topology, coulombcore freeze policy, and ordered drain sequence. Adds railiance01 systemd tunnel install assets and refreshes ops service inventory to reflect 2026-07-03 production placement (cluster State Hub, fleet mesh, draining coulombcore).
5.2 KiB
Fleet Mesh De-Hub Runbook (CUST-WP-0054-T02)
Date: 2026-07-03
Workplan: CUST-WP-0054-T02
Architecture: docs/workstation-independence-fleet-architecture.md
Goal
Remove the workstation from production data paths between railiance01 (activity-core) and coulombcore (State Hub cluster, issue-core). Workstation tunnels become interactive dev access only.
Before (workstation hub)
railiance01:18000 ──reverse──► workstation:8000 ──forward──► coulombcore cluster State Hub
railiance01:18765 ──reverse──► workstation:18765 ──forward──► coulombcore cluster issue-core
After (fleet-owned)
railiance01:18000 ──forward via SSH to coulombcore──► 10.43.170.94:8000 (State Hub)
railiance01:18765 ──forward via SSH to coulombcore──► 10.43.103.154:8765 (issue-core)
activity-core actcore-state-hub-bridge and actcore-issue-core-bridge keep
proxying to 127.0.0.1:18000 and 127.0.0.1:18765 on the railiance01 node.
Prerequisites
| Item | Check |
|---|---|
| ops-bridge installed on railiance01 | which bridge |
| SSH key authorized on coulombcore | ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 true from railiance01 |
| ClusterIPs current | state-hub-primary and issue-core-coulombcore workstation tunnels |
warden atm-fleet-mesh (target) |
cert_command migration after static-key smoke passes |
Reference config: infra/fleet-mesh/railiance01-tunnels.yaml
Install (railiance01)
railiance01 ships the kernel bridge utility (iproute2), not ops-bridge. Use the
systemd user units in infra/fleet-mesh/systemd/ (or the installer script).
# From the-custodian repo on the workstation
bash infra/fleet-mesh/install-railiance01.sh railiance01
The installer copies:
infra/fleet-mesh/systemd/*.service→~/.config/systemd/user/infra/fleet-mesh/railiance01-tunnels.yaml→~/.config/bridge/tunnels.yaml(reference for future ops-bridge install)~/.ssh/id_ops→ railiance01 (static key interim; migrate toatm-fleet-mesh+cert_command)
Enable lingering so user units survive logout/reboot:
ssh railiance01 'sudo loginctl enable-linger tegwick'
Cutover
# 1. Stop workstation reverse tunnels (one at a time — ops-bridge CLI)
bridge down state-hub-railiance01
bridge down issue-core-railiance01
# 2. Start fleet-owned forward tunnels on railiance01 (systemd)
ssh railiance01 'systemctl --user enable --now fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
# 3. Smoke from railiance01 node
ssh railiance01 'curl -sf http://127.0.0.1:18000/state/health && curl -sf http://127.0.0.1:18765/healthz'
Cutover evidence (2026-07-03): workstation reverse tunnels stopped;
railiance01 systemd forwards healthy; actcore-*-bridge pods 1/1; progress
write through fleet path succeeded (event 647b70c0).
Verify production (partial T10 rehearsal)
With workstation reverse tunnels down, confirm:
# Bridge pods healthy
ssh railiance01 'kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core | grep bridge'
# Consistency sweep API (from railiance01 cluster network)
ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c "
import urllib.request
print(urllib.request.urlopen(\"http://actcore-state-hub-bridge:8000/state/health\").read().decode())
"'
# Issue-core bridge
ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c "
import urllib.request
print(urllib.request.urlopen(\"http://actcore-issue-core-bridge:8765/healthz\").read().decode())
"'
Optional emission smoke (safe label only): trigger a known-safe activity-core
run or use the issue-core REST sink checklist from
near-term-production-service-lanes-status.md.
Persist across reboot
Systemd user units are enabled via install-railiance01.sh. Confirm:
ssh railiance01 'loginctl show-user tegwick -p Linger; systemctl --user is-enabled fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
When ops-bridge is installed on railiance01, railiance01-tunnels.yaml is the
drop-in config; until then systemd units are the production implementation.
Rollback
ssh railiance01 'bridge down fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
bridge up state-hub-railiance01 issue-core-railiance01
Workstation tunnel policy after cutover
| Keep (interactive dev) | Retire from production dependency |
|---|---|
state-hub-primary (MCP/agents) |
state-hub-railiance01 |
k3s-api-* |
issue-core-railiance01 |
state-hub-mcp-* |
— |
issue-core-coulombcore (workstation dev only) |
— |
Production on railiance01 must not depend on any workstation tunnel.
WireGuard evaluation
Current fleet mesh uses two forward tunnels (~2 units). WireGuard successor is deferred until persistent unit count exceeds ~5 per workplan T02.
cert_command migration (follow-on)
Replace static id_ops with atm-fleet-mesh + cert_command:
- Register
atm-fleet-meshin warden inventory and CoulombCoressh_principals.yaml - Generate dedicated keypair on railiance01
- Set
cert_command: "warden sign atm-fleet-mesh --pubkey ..."perops-warden/wiki/playbooks/ops-bridge-tunnel-cert.md