Files

codex cf4be716e1 CUST-WP-0054 T01-T03: fleet architecture, de-hub runbook, drain plan

Documents the three-machine role model, fleet mesh topology, coulombcore
freeze policy, and ordered drain sequence. Adds railiance01 systemd tunnel
install assets and refreshes ops service inventory to reflect 2026-07-03
production placement (cluster State Hub, fleet mesh, draining coulombcore).

2026-07-04 00:29:55 +02:00

5.2 KiB

Raw Blame History

Fleet Mesh De-Hub Runbook (CUST-WP-0054-T02)

Date: 2026-07-03
Workplan: CUST-WP-0054-T02
Architecture: docs/workstation-independence-fleet-architecture.md

Goal

Remove the workstation from production data paths between railiance01 (activity-core) and coulombcore (State Hub cluster, issue-core). Workstation tunnels become interactive dev access only.

Before (workstation hub)

railiance01:18000 ──reverse──► workstation:8000 ──forward──► coulombcore cluster State Hub
railiance01:18765 ──reverse──► workstation:18765 ──forward──► coulombcore cluster issue-core

After (fleet-owned)

railiance01:18000 ──forward via SSH to coulombcore──► 10.43.170.94:8000 (State Hub)
railiance01:18765 ──forward via SSH to coulombcore──► 10.43.103.154:8765 (issue-core)

activity-core actcore-state-hub-bridge and actcore-issue-core-bridge keep proxying to 127.0.0.1:18000 and 127.0.0.1:18765 on the railiance01 node.

Prerequisites

Item	Check
ops-bridge installed on railiance01	`which bridge`
SSH key authorized on coulombcore	`ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 true` from railiance01
ClusterIPs current	`state-hub-primary` and `issue-core-coulombcore` workstation tunnels
warden `atm-fleet-mesh` (target)	`cert_command` migration after static-key smoke passes

Reference config: infra/fleet-mesh/railiance01-tunnels.yaml

Install (railiance01)

railiance01 ships the kernel bridge utility (iproute2), not ops-bridge. Use the systemd user units in infra/fleet-mesh/systemd/ (or the installer script).

# From the-custodian repo on the workstation
bash infra/fleet-mesh/install-railiance01.sh railiance01

The installer copies:

infra/fleet-mesh/systemd/*.service → ~/.config/systemd/user/
infra/fleet-mesh/railiance01-tunnels.yaml → ~/.config/bridge/tunnels.yaml (reference for future ops-bridge install)
~/.ssh/id_ops → railiance01 (static key interim; migrate to atm-fleet-mesh + cert_command)

Enable lingering so user units survive logout/reboot:

ssh railiance01 'sudo loginctl enable-linger tegwick'

Cutover

# 1. Stop workstation reverse tunnels (one at a time — ops-bridge CLI)
bridge down state-hub-railiance01
bridge down issue-core-railiance01

# 2. Start fleet-owned forward tunnels on railiance01 (systemd)
ssh railiance01 'systemctl --user enable --now fleet-state-hub-coulombcore fleet-issue-core-coulombcore'

# 3. Smoke from railiance01 node
ssh railiance01 'curl -sf http://127.0.0.1:18000/state/health && curl -sf http://127.0.0.1:18765/healthz'

Cutover evidence (2026-07-03): workstation reverse tunnels stopped; railiance01 systemd forwards healthy; actcore-*-bridge pods 1/1; progress write through fleet path succeeded (event 647b70c0).

Verify production (partial T10 rehearsal)

With workstation reverse tunnels down, confirm:

# Bridge pods healthy
ssh railiance01 'kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core | grep bridge'

# Consistency sweep API (from railiance01 cluster network)
ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c "
import urllib.request
print(urllib.request.urlopen(\"http://actcore-state-hub-bridge:8000/state/health\").read().decode())
"'

# Issue-core bridge
ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c "
import urllib.request
print(urllib.request.urlopen(\"http://actcore-issue-core-bridge:8765/healthz\").read().decode())
"'

Optional emission smoke (safe label only): trigger a known-safe activity-core run or use the issue-core REST sink checklist from near-term-production-service-lanes-status.md.

Persist across reboot

Systemd user units are enabled via install-railiance01.sh. Confirm:

ssh railiance01 'loginctl show-user tegwick -p Linger; systemctl --user is-enabled fleet-state-hub-coulombcore fleet-issue-core-coulombcore'

When ops-bridge is installed on railiance01, railiance01-tunnels.yaml is the drop-in config; until then systemd units are the production implementation.

Rollback

ssh railiance01 'bridge down fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
bridge up state-hub-railiance01 issue-core-railiance01

Workstation tunnel policy after cutover

Keep (interactive dev)	Retire from production dependency
`state-hub-primary` (MCP/agents)	`state-hub-railiance01`
`k3s-api-*`	`issue-core-railiance01`
`state-hub-mcp-*`	—
`issue-core-coulombcore` (workstation dev only)	—

Production on railiance01 must not depend on any workstation tunnel.

WireGuard evaluation

Current fleet mesh uses two forward tunnels (~2 units). WireGuard successor is deferred until persistent unit count exceeds ~5 per workplan T02.

cert_command migration (follow-on)

Replace static id_ops with atm-fleet-mesh + cert_command:

Register atm-fleet-mesh in warden inventory and CoulombCore ssh_principals.yaml
Generate dedicated keypair on railiance01
Set cert_command: "warden sign atm-fleet-mesh --pubkey ..." per ops-warden/wiki/playbooks/ops-bridge-tunnel-cert.md

5.2 KiB Raw Blame History