# Fleet Mesh De-Hub Runbook (CUST-WP-0054-T02) Date: 2026-07-03 Workplan: `CUST-WP-0054-T02` Architecture: `docs/workstation-independence-fleet-architecture.md` ## Goal Remove the workstation from production data paths between railiance01 (activity-core) and coulombcore (State Hub cluster, issue-core). Workstation tunnels become interactive dev access only. ## Before (workstation hub) ``` railiance01:18000 ──reverse──► workstation:8000 ──forward──► coulombcore cluster State Hub railiance01:18765 ──reverse──► workstation:18765 ──forward──► coulombcore cluster issue-core ``` ## After (fleet-owned) ``` railiance01:18000 ──forward via SSH to coulombcore──► 10.43.170.94:8000 (State Hub) railiance01:18765 ──forward via SSH to coulombcore──► 10.43.103.154:8765 (issue-core) ``` activity-core `actcore-state-hub-bridge` and `actcore-issue-core-bridge` keep proxying to `127.0.0.1:18000` and `127.0.0.1:18765` on the railiance01 node. ## Prerequisites | Item | Check | | --- | --- | | ops-bridge installed on railiance01 | `which bridge` | | SSH key authorized on coulombcore | `ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 true` from railiance01 | | ClusterIPs current | `state-hub-primary` and `issue-core-coulombcore` workstation tunnels | | warden `atm-fleet-mesh` (target) | `cert_command` migration after static-key smoke passes | Reference config: `infra/fleet-mesh/railiance01-tunnels.yaml` ## Install (railiance01) railiance01 ships the kernel `bridge` utility (`iproute2`), not ops-bridge. Use the systemd user units in `infra/fleet-mesh/systemd/` (or the installer script). ```bash # From the-custodian repo on the workstation bash infra/fleet-mesh/install-railiance01.sh railiance01 ``` The installer copies: - `infra/fleet-mesh/systemd/*.service` → `~/.config/systemd/user/` - `infra/fleet-mesh/railiance01-tunnels.yaml` → `~/.config/bridge/tunnels.yaml` (reference for future ops-bridge install) - `~/.ssh/id_ops` → railiance01 (static key interim; migrate to `atm-fleet-mesh` + `cert_command`) Enable lingering so user units survive logout/reboot: ```bash ssh railiance01 'sudo loginctl enable-linger tegwick' ``` ## Cutover ```bash # 1. Stop workstation reverse tunnels (one at a time — ops-bridge CLI) bridge down state-hub-railiance01 bridge down issue-core-railiance01 # 2. Start fleet-owned forward tunnels on railiance01 (systemd) ssh railiance01 'systemctl --user enable --now fleet-state-hub-coulombcore fleet-issue-core-coulombcore' # 3. Smoke from railiance01 node ssh railiance01 'curl -sf http://127.0.0.1:18000/state/health && curl -sf http://127.0.0.1:18765/healthz' ``` **Cutover evidence (2026-07-03):** workstation reverse tunnels stopped; railiance01 systemd forwards healthy; `actcore-*-bridge` pods 1/1; progress write through fleet path succeeded (event `647b70c0`). ## Verify production (partial T10 rehearsal) With workstation reverse tunnels **down**, confirm: ```bash # Bridge pods healthy ssh railiance01 'kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core | grep bridge' # Consistency sweep API (from railiance01 cluster network) ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c " import urllib.request print(urllib.request.urlopen(\"http://actcore-state-hub-bridge:8000/state/health\").read().decode()) "' # Issue-core bridge ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c " import urllib.request print(urllib.request.urlopen(\"http://actcore-issue-core-bridge:8765/healthz\").read().decode()) "' ``` Optional emission smoke (safe label only): trigger a known-safe activity-core run or use the issue-core REST sink checklist from `near-term-production-service-lanes-status.md`. ## Persist across reboot Systemd user units are enabled via `install-railiance01.sh`. Confirm: ```bash ssh railiance01 'loginctl show-user tegwick -p Linger; systemctl --user is-enabled fleet-state-hub-coulombcore fleet-issue-core-coulombcore' ``` When ops-bridge is installed on railiance01, `railiance01-tunnels.yaml` is the drop-in config; until then systemd units are the production implementation. ## Rollback ```bash ssh railiance01 'bridge down fleet-state-hub-coulombcore fleet-issue-core-coulombcore' bridge up state-hub-railiance01 issue-core-railiance01 ``` ## Workstation tunnel policy after cutover | Keep (interactive dev) | Retire from production dependency | | --- | --- | | `state-hub-primary` (MCP/agents) | `state-hub-railiance01` | | `k3s-api-*` | `issue-core-railiance01` | | `state-hub-mcp-*` | — | | `issue-core-coulombcore` (workstation dev only) | — | Production on railiance01 must not depend on any workstation tunnel. ## WireGuard evaluation Current fleet mesh uses two forward tunnels (~2 units). WireGuard successor is deferred until persistent unit count exceeds ~5 per workplan T02. ## cert_command migration (follow-on) Replace static `id_ops` with `atm-fleet-mesh` + `cert_command`: 1. Register `atm-fleet-mesh` in warden inventory and CoulombCore `ssh_principals.yaml` 2. Generate dedicated keypair on railiance01 3. Set `cert_command: "warden sign atm-fleet-mesh --pubkey ..."` per `ops-warden/wiki/playbooks/ops-bridge-tunnel-cert.md`