# Fleet Mesh De-Hub Runbook (CUST-WP-0054-T02)

Date: 2026-07-03  
Workplan: `CUST-WP-0054-T02`  
Architecture: `docs/workstation-independence-fleet-architecture.md`

## Goal

Remove the workstation from production data paths between railiance01
(activity-core) and coulombcore (State Hub cluster, issue-core). Workstation
tunnels become interactive dev access only.

## Before (workstation hub)

```
railiance01:18000 ──reverse──► workstation:8000 ──forward──► coulombcore cluster State Hub
railiance01:18765 ──reverse──► workstation:18765 ──forward──► coulombcore cluster issue-core
```

## After (fleet-owned)

```
railiance01:18000 ──forward via SSH to coulombcore──► 10.43.170.94:8000 (State Hub)
railiance01:18765 ──forward via SSH to coulombcore──► 10.43.103.154:8765 (issue-core)
```

activity-core `actcore-state-hub-bridge` and `actcore-issue-core-bridge` keep
proxying to `127.0.0.1:18000` and `127.0.0.1:18765` on the railiance01 node.

## Prerequisites

| Item | Check |
| --- | --- |
| ops-bridge installed on railiance01 | `which bridge` |
| SSH key authorized on coulombcore | `ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 true` from railiance01 |
| ClusterIPs current | `state-hub-primary` and `issue-core-coulombcore` workstation tunnels |
| warden `atm-fleet-mesh` (target) | `cert_command` migration after static-key smoke passes |

Reference config: `infra/fleet-mesh/railiance01-tunnels.yaml`

## Install (railiance01)

railiance01 ships the kernel `bridge` utility (`iproute2`), not ops-bridge. Use the
systemd user units in `infra/fleet-mesh/systemd/` (or the installer script).

```bash
# From the-custodian repo on the workstation
bash infra/fleet-mesh/install-railiance01.sh railiance01
```

The installer copies:

- `infra/fleet-mesh/systemd/*.service` → `~/.config/systemd/user/`
- `infra/fleet-mesh/railiance01-tunnels.yaml` → `~/.config/bridge/tunnels.yaml` (reference for future ops-bridge install)
- `~/.ssh/id_ops` → railiance01 (static key interim; migrate to `atm-fleet-mesh` + `cert_command`)

Enable lingering so user units survive logout/reboot:

```bash
ssh railiance01 'sudo loginctl enable-linger tegwick'
```

## Cutover

```bash
# 1. Stop workstation reverse tunnels (one at a time — ops-bridge CLI)
bridge down state-hub-railiance01
bridge down issue-core-railiance01

# 2. Start fleet-owned forward tunnels on railiance01 (systemd)
ssh railiance01 'systemctl --user enable --now fleet-state-hub-coulombcore fleet-issue-core-coulombcore'

# 3. Smoke from railiance01 node
ssh railiance01 'curl -sf http://127.0.0.1:18000/state/health && curl -sf http://127.0.0.1:18765/healthz'
```

**Cutover evidence (2026-07-03):** workstation reverse tunnels stopped;
railiance01 systemd forwards healthy; `actcore-*-bridge` pods 1/1; progress
write through fleet path succeeded (event `647b70c0`).

## Verify production (partial T10 rehearsal)

With workstation reverse tunnels **down**, confirm:

```bash
# Bridge pods healthy
ssh railiance01 'kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core | grep bridge'

# Consistency sweep API (from railiance01 cluster network)
ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c "
import urllib.request
print(urllib.request.urlopen(\"http://actcore-state-hub-bridge:8000/state/health\").read().decode())
"'

# Issue-core bridge
ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c "
import urllib.request
print(urllib.request.urlopen(\"http://actcore-issue-core-bridge:8765/healthz\").read().decode())
"'
```

Optional emission smoke (safe label only): trigger a known-safe activity-core
run or use the issue-core REST sink checklist from
`near-term-production-service-lanes-status.md`.

## Persist across reboot

Systemd user units are enabled via `install-railiance01.sh`. Confirm:

```bash
ssh railiance01 'loginctl show-user tegwick -p Linger; systemctl --user is-enabled fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
```

When ops-bridge is installed on railiance01, `railiance01-tunnels.yaml` is the
drop-in config; until then systemd units are the production implementation.

## Rollback

```bash
ssh railiance01 'bridge down fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
bridge up state-hub-railiance01 issue-core-railiance01
```

## Workstation tunnel policy after cutover

| Keep (interactive dev) | Retire from production dependency |
| --- | --- |
| `state-hub-primary` (MCP/agents) | `state-hub-railiance01` |
| `k3s-api-*` | `issue-core-railiance01` |
| `state-hub-mcp-*` | — |
| `issue-core-coulombcore` (workstation dev only) | — |

Production on railiance01 must not depend on any workstation tunnel.

## WireGuard evaluation

Current fleet mesh uses two forward tunnels (~2 units). WireGuard successor is
deferred until persistent unit count exceeds ~5 per workplan T02.

## cert_command migration (follow-on)

Replace static `id_ops` with `atm-fleet-mesh` + `cert_command`:

1. Register `atm-fleet-mesh` in warden inventory and CoulombCore `ssh_principals.yaml`
2. Generate dedicated keypair on railiance01
3. Set `cert_command: "warden sign atm-fleet-mesh --pubkey ..."` per
   `ops-warden/wiki/playbooks/ops-bridge-tunnel-cert.md`