Documents the three-machine role model, fleet mesh topology, coulombcore freeze policy, and ordered drain sequence. Adds railiance01 systemd tunnel install assets and refreshes ops service inventory to reflect 2026-07-03 production placement (cluster State Hub, fleet mesh, draining coulombcore).
147 lines
5.2 KiB
Markdown
147 lines
5.2 KiB
Markdown
# Fleet Mesh De-Hub Runbook (CUST-WP-0054-T02)
|
|
|
|
Date: 2026-07-03
|
|
Workplan: `CUST-WP-0054-T02`
|
|
Architecture: `docs/workstation-independence-fleet-architecture.md`
|
|
|
|
## Goal
|
|
|
|
Remove the workstation from production data paths between railiance01
|
|
(activity-core) and coulombcore (State Hub cluster, issue-core). Workstation
|
|
tunnels become interactive dev access only.
|
|
|
|
## Before (workstation hub)
|
|
|
|
```
|
|
railiance01:18000 ──reverse──► workstation:8000 ──forward──► coulombcore cluster State Hub
|
|
railiance01:18765 ──reverse──► workstation:18765 ──forward──► coulombcore cluster issue-core
|
|
```
|
|
|
|
## After (fleet-owned)
|
|
|
|
```
|
|
railiance01:18000 ──forward via SSH to coulombcore──► 10.43.170.94:8000 (State Hub)
|
|
railiance01:18765 ──forward via SSH to coulombcore──► 10.43.103.154:8765 (issue-core)
|
|
```
|
|
|
|
activity-core `actcore-state-hub-bridge` and `actcore-issue-core-bridge` keep
|
|
proxying to `127.0.0.1:18000` and `127.0.0.1:18765` on the railiance01 node.
|
|
|
|
## Prerequisites
|
|
|
|
| Item | Check |
|
|
| --- | --- |
|
|
| ops-bridge installed on railiance01 | `which bridge` |
|
|
| SSH key authorized on coulombcore | `ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 true` from railiance01 |
|
|
| ClusterIPs current | `state-hub-primary` and `issue-core-coulombcore` workstation tunnels |
|
|
| warden `atm-fleet-mesh` (target) | `cert_command` migration after static-key smoke passes |
|
|
|
|
Reference config: `infra/fleet-mesh/railiance01-tunnels.yaml`
|
|
|
|
## Install (railiance01)
|
|
|
|
railiance01 ships the kernel `bridge` utility (`iproute2`), not ops-bridge. Use the
|
|
systemd user units in `infra/fleet-mesh/systemd/` (or the installer script).
|
|
|
|
```bash
|
|
# From the-custodian repo on the workstation
|
|
bash infra/fleet-mesh/install-railiance01.sh railiance01
|
|
```
|
|
|
|
The installer copies:
|
|
|
|
- `infra/fleet-mesh/systemd/*.service` → `~/.config/systemd/user/`
|
|
- `infra/fleet-mesh/railiance01-tunnels.yaml` → `~/.config/bridge/tunnels.yaml` (reference for future ops-bridge install)
|
|
- `~/.ssh/id_ops` → railiance01 (static key interim; migrate to `atm-fleet-mesh` + `cert_command`)
|
|
|
|
Enable lingering so user units survive logout/reboot:
|
|
|
|
```bash
|
|
ssh railiance01 'sudo loginctl enable-linger tegwick'
|
|
```
|
|
|
|
## Cutover
|
|
|
|
```bash
|
|
# 1. Stop workstation reverse tunnels (one at a time — ops-bridge CLI)
|
|
bridge down state-hub-railiance01
|
|
bridge down issue-core-railiance01
|
|
|
|
# 2. Start fleet-owned forward tunnels on railiance01 (systemd)
|
|
ssh railiance01 'systemctl --user enable --now fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
|
|
|
|
# 3. Smoke from railiance01 node
|
|
ssh railiance01 'curl -sf http://127.0.0.1:18000/state/health && curl -sf http://127.0.0.1:18765/healthz'
|
|
```
|
|
|
|
**Cutover evidence (2026-07-03):** workstation reverse tunnels stopped;
|
|
railiance01 systemd forwards healthy; `actcore-*-bridge` pods 1/1; progress
|
|
write through fleet path succeeded (event `647b70c0`).
|
|
|
|
## Verify production (partial T10 rehearsal)
|
|
|
|
With workstation reverse tunnels **down**, confirm:
|
|
|
|
```bash
|
|
# Bridge pods healthy
|
|
ssh railiance01 'kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core | grep bridge'
|
|
|
|
# Consistency sweep API (from railiance01 cluster network)
|
|
ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c "
|
|
import urllib.request
|
|
print(urllib.request.urlopen(\"http://actcore-state-hub-bridge:8000/state/health\").read().decode())
|
|
"'
|
|
|
|
# Issue-core bridge
|
|
ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c "
|
|
import urllib.request
|
|
print(urllib.request.urlopen(\"http://actcore-issue-core-bridge:8765/healthz\").read().decode())
|
|
"'
|
|
```
|
|
|
|
Optional emission smoke (safe label only): trigger a known-safe activity-core
|
|
run or use the issue-core REST sink checklist from
|
|
`near-term-production-service-lanes-status.md`.
|
|
|
|
## Persist across reboot
|
|
|
|
Systemd user units are enabled via `install-railiance01.sh`. Confirm:
|
|
|
|
```bash
|
|
ssh railiance01 'loginctl show-user tegwick -p Linger; systemctl --user is-enabled fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
|
|
```
|
|
|
|
When ops-bridge is installed on railiance01, `railiance01-tunnels.yaml` is the
|
|
drop-in config; until then systemd units are the production implementation.
|
|
|
|
## Rollback
|
|
|
|
```bash
|
|
ssh railiance01 'bridge down fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
|
|
bridge up state-hub-railiance01 issue-core-railiance01
|
|
```
|
|
|
|
## Workstation tunnel policy after cutover
|
|
|
|
| Keep (interactive dev) | Retire from production dependency |
|
|
| --- | --- |
|
|
| `state-hub-primary` (MCP/agents) | `state-hub-railiance01` |
|
|
| `k3s-api-*` | `issue-core-railiance01` |
|
|
| `state-hub-mcp-*` | — |
|
|
| `issue-core-coulombcore` (workstation dev only) | — |
|
|
|
|
Production on railiance01 must not depend on any workstation tunnel.
|
|
|
|
## WireGuard evaluation
|
|
|
|
Current fleet mesh uses two forward tunnels (~2 units). WireGuard successor is
|
|
deferred until persistent unit count exceeds ~5 per workplan T02.
|
|
|
|
## cert_command migration (follow-on)
|
|
|
|
Replace static `id_ops` with `atm-fleet-mesh` + `cert_command`:
|
|
|
|
1. Register `atm-fleet-mesh` in warden inventory and CoulombCore `ssh_principals.yaml`
|
|
2. Generate dedicated keypair on railiance01
|
|
3. Set `cert_command: "warden sign atm-fleet-mesh --pubkey ..."` per
|
|
`ops-warden/wiki/playbooks/ops-bridge-tunnel-cert.md` |