CUST-WP-0054 T01-T03: fleet architecture, de-hub runbook, drain plan
Documents the three-machine role model, fleet mesh topology, coulombcore freeze policy, and ordered drain sequence. Adds railiance01 systemd tunnel install assets and refreshes ops service inventory to reflect 2026-07-03 production placement (cluster State Hub, fleet mesh, draining coulombcore).
This commit is contained in:
147
docs/fleet-mesh-dehub-runbook.md
Normal file
147
docs/fleet-mesh-dehub-runbook.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# Fleet Mesh De-Hub Runbook (CUST-WP-0054-T02)
|
||||
|
||||
Date: 2026-07-03
|
||||
Workplan: `CUST-WP-0054-T02`
|
||||
Architecture: `docs/workstation-independence-fleet-architecture.md`
|
||||
|
||||
## Goal
|
||||
|
||||
Remove the workstation from production data paths between railiance01
|
||||
(activity-core) and coulombcore (State Hub cluster, issue-core). Workstation
|
||||
tunnels become interactive dev access only.
|
||||
|
||||
## Before (workstation hub)
|
||||
|
||||
```
|
||||
railiance01:18000 ──reverse──► workstation:8000 ──forward──► coulombcore cluster State Hub
|
||||
railiance01:18765 ──reverse──► workstation:18765 ──forward──► coulombcore cluster issue-core
|
||||
```
|
||||
|
||||
## After (fleet-owned)
|
||||
|
||||
```
|
||||
railiance01:18000 ──forward via SSH to coulombcore──► 10.43.170.94:8000 (State Hub)
|
||||
railiance01:18765 ──forward via SSH to coulombcore──► 10.43.103.154:8765 (issue-core)
|
||||
```
|
||||
|
||||
activity-core `actcore-state-hub-bridge` and `actcore-issue-core-bridge` keep
|
||||
proxying to `127.0.0.1:18000` and `127.0.0.1:18765` on the railiance01 node.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
| Item | Check |
|
||||
| --- | --- |
|
||||
| ops-bridge installed on railiance01 | `which bridge` |
|
||||
| SSH key authorized on coulombcore | `ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 true` from railiance01 |
|
||||
| ClusterIPs current | `state-hub-primary` and `issue-core-coulombcore` workstation tunnels |
|
||||
| warden `atm-fleet-mesh` (target) | `cert_command` migration after static-key smoke passes |
|
||||
|
||||
Reference config: `infra/fleet-mesh/railiance01-tunnels.yaml`
|
||||
|
||||
## Install (railiance01)
|
||||
|
||||
railiance01 ships the kernel `bridge` utility (`iproute2`), not ops-bridge. Use the
|
||||
systemd user units in `infra/fleet-mesh/systemd/` (or the installer script).
|
||||
|
||||
```bash
|
||||
# From the-custodian repo on the workstation
|
||||
bash infra/fleet-mesh/install-railiance01.sh railiance01
|
||||
```
|
||||
|
||||
The installer copies:
|
||||
|
||||
- `infra/fleet-mesh/systemd/*.service` → `~/.config/systemd/user/`
|
||||
- `infra/fleet-mesh/railiance01-tunnels.yaml` → `~/.config/bridge/tunnels.yaml` (reference for future ops-bridge install)
|
||||
- `~/.ssh/id_ops` → railiance01 (static key interim; migrate to `atm-fleet-mesh` + `cert_command`)
|
||||
|
||||
Enable lingering so user units survive logout/reboot:
|
||||
|
||||
```bash
|
||||
ssh railiance01 'sudo loginctl enable-linger tegwick'
|
||||
```
|
||||
|
||||
## Cutover
|
||||
|
||||
```bash
|
||||
# 1. Stop workstation reverse tunnels (one at a time — ops-bridge CLI)
|
||||
bridge down state-hub-railiance01
|
||||
bridge down issue-core-railiance01
|
||||
|
||||
# 2. Start fleet-owned forward tunnels on railiance01 (systemd)
|
||||
ssh railiance01 'systemctl --user enable --now fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
|
||||
|
||||
# 3. Smoke from railiance01 node
|
||||
ssh railiance01 'curl -sf http://127.0.0.1:18000/state/health && curl -sf http://127.0.0.1:18765/healthz'
|
||||
```
|
||||
|
||||
**Cutover evidence (2026-07-03):** workstation reverse tunnels stopped;
|
||||
railiance01 systemd forwards healthy; `actcore-*-bridge` pods 1/1; progress
|
||||
write through fleet path succeeded (event `647b70c0`).
|
||||
|
||||
## Verify production (partial T10 rehearsal)
|
||||
|
||||
With workstation reverse tunnels **down**, confirm:
|
||||
|
||||
```bash
|
||||
# Bridge pods healthy
|
||||
ssh railiance01 'kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core | grep bridge'
|
||||
|
||||
# Consistency sweep API (from railiance01 cluster network)
|
||||
ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c "
|
||||
import urllib.request
|
||||
print(urllib.request.urlopen(\"http://actcore-state-hub-bridge:8000/state/health\").read().decode())
|
||||
"'
|
||||
|
||||
# Issue-core bridge
|
||||
ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c "
|
||||
import urllib.request
|
||||
print(urllib.request.urlopen(\"http://actcore-issue-core-bridge:8765/healthz\").read().decode())
|
||||
"'
|
||||
```
|
||||
|
||||
Optional emission smoke (safe label only): trigger a known-safe activity-core
|
||||
run or use the issue-core REST sink checklist from
|
||||
`near-term-production-service-lanes-status.md`.
|
||||
|
||||
## Persist across reboot
|
||||
|
||||
Systemd user units are enabled via `install-railiance01.sh`. Confirm:
|
||||
|
||||
```bash
|
||||
ssh railiance01 'loginctl show-user tegwick -p Linger; systemctl --user is-enabled fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
|
||||
```
|
||||
|
||||
When ops-bridge is installed on railiance01, `railiance01-tunnels.yaml` is the
|
||||
drop-in config; until then systemd units are the production implementation.
|
||||
|
||||
## Rollback
|
||||
|
||||
```bash
|
||||
ssh railiance01 'bridge down fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
|
||||
bridge up state-hub-railiance01 issue-core-railiance01
|
||||
```
|
||||
|
||||
## Workstation tunnel policy after cutover
|
||||
|
||||
| Keep (interactive dev) | Retire from production dependency |
|
||||
| --- | --- |
|
||||
| `state-hub-primary` (MCP/agents) | `state-hub-railiance01` |
|
||||
| `k3s-api-*` | `issue-core-railiance01` |
|
||||
| `state-hub-mcp-*` | — |
|
||||
| `issue-core-coulombcore` (workstation dev only) | — |
|
||||
|
||||
Production on railiance01 must not depend on any workstation tunnel.
|
||||
|
||||
## WireGuard evaluation
|
||||
|
||||
Current fleet mesh uses two forward tunnels (~2 units). WireGuard successor is
|
||||
deferred until persistent unit count exceeds ~5 per workplan T02.
|
||||
|
||||
## cert_command migration (follow-on)
|
||||
|
||||
Replace static `id_ops` with `atm-fleet-mesh` + `cert_command`:
|
||||
|
||||
1. Register `atm-fleet-mesh` in warden inventory and CoulombCore `ssh_principals.yaml`
|
||||
2. Generate dedicated keypair on railiance01
|
||||
3. Set `cert_command: "warden sign atm-fleet-mesh --pubkey ..."` per
|
||||
`ops-warden/wiki/playbooks/ops-bridge-tunnel-cert.md`
|
||||
Reference in New Issue
Block a user