CUST-WP-0054 T01-T03: fleet architecture, de-hub runbook, drain plan

Documents the three-machine role model, fleet mesh topology, coulombcore
freeze policy, and ordered drain sequence. Adds railiance01 systemd tunnel
install assets and refreshes ops service inventory to reflect 2026-07-03
production placement (cluster State Hub, fleet mesh, draining coulombcore).
This commit is contained in:
codex
2026-07-04 00:29:55 +02:00
parent 0a77483861
commit cf4be716e1
10 changed files with 1050 additions and 34 deletions

View File

@@ -0,0 +1,298 @@
# Workstation Independence and Fleet Role Architecture
Date: 2026-07-03
Status: draft (canon-adjacent; promote to `canon/architecture/` after review)
Workplan: `CUST-WP-0054` T01
Related: `ADR-001`, `ADR-004`, `RAIL-BS-WP-0006`, `RAIL-HO-WP-0005`, `CUST-WP-0011`
## Purpose
Fix the three-machine role model, the fleet mesh topology, the promotion gate
for "production", and the phoenix path `coulombcore → railiance02`. Provide a
dependency register so every workload, tunnel, repo remote, sink path, and
build pipeline has a **current host**, **target host**, and **migration owner**.
The acceptance proof for the whole plan is `CUST-WP-0054-T10`: production runs
24h+ with the workstation fully offline.
## Machine Roles
| Machine | IP / identity | Current role (2026-07-03) | Target role |
| --- | --- | --- | --- |
| **railiance01** | `92.205.62.239` | First ThreePhoenix foundation node; hosts activity-core production, partial State Hub cluster footprint, automation schedules | **Production home** — first node of the growing Railiance fleet; hosts State Hub primary, forge, CI runners, and the automation loop |
| **coulombcore** | `92.205.130.254` | De-facto production host: State Hub cluster primary, Core Hub (`hub.coulomb.social`), issue-core, OpenBao, identity stack, ESO/ArgoCD, Gitea/registry | **Frozen legacy** — no new production; drain workload-by-workload; eventually wiped and **reborn as railiance02** |
| **workstation** | `bnt-lap001` / WSL2 | Production network hub (all 16 ops-bridge tunnels), State Hub client endpoint (`127.0.0.1:8000`), consistency-sweep writebacks, image build/publish, dev checkouts for 74 registered repos | **Temporary dev environment** — clone repos, run `make dev-hub`, push when connected; nothing in the production loop may depend on it being on |
### Role invariants
1. Production workloads authenticate, schedule, emit, and reconcile without the
workstation.
2. `coulombcore` is frozen for new production immediately (policy; see T03).
3. A workload counts as "production on railiance01" only after passing the
staged-promotion gate (see below).
4. Files remain authoritative per ADR-001; fleet databases are disposable caches.
## Fleet Mesh Topology
### Current topology (workstation as hub)
All ops-bridge tunnels originate on the workstation. Two production data paths
**chain through** it:
```
railiance01 workstation coulombcore
─────────── ─────────── ───────────
activity-core ──(state-hub-railiance01 reverse)──► :18000 ──(state-hub-primary forward)──► State Hub cluster
activity-core ──(issue-core-railiance01 reverse)──► :local ──(issue-core-coulombcore forward)──► issue-core
```
Live tunnel inventory (2026-07-03, `bridge status`):
| Tunnel | Direction | Actor | Production-critical? |
| --- | --- | --- | --- |
| `state-hub-primary` | workstation → coulombcore cluster | `agt-claude-coulombcore` | **yes** — MCP/agents reach cluster hub via `127.0.0.1:8000` |
| `state-hub-cluster-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | dev/ops access |
| `state-hub-railiance01` | railiance01 → workstation (reverse) | `agt-claude-railiance01` | **yes** — activity-core reaches hub |
| `state-hub-mcp-railiance01` | railiance01 → workstation (reverse) | `agt-claude-railiance01` | dev MCP |
| `issue-core-railiance01` | railiance01 → workstation (reverse) | `agt-claude-railiance01` | **yes** — emission lane |
| `issue-core-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | **yes** — completes emission chain |
| `state-hub-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | legacy/dev |
| `state-hub-mcp-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | dev MCP |
| `k3s-api-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | operator dev |
| `k3s-api-haskelseed` | workstation → haskelseed | `agt-claude-haskelseed` | experimental |
| `flex-auth-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | identity dev |
| `core-hub-staging-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | staging |
| `inter-hub-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | legacy Inter-Hub |
| `state-hub-haskelseed` | haskelseed → workstation | `agt-claude-haskelseed` | experimental |
| `state-hub-mcp-haskelseed` | haskelseed → workstation | `agt-claude-haskelseed` | experimental |
| `nix-daemon-haskelseed` | haskelseed → workstation | `agt-claude-haskelseed` | build dev |
A workstation reboot breaks daily triage evidence, consistency sweeps, and
issue emission until tunnels recover.
### Target topology (fleet-owned mesh)
```
railiance01 ◄────────────────────────────────────► coulombcore (draining)
│ direct atm- tunnels (ops-bridge on-host) │
│ State Hub API │ legacy until drain complete
│ issue-core REST │
└─ activity-core, Temporal, sweep checkouts └─ identity, OpenBao (last to move)
workstation (optional client)
│ interactive-only: k3s API, hub read, dev-hub
└─ may disconnect without production impact
```
Implementation owner: `CUST-WP-0054-T02`.
Key changes:
- ops-bridge (or systemd ssh units) runs **on railiance01** with `atm-` actor
certs for cross-machine lanes.
- `actcore-state-hub-bridge` and `actcore-issue-core-bridge` point at
machine-local tunnel ports, not workstation forwards.
- Workstation tunnels remain for interactive dev only.
- Evaluate WireGuard mesh when persistent unit count exceeds ~5.
This posture extends ADR-004 (connectivity-first) from "workstation connects
everything" to "fleet machines connect each other; workstation is a client."
## Production Promotion Gate
A workload is **production on railiance01** only when it conforms to the
finished staged-promotion contract (`RAIL-BS-WP-0006`):
| Gate | Requirement |
| --- | --- |
| Overlay repo | `railiance/<app>/` with `app.toml` and stage manifests |
| Stage commands | `stage deploy`, `stage observe`, `stage promote`, `stage rollback` proven |
| Evidence | Backup/restore drill, canary observation, operator approval recorded |
| Registry | Image in forge OCI registry with immutable tag |
**Exceptions** must be documented in the placement plan (T03) with explicit
rollback. No exception bypasses backup evidence for stateful workloads.
`coulombcore` workloads still running in production today are **grandfathered
legacy** until their drain task completes — not newly promoted production.
## Phoenix Path: coulombcore → railiance02
Machine-scale phoenix rotation reuses the same automation intended for future
3-node weekly rotations (`RAIL-BS-WP-0007`, `CUST-WP-0038` deferred until
railiance02 exists).
### Preconditions (drain complete)
All production dependencies moved off coulombcore per T03 ordering:
1. Forge + CI (T04) — repos and images no longer depend on `gitea.coulomb.social`
2. State Hub primary (T05) — cluster DB and sweep checkouts on railiance01
3. Core Hub, issue-core, Inter-Hub legacy — per T03 sequence
4. Identity + OpenBao — **last** (everything authenticates through them)
### Phoenix execution
Owner: `CUST-WP-0054-T09`, automation: `CUST-WP-0054-T08`.
| Phase | Action | Tooling |
| --- | --- | --- |
| S0 | Final inventory sweep, DNS/cert plan for `*.coulomb.social`, data archival | T09 |
| S1 | Wipe and greenfield rebuild | `NET-WP-0020` unseal + bootstrap chain |
| S2 | Join as `railiance02` | `railiance-cluster` overlay, `atm-` certs |
| S3 | Prove join-ready | Phoenix drill on disposable target first (T08) |
Longhorn distributed storage and PG streaming HA unlock once railiance01 +
railiance02 are both fleet nodes.
## Dev Environment (Files-First Beachhead)
Strategy A from the workplan; owner: `CUST-WP-0054-T07`.
```
git clone → make dev-hub → local ephemeral hub (compose)
├─ C-06 registration rebuilds workplan/task state from files
├─ offline write buffer (STATE-WP-0068) for progress/task events
└─ reconnect relay upstream; files reconcile, databases do not replicate
```
MCP config gains explicit `dev` / `fleet` profile switch. The workstation is
genuinely temporary: no fleet DB sync required for orientation.
## Dependency Register
### Workloads
| Workload | Current host | Target host | Migration owner | Method / notes |
| --- | --- | --- | --- | --- |
| State Hub API (primary) | coulombcore CNPG cluster via workstation tunnel `state-hub-primary``127.0.0.1:8000` | railiance01 | `CUST-WP-0054-T05` | `CUST-WP-0011-T07` playbook: freeze → exact-count restore → rewire |
| State Hub API (WSL2 fallback) | workstation WSL2 | retired | `CUST-WP-0011-T08/T09` → absorbed by `CUST-WP-0054-T10` | Stabilization window; not part of target architecture |
| activity-core | railiance01 k3s (`activity-core` ns) | railiance01 (retain) | — | Already on target machine; fix bridges in T02 |
| issue-core | coulombcore k3s | railiance01 | `CUST-WP-0054-T03` drain seq. | `ISSUE-WP-0003` live; emission chain fixed in T02 |
| Core Hub | coulombcore (`hub.coulomb.social`) | railiance01 | `CORE-WP-0005` + `CUST-WP-0054-T03` | Staging on coulombcore; production cutover human-gated |
| Inter-Hub (legacy Haskell) | coulombcore external | retired | `CORE-WP-0007` | Rollback-only after Core Hub cutover |
| Gitea + OCI registry | coulombcore k3s | railiance01 Forgejo | `RAIL-HO-WP-0005` / `CUST-WP-0054-T04` | Read-only mirror on coulombcore until decommission |
| OpenBao | coulombcore | railiance01 | `CUST-WP-0054-T03` (last) | NET-WP-0020 unseal automation |
| Identity stack (KeyCape, Authelia, privacyIDEA, lldap) | coulombcore | railiance01 | `CUST-WP-0054-T03` (last) | Coupled to OpenBao |
| ESO + ArgoCD control plane | coulombcore | railiance01 | `CUST-WP-0054-T03` | GitOps follows forge move |
| CNPG databases (per workload) | coulombcore / railiance01 | railiance01 per workload | `CUST-WP-0054-T03`, `CUST-WP-0054-T05` | CNPG pattern proven; migrate with workload |
| llm-connect | TBD cluster | railiance01 | near-term lanes board | `CCR-2026-0003` credential lane active |
| ops-hub (widget/evidence) | files + Inter-Hub widgets | railiance01 via Core Hub | `CUST-WP-0025`, `CUST-WP-0049` | Not blocking workstation independence |
| Temporal (activity-core) | railiance01 | railiance01 (retain) | — | Co-locate with activity-core |
| NATS (activity-core) | railiance01 | railiance01 (retain) | — | Co-locate with activity-core |
### Network tunnels (production-critical)
| Lane | Current path | Target path | Owner |
| --- | --- | --- | --- |
| activity-core → State Hub | railiance01 reverse → workstation → `state-hub-primary` → coulombcore | railiance01 `atm-` forward → railiance01 State Hub (local or short hop) | `CUST-WP-0054-T02` |
| Agents/MCP → State Hub | workstation `127.0.0.1:8000``state-hub-primary` → coulombcore | workstation `127.0.0.1:8000` → tunnel to railiance01 hub (dev client) or fleet endpoint | `CUST-WP-0054-T05` + T07 profiles |
| railiance01 automations → State Hub | `:18000` chain via workstation | railiance01-local bridge port | `CUST-WP-0054-T02` |
| activity-core → issue-core | railiance01 reverse → workstation → `issue-core-coulombcore` | railiance01 `atm-` forward → issue-core (on railiance01 post-drain) | `CUST-WP-0054-T02`, then T03 |
| Operator k3s access | workstation forwards (`k3s-api-*`) | workstation interactive (non-critical) | — |
### Repo remotes
All checked 2026-07-03; pattern is uniform:
| Repo (sample) | Current remote | Target remote | Owner |
| --- | --- | --- | --- |
| the-custodian | `gitea.coulomb.social/coulomb/the-custodian.git` | `forgejo.coulomb.social/coulomb/the-custodian.git` | `CUST-WP-0054-T04` |
| state-hub | `gitea.coulomb.social/coulomb/state-hub.git` | `forgejo.coulomb.social/coulomb/state-hub.git` | `CUST-WP-0054-T04` |
| activity-core | `gitea.coulomb.social/coulomb/activity-core.git` | `forgejo.coulomb.social/coulomb/activity-core.git` | `CUST-WP-0054-T04` |
| issue-core | `gitea.coulomb.social/coulomb/issue-core.git` | `forgejo.coulomb.social/coulomb/issue-core.git` | `CUST-WP-0054-T04` |
| ops-bridge | `gitea.coulomb.social/coulomb/ops-bridge.git` | `forgejo.coulomb.social/coulomb/ops-bridge.git` | `CUST-WP-0054-T04` |
| ops-warden | `gitea.coulomb.social/coulomb/ops-warden.git` | `forgejo.coulomb.social/coulomb/ops-warden.git` | `CUST-WP-0054-T04` |
| core-hub | `gitea.coulomb.social/coulomb/core-hub.git` | `forgejo.coulomb.social/coulomb/core-hub.git` | `CUST-WP-0054-T04` |
| *(all 74 registered repos)* | `gitea.coulomb.social/coulomb/<slug>.git` | `forgejo.coulomb.social/coulomb/<slug>.git` | `CUST-WP-0054-T04` |
### State Hub repo checkout paths
| Concern | Current | Target | Owner |
| --- | --- | --- | --- |
| `local_path` for 74 repos | `/home/worsch/<repo>` on workstation | railiance01 clone tree (e.g. `/home/tegwick/<repo>` or gitops-managed path) | `CUST-WP-0054-T05` |
| Consistency sweep writeback host | workstation (`consistency_check.py --remote` via API) | railiance01 checkouts from forge | `CUST-WP-0054-T05`, `STATE-WP-0064` |
| COULOMBCORE `host_paths` | `/home/tegwick/<repo>` (11 repos, `CUST-WP-0021`) | retired with coulombcore drain | `CUST-WP-0054-T09` |
| Multi-host path resolution | `host_paths` map per hostname | fleet-primary host only + dev-hub local | `CUST-WP-0054-T07` |
### Sink and prompt paths
| Sink / path | Current | Target | Owner |
| --- | --- | --- | --- |
| Daily triage working-memory | `/home/worsch/the-custodian/memory/working` (ActivityDefinition + PVC mount) | repo-relative or PVC-native path + sweep sync-to-repo | `CUST-WP-0054-T06` |
| Daily triage State Hub progress | cluster hub via workstation tunnel | railiance01 hub direct | `CUST-WP-0054-T02`, `T05` |
| Consistency sweep progress event | via workstation-hosted sweep | railiance01-hosted sweep | `CUST-WP-0054-T05`, `STATE-WP-0064` |
| Agent session traces (`runtime/agent.py`) | `memory/working/agent-session-*.md` on workstation | dev-hub local buffer; commit on reconnect | `CUST-WP-0054-T07` |
| `output_schema` in ActivityDefinitions | absolute paths under `/home/worsch/the-custodian/` | repo-relative resolution in activity-core | `CUST-WP-0054-T06` |
### Build and publish pipelines
| Image / artifact | Current build host | Current registry | Target build | Target registry | Owner |
| --- | --- | --- | --- | --- | --- |
| state-hub | workstation `docker build` | `gitea.coulomb.social/coulomb/state-hub` | Forgejo Actions runner on railiance01 | railiance01 forge OCI | `CUST-WP-0054-T04` |
| core-hub | workstation / railiance-forge docs | `gitea.coulomb.social/coulomb/core-hub` | CI runner | railiance01 forge OCI | `CUST-WP-0054-T04` |
| activity-core | workstation manual rebuild + scp | railiance01 k3s import / Gitea | CI on tag push | railiance01 forge OCI | `CUST-WP-0054-T04` |
| issue-core | workstation / manual | `gitea.coulomb.social/coulomb/issue-core` | CI runner | railiance01 forge OCI | `CUST-WP-0054-T04` |
| Haskell build agent | workstation VM (`haskell-build-vm`) | n/a | retired (`CORE-WP-0007`) | n/a | `CORE-WP-0007` |
Done criterion for T01: every row above has a target and migration owner. ✓
## Drain Sequence
Detailed plan: `docs/coulombcore-drain-placement-plan.md`
Freeze policy: `canon/standards/coulombcore-production-freeze_v0.1.md`
```
Wave 1 Forge + CI (T04)
Wave 2 State Hub primary (T05)
Wave 3 Core Hub (CORE-WP-0005)
Wave 4 issue-core
Wave 5 ESO / ArgoCD
Wave 6 Supporting apps
Wave 7 OpenBao + identity (LAST)
Wave 8 coulombcore phoenix → railiance02 (T09)
```
## Sequencing Map
```
T01 (this document) ✓
├─ T02 de-hub network ✓
├─ T03 placement plan / freeze ✓
│ ├─ T04 forge + CI
│ └─ T05 State Hub home on railiance01
├─ T06 sink decoupling
├─ T07 dev beachhead
└─ T08 phoenix drill
└─ T09 coulombcore → railiance02
└─ T10 workstation-off acceptance
```
## Evidence and Inventory Sources
- Live tunnel state: `bridge status` (2026-07-03)
- State Hub health: `http://127.0.0.1:8000/state/health` (cluster primary via tunnel)
- Registered repos: `GET /repos/` — 74 repos, all `local_path` under `/home/worsch/`
- `ops/service-inventory.yml` (2026-06-05; predates cluster cutover — refresh in T03)
- `docs/infrastructure-stabilization-pickup-checkpoint.md` (2026-07-03 metaplan closeout)
- Activity definitions: `activity-definitions/daily-statehub-wsjf-triage.md`,
`activity-definitions/state-hub-consistency-sweep.md`
## Open Gaps (not T01 blockers)
| Gap | Follow-on |
| --- | --- |
| Forgejo production hostname / SMTP / exposure decisions | `RAIL-HO-WP-0005-T02` (human) |
| `ops/service-inventory.yml` stale environment labels | Refresh during T03 |
| Core Hub widget-type registry prerequisite | `CORE-WP-0005-T04` |
| HA Postgres / Longhorn across 2+ nodes | `RAIL-BS-WP-0007`, `CUST-WP-0038` after railiance02 |
## Promotion to Canon
After operator review:
1. Move to `canon/architecture/adr-006-workstation-independence-fleet-roles.md`
(or equivalent ADR number).
2. Update `ops/service-inventory.yml` environment and service rows to match.
3. Link from `SCOPE.md` and `.custodian-brief.md` generation inputs.