Documents the three-machine role model, fleet mesh topology, coulombcore freeze policy, and ordered drain sequence. Adds railiance01 systemd tunnel install assets and refreshes ops service inventory to reflect 2026-07-03 production placement (cluster State Hub, fleet mesh, draining coulombcore).
298 lines
17 KiB
Markdown
298 lines
17 KiB
Markdown
# Workstation Independence and Fleet Role Architecture
|
|
|
|
Date: 2026-07-03
|
|
Status: draft (canon-adjacent; promote to `canon/architecture/` after review)
|
|
Workplan: `CUST-WP-0054` T01
|
|
Related: `ADR-001`, `ADR-004`, `RAIL-BS-WP-0006`, `RAIL-HO-WP-0005`, `CUST-WP-0011`
|
|
|
|
## Purpose
|
|
|
|
Fix the three-machine role model, the fleet mesh topology, the promotion gate
|
|
for "production", and the phoenix path `coulombcore → railiance02`. Provide a
|
|
dependency register so every workload, tunnel, repo remote, sink path, and
|
|
build pipeline has a **current host**, **target host**, and **migration owner**.
|
|
|
|
The acceptance proof for the whole plan is `CUST-WP-0054-T10`: production runs
|
|
24h+ with the workstation fully offline.
|
|
|
|
## Machine Roles
|
|
|
|
| Machine | IP / identity | Current role (2026-07-03) | Target role |
|
|
| --- | --- | --- | --- |
|
|
| **railiance01** | `92.205.62.239` | First ThreePhoenix foundation node; hosts activity-core production, partial State Hub cluster footprint, automation schedules | **Production home** — first node of the growing Railiance fleet; hosts State Hub primary, forge, CI runners, and the automation loop |
|
|
| **coulombcore** | `92.205.130.254` | De-facto production host: State Hub cluster primary, Core Hub (`hub.coulomb.social`), issue-core, OpenBao, identity stack, ESO/ArgoCD, Gitea/registry | **Frozen legacy** — no new production; drain workload-by-workload; eventually wiped and **reborn as railiance02** |
|
|
| **workstation** | `bnt-lap001` / WSL2 | Production network hub (all 16 ops-bridge tunnels), State Hub client endpoint (`127.0.0.1:8000`), consistency-sweep writebacks, image build/publish, dev checkouts for 74 registered repos | **Temporary dev environment** — clone repos, run `make dev-hub`, push when connected; nothing in the production loop may depend on it being on |
|
|
|
|
### Role invariants
|
|
|
|
1. Production workloads authenticate, schedule, emit, and reconcile without the
|
|
workstation.
|
|
2. `coulombcore` is frozen for new production immediately (policy; see T03).
|
|
3. A workload counts as "production on railiance01" only after passing the
|
|
staged-promotion gate (see below).
|
|
4. Files remain authoritative per ADR-001; fleet databases are disposable caches.
|
|
|
|
## Fleet Mesh Topology
|
|
|
|
### Current topology (workstation as hub)
|
|
|
|
All ops-bridge tunnels originate on the workstation. Two production data paths
|
|
**chain through** it:
|
|
|
|
```
|
|
railiance01 workstation coulombcore
|
|
─────────── ─────────── ───────────
|
|
activity-core ──(state-hub-railiance01 reverse)──► :18000 ──(state-hub-primary forward)──► State Hub cluster
|
|
activity-core ──(issue-core-railiance01 reverse)──► :local ──(issue-core-coulombcore forward)──► issue-core
|
|
```
|
|
|
|
Live tunnel inventory (2026-07-03, `bridge status`):
|
|
|
|
| Tunnel | Direction | Actor | Production-critical? |
|
|
| --- | --- | --- | --- |
|
|
| `state-hub-primary` | workstation → coulombcore cluster | `agt-claude-coulombcore` | **yes** — MCP/agents reach cluster hub via `127.0.0.1:8000` |
|
|
| `state-hub-cluster-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | dev/ops access |
|
|
| `state-hub-railiance01` | railiance01 → workstation (reverse) | `agt-claude-railiance01` | **yes** — activity-core reaches hub |
|
|
| `state-hub-mcp-railiance01` | railiance01 → workstation (reverse) | `agt-claude-railiance01` | dev MCP |
|
|
| `issue-core-railiance01` | railiance01 → workstation (reverse) | `agt-claude-railiance01` | **yes** — emission lane |
|
|
| `issue-core-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | **yes** — completes emission chain |
|
|
| `state-hub-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | legacy/dev |
|
|
| `state-hub-mcp-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | dev MCP |
|
|
| `k3s-api-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | operator dev |
|
|
| `k3s-api-haskelseed` | workstation → haskelseed | `agt-claude-haskelseed` | experimental |
|
|
| `flex-auth-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | identity dev |
|
|
| `core-hub-staging-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | staging |
|
|
| `inter-hub-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | legacy Inter-Hub |
|
|
| `state-hub-haskelseed` | haskelseed → workstation | `agt-claude-haskelseed` | experimental |
|
|
| `state-hub-mcp-haskelseed` | haskelseed → workstation | `agt-claude-haskelseed` | experimental |
|
|
| `nix-daemon-haskelseed` | haskelseed → workstation | `agt-claude-haskelseed` | build dev |
|
|
|
|
A workstation reboot breaks daily triage evidence, consistency sweeps, and
|
|
issue emission until tunnels recover.
|
|
|
|
### Target topology (fleet-owned mesh)
|
|
|
|
```
|
|
railiance01 ◄────────────────────────────────────► coulombcore (draining)
|
|
│ direct atm- tunnels (ops-bridge on-host) │
|
|
│ State Hub API │ legacy until drain complete
|
|
│ issue-core REST │
|
|
└─ activity-core, Temporal, sweep checkouts └─ identity, OpenBao (last to move)
|
|
|
|
workstation (optional client)
|
|
│ interactive-only: k3s API, hub read, dev-hub
|
|
└─ may disconnect without production impact
|
|
```
|
|
|
|
Implementation owner: `CUST-WP-0054-T02`.
|
|
|
|
Key changes:
|
|
|
|
- ops-bridge (or systemd ssh units) runs **on railiance01** with `atm-` actor
|
|
certs for cross-machine lanes.
|
|
- `actcore-state-hub-bridge` and `actcore-issue-core-bridge` point at
|
|
machine-local tunnel ports, not workstation forwards.
|
|
- Workstation tunnels remain for interactive dev only.
|
|
- Evaluate WireGuard mesh when persistent unit count exceeds ~5.
|
|
|
|
This posture extends ADR-004 (connectivity-first) from "workstation connects
|
|
everything" to "fleet machines connect each other; workstation is a client."
|
|
|
|
## Production Promotion Gate
|
|
|
|
A workload is **production on railiance01** only when it conforms to the
|
|
finished staged-promotion contract (`RAIL-BS-WP-0006`):
|
|
|
|
| Gate | Requirement |
|
|
| --- | --- |
|
|
| Overlay repo | `railiance/<app>/` with `app.toml` and stage manifests |
|
|
| Stage commands | `stage deploy`, `stage observe`, `stage promote`, `stage rollback` proven |
|
|
| Evidence | Backup/restore drill, canary observation, operator approval recorded |
|
|
| Registry | Image in forge OCI registry with immutable tag |
|
|
|
|
**Exceptions** must be documented in the placement plan (T03) with explicit
|
|
rollback. No exception bypasses backup evidence for stateful workloads.
|
|
|
|
`coulombcore` workloads still running in production today are **grandfathered
|
|
legacy** until their drain task completes — not newly promoted production.
|
|
|
|
## Phoenix Path: coulombcore → railiance02
|
|
|
|
Machine-scale phoenix rotation reuses the same automation intended for future
|
|
3-node weekly rotations (`RAIL-BS-WP-0007`, `CUST-WP-0038` deferred until
|
|
railiance02 exists).
|
|
|
|
### Preconditions (drain complete)
|
|
|
|
All production dependencies moved off coulombcore per T03 ordering:
|
|
|
|
1. Forge + CI (T04) — repos and images no longer depend on `gitea.coulomb.social`
|
|
2. State Hub primary (T05) — cluster DB and sweep checkouts on railiance01
|
|
3. Core Hub, issue-core, Inter-Hub legacy — per T03 sequence
|
|
4. Identity + OpenBao — **last** (everything authenticates through them)
|
|
|
|
### Phoenix execution
|
|
|
|
Owner: `CUST-WP-0054-T09`, automation: `CUST-WP-0054-T08`.
|
|
|
|
| Phase | Action | Tooling |
|
|
| --- | --- | --- |
|
|
| S0 | Final inventory sweep, DNS/cert plan for `*.coulomb.social`, data archival | T09 |
|
|
| S1 | Wipe and greenfield rebuild | `NET-WP-0020` unseal + bootstrap chain |
|
|
| S2 | Join as `railiance02` | `railiance-cluster` overlay, `atm-` certs |
|
|
| S3 | Prove join-ready | Phoenix drill on disposable target first (T08) |
|
|
|
|
Longhorn distributed storage and PG streaming HA unlock once railiance01 +
|
|
railiance02 are both fleet nodes.
|
|
|
|
## Dev Environment (Files-First Beachhead)
|
|
|
|
Strategy A from the workplan; owner: `CUST-WP-0054-T07`.
|
|
|
|
```
|
|
git clone → make dev-hub → local ephemeral hub (compose)
|
|
│
|
|
├─ C-06 registration rebuilds workplan/task state from files
|
|
├─ offline write buffer (STATE-WP-0068) for progress/task events
|
|
└─ reconnect relay upstream; files reconcile, databases do not replicate
|
|
```
|
|
|
|
MCP config gains explicit `dev` / `fleet` profile switch. The workstation is
|
|
genuinely temporary: no fleet DB sync required for orientation.
|
|
|
|
## Dependency Register
|
|
|
|
### Workloads
|
|
|
|
| Workload | Current host | Target host | Migration owner | Method / notes |
|
|
| --- | --- | --- | --- | --- |
|
|
| State Hub API (primary) | coulombcore CNPG cluster via workstation tunnel `state-hub-primary` → `127.0.0.1:8000` | railiance01 | `CUST-WP-0054-T05` | `CUST-WP-0011-T07` playbook: freeze → exact-count restore → rewire |
|
|
| State Hub API (WSL2 fallback) | workstation WSL2 | retired | `CUST-WP-0011-T08/T09` → absorbed by `CUST-WP-0054-T10` | Stabilization window; not part of target architecture |
|
|
| activity-core | railiance01 k3s (`activity-core` ns) | railiance01 (retain) | — | Already on target machine; fix bridges in T02 |
|
|
| issue-core | coulombcore k3s | railiance01 | `CUST-WP-0054-T03` drain seq. | `ISSUE-WP-0003` live; emission chain fixed in T02 |
|
|
| Core Hub | coulombcore (`hub.coulomb.social`) | railiance01 | `CORE-WP-0005` + `CUST-WP-0054-T03` | Staging on coulombcore; production cutover human-gated |
|
|
| Inter-Hub (legacy Haskell) | coulombcore external | retired | `CORE-WP-0007` | Rollback-only after Core Hub cutover |
|
|
| Gitea + OCI registry | coulombcore k3s | railiance01 Forgejo | `RAIL-HO-WP-0005` / `CUST-WP-0054-T04` | Read-only mirror on coulombcore until decommission |
|
|
| OpenBao | coulombcore | railiance01 | `CUST-WP-0054-T03` (last) | NET-WP-0020 unseal automation |
|
|
| Identity stack (KeyCape, Authelia, privacyIDEA, lldap) | coulombcore | railiance01 | `CUST-WP-0054-T03` (last) | Coupled to OpenBao |
|
|
| ESO + ArgoCD control plane | coulombcore | railiance01 | `CUST-WP-0054-T03` | GitOps follows forge move |
|
|
| CNPG databases (per workload) | coulombcore / railiance01 | railiance01 per workload | `CUST-WP-0054-T03`, `CUST-WP-0054-T05` | CNPG pattern proven; migrate with workload |
|
|
| llm-connect | TBD cluster | railiance01 | near-term lanes board | `CCR-2026-0003` credential lane active |
|
|
| ops-hub (widget/evidence) | files + Inter-Hub widgets | railiance01 via Core Hub | `CUST-WP-0025`, `CUST-WP-0049` | Not blocking workstation independence |
|
|
| Temporal (activity-core) | railiance01 | railiance01 (retain) | — | Co-locate with activity-core |
|
|
| NATS (activity-core) | railiance01 | railiance01 (retain) | — | Co-locate with activity-core |
|
|
|
|
### Network tunnels (production-critical)
|
|
|
|
| Lane | Current path | Target path | Owner |
|
|
| --- | --- | --- | --- |
|
|
| activity-core → State Hub | railiance01 reverse → workstation → `state-hub-primary` → coulombcore | railiance01 `atm-` forward → railiance01 State Hub (local or short hop) | `CUST-WP-0054-T02` |
|
|
| Agents/MCP → State Hub | workstation `127.0.0.1:8000` → `state-hub-primary` → coulombcore | workstation `127.0.0.1:8000` → tunnel to railiance01 hub (dev client) or fleet endpoint | `CUST-WP-0054-T05` + T07 profiles |
|
|
| railiance01 automations → State Hub | `:18000` chain via workstation | railiance01-local bridge port | `CUST-WP-0054-T02` |
|
|
| activity-core → issue-core | railiance01 reverse → workstation → `issue-core-coulombcore` | railiance01 `atm-` forward → issue-core (on railiance01 post-drain) | `CUST-WP-0054-T02`, then T03 |
|
|
| Operator k3s access | workstation forwards (`k3s-api-*`) | workstation interactive (non-critical) | — |
|
|
|
|
### Repo remotes
|
|
|
|
All checked 2026-07-03; pattern is uniform:
|
|
|
|
| Repo (sample) | Current remote | Target remote | Owner |
|
|
| --- | --- | --- | --- |
|
|
| the-custodian | `gitea.coulomb.social/coulomb/the-custodian.git` | `forgejo.coulomb.social/coulomb/the-custodian.git` | `CUST-WP-0054-T04` |
|
|
| state-hub | `gitea.coulomb.social/coulomb/state-hub.git` | `forgejo.coulomb.social/coulomb/state-hub.git` | `CUST-WP-0054-T04` |
|
|
| activity-core | `gitea.coulomb.social/coulomb/activity-core.git` | `forgejo.coulomb.social/coulomb/activity-core.git` | `CUST-WP-0054-T04` |
|
|
| issue-core | `gitea.coulomb.social/coulomb/issue-core.git` | `forgejo.coulomb.social/coulomb/issue-core.git` | `CUST-WP-0054-T04` |
|
|
| ops-bridge | `gitea.coulomb.social/coulomb/ops-bridge.git` | `forgejo.coulomb.social/coulomb/ops-bridge.git` | `CUST-WP-0054-T04` |
|
|
| ops-warden | `gitea.coulomb.social/coulomb/ops-warden.git` | `forgejo.coulomb.social/coulomb/ops-warden.git` | `CUST-WP-0054-T04` |
|
|
| core-hub | `gitea.coulomb.social/coulomb/core-hub.git` | `forgejo.coulomb.social/coulomb/core-hub.git` | `CUST-WP-0054-T04` |
|
|
| *(all 74 registered repos)* | `gitea.coulomb.social/coulomb/<slug>.git` | `forgejo.coulomb.social/coulomb/<slug>.git` | `CUST-WP-0054-T04` |
|
|
|
|
### State Hub repo checkout paths
|
|
|
|
| Concern | Current | Target | Owner |
|
|
| --- | --- | --- | --- |
|
|
| `local_path` for 74 repos | `/home/worsch/<repo>` on workstation | railiance01 clone tree (e.g. `/home/tegwick/<repo>` or gitops-managed path) | `CUST-WP-0054-T05` |
|
|
| Consistency sweep writeback host | workstation (`consistency_check.py --remote` via API) | railiance01 checkouts from forge | `CUST-WP-0054-T05`, `STATE-WP-0064` |
|
|
| COULOMBCORE `host_paths` | `/home/tegwick/<repo>` (11 repos, `CUST-WP-0021`) | retired with coulombcore drain | `CUST-WP-0054-T09` |
|
|
| Multi-host path resolution | `host_paths` map per hostname | fleet-primary host only + dev-hub local | `CUST-WP-0054-T07` |
|
|
|
|
### Sink and prompt paths
|
|
|
|
| Sink / path | Current | Target | Owner |
|
|
| --- | --- | --- | --- |
|
|
| Daily triage working-memory | `/home/worsch/the-custodian/memory/working` (ActivityDefinition + PVC mount) | repo-relative or PVC-native path + sweep sync-to-repo | `CUST-WP-0054-T06` |
|
|
| Daily triage State Hub progress | cluster hub via workstation tunnel | railiance01 hub direct | `CUST-WP-0054-T02`, `T05` |
|
|
| Consistency sweep progress event | via workstation-hosted sweep | railiance01-hosted sweep | `CUST-WP-0054-T05`, `STATE-WP-0064` |
|
|
| Agent session traces (`runtime/agent.py`) | `memory/working/agent-session-*.md` on workstation | dev-hub local buffer; commit on reconnect | `CUST-WP-0054-T07` |
|
|
| `output_schema` in ActivityDefinitions | absolute paths under `/home/worsch/the-custodian/` | repo-relative resolution in activity-core | `CUST-WP-0054-T06` |
|
|
|
|
### Build and publish pipelines
|
|
|
|
| Image / artifact | Current build host | Current registry | Target build | Target registry | Owner |
|
|
| --- | --- | --- | --- | --- | --- |
|
|
| state-hub | workstation `docker build` | `gitea.coulomb.social/coulomb/state-hub` | Forgejo Actions runner on railiance01 | railiance01 forge OCI | `CUST-WP-0054-T04` |
|
|
| core-hub | workstation / railiance-forge docs | `gitea.coulomb.social/coulomb/core-hub` | CI runner | railiance01 forge OCI | `CUST-WP-0054-T04` |
|
|
| activity-core | workstation manual rebuild + scp | railiance01 k3s import / Gitea | CI on tag push | railiance01 forge OCI | `CUST-WP-0054-T04` |
|
|
| issue-core | workstation / manual | `gitea.coulomb.social/coulomb/issue-core` | CI runner | railiance01 forge OCI | `CUST-WP-0054-T04` |
|
|
| Haskell build agent | workstation VM (`haskell-build-vm`) | n/a | retired (`CORE-WP-0007`) | n/a | `CORE-WP-0007` |
|
|
|
|
Done criterion for T01: every row above has a target and migration owner. ✓
|
|
|
|
## Drain Sequence
|
|
|
|
Detailed plan: `docs/coulombcore-drain-placement-plan.md`
|
|
Freeze policy: `canon/standards/coulombcore-production-freeze_v0.1.md`
|
|
|
|
```
|
|
Wave 1 Forge + CI (T04)
|
|
Wave 2 State Hub primary (T05)
|
|
Wave 3 Core Hub (CORE-WP-0005)
|
|
Wave 4 issue-core
|
|
Wave 5 ESO / ArgoCD
|
|
Wave 6 Supporting apps
|
|
Wave 7 OpenBao + identity (LAST)
|
|
Wave 8 coulombcore phoenix → railiance02 (T09)
|
|
```
|
|
|
|
## Sequencing Map
|
|
|
|
```
|
|
T01 (this document) ✓
|
|
├─ T02 de-hub network ✓
|
|
├─ T03 placement plan / freeze ✓
|
|
│ ├─ T04 forge + CI
|
|
│ └─ T05 State Hub home on railiance01
|
|
├─ T06 sink decoupling
|
|
├─ T07 dev beachhead
|
|
└─ T08 phoenix drill
|
|
└─ T09 coulombcore → railiance02
|
|
└─ T10 workstation-off acceptance
|
|
```
|
|
|
|
## Evidence and Inventory Sources
|
|
|
|
- Live tunnel state: `bridge status` (2026-07-03)
|
|
- State Hub health: `http://127.0.0.1:8000/state/health` (cluster primary via tunnel)
|
|
- Registered repos: `GET /repos/` — 74 repos, all `local_path` under `/home/worsch/`
|
|
- `ops/service-inventory.yml` (2026-06-05; predates cluster cutover — refresh in T03)
|
|
- `docs/infrastructure-stabilization-pickup-checkpoint.md` (2026-07-03 metaplan closeout)
|
|
- Activity definitions: `activity-definitions/daily-statehub-wsjf-triage.md`,
|
|
`activity-definitions/state-hub-consistency-sweep.md`
|
|
|
|
## Open Gaps (not T01 blockers)
|
|
|
|
| Gap | Follow-on |
|
|
| --- | --- |
|
|
| Forgejo production hostname / SMTP / exposure decisions | `RAIL-HO-WP-0005-T02` (human) |
|
|
| `ops/service-inventory.yml` stale environment labels | Refresh during T03 |
|
|
| Core Hub widget-type registry prerequisite | `CORE-WP-0005-T04` |
|
|
| HA Postgres / Longhorn across 2+ nodes | `RAIL-BS-WP-0007`, `CUST-WP-0038` after railiance02 |
|
|
|
|
## Promotion to Canon
|
|
|
|
After operator review:
|
|
|
|
1. Move to `canon/architecture/adr-006-workstation-independence-fleet-roles.md`
|
|
(or equivalent ADR number).
|
|
2. Update `ops/service-inventory.yml` environment and service rows to match.
|
|
3. Link from `SCOPE.md` and `.custodian-brief.md` generation inputs. |