CUST-WP-0054 T01-T03: fleet architecture, de-hub runbook, drain plan

Documents the three-machine role model, fleet mesh topology, coulombcore
freeze policy, and ordered drain sequence. Adds railiance01 systemd tunnel
install assets and refreshes ops service inventory to reflect 2026-07-03
production placement (cluster State Hub, fleet mesh, draining coulombcore).
This commit is contained in:
codex
2026-07-04 00:29:55 +02:00
parent 0a77483861
commit cf4be716e1
10 changed files with 1050 additions and 34 deletions

View File

@@ -0,0 +1,118 @@
---
id: canon-coulombcore-production-freeze
type: standard
title: "CoulombCore Production Freeze v0.1"
domain: custodian
status: active
version: "0.1"
created: "2026-07-03"
decided_by: custodian
tags: ["infrastructure", "coulombcore", "railiance01", "production", "freeze", "drain"]
related_workplans:
- CUST-WP-0054
- RAIL-HO-WP-0005
---
# CoulombCore Production Freeze v0.1
## Status
**Active from 2026-07-03.** CoulombCore (`92.205.130.254`) is frozen for new
production workloads.
## Context
Under the fleet role model (`CUST-WP-0054`, `docs/workstation-independence-fleet-architecture.md`):
| Machine | Role |
| --- | --- |
| **railiance01** | Production home — growing Railiance fleet |
| **coulombcore** | Legacy/experimental only; drain then phoenix to **railiance02** |
| **workstation** | Temporary dev environment |
Despite the role model, coulombcore still hosts production-critical workloads
(State Hub cluster primary, Core Hub, issue-core, Gitea, OpenBao, identity
stack, GitOps control plane). This freeze stops the problem from growing while
the drain sequence in `docs/coulombcore-drain-placement-plan.md` executes.
## Policy
### Frozen (blocked without exception)
No **new** production workloads may be introduced on coulombcore after
2026-07-03:
- New Helm releases, ArgoCD Applications, or CNPG clusters intended as
long-lived production
- New public DNS names under `*.coulomb.social` for production services
- New credential lanes whose **primary** runtime home is coulombcore
- New CI/CD publish targets that make coulombcore the canonical registry or
forge (canonical target is railiance01 Forgejo per `RAIL-HO-WP-0005`)
- New automation schedules that **require** coulombcore as the sole runtime
host (activity-core production is already on railiance01)
### Grandfathered (existing production may run)
Workloads already in production on coulombcore before 2026-07-03 may continue
until their drain step completes. They are **not** newly promoted production —
they are legacy carry-over on a condemned host.
### Allowed on coulombcore during drain
| Category | Examples |
| --- | --- |
| Drain migrations | Staged-promotion overlays targeting railiance01; cutover drills in isolated namespaces |
| Read-only mirrors | Gitea read-only rollback mirror after Forgejo cutover |
| Short-lived probes | Disposable Forgejo/restore namespaces per `RAIL-HO-WP-0005` probe strategy |
| Experimental / non-prod | Staging profiles, smoke namespaces, operator-attended bootstrap |
| Fleet mesh transit | Forward tunnels from railiance01 to coulombcore cluster services until those services move (T02 interim) |
### Promotion gate
A workload counts as **production on railiance01** only after passing the
staged-promotion contract (`RAIL-BS-WP-0006`). Coulombcore deployments do not
satisfy this gate after 2026-07-03.
## Enforcement
1. **Workplan review** — new workplans proposing coulombcore production require
an explicit exception row in the drain plan with rollback evidence.
2. **ArgoCD / GitOps** — new Applications with production intent must target
`railiance01-k3s`, not `coulombcore-k3s`, unless tagged `drain-migration`
or `experimental`.
3. **Agent instructions** — coding agents must not deploy new production
services to coulombcore; route to railiance01 overlays or file an exception
request via State Hub `needs_human`.
4. **Inventory drift**`ops/service-inventory.yml` rows for coulombcore
production services carry `lifecycle_state: draining` after their drain
wave starts.
## Exceptions
Document each exception in `docs/coulombcore-drain-placement-plan.md` under
**Documented exceptions** with:
- workload id
- reason the drain sequence cannot absorb it yet
- target host and target date
- rollback method
- approving workplan or operator decision id
## Exit criteria (lift freeze)
The freeze lifts for coulombcore as a **host** when:
1. All drain waves in the placement plan reach `retired` or `migrated`
2. Identity + OpenBao (last wave) run on railiance01
3. `CUST-WP-0054-T09` phoenix begins — coulombcore is wiped and rebuilt as
railiance02, not returned to production
After phoenix, the machine identity is **railiance02**; the coulombcore freeze
standard applies only to the historical drain period.
## Related documents
- Drain sequence: `docs/coulombcore-drain-placement-plan.md`
- Architecture: `docs/workstation-independence-fleet-architecture.md`
- Forgejo migration: `RAIL-HO-WP-0005` in `railiance-infra`
- Staged promotion: `RAIL-BS-WP-0006` (finished)

View File

@@ -0,0 +1,200 @@
# CoulombCore Drain and Production Placement Plan
Date: 2026-07-03
Workplan: `CUST-WP-0054-T03`
Freeze policy: `canon/standards/coulombcore-production-freeze_v0.1.md`
Architecture: `docs/workstation-independence-fleet-architecture.md`
## Purpose
Ordered drain sequence for every production workload on coulombcore
(`92.205.130.254`, `coulombcore-k3s`). Each row names current placement,
target placement, migration method, owner workplan, and prerequisites.
**Coupling rule:** forge and State Hub move early; identity + OpenBao move
last because everything authenticates through them.
## Wave overview
```
Wave 0 Freeze policy (this document + canon) — effective 2026-07-03
Wave 1 Source forge + CI runners ─────────── RAIL-HO-WP-0005 / CUST-WP-0054-T04
Wave 2 State Hub primary + sweep checkouts ── CUST-WP-0054-T05 / CUST-WP-0011
Wave 3 Core Hub production ────────────────── CORE-WP-0005
Wave 4 issue-core ─────────────────────────── ISSUE-WP-0003 + overlay
Wave 5 GitOps control plane (ESO, ArgoCD) ─── railiance-cluster overlays
Wave 6 Application stragglers ─────────────── per-app overlays
Wave 7 OpenBao + identity stack ───────────── NET-WP-0020 + key-cape (LAST)
Wave 8 coulombcore phoenix → railiance02 ─── CUST-WP-0054-T09
```
## Placement register
| # | Workload | Current (2026-07-03) | Target | Method | Owner | Wave | Status |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 1 | **Gitea + OCI registry** | coulombcore-k3s `default`; `gitea.coulomb.social` | railiance01 **`forgejo.coulomb.social`** | Staged-promotion S5 overlay; `RAIL-HO-WP-0005` probe → production; Gitea → read-only mirror | `RAIL-HO-WP-0005`, `CUST-WP-0054-T04` | 1 | grandfathered |
| 2 | **Forgejo Actions / CI runners** | none (workstation manual build) | railiance01 | New S5 overlay; image build on tag push | `CUST-WP-0054-T04` | 1 | planned |
| 3 | **Gitea DB + PVC** | coulombcore `databases` / `gitea-shared-storage` | railiance01 CNPG + PVC | Migrate with Forgejo; backup/restore drill required | `RAIL-HO-WP-0005` | 1 | grandfathered |
| 4 | **State Hub API (primary)** | coulombcore CNPG `state-hub-db`; cluster Svc `10.43.170.94:8000` | railiance01 CNPG + Deployment | `CUST-WP-0011-T07` playbook: freeze → exact-count restore → rewire; staged-promotion overlay | `CUST-WP-0054-T05`, `CUST-WP-0011` | 2 | grandfathered |
| 5 | **State Hub sweep checkouts** | workstation `/home/worsch/*` (74 repos) | railiance01 clone tree from forge | Relocate `host_paths` / `local_path`; no workstation writeback | `CUST-WP-0054-T05`, `STATE-WP-0064` | 2 | planned |
| 6 | **WSL2 State Hub fallback** | workstation WSL2 | retired | Stop after railiance01 primary stabilizes | `CUST-WP-0011-T08/T09`, `CUST-WP-0054-T10` | 2 | grandfathered |
| 7 | **Core Hub** | coulombcore `core-hub-staging`; public `hub.coulomb.social` | railiance01 | Staged-promotion overlay; dual-run prerequisite (`CORE-WP-0005-T04`) | `CORE-WP-0005` | 3 | grandfathered |
| 8 | **Inter-Hub (Haskell)** | coulombcore external | retired | Rollback-only after Core Hub cutover | `CORE-WP-0007` | 3 | grandfathered |
| 9 | **issue-core** | coulombcore `issue-core` ns; ClusterIP `10.43.103.154:8765` | railiance01 | Staged-promotion overlay; shorten fleet tunnel to local svc | `ISSUE-WP-0003`, `CUST-WP-0054-T03` | 4 | grandfathered |
| 10 | **issue-core CNPG** | coulombcore | railiance01 | Migrate with issue-core workload | `railiance-platform` | 4 | grandfathered |
| 11 | **External Secrets Operator** | coulombcore | railiance01 | GitOps follows forge; ESO stores point at railiance01 OpenBao post-Wave 7 or interim coulombcore path documented | `railiance-platform` | 5 | grandfathered |
| 12 | **ArgoCD** | coulombcore (boundary: should be S4) | railiance01 | Staged-promotion; repoint repo URLs to Forgejo | `railiance-cluster` | 5 | grandfathered |
| 13 | **llm-connect** | railiance01 `activity-core` ns (partial) | railiance01 | Already on target machine; complete in-cluster profile | `CCR-2026-0003` lane | 6 | observed |
| 14 | **activity-core** | railiance01 `activity-core` ns | railiance01 (retain) | No move; update sinks (T06) and hub URL post-Wave 2 | — | — | **on target** |
| 15 | **Temporal / NATS** | railiance01 | railiance01 (retain) | Co-located with activity-core | — | — | **on target** |
| 16 | **ops-hub evidence / widgets** | files + Core Hub path | railiance01 via Core Hub | Follows Core Hub; not coulombcore-blocking | `CUST-WP-0025`, `CUST-WP-0049` | 6 | planned |
| 17 | **artifact-store / MinIO lane** | assessment only | railiance01 or compatible endpoint | Compatibility-profile per `ARTIFACT-STORE-WP-0007` | `ARTIFACT-STORE-WP-0007` | 6 | planned |
| 18 | **OpenBao** | coulombcore | railiance01 | **Last infrastructure wave**; `NET-WP-0020` unseal automation; CNPG + seal migration | `NET-WP-0020`, `railiance-platform` | 7 | grandfathered |
| 19 | **KeyCape** | coulombcore | railiance01 | Follows OpenBao; OIDC/MFA paths | `key-cape` | 7 | grandfathered |
| 20 | **Authelia** | coulombcore | railiance01 | Identity front door | `key-cape` / `railiance-platform` | 7 | grandfathered |
| 21 | **privacyIDEA** | coulombcore | railiance01 | MFA backend | `key-cape` | 7 | grandfathered |
| 22 | **lldap** | coulombcore | railiance01 | LDAP directory | `key-cape` / `railiance-platform` | 7 | grandfathered |
| 23 | **flex-auth** | coulombcore | railiance01 | Policy registry follows identity | `flex-auth` | 7 | grandfathered |
| 24 | **Fleet mesh transit tunnels** | railiance01 systemd → coulombcore ClusterIPs | railiance01-local services | Retire when Waves 2+4 complete (hub + issue-core local) | `CUST-WP-0054-T02` | 24 | **interim active** |
| 25 | **CNPG operator** | coulombcore (boundary note) | railiance01 | Platform operator moves with Wave 2+ workloads | `railiance-platform` | 27 | grandfathered |
| 26 | **coulombcore host identity** | coulombcore | railiance02 | Machine phoenix after Wave 7 | `CUST-WP-0054-T09`, `CUST-WP-0054-T08` | 8 | wait |
## Per-wave detail
### Wave 1 — Source forge + CI (unblocks repos and images)
**Goal:** All repos and container images publish from railiance01; coulombcore
Gitea becomes read-only mirror.
| Step | Action | Done when |
| --- | --- | --- |
| 1.1 | Resolve `RAIL-HO-WP-0005-T02` production decisions (hostname **decided:** `forgejo.coulomb.social`; SMTP, runners, backup still open) | `docs/forgejo-production-decisions.md` |
| 1.2 | Disposable Forgejo probe namespace + restore drill | Backup/restore evidence id recorded |
| 1.3 | Production Forgejo cutover | All 74 repo remotes point at Forgejo; push/pull verified |
| 1.4 | Actions runners for `state-hub`, `core-hub`, `activity-core`, `issue-core` | Tag-triggered image lands in forge OCI |
| 1.5 | Gitea → read-only mirror on coulombcore | Rollback window documented; no new writes |
**Blocks:** Wave 2 sweep checkouts (needs forge clones on railiance01).
### Wave 2 — State Hub home on railiance01
**Goal:** Automation loop machine-local; consistency sweeps write back to
railiance01 checkouts, not workstation paths.
| Step | Action | Done when |
| --- | --- | --- |
| 2.1 | CNPG + storage review on railiance01 | Platform sign-off |
| 2.2 | `CUST-WP-0011-T07` cutover to railiance01 primary | Row counts match; `127.0.0.1:8000` serves railiance01 hub |
| 2.3 | Clone/register 74 repos on railiance01 from Forgejo | `fix-consistency` writebacks use railiance01 paths |
| 2.4 | Retire fleet tunnel `fleet-state-hub-coulombcore` | activity-core reaches hub without coulombcore hop |
| 2.5 | WSL2 fallback retirement (optional, after stabilization) | `CUST-WP-0011-T08/T09` |
**Prereq:** Wave 1 forge (clone source).
### Wave 3 — Core Hub production
**Goal:** `hub.coulomb.social` served from railiance01 Core Hub.
| Step | Action | Done when |
| --- | --- | --- |
| 3.1 | Close `CORE-WP-0005-T04` prerequisites (widget types, auth posture) | Catalog gap resolved |
| 3.2 | Operator-approved cutover with rollback plan | Deployed smoke + activity-core sink green |
| 3.3 | Inter-Hub marked rollback-only | `CORE-WP-0007` unblocks |
**Prereq:** Wave 1 (images via forge CI).
### Wave 4 — issue-core
**Goal:** Emission path is railiance01-local; no coulombcore ClusterIP in path.
| Step | Action | Done when |
| --- | --- | --- |
| 4.1 | Staged-promotion overlay on railiance01 | ArgoCD sync healthy |
| 4.2 | Migrate CNPG + secrets | ExternalSecret Ready |
| 4.3 | Point `ISSUE_CORE_URL` at in-cluster svc | Retire `fleet-issue-core-coulombcore` tunnel |
| 4.4 | Safe emission smoke | HTTP 201 + Gitea/Forgejo issue created |
**Prereq:** Wave 1 (image + gitops); credential lane `CCR-2026-0002` active.
### Wave 5 — GitOps control plane
**Goal:** ArgoCD and ESO run on railiance01 and track Forgejo repos.
| Step | Action | Done when |
| --- | --- | --- |
| 5.1 | ArgoCD overlay on railiance01 | Sync from Forgejo remotes |
| 5.2 | ESO → SecretStore paths updated | Workloads on railiance01 pull secrets |
| 5.3 | Decommission coulombcore ArgoCD Applications | No new syncs to coulombcore-k3s |
**Prereq:** Waves 12 (forge URLs, hub coordination).
### Wave 6 — Application stragglers
Low-coupling apps and evidence lanes that do not block earlier waves:
- llm-connect production profile completion
- ops-hub widget evidence via Core Hub
- artifact-store compatibility endpoint (if approved)
Each uses staged-promotion unless listed under **Documented exceptions**.
### Wave 7 — OpenBao + identity (LAST)
**Goal:** Authentication and secret custody off coulombcore.
| Step | Action | Done when |
| --- | --- | --- |
| 7.1 | OpenBao staged-promotion to railiance01 | Unseal automation (`NET-WP-0020`) proven |
| 7.2 | KeyCape / Authelia / privacyIDEA / lldap migration | OIDC login smoke on railiance01 |
| 7.3 | flex-auth registry points at new identity endpoints | Credential lanes re-pointed |
| 7.4 | CCR/applier paths verified | No production secret reads from coulombcore OpenBao |
**Gate:** `CUST-WP-0054-T09` cannot start until Wave 7 completes.
### Wave 8 — Phoenix to railiance02
Execute `CUST-WP-0054-T09` via T08 automation: wipe coulombcore, rebuild as
railiance02, join fleet. DNS/cert plan for remaining `*.coulomb.social` names.
## Documented exceptions
| Workload | Reason | Target date | Rollback | Approval |
| --- | --- | --- | --- | --- |
| Fleet mesh systemd tunnels | Wave 2/4 not complete; railiance01 reaches coulombcore ClusterIPs | Until Waves 2+4 done | Re-enable workstation reverse tunnels per `docs/fleet-mesh-dehub-runbook.md` | `CUST-WP-0054-T02` cutover 2026-07-03 |
| Core Hub staging on coulombcore | Pre-cutover smoke environment | Until Wave 3 cutover | Keep staging namespace | `CORE-WP-0005` |
| Static `id_ops` SSH key on railiance01 fleet units | `atm-fleet-mesh` cert_command blocked on VAULT_TOKEN | Until warden sign available | ops-bridge or rotated key | `CUST-WP-0054-T02` interim |
No other exceptions as of 2026-07-03. New exceptions require a State Hub
decision or workplan amendment.
## Staged-promotion method (default)
Per `RAIL-BS-WP-0006` (finished):
1. `railiance/<app>/app.toml` + overlay in owning repo
2. Stage 1 deploy → observe → promote with evidence
3. Backup/restore drill before production promotion
4. Rollback revision documented
Apps without overlays yet must get an overlay scaffold before Wave execution.
## Inventory sync
`ops/service-inventory.yml` updated 2026-07-03 for:
- coulombcore `lifecycle_state: draining` on grandfathered production services
- State Hub primary on coulombcore cluster (not workstation)
- railiance01 fleet-mesh and activity-core placement
- ops-bridge on railiance01 via systemd (not workstation hub)
Regenerate catalog view: `make ops-inventory-view`
## Human gates (not agent-executable)
| Gate | Owner | Blocks |
| --- | --- | --- |
| Forgejo T02 production decisions | operator | Wave 1 |
| State Hub railiance01 cutover approval | operator; `CUST-WP-0011-T07` | Wave 2 |
| Core Hub production cutover | operator; `CORE-WP-0005-T04` | Wave 3 |
| OpenBao/identity migration approval | operator + custody | Wave 7 |
| coulombcore phoenix approval | operator | Wave 8 |

View File

@@ -0,0 +1,147 @@
# Fleet Mesh De-Hub Runbook (CUST-WP-0054-T02)
Date: 2026-07-03
Workplan: `CUST-WP-0054-T02`
Architecture: `docs/workstation-independence-fleet-architecture.md`
## Goal
Remove the workstation from production data paths between railiance01
(activity-core) and coulombcore (State Hub cluster, issue-core). Workstation
tunnels become interactive dev access only.
## Before (workstation hub)
```
railiance01:18000 ──reverse──► workstation:8000 ──forward──► coulombcore cluster State Hub
railiance01:18765 ──reverse──► workstation:18765 ──forward──► coulombcore cluster issue-core
```
## After (fleet-owned)
```
railiance01:18000 ──forward via SSH to coulombcore──► 10.43.170.94:8000 (State Hub)
railiance01:18765 ──forward via SSH to coulombcore──► 10.43.103.154:8765 (issue-core)
```
activity-core `actcore-state-hub-bridge` and `actcore-issue-core-bridge` keep
proxying to `127.0.0.1:18000` and `127.0.0.1:18765` on the railiance01 node.
## Prerequisites
| Item | Check |
| --- | --- |
| ops-bridge installed on railiance01 | `which bridge` |
| SSH key authorized on coulombcore | `ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 true` from railiance01 |
| ClusterIPs current | `state-hub-primary` and `issue-core-coulombcore` workstation tunnels |
| warden `atm-fleet-mesh` (target) | `cert_command` migration after static-key smoke passes |
Reference config: `infra/fleet-mesh/railiance01-tunnels.yaml`
## Install (railiance01)
railiance01 ships the kernel `bridge` utility (`iproute2`), not ops-bridge. Use the
systemd user units in `infra/fleet-mesh/systemd/` (or the installer script).
```bash
# From the-custodian repo on the workstation
bash infra/fleet-mesh/install-railiance01.sh railiance01
```
The installer copies:
- `infra/fleet-mesh/systemd/*.service``~/.config/systemd/user/`
- `infra/fleet-mesh/railiance01-tunnels.yaml``~/.config/bridge/tunnels.yaml` (reference for future ops-bridge install)
- `~/.ssh/id_ops` → railiance01 (static key interim; migrate to `atm-fleet-mesh` + `cert_command`)
Enable lingering so user units survive logout/reboot:
```bash
ssh railiance01 'sudo loginctl enable-linger tegwick'
```
## Cutover
```bash
# 1. Stop workstation reverse tunnels (one at a time — ops-bridge CLI)
bridge down state-hub-railiance01
bridge down issue-core-railiance01
# 2. Start fleet-owned forward tunnels on railiance01 (systemd)
ssh railiance01 'systemctl --user enable --now fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
# 3. Smoke from railiance01 node
ssh railiance01 'curl -sf http://127.0.0.1:18000/state/health && curl -sf http://127.0.0.1:18765/healthz'
```
**Cutover evidence (2026-07-03):** workstation reverse tunnels stopped;
railiance01 systemd forwards healthy; `actcore-*-bridge` pods 1/1; progress
write through fleet path succeeded (event `647b70c0`).
## Verify production (partial T10 rehearsal)
With workstation reverse tunnels **down**, confirm:
```bash
# Bridge pods healthy
ssh railiance01 'kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core | grep bridge'
# Consistency sweep API (from railiance01 cluster network)
ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c "
import urllib.request
print(urllib.request.urlopen(\"http://actcore-state-hub-bridge:8000/state/health\").read().decode())
"'
# Issue-core bridge
ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c "
import urllib.request
print(urllib.request.urlopen(\"http://actcore-issue-core-bridge:8765/healthz\").read().decode())
"'
```
Optional emission smoke (safe label only): trigger a known-safe activity-core
run or use the issue-core REST sink checklist from
`near-term-production-service-lanes-status.md`.
## Persist across reboot
Systemd user units are enabled via `install-railiance01.sh`. Confirm:
```bash
ssh railiance01 'loginctl show-user tegwick -p Linger; systemctl --user is-enabled fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
```
When ops-bridge is installed on railiance01, `railiance01-tunnels.yaml` is the
drop-in config; until then systemd units are the production implementation.
## Rollback
```bash
ssh railiance01 'bridge down fleet-state-hub-coulombcore fleet-issue-core-coulombcore'
bridge up state-hub-railiance01 issue-core-railiance01
```
## Workstation tunnel policy after cutover
| Keep (interactive dev) | Retire from production dependency |
| --- | --- |
| `state-hub-primary` (MCP/agents) | `state-hub-railiance01` |
| `k3s-api-*` | `issue-core-railiance01` |
| `state-hub-mcp-*` | — |
| `issue-core-coulombcore` (workstation dev only) | — |
Production on railiance01 must not depend on any workstation tunnel.
## WireGuard evaluation
Current fleet mesh uses two forward tunnels (~2 units). WireGuard successor is
deferred until persistent unit count exceeds ~5 per workplan T02.
## cert_command migration (follow-on)
Replace static `id_ops` with `atm-fleet-mesh` + `cert_command`:
1. Register `atm-fleet-mesh` in warden inventory and CoulombCore `ssh_principals.yaml`
2. Generate dedicated keypair on railiance01
3. Set `cert_command: "warden sign atm-fleet-mesh --pubkey ..."` per
`ops-warden/wiki/playbooks/ops-bridge-tunnel-cert.md`

View File

@@ -3,7 +3,7 @@
<!-- generated by ops/render_service_inventory.py; edit ops/service-inventory.yml instead -->
Source: `ops/service-inventory.yml`
Inventory last reviewed: `2026-06-05`
Inventory last reviewed: `2026-07-03`
This is the repo-native first view for `CUST-WP-0047`. It exists so an
operator can answer what is running where before the full standalone
@@ -16,9 +16,9 @@ operator can answer what is running where before the full standalone
| Environments | 4 |
| Hosts | 3 |
| Clusters | 3 |
| Services | 8 |
| Services: observed_ok | 2 |
| Services: unknown | 6 |
| Services | 11 |
| Services: observed_ok | 6 |
| Services: unknown | 5 |
## Service Catalog
@@ -27,10 +27,13 @@ operator can answer what is running where before the full standalone
| Gitea (gitea) | CoulombCore<br>type: k3s; cluster: coulombcore-k3s; namespace: default | railiance-apps | https://gitea.coulomb.social/v2/<br>Expected: status 401, OCI registry auth challenge | unknown<br>2026-05-16: Inventory draft records Helm release gitea, namespace default, app version 1.25.4, NodePort 32166, and registry auth challenge. | database:gitea-db<br>pvc:default/gitea-shared-storage | k8s: unknown (coulombcore-k3s/default) | Package token and push/pull verification need current evidence. |
| Gitea Database (gitea-database) | CoulombCore<br>type: k3s; cluster: coulombcore-k3s; namespace: databases | railiance-platform | - | unknown<br>2026-05-16: /home/worsch/helix-forge/wiki/OpsHubInventory.md | - | k8s: unknown (coulombcore-k3s/databases) | Backup and restore evidence not recorded in ops inventory. |
| Gitea Shared Storage (gitea-shared-storage) | CoulombCore<br>type: k3s; cluster: coulombcore-k3s; namespace: default | railiance-platform<br>railiance-apps | - | unknown<br>2026-05-16: /home/worsch/helix-forge/wiki/OpsHubInventory.md | - | k8s: unknown (coulombcore-k3s/default/pvc/gitea-shared-storage) | Package blob backup and restore evidence not confirmed. |
| State Hub (state-hub) | Local Workstation<br>type: local-process; host: local-workstation; ports: 8000 | state-hub<br>the-custodian | http://127.0.0.1:8000/state/health<br>Expected: status 200, health response | observed_ok<br>2026-06-05: State Hub accepted inbox, task, and progress API calls. | postgresql:state-hub | http: observed_ok (http://127.0.0.1:8000) | Future cluster deployment readiness still needs ops evidence. |
| State Hub (state-hub) | CoulombCore<br>type: k3s; cluster: coulombcore-k3s; namespace: state-hub | state-hub<br>the-custodian | http://127.0.0.1:8000/state/health<br>Expected: status 200, health response | observed_ok<br>2026-07-03: Cluster hub healthy; railiance01 reaches via fleet forward tunnel. | postgresql:state-hub-db | http: observed_ok (workstation tunnel state-hub-primary → cluster)<br>tunnel: observed_ok (railiance01 systemd fleet-state-hub-coulombcore → cluster) | Primary home must move to railiance01 per CUST-WP-0054-T05. |
| issue-core (issue-core) | CoulombCore<br>type: k3s; cluster: coulombcore-k3s; namespace: issue-core | issue-core | http://127.0.0.1:8765/healthz<br>Expected: status 200, version response | observed_ok<br>2026-07-02: REST emission live via cross-machine fleet path. | postgresql:issue-core | tunnel: observed_ok (railiance01 fleet-issue-core-coulombcore → cluster) | Target railiance01 overlay per CUST-WP-0054 drain Wave 4. |
| Core Hub (core-hub) | CoulombCore<br>type: k3s; cluster: coulombcore-k3s; namespace: core-hub-staging | core-hub | https://hub.coulomb.social/api/v2/hubs<br>Expected: status 200, hub list when authenticated | observed_ok<br>2026-07-02: Staging deployed; production cutover gated on CORE-WP-0005-T04. | postgresql:core-hub | k8s: observed_ok (coulombcore-k3s/core-hub-staging) | Production cutover to railiance01 pending operator approval. |
| Fleet Mesh (railiance01) (fleet-mesh-railiance01) | Railiance01<br>type: systemd; host: railiance01 | the-custodian<br>ops-bridge | http://127.0.0.1:18000/state/health<br>Expected: status 200 | observed_ok<br>2026-07-03: Workstation reverse tunnels stopped; systemd forwards healthy. | - | ssh-tunnel: observed_ok (railiance01 → coulombcore ClusterIPs) | Migrate to atm-fleet-mesh cert_command when VAULT_TOKEN available. |
| Inter-Hub (inter-hub) | ThreePhoenix Production<br>type: external; public_endpoint: https://hub.coulomb.social | inter-hub | https://hub.coulomb.social/api/v2/openapi.json<br>Expected: status 200, OpenAPI document | unknown<br>2026-05-16: /home/worsch/helix-forge/wiki/OpsHubInventory.md | - | https: unknown (https://hub.coulomb.social) | ops-hub bootstrap requires authenticated UI flow or deployment-side migration. |
| activity-core (activity-core) | Railiance01<br>type: k3s; cluster: railiance01-k3s; namespace: activity-core | activity-core<br>the-custodian | activity-core API health endpoint<br>Expected: status 200, healthy DB and Temporal status | observed_ok<br>2026-05-23: API health, worker rollout, Temporal CLI schedule listing, and State Hub bridge were verified. | postgresql:activity-core<br>temporal:activity-core<br>nats:railiance01 | k8s: observed_ok (railiance01-k3s/activity-core) | Add explicit ops inventory probes and evidence events. |
| Ops Bridge (ops-bridge) | Local Workstation<br>type: bridge; host: local-workstation | ops-bridge | - | unknown<br>2026-05-16: Bridge is useful for connected-server visibility but is not itself the service catalog. | - | ssh-tunnel: unknown (connected remote servers) | Emit reachability evidence into ops-hub instead of relying on bridge state as inventory. |
| Ops Bridge (ops-bridge) | Local Workstation<br>type: bridge; host: local-workstation | ops-bridge | - | observed_ok<br>2026-07-03: state-hub-railiance01 and issue-core-railiance01 stopped; not production-critical. | - | ssh-tunnel: observed_ok (interactive dev tunnels only (k3s-api, state-hub-primary)) | Install ops-bridge on railiance01 or keep systemd fleet-mesh units. |
| Haskell Build Agent (haskell-build-agent) | Local Workstation<br>type: systemd; host: haskell-build-vm | the-custodian | http://127.0.0.1:18000<br>Expected: VM can reach State Hub through SSH forward | unknown<br>undated: Build agent is a systemd service and registers with State Hub on boot. | - | ssh: unknown (local workstation reverse tunnel port 12222) | Current tunnel and capability registration need live evidence in ops-hub. |
## Open Operating Gaps
@@ -50,7 +53,21 @@ operator can answer what is running where before the full standalone
### State Hub (`state-hub`)
- Future cluster deployment readiness still needs ops evidence.
- Primary home must move to railiance01 per CUST-WP-0054-T05.
- Consistency sweep writebacks still target workstation paths.
### issue-core (`issue-core`)
- Target railiance01 overlay per CUST-WP-0054 drain Wave 4.
### Core Hub (`core-hub`)
- Production cutover to railiance01 pending operator approval.
### Fleet Mesh (railiance01) (`fleet-mesh-railiance01`)
- Migrate to atm-fleet-mesh cert_command when VAULT_TOKEN available.
- Retire when State Hub and issue-core move to railiance01.
### Inter-Hub (`inter-hub`)
@@ -62,7 +79,7 @@ operator can answer what is running where before the full standalone
### Ops Bridge (`ops-bridge`)
- Emit reachability evidence into ops-hub instead of relying on bridge state as inventory.
- Install ops-bridge on railiance01 or keep systemd fleet-mesh units.
### Haskell Build Agent (`haskell-build-agent`)

View File

@@ -0,0 +1,298 @@
# Workstation Independence and Fleet Role Architecture
Date: 2026-07-03
Status: draft (canon-adjacent; promote to `canon/architecture/` after review)
Workplan: `CUST-WP-0054` T01
Related: `ADR-001`, `ADR-004`, `RAIL-BS-WP-0006`, `RAIL-HO-WP-0005`, `CUST-WP-0011`
## Purpose
Fix the three-machine role model, the fleet mesh topology, the promotion gate
for "production", and the phoenix path `coulombcore → railiance02`. Provide a
dependency register so every workload, tunnel, repo remote, sink path, and
build pipeline has a **current host**, **target host**, and **migration owner**.
The acceptance proof for the whole plan is `CUST-WP-0054-T10`: production runs
24h+ with the workstation fully offline.
## Machine Roles
| Machine | IP / identity | Current role (2026-07-03) | Target role |
| --- | --- | --- | --- |
| **railiance01** | `92.205.62.239` | First ThreePhoenix foundation node; hosts activity-core production, partial State Hub cluster footprint, automation schedules | **Production home** — first node of the growing Railiance fleet; hosts State Hub primary, forge, CI runners, and the automation loop |
| **coulombcore** | `92.205.130.254` | De-facto production host: State Hub cluster primary, Core Hub (`hub.coulomb.social`), issue-core, OpenBao, identity stack, ESO/ArgoCD, Gitea/registry | **Frozen legacy** — no new production; drain workload-by-workload; eventually wiped and **reborn as railiance02** |
| **workstation** | `bnt-lap001` / WSL2 | Production network hub (all 16 ops-bridge tunnels), State Hub client endpoint (`127.0.0.1:8000`), consistency-sweep writebacks, image build/publish, dev checkouts for 74 registered repos | **Temporary dev environment** — clone repos, run `make dev-hub`, push when connected; nothing in the production loop may depend on it being on |
### Role invariants
1. Production workloads authenticate, schedule, emit, and reconcile without the
workstation.
2. `coulombcore` is frozen for new production immediately (policy; see T03).
3. A workload counts as "production on railiance01" only after passing the
staged-promotion gate (see below).
4. Files remain authoritative per ADR-001; fleet databases are disposable caches.
## Fleet Mesh Topology
### Current topology (workstation as hub)
All ops-bridge tunnels originate on the workstation. Two production data paths
**chain through** it:
```
railiance01 workstation coulombcore
─────────── ─────────── ───────────
activity-core ──(state-hub-railiance01 reverse)──► :18000 ──(state-hub-primary forward)──► State Hub cluster
activity-core ──(issue-core-railiance01 reverse)──► :local ──(issue-core-coulombcore forward)──► issue-core
```
Live tunnel inventory (2026-07-03, `bridge status`):
| Tunnel | Direction | Actor | Production-critical? |
| --- | --- | --- | --- |
| `state-hub-primary` | workstation → coulombcore cluster | `agt-claude-coulombcore` | **yes** — MCP/agents reach cluster hub via `127.0.0.1:8000` |
| `state-hub-cluster-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | dev/ops access |
| `state-hub-railiance01` | railiance01 → workstation (reverse) | `agt-claude-railiance01` | **yes** — activity-core reaches hub |
| `state-hub-mcp-railiance01` | railiance01 → workstation (reverse) | `agt-claude-railiance01` | dev MCP |
| `issue-core-railiance01` | railiance01 → workstation (reverse) | `agt-claude-railiance01` | **yes** — emission lane |
| `issue-core-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | **yes** — completes emission chain |
| `state-hub-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | legacy/dev |
| `state-hub-mcp-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | dev MCP |
| `k3s-api-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | operator dev |
| `k3s-api-haskelseed` | workstation → haskelseed | `agt-claude-haskelseed` | experimental |
| `flex-auth-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | identity dev |
| `core-hub-staging-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | staging |
| `inter-hub-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | legacy Inter-Hub |
| `state-hub-haskelseed` | haskelseed → workstation | `agt-claude-haskelseed` | experimental |
| `state-hub-mcp-haskelseed` | haskelseed → workstation | `agt-claude-haskelseed` | experimental |
| `nix-daemon-haskelseed` | haskelseed → workstation | `agt-claude-haskelseed` | build dev |
A workstation reboot breaks daily triage evidence, consistency sweeps, and
issue emission until tunnels recover.
### Target topology (fleet-owned mesh)
```
railiance01 ◄────────────────────────────────────► coulombcore (draining)
│ direct atm- tunnels (ops-bridge on-host) │
│ State Hub API │ legacy until drain complete
│ issue-core REST │
└─ activity-core, Temporal, sweep checkouts └─ identity, OpenBao (last to move)
workstation (optional client)
│ interactive-only: k3s API, hub read, dev-hub
└─ may disconnect without production impact
```
Implementation owner: `CUST-WP-0054-T02`.
Key changes:
- ops-bridge (or systemd ssh units) runs **on railiance01** with `atm-` actor
certs for cross-machine lanes.
- `actcore-state-hub-bridge` and `actcore-issue-core-bridge` point at
machine-local tunnel ports, not workstation forwards.
- Workstation tunnels remain for interactive dev only.
- Evaluate WireGuard mesh when persistent unit count exceeds ~5.
This posture extends ADR-004 (connectivity-first) from "workstation connects
everything" to "fleet machines connect each other; workstation is a client."
## Production Promotion Gate
A workload is **production on railiance01** only when it conforms to the
finished staged-promotion contract (`RAIL-BS-WP-0006`):
| Gate | Requirement |
| --- | --- |
| Overlay repo | `railiance/<app>/` with `app.toml` and stage manifests |
| Stage commands | `stage deploy`, `stage observe`, `stage promote`, `stage rollback` proven |
| Evidence | Backup/restore drill, canary observation, operator approval recorded |
| Registry | Image in forge OCI registry with immutable tag |
**Exceptions** must be documented in the placement plan (T03) with explicit
rollback. No exception bypasses backup evidence for stateful workloads.
`coulombcore` workloads still running in production today are **grandfathered
legacy** until their drain task completes — not newly promoted production.
## Phoenix Path: coulombcore → railiance02
Machine-scale phoenix rotation reuses the same automation intended for future
3-node weekly rotations (`RAIL-BS-WP-0007`, `CUST-WP-0038` deferred until
railiance02 exists).
### Preconditions (drain complete)
All production dependencies moved off coulombcore per T03 ordering:
1. Forge + CI (T04) — repos and images no longer depend on `gitea.coulomb.social`
2. State Hub primary (T05) — cluster DB and sweep checkouts on railiance01
3. Core Hub, issue-core, Inter-Hub legacy — per T03 sequence
4. Identity + OpenBao — **last** (everything authenticates through them)
### Phoenix execution
Owner: `CUST-WP-0054-T09`, automation: `CUST-WP-0054-T08`.
| Phase | Action | Tooling |
| --- | --- | --- |
| S0 | Final inventory sweep, DNS/cert plan for `*.coulomb.social`, data archival | T09 |
| S1 | Wipe and greenfield rebuild | `NET-WP-0020` unseal + bootstrap chain |
| S2 | Join as `railiance02` | `railiance-cluster` overlay, `atm-` certs |
| S3 | Prove join-ready | Phoenix drill on disposable target first (T08) |
Longhorn distributed storage and PG streaming HA unlock once railiance01 +
railiance02 are both fleet nodes.
## Dev Environment (Files-First Beachhead)
Strategy A from the workplan; owner: `CUST-WP-0054-T07`.
```
git clone → make dev-hub → local ephemeral hub (compose)
├─ C-06 registration rebuilds workplan/task state from files
├─ offline write buffer (STATE-WP-0068) for progress/task events
└─ reconnect relay upstream; files reconcile, databases do not replicate
```
MCP config gains explicit `dev` / `fleet` profile switch. The workstation is
genuinely temporary: no fleet DB sync required for orientation.
## Dependency Register
### Workloads
| Workload | Current host | Target host | Migration owner | Method / notes |
| --- | --- | --- | --- | --- |
| State Hub API (primary) | coulombcore CNPG cluster via workstation tunnel `state-hub-primary``127.0.0.1:8000` | railiance01 | `CUST-WP-0054-T05` | `CUST-WP-0011-T07` playbook: freeze → exact-count restore → rewire |
| State Hub API (WSL2 fallback) | workstation WSL2 | retired | `CUST-WP-0011-T08/T09` → absorbed by `CUST-WP-0054-T10` | Stabilization window; not part of target architecture |
| activity-core | railiance01 k3s (`activity-core` ns) | railiance01 (retain) | — | Already on target machine; fix bridges in T02 |
| issue-core | coulombcore k3s | railiance01 | `CUST-WP-0054-T03` drain seq. | `ISSUE-WP-0003` live; emission chain fixed in T02 |
| Core Hub | coulombcore (`hub.coulomb.social`) | railiance01 | `CORE-WP-0005` + `CUST-WP-0054-T03` | Staging on coulombcore; production cutover human-gated |
| Inter-Hub (legacy Haskell) | coulombcore external | retired | `CORE-WP-0007` | Rollback-only after Core Hub cutover |
| Gitea + OCI registry | coulombcore k3s | railiance01 Forgejo | `RAIL-HO-WP-0005` / `CUST-WP-0054-T04` | Read-only mirror on coulombcore until decommission |
| OpenBao | coulombcore | railiance01 | `CUST-WP-0054-T03` (last) | NET-WP-0020 unseal automation |
| Identity stack (KeyCape, Authelia, privacyIDEA, lldap) | coulombcore | railiance01 | `CUST-WP-0054-T03` (last) | Coupled to OpenBao |
| ESO + ArgoCD control plane | coulombcore | railiance01 | `CUST-WP-0054-T03` | GitOps follows forge move |
| CNPG databases (per workload) | coulombcore / railiance01 | railiance01 per workload | `CUST-WP-0054-T03`, `CUST-WP-0054-T05` | CNPG pattern proven; migrate with workload |
| llm-connect | TBD cluster | railiance01 | near-term lanes board | `CCR-2026-0003` credential lane active |
| ops-hub (widget/evidence) | files + Inter-Hub widgets | railiance01 via Core Hub | `CUST-WP-0025`, `CUST-WP-0049` | Not blocking workstation independence |
| Temporal (activity-core) | railiance01 | railiance01 (retain) | — | Co-locate with activity-core |
| NATS (activity-core) | railiance01 | railiance01 (retain) | — | Co-locate with activity-core |
### Network tunnels (production-critical)
| Lane | Current path | Target path | Owner |
| --- | --- | --- | --- |
| activity-core → State Hub | railiance01 reverse → workstation → `state-hub-primary` → coulombcore | railiance01 `atm-` forward → railiance01 State Hub (local or short hop) | `CUST-WP-0054-T02` |
| Agents/MCP → State Hub | workstation `127.0.0.1:8000``state-hub-primary` → coulombcore | workstation `127.0.0.1:8000` → tunnel to railiance01 hub (dev client) or fleet endpoint | `CUST-WP-0054-T05` + T07 profiles |
| railiance01 automations → State Hub | `:18000` chain via workstation | railiance01-local bridge port | `CUST-WP-0054-T02` |
| activity-core → issue-core | railiance01 reverse → workstation → `issue-core-coulombcore` | railiance01 `atm-` forward → issue-core (on railiance01 post-drain) | `CUST-WP-0054-T02`, then T03 |
| Operator k3s access | workstation forwards (`k3s-api-*`) | workstation interactive (non-critical) | — |
### Repo remotes
All checked 2026-07-03; pattern is uniform:
| Repo (sample) | Current remote | Target remote | Owner |
| --- | --- | --- | --- |
| the-custodian | `gitea.coulomb.social/coulomb/the-custodian.git` | `forgejo.coulomb.social/coulomb/the-custodian.git` | `CUST-WP-0054-T04` |
| state-hub | `gitea.coulomb.social/coulomb/state-hub.git` | `forgejo.coulomb.social/coulomb/state-hub.git` | `CUST-WP-0054-T04` |
| activity-core | `gitea.coulomb.social/coulomb/activity-core.git` | `forgejo.coulomb.social/coulomb/activity-core.git` | `CUST-WP-0054-T04` |
| issue-core | `gitea.coulomb.social/coulomb/issue-core.git` | `forgejo.coulomb.social/coulomb/issue-core.git` | `CUST-WP-0054-T04` |
| ops-bridge | `gitea.coulomb.social/coulomb/ops-bridge.git` | `forgejo.coulomb.social/coulomb/ops-bridge.git` | `CUST-WP-0054-T04` |
| ops-warden | `gitea.coulomb.social/coulomb/ops-warden.git` | `forgejo.coulomb.social/coulomb/ops-warden.git` | `CUST-WP-0054-T04` |
| core-hub | `gitea.coulomb.social/coulomb/core-hub.git` | `forgejo.coulomb.social/coulomb/core-hub.git` | `CUST-WP-0054-T04` |
| *(all 74 registered repos)* | `gitea.coulomb.social/coulomb/<slug>.git` | `forgejo.coulomb.social/coulomb/<slug>.git` | `CUST-WP-0054-T04` |
### State Hub repo checkout paths
| Concern | Current | Target | Owner |
| --- | --- | --- | --- |
| `local_path` for 74 repos | `/home/worsch/<repo>` on workstation | railiance01 clone tree (e.g. `/home/tegwick/<repo>` or gitops-managed path) | `CUST-WP-0054-T05` |
| Consistency sweep writeback host | workstation (`consistency_check.py --remote` via API) | railiance01 checkouts from forge | `CUST-WP-0054-T05`, `STATE-WP-0064` |
| COULOMBCORE `host_paths` | `/home/tegwick/<repo>` (11 repos, `CUST-WP-0021`) | retired with coulombcore drain | `CUST-WP-0054-T09` |
| Multi-host path resolution | `host_paths` map per hostname | fleet-primary host only + dev-hub local | `CUST-WP-0054-T07` |
### Sink and prompt paths
| Sink / path | Current | Target | Owner |
| --- | --- | --- | --- |
| Daily triage working-memory | `/home/worsch/the-custodian/memory/working` (ActivityDefinition + PVC mount) | repo-relative or PVC-native path + sweep sync-to-repo | `CUST-WP-0054-T06` |
| Daily triage State Hub progress | cluster hub via workstation tunnel | railiance01 hub direct | `CUST-WP-0054-T02`, `T05` |
| Consistency sweep progress event | via workstation-hosted sweep | railiance01-hosted sweep | `CUST-WP-0054-T05`, `STATE-WP-0064` |
| Agent session traces (`runtime/agent.py`) | `memory/working/agent-session-*.md` on workstation | dev-hub local buffer; commit on reconnect | `CUST-WP-0054-T07` |
| `output_schema` in ActivityDefinitions | absolute paths under `/home/worsch/the-custodian/` | repo-relative resolution in activity-core | `CUST-WP-0054-T06` |
### Build and publish pipelines
| Image / artifact | Current build host | Current registry | Target build | Target registry | Owner |
| --- | --- | --- | --- | --- | --- |
| state-hub | workstation `docker build` | `gitea.coulomb.social/coulomb/state-hub` | Forgejo Actions runner on railiance01 | railiance01 forge OCI | `CUST-WP-0054-T04` |
| core-hub | workstation / railiance-forge docs | `gitea.coulomb.social/coulomb/core-hub` | CI runner | railiance01 forge OCI | `CUST-WP-0054-T04` |
| activity-core | workstation manual rebuild + scp | railiance01 k3s import / Gitea | CI on tag push | railiance01 forge OCI | `CUST-WP-0054-T04` |
| issue-core | workstation / manual | `gitea.coulomb.social/coulomb/issue-core` | CI runner | railiance01 forge OCI | `CUST-WP-0054-T04` |
| Haskell build agent | workstation VM (`haskell-build-vm`) | n/a | retired (`CORE-WP-0007`) | n/a | `CORE-WP-0007` |
Done criterion for T01: every row above has a target and migration owner. ✓
## Drain Sequence
Detailed plan: `docs/coulombcore-drain-placement-plan.md`
Freeze policy: `canon/standards/coulombcore-production-freeze_v0.1.md`
```
Wave 1 Forge + CI (T04)
Wave 2 State Hub primary (T05)
Wave 3 Core Hub (CORE-WP-0005)
Wave 4 issue-core
Wave 5 ESO / ArgoCD
Wave 6 Supporting apps
Wave 7 OpenBao + identity (LAST)
Wave 8 coulombcore phoenix → railiance02 (T09)
```
## Sequencing Map
```
T01 (this document) ✓
├─ T02 de-hub network ✓
├─ T03 placement plan / freeze ✓
│ ├─ T04 forge + CI
│ └─ T05 State Hub home on railiance01
├─ T06 sink decoupling
├─ T07 dev beachhead
└─ T08 phoenix drill
└─ T09 coulombcore → railiance02
└─ T10 workstation-off acceptance
```
## Evidence and Inventory Sources
- Live tunnel state: `bridge status` (2026-07-03)
- State Hub health: `http://127.0.0.1:8000/state/health` (cluster primary via tunnel)
- Registered repos: `GET /repos/` — 74 repos, all `local_path` under `/home/worsch/`
- `ops/service-inventory.yml` (2026-06-05; predates cluster cutover — refresh in T03)
- `docs/infrastructure-stabilization-pickup-checkpoint.md` (2026-07-03 metaplan closeout)
- Activity definitions: `activity-definitions/daily-statehub-wsjf-triage.md`,
`activity-definitions/state-hub-consistency-sweep.md`
## Open Gaps (not T01 blockers)
| Gap | Follow-on |
| --- | --- |
| Forgejo production hostname / SMTP / exposure decisions | `RAIL-HO-WP-0005-T02` (human) |
| `ops/service-inventory.yml` stale environment labels | Refresh during T03 |
| Core Hub widget-type registry prerequisite | `CORE-WP-0005-T04` |
| HA Postgres / Longhorn across 2+ nodes | `RAIL-BS-WP-0007`, `CUST-WP-0038` after railiance02 |
## Promotion to Canon
After operator review:
1. Move to `canon/architecture/adr-006-workstation-independence-fleet-roles.md`
(or equivalent ADR number).
2. Update `ops/service-inventory.yml` environment and service rows to match.
3. Link from `SCOPE.md` and `.custodian-brief.md` generation inputs.

View File

@@ -0,0 +1,26 @@
#!/usr/bin/env bash
# Install fleet-mesh systemd user units on railiance01 (CUST-WP-0054-T02).
set -euo pipefail
REMOTE="${1:-railiance01}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
ssh "$REMOTE" 'mkdir -p ~/.config/bridge ~/.config/systemd/user ~/.ssh'
scp "$SCRIPT_DIR/railiance01-tunnels.yaml" "$REMOTE:~/.config/bridge/tunnels.yaml"
scp "$SCRIPT_DIR/systemd/"*.service "$REMOTE:~/.config/systemd/user/"
scp "${HOME}/.ssh/id_ops" "${HOME}/.ssh/id_ops.pub" "$REMOTE:~/.ssh/"
ssh "$REMOTE" 'chmod 600 ~/.ssh/id_ops ~/.config/bridge/tunnels.yaml'
ssh "$REMOTE" 'sudo loginctl enable-linger tegwick 2>/dev/null || true'
ssh "$REMOTE" bash -s <<'EOF'
set -euo pipefail
systemctl --user daemon-reload
systemctl --user enable --now fleet-state-hub-coulombcore.service
systemctl --user enable --now fleet-issue-core-coulombcore.service
sleep 2
curl -sf http://127.0.0.1:18000/state/health
curl -sf http://127.0.0.1:18765/healthz
systemctl --user --no-pager status fleet-state-hub-coulombcore.service fleet-issue-core-coulombcore.service
EOF
echo "Fleet mesh tunnels active on $REMOTE"

View File

@@ -0,0 +1,51 @@
# Fleet-owned production tunnels on railiance01 (CUST-WP-0054-T02).
# Install to: ~/.config/bridge/tunnels.yaml on railiance01
#
# Replaces workstation reverse tunnels state-hub-railiance01 and
# issue-core-railiance01 with machine-local forward tunnels through coulombcore.
#
# activity-core bridge proxies (unchanged):
# actcore-state-hub-bridge -> 127.0.0.1:18000
# actcore-issue-core-bridge -> 127.0.0.1:18765
tunnels:
fleet-state-hub-coulombcore:
host: 92.205.130.254
remote_port: 8000
local_port: 18000
direction: local
remote_host: 10.43.170.94
ssh_user: tegwick
ssh_key: ~/.ssh/id_ops
actor: atm-fleet-mesh
health_check:
url: http://127.0.0.1:18000/state/health
interval_seconds: 30
timeout_seconds: 5
reconnect:
max_attempts: 0
backoff_initial: 5
backoff_max: 60
fleet-issue-core-coulombcore:
host: 92.205.130.254
remote_port: 8765
local_port: 18765
direction: local
remote_host: 10.43.103.154
ssh_user: tegwick
ssh_key: ~/.ssh/id_ops
actor: atm-fleet-mesh
health_check:
url: http://127.0.0.1:18765/healthz
interval_seconds: 30
timeout_seconds: 5
reconnect:
max_attempts: 0
backoff_initial: 5
backoff_max: 60
actors:
atm-fleet-mesh:
class: atm
description: Railiance01 fleet mesh — direct production lanes to coulombcore cluster services

View File

@@ -0,0 +1,21 @@
[Unit]
Description=Fleet mesh issue-core forward tunnel (railiance01 to coulombcore cluster)
After=network-online.target
Wants=network-online.target
StartLimitIntervalSec=0
[Service]
Type=simple
ExecStart=/usr/bin/ssh -N \
-L 127.0.0.1:18765:10.43.103.154:8765 \
-i /home/tegwick/.ssh/id_ops \
-o ServerAliveInterval=10 \
-o ServerAliveCountMax=3 \
-o ExitOnForwardFailure=yes \
-o StrictHostKeyChecking=accept-new \
tegwick@92.205.130.254
Restart=always
RestartSec=5
[Install]
WantedBy=default.target

View File

@@ -0,0 +1,21 @@
[Unit]
Description=Fleet mesh State Hub forward tunnel (railiance01 to coulombcore cluster)
After=network-online.target
Wants=network-online.target
StartLimitIntervalSec=0
[Service]
Type=simple
ExecStart=/usr/bin/ssh -N \
-L 127.0.0.1:18000:10.43.170.94:8000 \
-i /home/tegwick/.ssh/id_ops \
-o ServerAliveInterval=10 \
-o ServerAliveCountMax=3 \
-o ExitOnForwardFailure=yes \
-o StrictHostKeyChecking=accept-new \
tegwick@92.205.130.254
Restart=always
RestartSec=5
[Install]
WantedBy=default.target

View File

@@ -1,5 +1,5 @@
version: 1
last_reviewed: "2026-06-05"
last_reviewed: "2026-07-03"
policy:
non_secret_inventory: true
secrets_rule: "Do not store credentials, tokens, private addresses that are not already operationally documented, or command output containing secrets."
@@ -20,11 +20,11 @@ environments:
lifecycle_state: observed
- id: coulombcore
name: "CoulombCore"
role: "Transitional production-like runtime"
lifecycle_state: observed
role: "Legacy production host — frozen for new workloads; draining per CUST-WP-0054-T03"
lifecycle_state: draining
- id: railiance01
name: "Railiance01"
role: "First ThreePhoenix foundation node"
role: "Production home — activity-core, fleet mesh, target for drain waves"
lifecycle_state: observed
- id: threephoenix-prod
name: "ThreePhoenix Production"
@@ -77,7 +77,7 @@ services:
- id: gitea
name: "Gitea"
kind: application
lifecycle_state: observed
lifecycle_state: draining
health_status: unknown
environment: coulombcore
owner_repos:
@@ -173,9 +173,9 @@ services:
- id: state-hub
name: "State Hub"
kind: coordination-service
lifecycle_state: observed
lifecycle_state: draining
health_status: observed_ok
environment: local
environment: coulombcore
owner_repos:
- state-hub
- the-custodian
@@ -183,29 +183,146 @@ services:
- "/home/worsch/state-hub"
- "/home/worsch/the-custodian/state-hub/README.md"
runtime:
type: local-process
host: local-workstation
ports:
- 8000
type: k3s
cluster: coulombcore-k3s
namespace: state-hub
workload_refs:
- "cnpg:state-hub-db"
- "svc:10.43.170.94:8000"
endpoints:
- id: state-hub-local-api
- id: state-hub-cluster-api
type: http
url: "http://127.0.0.1:8000/state/health"
expected_status: 200
expected_signal: "health response"
- id: state-hub-railiance01-fleet
type: tunnel
url: "http://127.0.0.1:18000/state/health"
expected_status: 200
expected_signal: "reachable from railiance01 fleet mesh"
backing_stores:
- "postgresql:state-hub"
- "postgresql:state-hub-db"
access_paths:
- type: http
target: "http://127.0.0.1:8000"
target: "workstation tunnel state-hub-primary → cluster"
status: observed_ok
- type: tunnel
target: "railiance01 systemd fleet-state-hub-coulombcore → cluster"
status: observed_ok
evidence:
- type: session-probe
observed_at: "2026-06-05"
source: "Codex session curl to local State Hub"
summary: "State Hub accepted inbox, task, and progress API calls."
observed_at: "2026-07-03"
source: "CUST-WP-0054-T02 fleet mesh + cluster primary"
summary: "Cluster hub healthy; railiance01 reaches via fleet forward tunnel."
gaps:
- "Future cluster deployment readiness still needs ops evidence."
- "Primary home must move to railiance01 per CUST-WP-0054-T05."
- "Consistency sweep writebacks still target workstation paths."
- id: issue-core
name: "issue-core"
kind: application
lifecycle_state: draining
health_status: observed_ok
environment: coulombcore
owner_repos:
- issue-core
runtime:
type: k3s
cluster: coulombcore-k3s
namespace: issue-core
workload_refs:
- "svc:10.43.103.154:8765"
endpoints:
- id: issue-core-api
type: http
url: "http://127.0.0.1:8765/healthz"
expected_status: 200
expected_signal: "version response"
backing_stores:
- "postgresql:issue-core"
access_paths:
- type: tunnel
target: "railiance01 fleet-issue-core-coulombcore → cluster"
status: observed_ok
evidence:
- type: workplan-note
observed_at: "2026-07-02"
source: "ISSUE-WP-0003 completion — Gitea issue 176 emission"
summary: "REST emission live via cross-machine fleet path."
gaps:
- "Target railiance01 overlay per CUST-WP-0054 drain Wave 4."
- id: core-hub
name: "Core Hub"
kind: governance-service
lifecycle_state: draining
health_status: observed_ok
environment: coulombcore
owner_repos:
- core-hub
runtime:
type: k3s
cluster: coulombcore-k3s
namespace: core-hub-staging
endpoints:
- id: core-hub-public
type: https
url: "https://hub.coulomb.social/api/v2/hubs"
expected_status: 200
expected_signal: "hub list when authenticated"
backing_stores:
- "postgresql:core-hub"
access_paths:
- type: k8s
target: "coulombcore-k3s/core-hub-staging"
status: observed_ok
evidence:
- type: workplan-note
observed_at: "2026-07-02"
source: "CUST-WP-0051 metaplan closeout"
summary: "Staging deployed; production cutover gated on CORE-WP-0005-T04."
gaps:
- "Production cutover to railiance01 pending operator approval."
- id: fleet-mesh-railiance01
name: "Fleet Mesh (railiance01)"
kind: connectivity-service
lifecycle_state: observed
health_status: observed_ok
environment: railiance01
owner_repos:
- the-custodian
- ops-bridge
desired_state_sources:
- "/home/worsch/the-custodian/infra/fleet-mesh/"
runtime:
type: systemd
host: railiance01
workload_refs:
- "fleet-state-hub-coulombcore.service"
- "fleet-issue-core-coulombcore.service"
endpoints:
- id: fleet-state-hub-local
type: http
url: "http://127.0.0.1:18000/state/health"
expected_status: 200
- id: fleet-issue-core-local
type: http
url: "http://127.0.0.1:18765/healthz"
expected_status: 200
backing_stores: []
access_paths:
- type: ssh-tunnel
target: "railiance01 → coulombcore ClusterIPs"
status: observed_ok
evidence:
- type: session-probe
observed_at: "2026-07-03"
source: "CUST-WP-0054-T02 cutover"
summary: "Workstation reverse tunnels stopped; systemd forwards healthy."
gaps:
- "Migrate to atm-fleet-mesh cert_command when VAULT_TOKEN available."
- "Retire when State Hub and issue-core move to railiance01."
- id: inter-hub
name: "Inter-Hub"
@@ -287,7 +404,7 @@ services:
name: "Ops Bridge"
kind: connectivity-service
lifecycle_state: observed
health_status: unknown
health_status: observed_ok
environment: local
owner_repos:
- ops-bridge
@@ -298,15 +415,15 @@ services:
backing_stores: []
access_paths:
- type: ssh-tunnel
target: "connected remote servers"
status: unknown
target: "interactive dev tunnels only (k3s-api, state-hub-primary)"
status: observed_ok
evidence:
- type: document
observed_at: "2026-05-16"
source: "/home/worsch/helix-forge/wiki/OpsHubInventory.md"
summary: "Bridge is useful for connected-server visibility but is not itself the service catalog."
- type: session-probe
observed_at: "2026-07-03"
source: "CUST-WP-0054-T02 — production reverse tunnels retired"
summary: "state-hub-railiance01 and issue-core-railiance01 stopped; not production-critical."
gaps:
- "Emit reachability evidence into ops-hub instead of relying on bridge state as inventory."
- "Install ops-bridge on railiance01 or keep systemd fleet-mesh units."
- id: haskell-build-agent
name: "Haskell Build Agent"