diff --git a/canon/standards/coulombcore-production-freeze_v0.1.md b/canon/standards/coulombcore-production-freeze_v0.1.md new file mode 100644 index 0000000..efe9ac1 --- /dev/null +++ b/canon/standards/coulombcore-production-freeze_v0.1.md @@ -0,0 +1,118 @@ +--- +id: canon-coulombcore-production-freeze +type: standard +title: "CoulombCore Production Freeze v0.1" +domain: custodian +status: active +version: "0.1" +created: "2026-07-03" +decided_by: custodian +tags: ["infrastructure", "coulombcore", "railiance01", "production", "freeze", "drain"] +related_workplans: + - CUST-WP-0054 + - RAIL-HO-WP-0005 +--- + +# CoulombCore Production Freeze v0.1 + +## Status + +**Active from 2026-07-03.** CoulombCore (`92.205.130.254`) is frozen for new +production workloads. + +## Context + +Under the fleet role model (`CUST-WP-0054`, `docs/workstation-independence-fleet-architecture.md`): + +| Machine | Role | +| --- | --- | +| **railiance01** | Production home — growing Railiance fleet | +| **coulombcore** | Legacy/experimental only; drain then phoenix to **railiance02** | +| **workstation** | Temporary dev environment | + +Despite the role model, coulombcore still hosts production-critical workloads +(State Hub cluster primary, Core Hub, issue-core, Gitea, OpenBao, identity +stack, GitOps control plane). This freeze stops the problem from growing while +the drain sequence in `docs/coulombcore-drain-placement-plan.md` executes. + +## Policy + +### Frozen (blocked without exception) + +No **new** production workloads may be introduced on coulombcore after +2026-07-03: + +- New Helm releases, ArgoCD Applications, or CNPG clusters intended as + long-lived production +- New public DNS names under `*.coulomb.social` for production services +- New credential lanes whose **primary** runtime home is coulombcore +- New CI/CD publish targets that make coulombcore the canonical registry or + forge (canonical target is railiance01 Forgejo per `RAIL-HO-WP-0005`) +- New automation schedules that **require** coulombcore as the sole runtime + host (activity-core production is already on railiance01) + +### Grandfathered (existing production may run) + +Workloads already in production on coulombcore before 2026-07-03 may continue +until their drain step completes. They are **not** newly promoted production — +they are legacy carry-over on a condemned host. + +### Allowed on coulombcore during drain + +| Category | Examples | +| --- | --- | +| Drain migrations | Staged-promotion overlays targeting railiance01; cutover drills in isolated namespaces | +| Read-only mirrors | Gitea read-only rollback mirror after Forgejo cutover | +| Short-lived probes | Disposable Forgejo/restore namespaces per `RAIL-HO-WP-0005` probe strategy | +| Experimental / non-prod | Staging profiles, smoke namespaces, operator-attended bootstrap | +| Fleet mesh transit | Forward tunnels from railiance01 to coulombcore cluster services until those services move (T02 interim) | + +### Promotion gate + +A workload counts as **production on railiance01** only after passing the +staged-promotion contract (`RAIL-BS-WP-0006`). Coulombcore deployments do not +satisfy this gate after 2026-07-03. + +## Enforcement + +1. **Workplan review** — new workplans proposing coulombcore production require + an explicit exception row in the drain plan with rollback evidence. +2. **ArgoCD / GitOps** — new Applications with production intent must target + `railiance01-k3s`, not `coulombcore-k3s`, unless tagged `drain-migration` + or `experimental`. +3. **Agent instructions** — coding agents must not deploy new production + services to coulombcore; route to railiance01 overlays or file an exception + request via State Hub `needs_human`. +4. **Inventory drift** — `ops/service-inventory.yml` rows for coulombcore + production services carry `lifecycle_state: draining` after their drain + wave starts. + +## Exceptions + +Document each exception in `docs/coulombcore-drain-placement-plan.md` under +**Documented exceptions** with: + +- workload id +- reason the drain sequence cannot absorb it yet +- target host and target date +- rollback method +- approving workplan or operator decision id + +## Exit criteria (lift freeze) + +The freeze lifts for coulombcore as a **host** when: + +1. All drain waves in the placement plan reach `retired` or `migrated` +2. Identity + OpenBao (last wave) run on railiance01 +3. `CUST-WP-0054-T09` phoenix begins — coulombcore is wiped and rebuilt as + railiance02, not returned to production + +After phoenix, the machine identity is **railiance02**; the coulombcore freeze +standard applies only to the historical drain period. + +## Related documents + +- Drain sequence: `docs/coulombcore-drain-placement-plan.md` +- Architecture: `docs/workstation-independence-fleet-architecture.md` +- Forgejo migration: `RAIL-HO-WP-0005` in `railiance-infra` +- Staged promotion: `RAIL-BS-WP-0006` (finished) \ No newline at end of file diff --git a/docs/coulombcore-drain-placement-plan.md b/docs/coulombcore-drain-placement-plan.md new file mode 100644 index 0000000..a6eaa2f --- /dev/null +++ b/docs/coulombcore-drain-placement-plan.md @@ -0,0 +1,200 @@ +# CoulombCore Drain and Production Placement Plan + +Date: 2026-07-03 +Workplan: `CUST-WP-0054-T03` +Freeze policy: `canon/standards/coulombcore-production-freeze_v0.1.md` +Architecture: `docs/workstation-independence-fleet-architecture.md` + +## Purpose + +Ordered drain sequence for every production workload on coulombcore +(`92.205.130.254`, `coulombcore-k3s`). Each row names current placement, +target placement, migration method, owner workplan, and prerequisites. + +**Coupling rule:** forge and State Hub move early; identity + OpenBao move +last because everything authenticates through them. + +## Wave overview + +``` +Wave 0 Freeze policy (this document + canon) — effective 2026-07-03 +Wave 1 Source forge + CI runners ─────────── RAIL-HO-WP-0005 / CUST-WP-0054-T04 +Wave 2 State Hub primary + sweep checkouts ── CUST-WP-0054-T05 / CUST-WP-0011 +Wave 3 Core Hub production ────────────────── CORE-WP-0005 +Wave 4 issue-core ─────────────────────────── ISSUE-WP-0003 + overlay +Wave 5 GitOps control plane (ESO, ArgoCD) ─── railiance-cluster overlays +Wave 6 Application stragglers ─────────────── per-app overlays +Wave 7 OpenBao + identity stack ───────────── NET-WP-0020 + key-cape (LAST) +Wave 8 coulombcore phoenix → railiance02 ─── CUST-WP-0054-T09 +``` + +## Placement register + +| # | Workload | Current (2026-07-03) | Target | Method | Owner | Wave | Status | +| --- | --- | --- | --- | --- | --- | --- | --- | +| 1 | **Gitea + OCI registry** | coulombcore-k3s `default`; `gitea.coulomb.social` | railiance01 **`forgejo.coulomb.social`** | Staged-promotion S5 overlay; `RAIL-HO-WP-0005` probe → production; Gitea → read-only mirror | `RAIL-HO-WP-0005`, `CUST-WP-0054-T04` | 1 | grandfathered | +| 2 | **Forgejo Actions / CI runners** | none (workstation manual build) | railiance01 | New S5 overlay; image build on tag push | `CUST-WP-0054-T04` | 1 | planned | +| 3 | **Gitea DB + PVC** | coulombcore `databases` / `gitea-shared-storage` | railiance01 CNPG + PVC | Migrate with Forgejo; backup/restore drill required | `RAIL-HO-WP-0005` | 1 | grandfathered | +| 4 | **State Hub API (primary)** | coulombcore CNPG `state-hub-db`; cluster Svc `10.43.170.94:8000` | railiance01 CNPG + Deployment | `CUST-WP-0011-T07` playbook: freeze → exact-count restore → rewire; staged-promotion overlay | `CUST-WP-0054-T05`, `CUST-WP-0011` | 2 | grandfathered | +| 5 | **State Hub sweep checkouts** | workstation `/home/worsch/*` (74 repos) | railiance01 clone tree from forge | Relocate `host_paths` / `local_path`; no workstation writeback | `CUST-WP-0054-T05`, `STATE-WP-0064` | 2 | planned | +| 6 | **WSL2 State Hub fallback** | workstation WSL2 | retired | Stop after railiance01 primary stabilizes | `CUST-WP-0011-T08/T09`, `CUST-WP-0054-T10` | 2 | grandfathered | +| 7 | **Core Hub** | coulombcore `core-hub-staging`; public `hub.coulomb.social` | railiance01 | Staged-promotion overlay; dual-run prerequisite (`CORE-WP-0005-T04`) | `CORE-WP-0005` | 3 | grandfathered | +| 8 | **Inter-Hub (Haskell)** | coulombcore external | retired | Rollback-only after Core Hub cutover | `CORE-WP-0007` | 3 | grandfathered | +| 9 | **issue-core** | coulombcore `issue-core` ns; ClusterIP `10.43.103.154:8765` | railiance01 | Staged-promotion overlay; shorten fleet tunnel to local svc | `ISSUE-WP-0003`, `CUST-WP-0054-T03` | 4 | grandfathered | +| 10 | **issue-core CNPG** | coulombcore | railiance01 | Migrate with issue-core workload | `railiance-platform` | 4 | grandfathered | +| 11 | **External Secrets Operator** | coulombcore | railiance01 | GitOps follows forge; ESO stores point at railiance01 OpenBao post-Wave 7 or interim coulombcore path documented | `railiance-platform` | 5 | grandfathered | +| 12 | **ArgoCD** | coulombcore (boundary: should be S4) | railiance01 | Staged-promotion; repoint repo URLs to Forgejo | `railiance-cluster` | 5 | grandfathered | +| 13 | **llm-connect** | railiance01 `activity-core` ns (partial) | railiance01 | Already on target machine; complete in-cluster profile | `CCR-2026-0003` lane | 6 | observed | +| 14 | **activity-core** | railiance01 `activity-core` ns | railiance01 (retain) | No move; update sinks (T06) and hub URL post-Wave 2 | — | — | **on target** | +| 15 | **Temporal / NATS** | railiance01 | railiance01 (retain) | Co-located with activity-core | — | — | **on target** | +| 16 | **ops-hub evidence / widgets** | files + Core Hub path | railiance01 via Core Hub | Follows Core Hub; not coulombcore-blocking | `CUST-WP-0025`, `CUST-WP-0049` | 6 | planned | +| 17 | **artifact-store / MinIO lane** | assessment only | railiance01 or compatible endpoint | Compatibility-profile per `ARTIFACT-STORE-WP-0007` | `ARTIFACT-STORE-WP-0007` | 6 | planned | +| 18 | **OpenBao** | coulombcore | railiance01 | **Last infrastructure wave**; `NET-WP-0020` unseal automation; CNPG + seal migration | `NET-WP-0020`, `railiance-platform` | 7 | grandfathered | +| 19 | **KeyCape** | coulombcore | railiance01 | Follows OpenBao; OIDC/MFA paths | `key-cape` | 7 | grandfathered | +| 20 | **Authelia** | coulombcore | railiance01 | Identity front door | `key-cape` / `railiance-platform` | 7 | grandfathered | +| 21 | **privacyIDEA** | coulombcore | railiance01 | MFA backend | `key-cape` | 7 | grandfathered | +| 22 | **lldap** | coulombcore | railiance01 | LDAP directory | `key-cape` / `railiance-platform` | 7 | grandfathered | +| 23 | **flex-auth** | coulombcore | railiance01 | Policy registry follows identity | `flex-auth` | 7 | grandfathered | +| 24 | **Fleet mesh transit tunnels** | railiance01 systemd → coulombcore ClusterIPs | railiance01-local services | Retire when Waves 2+4 complete (hub + issue-core local) | `CUST-WP-0054-T02` | 2–4 | **interim active** | +| 25 | **CNPG operator** | coulombcore (boundary note) | railiance01 | Platform operator moves with Wave 2+ workloads | `railiance-platform` | 2–7 | grandfathered | +| 26 | **coulombcore host identity** | coulombcore | railiance02 | Machine phoenix after Wave 7 | `CUST-WP-0054-T09`, `CUST-WP-0054-T08` | 8 | wait | + +## Per-wave detail + +### Wave 1 — Source forge + CI (unblocks repos and images) + +**Goal:** All repos and container images publish from railiance01; coulombcore +Gitea becomes read-only mirror. + +| Step | Action | Done when | +| --- | --- | --- | +| 1.1 | Resolve `RAIL-HO-WP-0005-T02` production decisions (hostname **decided:** `forgejo.coulomb.social`; SMTP, runners, backup still open) | `docs/forgejo-production-decisions.md` | +| 1.2 | Disposable Forgejo probe namespace + restore drill | Backup/restore evidence id recorded | +| 1.3 | Production Forgejo cutover | All 74 repo remotes point at Forgejo; push/pull verified | +| 1.4 | Actions runners for `state-hub`, `core-hub`, `activity-core`, `issue-core` | Tag-triggered image lands in forge OCI | +| 1.5 | Gitea → read-only mirror on coulombcore | Rollback window documented; no new writes | + +**Blocks:** Wave 2 sweep checkouts (needs forge clones on railiance01). + +### Wave 2 — State Hub home on railiance01 + +**Goal:** Automation loop machine-local; consistency sweeps write back to +railiance01 checkouts, not workstation paths. + +| Step | Action | Done when | +| --- | --- | --- | +| 2.1 | CNPG + storage review on railiance01 | Platform sign-off | +| 2.2 | `CUST-WP-0011-T07` cutover to railiance01 primary | Row counts match; `127.0.0.1:8000` serves railiance01 hub | +| 2.3 | Clone/register 74 repos on railiance01 from Forgejo | `fix-consistency` writebacks use railiance01 paths | +| 2.4 | Retire fleet tunnel `fleet-state-hub-coulombcore` | activity-core reaches hub without coulombcore hop | +| 2.5 | WSL2 fallback retirement (optional, after stabilization) | `CUST-WP-0011-T08/T09` | + +**Prereq:** Wave 1 forge (clone source). + +### Wave 3 — Core Hub production + +**Goal:** `hub.coulomb.social` served from railiance01 Core Hub. + +| Step | Action | Done when | +| --- | --- | --- | +| 3.1 | Close `CORE-WP-0005-T04` prerequisites (widget types, auth posture) | Catalog gap resolved | +| 3.2 | Operator-approved cutover with rollback plan | Deployed smoke + activity-core sink green | +| 3.3 | Inter-Hub marked rollback-only | `CORE-WP-0007` unblocks | + +**Prereq:** Wave 1 (images via forge CI). + +### Wave 4 — issue-core + +**Goal:** Emission path is railiance01-local; no coulombcore ClusterIP in path. + +| Step | Action | Done when | +| --- | --- | --- | +| 4.1 | Staged-promotion overlay on railiance01 | ArgoCD sync healthy | +| 4.2 | Migrate CNPG + secrets | ExternalSecret Ready | +| 4.3 | Point `ISSUE_CORE_URL` at in-cluster svc | Retire `fleet-issue-core-coulombcore` tunnel | +| 4.4 | Safe emission smoke | HTTP 201 + Gitea/Forgejo issue created | + +**Prereq:** Wave 1 (image + gitops); credential lane `CCR-2026-0002` active. + +### Wave 5 — GitOps control plane + +**Goal:** ArgoCD and ESO run on railiance01 and track Forgejo repos. + +| Step | Action | Done when | +| --- | --- | --- | +| 5.1 | ArgoCD overlay on railiance01 | Sync from Forgejo remotes | +| 5.2 | ESO → SecretStore paths updated | Workloads on railiance01 pull secrets | +| 5.3 | Decommission coulombcore ArgoCD Applications | No new syncs to coulombcore-k3s | + +**Prereq:** Waves 1–2 (forge URLs, hub coordination). + +### Wave 6 — Application stragglers + +Low-coupling apps and evidence lanes that do not block earlier waves: + +- llm-connect production profile completion +- ops-hub widget evidence via Core Hub +- artifact-store compatibility endpoint (if approved) + +Each uses staged-promotion unless listed under **Documented exceptions**. + +### Wave 7 — OpenBao + identity (LAST) + +**Goal:** Authentication and secret custody off coulombcore. + +| Step | Action | Done when | +| --- | --- | --- | +| 7.1 | OpenBao staged-promotion to railiance01 | Unseal automation (`NET-WP-0020`) proven | +| 7.2 | KeyCape / Authelia / privacyIDEA / lldap migration | OIDC login smoke on railiance01 | +| 7.3 | flex-auth registry points at new identity endpoints | Credential lanes re-pointed | +| 7.4 | CCR/applier paths verified | No production secret reads from coulombcore OpenBao | + +**Gate:** `CUST-WP-0054-T09` cannot start until Wave 7 completes. + +### Wave 8 — Phoenix to railiance02 + +Execute `CUST-WP-0054-T09` via T08 automation: wipe coulombcore, rebuild as +railiance02, join fleet. DNS/cert plan for remaining `*.coulomb.social` names. + +## Documented exceptions + +| Workload | Reason | Target date | Rollback | Approval | +| --- | --- | --- | --- | --- | +| Fleet mesh systemd tunnels | Wave 2/4 not complete; railiance01 reaches coulombcore ClusterIPs | Until Waves 2+4 done | Re-enable workstation reverse tunnels per `docs/fleet-mesh-dehub-runbook.md` | `CUST-WP-0054-T02` cutover 2026-07-03 | +| Core Hub staging on coulombcore | Pre-cutover smoke environment | Until Wave 3 cutover | Keep staging namespace | `CORE-WP-0005` | +| Static `id_ops` SSH key on railiance01 fleet units | `atm-fleet-mesh` cert_command blocked on VAULT_TOKEN | Until warden sign available | ops-bridge or rotated key | `CUST-WP-0054-T02` interim | + +No other exceptions as of 2026-07-03. New exceptions require a State Hub +decision or workplan amendment. + +## Staged-promotion method (default) + +Per `RAIL-BS-WP-0006` (finished): + +1. `railiance//app.toml` + overlay in owning repo +2. Stage 1 deploy → observe → promote with evidence +3. Backup/restore drill before production promotion +4. Rollback revision documented + +Apps without overlays yet must get an overlay scaffold before Wave execution. + +## Inventory sync + +`ops/service-inventory.yml` updated 2026-07-03 for: + +- coulombcore `lifecycle_state: draining` on grandfathered production services +- State Hub primary on coulombcore cluster (not workstation) +- railiance01 fleet-mesh and activity-core placement +- ops-bridge on railiance01 via systemd (not workstation hub) + +Regenerate catalog view: `make ops-inventory-view` + +## Human gates (not agent-executable) + +| Gate | Owner | Blocks | +| --- | --- | --- | +| Forgejo T02 production decisions | operator | Wave 1 | +| State Hub railiance01 cutover approval | operator; `CUST-WP-0011-T07` | Wave 2 | +| Core Hub production cutover | operator; `CORE-WP-0005-T04` | Wave 3 | +| OpenBao/identity migration approval | operator + custody | Wave 7 | +| coulombcore phoenix approval | operator | Wave 8 | \ No newline at end of file diff --git a/docs/fleet-mesh-dehub-runbook.md b/docs/fleet-mesh-dehub-runbook.md new file mode 100644 index 0000000..c21f9ae --- /dev/null +++ b/docs/fleet-mesh-dehub-runbook.md @@ -0,0 +1,147 @@ +# Fleet Mesh De-Hub Runbook (CUST-WP-0054-T02) + +Date: 2026-07-03 +Workplan: `CUST-WP-0054-T02` +Architecture: `docs/workstation-independence-fleet-architecture.md` + +## Goal + +Remove the workstation from production data paths between railiance01 +(activity-core) and coulombcore (State Hub cluster, issue-core). Workstation +tunnels become interactive dev access only. + +## Before (workstation hub) + +``` +railiance01:18000 ──reverse──► workstation:8000 ──forward──► coulombcore cluster State Hub +railiance01:18765 ──reverse──► workstation:18765 ──forward──► coulombcore cluster issue-core +``` + +## After (fleet-owned) + +``` +railiance01:18000 ──forward via SSH to coulombcore──► 10.43.170.94:8000 (State Hub) +railiance01:18765 ──forward via SSH to coulombcore──► 10.43.103.154:8765 (issue-core) +``` + +activity-core `actcore-state-hub-bridge` and `actcore-issue-core-bridge` keep +proxying to `127.0.0.1:18000` and `127.0.0.1:18765` on the railiance01 node. + +## Prerequisites + +| Item | Check | +| --- | --- | +| ops-bridge installed on railiance01 | `which bridge` | +| SSH key authorized on coulombcore | `ssh -i ~/.ssh/id_ops tegwick@92.205.130.254 true` from railiance01 | +| ClusterIPs current | `state-hub-primary` and `issue-core-coulombcore` workstation tunnels | +| warden `atm-fleet-mesh` (target) | `cert_command` migration after static-key smoke passes | + +Reference config: `infra/fleet-mesh/railiance01-tunnels.yaml` + +## Install (railiance01) + +railiance01 ships the kernel `bridge` utility (`iproute2`), not ops-bridge. Use the +systemd user units in `infra/fleet-mesh/systemd/` (or the installer script). + +```bash +# From the-custodian repo on the workstation +bash infra/fleet-mesh/install-railiance01.sh railiance01 +``` + +The installer copies: + +- `infra/fleet-mesh/systemd/*.service` → `~/.config/systemd/user/` +- `infra/fleet-mesh/railiance01-tunnels.yaml` → `~/.config/bridge/tunnels.yaml` (reference for future ops-bridge install) +- `~/.ssh/id_ops` → railiance01 (static key interim; migrate to `atm-fleet-mesh` + `cert_command`) + +Enable lingering so user units survive logout/reboot: + +```bash +ssh railiance01 'sudo loginctl enable-linger tegwick' +``` + +## Cutover + +```bash +# 1. Stop workstation reverse tunnels (one at a time — ops-bridge CLI) +bridge down state-hub-railiance01 +bridge down issue-core-railiance01 + +# 2. Start fleet-owned forward tunnels on railiance01 (systemd) +ssh railiance01 'systemctl --user enable --now fleet-state-hub-coulombcore fleet-issue-core-coulombcore' + +# 3. Smoke from railiance01 node +ssh railiance01 'curl -sf http://127.0.0.1:18000/state/health && curl -sf http://127.0.0.1:18765/healthz' +``` + +**Cutover evidence (2026-07-03):** workstation reverse tunnels stopped; +railiance01 systemd forwards healthy; `actcore-*-bridge` pods 1/1; progress +write through fleet path succeeded (event `647b70c0`). + +## Verify production (partial T10 rehearsal) + +With workstation reverse tunnels **down**, confirm: + +```bash +# Bridge pods healthy +ssh railiance01 'kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core | grep bridge' + +# Consistency sweep API (from railiance01 cluster network) +ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c " +import urllib.request +print(urllib.request.urlopen(\"http://actcore-state-hub-bridge:8000/state/health\").read().decode()) +"' + +# Issue-core bridge +ssh railiance01 'kubectl -n activity-core exec deploy/actcore-api -- python -c " +import urllib.request +print(urllib.request.urlopen(\"http://actcore-issue-core-bridge:8765/healthz\").read().decode()) +"' +``` + +Optional emission smoke (safe label only): trigger a known-safe activity-core +run or use the issue-core REST sink checklist from +`near-term-production-service-lanes-status.md`. + +## Persist across reboot + +Systemd user units are enabled via `install-railiance01.sh`. Confirm: + +```bash +ssh railiance01 'loginctl show-user tegwick -p Linger; systemctl --user is-enabled fleet-state-hub-coulombcore fleet-issue-core-coulombcore' +``` + +When ops-bridge is installed on railiance01, `railiance01-tunnels.yaml` is the +drop-in config; until then systemd units are the production implementation. + +## Rollback + +```bash +ssh railiance01 'bridge down fleet-state-hub-coulombcore fleet-issue-core-coulombcore' +bridge up state-hub-railiance01 issue-core-railiance01 +``` + +## Workstation tunnel policy after cutover + +| Keep (interactive dev) | Retire from production dependency | +| --- | --- | +| `state-hub-primary` (MCP/agents) | `state-hub-railiance01` | +| `k3s-api-*` | `issue-core-railiance01` | +| `state-hub-mcp-*` | — | +| `issue-core-coulombcore` (workstation dev only) | — | + +Production on railiance01 must not depend on any workstation tunnel. + +## WireGuard evaluation + +Current fleet mesh uses two forward tunnels (~2 units). WireGuard successor is +deferred until persistent unit count exceeds ~5 per workplan T02. + +## cert_command migration (follow-on) + +Replace static `id_ops` with `atm-fleet-mesh` + `cert_command`: + +1. Register `atm-fleet-mesh` in warden inventory and CoulombCore `ssh_principals.yaml` +2. Generate dedicated keypair on railiance01 +3. Set `cert_command: "warden sign atm-fleet-mesh --pubkey ..."` per + `ops-warden/wiki/playbooks/ops-bridge-tunnel-cert.md` \ No newline at end of file diff --git a/docs/ops-hub-service-catalog.md b/docs/ops-hub-service-catalog.md index 0a6d186..c01d55b 100644 --- a/docs/ops-hub-service-catalog.md +++ b/docs/ops-hub-service-catalog.md @@ -3,7 +3,7 @@ Source: `ops/service-inventory.yml` -Inventory last reviewed: `2026-06-05` +Inventory last reviewed: `2026-07-03` This is the repo-native first view for `CUST-WP-0047`. It exists so an operator can answer what is running where before the full standalone @@ -16,9 +16,9 @@ operator can answer what is running where before the full standalone | Environments | 4 | | Hosts | 3 | | Clusters | 3 | -| Services | 8 | -| Services: observed_ok | 2 | -| Services: unknown | 6 | +| Services | 11 | +| Services: observed_ok | 6 | +| Services: unknown | 5 | ## Service Catalog @@ -27,10 +27,13 @@ operator can answer what is running where before the full standalone | Gitea (gitea) | CoulombCore
type: k3s; cluster: coulombcore-k3s; namespace: default | railiance-apps | https://gitea.coulomb.social/v2/
Expected: status 401, OCI registry auth challenge | unknown
2026-05-16: Inventory draft records Helm release gitea, namespace default, app version 1.25.4, NodePort 32166, and registry auth challenge. | database:gitea-db
pvc:default/gitea-shared-storage | k8s: unknown (coulombcore-k3s/default) | Package token and push/pull verification need current evidence. | | Gitea Database (gitea-database) | CoulombCore
type: k3s; cluster: coulombcore-k3s; namespace: databases | railiance-platform | - | unknown
2026-05-16: /home/worsch/helix-forge/wiki/OpsHubInventory.md | - | k8s: unknown (coulombcore-k3s/databases) | Backup and restore evidence not recorded in ops inventory. | | Gitea Shared Storage (gitea-shared-storage) | CoulombCore
type: k3s; cluster: coulombcore-k3s; namespace: default | railiance-platform
railiance-apps | - | unknown
2026-05-16: /home/worsch/helix-forge/wiki/OpsHubInventory.md | - | k8s: unknown (coulombcore-k3s/default/pvc/gitea-shared-storage) | Package blob backup and restore evidence not confirmed. | -| State Hub (state-hub) | Local Workstation
type: local-process; host: local-workstation; ports: 8000 | state-hub
the-custodian | http://127.0.0.1:8000/state/health
Expected: status 200, health response | observed_ok
2026-06-05: State Hub accepted inbox, task, and progress API calls. | postgresql:state-hub | http: observed_ok (http://127.0.0.1:8000) | Future cluster deployment readiness still needs ops evidence. | +| State Hub (state-hub) | CoulombCore
type: k3s; cluster: coulombcore-k3s; namespace: state-hub | state-hub
the-custodian | http://127.0.0.1:8000/state/health
Expected: status 200, health response | observed_ok
2026-07-03: Cluster hub healthy; railiance01 reaches via fleet forward tunnel. | postgresql:state-hub-db | http: observed_ok (workstation tunnel state-hub-primary → cluster)
tunnel: observed_ok (railiance01 systemd fleet-state-hub-coulombcore → cluster) | Primary home must move to railiance01 per CUST-WP-0054-T05. | +| issue-core (issue-core) | CoulombCore
type: k3s; cluster: coulombcore-k3s; namespace: issue-core | issue-core | http://127.0.0.1:8765/healthz
Expected: status 200, version response | observed_ok
2026-07-02: REST emission live via cross-machine fleet path. | postgresql:issue-core | tunnel: observed_ok (railiance01 fleet-issue-core-coulombcore → cluster) | Target railiance01 overlay per CUST-WP-0054 drain Wave 4. | +| Core Hub (core-hub) | CoulombCore
type: k3s; cluster: coulombcore-k3s; namespace: core-hub-staging | core-hub | https://hub.coulomb.social/api/v2/hubs
Expected: status 200, hub list when authenticated | observed_ok
2026-07-02: Staging deployed; production cutover gated on CORE-WP-0005-T04. | postgresql:core-hub | k8s: observed_ok (coulombcore-k3s/core-hub-staging) | Production cutover to railiance01 pending operator approval. | +| Fleet Mesh (railiance01) (fleet-mesh-railiance01) | Railiance01
type: systemd; host: railiance01 | the-custodian
ops-bridge | http://127.0.0.1:18000/state/health
Expected: status 200 | observed_ok
2026-07-03: Workstation reverse tunnels stopped; systemd forwards healthy. | - | ssh-tunnel: observed_ok (railiance01 → coulombcore ClusterIPs) | Migrate to atm-fleet-mesh cert_command when VAULT_TOKEN available. | | Inter-Hub (inter-hub) | ThreePhoenix Production
type: external; public_endpoint: https://hub.coulomb.social | inter-hub | https://hub.coulomb.social/api/v2/openapi.json
Expected: status 200, OpenAPI document | unknown
2026-05-16: /home/worsch/helix-forge/wiki/OpsHubInventory.md | - | https: unknown (https://hub.coulomb.social) | ops-hub bootstrap requires authenticated UI flow or deployment-side migration. | | activity-core (activity-core) | Railiance01
type: k3s; cluster: railiance01-k3s; namespace: activity-core | activity-core
the-custodian | activity-core API health endpoint
Expected: status 200, healthy DB and Temporal status | observed_ok
2026-05-23: API health, worker rollout, Temporal CLI schedule listing, and State Hub bridge were verified. | postgresql:activity-core
temporal:activity-core
nats:railiance01 | k8s: observed_ok (railiance01-k3s/activity-core) | Add explicit ops inventory probes and evidence events. | -| Ops Bridge (ops-bridge) | Local Workstation
type: bridge; host: local-workstation | ops-bridge | - | unknown
2026-05-16: Bridge is useful for connected-server visibility but is not itself the service catalog. | - | ssh-tunnel: unknown (connected remote servers) | Emit reachability evidence into ops-hub instead of relying on bridge state as inventory. | +| Ops Bridge (ops-bridge) | Local Workstation
type: bridge; host: local-workstation | ops-bridge | - | observed_ok
2026-07-03: state-hub-railiance01 and issue-core-railiance01 stopped; not production-critical. | - | ssh-tunnel: observed_ok (interactive dev tunnels only (k3s-api, state-hub-primary)) | Install ops-bridge on railiance01 or keep systemd fleet-mesh units. | | Haskell Build Agent (haskell-build-agent) | Local Workstation
type: systemd; host: haskell-build-vm | the-custodian | http://127.0.0.1:18000
Expected: VM can reach State Hub through SSH forward | unknown
undated: Build agent is a systemd service and registers with State Hub on boot. | - | ssh: unknown (local workstation reverse tunnel port 12222) | Current tunnel and capability registration need live evidence in ops-hub. | ## Open Operating Gaps @@ -50,7 +53,21 @@ operator can answer what is running where before the full standalone ### State Hub (`state-hub`) -- Future cluster deployment readiness still needs ops evidence. +- Primary home must move to railiance01 per CUST-WP-0054-T05. +- Consistency sweep writebacks still target workstation paths. + +### issue-core (`issue-core`) + +- Target railiance01 overlay per CUST-WP-0054 drain Wave 4. + +### Core Hub (`core-hub`) + +- Production cutover to railiance01 pending operator approval. + +### Fleet Mesh (railiance01) (`fleet-mesh-railiance01`) + +- Migrate to atm-fleet-mesh cert_command when VAULT_TOKEN available. +- Retire when State Hub and issue-core move to railiance01. ### Inter-Hub (`inter-hub`) @@ -62,7 +79,7 @@ operator can answer what is running where before the full standalone ### Ops Bridge (`ops-bridge`) -- Emit reachability evidence into ops-hub instead of relying on bridge state as inventory. +- Install ops-bridge on railiance01 or keep systemd fleet-mesh units. ### Haskell Build Agent (`haskell-build-agent`) diff --git a/docs/workstation-independence-fleet-architecture.md b/docs/workstation-independence-fleet-architecture.md new file mode 100644 index 0000000..6f04cd6 --- /dev/null +++ b/docs/workstation-independence-fleet-architecture.md @@ -0,0 +1,298 @@ +# Workstation Independence and Fleet Role Architecture + +Date: 2026-07-03 +Status: draft (canon-adjacent; promote to `canon/architecture/` after review) +Workplan: `CUST-WP-0054` T01 +Related: `ADR-001`, `ADR-004`, `RAIL-BS-WP-0006`, `RAIL-HO-WP-0005`, `CUST-WP-0011` + +## Purpose + +Fix the three-machine role model, the fleet mesh topology, the promotion gate +for "production", and the phoenix path `coulombcore → railiance02`. Provide a +dependency register so every workload, tunnel, repo remote, sink path, and +build pipeline has a **current host**, **target host**, and **migration owner**. + +The acceptance proof for the whole plan is `CUST-WP-0054-T10`: production runs +24h+ with the workstation fully offline. + +## Machine Roles + +| Machine | IP / identity | Current role (2026-07-03) | Target role | +| --- | --- | --- | --- | +| **railiance01** | `92.205.62.239` | First ThreePhoenix foundation node; hosts activity-core production, partial State Hub cluster footprint, automation schedules | **Production home** — first node of the growing Railiance fleet; hosts State Hub primary, forge, CI runners, and the automation loop | +| **coulombcore** | `92.205.130.254` | De-facto production host: State Hub cluster primary, Core Hub (`hub.coulomb.social`), issue-core, OpenBao, identity stack, ESO/ArgoCD, Gitea/registry | **Frozen legacy** — no new production; drain workload-by-workload; eventually wiped and **reborn as railiance02** | +| **workstation** | `bnt-lap001` / WSL2 | Production network hub (all 16 ops-bridge tunnels), State Hub client endpoint (`127.0.0.1:8000`), consistency-sweep writebacks, image build/publish, dev checkouts for 74 registered repos | **Temporary dev environment** — clone repos, run `make dev-hub`, push when connected; nothing in the production loop may depend on it being on | + +### Role invariants + +1. Production workloads authenticate, schedule, emit, and reconcile without the + workstation. +2. `coulombcore` is frozen for new production immediately (policy; see T03). +3. A workload counts as "production on railiance01" only after passing the + staged-promotion gate (see below). +4. Files remain authoritative per ADR-001; fleet databases are disposable caches. + +## Fleet Mesh Topology + +### Current topology (workstation as hub) + +All ops-bridge tunnels originate on the workstation. Two production data paths +**chain through** it: + +``` +railiance01 workstation coulombcore +─────────── ─────────── ─────────── +activity-core ──(state-hub-railiance01 reverse)──► :18000 ──(state-hub-primary forward)──► State Hub cluster +activity-core ──(issue-core-railiance01 reverse)──► :local ──(issue-core-coulombcore forward)──► issue-core +``` + +Live tunnel inventory (2026-07-03, `bridge status`): + +| Tunnel | Direction | Actor | Production-critical? | +| --- | --- | --- | --- | +| `state-hub-primary` | workstation → coulombcore cluster | `agt-claude-coulombcore` | **yes** — MCP/agents reach cluster hub via `127.0.0.1:8000` | +| `state-hub-cluster-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | dev/ops access | +| `state-hub-railiance01` | railiance01 → workstation (reverse) | `agt-claude-railiance01` | **yes** — activity-core reaches hub | +| `state-hub-mcp-railiance01` | railiance01 → workstation (reverse) | `agt-claude-railiance01` | dev MCP | +| `issue-core-railiance01` | railiance01 → workstation (reverse) | `agt-claude-railiance01` | **yes** — emission lane | +| `issue-core-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | **yes** — completes emission chain | +| `state-hub-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | legacy/dev | +| `state-hub-mcp-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | dev MCP | +| `k3s-api-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | operator dev | +| `k3s-api-haskelseed` | workstation → haskelseed | `agt-claude-haskelseed` | experimental | +| `flex-auth-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | identity dev | +| `core-hub-staging-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | staging | +| `inter-hub-coulombcore` | workstation → coulombcore | `agt-claude-coulombcore` | legacy Inter-Hub | +| `state-hub-haskelseed` | haskelseed → workstation | `agt-claude-haskelseed` | experimental | +| `state-hub-mcp-haskelseed` | haskelseed → workstation | `agt-claude-haskelseed` | experimental | +| `nix-daemon-haskelseed` | haskelseed → workstation | `agt-claude-haskelseed` | build dev | + +A workstation reboot breaks daily triage evidence, consistency sweeps, and +issue emission until tunnels recover. + +### Target topology (fleet-owned mesh) + +``` +railiance01 ◄────────────────────────────────────► coulombcore (draining) + │ direct atm- tunnels (ops-bridge on-host) │ + │ State Hub API │ legacy until drain complete + │ issue-core REST │ + └─ activity-core, Temporal, sweep checkouts └─ identity, OpenBao (last to move) + +workstation (optional client) + │ interactive-only: k3s API, hub read, dev-hub + └─ may disconnect without production impact +``` + +Implementation owner: `CUST-WP-0054-T02`. + +Key changes: + +- ops-bridge (or systemd ssh units) runs **on railiance01** with `atm-` actor + certs for cross-machine lanes. +- `actcore-state-hub-bridge` and `actcore-issue-core-bridge` point at + machine-local tunnel ports, not workstation forwards. +- Workstation tunnels remain for interactive dev only. +- Evaluate WireGuard mesh when persistent unit count exceeds ~5. + +This posture extends ADR-004 (connectivity-first) from "workstation connects +everything" to "fleet machines connect each other; workstation is a client." + +## Production Promotion Gate + +A workload is **production on railiance01** only when it conforms to the +finished staged-promotion contract (`RAIL-BS-WP-0006`): + +| Gate | Requirement | +| --- | --- | +| Overlay repo | `railiance//` with `app.toml` and stage manifests | +| Stage commands | `stage deploy`, `stage observe`, `stage promote`, `stage rollback` proven | +| Evidence | Backup/restore drill, canary observation, operator approval recorded | +| Registry | Image in forge OCI registry with immutable tag | + +**Exceptions** must be documented in the placement plan (T03) with explicit +rollback. No exception bypasses backup evidence for stateful workloads. + +`coulombcore` workloads still running in production today are **grandfathered +legacy** until their drain task completes — not newly promoted production. + +## Phoenix Path: coulombcore → railiance02 + +Machine-scale phoenix rotation reuses the same automation intended for future +3-node weekly rotations (`RAIL-BS-WP-0007`, `CUST-WP-0038` deferred until +railiance02 exists). + +### Preconditions (drain complete) + +All production dependencies moved off coulombcore per T03 ordering: + +1. Forge + CI (T04) — repos and images no longer depend on `gitea.coulomb.social` +2. State Hub primary (T05) — cluster DB and sweep checkouts on railiance01 +3. Core Hub, issue-core, Inter-Hub legacy — per T03 sequence +4. Identity + OpenBao — **last** (everything authenticates through them) + +### Phoenix execution + +Owner: `CUST-WP-0054-T09`, automation: `CUST-WP-0054-T08`. + +| Phase | Action | Tooling | +| --- | --- | --- | +| S0 | Final inventory sweep, DNS/cert plan for `*.coulomb.social`, data archival | T09 | +| S1 | Wipe and greenfield rebuild | `NET-WP-0020` unseal + bootstrap chain | +| S2 | Join as `railiance02` | `railiance-cluster` overlay, `atm-` certs | +| S3 | Prove join-ready | Phoenix drill on disposable target first (T08) | + +Longhorn distributed storage and PG streaming HA unlock once railiance01 + +railiance02 are both fleet nodes. + +## Dev Environment (Files-First Beachhead) + +Strategy A from the workplan; owner: `CUST-WP-0054-T07`. + +``` +git clone → make dev-hub → local ephemeral hub (compose) + │ + ├─ C-06 registration rebuilds workplan/task state from files + ├─ offline write buffer (STATE-WP-0068) for progress/task events + └─ reconnect relay upstream; files reconcile, databases do not replicate +``` + +MCP config gains explicit `dev` / `fleet` profile switch. The workstation is +genuinely temporary: no fleet DB sync required for orientation. + +## Dependency Register + +### Workloads + +| Workload | Current host | Target host | Migration owner | Method / notes | +| --- | --- | --- | --- | --- | +| State Hub API (primary) | coulombcore CNPG cluster via workstation tunnel `state-hub-primary` → `127.0.0.1:8000` | railiance01 | `CUST-WP-0054-T05` | `CUST-WP-0011-T07` playbook: freeze → exact-count restore → rewire | +| State Hub API (WSL2 fallback) | workstation WSL2 | retired | `CUST-WP-0011-T08/T09` → absorbed by `CUST-WP-0054-T10` | Stabilization window; not part of target architecture | +| activity-core | railiance01 k3s (`activity-core` ns) | railiance01 (retain) | — | Already on target machine; fix bridges in T02 | +| issue-core | coulombcore k3s | railiance01 | `CUST-WP-0054-T03` drain seq. | `ISSUE-WP-0003` live; emission chain fixed in T02 | +| Core Hub | coulombcore (`hub.coulomb.social`) | railiance01 | `CORE-WP-0005` + `CUST-WP-0054-T03` | Staging on coulombcore; production cutover human-gated | +| Inter-Hub (legacy Haskell) | coulombcore external | retired | `CORE-WP-0007` | Rollback-only after Core Hub cutover | +| Gitea + OCI registry | coulombcore k3s | railiance01 Forgejo | `RAIL-HO-WP-0005` / `CUST-WP-0054-T04` | Read-only mirror on coulombcore until decommission | +| OpenBao | coulombcore | railiance01 | `CUST-WP-0054-T03` (last) | NET-WP-0020 unseal automation | +| Identity stack (KeyCape, Authelia, privacyIDEA, lldap) | coulombcore | railiance01 | `CUST-WP-0054-T03` (last) | Coupled to OpenBao | +| ESO + ArgoCD control plane | coulombcore | railiance01 | `CUST-WP-0054-T03` | GitOps follows forge move | +| CNPG databases (per workload) | coulombcore / railiance01 | railiance01 per workload | `CUST-WP-0054-T03`, `CUST-WP-0054-T05` | CNPG pattern proven; migrate with workload | +| llm-connect | TBD cluster | railiance01 | near-term lanes board | `CCR-2026-0003` credential lane active | +| ops-hub (widget/evidence) | files + Inter-Hub widgets | railiance01 via Core Hub | `CUST-WP-0025`, `CUST-WP-0049` | Not blocking workstation independence | +| Temporal (activity-core) | railiance01 | railiance01 (retain) | — | Co-locate with activity-core | +| NATS (activity-core) | railiance01 | railiance01 (retain) | — | Co-locate with activity-core | + +### Network tunnels (production-critical) + +| Lane | Current path | Target path | Owner | +| --- | --- | --- | --- | +| activity-core → State Hub | railiance01 reverse → workstation → `state-hub-primary` → coulombcore | railiance01 `atm-` forward → railiance01 State Hub (local or short hop) | `CUST-WP-0054-T02` | +| Agents/MCP → State Hub | workstation `127.0.0.1:8000` → `state-hub-primary` → coulombcore | workstation `127.0.0.1:8000` → tunnel to railiance01 hub (dev client) or fleet endpoint | `CUST-WP-0054-T05` + T07 profiles | +| railiance01 automations → State Hub | `:18000` chain via workstation | railiance01-local bridge port | `CUST-WP-0054-T02` | +| activity-core → issue-core | railiance01 reverse → workstation → `issue-core-coulombcore` | railiance01 `atm-` forward → issue-core (on railiance01 post-drain) | `CUST-WP-0054-T02`, then T03 | +| Operator k3s access | workstation forwards (`k3s-api-*`) | workstation interactive (non-critical) | — | + +### Repo remotes + +All checked 2026-07-03; pattern is uniform: + +| Repo (sample) | Current remote | Target remote | Owner | +| --- | --- | --- | --- | +| the-custodian | `gitea.coulomb.social/coulomb/the-custodian.git` | `forgejo.coulomb.social/coulomb/the-custodian.git` | `CUST-WP-0054-T04` | +| state-hub | `gitea.coulomb.social/coulomb/state-hub.git` | `forgejo.coulomb.social/coulomb/state-hub.git` | `CUST-WP-0054-T04` | +| activity-core | `gitea.coulomb.social/coulomb/activity-core.git` | `forgejo.coulomb.social/coulomb/activity-core.git` | `CUST-WP-0054-T04` | +| issue-core | `gitea.coulomb.social/coulomb/issue-core.git` | `forgejo.coulomb.social/coulomb/issue-core.git` | `CUST-WP-0054-T04` | +| ops-bridge | `gitea.coulomb.social/coulomb/ops-bridge.git` | `forgejo.coulomb.social/coulomb/ops-bridge.git` | `CUST-WP-0054-T04` | +| ops-warden | `gitea.coulomb.social/coulomb/ops-warden.git` | `forgejo.coulomb.social/coulomb/ops-warden.git` | `CUST-WP-0054-T04` | +| core-hub | `gitea.coulomb.social/coulomb/core-hub.git` | `forgejo.coulomb.social/coulomb/core-hub.git` | `CUST-WP-0054-T04` | +| *(all 74 registered repos)* | `gitea.coulomb.social/coulomb/.git` | `forgejo.coulomb.social/coulomb/.git` | `CUST-WP-0054-T04` | + +### State Hub repo checkout paths + +| Concern | Current | Target | Owner | +| --- | --- | --- | --- | +| `local_path` for 74 repos | `/home/worsch/` on workstation | railiance01 clone tree (e.g. `/home/tegwick/` or gitops-managed path) | `CUST-WP-0054-T05` | +| Consistency sweep writeback host | workstation (`consistency_check.py --remote` via API) | railiance01 checkouts from forge | `CUST-WP-0054-T05`, `STATE-WP-0064` | +| COULOMBCORE `host_paths` | `/home/tegwick/` (11 repos, `CUST-WP-0021`) | retired with coulombcore drain | `CUST-WP-0054-T09` | +| Multi-host path resolution | `host_paths` map per hostname | fleet-primary host only + dev-hub local | `CUST-WP-0054-T07` | + +### Sink and prompt paths + +| Sink / path | Current | Target | Owner | +| --- | --- | --- | --- | +| Daily triage working-memory | `/home/worsch/the-custodian/memory/working` (ActivityDefinition + PVC mount) | repo-relative or PVC-native path + sweep sync-to-repo | `CUST-WP-0054-T06` | +| Daily triage State Hub progress | cluster hub via workstation tunnel | railiance01 hub direct | `CUST-WP-0054-T02`, `T05` | +| Consistency sweep progress event | via workstation-hosted sweep | railiance01-hosted sweep | `CUST-WP-0054-T05`, `STATE-WP-0064` | +| Agent session traces (`runtime/agent.py`) | `memory/working/agent-session-*.md` on workstation | dev-hub local buffer; commit on reconnect | `CUST-WP-0054-T07` | +| `output_schema` in ActivityDefinitions | absolute paths under `/home/worsch/the-custodian/` | repo-relative resolution in activity-core | `CUST-WP-0054-T06` | + +### Build and publish pipelines + +| Image / artifact | Current build host | Current registry | Target build | Target registry | Owner | +| --- | --- | --- | --- | --- | --- | +| state-hub | workstation `docker build` | `gitea.coulomb.social/coulomb/state-hub` | Forgejo Actions runner on railiance01 | railiance01 forge OCI | `CUST-WP-0054-T04` | +| core-hub | workstation / railiance-forge docs | `gitea.coulomb.social/coulomb/core-hub` | CI runner | railiance01 forge OCI | `CUST-WP-0054-T04` | +| activity-core | workstation manual rebuild + scp | railiance01 k3s import / Gitea | CI on tag push | railiance01 forge OCI | `CUST-WP-0054-T04` | +| issue-core | workstation / manual | `gitea.coulomb.social/coulomb/issue-core` | CI runner | railiance01 forge OCI | `CUST-WP-0054-T04` | +| Haskell build agent | workstation VM (`haskell-build-vm`) | n/a | retired (`CORE-WP-0007`) | n/a | `CORE-WP-0007` | + +Done criterion for T01: every row above has a target and migration owner. ✓ + +## Drain Sequence + +Detailed plan: `docs/coulombcore-drain-placement-plan.md` +Freeze policy: `canon/standards/coulombcore-production-freeze_v0.1.md` + +``` +Wave 1 Forge + CI (T04) +Wave 2 State Hub primary (T05) +Wave 3 Core Hub (CORE-WP-0005) +Wave 4 issue-core +Wave 5 ESO / ArgoCD +Wave 6 Supporting apps +Wave 7 OpenBao + identity (LAST) +Wave 8 coulombcore phoenix → railiance02 (T09) +``` + +## Sequencing Map + +``` +T01 (this document) ✓ + ├─ T02 de-hub network ✓ + ├─ T03 placement plan / freeze ✓ + │ ├─ T04 forge + CI + │ └─ T05 State Hub home on railiance01 + ├─ T06 sink decoupling + ├─ T07 dev beachhead + └─ T08 phoenix drill + └─ T09 coulombcore → railiance02 + └─ T10 workstation-off acceptance +``` + +## Evidence and Inventory Sources + +- Live tunnel state: `bridge status` (2026-07-03) +- State Hub health: `http://127.0.0.1:8000/state/health` (cluster primary via tunnel) +- Registered repos: `GET /repos/` — 74 repos, all `local_path` under `/home/worsch/` +- `ops/service-inventory.yml` (2026-06-05; predates cluster cutover — refresh in T03) +- `docs/infrastructure-stabilization-pickup-checkpoint.md` (2026-07-03 metaplan closeout) +- Activity definitions: `activity-definitions/daily-statehub-wsjf-triage.md`, + `activity-definitions/state-hub-consistency-sweep.md` + +## Open Gaps (not T01 blockers) + +| Gap | Follow-on | +| --- | --- | +| Forgejo production hostname / SMTP / exposure decisions | `RAIL-HO-WP-0005-T02` (human) | +| `ops/service-inventory.yml` stale environment labels | Refresh during T03 | +| Core Hub widget-type registry prerequisite | `CORE-WP-0005-T04` | +| HA Postgres / Longhorn across 2+ nodes | `RAIL-BS-WP-0007`, `CUST-WP-0038` after railiance02 | + +## Promotion to Canon + +After operator review: + +1. Move to `canon/architecture/adr-006-workstation-independence-fleet-roles.md` + (or equivalent ADR number). +2. Update `ops/service-inventory.yml` environment and service rows to match. +3. Link from `SCOPE.md` and `.custodian-brief.md` generation inputs. \ No newline at end of file diff --git a/infra/fleet-mesh/install-railiance01.sh b/infra/fleet-mesh/install-railiance01.sh new file mode 100755 index 0000000..0a421d7 --- /dev/null +++ b/infra/fleet-mesh/install-railiance01.sh @@ -0,0 +1,26 @@ +#!/usr/bin/env bash +# Install fleet-mesh systemd user units on railiance01 (CUST-WP-0054-T02). +set -euo pipefail + +REMOTE="${1:-railiance01}" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +ssh "$REMOTE" 'mkdir -p ~/.config/bridge ~/.config/systemd/user ~/.ssh' +scp "$SCRIPT_DIR/railiance01-tunnels.yaml" "$REMOTE:~/.config/bridge/tunnels.yaml" +scp "$SCRIPT_DIR/systemd/"*.service "$REMOTE:~/.config/systemd/user/" +scp "${HOME}/.ssh/id_ops" "${HOME}/.ssh/id_ops.pub" "$REMOTE:~/.ssh/" +ssh "$REMOTE" 'chmod 600 ~/.ssh/id_ops ~/.config/bridge/tunnels.yaml' +ssh "$REMOTE" 'sudo loginctl enable-linger tegwick 2>/dev/null || true' + +ssh "$REMOTE" bash -s <<'EOF' +set -euo pipefail +systemctl --user daemon-reload +systemctl --user enable --now fleet-state-hub-coulombcore.service +systemctl --user enable --now fleet-issue-core-coulombcore.service +sleep 2 +curl -sf http://127.0.0.1:18000/state/health +curl -sf http://127.0.0.1:18765/healthz +systemctl --user --no-pager status fleet-state-hub-coulombcore.service fleet-issue-core-coulombcore.service +EOF + +echo "Fleet mesh tunnels active on $REMOTE" \ No newline at end of file diff --git a/infra/fleet-mesh/railiance01-tunnels.yaml b/infra/fleet-mesh/railiance01-tunnels.yaml new file mode 100644 index 0000000..0df3d2b --- /dev/null +++ b/infra/fleet-mesh/railiance01-tunnels.yaml @@ -0,0 +1,51 @@ +# Fleet-owned production tunnels on railiance01 (CUST-WP-0054-T02). +# Install to: ~/.config/bridge/tunnels.yaml on railiance01 +# +# Replaces workstation reverse tunnels state-hub-railiance01 and +# issue-core-railiance01 with machine-local forward tunnels through coulombcore. +# +# activity-core bridge proxies (unchanged): +# actcore-state-hub-bridge -> 127.0.0.1:18000 +# actcore-issue-core-bridge -> 127.0.0.1:18765 + +tunnels: + fleet-state-hub-coulombcore: + host: 92.205.130.254 + remote_port: 8000 + local_port: 18000 + direction: local + remote_host: 10.43.170.94 + ssh_user: tegwick + ssh_key: ~/.ssh/id_ops + actor: atm-fleet-mesh + health_check: + url: http://127.0.0.1:18000/state/health + interval_seconds: 30 + timeout_seconds: 5 + reconnect: + max_attempts: 0 + backoff_initial: 5 + backoff_max: 60 + + fleet-issue-core-coulombcore: + host: 92.205.130.254 + remote_port: 8765 + local_port: 18765 + direction: local + remote_host: 10.43.103.154 + ssh_user: tegwick + ssh_key: ~/.ssh/id_ops + actor: atm-fleet-mesh + health_check: + url: http://127.0.0.1:18765/healthz + interval_seconds: 30 + timeout_seconds: 5 + reconnect: + max_attempts: 0 + backoff_initial: 5 + backoff_max: 60 + +actors: + atm-fleet-mesh: + class: atm + description: Railiance01 fleet mesh — direct production lanes to coulombcore cluster services \ No newline at end of file diff --git a/infra/fleet-mesh/systemd/fleet-issue-core-coulombcore.service b/infra/fleet-mesh/systemd/fleet-issue-core-coulombcore.service new file mode 100644 index 0000000..0beab96 --- /dev/null +++ b/infra/fleet-mesh/systemd/fleet-issue-core-coulombcore.service @@ -0,0 +1,21 @@ +[Unit] +Description=Fleet mesh issue-core forward tunnel (railiance01 to coulombcore cluster) +After=network-online.target +Wants=network-online.target +StartLimitIntervalSec=0 + +[Service] +Type=simple +ExecStart=/usr/bin/ssh -N \ + -L 127.0.0.1:18765:10.43.103.154:8765 \ + -i /home/tegwick/.ssh/id_ops \ + -o ServerAliveInterval=10 \ + -o ServerAliveCountMax=3 \ + -o ExitOnForwardFailure=yes \ + -o StrictHostKeyChecking=accept-new \ + tegwick@92.205.130.254 +Restart=always +RestartSec=5 + +[Install] +WantedBy=default.target \ No newline at end of file diff --git a/infra/fleet-mesh/systemd/fleet-state-hub-coulombcore.service b/infra/fleet-mesh/systemd/fleet-state-hub-coulombcore.service new file mode 100644 index 0000000..4c60607 --- /dev/null +++ b/infra/fleet-mesh/systemd/fleet-state-hub-coulombcore.service @@ -0,0 +1,21 @@ +[Unit] +Description=Fleet mesh State Hub forward tunnel (railiance01 to coulombcore cluster) +After=network-online.target +Wants=network-online.target +StartLimitIntervalSec=0 + +[Service] +Type=simple +ExecStart=/usr/bin/ssh -N \ + -L 127.0.0.1:18000:10.43.170.94:8000 \ + -i /home/tegwick/.ssh/id_ops \ + -o ServerAliveInterval=10 \ + -o ServerAliveCountMax=3 \ + -o ExitOnForwardFailure=yes \ + -o StrictHostKeyChecking=accept-new \ + tegwick@92.205.130.254 +Restart=always +RestartSec=5 + +[Install] +WantedBy=default.target \ No newline at end of file diff --git a/ops/service-inventory.yml b/ops/service-inventory.yml index 5a4b30f..69f7059 100644 --- a/ops/service-inventory.yml +++ b/ops/service-inventory.yml @@ -1,5 +1,5 @@ version: 1 -last_reviewed: "2026-06-05" +last_reviewed: "2026-07-03" policy: non_secret_inventory: true secrets_rule: "Do not store credentials, tokens, private addresses that are not already operationally documented, or command output containing secrets." @@ -20,11 +20,11 @@ environments: lifecycle_state: observed - id: coulombcore name: "CoulombCore" - role: "Transitional production-like runtime" - lifecycle_state: observed + role: "Legacy production host — frozen for new workloads; draining per CUST-WP-0054-T03" + lifecycle_state: draining - id: railiance01 name: "Railiance01" - role: "First ThreePhoenix foundation node" + role: "Production home — activity-core, fleet mesh, target for drain waves" lifecycle_state: observed - id: threephoenix-prod name: "ThreePhoenix Production" @@ -77,7 +77,7 @@ services: - id: gitea name: "Gitea" kind: application - lifecycle_state: observed + lifecycle_state: draining health_status: unknown environment: coulombcore owner_repos: @@ -173,9 +173,9 @@ services: - id: state-hub name: "State Hub" kind: coordination-service - lifecycle_state: observed + lifecycle_state: draining health_status: observed_ok - environment: local + environment: coulombcore owner_repos: - state-hub - the-custodian @@ -183,29 +183,146 @@ services: - "/home/worsch/state-hub" - "/home/worsch/the-custodian/state-hub/README.md" runtime: - type: local-process - host: local-workstation - ports: - - 8000 + type: k3s + cluster: coulombcore-k3s + namespace: state-hub + workload_refs: + - "cnpg:state-hub-db" + - "svc:10.43.170.94:8000" endpoints: - - id: state-hub-local-api + - id: state-hub-cluster-api type: http url: "http://127.0.0.1:8000/state/health" expected_status: 200 expected_signal: "health response" + - id: state-hub-railiance01-fleet + type: tunnel + url: "http://127.0.0.1:18000/state/health" + expected_status: 200 + expected_signal: "reachable from railiance01 fleet mesh" backing_stores: - - "postgresql:state-hub" + - "postgresql:state-hub-db" access_paths: - type: http - target: "http://127.0.0.1:8000" + target: "workstation tunnel state-hub-primary → cluster" + status: observed_ok + - type: tunnel + target: "railiance01 systemd fleet-state-hub-coulombcore → cluster" status: observed_ok evidence: - type: session-probe - observed_at: "2026-06-05" - source: "Codex session curl to local State Hub" - summary: "State Hub accepted inbox, task, and progress API calls." + observed_at: "2026-07-03" + source: "CUST-WP-0054-T02 fleet mesh + cluster primary" + summary: "Cluster hub healthy; railiance01 reaches via fleet forward tunnel." gaps: - - "Future cluster deployment readiness still needs ops evidence." + - "Primary home must move to railiance01 per CUST-WP-0054-T05." + - "Consistency sweep writebacks still target workstation paths." + + - id: issue-core + name: "issue-core" + kind: application + lifecycle_state: draining + health_status: observed_ok + environment: coulombcore + owner_repos: + - issue-core + runtime: + type: k3s + cluster: coulombcore-k3s + namespace: issue-core + workload_refs: + - "svc:10.43.103.154:8765" + endpoints: + - id: issue-core-api + type: http + url: "http://127.0.0.1:8765/healthz" + expected_status: 200 + expected_signal: "version response" + backing_stores: + - "postgresql:issue-core" + access_paths: + - type: tunnel + target: "railiance01 fleet-issue-core-coulombcore → cluster" + status: observed_ok + evidence: + - type: workplan-note + observed_at: "2026-07-02" + source: "ISSUE-WP-0003 completion — Gitea issue 176 emission" + summary: "REST emission live via cross-machine fleet path." + gaps: + - "Target railiance01 overlay per CUST-WP-0054 drain Wave 4." + + - id: core-hub + name: "Core Hub" + kind: governance-service + lifecycle_state: draining + health_status: observed_ok + environment: coulombcore + owner_repos: + - core-hub + runtime: + type: k3s + cluster: coulombcore-k3s + namespace: core-hub-staging + endpoints: + - id: core-hub-public + type: https + url: "https://hub.coulomb.social/api/v2/hubs" + expected_status: 200 + expected_signal: "hub list when authenticated" + backing_stores: + - "postgresql:core-hub" + access_paths: + - type: k8s + target: "coulombcore-k3s/core-hub-staging" + status: observed_ok + evidence: + - type: workplan-note + observed_at: "2026-07-02" + source: "CUST-WP-0051 metaplan closeout" + summary: "Staging deployed; production cutover gated on CORE-WP-0005-T04." + gaps: + - "Production cutover to railiance01 pending operator approval." + + - id: fleet-mesh-railiance01 + name: "Fleet Mesh (railiance01)" + kind: connectivity-service + lifecycle_state: observed + health_status: observed_ok + environment: railiance01 + owner_repos: + - the-custodian + - ops-bridge + desired_state_sources: + - "/home/worsch/the-custodian/infra/fleet-mesh/" + runtime: + type: systemd + host: railiance01 + workload_refs: + - "fleet-state-hub-coulombcore.service" + - "fleet-issue-core-coulombcore.service" + endpoints: + - id: fleet-state-hub-local + type: http + url: "http://127.0.0.1:18000/state/health" + expected_status: 200 + - id: fleet-issue-core-local + type: http + url: "http://127.0.0.1:18765/healthz" + expected_status: 200 + backing_stores: [] + access_paths: + - type: ssh-tunnel + target: "railiance01 → coulombcore ClusterIPs" + status: observed_ok + evidence: + - type: session-probe + observed_at: "2026-07-03" + source: "CUST-WP-0054-T02 cutover" + summary: "Workstation reverse tunnels stopped; systemd forwards healthy." + gaps: + - "Migrate to atm-fleet-mesh cert_command when VAULT_TOKEN available." + - "Retire when State Hub and issue-core move to railiance01." - id: inter-hub name: "Inter-Hub" @@ -287,7 +404,7 @@ services: name: "Ops Bridge" kind: connectivity-service lifecycle_state: observed - health_status: unknown + health_status: observed_ok environment: local owner_repos: - ops-bridge @@ -298,15 +415,15 @@ services: backing_stores: [] access_paths: - type: ssh-tunnel - target: "connected remote servers" - status: unknown + target: "interactive dev tunnels only (k3s-api, state-hub-primary)" + status: observed_ok evidence: - - type: document - observed_at: "2026-05-16" - source: "/home/worsch/helix-forge/wiki/OpsHubInventory.md" - summary: "Bridge is useful for connected-server visibility but is not itself the service catalog." + - type: session-probe + observed_at: "2026-07-03" + source: "CUST-WP-0054-T02 — production reverse tunnels retired" + summary: "state-hub-railiance01 and issue-core-railiance01 stopped; not production-critical." gaps: - - "Emit reachability evidence into ops-hub instead of relying on bridge state as inventory." + - "Install ops-bridge on railiance01 or keep systemd fleet-mesh units." - id: haskell-build-agent name: "Haskell Build Agent"