CUST-WP-0054 T01-T03: fleet architecture, de-hub runbook, drain plan

Documents the three-machine role model, fleet mesh topology, coulombcore
freeze policy, and ordered drain sequence. Adds railiance01 systemd tunnel
install assets and refreshes ops service inventory to reflect 2026-07-03
production placement (cluster State Hub, fleet mesh, draining coulombcore).
This commit is contained in:
codex
2026-07-04 00:29:55 +02:00
parent 0a77483861
commit cf4be716e1
10 changed files with 1050 additions and 34 deletions

View File

@@ -0,0 +1,200 @@
# CoulombCore Drain and Production Placement Plan
Date: 2026-07-03
Workplan: `CUST-WP-0054-T03`
Freeze policy: `canon/standards/coulombcore-production-freeze_v0.1.md`
Architecture: `docs/workstation-independence-fleet-architecture.md`
## Purpose
Ordered drain sequence for every production workload on coulombcore
(`92.205.130.254`, `coulombcore-k3s`). Each row names current placement,
target placement, migration method, owner workplan, and prerequisites.
**Coupling rule:** forge and State Hub move early; identity + OpenBao move
last because everything authenticates through them.
## Wave overview
```
Wave 0 Freeze policy (this document + canon) — effective 2026-07-03
Wave 1 Source forge + CI runners ─────────── RAIL-HO-WP-0005 / CUST-WP-0054-T04
Wave 2 State Hub primary + sweep checkouts ── CUST-WP-0054-T05 / CUST-WP-0011
Wave 3 Core Hub production ────────────────── CORE-WP-0005
Wave 4 issue-core ─────────────────────────── ISSUE-WP-0003 + overlay
Wave 5 GitOps control plane (ESO, ArgoCD) ─── railiance-cluster overlays
Wave 6 Application stragglers ─────────────── per-app overlays
Wave 7 OpenBao + identity stack ───────────── NET-WP-0020 + key-cape (LAST)
Wave 8 coulombcore phoenix → railiance02 ─── CUST-WP-0054-T09
```
## Placement register
| # | Workload | Current (2026-07-03) | Target | Method | Owner | Wave | Status |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 1 | **Gitea + OCI registry** | coulombcore-k3s `default`; `gitea.coulomb.social` | railiance01 **`forgejo.coulomb.social`** | Staged-promotion S5 overlay; `RAIL-HO-WP-0005` probe → production; Gitea → read-only mirror | `RAIL-HO-WP-0005`, `CUST-WP-0054-T04` | 1 | grandfathered |
| 2 | **Forgejo Actions / CI runners** | none (workstation manual build) | railiance01 | New S5 overlay; image build on tag push | `CUST-WP-0054-T04` | 1 | planned |
| 3 | **Gitea DB + PVC** | coulombcore `databases` / `gitea-shared-storage` | railiance01 CNPG + PVC | Migrate with Forgejo; backup/restore drill required | `RAIL-HO-WP-0005` | 1 | grandfathered |
| 4 | **State Hub API (primary)** | coulombcore CNPG `state-hub-db`; cluster Svc `10.43.170.94:8000` | railiance01 CNPG + Deployment | `CUST-WP-0011-T07` playbook: freeze → exact-count restore → rewire; staged-promotion overlay | `CUST-WP-0054-T05`, `CUST-WP-0011` | 2 | grandfathered |
| 5 | **State Hub sweep checkouts** | workstation `/home/worsch/*` (74 repos) | railiance01 clone tree from forge | Relocate `host_paths` / `local_path`; no workstation writeback | `CUST-WP-0054-T05`, `STATE-WP-0064` | 2 | planned |
| 6 | **WSL2 State Hub fallback** | workstation WSL2 | retired | Stop after railiance01 primary stabilizes | `CUST-WP-0011-T08/T09`, `CUST-WP-0054-T10` | 2 | grandfathered |
| 7 | **Core Hub** | coulombcore `core-hub-staging`; public `hub.coulomb.social` | railiance01 | Staged-promotion overlay; dual-run prerequisite (`CORE-WP-0005-T04`) | `CORE-WP-0005` | 3 | grandfathered |
| 8 | **Inter-Hub (Haskell)** | coulombcore external | retired | Rollback-only after Core Hub cutover | `CORE-WP-0007` | 3 | grandfathered |
| 9 | **issue-core** | coulombcore `issue-core` ns; ClusterIP `10.43.103.154:8765` | railiance01 | Staged-promotion overlay; shorten fleet tunnel to local svc | `ISSUE-WP-0003`, `CUST-WP-0054-T03` | 4 | grandfathered |
| 10 | **issue-core CNPG** | coulombcore | railiance01 | Migrate with issue-core workload | `railiance-platform` | 4 | grandfathered |
| 11 | **External Secrets Operator** | coulombcore | railiance01 | GitOps follows forge; ESO stores point at railiance01 OpenBao post-Wave 7 or interim coulombcore path documented | `railiance-platform` | 5 | grandfathered |
| 12 | **ArgoCD** | coulombcore (boundary: should be S4) | railiance01 | Staged-promotion; repoint repo URLs to Forgejo | `railiance-cluster` | 5 | grandfathered |
| 13 | **llm-connect** | railiance01 `activity-core` ns (partial) | railiance01 | Already on target machine; complete in-cluster profile | `CCR-2026-0003` lane | 6 | observed |
| 14 | **activity-core** | railiance01 `activity-core` ns | railiance01 (retain) | No move; update sinks (T06) and hub URL post-Wave 2 | — | — | **on target** |
| 15 | **Temporal / NATS** | railiance01 | railiance01 (retain) | Co-located with activity-core | — | — | **on target** |
| 16 | **ops-hub evidence / widgets** | files + Core Hub path | railiance01 via Core Hub | Follows Core Hub; not coulombcore-blocking | `CUST-WP-0025`, `CUST-WP-0049` | 6 | planned |
| 17 | **artifact-store / MinIO lane** | assessment only | railiance01 or compatible endpoint | Compatibility-profile per `ARTIFACT-STORE-WP-0007` | `ARTIFACT-STORE-WP-0007` | 6 | planned |
| 18 | **OpenBao** | coulombcore | railiance01 | **Last infrastructure wave**; `NET-WP-0020` unseal automation; CNPG + seal migration | `NET-WP-0020`, `railiance-platform` | 7 | grandfathered |
| 19 | **KeyCape** | coulombcore | railiance01 | Follows OpenBao; OIDC/MFA paths | `key-cape` | 7 | grandfathered |
| 20 | **Authelia** | coulombcore | railiance01 | Identity front door | `key-cape` / `railiance-platform` | 7 | grandfathered |
| 21 | **privacyIDEA** | coulombcore | railiance01 | MFA backend | `key-cape` | 7 | grandfathered |
| 22 | **lldap** | coulombcore | railiance01 | LDAP directory | `key-cape` / `railiance-platform` | 7 | grandfathered |
| 23 | **flex-auth** | coulombcore | railiance01 | Policy registry follows identity | `flex-auth` | 7 | grandfathered |
| 24 | **Fleet mesh transit tunnels** | railiance01 systemd → coulombcore ClusterIPs | railiance01-local services | Retire when Waves 2+4 complete (hub + issue-core local) | `CUST-WP-0054-T02` | 24 | **interim active** |
| 25 | **CNPG operator** | coulombcore (boundary note) | railiance01 | Platform operator moves with Wave 2+ workloads | `railiance-platform` | 27 | grandfathered |
| 26 | **coulombcore host identity** | coulombcore | railiance02 | Machine phoenix after Wave 7 | `CUST-WP-0054-T09`, `CUST-WP-0054-T08` | 8 | wait |
## Per-wave detail
### Wave 1 — Source forge + CI (unblocks repos and images)
**Goal:** All repos and container images publish from railiance01; coulombcore
Gitea becomes read-only mirror.
| Step | Action | Done when |
| --- | --- | --- |
| 1.1 | Resolve `RAIL-HO-WP-0005-T02` production decisions (hostname **decided:** `forgejo.coulomb.social`; SMTP, runners, backup still open) | `docs/forgejo-production-decisions.md` |
| 1.2 | Disposable Forgejo probe namespace + restore drill | Backup/restore evidence id recorded |
| 1.3 | Production Forgejo cutover | All 74 repo remotes point at Forgejo; push/pull verified |
| 1.4 | Actions runners for `state-hub`, `core-hub`, `activity-core`, `issue-core` | Tag-triggered image lands in forge OCI |
| 1.5 | Gitea → read-only mirror on coulombcore | Rollback window documented; no new writes |
**Blocks:** Wave 2 sweep checkouts (needs forge clones on railiance01).
### Wave 2 — State Hub home on railiance01
**Goal:** Automation loop machine-local; consistency sweeps write back to
railiance01 checkouts, not workstation paths.
| Step | Action | Done when |
| --- | --- | --- |
| 2.1 | CNPG + storage review on railiance01 | Platform sign-off |
| 2.2 | `CUST-WP-0011-T07` cutover to railiance01 primary | Row counts match; `127.0.0.1:8000` serves railiance01 hub |
| 2.3 | Clone/register 74 repos on railiance01 from Forgejo | `fix-consistency` writebacks use railiance01 paths |
| 2.4 | Retire fleet tunnel `fleet-state-hub-coulombcore` | activity-core reaches hub without coulombcore hop |
| 2.5 | WSL2 fallback retirement (optional, after stabilization) | `CUST-WP-0011-T08/T09` |
**Prereq:** Wave 1 forge (clone source).
### Wave 3 — Core Hub production
**Goal:** `hub.coulomb.social` served from railiance01 Core Hub.
| Step | Action | Done when |
| --- | --- | --- |
| 3.1 | Close `CORE-WP-0005-T04` prerequisites (widget types, auth posture) | Catalog gap resolved |
| 3.2 | Operator-approved cutover with rollback plan | Deployed smoke + activity-core sink green |
| 3.3 | Inter-Hub marked rollback-only | `CORE-WP-0007` unblocks |
**Prereq:** Wave 1 (images via forge CI).
### Wave 4 — issue-core
**Goal:** Emission path is railiance01-local; no coulombcore ClusterIP in path.
| Step | Action | Done when |
| --- | --- | --- |
| 4.1 | Staged-promotion overlay on railiance01 | ArgoCD sync healthy |
| 4.2 | Migrate CNPG + secrets | ExternalSecret Ready |
| 4.3 | Point `ISSUE_CORE_URL` at in-cluster svc | Retire `fleet-issue-core-coulombcore` tunnel |
| 4.4 | Safe emission smoke | HTTP 201 + Gitea/Forgejo issue created |
**Prereq:** Wave 1 (image + gitops); credential lane `CCR-2026-0002` active.
### Wave 5 — GitOps control plane
**Goal:** ArgoCD and ESO run on railiance01 and track Forgejo repos.
| Step | Action | Done when |
| --- | --- | --- |
| 5.1 | ArgoCD overlay on railiance01 | Sync from Forgejo remotes |
| 5.2 | ESO → SecretStore paths updated | Workloads on railiance01 pull secrets |
| 5.3 | Decommission coulombcore ArgoCD Applications | No new syncs to coulombcore-k3s |
**Prereq:** Waves 12 (forge URLs, hub coordination).
### Wave 6 — Application stragglers
Low-coupling apps and evidence lanes that do not block earlier waves:
- llm-connect production profile completion
- ops-hub widget evidence via Core Hub
- artifact-store compatibility endpoint (if approved)
Each uses staged-promotion unless listed under **Documented exceptions**.
### Wave 7 — OpenBao + identity (LAST)
**Goal:** Authentication and secret custody off coulombcore.
| Step | Action | Done when |
| --- | --- | --- |
| 7.1 | OpenBao staged-promotion to railiance01 | Unseal automation (`NET-WP-0020`) proven |
| 7.2 | KeyCape / Authelia / privacyIDEA / lldap migration | OIDC login smoke on railiance01 |
| 7.3 | flex-auth registry points at new identity endpoints | Credential lanes re-pointed |
| 7.4 | CCR/applier paths verified | No production secret reads from coulombcore OpenBao |
**Gate:** `CUST-WP-0054-T09` cannot start until Wave 7 completes.
### Wave 8 — Phoenix to railiance02
Execute `CUST-WP-0054-T09` via T08 automation: wipe coulombcore, rebuild as
railiance02, join fleet. DNS/cert plan for remaining `*.coulomb.social` names.
## Documented exceptions
| Workload | Reason | Target date | Rollback | Approval |
| --- | --- | --- | --- | --- |
| Fleet mesh systemd tunnels | Wave 2/4 not complete; railiance01 reaches coulombcore ClusterIPs | Until Waves 2+4 done | Re-enable workstation reverse tunnels per `docs/fleet-mesh-dehub-runbook.md` | `CUST-WP-0054-T02` cutover 2026-07-03 |
| Core Hub staging on coulombcore | Pre-cutover smoke environment | Until Wave 3 cutover | Keep staging namespace | `CORE-WP-0005` |
| Static `id_ops` SSH key on railiance01 fleet units | `atm-fleet-mesh` cert_command blocked on VAULT_TOKEN | Until warden sign available | ops-bridge or rotated key | `CUST-WP-0054-T02` interim |
No other exceptions as of 2026-07-03. New exceptions require a State Hub
decision or workplan amendment.
## Staged-promotion method (default)
Per `RAIL-BS-WP-0006` (finished):
1. `railiance/<app>/app.toml` + overlay in owning repo
2. Stage 1 deploy → observe → promote with evidence
3. Backup/restore drill before production promotion
4. Rollback revision documented
Apps without overlays yet must get an overlay scaffold before Wave execution.
## Inventory sync
`ops/service-inventory.yml` updated 2026-07-03 for:
- coulombcore `lifecycle_state: draining` on grandfathered production services
- State Hub primary on coulombcore cluster (not workstation)
- railiance01 fleet-mesh and activity-core placement
- ops-bridge on railiance01 via systemd (not workstation hub)
Regenerate catalog view: `make ops-inventory-view`
## Human gates (not agent-executable)
| Gate | Owner | Blocks |
| --- | --- | --- |
| Forgejo T02 production decisions | operator | Wave 1 |
| State Hub railiance01 cutover approval | operator; `CUST-WP-0011-T07` | Wave 2 |
| Core Hub production cutover | operator; `CORE-WP-0005-T04` | Wave 3 |
| OpenBao/identity migration approval | operator + custody | Wave 7 |
| coulombcore phoenix approval | operator | Wave 8 |