Files
the-custodian/docs/coulombcore-drain-placement-plan.md
codex cf4be716e1 CUST-WP-0054 T01-T03: fleet architecture, de-hub runbook, drain plan
Documents the three-machine role model, fleet mesh topology, coulombcore
freeze policy, and ordered drain sequence. Adds railiance01 systemd tunnel
install assets and refreshes ops service inventory to reflect 2026-07-03
production placement (cluster State Hub, fleet mesh, draining coulombcore).
2026-07-04 00:29:55 +02:00

200 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CoulombCore Drain and Production Placement Plan
Date: 2026-07-03
Workplan: `CUST-WP-0054-T03`
Freeze policy: `canon/standards/coulombcore-production-freeze_v0.1.md`
Architecture: `docs/workstation-independence-fleet-architecture.md`
## Purpose
Ordered drain sequence for every production workload on coulombcore
(`92.205.130.254`, `coulombcore-k3s`). Each row names current placement,
target placement, migration method, owner workplan, and prerequisites.
**Coupling rule:** forge and State Hub move early; identity + OpenBao move
last because everything authenticates through them.
## Wave overview
```
Wave 0 Freeze policy (this document + canon) — effective 2026-07-03
Wave 1 Source forge + CI runners ─────────── RAIL-HO-WP-0005 / CUST-WP-0054-T04
Wave 2 State Hub primary + sweep checkouts ── CUST-WP-0054-T05 / CUST-WP-0011
Wave 3 Core Hub production ────────────────── CORE-WP-0005
Wave 4 issue-core ─────────────────────────── ISSUE-WP-0003 + overlay
Wave 5 GitOps control plane (ESO, ArgoCD) ─── railiance-cluster overlays
Wave 6 Application stragglers ─────────────── per-app overlays
Wave 7 OpenBao + identity stack ───────────── NET-WP-0020 + key-cape (LAST)
Wave 8 coulombcore phoenix → railiance02 ─── CUST-WP-0054-T09
```
## Placement register
| # | Workload | Current (2026-07-03) | Target | Method | Owner | Wave | Status |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 1 | **Gitea + OCI registry** | coulombcore-k3s `default`; `gitea.coulomb.social` | railiance01 **`forgejo.coulomb.social`** | Staged-promotion S5 overlay; `RAIL-HO-WP-0005` probe → production; Gitea → read-only mirror | `RAIL-HO-WP-0005`, `CUST-WP-0054-T04` | 1 | grandfathered |
| 2 | **Forgejo Actions / CI runners** | none (workstation manual build) | railiance01 | New S5 overlay; image build on tag push | `CUST-WP-0054-T04` | 1 | planned |
| 3 | **Gitea DB + PVC** | coulombcore `databases` / `gitea-shared-storage` | railiance01 CNPG + PVC | Migrate with Forgejo; backup/restore drill required | `RAIL-HO-WP-0005` | 1 | grandfathered |
| 4 | **State Hub API (primary)** | coulombcore CNPG `state-hub-db`; cluster Svc `10.43.170.94:8000` | railiance01 CNPG + Deployment | `CUST-WP-0011-T07` playbook: freeze → exact-count restore → rewire; staged-promotion overlay | `CUST-WP-0054-T05`, `CUST-WP-0011` | 2 | grandfathered |
| 5 | **State Hub sweep checkouts** | workstation `/home/worsch/*` (74 repos) | railiance01 clone tree from forge | Relocate `host_paths` / `local_path`; no workstation writeback | `CUST-WP-0054-T05`, `STATE-WP-0064` | 2 | planned |
| 6 | **WSL2 State Hub fallback** | workstation WSL2 | retired | Stop after railiance01 primary stabilizes | `CUST-WP-0011-T08/T09`, `CUST-WP-0054-T10` | 2 | grandfathered |
| 7 | **Core Hub** | coulombcore `core-hub-staging`; public `hub.coulomb.social` | railiance01 | Staged-promotion overlay; dual-run prerequisite (`CORE-WP-0005-T04`) | `CORE-WP-0005` | 3 | grandfathered |
| 8 | **Inter-Hub (Haskell)** | coulombcore external | retired | Rollback-only after Core Hub cutover | `CORE-WP-0007` | 3 | grandfathered |
| 9 | **issue-core** | coulombcore `issue-core` ns; ClusterIP `10.43.103.154:8765` | railiance01 | Staged-promotion overlay; shorten fleet tunnel to local svc | `ISSUE-WP-0003`, `CUST-WP-0054-T03` | 4 | grandfathered |
| 10 | **issue-core CNPG** | coulombcore | railiance01 | Migrate with issue-core workload | `railiance-platform` | 4 | grandfathered |
| 11 | **External Secrets Operator** | coulombcore | railiance01 | GitOps follows forge; ESO stores point at railiance01 OpenBao post-Wave 7 or interim coulombcore path documented | `railiance-platform` | 5 | grandfathered |
| 12 | **ArgoCD** | coulombcore (boundary: should be S4) | railiance01 | Staged-promotion; repoint repo URLs to Forgejo | `railiance-cluster` | 5 | grandfathered |
| 13 | **llm-connect** | railiance01 `activity-core` ns (partial) | railiance01 | Already on target machine; complete in-cluster profile | `CCR-2026-0003` lane | 6 | observed |
| 14 | **activity-core** | railiance01 `activity-core` ns | railiance01 (retain) | No move; update sinks (T06) and hub URL post-Wave 2 | — | — | **on target** |
| 15 | **Temporal / NATS** | railiance01 | railiance01 (retain) | Co-located with activity-core | — | — | **on target** |
| 16 | **ops-hub evidence / widgets** | files + Core Hub path | railiance01 via Core Hub | Follows Core Hub; not coulombcore-blocking | `CUST-WP-0025`, `CUST-WP-0049` | 6 | planned |
| 17 | **artifact-store / MinIO lane** | assessment only | railiance01 or compatible endpoint | Compatibility-profile per `ARTIFACT-STORE-WP-0007` | `ARTIFACT-STORE-WP-0007` | 6 | planned |
| 18 | **OpenBao** | coulombcore | railiance01 | **Last infrastructure wave**; `NET-WP-0020` unseal automation; CNPG + seal migration | `NET-WP-0020`, `railiance-platform` | 7 | grandfathered |
| 19 | **KeyCape** | coulombcore | railiance01 | Follows OpenBao; OIDC/MFA paths | `key-cape` | 7 | grandfathered |
| 20 | **Authelia** | coulombcore | railiance01 | Identity front door | `key-cape` / `railiance-platform` | 7 | grandfathered |
| 21 | **privacyIDEA** | coulombcore | railiance01 | MFA backend | `key-cape` | 7 | grandfathered |
| 22 | **lldap** | coulombcore | railiance01 | LDAP directory | `key-cape` / `railiance-platform` | 7 | grandfathered |
| 23 | **flex-auth** | coulombcore | railiance01 | Policy registry follows identity | `flex-auth` | 7 | grandfathered |
| 24 | **Fleet mesh transit tunnels** | railiance01 systemd → coulombcore ClusterIPs | railiance01-local services | Retire when Waves 2+4 complete (hub + issue-core local) | `CUST-WP-0054-T02` | 24 | **interim active** |
| 25 | **CNPG operator** | coulombcore (boundary note) | railiance01 | Platform operator moves with Wave 2+ workloads | `railiance-platform` | 27 | grandfathered |
| 26 | **coulombcore host identity** | coulombcore | railiance02 | Machine phoenix after Wave 7 | `CUST-WP-0054-T09`, `CUST-WP-0054-T08` | 8 | wait |
## Per-wave detail
### Wave 1 — Source forge + CI (unblocks repos and images)
**Goal:** All repos and container images publish from railiance01; coulombcore
Gitea becomes read-only mirror.
| Step | Action | Done when |
| --- | --- | --- |
| 1.1 | Resolve `RAIL-HO-WP-0005-T02` production decisions (hostname **decided:** `forgejo.coulomb.social`; SMTP, runners, backup still open) | `docs/forgejo-production-decisions.md` |
| 1.2 | Disposable Forgejo probe namespace + restore drill | Backup/restore evidence id recorded |
| 1.3 | Production Forgejo cutover | All 74 repo remotes point at Forgejo; push/pull verified |
| 1.4 | Actions runners for `state-hub`, `core-hub`, `activity-core`, `issue-core` | Tag-triggered image lands in forge OCI |
| 1.5 | Gitea → read-only mirror on coulombcore | Rollback window documented; no new writes |
**Blocks:** Wave 2 sweep checkouts (needs forge clones on railiance01).
### Wave 2 — State Hub home on railiance01
**Goal:** Automation loop machine-local; consistency sweeps write back to
railiance01 checkouts, not workstation paths.
| Step | Action | Done when |
| --- | --- | --- |
| 2.1 | CNPG + storage review on railiance01 | Platform sign-off |
| 2.2 | `CUST-WP-0011-T07` cutover to railiance01 primary | Row counts match; `127.0.0.1:8000` serves railiance01 hub |
| 2.3 | Clone/register 74 repos on railiance01 from Forgejo | `fix-consistency` writebacks use railiance01 paths |
| 2.4 | Retire fleet tunnel `fleet-state-hub-coulombcore` | activity-core reaches hub without coulombcore hop |
| 2.5 | WSL2 fallback retirement (optional, after stabilization) | `CUST-WP-0011-T08/T09` |
**Prereq:** Wave 1 forge (clone source).
### Wave 3 — Core Hub production
**Goal:** `hub.coulomb.social` served from railiance01 Core Hub.
| Step | Action | Done when |
| --- | --- | --- |
| 3.1 | Close `CORE-WP-0005-T04` prerequisites (widget types, auth posture) | Catalog gap resolved |
| 3.2 | Operator-approved cutover with rollback plan | Deployed smoke + activity-core sink green |
| 3.3 | Inter-Hub marked rollback-only | `CORE-WP-0007` unblocks |
**Prereq:** Wave 1 (images via forge CI).
### Wave 4 — issue-core
**Goal:** Emission path is railiance01-local; no coulombcore ClusterIP in path.
| Step | Action | Done when |
| --- | --- | --- |
| 4.1 | Staged-promotion overlay on railiance01 | ArgoCD sync healthy |
| 4.2 | Migrate CNPG + secrets | ExternalSecret Ready |
| 4.3 | Point `ISSUE_CORE_URL` at in-cluster svc | Retire `fleet-issue-core-coulombcore` tunnel |
| 4.4 | Safe emission smoke | HTTP 201 + Gitea/Forgejo issue created |
**Prereq:** Wave 1 (image + gitops); credential lane `CCR-2026-0002` active.
### Wave 5 — GitOps control plane
**Goal:** ArgoCD and ESO run on railiance01 and track Forgejo repos.
| Step | Action | Done when |
| --- | --- | --- |
| 5.1 | ArgoCD overlay on railiance01 | Sync from Forgejo remotes |
| 5.2 | ESO → SecretStore paths updated | Workloads on railiance01 pull secrets |
| 5.3 | Decommission coulombcore ArgoCD Applications | No new syncs to coulombcore-k3s |
**Prereq:** Waves 12 (forge URLs, hub coordination).
### Wave 6 — Application stragglers
Low-coupling apps and evidence lanes that do not block earlier waves:
- llm-connect production profile completion
- ops-hub widget evidence via Core Hub
- artifact-store compatibility endpoint (if approved)
Each uses staged-promotion unless listed under **Documented exceptions**.
### Wave 7 — OpenBao + identity (LAST)
**Goal:** Authentication and secret custody off coulombcore.
| Step | Action | Done when |
| --- | --- | --- |
| 7.1 | OpenBao staged-promotion to railiance01 | Unseal automation (`NET-WP-0020`) proven |
| 7.2 | KeyCape / Authelia / privacyIDEA / lldap migration | OIDC login smoke on railiance01 |
| 7.3 | flex-auth registry points at new identity endpoints | Credential lanes re-pointed |
| 7.4 | CCR/applier paths verified | No production secret reads from coulombcore OpenBao |
**Gate:** `CUST-WP-0054-T09` cannot start until Wave 7 completes.
### Wave 8 — Phoenix to railiance02
Execute `CUST-WP-0054-T09` via T08 automation: wipe coulombcore, rebuild as
railiance02, join fleet. DNS/cert plan for remaining `*.coulomb.social` names.
## Documented exceptions
| Workload | Reason | Target date | Rollback | Approval |
| --- | --- | --- | --- | --- |
| Fleet mesh systemd tunnels | Wave 2/4 not complete; railiance01 reaches coulombcore ClusterIPs | Until Waves 2+4 done | Re-enable workstation reverse tunnels per `docs/fleet-mesh-dehub-runbook.md` | `CUST-WP-0054-T02` cutover 2026-07-03 |
| Core Hub staging on coulombcore | Pre-cutover smoke environment | Until Wave 3 cutover | Keep staging namespace | `CORE-WP-0005` |
| Static `id_ops` SSH key on railiance01 fleet units | `atm-fleet-mesh` cert_command blocked on VAULT_TOKEN | Until warden sign available | ops-bridge or rotated key | `CUST-WP-0054-T02` interim |
No other exceptions as of 2026-07-03. New exceptions require a State Hub
decision or workplan amendment.
## Staged-promotion method (default)
Per `RAIL-BS-WP-0006` (finished):
1. `railiance/<app>/app.toml` + overlay in owning repo
2. Stage 1 deploy → observe → promote with evidence
3. Backup/restore drill before production promotion
4. Rollback revision documented
Apps without overlays yet must get an overlay scaffold before Wave execution.
## Inventory sync
`ops/service-inventory.yml` updated 2026-07-03 for:
- coulombcore `lifecycle_state: draining` on grandfathered production services
- State Hub primary on coulombcore cluster (not workstation)
- railiance01 fleet-mesh and activity-core placement
- ops-bridge on railiance01 via systemd (not workstation hub)
Regenerate catalog view: `make ops-inventory-view`
## Human gates (not agent-executable)
| Gate | Owner | Blocks |
| --- | --- | --- |
| Forgejo T02 production decisions | operator | Wave 1 |
| State Hub railiance01 cutover approval | operator; `CUST-WP-0011-T07` | Wave 2 |
| Core Hub production cutover | operator; `CORE-WP-0005-T04` | Wave 3 |
| OpenBao/identity migration approval | operator + custody | Wave 7 |
| coulombcore phoenix approval | operator | Wave 8 |