Files

codex e8a7f49bde Record ADR-004 in-cluster Forgejo runner decision for T04

Updates forgejo-production-decisions and CUST-WP-0054-T04 partial progress.

2026-07-03 22:29:28 +02:00

13 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, planning_priority, planning_order, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	planning_priority	planning_order	created	updated	state_hub_workstream_id
CUST-WP-0054	workplan	Workstation Independence and Fleet Role Realignment	infotech	the-custodian	active	codex	custodian	high	54	2026-07-03	2026-07-03	8a828444-dd49-4d7b-a2d1-9952b5bc929d

CUST-WP-0054 - Workstation Independence and Fleet Role Realignment

Goal

Remove every runtime dependency and always-on workload from the workstation, realign machine roles, and make the production loop run uninterrupted with the workstation powered off:

Machine	Target role
railiance01	Production workloads (the first node of the growing Railiance fleet).
coulombcore	Early/legacy/experimental workloads only; frozen for new production; eventually decommissioned and reborn as railiance02 (a machine-scale phoenix rotation).
workstation	Local, temporary dev environment to build and evolve repos — nothing depends on it being on.

The acceptance proof for the whole plan is T10: production runs for 24h+ with the workstation offline.

Current-State Findings (2026-07-03 inventory)

These are the concrete dependencies this plan removes:

The workstation is a production network hub. All 16 ops-bridge tunnels originate here. Two production paths chain through it:
- State Hub primary: everything reaches the cluster hub via the workstation's state-hub-primary forward tunnel, and railiance01's automations via state-hub-railiance01 (reverse) → workstation → cluster.
- issue-core emission: activity-core (railiance01) → workstation → issue-core (coulombcore). A workstation reboot breaks daily triage evidence, consistency sweeps, and task emission.
Production landed on the condemned machine. The 2026-07-02/03 cutovers correctly moved State Hub and Core Hub off the workstation — but onto coulombcore (State Hub cluster primary, Core Hub behind hub.coulomb.social, issue-core, OpenBao, ArgoCD, ESO, the identity stack, and Gitea all run there). Under the new roles these are production workloads on a legacy machine.
Source of truth for code is on the condemned machine. Every repo's canonical remote is gitea.coulomb.social (coulombcore). The Forgejo production migration (RAIL-HO-WP-0005) already targets railiance01.
The consistency sweep and daily triage depend on workstation repo checkouts (/home/worsch/* clones are what fix-consistency writes back to) and the triage working-memory sink references a /home/worsch/the-custodian/memory/working path identity.
Images are built and published from the workstation (docker build + push by hand); there is no CI runner, so releases require this machine.
WSL2 State Hub fallback still exists on the workstation (CUST-WP-0011-T08/T09 stabilization window) — expected, retired by that workplan, not this one.

Deployment Strategy (proposed — improves on the rough beachhead idea)

A. Files-first beachhead, not database replication. ADR-001 already declares that the State Hub is a read model rebuildable from repo files. So the dev-environment story should be: make dev-hub starts a local ephemeral hub (compose: postgres + API + MCP, exactly what exists today) whose content is rebuilt from the local repo checkouts via the C-06 registration path — not synced from the fleet database. Offline writes (progress events, task status) accumulate in a write buffer and relay upstream on reconnect (STATE-WP-0068 offline-write-buffer-and-edge-relay is the existing seed for this). Two hubs never replicate; files reconcile them. This makes the workstation genuinely temporary: clone repos → make dev-hub → work offline → push + relay when connected.

B. Fleet mesh instead of workstation-hub networking. Machine-to-machine paths (railiance01 ↔ coulombcore/railiance02) get direct, persistent links owned by the machines themselves — ops-bridge units running on the fleet machines under atm- actor certs (ops-warden), or a minimal WireGuard mesh if tunnel count grows. The workstation becomes a mesh client when present, never a relay.

C. Promotion-gated production, ThreePhoenix ideas incrementally. A workload counts as "production on railiance01" only when it conforms to the already-finished staged-promotion contract (RAIL-BS-WP-0006: overlay repo, railiance/app.toml, stage commands, rollback). ThreePhoenix ideas adopted now, without waiting for three nodes:

CNPG-managed Postgres per workload (already proven on coulombcore);
greenfield rebuild automation (NET-WP-0020 unseal + bootstrap chain) as the standing "phoenix drill" for single machines;
phoenix rotation at machine scale: the coulombcore → railiance02 rebirth is the first full phoenix and should be executed with the same automation that a future 3-node weekly rotation would use;
Longhorn distributed storage and PG streaming HA remain deferred until a third node exists (RAIL-BS-WP-0007 / CUST-WP-0038 stay the follow-on).

D. Decommission by attrition, not by big-bang. coulombcore is frozen for new production immediately (policy), drained workload-by-workload via staged promotion onto railiance01, and only rebuilt as railiance02 when the last production dependency (likely identity/OpenBao) has moved.

Task: Target architecture and dependency register

id: CUST-WP-0054-T01
status: done
priority: high
state_hub_task_id: "67b91b18-9ad0-4917-990a-056a7007a2d4"

Write the canon-adjacent architecture note (docs/ first; promote to canon/architecture/ after review) fixing the three machine roles, the fleet mesh topology, the promotion gate for "production", and the phoenix path coulombcore → railiance02. Include the full dependency register: every workload, tunnel, repo remote, sink path, and build pipeline with its current host and target host. Done when every row has a target and a migration owner (this plan's task or an existing workplan reference).

Task: De-hub the network — fleet-owned direct tunnels

id: CUST-WP-0054-T02
status: done
priority: high
state_hub_task_id: "4f2ae1f1-f9ad-44bb-bae7-151030634f56"

Remove the workstation from all production data paths:

Run ops-bridge (or systemd ssh units) on railiance01 with atm- actor certs for the two live cross-machine lanes: railiance01 → coulombcore issue-core, and railiance01 → cluster State Hub (replacing the workstation-chained state-hub-railiance01 + state-hub-primary pair).
Re-point actcore-state-hub-bridge and actcore-issue-core-bridge at the machine-local tunnel ports.
Workstation tunnels remain only for interactive dev access (k3s API, hub client) and may drop at any time without production impact.
Evaluate WireGuard mesh as the successor if unit count exceeds ~5.

Done when killing every workstation tunnel leaves triage, sweeps, and emission working (partial T10 rehearsal).

Task: Production placement plan and freeze policy

id: CUST-WP-0054-T03
status: done
priority: high
state_hub_task_id: "70a25fbd-71d7-4d74-a04b-30e775984feb"

Declare coulombcore frozen for new production (policy note in canon). Produce the drain sequence with per-workload target and method (staged-promotion overlay or documented exception): State Hub, Core Hub, issue-core, OpenBao, identity stack (KeyCape/Authelia/privacyIDEA/lldap), ESO/ArgoCD control plane, Gitea/Forgejo, CNPG databases. Explicitly order them by coupling (forge and State Hub early; identity + OpenBao last, since everything authenticates through them).

Task: Forge to railiance01 + CI runners (kill workstation builds)

id: CUST-WP-0054-T04
status: progress
priority: high
state_hub_task_id: "79b9ee4d-f792-434c-a2ea-2fe216a948ca"

Execute/absorb RAIL-HO-WP-0005: Forgejo production on railiance01 becomes the canonical remote for all repos; coulombcore Gitea becomes a read-only mirror until decommission. Stand up Actions runners so container images (state-hub, core-hub, issue-core, activity-core) build and push in CI from tags — the workstation stops being the build/publish host.

Partial (2026-07-03): ADR-004 in-cluster runner (railiance01-build-01 + DinD) replaces interim coulombcore host runner. Remaining: image-build workflow on runner, repo migration, release with workstation off.

Task: State Hub production home on railiance01

id: CUST-WP-0054-T05
status: todo
priority: high
state_hub_task_id: "e91db8d0-973d-4a31-b3c2-ca37fd002ec7"

Move the State Hub primary from coulombcore to railiance01 using the proven CUST-WP-0011-T07 playbook (freeze → exact-count restore → rewire). This makes the automation loop machine-local: activity-core, Temporal, and the hub share one machine, so daily triage and sweeps survive any other machine being down. Prereq: railiance01 CNPG + storage reviewed (T03). Also relocate the consistency-sweep repo checkouts to railiance01 (clones from the T04 forge) so file writebacks no longer touch workstation paths.

Task: Working-memory and sink path decoupling

id: CUST-WP-0054-T06
status: todo
priority: medium
state_hub_task_id: "f2c5dd4b-9af4-4e8c-8619-6814e7d1666e"

Remove /home/worsch/... path identities from runtime contracts: the triage working-memory sink and any prompt/context paths become repo-relative or PVC-native with a defined sync-to-repo step (commit via the sweep). Done when no ActivityDefinition or sink references a workstation-specific absolute path.

Task: Dev-environment beachhead (files-first)

id: CUST-WP-0054-T07
status: todo
priority: high
state_hub_task_id: "0eaf1961-a4e7-459e-b710-3e72042cdf50"

Implement strategy A: make dev-hub (or custodian dev up) starts the local compose hub, registers the locally cloned repos, and rebuilds workplan/task state from files via C-06 — no fleet connection required. Implement the offline write buffer + reconnect relay for progress/task events (align with STATE-WP-0068; keep the buffer file-backed and idempotent). MCP config gains an explicit dev/fleet profile switch. Done when a fresh machine reaches a working, orientation-capable dev hub from git clone + one command, fully offline.

Task: ThreePhoenix increment — phoenix drill automation

id: CUST-WP-0054-T08
status: todo
priority: medium
state_hub_task_id: "ede6713e-8552-469c-bfe5-b17b015e1809"

Compose the existing pieces (NET-WP-0020 unseal automation, S1–S3 bootstrap chain, staged-promotion overlays, CNPG restore drills) into one rehearsable "phoenix a machine" runbook + automation entrypoint, proven on a disposable target (haskelseed or a VM). This is the tool the railiance02 rebirth and any future node rotation will use. Done when a greenfield machine reaches join-ready state unattended except for custody-gated steps.

Task: coulombcore decommission readiness → railiance02

id: CUST-WP-0054-T09
status: wait
priority: medium
state_hub_task_id: "c6b0d0a7-88c7-46f6-9d05-5d1078df3c8c"

Gated on T03–T05 drains reaching identity/OpenBao. Final inventory sweep, data archival (episodic memory of the machine's history), DNS/cert plan for *.coulomb.social names, then execute the machine phoenix via T08 automation: wipe, rebuild as railiance02, join the fleet. Longhorn/PG-HA (RAIL-BS-WP-0007, CUST-WP-0038) unlock once railiance01 + railiance02 are both fleet nodes.

Task: Workstation-off acceptance test

id: CUST-WP-0054-T10
status: wait
priority: high
state_hub_task_id: "cd6a31e8-c99e-4191-97d7-68d0389137b0"

The plan's proof: workstation fully offline for 24h+ (no tunnels, no processes). Verify afterwards from evidence alone: scheduled triage ran and validated, consistency sweeps ran, issue emission works, hub/API/dashboards served, forge and CI available. Then verify the inverse: workstation boots, custodian dev up gives a working offline dev hub, and reconnect relays buffered events. Done when both directions pass without manual repair.

Sequencing

T01 (architecture + register)
 ├─ T02 de-hub network        ── unblocks most of T10's first half
 ├─ T03 placement plan/freeze ─┬─ T04 forge + CI ─┐
 │                             └─ T05 hub home    ├─ T09 decommission → railiance02
 ├─ T06 sink decoupling                           │
 ├─ T07 dev beachhead                             │
 └─ T08 phoenix drill ────────────────────────────┘
                                        T10 acceptance (both halves)

Relationship to existing workplans

Absorbs the sequencing of RAIL-HO-WP-0005 (Forgejo) and CUST-WP-0011-T08/T09 (WSL2 fallback retirement folds into T10).
Builds on finished RAIL-BS-WP-0006 (staged promotion) and NET-WP-0020 (unseal automation).
Defers to RAIL-BS-WP-0007 / CUST-WP-0038 for true 3-node HA once railiance02 exists.
Extends STATE-WP-0068 (offline write buffer) as the T07 relay mechanism.
CORE-WP-0005 stabilization + CORE-WP-0007 Haskell retirement proceed independently; Core Hub simply appears in the T03 drain sequence.

13 KiB Raw Blame History Unescape Escape