Files
the-custodian/workplans/CUST-WP-0054-workstation-independence-and-fleet-realignment.md
codex a912679675 Document glas-harness Forgejo migration pilot routing
Records what works (SSH forgejo-remote, CI smoke, HTTPS mirror) and what is
still blocked before state-hub cutover. Updates CUST-WP-0054-T04 progress.
2026-07-04 00:59:53 +02:00

14 KiB
Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, planning_priority, planning_order, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug planning_priority planning_order created updated state_hub_workstream_id
CUST-WP-0054 workplan Workstation Independence and Fleet Role Realignment infotech the-custodian active codex custodian high 54 2026-07-03 2026-07-03 8a828444-dd49-4d7b-a2d1-9952b5bc929d

CUST-WP-0054 - Workstation Independence and Fleet Role Realignment

Goal

Remove every runtime dependency and always-on workload from the workstation, realign machine roles, and make the production loop run uninterrupted with the workstation powered off:

Machine Target role
railiance01 Production workloads (the first node of the growing Railiance fleet).
coulombcore Early/legacy/experimental workloads only; frozen for new production; eventually decommissioned and reborn as railiance02 (a machine-scale phoenix rotation).
workstation Local, temporary dev environment to build and evolve repos — nothing depends on it being on.

The acceptance proof for the whole plan is T10: production runs for 24h+ with the workstation offline.

Current-State Findings (2026-07-03 inventory)

These are the concrete dependencies this plan removes:

  1. The workstation is a production network hub. All 16 ops-bridge tunnels originate here. Two production paths chain through it:
    • State Hub primary: everything reaches the cluster hub via the workstation's state-hub-primary forward tunnel, and railiance01's automations via state-hub-railiance01 (reverse) → workstation → cluster.
    • issue-core emission: activity-core (railiance01) → workstation → issue-core (coulombcore). A workstation reboot breaks daily triage evidence, consistency sweeps, and task emission.
  2. Production landed on the condemned machine. The 2026-07-02/03 cutovers correctly moved State Hub and Core Hub off the workstation — but onto coulombcore (State Hub cluster primary, Core Hub behind hub.coulomb.social, issue-core, OpenBao, ArgoCD, ESO, the identity stack, and Gitea all run there). Under the new roles these are production workloads on a legacy machine.
  3. Source of truth for code is on the condemned machine. Every repo's canonical remote is gitea.coulomb.social (coulombcore). The Forgejo production migration (RAIL-HO-WP-0005) already targets railiance01.
  4. The consistency sweep and daily triage depend on workstation repo checkouts (/home/worsch/* clones are what fix-consistency writes back to) and the triage working-memory sink references a /home/worsch/the-custodian/memory/working path identity.
  5. Images are built and published from the workstation (docker build + push by hand); there is no CI runner, so releases require this machine.
  6. WSL2 State Hub fallback still exists on the workstation (CUST-WP-0011-T08/T09 stabilization window) — expected, retired by that workplan, not this one.

Deployment Strategy (proposed — improves on the rough beachhead idea)

A. Files-first beachhead, not database replication. ADR-001 already declares that the State Hub is a read model rebuildable from repo files. So the dev-environment story should be: make dev-hub starts a local ephemeral hub (compose: postgres + API + MCP, exactly what exists today) whose content is rebuilt from the local repo checkouts via the C-06 registration path — not synced from the fleet database. Offline writes (progress events, task status) accumulate in a write buffer and relay upstream on reconnect (STATE-WP-0068 offline-write-buffer-and-edge-relay is the existing seed for this). Two hubs never replicate; files reconcile them. This makes the workstation genuinely temporary: clone repos → make dev-hub → work offline → push + relay when connected.

B. Fleet mesh instead of workstation-hub networking. Machine-to-machine paths (railiance01 ↔ coulombcore/railiance02) get direct, persistent links owned by the machines themselves — ops-bridge units running on the fleet machines under atm- actor certs (ops-warden), or a minimal WireGuard mesh if tunnel count grows. The workstation becomes a mesh client when present, never a relay.

C. Promotion-gated production, ThreePhoenix ideas incrementally. A workload counts as "production on railiance01" only when it conforms to the already-finished staged-promotion contract (RAIL-BS-WP-0006: overlay repo, railiance/app.toml, stage commands, rollback). ThreePhoenix ideas adopted now, without waiting for three nodes:

  • CNPG-managed Postgres per workload (already proven on coulombcore);
  • greenfield rebuild automation (NET-WP-0020 unseal + bootstrap chain) as the standing "phoenix drill" for single machines;
  • phoenix rotation at machine scale: the coulombcore → railiance02 rebirth is the first full phoenix and should be executed with the same automation that a future 3-node weekly rotation would use;
  • Longhorn distributed storage and PG streaming HA remain deferred until a third node exists (RAIL-BS-WP-0007 / CUST-WP-0038 stay the follow-on).

D. Decommission by attrition, not by big-bang. coulombcore is frozen for new production immediately (policy), drained workload-by-workload via staged promotion onto railiance01, and only rebuilt as railiance02 when the last production dependency (likely identity/OpenBao) has moved.

Task: Target architecture and dependency register

id: CUST-WP-0054-T01
status: done
priority: high
state_hub_task_id: "67b91b18-9ad0-4917-990a-056a7007a2d4"

Write the canon-adjacent architecture note (docs/ first; promote to canon/architecture/ after review) fixing the three machine roles, the fleet mesh topology, the promotion gate for "production", and the phoenix path coulombcore → railiance02. Include the full dependency register: every workload, tunnel, repo remote, sink path, and build pipeline with its current host and target host. Done when every row has a target and a migration owner (this plan's task or an existing workplan reference).

Task: De-hub the network — fleet-owned direct tunnels

id: CUST-WP-0054-T02
status: done
priority: high
state_hub_task_id: "4f2ae1f1-f9ad-44bb-bae7-151030634f56"

Remove the workstation from all production data paths:

  • Run ops-bridge (or systemd ssh units) on railiance01 with atm- actor certs for the two live cross-machine lanes: railiance01 → coulombcore issue-core, and railiance01 → cluster State Hub (replacing the workstation-chained state-hub-railiance01 + state-hub-primary pair).
  • Re-point actcore-state-hub-bridge and actcore-issue-core-bridge at the machine-local tunnel ports.
  • Workstation tunnels remain only for interactive dev access (k3s API, hub client) and may drop at any time without production impact.
  • Evaluate WireGuard mesh as the successor if unit count exceeds ~5.

Done when killing every workstation tunnel leaves triage, sweeps, and emission working (partial T10 rehearsal).

Task: Production placement plan and freeze policy

id: CUST-WP-0054-T03
status: done
priority: high
state_hub_task_id: "70a25fbd-71d7-4d74-a04b-30e775984feb"

Declare coulombcore frozen for new production (policy note in canon). Produce the drain sequence with per-workload target and method (staged-promotion overlay or documented exception): State Hub, Core Hub, issue-core, OpenBao, identity stack (KeyCape/Authelia/privacyIDEA/lldap), ESO/ArgoCD control plane, Gitea/Forgejo, CNPG databases. Explicitly order them by coupling (forge and State Hub early; identity + OpenBao last, since everything authenticates through them).

Task: Forge to railiance01 + CI runners (kill workstation builds)

id: CUST-WP-0054-T04
status: progress
priority: high
state_hub_task_id: "79b9ee4d-f792-434c-a2ea-2fe216a948ca"

Execute/absorb RAIL-HO-WP-0005: Forgejo production on railiance01 becomes the canonical remote for all repos; coulombcore Gitea becomes a read-only mirror until decommission. Stand up Actions runners so container images (state-hub, core-hub, issue-core, activity-core) build and push in CI from tags — the workstation stops being the build/publish host.

Partial (2026-07-03): ADR-004 in-cluster runner (railiance01-build-01 + DinD) replaces interim coulombcore host runner. Interim coulombcore host runner disabled. Org Actions secrets (REGISTRY_USER, REGISTRY_TOKEN) set. coulomb/forgejo-actions-probe image-build workflow builds and pushes to forgejo.coulomb.social/coulomb/forgejo-actions-probe (static docker-cli + DinD; actions/checkout@v4 not used — git clone in job). Pilot (2026-07-03): coulomb/glas-harness migrated to Forgejo (origin=forgejo-remote); CI smoke host+container green — see docs/forgejo-repo-migration-pilot-glas-harness.md. Remaining: promote more repos using pilot routing; then production set (state-hub, core-hub, issue-core, activity-core); release with workstation off.

Task: State Hub production home on railiance01

id: CUST-WP-0054-T05
status: todo
priority: high
state_hub_task_id: "e91db8d0-973d-4a31-b3c2-ca37fd002ec7"

Move the State Hub primary from coulombcore to railiance01 using the proven CUST-WP-0011-T07 playbook (freeze → exact-count restore → rewire). This makes the automation loop machine-local: activity-core, Temporal, and the hub share one machine, so daily triage and sweeps survive any other machine being down. Prereq: railiance01 CNPG + storage reviewed (T03). Also relocate the consistency-sweep repo checkouts to railiance01 (clones from the T04 forge) so file writebacks no longer touch workstation paths.

Task: Working-memory and sink path decoupling

id: CUST-WP-0054-T06
status: todo
priority: medium
state_hub_task_id: "f2c5dd4b-9af4-4e8c-8619-6814e7d1666e"

Remove /home/worsch/... path identities from runtime contracts: the triage working-memory sink and any prompt/context paths become repo-relative or PVC-native with a defined sync-to-repo step (commit via the sweep). Done when no ActivityDefinition or sink references a workstation-specific absolute path.

Task: Dev-environment beachhead (files-first)

id: CUST-WP-0054-T07
status: todo
priority: high
state_hub_task_id: "0eaf1961-a4e7-459e-b710-3e72042cdf50"

Implement strategy A: make dev-hub (or custodian dev up) starts the local compose hub, registers the locally cloned repos, and rebuilds workplan/task state from files via C-06 — no fleet connection required. Implement the offline write buffer + reconnect relay for progress/task events (align with STATE-WP-0068; keep the buffer file-backed and idempotent). MCP config gains an explicit dev/fleet profile switch. Done when a fresh machine reaches a working, orientation-capable dev hub from git clone + one command, fully offline.

Task: ThreePhoenix increment — phoenix drill automation

id: CUST-WP-0054-T08
status: todo
priority: medium
state_hub_task_id: "ede6713e-8552-469c-bfe5-b17b015e1809"

Compose the existing pieces (NET-WP-0020 unseal automation, S1S3 bootstrap chain, staged-promotion overlays, CNPG restore drills) into one rehearsable "phoenix a machine" runbook + automation entrypoint, proven on a disposable target (haskelseed or a VM). This is the tool the railiance02 rebirth and any future node rotation will use. Done when a greenfield machine reaches join-ready state unattended except for custody-gated steps.

Task: coulombcore decommission readiness → railiance02

id: CUST-WP-0054-T09
status: wait
priority: medium
state_hub_task_id: "c6b0d0a7-88c7-46f6-9d05-5d1078df3c8c"

Gated on T03T05 drains reaching identity/OpenBao. Final inventory sweep, data archival (episodic memory of the machine's history), DNS/cert plan for *.coulomb.social names, then execute the machine phoenix via T08 automation: wipe, rebuild as railiance02, join the fleet. Longhorn/PG-HA (RAIL-BS-WP-0007, CUST-WP-0038) unlock once railiance01 + railiance02 are both fleet nodes.

Task: Workstation-off acceptance test

id: CUST-WP-0054-T10
status: wait
priority: high
state_hub_task_id: "cd6a31e8-c99e-4191-97d7-68d0389137b0"

The plan's proof: workstation fully offline for 24h+ (no tunnels, no processes). Verify afterwards from evidence alone: scheduled triage ran and validated, consistency sweeps ran, issue emission works, hub/API/dashboards served, forge and CI available. Then verify the inverse: workstation boots, custodian dev up gives a working offline dev hub, and reconnect relays buffered events. Done when both directions pass without manual repair.

Sequencing

T01 (architecture + register)
 ├─ T02 de-hub network        ── unblocks most of T10's first half
 ├─ T03 placement plan/freeze ─┬─ T04 forge + CI ─┐
 │                             └─ T05 hub home    ├─ T09 decommission → railiance02
 ├─ T06 sink decoupling                           │
 ├─ T07 dev beachhead                             │
 └─ T08 phoenix drill ────────────────────────────┘
                                        T10 acceptance (both halves)

Relationship to existing workplans

  • Absorbs the sequencing of RAIL-HO-WP-0005 (Forgejo) and CUST-WP-0011-T08/T09 (WSL2 fallback retirement folds into T10).
  • Builds on finished RAIL-BS-WP-0006 (staged promotion) and NET-WP-0020 (unseal automation).
  • Defers to RAIL-BS-WP-0007 / CUST-WP-0038 for true 3-node HA once railiance02 exists.
  • Extends STATE-WP-0068 (offline write buffer) as the T07 relay mechanism.
  • CORE-WP-0005 stabilization + CORE-WP-0007 Haskell retirement proceed independently; Core Hub simply appears in the T03 drain sequence.