ADR and Runbook artefacts

2026-03-27 00:16:09 +01:00
parent b19896a9a9
commit 29b84de13c
2 changed files with 262 additions and 0 deletions
--- a/canon/architecture/adr-004-connectivity-first-network-posture.md
+++ b/canon/architecture/adr-004-connectivity-first-network-posture.md
@@ -0,0 +1,143 @@
 ---
 id: ADR-004
 type: architecture-decision-record
 title: "Connectivity-First Network Posture for Custodian Infrastructure"
 status: accepted
 decided_by: Bernd Worsch
 date: "2026-03-26"
 tags: ["architecture", "network", "ops-bridge", "connectivity", "isolation", "security-posture"]
 ---
 # ADR-004: Connectivity-First Network Posture for Custodian Infrastructure
 ## Status
 Accepted.
 ## Context
 The Custodian infrastructure spans multiple machines: a primary workstation, a
 shared compute node (COULOMBCORE), and Railiance cluster nodes. Agents running
 on remote machines need to reach the state-hub API and MCP server, which live
 on the workstation. Human operators and agents also need to reach remote
 services (k3s API, Gitea, Temporal) from the workstation.
 Two network postures were considered for how these components communicate:
 **Option A — Connectivity-first:** Components are connected by default via
 controlled, observable access paths (reverse SSH tunnels managed by ops-bridge).
 Isolation is added selectively where there is a specific threat model or
 compliance reason to do so.
 **Option B — Isolation-first (zero-trust):** No component trusts any other by
 default. Every connection requires mutual authentication, short-lived
 credentials, and explicit authorisation at the point of use. Connectivity is
 earned, not assumed.
 This decision is architectural policy — it governs how ops-bridge tunnels are
 designed, how agent-to-hub communication works, and how new infrastructure
 components are onboarded.
 ## Decision
 **Connectivity-first, with isolation as a deliberate option.**
 The default posture for Custodian infrastructure is: components that need to
 work together are connected. Access paths are explicit, observable, and managed
 (via ops-bridge), but they are persistent by default rather than ephemeral.
 Isolation is introduced where there is a specific, articulated reason — not as
 a blanket policy applied uniformly.
 ## Rationale
 ### 1. Scale and team size
 The infrastructure is operated by a single human and a bounded set of
 automation agents. The overhead of zero-trust (credential rotation, SPIFFE/SPIRE
 identity, mTLS everywhere, short-lived tokens per request) is disproportionate
 for this scale. It would consume significant operational complexity without
 a commensurate security return.
 ### 2. Observability over perimeter hardening
 The primary security control at this scale is **observability**: knowing what
 connected, when, from where, and what it did. ops-bridge provides this — every
 tunnel is named, actor-attributed, health-checked, and audited. A perimeter of
 invisible short-lived connections would actually reduce observability.
 ### 3. The threat model does not require zero-trust today
 The main threats are:
 - A runaway agent consuming resources (mitigated by nproc/memory cgroups)
 - A compromised workload reaching state-hub and corrupting state (mitigated by
  the read-model design of state-hub — write surface is narrow and sanctioned)
 - An external attacker reaching internal services (mitigated by the tunnels
  being reverse SSH — no inbound ports exposed)
 Zero-trust would address a different threat model: lateral movement between
 hostile tenants, or untrusted code running in the same environment as sensitive
 data. That is not the current situation.
 ### 4. Degrade-gracefully requires persistent connectivity
 The Custodian's foundational value of **local-first, degrade-gracefully**
 requires that agents can orient themselves even when some connections are slow
 or partially degraded. Ephemeral connectivity (zero-trust tokens, service mesh
 sidecars) introduces additional failure modes that conflict with graceful
 degradation. Persistent SSH tunnels with auto-reconnect are more resilient to
 intermittent conditions.
 ### 5. Isolation remains the right choice in specific cases
 Connectivity-first does not mean no isolation. The following cases call for
 explicit isolation and are handled separately:
 - **Tenant separation** (when/if multi-user or multi-org) — each tenant gets
  its own network segment
 - **Privileged execution** — CI runners and agent actions with write access to
  production systems run in ephemeral, isolated environments (per the
  Privileged Execution Control standard)
 - **Secrets** — credentials are never transmitted over tunnels in plaintext;
  age-encrypted at rest, SOPS for config
 ## Consequences
 ### Immediate
 - ops-bridge tunnels are **persistent** (max_attempts: 0, auto-reconnect) and
  are treated as infrastructure, not one-off connections
 - Agents on remote machines check tunnel health at session start and restore
  dropped tunnels before accessing state-hub (documented in global CLAUDE.md)
 - New infrastructure components are onboarded with a named tunnel entry in
  `~/.config/bridge/tunnels.yaml` — not ad-hoc SSH commands
 ### Deferred
 - If the infrastructure grows to multi-tenant or multi-operator, this decision
  should be revisited. At that point, isolation-first becomes proportionate.
 - If a security audit identifies a specific lateral movement risk, targeted
  isolation (network policy, mTLS for that service) is the response — not a
  wholesale posture change.
 ## Alternatives Rejected
 ### Zero-trust / isolation-first
 Rejected for current scale. The operational overhead (credential lifecycle,
 service mesh, mutual TLS) is disproportionate, observability would decrease,
 and the threat model does not require it. Noted for re-evaluation at multi-
 tenant scale.
 ### VPN (WireGuard / Tailscale)
 Considered briefly. VPN would solve the connectivity problem but introduces
 a persistent network layer that all traffic traverses, reducing the
 explicitness of individual access paths. ops-bridge tunnels are per-service
 and per-actor, which gives better observability and blast-radius control.
 VPN is not ruled out as a future complement but is not the primary approach.
 ### Ad-hoc SSH (no ops-bridge)
 The pre-ops-bridge approach. Rejected because it has no health checks, no
 actor attribution, no audit log, and requires manual intervention to restore.
 ops-bridge formalises the same SSH tunnel pattern with operational discipline.
--- a/ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md
+++ b/ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md
@@ -0,0 +1,119 @@
 ---
 title: "INC-002: COULOMBCORE node overload — runaway Claude Code agent"
 date: 2026-03-26
 severity: high
 status: resolved
 affected: gitea (http://92.205.130.254:32166), k3s API, SSH access
 environment: COULOMBCORE k3s cluster
 duration: ~15 minutes (detected ~21:00, SSH restored ~21:08 UTC)
 resolved_by: Bernd Worsch / Claude
 ---
 # INC-002: COULOMBCORE node overload — runaway Claude Code agent
 ## Summary
 The COULOMBCORE node (92.205.130.254) became completely unresponsive under extreme load
 generated by a runaway Claude Code agent process. Load average peaked at **417.43** (1m).
 99.8% of CPU time was spent in kernel mode (context switching). SSH connections timed out
 during banner exchange. k3s API was unreachable (TLS handshake timeout). Gitea was
 technically still running as a process but unable to serve requests. The node had no swap,
 so memory exhaustion amplified the impact — kswapd0 was consuming ~22% CPU trying to
 reclaim pages with nowhere to put them.
 ---
 ## Timeline
 | Time (approx UTC) | Event |
 |------|-------|
 | ~20:45 | Runaway claude agent (PID 2457456, user tegwick) spawning hundreds of subprocesses |
 | ~21:00 | Load average passes 300; SSH banner exchange starts timing out |
 | ~21:00 | User attempts git operations; git repo service unreachable |
 | ~21:00 | Remote diagnosis begins via ops-bridge (state-hub reverse tunnel still alive) |
 | ~21:00 | k3s API confirmed unresponsive (TLS handshake timeout via local tunnel :16443) |
 | ~21:05 | User obtains console/VNC access via hosting provider |
 | ~21:05 | `top` output shared: load 417, 530 tasks (104 running), 34 zombies, 99.8% sy |
 | ~21:06 | `kill -9 2457456` + `kill -9 2579133` (stuck apport) executed via console |
 | ~21:08 | SSH accepting connections again (load 85, still declining) |
 | ~21:09 | kubectl connectivity restored; PostgreSQL HA nodes resyncing |
 | ~21:10 | Gitea accessible; incident resolved |
 ---
 ## Root Cause
 A Claude Code agent running on COULOMBCORE under the `tegwick` user (PID 2457456,
 VIRT 71.1GB) spawned approximately 500 child processes. The likely cause is an unbounded
 Ralph loop or parallel agent task expansion without a completion condition or iteration cap.
 With no swap configured on a 3.9GB machine, the kernel had no reclaim target. kswapd0
 ran at ~22% CPU continuously. systemd was at ~17% CPU processing unit state changes for
 the constant process churn. The combination of ~500 tasks competing for 2 vCPUs with no
 memory headroom caused the runaway context switching that buried the node.
 **The ops-bridge reverse tunnels survived** because they were established before the
 overload began and require no new SSH connections to stay alive. This was the only
 out-of-band visibility channel available once SSH stopped accepting new connections.
 ---
 ## Impact
 - Gitea and git operations unavailable for ~15 minutes
 - SSH access to COULOMBCORE unavailable (required console)
 - k3s API unresponsive (no pod management possible)
 - PostgreSQL HA nodes 0 and 2 restarted under load (recovered on their own)
 - No data loss
 ---
 ## Resolution Steps
 ```bash
 # Via console/VNC — SSH was not available
 # 1. Identify the runaway process (top showed PID 2457456 at 71GB VIRT, 6.8% CPU)
 #    Indicator: massive VIRT, hundreds of children, 99.8% sy in top
 # 2. Kill the runaway agent and stuck crash reporter
 kill -9 2457456     # runaway claude process
 kill -9 2579133     # apport in D-state, consuming CPU
 # 3. Wait ~60s — load drops, SSH accepts connections
 # 4. Verify PostgreSQL HA recovery (may take 2-3 min to resync)
 kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha'
 # 5. Verify Gitea
 curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/
 ```
 ---
 ## Follow-up Actions
 - [ ] Add swap to COULOMBCORE: `fallocate -l 4G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile && echo '/swapfile none swap sw 0 0' >> /etc/fstab`
 - [ ] Set nproc ulimit for tegwick user: `/etc/security/limits.conf` → `tegwick hard nproc 512`
 - [ ] Set memory limit on tegwick systemd user session: `systemctl --user set-property "" MemoryMax=1G` or a dedicated slice
 - [ ] Add cluster-wide pod health alerting (cron on CoulombCore) — catches any crashlooping pod, not just Gitea; see runbook "Robustness §5"
 - [ ] Ensure all Ralph loops on remote agents use `/ralph-workplan` (bounded, HEUREKA stop) never raw `/ralph-loop`
 - [ ] Consider adding a `bridge check` cron on workstation that alerts when node load > threshold via state-hub API
 ---
 ## Lessons Learned
 1. **No swap = amplified blast radius.** A machine with no swap has zero buffer between
   "memory pressure" and "complete kernel thrash". A 4GB swapfile costs nothing and buys
   significant time for intervention.
 2. **Reverse tunnels are the last line of visibility.** SSH and the k3s API both died.
   The state-hub reverse tunnel (established from COULOMBCORE outbound) survived and
   confirmed the node was alive. This was critical for triage.
 3. **Remote agents need hard resource ceilings.** A Claude Code agent that spawns
   subprocesses has no built-in rate limit. nproc + systemd MemoryMax on the user session
   are the right controls for this environment.
 4. **Console access is non-negotiable.** Once SSH dies the only recovery path is OOB
   console. Ensure hosting provider console credentials are always accessible.