diff --git a/canon/architecture/adr-004-connectivity-first-network-posture.md b/canon/architecture/adr-004-connectivity-first-network-posture.md new file mode 100644 index 0000000..d5190b9 --- /dev/null +++ b/canon/architecture/adr-004-connectivity-first-network-posture.md @@ -0,0 +1,143 @@ +--- +id: ADR-004 +type: architecture-decision-record +title: "Connectivity-First Network Posture for Custodian Infrastructure" +status: accepted +decided_by: Bernd Worsch +date: "2026-03-26" +tags: ["architecture", "network", "ops-bridge", "connectivity", "isolation", "security-posture"] +--- + +# ADR-004: Connectivity-First Network Posture for Custodian Infrastructure + +## Status + +Accepted. + +## Context + +The Custodian infrastructure spans multiple machines: a primary workstation, a +shared compute node (COULOMBCORE), and Railiance cluster nodes. Agents running +on remote machines need to reach the state-hub API and MCP server, which live +on the workstation. Human operators and agents also need to reach remote +services (k3s API, Gitea, Temporal) from the workstation. + +Two network postures were considered for how these components communicate: + +**Option A — Connectivity-first:** Components are connected by default via +controlled, observable access paths (reverse SSH tunnels managed by ops-bridge). +Isolation is added selectively where there is a specific threat model or +compliance reason to do so. + +**Option B — Isolation-first (zero-trust):** No component trusts any other by +default. Every connection requires mutual authentication, short-lived +credentials, and explicit authorisation at the point of use. Connectivity is +earned, not assumed. + +This decision is architectural policy — it governs how ops-bridge tunnels are +designed, how agent-to-hub communication works, and how new infrastructure +components are onboarded. + +## Decision + +**Connectivity-first, with isolation as a deliberate option.** + +The default posture for Custodian infrastructure is: components that need to +work together are connected. Access paths are explicit, observable, and managed +(via ops-bridge), but they are persistent by default rather than ephemeral. +Isolation is introduced where there is a specific, articulated reason — not as +a blanket policy applied uniformly. + +## Rationale + +### 1. Scale and team size + +The infrastructure is operated by a single human and a bounded set of +automation agents. The overhead of zero-trust (credential rotation, SPIFFE/SPIRE +identity, mTLS everywhere, short-lived tokens per request) is disproportionate +for this scale. It would consume significant operational complexity without +a commensurate security return. + +### 2. Observability over perimeter hardening + +The primary security control at this scale is **observability**: knowing what +connected, when, from where, and what it did. ops-bridge provides this — every +tunnel is named, actor-attributed, health-checked, and audited. A perimeter of +invisible short-lived connections would actually reduce observability. + +### 3. The threat model does not require zero-trust today + +The main threats are: +- A runaway agent consuming resources (mitigated by nproc/memory cgroups) +- A compromised workload reaching state-hub and corrupting state (mitigated by + the read-model design of state-hub — write surface is narrow and sanctioned) +- An external attacker reaching internal services (mitigated by the tunnels + being reverse SSH — no inbound ports exposed) + +Zero-trust would address a different threat model: lateral movement between +hostile tenants, or untrusted code running in the same environment as sensitive +data. That is not the current situation. + +### 4. Degrade-gracefully requires persistent connectivity + +The Custodian's foundational value of **local-first, degrade-gracefully** +requires that agents can orient themselves even when some connections are slow +or partially degraded. Ephemeral connectivity (zero-trust tokens, service mesh +sidecars) introduces additional failure modes that conflict with graceful +degradation. Persistent SSH tunnels with auto-reconnect are more resilient to +intermittent conditions. + +### 5. Isolation remains the right choice in specific cases + +Connectivity-first does not mean no isolation. The following cases call for +explicit isolation and are handled separately: + +- **Tenant separation** (when/if multi-user or multi-org) — each tenant gets + its own network segment +- **Privileged execution** — CI runners and agent actions with write access to + production systems run in ephemeral, isolated environments (per the + Privileged Execution Control standard) +- **Secrets** — credentials are never transmitted over tunnels in plaintext; + age-encrypted at rest, SOPS for config + +## Consequences + +### Immediate + +- ops-bridge tunnels are **persistent** (max_attempts: 0, auto-reconnect) and + are treated as infrastructure, not one-off connections +- Agents on remote machines check tunnel health at session start and restore + dropped tunnels before accessing state-hub (documented in global CLAUDE.md) +- New infrastructure components are onboarded with a named tunnel entry in + `~/.config/bridge/tunnels.yaml` — not ad-hoc SSH commands + +### Deferred + +- If the infrastructure grows to multi-tenant or multi-operator, this decision + should be revisited. At that point, isolation-first becomes proportionate. +- If a security audit identifies a specific lateral movement risk, targeted + isolation (network policy, mTLS for that service) is the response — not a + wholesale posture change. + +## Alternatives Rejected + +### Zero-trust / isolation-first + +Rejected for current scale. The operational overhead (credential lifecycle, +service mesh, mutual TLS) is disproportionate, observability would decrease, +and the threat model does not require it. Noted for re-evaluation at multi- +tenant scale. + +### VPN (WireGuard / Tailscale) + +Considered briefly. VPN would solve the connectivity problem but introduces +a persistent network layer that all traffic traverses, reducing the +explicitness of individual access paths. ops-bridge tunnels are per-service +and per-actor, which gives better observability and blast-radius control. +VPN is not ruled out as a future complement but is not the primary approach. + +### Ad-hoc SSH (no ops-bridge) + +The pre-ops-bridge approach. Rejected because it has no health checks, no +actor attribution, no audit log, and requires manual intervention to restore. +ops-bridge formalises the same SSH tunnel pattern with operational discipline. diff --git a/ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md b/ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md new file mode 100644 index 0000000..054dca0 --- /dev/null +++ b/ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md @@ -0,0 +1,119 @@ +--- +title: "INC-002: COULOMBCORE node overload — runaway Claude Code agent" +date: 2026-03-26 +severity: high +status: resolved +affected: gitea (http://92.205.130.254:32166), k3s API, SSH access +environment: COULOMBCORE k3s cluster +duration: ~15 minutes (detected ~21:00, SSH restored ~21:08 UTC) +resolved_by: Bernd Worsch / Claude +--- + +# INC-002: COULOMBCORE node overload — runaway Claude Code agent + +## Summary + +The COULOMBCORE node (92.205.130.254) became completely unresponsive under extreme load +generated by a runaway Claude Code agent process. Load average peaked at **417.43** (1m). +99.8% of CPU time was spent in kernel mode (context switching). SSH connections timed out +during banner exchange. k3s API was unreachable (TLS handshake timeout). Gitea was +technically still running as a process but unable to serve requests. The node had no swap, +so memory exhaustion amplified the impact — kswapd0 was consuming ~22% CPU trying to +reclaim pages with nowhere to put them. + +--- + +## Timeline + +| Time (approx UTC) | Event | +|------|-------| +| ~20:45 | Runaway claude agent (PID 2457456, user tegwick) spawning hundreds of subprocesses | +| ~21:00 | Load average passes 300; SSH banner exchange starts timing out | +| ~21:00 | User attempts git operations; git repo service unreachable | +| ~21:00 | Remote diagnosis begins via ops-bridge (state-hub reverse tunnel still alive) | +| ~21:00 | k3s API confirmed unresponsive (TLS handshake timeout via local tunnel :16443) | +| ~21:05 | User obtains console/VNC access via hosting provider | +| ~21:05 | `top` output shared: load 417, 530 tasks (104 running), 34 zombies, 99.8% sy | +| ~21:06 | `kill -9 2457456` + `kill -9 2579133` (stuck apport) executed via console | +| ~21:08 | SSH accepting connections again (load 85, still declining) | +| ~21:09 | kubectl connectivity restored; PostgreSQL HA nodes resyncing | +| ~21:10 | Gitea accessible; incident resolved | + +--- + +## Root Cause + +A Claude Code agent running on COULOMBCORE under the `tegwick` user (PID 2457456, +VIRT 71.1GB) spawned approximately 500 child processes. The likely cause is an unbounded +Ralph loop or parallel agent task expansion without a completion condition or iteration cap. + +With no swap configured on a 3.9GB machine, the kernel had no reclaim target. kswapd0 +ran at ~22% CPU continuously. systemd was at ~17% CPU processing unit state changes for +the constant process churn. The combination of ~500 tasks competing for 2 vCPUs with no +memory headroom caused the runaway context switching that buried the node. + +**The ops-bridge reverse tunnels survived** because they were established before the +overload began and require no new SSH connections to stay alive. This was the only +out-of-band visibility channel available once SSH stopped accepting new connections. + +--- + +## Impact + +- Gitea and git operations unavailable for ~15 minutes +- SSH access to COULOMBCORE unavailable (required console) +- k3s API unresponsive (no pod management possible) +- PostgreSQL HA nodes 0 and 2 restarted under load (recovered on their own) +- No data loss + +--- + +## Resolution Steps + +```bash +# Via console/VNC — SSH was not available + +# 1. Identify the runaway process (top showed PID 2457456 at 71GB VIRT, 6.8% CPU) +# Indicator: massive VIRT, hundreds of children, 99.8% sy in top + +# 2. Kill the runaway agent and stuck crash reporter +kill -9 2457456 # runaway claude process +kill -9 2579133 # apport in D-state, consuming CPU + +# 3. Wait ~60s — load drops, SSH accepts connections +# 4. Verify PostgreSQL HA recovery (may take 2-3 min to resync) +kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha' + +# 5. Verify Gitea +curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/ +``` + +--- + +## Follow-up Actions + +- [ ] Add swap to COULOMBCORE: `fallocate -l 4G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile && echo '/swapfile none swap sw 0 0' >> /etc/fstab` +- [ ] Set nproc ulimit for tegwick user: `/etc/security/limits.conf` → `tegwick hard nproc 512` +- [ ] Set memory limit on tegwick systemd user session: `systemctl --user set-property "" MemoryMax=1G` or a dedicated slice +- [ ] Add cluster-wide pod health alerting (cron on CoulombCore) — catches any crashlooping pod, not just Gitea; see runbook "Robustness §5" +- [ ] Ensure all Ralph loops on remote agents use `/ralph-workplan` (bounded, HEUREKA stop) never raw `/ralph-loop` +- [ ] Consider adding a `bridge check` cron on workstation that alerts when node load > threshold via state-hub API + +--- + +## Lessons Learned + +1. **No swap = amplified blast radius.** A machine with no swap has zero buffer between + "memory pressure" and "complete kernel thrash". A 4GB swapfile costs nothing and buys + significant time for intervention. + +2. **Reverse tunnels are the last line of visibility.** SSH and the k3s API both died. + The state-hub reverse tunnel (established from COULOMBCORE outbound) survived and + confirmed the node was alive. This was critical for triage. + +3. **Remote agents need hard resource ceilings.** A Claude Code agent that spawns + subprocesses has no built-in rate limit. nproc + systemd MemoryMax on the user session + are the right controls for this environment. + +4. **Console access is non-negotiable.** Once SSH dies the only recovery path is OOB + console. Ensure hosting provider console credentials are always accessible.