ADR and Runbook artefacts
This commit is contained in:
143
canon/architecture/adr-004-connectivity-first-network-posture.md
Normal file
143
canon/architecture/adr-004-connectivity-first-network-posture.md
Normal file
@@ -0,0 +1,143 @@
|
|||||||
|
---
|
||||||
|
id: ADR-004
|
||||||
|
type: architecture-decision-record
|
||||||
|
title: "Connectivity-First Network Posture for Custodian Infrastructure"
|
||||||
|
status: accepted
|
||||||
|
decided_by: Bernd Worsch
|
||||||
|
date: "2026-03-26"
|
||||||
|
tags: ["architecture", "network", "ops-bridge", "connectivity", "isolation", "security-posture"]
|
||||||
|
---
|
||||||
|
|
||||||
|
# ADR-004: Connectivity-First Network Posture for Custodian Infrastructure
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The Custodian infrastructure spans multiple machines: a primary workstation, a
|
||||||
|
shared compute node (COULOMBCORE), and Railiance cluster nodes. Agents running
|
||||||
|
on remote machines need to reach the state-hub API and MCP server, which live
|
||||||
|
on the workstation. Human operators and agents also need to reach remote
|
||||||
|
services (k3s API, Gitea, Temporal) from the workstation.
|
||||||
|
|
||||||
|
Two network postures were considered for how these components communicate:
|
||||||
|
|
||||||
|
**Option A — Connectivity-first:** Components are connected by default via
|
||||||
|
controlled, observable access paths (reverse SSH tunnels managed by ops-bridge).
|
||||||
|
Isolation is added selectively where there is a specific threat model or
|
||||||
|
compliance reason to do so.
|
||||||
|
|
||||||
|
**Option B — Isolation-first (zero-trust):** No component trusts any other by
|
||||||
|
default. Every connection requires mutual authentication, short-lived
|
||||||
|
credentials, and explicit authorisation at the point of use. Connectivity is
|
||||||
|
earned, not assumed.
|
||||||
|
|
||||||
|
This decision is architectural policy — it governs how ops-bridge tunnels are
|
||||||
|
designed, how agent-to-hub communication works, and how new infrastructure
|
||||||
|
components are onboarded.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
**Connectivity-first, with isolation as a deliberate option.**
|
||||||
|
|
||||||
|
The default posture for Custodian infrastructure is: components that need to
|
||||||
|
work together are connected. Access paths are explicit, observable, and managed
|
||||||
|
(via ops-bridge), but they are persistent by default rather than ephemeral.
|
||||||
|
Isolation is introduced where there is a specific, articulated reason — not as
|
||||||
|
a blanket policy applied uniformly.
|
||||||
|
|
||||||
|
## Rationale
|
||||||
|
|
||||||
|
### 1. Scale and team size
|
||||||
|
|
||||||
|
The infrastructure is operated by a single human and a bounded set of
|
||||||
|
automation agents. The overhead of zero-trust (credential rotation, SPIFFE/SPIRE
|
||||||
|
identity, mTLS everywhere, short-lived tokens per request) is disproportionate
|
||||||
|
for this scale. It would consume significant operational complexity without
|
||||||
|
a commensurate security return.
|
||||||
|
|
||||||
|
### 2. Observability over perimeter hardening
|
||||||
|
|
||||||
|
The primary security control at this scale is **observability**: knowing what
|
||||||
|
connected, when, from where, and what it did. ops-bridge provides this — every
|
||||||
|
tunnel is named, actor-attributed, health-checked, and audited. A perimeter of
|
||||||
|
invisible short-lived connections would actually reduce observability.
|
||||||
|
|
||||||
|
### 3. The threat model does not require zero-trust today
|
||||||
|
|
||||||
|
The main threats are:
|
||||||
|
- A runaway agent consuming resources (mitigated by nproc/memory cgroups)
|
||||||
|
- A compromised workload reaching state-hub and corrupting state (mitigated by
|
||||||
|
the read-model design of state-hub — write surface is narrow and sanctioned)
|
||||||
|
- An external attacker reaching internal services (mitigated by the tunnels
|
||||||
|
being reverse SSH — no inbound ports exposed)
|
||||||
|
|
||||||
|
Zero-trust would address a different threat model: lateral movement between
|
||||||
|
hostile tenants, or untrusted code running in the same environment as sensitive
|
||||||
|
data. That is not the current situation.
|
||||||
|
|
||||||
|
### 4. Degrade-gracefully requires persistent connectivity
|
||||||
|
|
||||||
|
The Custodian's foundational value of **local-first, degrade-gracefully**
|
||||||
|
requires that agents can orient themselves even when some connections are slow
|
||||||
|
or partially degraded. Ephemeral connectivity (zero-trust tokens, service mesh
|
||||||
|
sidecars) introduces additional failure modes that conflict with graceful
|
||||||
|
degradation. Persistent SSH tunnels with auto-reconnect are more resilient to
|
||||||
|
intermittent conditions.
|
||||||
|
|
||||||
|
### 5. Isolation remains the right choice in specific cases
|
||||||
|
|
||||||
|
Connectivity-first does not mean no isolation. The following cases call for
|
||||||
|
explicit isolation and are handled separately:
|
||||||
|
|
||||||
|
- **Tenant separation** (when/if multi-user or multi-org) — each tenant gets
|
||||||
|
its own network segment
|
||||||
|
- **Privileged execution** — CI runners and agent actions with write access to
|
||||||
|
production systems run in ephemeral, isolated environments (per the
|
||||||
|
Privileged Execution Control standard)
|
||||||
|
- **Secrets** — credentials are never transmitted over tunnels in plaintext;
|
||||||
|
age-encrypted at rest, SOPS for config
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Immediate
|
||||||
|
|
||||||
|
- ops-bridge tunnels are **persistent** (max_attempts: 0, auto-reconnect) and
|
||||||
|
are treated as infrastructure, not one-off connections
|
||||||
|
- Agents on remote machines check tunnel health at session start and restore
|
||||||
|
dropped tunnels before accessing state-hub (documented in global CLAUDE.md)
|
||||||
|
- New infrastructure components are onboarded with a named tunnel entry in
|
||||||
|
`~/.config/bridge/tunnels.yaml` — not ad-hoc SSH commands
|
||||||
|
|
||||||
|
### Deferred
|
||||||
|
|
||||||
|
- If the infrastructure grows to multi-tenant or multi-operator, this decision
|
||||||
|
should be revisited. At that point, isolation-first becomes proportionate.
|
||||||
|
- If a security audit identifies a specific lateral movement risk, targeted
|
||||||
|
isolation (network policy, mTLS for that service) is the response — not a
|
||||||
|
wholesale posture change.
|
||||||
|
|
||||||
|
## Alternatives Rejected
|
||||||
|
|
||||||
|
### Zero-trust / isolation-first
|
||||||
|
|
||||||
|
Rejected for current scale. The operational overhead (credential lifecycle,
|
||||||
|
service mesh, mutual TLS) is disproportionate, observability would decrease,
|
||||||
|
and the threat model does not require it. Noted for re-evaluation at multi-
|
||||||
|
tenant scale.
|
||||||
|
|
||||||
|
### VPN (WireGuard / Tailscale)
|
||||||
|
|
||||||
|
Considered briefly. VPN would solve the connectivity problem but introduces
|
||||||
|
a persistent network layer that all traffic traverses, reducing the
|
||||||
|
explicitness of individual access paths. ops-bridge tunnels are per-service
|
||||||
|
and per-actor, which gives better observability and blast-radius control.
|
||||||
|
VPN is not ruled out as a future complement but is not the primary approach.
|
||||||
|
|
||||||
|
### Ad-hoc SSH (no ops-bridge)
|
||||||
|
|
||||||
|
The pre-ops-bridge approach. Rejected because it has no health checks, no
|
||||||
|
actor attribution, no audit log, and requires manual intervention to restore.
|
||||||
|
ops-bridge formalises the same SSH tunnel pattern with operational discipline.
|
||||||
119
ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md
Normal file
119
ops/incidents/2026-03-26-coulombcore-runaway-agent-overload.md
Normal file
@@ -0,0 +1,119 @@
|
|||||||
|
---
|
||||||
|
title: "INC-002: COULOMBCORE node overload — runaway Claude Code agent"
|
||||||
|
date: 2026-03-26
|
||||||
|
severity: high
|
||||||
|
status: resolved
|
||||||
|
affected: gitea (http://92.205.130.254:32166), k3s API, SSH access
|
||||||
|
environment: COULOMBCORE k3s cluster
|
||||||
|
duration: ~15 minutes (detected ~21:00, SSH restored ~21:08 UTC)
|
||||||
|
resolved_by: Bernd Worsch / Claude
|
||||||
|
---
|
||||||
|
|
||||||
|
# INC-002: COULOMBCORE node overload — runaway Claude Code agent
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
The COULOMBCORE node (92.205.130.254) became completely unresponsive under extreme load
|
||||||
|
generated by a runaway Claude Code agent process. Load average peaked at **417.43** (1m).
|
||||||
|
99.8% of CPU time was spent in kernel mode (context switching). SSH connections timed out
|
||||||
|
during banner exchange. k3s API was unreachable (TLS handshake timeout). Gitea was
|
||||||
|
technically still running as a process but unable to serve requests. The node had no swap,
|
||||||
|
so memory exhaustion amplified the impact — kswapd0 was consuming ~22% CPU trying to
|
||||||
|
reclaim pages with nowhere to put them.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Timeline
|
||||||
|
|
||||||
|
| Time (approx UTC) | Event |
|
||||||
|
|------|-------|
|
||||||
|
| ~20:45 | Runaway claude agent (PID 2457456, user tegwick) spawning hundreds of subprocesses |
|
||||||
|
| ~21:00 | Load average passes 300; SSH banner exchange starts timing out |
|
||||||
|
| ~21:00 | User attempts git operations; git repo service unreachable |
|
||||||
|
| ~21:00 | Remote diagnosis begins via ops-bridge (state-hub reverse tunnel still alive) |
|
||||||
|
| ~21:00 | k3s API confirmed unresponsive (TLS handshake timeout via local tunnel :16443) |
|
||||||
|
| ~21:05 | User obtains console/VNC access via hosting provider |
|
||||||
|
| ~21:05 | `top` output shared: load 417, 530 tasks (104 running), 34 zombies, 99.8% sy |
|
||||||
|
| ~21:06 | `kill -9 2457456` + `kill -9 2579133` (stuck apport) executed via console |
|
||||||
|
| ~21:08 | SSH accepting connections again (load 85, still declining) |
|
||||||
|
| ~21:09 | kubectl connectivity restored; PostgreSQL HA nodes resyncing |
|
||||||
|
| ~21:10 | Gitea accessible; incident resolved |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Root Cause
|
||||||
|
|
||||||
|
A Claude Code agent running on COULOMBCORE under the `tegwick` user (PID 2457456,
|
||||||
|
VIRT 71.1GB) spawned approximately 500 child processes. The likely cause is an unbounded
|
||||||
|
Ralph loop or parallel agent task expansion without a completion condition or iteration cap.
|
||||||
|
|
||||||
|
With no swap configured on a 3.9GB machine, the kernel had no reclaim target. kswapd0
|
||||||
|
ran at ~22% CPU continuously. systemd was at ~17% CPU processing unit state changes for
|
||||||
|
the constant process churn. The combination of ~500 tasks competing for 2 vCPUs with no
|
||||||
|
memory headroom caused the runaway context switching that buried the node.
|
||||||
|
|
||||||
|
**The ops-bridge reverse tunnels survived** because they were established before the
|
||||||
|
overload began and require no new SSH connections to stay alive. This was the only
|
||||||
|
out-of-band visibility channel available once SSH stopped accepting new connections.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Impact
|
||||||
|
|
||||||
|
- Gitea and git operations unavailable for ~15 minutes
|
||||||
|
- SSH access to COULOMBCORE unavailable (required console)
|
||||||
|
- k3s API unresponsive (no pod management possible)
|
||||||
|
- PostgreSQL HA nodes 0 and 2 restarted under load (recovered on their own)
|
||||||
|
- No data loss
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resolution Steps
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Via console/VNC — SSH was not available
|
||||||
|
|
||||||
|
# 1. Identify the runaway process (top showed PID 2457456 at 71GB VIRT, 6.8% CPU)
|
||||||
|
# Indicator: massive VIRT, hundreds of children, 99.8% sy in top
|
||||||
|
|
||||||
|
# 2. Kill the runaway agent and stuck crash reporter
|
||||||
|
kill -9 2457456 # runaway claude process
|
||||||
|
kill -9 2579133 # apport in D-state, consuming CPU
|
||||||
|
|
||||||
|
# 3. Wait ~60s — load drops, SSH accepts connections
|
||||||
|
# 4. Verify PostgreSQL HA recovery (may take 2-3 min to resync)
|
||||||
|
kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha'
|
||||||
|
|
||||||
|
# 5. Verify Gitea
|
||||||
|
curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Follow-up Actions
|
||||||
|
|
||||||
|
- [ ] Add swap to COULOMBCORE: `fallocate -l 4G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile && echo '/swapfile none swap sw 0 0' >> /etc/fstab`
|
||||||
|
- [ ] Set nproc ulimit for tegwick user: `/etc/security/limits.conf` → `tegwick hard nproc 512`
|
||||||
|
- [ ] Set memory limit on tegwick systemd user session: `systemctl --user set-property "" MemoryMax=1G` or a dedicated slice
|
||||||
|
- [ ] Add cluster-wide pod health alerting (cron on CoulombCore) — catches any crashlooping pod, not just Gitea; see runbook "Robustness §5"
|
||||||
|
- [ ] Ensure all Ralph loops on remote agents use `/ralph-workplan` (bounded, HEUREKA stop) never raw `/ralph-loop`
|
||||||
|
- [ ] Consider adding a `bridge check` cron on workstation that alerts when node load > threshold via state-hub API
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Lessons Learned
|
||||||
|
|
||||||
|
1. **No swap = amplified blast radius.** A machine with no swap has zero buffer between
|
||||||
|
"memory pressure" and "complete kernel thrash". A 4GB swapfile costs nothing and buys
|
||||||
|
significant time for intervention.
|
||||||
|
|
||||||
|
2. **Reverse tunnels are the last line of visibility.** SSH and the k3s API both died.
|
||||||
|
The state-hub reverse tunnel (established from COULOMBCORE outbound) survived and
|
||||||
|
confirmed the node was alive. This was critical for triage.
|
||||||
|
|
||||||
|
3. **Remote agents need hard resource ceilings.** A Claude Code agent that spawns
|
||||||
|
subprocesses has no built-in rate limit. nproc + systemd MemoryMax on the user session
|
||||||
|
are the right controls for this environment.
|
||||||
|
|
||||||
|
4. **Console access is non-negotiable.** Once SSH dies the only recovery path is OOB
|
||||||
|
console. Ensure hosting provider console credentials are always accessible.
|
||||||
Reference in New Issue
Block a user