ADR and Runbook artefacts

This commit is contained in:
2026-03-27 00:16:09 +01:00
parent b19896a9a9
commit 29b84de13c
2 changed files with 262 additions and 0 deletions

View File

@@ -0,0 +1,143 @@
---
id: ADR-004
type: architecture-decision-record
title: "Connectivity-First Network Posture for Custodian Infrastructure"
status: accepted
decided_by: Bernd Worsch
date: "2026-03-26"
tags: ["architecture", "network", "ops-bridge", "connectivity", "isolation", "security-posture"]
---
# ADR-004: Connectivity-First Network Posture for Custodian Infrastructure
## Status
Accepted.
## Context
The Custodian infrastructure spans multiple machines: a primary workstation, a
shared compute node (COULOMBCORE), and Railiance cluster nodes. Agents running
on remote machines need to reach the state-hub API and MCP server, which live
on the workstation. Human operators and agents also need to reach remote
services (k3s API, Gitea, Temporal) from the workstation.
Two network postures were considered for how these components communicate:
**Option A — Connectivity-first:** Components are connected by default via
controlled, observable access paths (reverse SSH tunnels managed by ops-bridge).
Isolation is added selectively where there is a specific threat model or
compliance reason to do so.
**Option B — Isolation-first (zero-trust):** No component trusts any other by
default. Every connection requires mutual authentication, short-lived
credentials, and explicit authorisation at the point of use. Connectivity is
earned, not assumed.
This decision is architectural policy — it governs how ops-bridge tunnels are
designed, how agent-to-hub communication works, and how new infrastructure
components are onboarded.
## Decision
**Connectivity-first, with isolation as a deliberate option.**
The default posture for Custodian infrastructure is: components that need to
work together are connected. Access paths are explicit, observable, and managed
(via ops-bridge), but they are persistent by default rather than ephemeral.
Isolation is introduced where there is a specific, articulated reason — not as
a blanket policy applied uniformly.
## Rationale
### 1. Scale and team size
The infrastructure is operated by a single human and a bounded set of
automation agents. The overhead of zero-trust (credential rotation, SPIFFE/SPIRE
identity, mTLS everywhere, short-lived tokens per request) is disproportionate
for this scale. It would consume significant operational complexity without
a commensurate security return.
### 2. Observability over perimeter hardening
The primary security control at this scale is **observability**: knowing what
connected, when, from where, and what it did. ops-bridge provides this — every
tunnel is named, actor-attributed, health-checked, and audited. A perimeter of
invisible short-lived connections would actually reduce observability.
### 3. The threat model does not require zero-trust today
The main threats are:
- A runaway agent consuming resources (mitigated by nproc/memory cgroups)
- A compromised workload reaching state-hub and corrupting state (mitigated by
the read-model design of state-hub — write surface is narrow and sanctioned)
- An external attacker reaching internal services (mitigated by the tunnels
being reverse SSH — no inbound ports exposed)
Zero-trust would address a different threat model: lateral movement between
hostile tenants, or untrusted code running in the same environment as sensitive
data. That is not the current situation.
### 4. Degrade-gracefully requires persistent connectivity
The Custodian's foundational value of **local-first, degrade-gracefully**
requires that agents can orient themselves even when some connections are slow
or partially degraded. Ephemeral connectivity (zero-trust tokens, service mesh
sidecars) introduces additional failure modes that conflict with graceful
degradation. Persistent SSH tunnels with auto-reconnect are more resilient to
intermittent conditions.
### 5. Isolation remains the right choice in specific cases
Connectivity-first does not mean no isolation. The following cases call for
explicit isolation and are handled separately:
- **Tenant separation** (when/if multi-user or multi-org) — each tenant gets
its own network segment
- **Privileged execution** — CI runners and agent actions with write access to
production systems run in ephemeral, isolated environments (per the
Privileged Execution Control standard)
- **Secrets** — credentials are never transmitted over tunnels in plaintext;
age-encrypted at rest, SOPS for config
## Consequences
### Immediate
- ops-bridge tunnels are **persistent** (max_attempts: 0, auto-reconnect) and
are treated as infrastructure, not one-off connections
- Agents on remote machines check tunnel health at session start and restore
dropped tunnels before accessing state-hub (documented in global CLAUDE.md)
- New infrastructure components are onboarded with a named tunnel entry in
`~/.config/bridge/tunnels.yaml` — not ad-hoc SSH commands
### Deferred
- If the infrastructure grows to multi-tenant or multi-operator, this decision
should be revisited. At that point, isolation-first becomes proportionate.
- If a security audit identifies a specific lateral movement risk, targeted
isolation (network policy, mTLS for that service) is the response — not a
wholesale posture change.
## Alternatives Rejected
### Zero-trust / isolation-first
Rejected for current scale. The operational overhead (credential lifecycle,
service mesh, mutual TLS) is disproportionate, observability would decrease,
and the threat model does not require it. Noted for re-evaluation at multi-
tenant scale.
### VPN (WireGuard / Tailscale)
Considered briefly. VPN would solve the connectivity problem but introduces
a persistent network layer that all traffic traverses, reducing the
explicitness of individual access paths. ops-bridge tunnels are per-service
and per-actor, which gives better observability and blast-radius control.
VPN is not ruled out as a future complement but is not the primary approach.
### Ad-hoc SSH (no ops-bridge)
The pre-ops-bridge approach. Rejected because it has no health checks, no
actor attribution, no audit log, and requires manual intervention to restore.
ops-bridge formalises the same SSH tunnel pattern with operational discipline.

View File

@@ -0,0 +1,119 @@
---
title: "INC-002: COULOMBCORE node overload — runaway Claude Code agent"
date: 2026-03-26
severity: high
status: resolved
affected: gitea (http://92.205.130.254:32166), k3s API, SSH access
environment: COULOMBCORE k3s cluster
duration: ~15 minutes (detected ~21:00, SSH restored ~21:08 UTC)
resolved_by: Bernd Worsch / Claude
---
# INC-002: COULOMBCORE node overload — runaway Claude Code agent
## Summary
The COULOMBCORE node (92.205.130.254) became completely unresponsive under extreme load
generated by a runaway Claude Code agent process. Load average peaked at **417.43** (1m).
99.8% of CPU time was spent in kernel mode (context switching). SSH connections timed out
during banner exchange. k3s API was unreachable (TLS handshake timeout). Gitea was
technically still running as a process but unable to serve requests. The node had no swap,
so memory exhaustion amplified the impact — kswapd0 was consuming ~22% CPU trying to
reclaim pages with nowhere to put them.
---
## Timeline
| Time (approx UTC) | Event |
|------|-------|
| ~20:45 | Runaway claude agent (PID 2457456, user tegwick) spawning hundreds of subprocesses |
| ~21:00 | Load average passes 300; SSH banner exchange starts timing out |
| ~21:00 | User attempts git operations; git repo service unreachable |
| ~21:00 | Remote diagnosis begins via ops-bridge (state-hub reverse tunnel still alive) |
| ~21:00 | k3s API confirmed unresponsive (TLS handshake timeout via local tunnel :16443) |
| ~21:05 | User obtains console/VNC access via hosting provider |
| ~21:05 | `top` output shared: load 417, 530 tasks (104 running), 34 zombies, 99.8% sy |
| ~21:06 | `kill -9 2457456` + `kill -9 2579133` (stuck apport) executed via console |
| ~21:08 | SSH accepting connections again (load 85, still declining) |
| ~21:09 | kubectl connectivity restored; PostgreSQL HA nodes resyncing |
| ~21:10 | Gitea accessible; incident resolved |
---
## Root Cause
A Claude Code agent running on COULOMBCORE under the `tegwick` user (PID 2457456,
VIRT 71.1GB) spawned approximately 500 child processes. The likely cause is an unbounded
Ralph loop or parallel agent task expansion without a completion condition or iteration cap.
With no swap configured on a 3.9GB machine, the kernel had no reclaim target. kswapd0
ran at ~22% CPU continuously. systemd was at ~17% CPU processing unit state changes for
the constant process churn. The combination of ~500 tasks competing for 2 vCPUs with no
memory headroom caused the runaway context switching that buried the node.
**The ops-bridge reverse tunnels survived** because they were established before the
overload began and require no new SSH connections to stay alive. This was the only
out-of-band visibility channel available once SSH stopped accepting new connections.
---
## Impact
- Gitea and git operations unavailable for ~15 minutes
- SSH access to COULOMBCORE unavailable (required console)
- k3s API unresponsive (no pod management possible)
- PostgreSQL HA nodes 0 and 2 restarted under load (recovered on their own)
- No data loss
---
## Resolution Steps
```bash
# Via console/VNC — SSH was not available
# 1. Identify the runaway process (top showed PID 2457456 at 71GB VIRT, 6.8% CPU)
# Indicator: massive VIRT, hundreds of children, 99.8% sy in top
# 2. Kill the runaway agent and stuck crash reporter
kill -9 2457456 # runaway claude process
kill -9 2579133 # apport in D-state, consuming CPU
# 3. Wait ~60s — load drops, SSH accepts connections
# 4. Verify PostgreSQL HA recovery (may take 2-3 min to resync)
kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha'
# 5. Verify Gitea
curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/
```
---
## Follow-up Actions
- [ ] Add swap to COULOMBCORE: `fallocate -l 4G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile && echo '/swapfile none swap sw 0 0' >> /etc/fstab`
- [ ] Set nproc ulimit for tegwick user: `/etc/security/limits.conf``tegwick hard nproc 512`
- [ ] Set memory limit on tegwick systemd user session: `systemctl --user set-property "" MemoryMax=1G` or a dedicated slice
- [ ] Add cluster-wide pod health alerting (cron on CoulombCore) — catches any crashlooping pod, not just Gitea; see runbook "Robustness §5"
- [ ] Ensure all Ralph loops on remote agents use `/ralph-workplan` (bounded, HEUREKA stop) never raw `/ralph-loop`
- [ ] Consider adding a `bridge check` cron on workstation that alerts when node load > threshold via state-hub API
---
## Lessons Learned
1. **No swap = amplified blast radius.** A machine with no swap has zero buffer between
"memory pressure" and "complete kernel thrash". A 4GB swapfile costs nothing and buys
significant time for intervention.
2. **Reverse tunnels are the last line of visibility.** SSH and the k3s API both died.
The state-hub reverse tunnel (established from COULOMBCORE outbound) survived and
confirmed the node was alive. This was critical for triage.
3. **Remote agents need hard resource ceilings.** A Claude Code agent that spawns
subprocesses has no built-in rate limit. nproc + systemd MemoryMax on the user session
are the right controls for this environment.
4. **Console access is non-negotiable.** Once SSH dies the only recovery path is OOB
console. Ensure hosting provider console credentials are always accessible.