6.1 KiB
id, type, title, status, decided_by, date, tags
| id | type | title | status | decided_by | date | tags | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ADR-004 | architecture-decision-record | Connectivity-First Network Posture for Custodian Infrastructure | accepted | Bernd Worsch | 2026-03-26 |
|
ADR-004: Connectivity-First Network Posture for Custodian Infrastructure
Status
Accepted.
Context
The Custodian infrastructure spans multiple machines: a primary workstation, a shared compute node (COULOMBCORE), and Railiance cluster nodes. Agents running on remote machines need to reach the state-hub API and MCP server, which live on the workstation. Human operators and agents also need to reach remote services (k3s API, Gitea, Temporal) from the workstation.
Two network postures were considered for how these components communicate:
Option A — Connectivity-first: Components are connected by default via controlled, observable access paths (reverse SSH tunnels managed by ops-bridge). Isolation is added selectively where there is a specific threat model or compliance reason to do so.
Option B — Isolation-first (zero-trust): No component trusts any other by default. Every connection requires mutual authentication, short-lived credentials, and explicit authorisation at the point of use. Connectivity is earned, not assumed.
This decision is architectural policy — it governs how ops-bridge tunnels are designed, how agent-to-hub communication works, and how new infrastructure components are onboarded.
Decision
Connectivity-first, with isolation as a deliberate option.
The default posture for Custodian infrastructure is: components that need to work together are connected. Access paths are explicit, observable, and managed (via ops-bridge), but they are persistent by default rather than ephemeral. Isolation is introduced where there is a specific, articulated reason — not as a blanket policy applied uniformly.
Rationale
1. Scale and team size
The infrastructure is operated by a single human and a bounded set of automation agents. The overhead of zero-trust (credential rotation, SPIFFE/SPIRE identity, mTLS everywhere, short-lived tokens per request) is disproportionate for this scale. It would consume significant operational complexity without a commensurate security return.
2. Observability over perimeter hardening
The primary security control at this scale is observability: knowing what connected, when, from where, and what it did. ops-bridge provides this — every tunnel is named, actor-attributed, health-checked, and audited. A perimeter of invisible short-lived connections would actually reduce observability.
3. The threat model does not require zero-trust today
The main threats are:
- A runaway agent consuming resources (mitigated by nproc/memory cgroups)
- A compromised workload reaching state-hub and corrupting state (mitigated by the read-model design of state-hub — write surface is narrow and sanctioned)
- An external attacker reaching internal services (mitigated by the tunnels being reverse SSH — no inbound ports exposed)
Zero-trust would address a different threat model: lateral movement between hostile tenants, or untrusted code running in the same environment as sensitive data. That is not the current situation.
4. Degrade-gracefully requires persistent connectivity
The Custodian's foundational value of local-first, degrade-gracefully requires that agents can orient themselves even when some connections are slow or partially degraded. Ephemeral connectivity (zero-trust tokens, service mesh sidecars) introduces additional failure modes that conflict with graceful degradation. Persistent SSH tunnels with auto-reconnect are more resilient to intermittent conditions.
5. Isolation remains the right choice in specific cases
Connectivity-first does not mean no isolation. The following cases call for explicit isolation and are handled separately:
- Tenant separation (when/if multi-user or multi-org) — each tenant gets its own network segment
- Privileged execution — CI runners and agent actions with write access to production systems run in ephemeral, isolated environments (per the Privileged Execution Control standard)
- Secrets — credentials are never transmitted over tunnels in plaintext; age-encrypted at rest, SOPS for config
Consequences
Immediate
- ops-bridge tunnels are persistent (max_attempts: 0, auto-reconnect) and are treated as infrastructure, not one-off connections
- Agents on remote machines check tunnel health at session start and restore dropped tunnels before accessing state-hub (documented in global CLAUDE.md)
- New infrastructure components are onboarded with a named tunnel entry in
~/.config/bridge/tunnels.yaml— not ad-hoc SSH commands
Deferred
- If the infrastructure grows to multi-tenant or multi-operator, this decision should be revisited. At that point, isolation-first becomes proportionate.
- If a security audit identifies a specific lateral movement risk, targeted isolation (network policy, mTLS for that service) is the response — not a wholesale posture change.
Alternatives Rejected
Zero-trust / isolation-first
Rejected for current scale. The operational overhead (credential lifecycle, service mesh, mutual TLS) is disproportionate, observability would decrease, and the threat model does not require it. Noted for re-evaluation at multi- tenant scale.
VPN (WireGuard / Tailscale)
Considered briefly. VPN would solve the connectivity problem but introduces a persistent network layer that all traffic traverses, reducing the explicitness of individual access paths. ops-bridge tunnels are per-service and per-actor, which gives better observability and blast-radius control. VPN is not ruled out as a future complement but is not the primary approach.
Ad-hoc SSH (no ops-bridge)
The pre-ops-bridge approach. Rejected because it has no health checks, no actor attribution, no audit log, and requires manual intervention to restore. ops-bridge formalises the same SSH tunnel pattern with operational discipline.