Files
the-custodian/canon/architecture/adr-004-connectivity-first-network-posture.md
2026-03-27 00:16:09 +01:00

144 lines
6.1 KiB
Markdown

---
id: ADR-004
type: architecture-decision-record
title: "Connectivity-First Network Posture for Custodian Infrastructure"
status: accepted
decided_by: Bernd Worsch
date: "2026-03-26"
tags: ["architecture", "network", "ops-bridge", "connectivity", "isolation", "security-posture"]
---
# ADR-004: Connectivity-First Network Posture for Custodian Infrastructure
## Status
Accepted.
## Context
The Custodian infrastructure spans multiple machines: a primary workstation, a
shared compute node (COULOMBCORE), and Railiance cluster nodes. Agents running
on remote machines need to reach the state-hub API and MCP server, which live
on the workstation. Human operators and agents also need to reach remote
services (k3s API, Gitea, Temporal) from the workstation.
Two network postures were considered for how these components communicate:
**Option A — Connectivity-first:** Components are connected by default via
controlled, observable access paths (reverse SSH tunnels managed by ops-bridge).
Isolation is added selectively where there is a specific threat model or
compliance reason to do so.
**Option B — Isolation-first (zero-trust):** No component trusts any other by
default. Every connection requires mutual authentication, short-lived
credentials, and explicit authorisation at the point of use. Connectivity is
earned, not assumed.
This decision is architectural policy — it governs how ops-bridge tunnels are
designed, how agent-to-hub communication works, and how new infrastructure
components are onboarded.
## Decision
**Connectivity-first, with isolation as a deliberate option.**
The default posture for Custodian infrastructure is: components that need to
work together are connected. Access paths are explicit, observable, and managed
(via ops-bridge), but they are persistent by default rather than ephemeral.
Isolation is introduced where there is a specific, articulated reason — not as
a blanket policy applied uniformly.
## Rationale
### 1. Scale and team size
The infrastructure is operated by a single human and a bounded set of
automation agents. The overhead of zero-trust (credential rotation, SPIFFE/SPIRE
identity, mTLS everywhere, short-lived tokens per request) is disproportionate
for this scale. It would consume significant operational complexity without
a commensurate security return.
### 2. Observability over perimeter hardening
The primary security control at this scale is **observability**: knowing what
connected, when, from where, and what it did. ops-bridge provides this — every
tunnel is named, actor-attributed, health-checked, and audited. A perimeter of
invisible short-lived connections would actually reduce observability.
### 3. The threat model does not require zero-trust today
The main threats are:
- A runaway agent consuming resources (mitigated by nproc/memory cgroups)
- A compromised workload reaching state-hub and corrupting state (mitigated by
the read-model design of state-hub — write surface is narrow and sanctioned)
- An external attacker reaching internal services (mitigated by the tunnels
being reverse SSH — no inbound ports exposed)
Zero-trust would address a different threat model: lateral movement between
hostile tenants, or untrusted code running in the same environment as sensitive
data. That is not the current situation.
### 4. Degrade-gracefully requires persistent connectivity
The Custodian's foundational value of **local-first, degrade-gracefully**
requires that agents can orient themselves even when some connections are slow
or partially degraded. Ephemeral connectivity (zero-trust tokens, service mesh
sidecars) introduces additional failure modes that conflict with graceful
degradation. Persistent SSH tunnels with auto-reconnect are more resilient to
intermittent conditions.
### 5. Isolation remains the right choice in specific cases
Connectivity-first does not mean no isolation. The following cases call for
explicit isolation and are handled separately:
- **Tenant separation** (when/if multi-user or multi-org) — each tenant gets
its own network segment
- **Privileged execution** — CI runners and agent actions with write access to
production systems run in ephemeral, isolated environments (per the
Privileged Execution Control standard)
- **Secrets** — credentials are never transmitted over tunnels in plaintext;
age-encrypted at rest, SOPS for config
## Consequences
### Immediate
- ops-bridge tunnels are **persistent** (max_attempts: 0, auto-reconnect) and
are treated as infrastructure, not one-off connections
- Agents on remote machines check tunnel health at session start and restore
dropped tunnels before accessing state-hub (documented in global CLAUDE.md)
- New infrastructure components are onboarded with a named tunnel entry in
`~/.config/bridge/tunnels.yaml` — not ad-hoc SSH commands
### Deferred
- If the infrastructure grows to multi-tenant or multi-operator, this decision
should be revisited. At that point, isolation-first becomes proportionate.
- If a security audit identifies a specific lateral movement risk, targeted
isolation (network policy, mTLS for that service) is the response — not a
wholesale posture change.
## Alternatives Rejected
### Zero-trust / isolation-first
Rejected for current scale. The operational overhead (credential lifecycle,
service mesh, mutual TLS) is disproportionate, observability would decrease,
and the threat model does not require it. Noted for re-evaluation at multi-
tenant scale.
### VPN (WireGuard / Tailscale)
Considered briefly. VPN would solve the connectivity problem but introduces
a persistent network layer that all traffic traverses, reducing the
explicitness of individual access paths. ops-bridge tunnels are per-service
and per-actor, which gives better observability and blast-radius control.
VPN is not ruled out as a future complement but is not the primary approach.
### Ad-hoc SSH (no ops-bridge)
The pre-ops-bridge approach. Rejected because it has no health checks, no
actor attribution, no audit log, and requires manual intervention to restore.
ops-bridge formalises the same SSH tunnel pattern with operational discipline.