144 lines
6.1 KiB
Markdown
144 lines
6.1 KiB
Markdown
---
|
|
id: ADR-004
|
|
type: architecture-decision-record
|
|
title: "Connectivity-First Network Posture for Custodian Infrastructure"
|
|
status: accepted
|
|
decided_by: Bernd Worsch
|
|
date: "2026-03-26"
|
|
tags: ["architecture", "network", "ops-bridge", "connectivity", "isolation", "security-posture"]
|
|
---
|
|
|
|
# ADR-004: Connectivity-First Network Posture for Custodian Infrastructure
|
|
|
|
## Status
|
|
|
|
Accepted.
|
|
|
|
## Context
|
|
|
|
The Custodian infrastructure spans multiple machines: a primary workstation, a
|
|
shared compute node (COULOMBCORE), and Railiance cluster nodes. Agents running
|
|
on remote machines need to reach the state-hub API and MCP server, which live
|
|
on the workstation. Human operators and agents also need to reach remote
|
|
services (k3s API, Gitea, Temporal) from the workstation.
|
|
|
|
Two network postures were considered for how these components communicate:
|
|
|
|
**Option A — Connectivity-first:** Components are connected by default via
|
|
controlled, observable access paths (reverse SSH tunnels managed by ops-bridge).
|
|
Isolation is added selectively where there is a specific threat model or
|
|
compliance reason to do so.
|
|
|
|
**Option B — Isolation-first (zero-trust):** No component trusts any other by
|
|
default. Every connection requires mutual authentication, short-lived
|
|
credentials, and explicit authorisation at the point of use. Connectivity is
|
|
earned, not assumed.
|
|
|
|
This decision is architectural policy — it governs how ops-bridge tunnels are
|
|
designed, how agent-to-hub communication works, and how new infrastructure
|
|
components are onboarded.
|
|
|
|
## Decision
|
|
|
|
**Connectivity-first, with isolation as a deliberate option.**
|
|
|
|
The default posture for Custodian infrastructure is: components that need to
|
|
work together are connected. Access paths are explicit, observable, and managed
|
|
(via ops-bridge), but they are persistent by default rather than ephemeral.
|
|
Isolation is introduced where there is a specific, articulated reason — not as
|
|
a blanket policy applied uniformly.
|
|
|
|
## Rationale
|
|
|
|
### 1. Scale and team size
|
|
|
|
The infrastructure is operated by a single human and a bounded set of
|
|
automation agents. The overhead of zero-trust (credential rotation, SPIFFE/SPIRE
|
|
identity, mTLS everywhere, short-lived tokens per request) is disproportionate
|
|
for this scale. It would consume significant operational complexity without
|
|
a commensurate security return.
|
|
|
|
### 2. Observability over perimeter hardening
|
|
|
|
The primary security control at this scale is **observability**: knowing what
|
|
connected, when, from where, and what it did. ops-bridge provides this — every
|
|
tunnel is named, actor-attributed, health-checked, and audited. A perimeter of
|
|
invisible short-lived connections would actually reduce observability.
|
|
|
|
### 3. The threat model does not require zero-trust today
|
|
|
|
The main threats are:
|
|
- A runaway agent consuming resources (mitigated by nproc/memory cgroups)
|
|
- A compromised workload reaching state-hub and corrupting state (mitigated by
|
|
the read-model design of state-hub — write surface is narrow and sanctioned)
|
|
- An external attacker reaching internal services (mitigated by the tunnels
|
|
being reverse SSH — no inbound ports exposed)
|
|
|
|
Zero-trust would address a different threat model: lateral movement between
|
|
hostile tenants, or untrusted code running in the same environment as sensitive
|
|
data. That is not the current situation.
|
|
|
|
### 4. Degrade-gracefully requires persistent connectivity
|
|
|
|
The Custodian's foundational value of **local-first, degrade-gracefully**
|
|
requires that agents can orient themselves even when some connections are slow
|
|
or partially degraded. Ephemeral connectivity (zero-trust tokens, service mesh
|
|
sidecars) introduces additional failure modes that conflict with graceful
|
|
degradation. Persistent SSH tunnels with auto-reconnect are more resilient to
|
|
intermittent conditions.
|
|
|
|
### 5. Isolation remains the right choice in specific cases
|
|
|
|
Connectivity-first does not mean no isolation. The following cases call for
|
|
explicit isolation and are handled separately:
|
|
|
|
- **Tenant separation** (when/if multi-user or multi-org) — each tenant gets
|
|
its own network segment
|
|
- **Privileged execution** — CI runners and agent actions with write access to
|
|
production systems run in ephemeral, isolated environments (per the
|
|
Privileged Execution Control standard)
|
|
- **Secrets** — credentials are never transmitted over tunnels in plaintext;
|
|
age-encrypted at rest, SOPS for config
|
|
|
|
## Consequences
|
|
|
|
### Immediate
|
|
|
|
- ops-bridge tunnels are **persistent** (max_attempts: 0, auto-reconnect) and
|
|
are treated as infrastructure, not one-off connections
|
|
- Agents on remote machines check tunnel health at session start and restore
|
|
dropped tunnels before accessing state-hub (documented in global CLAUDE.md)
|
|
- New infrastructure components are onboarded with a named tunnel entry in
|
|
`~/.config/bridge/tunnels.yaml` — not ad-hoc SSH commands
|
|
|
|
### Deferred
|
|
|
|
- If the infrastructure grows to multi-tenant or multi-operator, this decision
|
|
should be revisited. At that point, isolation-first becomes proportionate.
|
|
- If a security audit identifies a specific lateral movement risk, targeted
|
|
isolation (network policy, mTLS for that service) is the response — not a
|
|
wholesale posture change.
|
|
|
|
## Alternatives Rejected
|
|
|
|
### Zero-trust / isolation-first
|
|
|
|
Rejected for current scale. The operational overhead (credential lifecycle,
|
|
service mesh, mutual TLS) is disproportionate, observability would decrease,
|
|
and the threat model does not require it. Noted for re-evaluation at multi-
|
|
tenant scale.
|
|
|
|
### VPN (WireGuard / Tailscale)
|
|
|
|
Considered briefly. VPN would solve the connectivity problem but introduces
|
|
a persistent network layer that all traffic traverses, reducing the
|
|
explicitness of individual access paths. ops-bridge tunnels are per-service
|
|
and per-actor, which gives better observability and blast-radius control.
|
|
VPN is not ruled out as a future complement but is not the primary approach.
|
|
|
|
### Ad-hoc SSH (no ops-bridge)
|
|
|
|
The pre-ops-bridge approach. Rejected because it has no health checks, no
|
|
actor attribution, no audit log, and requires manual intervention to restore.
|
|
ops-bridge formalises the same SSH tunnel pattern with operational discipline.
|