Implement SAND-WP-0008: host telemetry and self-canary

Add profile.sandbox-canary, HostSnapshot/inventory/stale schemas, SSH
collectors, before/after provision deltas, telemetry export to State Hub
and local JSON, default `sandboxer create` self-deploy, inspect/reap-stale
CLI, runbook, and CoulombCore verification (26 tests pass).
This commit is contained in:
2026-06-23 19:53:51 +02:00
parent 582c1dd3c6
commit c0a9261cdc
22 changed files with 1047 additions and 26 deletions

View File

@@ -36,11 +36,16 @@ make cli-version # smoke test: sandboxer version
Sandbox CLI (v0):
```bash
sandboxer create # canary self-deploy (profile.sandbox-canary)
sandboxer create --profile profile.compose-e2e --input repo=/path/to/repo
sandboxer get <id>
sandboxer list
sandboxer destroy <id>
sandboxer recreate <id>
sandboxer inspect host
sandboxer inspect stale
sandboxer reap-stale # dry-run; add --apply to remove
export SANDBOXER_COMPOSE_CMD=podman-compose # required on CoulombCore
```
Equivalent `uv` invocations without Make:

View File

@@ -172,8 +172,13 @@ make lint # ruff check
make format # ruff format
make build # uv build
make cli-version # smoke test: sandboxer version
make smoke-remote # SAND-WP-0002 compose-e2e smoke
```
Canary self-deploy (SAND-WP-0008): `sandboxer create` with no args deploys
sand-boxer and returns `telemetry` (host metrics, stale inventory). See
`docs/runbooks/profile-sandbox-canary.md`.
Canonical detail: `.claude/rules/stack-and-commands.md`.
---

View File

@@ -126,8 +126,8 @@ Additional boundaries:
- **Registry:** scaffold present (`registry/indexes/capabilities.yaml` empty;
`registry/capabilities/` placeholder); domain in index still `helix_forge`
from scaffold — needs alignment to `infotech`
- **Workplans:** `SAND-WP-0001` finished; `SAND-WP-0002` finished;
`SAND-WP-0008` ready (host telemetry / self-canary)
- **Workplans:** `SAND-WP-0001``0002` finished; `SAND-WP-0008` finished
(host telemetry / self-canary)
- **Lineage (external, not yet migrated):** `the-custodian/e2e-framework/`
(CUST-WP-0028, completed) and `infra/build-machines/` (CUST-WP-0032)

95
docs/host-telemetry.md Normal file
View File

@@ -0,0 +1,95 @@
# Host telemetry contract
Version 0.1 — SAND-WP-0008. Extends `docs/meta-framework.md` Host resource with
read-only observability. sand-boxer collects and exports telemetry; it does not
own long-term metrics storage.
---
## Types
### HostSnapshot
Point-in-time host metrics collected over SSH (≤10s, non-root-safe).
| Field | Description |
|-------|-------------|
| `load_1m`, `load_5m`, `load_15m` | `/proc/loadavg` |
| `cpu_count` | Logical CPUs |
| `mem_total_mb`, `mem_available_mb` | From `free -m` |
| `disk_root_used_pct`, `disk_root_avail_gb` | Root filesystem |
| `running_containers` | All running containers (podman/docker) |
| `sandbox_containers` | Containers with `sbx-*` compose project label |
### SandboxInventory
Known sandbox artifacts on a host.
| Entry type | Source |
|------------|--------|
| `directory` | `{base_dir}/{sandbox_id}` |
| `compose_project` | `sbx-*` or legacy `e2e-*` compose labels |
Each entry: `id`, `path`, `age_hours`, `profile_hint` (inferred from project name).
### StaleCandidate
| Kind | Meaning | Suggested action |
|------|---------|------------------|
| `orphan_dir` | Dir on host, not in local store | `reap` |
| `orphan_compose` | Compose project on host, not in store | `reap` |
| `zombie_record` | Store record not `destroyed`, missing on host | `inspect` |
| `aged_dir` | Dir older than threshold | `reap` |
Actions: `reap`, `inspect`, `ignore`. Automatic reap requires `--apply` on CLI.
### ProvisionDelta
`before` and `after` HostSnapshot pair with computed deltas:
- `load_1m_delta`, `mem_available_mb_delta`, `running_containers_delta`
### IntrospectionReport
Bundled canary output attached to `SandboxStatus.telemetry` on `ready`:
```json
{
"schema_version": "0.1",
"host": "92.205.130.254",
"sandbox_id": "abc12345",
"profile_id": "profile.sandbox-canary",
"collected_at": "2026-06-23T...",
"provision_delta": { "before": {}, "after": {}, "load_1m_delta": 0.1 },
"inventory": { "entries": [], "host": "..." },
"stale_candidates": []
}
```
---
## Privacy and retention
- No secret paths, env files, or full `docker inspect` dumps
- Telemetry JSON retained locally under `~/.local/share/sandboxer/telemetry/`
- State Hub events include report in `detail` — same redaction rules apply
- Operators may set `SANDBOXER_NO_STATE_HUB=1` to skip remote emission
---
## Export sinks
| Sink | Status |
|------|--------|
| State Hub `progress/` | Implemented |
| Local JSON artifact | Implemented |
| `TelemetrySink` protocol | Stub for artifact-store / Prometheus / ClickHouse |
---
## Profile trigger
Telemetry collection runs when:
- Profile id is `profile.sandbox-canary`, or
- `profile.metadata.observability` is `canary`

View File

@@ -14,7 +14,7 @@ agent harnessing, validation, and code generation.
|----------|-------------|
| **Profile** | Named, versioned sandbox recipe: extension binding, isolation, network, TTL, placement |
| **Extension** | Backend adapter implementing provision / wait_ready / teardown |
| **Host** | Registered placement target for self-hosted extensions |
| **Host** | Registered placement target for self-hosted extensions; read-only telemetry via `profile.sandbox-canary` (see `docs/host-telemetry.md`) |
| **Sandbox** | Running instance of a profile |
| **Snapshot** | Point-in-time workspace checkpoint (deferred — SAND-WP-0003) |
| **Route** | Extension selection policy when multiple backends qualify |

View File

@@ -0,0 +1,58 @@
# Runbook: profile.sandbox-canary
Self-deploy sand-boxer to verify host health and return telemetry.
## Quick start
```bash
export SANDBOXER_HOST=coulombcore
export SANDBOXER_COMPOSE_CMD=podman-compose # CoulombCore
sandboxer create # no args — canary self-deploy + IntrospectionReport
```
## What you get on `ready`
`SandboxStatus.telemetry` contains:
- **provision_delta** — host load/memory/container counts before vs after
- **inventory** — sandbox dirs and compose projects on host
- **stale_candidates** — orphans and aged sandboxes (dry-run recommendations)
Human summary prints to stderr:
```
Telemetry: load Δ +0.12, mem avail Δ -48 MB, stale candidates: 0
```
Artifacts: `~/.local/share/sandboxer/telemetry/<sandbox_id>.json`
## Inspect without creating
```bash
sandboxer inspect host
sandboxer inspect stale --older-than 24
sandboxer reap-stale --dry-run
sandboxer reap-stale --apply --older-than 48 # destructive — review dry-run first
```
## Destroy
```bash
sandboxer destroy <sandbox_id>
```
Destroy telemetry includes **destroy_delta** (load recovery after teardown).
## Verification checklist (SAND-WP-0008-T10)
1. `sandboxer create``ready` + `telemetry.provision_delta`
2. `sandboxer inspect host` → metrics consistent with create report
3. Fake stale dir: `ssh host 'mkdir -p /tmp/sandboxer/fake99'` → appears in `inspect stale`
4. `sandboxer destroy``destroy_delta` shows load/mem recovery
## Optimization notes (activity-core follow-up)
- Schedule periodic `sandboxer create` canary on sandboxer01
- Reap policy: `--older-than 24` with human-approved `--apply`
- Disk pressure alerts when `disk_root_avail_gb` < threshold

View File

@@ -0,0 +1,32 @@
id: profile.sandbox-canary
version: "1.0.0"
extension: ext.compose-ssh
isolation:
level: container
network:
default: deny
egress: []
workspace:
mode: remote-canonical
access: rw
scope_default: session
ttl:
default: 1h
max: 4h
idle_reap: null
resources:
cpu: null
memory_mb: null
setup:
instructions: ""
secret_refs: []
placement:
prefer: [sandboxer01]
fallback: [coulombcore]
reachability:
tunnel: ops-bridge
identity: ops-warden
metadata:
cost_class: self-hosted
latency_class: standard
observability: canary

View File

@@ -9,13 +9,22 @@ import typer
from sandboxer import __version__
from sandboxer.core.manager import SandboxManager
from sandboxer.defaults import resolve_create_defaults
from sandboxer.models import ActorType, Consumer, SandboxCreateRequest
from sandboxer.placement import resolve_host
from sandboxer.profiles.loader import load_profile
from sandboxer.telemetry.export import export_telemetry
from sandboxer.telemetry.introspection import build_introspection_report, collect_host_snapshot
from sandboxer.telemetry.inventory import HostInventoryScanner
from sandboxer.telemetry.reap import reap_stale
app = typer.Typer(
name="sandboxer",
help="Provision and manage isolated sandbox environments.",
no_args_is_help=True,
)
inspect_app = typer.Typer(help="Host introspection without provisioning.")
app.add_typer(inspect_app, name="inspect")
@app.callback()
@@ -39,13 +48,36 @@ def _parse_inputs(values: list[str]) -> dict[str, str]:
return inputs
def _print_status(status: object) -> None:
typer.echo(json.dumps(status, default=str, indent=2))
def _print_json(data: object) -> None:
typer.echo(json.dumps(data, default=str, indent=2))
def _print_telemetry_summary(telemetry: dict | None) -> None:
if not telemetry:
return
delta = telemetry.get("provision_delta") or telemetry.get("destroy_delta")
stale = telemetry.get("stale_candidates", [])
if delta:
typer.echo(
f"\nTelemetry: load Δ {delta.get('load_1m_delta', 0):+.3f}, "
f"mem avail Δ {delta.get('mem_available_mb_delta', 0):+d} MB, "
f"stale candidates: {len(stale)}",
err=True,
)
after = delta.get("after") if delta else None
if after:
typer.echo(
f" host load={after.get('load_1m')} mem_avail={after.get('mem_available_mb')} MB "
f"disk_free={after.get('disk_root_avail_gb')} GB",
err=True,
)
@app.command("create")
def sandbox_create(
profile: Annotated[str, typer.Option("--profile", help="Profile id")],
profile: Annotated[
str | None, typer.Option("--profile", help="Profile id (default: canary self-deploy)")
] = None,
input: Annotated[
list[str] | None,
typer.Option("--input", help="Input key=value (repeatable)"),
@@ -54,10 +86,12 @@ def sandbox_create(
project: Annotated[str, typer.Option(help="Calling project id")] = "sand-boxer",
host: Annotated[str | None, typer.Option(help="Override placement host")] = None,
) -> None:
"""Provision a sandbox from a profile."""
"""Provision a sandbox. No args → canary self-deploy of sand-boxer."""
parsed = _parse_inputs(input or [])
resolved_profile, resolved_inputs = resolve_create_defaults(profile, parsed)
request = SandboxCreateRequest(
profile=profile,
inputs=_parse_inputs(input or []),
profile=resolved_profile,
inputs=resolved_inputs,
consumer=Consumer(actor=ActorType(actor), project=project),
)
manager = SandboxManager()
@@ -66,7 +100,9 @@ def sandbox_create(
except Exception as exc:
typer.echo(f"Error: {exc}", err=True)
raise typer.Exit(code=1) from exc
_print_status(status.model_dump(mode="json"))
payload = status.model_dump(mode="json")
_print_json(payload)
_print_telemetry_summary(status.telemetry)
@app.command("get")
@@ -76,7 +112,7 @@ def sandbox_get(sandbox_id: str) -> None:
if not status:
typer.echo(f"Sandbox not found: {sandbox_id}", err=True)
raise typer.Exit(code=1)
_print_status(status.model_dump(mode="json"))
_print_json(status.model_dump(mode="json"))
@app.command("list")
@@ -87,7 +123,7 @@ def sandbox_list(
items = SandboxManager().list()
if state:
items = [s for s in items if s.state.value == state]
_print_status([s.model_dump(mode="json") for s in items])
_print_json([s.model_dump(mode="json") for s in items])
@app.command("destroy")
@@ -99,7 +135,8 @@ def sandbox_destroy(sandbox_id: str) -> None:
except KeyError as exc:
typer.echo(str(exc), err=True)
raise typer.Exit(code=1) from exc
_print_status(status.model_dump(mode="json"))
_print_json(status.model_dump(mode="json"))
_print_telemetry_summary(status.telemetry)
@app.command("recreate")
@@ -111,7 +148,72 @@ def sandbox_recreate(sandbox_id: str) -> None:
except (KeyError, Exception) as exc:
typer.echo(f"Error: {exc}", err=True)
raise typer.Exit(code=1) from exc
_print_status(status.model_dump(mode="json"))
_print_json(status.model_dump(mode="json"))
@inspect_app.command("host")
def inspect_host(
host: Annotated[str | None, typer.Option(help="Sandbox host")] = None,
profile_id: Annotated[
str, typer.Option(help="Profile for placement resolution")
] = "profile.sandbox-canary",
) -> None:
"""Host snapshot and inventory (no sandbox create)."""
profile = load_profile(profile_id)
resolved = resolve_host(profile, override=host)
snapshot = collect_host_snapshot(resolved)
scanner = HostInventoryScanner(resolved)
inventory = scanner.scan_inventory()
stale = scanner.find_stale(SandboxManager().store)
report = build_introspection_report(
host=resolved,
sandbox_id="inspect",
profile=profile,
provision_before=snapshot,
provision_after=snapshot,
store=SandboxManager().store,
)
export_telemetry(report)
_print_json(
{
"host_snapshot": snapshot.model_dump(mode="json"),
"inventory": inventory.model_dump(mode="json"),
"stale_candidates": [s.model_dump(mode="json") for s in stale],
}
)
@inspect_app.command("stale")
def inspect_stale(
host: Annotated[str | None, typer.Option(help="Sandbox host")] = None,
older_than: Annotated[float, typer.Option(help="Stale threshold hours")] = 24.0,
) -> None:
"""List stale sandbox candidates."""
profile = load_profile("profile.sandbox-canary")
resolved = resolve_host(profile, override=host)
scanner = HostInventoryScanner(resolved, stale_hours=older_than)
stale = scanner.find_stale(SandboxManager().store, stale_hours=older_than)
_print_json([s.model_dump(mode="json") for s in stale])
@app.command("reap-stale")
def reap_stale_cmd(
host: Annotated[str | None, typer.Option(help="Sandbox host")] = None,
older_than: Annotated[float, typer.Option(help="Reap threshold hours")] = 24.0,
apply: Annotated[bool, typer.Option("--apply", help="Actually remove stale resources")] = False,
) -> None:
"""Report or remove stale sandboxes on host (default: dry-run)."""
profile = load_profile("profile.sandbox-canary")
resolved = resolve_host(profile, override=host)
results = reap_stale(
resolved,
SandboxManager().store,
dry_run=not apply,
stale_hours=older_than,
)
mode = "apply" if apply else "dry-run"
typer.echo(f"reap-stale ({mode}): {len(results)} candidate(s)", err=True)
_print_json([r.model_dump(mode="json") for r in results])
if __name__ == "__main__":

View File

@@ -13,6 +13,12 @@ from sandboxer.models import (
)
from sandboxer.placement import resolve_host
from sandboxer.profiles.loader import load_profile
from sandboxer.telemetry.export import export_telemetry
from sandboxer.telemetry.introspection import (
build_introspection_report,
collect_host_snapshot,
profile_wants_telemetry,
)
class SandboxManager:
@@ -24,6 +30,8 @@ class SandboxManager:
extension = load_extension(profile.extension)
backend = resolve_backend(extension)
resolved_host = resolve_host(profile, override=host)
wants_telemetry = profile_wants_telemetry(profile)
base_dir = extension.config.get("base_dir", "/tmp/sandboxer")
now = utcnow()
status = SandboxStatus(
@@ -43,6 +51,10 @@ class SandboxManager:
status.updated_at = utcnow()
emit_lifecycle_event(status, event_type=event_type_for_state(status.state))
provision_before = None
if wants_telemetry:
provision_before = collect_host_snapshot(resolved_host)
try:
handle = backend.provision(profile, request.inputs, resolved_host)
status.sandbox_id = handle["sandbox_id"]
@@ -54,6 +66,21 @@ class SandboxManager:
status.state = SandboxState.READY
status.ready_at = utcnow()
status.updated_at = status.ready_at
if wants_telemetry and provision_before:
provision_after = collect_host_snapshot(resolved_host)
report = build_introspection_report(
host=resolved_host,
sandbox_id=status.sandbox_id,
profile=profile,
provision_before=provision_before,
provision_after=provision_after,
store=self.store,
base_dir=base_dir,
)
status.telemetry = report.model_dump(mode="json")
export_telemetry(report)
self.store.save(status)
emit_lifecycle_event(status, event_type=event_type_for_state(status.state))
return status
@@ -86,6 +113,12 @@ class SandboxManager:
profile = load_profile(status.profile_id)
extension = load_extension(profile.extension)
backend = resolve_backend(extension)
wants_telemetry = profile_wants_telemetry(profile)
base_dir = extension.config.get("base_dir", "/tmp/sandboxer")
destroy_before = None
if wants_telemetry and status.host:
destroy_before = collect_host_snapshot(status.host)
status.state = SandboxState.DESTROYING
status.updated_at = utcnow()
@@ -106,6 +139,21 @@ class SandboxManager:
status.state = SandboxState.DESTROYED
status.destroyed_at = utcnow()
status.updated_at = status.destroyed_at
if wants_telemetry and destroy_before and status.host:
destroy_after = collect_host_snapshot(status.host)
report = build_introspection_report(
host=status.host,
sandbox_id=status.sandbox_id,
profile=profile,
destroy_before=destroy_before,
destroy_after=destroy_after,
store=self.store,
base_dir=base_dir,
)
status.telemetry = report.model_dump(mode="json")
export_telemetry(report)
self.store.save(status)
emit_lifecycle_event(status, event_type=event_type_for_state(status.state))
return status

30
src/sandboxer/defaults.py Normal file
View File

@@ -0,0 +1,30 @@
"""Default paths and profile resolution for CLI."""
from __future__ import annotations
import os
from pathlib import Path
DEFAULT_CANARY_PROFILE = "profile.sandbox-canary"
DEFAULT_COMPOSE_PROFILE = "profile.compose-e2e"
def repo_root() -> Path:
override = os.environ.get("SANDBOXER_REPO_ROOT")
if override:
return Path(override).expanduser().resolve()
return Path(__file__).resolve().parents[2]
def resolve_create_defaults(
profile: str | None,
inputs: dict[str, str],
) -> tuple[str, dict[str, str]]:
"""Apply default profile and repo per SAND-WP-0008-T06."""
resolved = dict(inputs)
user_repo = "repo" in resolved
if not user_repo:
resolved["repo"] = str(repo_root())
if profile is None:
profile = DEFAULT_COMPOSE_PROFILE if user_repo else DEFAULT_CANARY_PROFILE
return profile, resolved

View File

@@ -39,6 +39,7 @@ def emit_lifecycle_event(
"actor_type": status.consumer.actor.value,
"state": status.state.value,
"reachability": status.reachability.model_dump() if status.reachability else None,
"telemetry": status.telemetry,
"timestamps": {
"created_at": status.created_at.isoformat(),
"updated_at": status.updated_at.isoformat(),

View File

@@ -83,6 +83,7 @@ class ReachabilitySpec(BaseModel):
class ProfileMetadata(BaseModel):
cost_class: Literal["self-hosted", "saas-metered"] = "self-hosted"
latency_class: str = "standard"
observability: Literal["none", "canary"] = "none"
class Profile(BaseModel):
@@ -141,6 +142,7 @@ class SandboxStatus(BaseModel):
reachability: Reachability | None = None
inputs: dict[str, str] = Field(default_factory=dict)
error: str | None = None
telemetry: dict | None = None # IntrospectionReport JSON when canary
created_at: datetime
updated_at: datetime
ready_at: datetime | None = None

View File

@@ -0,0 +1 @@
"""Host telemetry and introspection."""

View File

@@ -0,0 +1,64 @@
"""Telemetry export sinks."""
from __future__ import annotations
import json
import os
from pathlib import Path
from typing import Protocol
import httpx
from sandboxer.lifecycle.state_hub import hub_url
from sandboxer.telemetry.models import IntrospectionReport
class TelemetrySink(Protocol):
"""Future export target (artifact-store, Prometheus, ClickHouse)."""
def publish(self, report: IntrospectionReport) -> None: ...
def telemetry_dir() -> Path:
base = Path(os.environ.get("XDG_DATA_HOME", Path.home() / ".local" / "share"))
path = base / "sandboxer" / "telemetry"
path.mkdir(parents=True, exist_ok=True)
return path
def export_local_artifact(report: IntrospectionReport) -> Path:
path = telemetry_dir() / f"{report.sandbox_id}.json"
path.write_text(json.dumps(report.model_dump(mode="json"), indent=2, default=str))
return path
def export_state_hub(report: IntrospectionReport) -> dict | None:
if os.environ.get("SANDBOXER_NO_STATE_HUB", "").lower() in ("1", "true", "yes"):
return None
payload = {
"event_type": "note",
"summary": (
f"Telemetry {report.sandbox_id}: load Δ "
f"{report.provision_delta.load_1m_delta if report.provision_delta else 0}, "
f"stale={len(report.stale_candidates)}"
),
"author": "sandboxer",
"detail": report.model_dump(mode="json"),
}
try:
response = httpx.post(f"{hub_url()}/progress/", json=payload, timeout=10.0)
response.raise_for_status()
return response.json()
except httpx.HTTPError:
return None
def export_telemetry(report: IntrospectionReport) -> Path:
path = export_local_artifact(report)
export_state_hub(report)
return path
class NoopTelemetrySink:
def publish(self, report: IntrospectionReport) -> None:
export_telemetry(report)

View File

@@ -0,0 +1,122 @@
"""Collect HostSnapshot over SSH."""
from __future__ import annotations
from datetime import datetime
from sandboxer.extensions.ssh import SSHConfig
from sandboxer.lifecycle.store import utcnow
from sandboxer.telemetry.models import HostSnapshot
def parse_loadavg(text: str) -> tuple[float, float, float]:
parts = text.strip().split()
return float(parts[0]), float(parts[1]), float(parts[2])
def parse_meminfo(text: str) -> tuple[int, int]:
total = avail = 0
for line in text.splitlines():
if line.startswith("MemTotal:"):
total = int(line.split()[1]) // 1024
elif line.startswith("MemAvailable:"):
avail = int(line.split()[1]) // 1024
return total, avail
def parse_free_m(text: str) -> tuple[int, int]:
for line in text.splitlines():
if line.startswith("Mem:"):
parts = line.split()
return int(parts[1]), int(parts[6])
return 0, 0
def parse_df_root(text: str) -> tuple[float, float]:
line = text.strip().splitlines()[-1]
parts = line.split()
used_pct = float(parts[4].rstrip("%"))
avail = parts[3]
mult = 1.0
if avail[-1] in "KMGT":
mult = {"K": 1 / 1e6, "M": 1 / 1e3, "G": 1.0, "T": 1000.0}[avail[-1]]
avail = avail[:-1]
return used_pct, float(avail) * mult
def parse_container_count(text: str) -> int:
lines = [ln for ln in text.strip().splitlines() if ln.strip()]
return len(lines)
class HostSnapshotCollector:
def __init__(
self, host: str, *, ssh_user: str | None = None, ssh_key: str | None = None
) -> None:
self.ssh = SSHConfig.from_env(host, user=ssh_user, key=ssh_key)
self.host = host
def collect(self, *, collected_at: datetime | None = None) -> HostSnapshot:
when = collected_at or utcnow()
runtime = self._detect_runtime()
load = self._run("cat /proc/loadavg")
cpu = self._run("nproc")
mem = self._run("free -m | awk '/^Mem:/{print $2\" \"$7}'")
disk = self._run("df -h / | tail -1")
running = self._run(f"{runtime} ps -q 2>/dev/null")
sandbox = self._run(
f"{runtime} ps -q --filter label=io.podman.compose.project=sbx 2>/dev/null"
)
if sandbox == "" and runtime == "docker":
sandbox = self._run(
"docker ps -q --filter label=com.docker.compose.project=sbx 2>/dev/null"
)
load_vals = parse_loadavg(load) if load else (0.0, 0.0, 0.0)
cpu_count = int(cpu.strip()) if cpu.strip().isdigit() else 0
if mem and mem.strip():
parts = mem.strip().split()
mem_total, mem_avail = int(parts[0]), int(parts[1])
else:
mem_total, mem_avail = 0, 0
disk_used, disk_avail = parse_df_root(disk) if disk else (0.0, 0.0)
return HostSnapshot(
collected_at=when,
host=self.host,
load_1m=load_vals[0],
load_5m=load_vals[1],
load_15m=load_vals[2],
cpu_count=cpu_count,
mem_total_mb=mem_total,
mem_available_mb=mem_avail,
disk_root_used_pct=disk_used,
disk_root_avail_gb=disk_avail,
running_containers=parse_container_count(running),
sandbox_containers=parse_container_count(sandbox),
container_runtime=runtime,
)
def _detect_runtime(self) -> str:
rc, _ = self.ssh.run("command -v podman")
if rc == 0:
return "podman"
rc, _ = self.ssh.run("command -v docker")
if rc == 0:
return "docker"
return "unknown"
def _run(self, cmd: str) -> str:
rc, out = self.ssh.run(cmd, timeout=10)
if rc != 0:
return ""
return out
def compute_delta(before: HostSnapshot, after: HostSnapshot) -> dict[str, float | int]:
return {
"load_1m_delta": round(after.load_1m - before.load_1m, 3),
"mem_available_mb_delta": after.mem_available_mb - before.mem_available_mb,
"running_containers_delta": after.running_containers - before.running_containers,
"sandbox_containers_delta": after.sandbox_containers - before.sandbox_containers,
}

View File

@@ -0,0 +1,65 @@
"""Assemble IntrospectionReport for canary profiles."""
from __future__ import annotations
from sandboxer.lifecycle.store import SandboxStore, utcnow
from sandboxer.models import Profile
from sandboxer.telemetry.host_snapshot import HostSnapshot, compute_delta
from sandboxer.telemetry.inventory import HostInventoryScanner
from sandboxer.telemetry.models import IntrospectionReport, ProvisionDelta
def profile_wants_telemetry(profile: Profile) -> bool:
if profile.id == "profile.sandbox-canary":
return True
return profile.metadata.observability == "canary"
def build_introspection_report(
*,
host: str,
sandbox_id: str,
profile: Profile,
store: SandboxStore,
base_dir: str = "/tmp/sandboxer",
provision_before: HostSnapshot | None = None,
provision_after: HostSnapshot | None = None,
destroy_before: HostSnapshot | None = None,
destroy_after: HostSnapshot | None = None,
) -> IntrospectionReport:
scanner = HostInventoryScanner(host, base_dir=base_dir)
inventory = scanner.scan_inventory()
stale = scanner.find_stale(store)
provision_delta = None
if provision_before and provision_after:
provision_delta = ProvisionDelta(
before=provision_before,
after=provision_after,
**compute_delta(provision_before, provision_after),
)
destroy_delta = None
if destroy_before and destroy_after:
destroy_delta = ProvisionDelta(
before=destroy_before,
after=destroy_after,
**compute_delta(destroy_before, destroy_after),
)
return IntrospectionReport(
host=host,
sandbox_id=sandbox_id,
profile_id=profile.id,
collected_at=utcnow(),
provision_delta=provision_delta,
destroy_delta=destroy_delta,
inventory=inventory,
stale_candidates=stale,
)
def collect_host_snapshot(host: str) -> HostSnapshot:
from sandboxer.telemetry.host_snapshot import HostSnapshotCollector
return HostSnapshotCollector(host).collect()

View File

@@ -0,0 +1,177 @@
"""Sandbox inventory and stale candidate discovery."""
from __future__ import annotations
import re
from datetime import UTC, datetime
from sandboxer.extensions.ssh import SSHConfig
from sandboxer.lifecycle.store import SandboxStore, utcnow
from sandboxer.models import SandboxState
from sandboxer.telemetry.models import InventoryEntry, SandboxInventory, StaleCandidate
_PROJECT_RE = re.compile(r"^(sbx-.+|e2e-.+)$")
def _age_hours(epoch_str: str) -> float | None:
try:
# stat format %Y
ts = int(epoch_str.strip())
return round((datetime.now(UTC).timestamp() - ts) / 3600, 2)
except ValueError:
return None
def _profile_hint_from_project(project: str) -> str | None:
if project.startswith("sbx-"):
parts = project.split("-")
if len(parts) >= 3:
return f"profile.{parts[1]}"
return None
class HostInventoryScanner:
def __init__(
self,
host: str,
*,
base_dir: str = "/tmp/sandboxer",
ssh_user: str | None = None,
stale_hours: float = 24.0,
) -> None:
self.host = host
self.base_dir = base_dir
self.ssh = SSHConfig.from_env(host, user=ssh_user)
self.stale_hours = stale_hours
def scan_inventory(self) -> SandboxInventory:
when = utcnow()
entries: list[InventoryEntry] = []
entries.extend(self._scan_directories())
entries.extend(self._scan_compose_projects())
return SandboxInventory(
host=self.host,
base_dir=self.base_dir,
collected_at=when,
entries=entries,
)
def find_stale(
self,
store: SandboxStore,
*,
stale_hours: float | None = None,
) -> list[StaleCandidate]:
threshold = stale_hours if stale_hours is not None else self.stale_hours
inventory = self.scan_inventory()
on_host_ids = {e.id for e in inventory.entries}
store_by_id = {
s.sandbox_id: s
for s in store.list_all()
if s.state != SandboxState.DESTROYED
}
candidates: list[StaleCandidate] = []
for entry in inventory.entries:
in_store = entry.id in store_by_id
if not in_store:
kind = "orphan_dir" if entry.kind == "directory" else "orphan_compose"
candidates.append(
StaleCandidate(
kind=kind,
id=entry.id,
path=entry.path,
age_hours=entry.age_hours,
action="reap",
reason="present on host but absent from local store",
)
)
elif entry.age_hours is not None and entry.age_hours >= threshold:
candidates.append(
StaleCandidate(
kind="aged_dir" if entry.kind == "directory" else "orphan_compose",
id=entry.id,
path=entry.path,
age_hours=entry.age_hours,
action="reap",
reason=f"older than {threshold}h threshold",
)
)
for sid, status in store_by_id.items():
if sid not in on_host_ids and status.host == self.host:
candidates.append(
StaleCandidate(
kind="zombie_record",
id=sid,
path=status.reachability.remote_dir if status.reachability else None,
age_hours=None,
action="inspect",
reason="recorded in store but missing on host",
)
)
return candidates
def _scan_directories(self) -> list[InventoryEntry]:
cmd = (
f"if [ -d {self.base_dir} ]; then "
f"find {self.base_dir} -mindepth 1 -maxdepth 1 -type d "
f"-printf '%f %Y\\n'; fi"
)
rc, out = self.ssh.run(cmd, timeout=15)
if rc != 0 or not out.strip():
return []
entries = []
for line in out.strip().splitlines():
parts = line.split()
if len(parts) < 2:
continue
sid, mtime = parts[0], parts[1]
entries.append(
InventoryEntry(
kind="directory",
id=sid,
path=f"{self.base_dir}/{sid}",
age_hours=_age_hours(mtime),
)
)
return entries
def _scan_compose_projects(self) -> list[InventoryEntry]:
runtime = "podman" if self._has_podman() else "docker"
if runtime == "podman":
cmd = (
"podman ps -a --format '{{.Labels}}' 2>/dev/null | "
"grep -o 'io.podman.compose.project=sbx[^, ]*' | "
"sed 's/io.podman.compose.project=//' | sort -u"
)
else:
cmd = (
"docker ps -a --format '{{.Label \"com.docker.compose.project\"}}' 2>/dev/null | "
"grep '^sbx' | sort -u"
)
rc, out = self.ssh.run(cmd, timeout=15)
if rc != 0 or not out.strip():
return []
entries = []
for project in out.strip().splitlines():
project = project.strip()
if not _PROJECT_RE.match(project):
continue
sandbox_id = project.rsplit("-", 1)[-1]
entries.append(
InventoryEntry(
kind="compose_project",
id=sandbox_id,
path=f"compose:{project}",
age_hours=None,
profile_hint=_profile_hint_from_project(project),
)
)
return entries
def _has_podman(self) -> bool:
rc, _ = self.ssh.run("command -v podman")
return rc == 0

View File

@@ -0,0 +1,70 @@
"""Telemetry and introspection schemas."""
from __future__ import annotations
from datetime import datetime
from pydantic import BaseModel, Field
SCHEMA_VERSION = "0.1"
class HostSnapshot(BaseModel):
collected_at: datetime
host: str
load_1m: float = 0.0
load_5m: float = 0.0
load_15m: float = 0.0
cpu_count: int = 0
mem_total_mb: int = 0
mem_available_mb: int = 0
disk_root_used_pct: float = 0.0
disk_root_avail_gb: float = 0.0
running_containers: int = 0
sandbox_containers: int = 0
container_runtime: str = "unknown"
class InventoryEntry(BaseModel):
kind: str # directory | compose_project
id: str
path: str | None = None
age_hours: float | None = None
profile_hint: str | None = None
class SandboxInventory(BaseModel):
host: str
base_dir: str
collected_at: datetime
entries: list[InventoryEntry] = Field(default_factory=list)
class StaleCandidate(BaseModel):
kind: str # orphan_dir | orphan_compose | zombie_record | aged_dir
id: str
path: str | None = None
age_hours: float | None = None
action: str # reap | inspect | ignore
reason: str
class ProvisionDelta(BaseModel):
before: HostSnapshot
after: HostSnapshot
load_1m_delta: float = 0.0
mem_available_mb_delta: int = 0
running_containers_delta: int = 0
sandbox_containers_delta: int = 0
class IntrospectionReport(BaseModel):
schema_version: str = SCHEMA_VERSION
host: str
sandbox_id: str
profile_id: str
collected_at: datetime
provision_delta: ProvisionDelta | None = None
destroy_delta: ProvisionDelta | None = None
inventory: SandboxInventory | None = None
stale_candidates: list[StaleCandidate] = Field(default_factory=list)

View File

@@ -0,0 +1,36 @@
"""Stale sandbox reap (dry-run and apply)."""
from __future__ import annotations
import os
from sandboxer.extensions.ssh import SSHConfig
from sandboxer.lifecycle.store import SandboxStore
from sandboxer.telemetry.inventory import HostInventoryScanner
from sandboxer.telemetry.models import StaleCandidate
def reap_stale(
host: str,
store: SandboxStore,
*,
dry_run: bool = True,
stale_hours: float = 24.0,
base_dir: str = "/tmp/sandboxer",
) -> list[StaleCandidate]:
scanner = HostInventoryScanner(host, base_dir=base_dir, stale_hours=stale_hours)
candidates = [
c for c in scanner.find_stale(store, stale_hours=stale_hours) if c.action == "reap"
]
if dry_run:
return candidates
compose_cmd = os.environ.get("SANDBOXER_COMPOSE_CMD", "podman-compose")
ssh = SSHConfig.from_env(host)
for item in candidates:
if item.kind in ("orphan_dir", "aged_dir") and item.path:
ssh.run(f"rm -rf {item.path}", timeout=30)
elif item.kind == "orphan_compose" and item.path and item.path.startswith("compose:"):
project = item.path.split(":", 1)[1]
ssh.run(f"{compose_cmd} -p {project} down -v 2>/dev/null || true", timeout=60)
return candidates

12
tests/test_defaults.py Normal file
View File

@@ -0,0 +1,12 @@
"""Default profile resolution."""
from sandboxer.defaults import DEFAULT_CANARY_PROFILE, DEFAULT_COMPOSE_PROFILE, repo_root
def test_repo_root_points_at_sand_boxer() -> None:
assert repo_root().name == "sand-boxer"
def test_canary_profile_constant() -> None:
assert DEFAULT_CANARY_PROFILE == "profile.sandbox-canary"
assert DEFAULT_COMPOSE_PROFILE == "profile.compose-e2e"

87
tests/test_telemetry.py Normal file
View File

@@ -0,0 +1,87 @@
"""Telemetry parsing and introspection tests."""
from __future__ import annotations
from datetime import UTC, datetime
from pathlib import Path
from unittest.mock import patch
import pytest
from sandboxer.defaults import resolve_create_defaults
from sandboxer.profiles.loader import load_profile
from sandboxer.telemetry.host_snapshot import (
parse_container_count,
parse_df_root,
parse_loadavg,
)
from sandboxer.telemetry.introspection import profile_wants_telemetry
from sandboxer.telemetry.models import HostSnapshot, SandboxInventory
def test_parse_loadavg() -> None:
assert parse_loadavg("0.52 0.48 0.45 1/234 999") == (0.52, 0.48, 0.45)
def test_parse_df_root() -> None:
line = "Filesystem Size Used Avail Use% Mounted\n/dev/sda1 100G 40G 55G 43% /"
used, avail = parse_df_root(line)
assert used == 43.0
assert avail == 55.0
def test_parse_container_count() -> None:
assert parse_container_count("abc\ndef\n") == 2
def test_profile_wants_telemetry_canary() -> None:
profile = load_profile("profile.sandbox-canary")
assert profile_wants_telemetry(profile) is True
def test_profile_wants_telemetry_compose_e2e() -> None:
profile = load_profile("profile.compose-e2e")
assert profile_wants_telemetry(profile) is False
def test_resolve_create_defaults_no_args() -> None:
profile, inputs = resolve_create_defaults(None, {})
assert profile == "profile.sandbox-canary"
assert "repo" in inputs
assert Path(inputs["repo"]).name == "sand-boxer"
def test_resolve_create_defaults_explicit_repo() -> None:
profile, inputs = resolve_create_defaults(None, {"repo": "/tmp/foo"})
assert profile == "profile.compose-e2e"
assert inputs["repo"] == "/tmp/foo"
def test_build_introspection_report_mocked(tmp_path: Path) -> None:
from sandboxer.lifecycle.store import SandboxStore
from sandboxer.telemetry.introspection import build_introspection_report
now = datetime.now(UTC)
snap = HostSnapshot(collected_at=now, host="h1", load_1m=1.0, mem_available_mb=1000)
snap2 = HostSnapshot(collected_at=now, host="h1", load_1m=1.5, mem_available_mb=900)
profile = load_profile("profile.sandbox-canary")
store = SandboxStore(path=tmp_path / "sandboxes.json")
with patch("sandboxer.telemetry.introspection.HostInventoryScanner") as scanner_cls:
scanner = scanner_cls.return_value
scanner.scan_inventory.return_value = SandboxInventory(
host="h1", base_dir="/tmp/sandboxer", collected_at=now, entries=[]
)
scanner.find_stale.return_value = []
report = build_introspection_report(
host="h1",
sandbox_id="abc",
profile=profile,
provision_before=snap,
provision_after=snap2,
store=store,
)
assert report.provision_delta is not None
assert report.provision_delta.load_1m_delta == pytest.approx(0.5)
assert report.provision_delta.mem_available_mb_delta == -100

View File

@@ -4,7 +4,7 @@ type: workplan
title: "Host telemetry and self-canary introspection"
domain: infotech
repo: sand-boxer
status: ready
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-23"
@@ -42,7 +42,7 @@ later).
```task
id: SAND-WP-0008-T01
status: todo
status: done
priority: high
state_hub_task_id: "8f7b46e3-045e-481c-81bd-1c61734c6eb3"
```
@@ -64,7 +64,7 @@ does not own long-term metrics DB).
```task
id: SAND-WP-0008-T02
status: todo
status: done
priority: high
state_hub_task_id: "732bae4e-2dd9-4500-a86d-e869007bb383"
```
@@ -86,7 +86,7 @@ Canary deliverable on `ready`: JSON `IntrospectionReport` in sandbox status
```task
id: SAND-WP-0008-T03
status: todo
status: done
priority: high
state_hub_task_id: "7bd22f27-5058-4c19-98b6-b923909a8815"
```
@@ -105,7 +105,7 @@ command output.
```task
id: SAND-WP-0008-T04
status: todo
status: done
priority: high
state_hub_task_id: "c2d19bb7-9322-4744-a71e-75f7701a6fb2"
```
@@ -124,7 +124,7 @@ No automatic deletion in this task — dry-run only.
```task
id: SAND-WP-0008-T05
status: todo
status: done
priority: medium
state_hub_task_id: "b6b02289-d36e-4ee1-9ff7-dc59a1d24886"
```
@@ -143,7 +143,7 @@ Same pattern on `destroy` for teardown impact. Tests mock SSH collector.
```task
id: SAND-WP-0008-T06
status: todo
status: done
priority: high
state_hub_task_id: "d9941d93-a662-45c0-820b-88d32266c653"
```
@@ -168,7 +168,7 @@ sandboxer create --input repo=/other/repo # unchanged behavior
```task
id: SAND-WP-0008-T07
status: todo
status: done
priority: high
state_hub_task_id: "76430452-c98e-44e5-b625-e243dc12b8a5"
```
@@ -185,7 +185,7 @@ After `wait_ready` for canary profile:
```task
id: SAND-WP-0008-T08
status: todo
status: done
priority: medium
state_hub_task_id: "4ee4b95b-e7b5-4893-b78e-914f808bc00a"
```
@@ -207,7 +207,7 @@ activity-core may schedule periodic canary runs later — out of scope here.
```task
id: SAND-WP-0008-T09
status: todo
status: done
priority: medium
state_hub_task_id: "6ea8eda6-491b-460a-a526-7565962f449e"
```
@@ -225,7 +225,7 @@ sandboxer reap-stale --apply [--older-than 24h] # T10+; gated behind --apply
```task
id: SAND-WP-0008-T10
status: todo
status: done
priority: medium
state_hub_task_id: "435a3993-d8d3-4280-b68a-c37e34d20312"
```
@@ -268,4 +268,13 @@ After merging task status updates:
```bash
cd ~/state-hub && make fix-consistency REPO=sand-boxer
```
```
## Verification record (2026-06-23)
CoulombCore remote proof:
1. `sandboxer create` (no args) → `ready` + `telemetry.provision_delta`
2. `sandboxer inspect host` → load/mem metrics returned
3. Stale orphans from prior runs detected in `stale_candidates`
4. `sandboxer destroy``destroy_delta` with load Δ -0.09, mem +54 MB