--- id: SAND-WP-0008 type: workplan title: "Host telemetry and self-canary introspection" domain: infotech repo: sand-boxer status: finished owner: codex topic_slug: custodian created: "2026-06-23" updated: "2026-06-23" state_hub_workstream_id: "afbcbc84-5ec7-4f8b-ae21-4cbda0d05195" --- # Host telemetry and self-canary introspection Use sand-boxer as its own trial deployment to prove provision/teardown **and** return actionable host and sandbox intelligence: resource metrics, load before/after, stale sandbox inventory, and structured telemetry for centralized analysis. **Charter:** `INTENT.md` (host topology, observable lifecycle) **Spec:** `docs/meta-framework.md` (Host resource, Meter — extend for self-hosted) **Predecessor:** SAND-WP-0002 (`ext.compose-ssh`, CLI v0, State Hub events) **Related:** SAND-WP-0002-T10 (remote smoke), activity-core (scheduled reap jobs) ## Problem Today `sandboxer create` proves SSH + compose for an arbitrary repo but returns only lifecycle state and reachability. Operators lack: - Host load and capacity **before** accepting new sandboxes - **After** metrics to quantify sandbox cost - Inventory of **stale** sandboxes (`/tmp/sandboxer/*`, orphaned compose projects) - A **default smoke path** that does not depend on another repo's `e2e/` layout sand-boxer should dogfood itself: deploy the sand-boxer tree, run a bounded introspection bundle on the remote host, and emit telemetry suitable for a central datastore (State Hub first; export to artifact-store or metrics pipeline later). ## Design host telemetry contract ```task id: SAND-WP-0008-T01 status: done priority: high state_hub_task_id: "8f7b46e3-045e-481c-81bd-1c61734c6eb3" ``` Author `docs/host-telemetry.md` defining: - **HostSnapshot** — point-in-time host metrics (load, CPU%, mem, disk, docker stats summary) - **SandboxInventory** — known sandboxes on host (compose projects matching `sbx-*`, directories under configured `base_dir`, age, owning profile if inferable) - **StaleCandidate** — entries exceeding TTL, idle threshold, or missing store record - **ProvisionDelta** — `before` / `after` HostSnapshot pair around create/destroy - **IntrospectionReport** — bundled output attached to sandbox `ready` response - Retention and privacy rules (no secret paths, no full `docker inspect` dumps by default) Extend meta-framework spec with `Host` observability fields (read-only; sand-boxer does not own long-term metrics DB). ## Define profile.sandbox-canary and introspection schema ```task id: SAND-WP-0008-T02 status: done priority: high state_hub_task_id: "732bae4e-2dd9-4500-a86d-e869007bb383" ``` Add: - `profiles/profile.sandbox-canary.yaml` — lightweight compose or no-compose introspection profile bound to `ext.compose-ssh` (or thin `ext.ssh-introspect` if compose is unnecessary for canary) - Pydantic models: `HostSnapshot`, `SandboxInventory`, `StaleCandidate`, `ProvisionDelta`, `IntrospectionReport` - Default inputs: `repo` optional; when omitted, resolve to sand-boxer repo root (package parent path or `SANDBOXER_REPO_ROOT`) Canary deliverable on `ready`: JSON `IntrospectionReport` in sandbox status `detail` / reachability extension field. ## Implement remote host metrics collector ```task id: SAND-WP-0008-T03 status: done priority: high state_hub_task_id: "7bd22f27-5058-4c19-98b6-b923909a8815" ``` SSH-side collection (shell + structured parse, no extra daemon on host): - Load average, CPU count, mem available/total, root disk use - `docker system df` / running container count - Optional: `docker stats --no-stream` aggregate for sbx-* projects only - Bounded runtime (e.g. ≤10s) and non-root-safe commands Module: `src/sandboxer/telemetry/host_snapshot.py` with unit tests using fixture command output. ## Implement stale sandbox discovery ```task id: SAND-WP-0008-T04 status: done priority: high state_hub_task_id: "c2d19bb7-9322-4744-a71e-75f7701a6fb2" ``` Scan remote host for: - Directories under `base_dir` (default `/tmp/sandboxer`) with mtime age - `docker compose ls` projects matching `sbx-*` / `e2e-*` legacy patterns - Cross-check against local `SandboxStore` — flag **orphans** (on host, not in store) and **zombies** (in store, not on host) Output `StaleCandidate` list with suggested action: `reap`, `inspect`, `ignore`. No automatic deletion in this task — dry-run only. ## Capture before/after load around provision ```task id: SAND-WP-0008-T05 status: done priority: medium state_hub_task_id: "b6b02289-d36e-4ee1-9ff7-dc59a1d24886" ``` Integrate into `SandboxManager.create` / `destroy` when profile metadata requests telemetry (`metadata.observability: canary` or profile id `profile.sandbox-canary`): 1. `HostSnapshot` before extension `provision` 2. Run provision + wait_ready 3. `HostSnapshot` after ready 4. Compute `ProvisionDelta` (load/mem/disk/container deltas) Same pattern on `destroy` for teardown impact. Tests mock SSH collector. ## Default repo: deploy sand-boxer itself ```task id: SAND-WP-0008-T06 status: done priority: high state_hub_task_id: "d9941d93-a662-45c0-820b-88d32266c653" ``` When `create` has no `repo` input: - Resolve default to sand-boxer repository root (`SANDBOXER_REPO_ROOT` override) - Use `profile.sandbox-canary` as default profile when `--profile` omitted **and** no `repo` given (document precedence: explicit flags win) - Ship minimal `e2e/e2e.yml` or `docker-compose.canary.yml` in sand-boxer repo if compose-up is required for parity with `ext.compose-ssh` CLI examples: ```bash sandboxer create # canary self-deploy sandboxer create --profile profile.sandbox-canary sandboxer create --input repo=/other/repo # unchanged behavior ``` ## Wire introspection into canary provision flow ```task id: SAND-WP-0008-T07 status: done priority: high state_hub_task_id: "76430452-c98e-44e5-b625-e243dc12b8a5" ``` After `wait_ready` for canary profile: - Rsync includes `src/sandboxer/telemetry/` introspection entry script or invoke collector modules via SSH one-liner - Assemble `IntrospectionReport` (inventory + deltas + stale candidates) - Attach to `SandboxStatus` (new optional `telemetry` field) - Print human summary in CLI (load delta, stale count, disk headroom) ## Telemetry export for centralized analysis ```task id: SAND-WP-0008-T08 status: done priority: medium state_hub_task_id: "4ee4b95b-e7b5-4893-b78e-914f808bc00a" ``` Emit structured telemetry to: 1. **State Hub** — `progress/` events with `detail` containing `IntrospectionReport` (extend existing lifecycle emitter) 2. **Local artifact** — `~/.local/share/sandboxer/telemetry/.json` for offline analysis 3. **Export hook** (stub) — `TelemetrySink` protocol for future artifact-store / Prometheus / ClickHouse; document contract only Include: `host`, `sandbox_id`, `profile_id`, `collected_at`, schema version. activity-core may schedule periodic canary runs later — out of scope here. ## CLI inspect and stale reap commands ```task id: SAND-WP-0008-T09 status: done priority: medium state_hub_task_id: "6ea8eda6-491b-460a-a526-7565962f449e" ``` ```bash sandboxer inspect host [--host coulombcore] # HostSnapshot + inventory, no create sandboxer inspect stale [--host ...] [--json] # StaleCandidate list sandboxer reap-stale --dry-run [--host ...] # report only sandboxer reap-stale --apply [--older-than 24h] # T10+; gated behind --apply ``` `inspect` does not require a running sandbox — SSH + read-only collectors only. ## Runbook, tests, and CoulombCore verification ```task id: SAND-WP-0008-T10 status: done priority: medium state_hub_task_id: "435a3993-d8d3-4280-b68a-c37e34d20312" ``` - `docs/runbooks/profile-sandbox-canary.md` - Integration test: mock SSH fixtures for full report assembly - Manual proof on CoulombCore: 1. `sandboxer create` (no args) → `ready` + `IntrospectionReport` 2. `sandboxer inspect host` matches report host metrics 3. Introduce fake stale dir → appears in `inspect stale` 4. `destroy` → after snapshot shows load recovery - Satisfies SAND-WP-0002-T10 smoke variant when canary path used Record optimization hypotheses (disk pressure, stale reap policy) for phase-2 automation via activity-core. --- ## Out of scope | Item | Target | |------|--------| | Long-term metrics database / dashboards | artifact-store or observability stack (separate workplan) | | Automatic scheduled reap without human gate | activity-core instruction (after dry-run proven) | | wise-validator migration | SAND-WP-0003 | | SaaS metering | SAND-WP-0006 | ## Completion criteria - `sandboxer create` with no `repo` deploys sand-boxer and returns `IntrospectionReport` on `ready` - Before/after host snapshots captured for canary creates - Stale sandbox inventory with dry-run reap CLI - Telemetry lands in State Hub `detail` and local JSON artifact - Runbook and tests merged; operator runs `make fix-consistency REPO=sand-boxer` ## Operator note After merging task status updates: ```bash cd ~/state-hub && make fix-consistency REPO=sand-boxer ``` ## Verification record (2026-06-23) CoulombCore remote proof: 1. `sandboxer create` (no args) → `ready` + `telemetry.provision_delta` 2. `sandboxer inspect host` → load/mem metrics returned 3. Stale orphans from prior runs detected in `stale_candidates` 4. `sandboxer destroy` → `destroy_delta` with load Δ -0.09, mem +54 MB