Files
sand-boxer/workplans/SAND-WP-0008-host-telemetry-and-self-canary.md
tegwick c0a9261cdc Implement SAND-WP-0008: host telemetry and self-canary
Add profile.sandbox-canary, HostSnapshot/inventory/stale schemas, SSH
collectors, before/after provision deltas, telemetry export to State Hub
and local JSON, default `sandboxer create` self-deploy, inspect/reap-stale
CLI, runbook, and CoulombCore verification (26 tests pass).
2026-06-23 19:53:51 +02:00

280 lines
9.2 KiB
Markdown

---
id: SAND-WP-0008
type: workplan
title: "Host telemetry and self-canary introspection"
domain: infotech
repo: sand-boxer
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-23"
updated: "2026-06-23"
state_hub_workstream_id: "afbcbc84-5ec7-4f8b-ae21-4cbda0d05195"
---
# Host telemetry and self-canary introspection
Use sand-boxer as its own trial deployment to prove provision/teardown **and**
return actionable host and sandbox intelligence: resource metrics, load before/after,
stale sandbox inventory, and structured telemetry for centralized analysis.
**Charter:** `INTENT.md` (host topology, observable lifecycle)
**Spec:** `docs/meta-framework.md` (Host resource, Meter — extend for self-hosted)
**Predecessor:** SAND-WP-0002 (`ext.compose-ssh`, CLI v0, State Hub events)
**Related:** SAND-WP-0002-T10 (remote smoke), activity-core (scheduled reap jobs)
## Problem
Today `sandboxer create` proves SSH + compose for an arbitrary repo but returns
only lifecycle state and reachability. Operators lack:
- Host load and capacity **before** accepting new sandboxes
- **After** metrics to quantify sandbox cost
- Inventory of **stale** sandboxes (`/tmp/sandboxer/*`, orphaned compose projects)
- A **default smoke path** that does not depend on another repo's `e2e/` layout
sand-boxer should dogfood itself: deploy the sand-boxer tree, run a bounded
introspection bundle on the remote host, and emit telemetry suitable for a
central datastore (State Hub first; export to artifact-store or metrics pipeline
later).
## Design host telemetry contract
```task
id: SAND-WP-0008-T01
status: done
priority: high
state_hub_task_id: "8f7b46e3-045e-481c-81bd-1c61734c6eb3"
```
Author `docs/host-telemetry.md` defining:
- **HostSnapshot** — point-in-time host metrics (load, CPU%, mem, disk, docker stats summary)
- **SandboxInventory** — known sandboxes on host (compose projects matching `sbx-*`,
directories under configured `base_dir`, age, owning profile if inferable)
- **StaleCandidate** — entries exceeding TTL, idle threshold, or missing store record
- **ProvisionDelta** — `before` / `after` HostSnapshot pair around create/destroy
- **IntrospectionReport** — bundled output attached to sandbox `ready` response
- Retention and privacy rules (no secret paths, no full `docker inspect` dumps by default)
Extend meta-framework spec with `Host` observability fields (read-only; sand-boxer
does not own long-term metrics DB).
## Define profile.sandbox-canary and introspection schema
```task
id: SAND-WP-0008-T02
status: done
priority: high
state_hub_task_id: "732bae4e-2dd9-4500-a86d-e869007bb383"
```
Add:
- `profiles/profile.sandbox-canary.yaml` — lightweight compose or no-compose
introspection profile bound to `ext.compose-ssh` (or thin `ext.ssh-introspect`
if compose is unnecessary for canary)
- Pydantic models: `HostSnapshot`, `SandboxInventory`, `StaleCandidate`,
`ProvisionDelta`, `IntrospectionReport`
- Default inputs: `repo` optional; when omitted, resolve to sand-boxer repo root
(package parent path or `SANDBOXER_REPO_ROOT`)
Canary deliverable on `ready`: JSON `IntrospectionReport` in sandbox status
`detail` / reachability extension field.
## Implement remote host metrics collector
```task
id: SAND-WP-0008-T03
status: done
priority: high
state_hub_task_id: "7bd22f27-5058-4c19-98b6-b923909a8815"
```
SSH-side collection (shell + structured parse, no extra daemon on host):
- Load average, CPU count, mem available/total, root disk use
- `docker system df` / running container count
- Optional: `docker stats --no-stream` aggregate for sbx-* projects only
- Bounded runtime (e.g. ≤10s) and non-root-safe commands
Module: `src/sandboxer/telemetry/host_snapshot.py` with unit tests using fixture
command output.
## Implement stale sandbox discovery
```task
id: SAND-WP-0008-T04
status: done
priority: high
state_hub_task_id: "c2d19bb7-9322-4744-a71e-75f7701a6fb2"
```
Scan remote host for:
- Directories under `base_dir` (default `/tmp/sandboxer`) with mtime age
- `docker compose ls` projects matching `sbx-*` / `e2e-*` legacy patterns
- Cross-check against local `SandboxStore` — flag **orphans** (on host, not in store)
and **zombies** (in store, not on host)
Output `StaleCandidate` list with suggested action: `reap`, `inspect`, `ignore`.
No automatic deletion in this task — dry-run only.
## Capture before/after load around provision
```task
id: SAND-WP-0008-T05
status: done
priority: medium
state_hub_task_id: "b6b02289-d36e-4ee1-9ff7-dc59a1d24886"
```
Integrate into `SandboxManager.create` / `destroy` when profile metadata requests
telemetry (`metadata.observability: canary` or profile id `profile.sandbox-canary`):
1. `HostSnapshot` before extension `provision`
2. Run provision + wait_ready
3. `HostSnapshot` after ready
4. Compute `ProvisionDelta` (load/mem/disk/container deltas)
Same pattern on `destroy` for teardown impact. Tests mock SSH collector.
## Default repo: deploy sand-boxer itself
```task
id: SAND-WP-0008-T06
status: done
priority: high
state_hub_task_id: "d9941d93-a662-45c0-820b-88d32266c653"
```
When `create` has no `repo` input:
- Resolve default to sand-boxer repository root (`SANDBOXER_REPO_ROOT` override)
- Use `profile.sandbox-canary` as default profile when `--profile` omitted **and**
no `repo` given (document precedence: explicit flags win)
- Ship minimal `e2e/e2e.yml` or `docker-compose.canary.yml` in sand-boxer repo if
compose-up is required for parity with `ext.compose-ssh`
CLI examples:
```bash
sandboxer create # canary self-deploy
sandboxer create --profile profile.sandbox-canary
sandboxer create --input repo=/other/repo # unchanged behavior
```
## Wire introspection into canary provision flow
```task
id: SAND-WP-0008-T07
status: done
priority: high
state_hub_task_id: "76430452-c98e-44e5-b625-e243dc12b8a5"
```
After `wait_ready` for canary profile:
- Rsync includes `src/sandboxer/telemetry/` introspection entry script or invoke
collector modules via SSH one-liner
- Assemble `IntrospectionReport` (inventory + deltas + stale candidates)
- Attach to `SandboxStatus` (new optional `telemetry` field)
- Print human summary in CLI (load delta, stale count, disk headroom)
## Telemetry export for centralized analysis
```task
id: SAND-WP-0008-T08
status: done
priority: medium
state_hub_task_id: "4ee4b95b-e7b5-4893-b78e-914f808bc00a"
```
Emit structured telemetry to:
1. **State Hub**`progress/` events with `detail` containing `IntrospectionReport`
(extend existing lifecycle emitter)
2. **Local artifact**`~/.local/share/sandboxer/telemetry/<sandbox_id>.json` for
offline analysis
3. **Export hook** (stub) — `TelemetrySink` protocol for future artifact-store /
Prometheus / ClickHouse; document contract only
Include: `host`, `sandbox_id`, `profile_id`, `collected_at`, schema version.
activity-core may schedule periodic canary runs later — out of scope here.
## CLI inspect and stale reap commands
```task
id: SAND-WP-0008-T09
status: done
priority: medium
state_hub_task_id: "6ea8eda6-491b-460a-a526-7565962f449e"
```
```bash
sandboxer inspect host [--host coulombcore] # HostSnapshot + inventory, no create
sandboxer inspect stale [--host ...] [--json] # StaleCandidate list
sandboxer reap-stale --dry-run [--host ...] # report only
sandboxer reap-stale --apply [--older-than 24h] # T10+; gated behind --apply
```
`inspect` does not require a running sandbox — SSH + read-only collectors only.
## Runbook, tests, and CoulombCore verification
```task
id: SAND-WP-0008-T10
status: done
priority: medium
state_hub_task_id: "435a3993-d8d3-4280-b68a-c37e34d20312"
```
- `docs/runbooks/profile-sandbox-canary.md`
- Integration test: mock SSH fixtures for full report assembly
- Manual proof on CoulombCore:
1. `sandboxer create` (no args) → `ready` + `IntrospectionReport`
2. `sandboxer inspect host` matches report host metrics
3. Introduce fake stale dir → appears in `inspect stale`
4. `destroy` → after snapshot shows load recovery
- Satisfies SAND-WP-0002-T10 smoke variant when canary path used
Record optimization hypotheses (disk pressure, stale reap policy) for phase-2
automation via activity-core.
---
## Out of scope
| Item | Target |
|------|--------|
| Long-term metrics database / dashboards | artifact-store or observability stack (separate workplan) |
| Automatic scheduled reap without human gate | activity-core instruction (after dry-run proven) |
| wise-validator migration | SAND-WP-0003 |
| SaaS metering | SAND-WP-0006 |
## Completion criteria
- `sandboxer create` with no `repo` deploys sand-boxer and returns
`IntrospectionReport` on `ready`
- Before/after host snapshots captured for canary creates
- Stale sandbox inventory with dry-run reap CLI
- Telemetry lands in State Hub `detail` and local JSON artifact
- Runbook and tests merged; operator runs `make fix-consistency REPO=sand-boxer`
## Operator note
After merging task status updates:
```bash
cd ~/state-hub && make fix-consistency REPO=sand-boxer
```
## Verification record (2026-06-23)
CoulombCore remote proof:
1. `sandboxer create` (no args) → `ready` + `telemetry.provision_delta`
2. `sandboxer inspect host` → load/mem metrics returned
3. Stale orphans from prior runs detected in `stale_candidates`
4. `sandboxer destroy``destroy_delta` with load Δ -0.09, mem +54 MB