From 20e25726d717b0baa2ec34f1ae539297f4d3bfe9 Mon Sep 17 00:00:00 2001 From: tegwick Date: Tue, 23 Jun 2026 14:25:05 +0200 Subject: [PATCH] Add SAND-WP-0008: host telemetry and self-canary introspection Workplan for default sand-boxer self-deploy, before/after host metrics, stale sandbox inventory, and telemetry export for centralized analysis. --- SCOPE.md | 3 +- .../SAND-WP-0002-meta-framework-foundation.md | 1 + ...-WP-0008-host-telemetry-and-self-canary.md | 260 ++++++++++++++++++ 3 files changed, 263 insertions(+), 1 deletion(-) create mode 100644 workplans/SAND-WP-0008-host-telemetry-and-self-canary.md diff --git a/SCOPE.md b/SCOPE.md index acea92c..7161254 100644 --- a/SCOPE.md +++ b/SCOPE.md @@ -126,7 +126,8 @@ Additional boundaries: - **Registry:** scaffold present (`registry/indexes/capabilities.yaml` empty; `registry/capabilities/` placeholder); domain in index still `helix_forge` from scaffold — needs alignment to `infotech` -- **Workplans:** `SAND-WP-0001` (State Hub bootstrap) in `ready` +- **Workplans:** `SAND-WP-0001` finished; `SAND-WP-0002` active; + `SAND-WP-0008` proposed (host telemetry / self-canary) - **Lineage (external, not yet migrated):** `the-custodian/e2e-framework/` (CUST-WP-0028, completed) and `infra/build-machines/` (CUST-WP-0032) diff --git a/workplans/SAND-WP-0002-meta-framework-foundation.md b/workplans/SAND-WP-0002-meta-framework-foundation.md index 56ed93c..96be00b 100644 --- a/workplans/SAND-WP-0002-meta-framework-foundation.md +++ b/workplans/SAND-WP-0002-meta-framework-foundation.md @@ -248,6 +248,7 @@ test (steps 1–4) pending operator run against CoulombCore/sandboxer01. | `ext.vm-packer` (build-machines) | SAND-WP-0005 | | SaaS extensions + payments layer | SAND-WP-0006 | | Snapshot / restore / checkpoint profiles | SAND-WP-0007 | +| Host telemetry, self-canary, stale sandbox inventory | SAND-WP-0008 | | Coulomb-native runtime (phase 5) | Backlog | ## Completion criteria diff --git a/workplans/SAND-WP-0008-host-telemetry-and-self-canary.md b/workplans/SAND-WP-0008-host-telemetry-and-self-canary.md new file mode 100644 index 0000000..f8ec299 --- /dev/null +++ b/workplans/SAND-WP-0008-host-telemetry-and-self-canary.md @@ -0,0 +1,260 @@ +--- +id: SAND-WP-0008 +type: workplan +title: "Host telemetry and self-canary introspection" +domain: infotech +repo: sand-boxer +status: ready +owner: codex +topic_slug: custodian +created: "2026-06-23" +updated: "2026-06-23" +--- + +# Host telemetry and self-canary introspection + +Use sand-boxer as its own trial deployment to prove provision/teardown **and** +return actionable host and sandbox intelligence: resource metrics, load before/after, +stale sandbox inventory, and structured telemetry for centralized analysis. + +**Charter:** `INTENT.md` (host topology, observable lifecycle) +**Spec:** `docs/meta-framework.md` (Host resource, Meter — extend for self-hosted) +**Predecessor:** SAND-WP-0002 (`ext.compose-ssh`, CLI v0, State Hub events) +**Related:** SAND-WP-0002-T10 (remote smoke), activity-core (scheduled reap jobs) + +## Problem + +Today `sandboxer create` proves SSH + compose for an arbitrary repo but returns +only lifecycle state and reachability. Operators lack: + +- Host load and capacity **before** accepting new sandboxes +- **After** metrics to quantify sandbox cost +- Inventory of **stale** sandboxes (`/tmp/sandboxer/*`, orphaned compose projects) +- A **default smoke path** that does not depend on another repo's `e2e/` layout + +sand-boxer should dogfood itself: deploy the sand-boxer tree, run a bounded +introspection bundle on the remote host, and emit telemetry suitable for a +central datastore (State Hub first; export to artifact-store or metrics pipeline +later). + +## Design host telemetry contract + +```task +id: SAND-WP-0008-T01 +status: todo +priority: high +``` + +Author `docs/host-telemetry.md` defining: + +- **HostSnapshot** — point-in-time host metrics (load, CPU%, mem, disk, docker stats summary) +- **SandboxInventory** — known sandboxes on host (compose projects matching `sbx-*`, + directories under configured `base_dir`, age, owning profile if inferable) +- **StaleCandidate** — entries exceeding TTL, idle threshold, or missing store record +- **ProvisionDelta** — `before` / `after` HostSnapshot pair around create/destroy +- **IntrospectionReport** — bundled output attached to sandbox `ready` response +- Retention and privacy rules (no secret paths, no full `docker inspect` dumps by default) + +Extend meta-framework spec with `Host` observability fields (read-only; sand-boxer +does not own long-term metrics DB). + +## Define profile.sandbox-canary and introspection schema + +```task +id: SAND-WP-0008-T02 +status: todo +priority: high +``` + +Add: + +- `profiles/profile.sandbox-canary.yaml` — lightweight compose or no-compose + introspection profile bound to `ext.compose-ssh` (or thin `ext.ssh-introspect` + if compose is unnecessary for canary) +- Pydantic models: `HostSnapshot`, `SandboxInventory`, `StaleCandidate`, + `ProvisionDelta`, `IntrospectionReport` +- Default inputs: `repo` optional; when omitted, resolve to sand-boxer repo root + (package parent path or `SANDBOXER_REPO_ROOT`) + +Canary deliverable on `ready`: JSON `IntrospectionReport` in sandbox status +`detail` / reachability extension field. + +## Implement remote host metrics collector + +```task +id: SAND-WP-0008-T03 +status: todo +priority: high +``` + +SSH-side collection (shell + structured parse, no extra daemon on host): + +- Load average, CPU count, mem available/total, root disk use +- `docker system df` / running container count +- Optional: `docker stats --no-stream` aggregate for sbx-* projects only +- Bounded runtime (e.g. ≤10s) and non-root-safe commands + +Module: `src/sandboxer/telemetry/host_snapshot.py` with unit tests using fixture +command output. + +## Implement stale sandbox discovery + +```task +id: SAND-WP-0008-T04 +status: todo +priority: high +``` + +Scan remote host for: + +- Directories under `base_dir` (default `/tmp/sandboxer`) with mtime age +- `docker compose ls` projects matching `sbx-*` / `e2e-*` legacy patterns +- Cross-check against local `SandboxStore` — flag **orphans** (on host, not in store) + and **zombies** (in store, not on host) + +Output `StaleCandidate` list with suggested action: `reap`, `inspect`, `ignore`. +No automatic deletion in this task — dry-run only. + +## Capture before/after load around provision + +```task +id: SAND-WP-0008-T05 +status: todo +priority: medium +``` + +Integrate into `SandboxManager.create` / `destroy` when profile metadata requests +telemetry (`metadata.observability: canary` or profile id `profile.sandbox-canary`): + +1. `HostSnapshot` before extension `provision` +2. Run provision + wait_ready +3. `HostSnapshot` after ready +4. Compute `ProvisionDelta` (load/mem/disk/container deltas) + +Same pattern on `destroy` for teardown impact. Tests mock SSH collector. + +## Default repo: deploy sand-boxer itself + +```task +id: SAND-WP-0008-T06 +status: todo +priority: high +``` + +When `create` has no `repo` input: + +- Resolve default to sand-boxer repository root (`SANDBOXER_REPO_ROOT` override) +- Use `profile.sandbox-canary` as default profile when `--profile` omitted **and** + no `repo` given (document precedence: explicit flags win) +- Ship minimal `e2e/e2e.yml` or `docker-compose.canary.yml` in sand-boxer repo if + compose-up is required for parity with `ext.compose-ssh` + +CLI examples: + +```bash +sandboxer create # canary self-deploy +sandboxer create --profile profile.sandbox-canary +sandboxer create --input repo=/other/repo # unchanged behavior +``` + +## Wire introspection into canary provision flow + +```task +id: SAND-WP-0008-T07 +status: todo +priority: high +``` + +After `wait_ready` for canary profile: + +- Rsync includes `src/sandboxer/telemetry/` introspection entry script or invoke + collector modules via SSH one-liner +- Assemble `IntrospectionReport` (inventory + deltas + stale candidates) +- Attach to `SandboxStatus` (new optional `telemetry` field) +- Print human summary in CLI (load delta, stale count, disk headroom) + +## Telemetry export for centralized analysis + +```task +id: SAND-WP-0008-T08 +status: todo +priority: medium +``` + +Emit structured telemetry to: + +1. **State Hub** — `progress/` events with `detail` containing `IntrospectionReport` + (extend existing lifecycle emitter) +2. **Local artifact** — `~/.local/share/sandboxer/telemetry/.json` for + offline analysis +3. **Export hook** (stub) — `TelemetrySink` protocol for future artifact-store / + Prometheus / ClickHouse; document contract only + +Include: `host`, `sandbox_id`, `profile_id`, `collected_at`, schema version. + +activity-core may schedule periodic canary runs later — out of scope here. + +## CLI inspect and stale reap commands + +```task +id: SAND-WP-0008-T09 +status: todo +priority: medium +``` + +```bash +sandboxer inspect host [--host coulombcore] # HostSnapshot + inventory, no create +sandboxer inspect stale [--host ...] [--json] # StaleCandidate list +sandboxer reap-stale --dry-run [--host ...] # report only +sandboxer reap-stale --apply [--older-than 24h] # T10+; gated behind --apply +``` + +`inspect` does not require a running sandbox — SSH + read-only collectors only. + +## Runbook, tests, and CoulombCore verification + +```task +id: SAND-WP-0008-T10 +status: todo +priority: medium +``` + +- `docs/runbooks/profile-sandbox-canary.md` +- Integration test: mock SSH fixtures for full report assembly +- Manual proof on CoulombCore: + 1. `sandboxer create` (no args) → `ready` + `IntrospectionReport` + 2. `sandboxer inspect host` matches report host metrics + 3. Introduce fake stale dir → appears in `inspect stale` + 4. `destroy` → after snapshot shows load recovery +- Satisfies SAND-WP-0002-T10 smoke variant when canary path used + +Record optimization hypotheses (disk pressure, stale reap policy) for phase-2 +automation via activity-core. + +--- + +## Out of scope + +| Item | Target | +|------|--------| +| Long-term metrics database / dashboards | artifact-store or observability stack (separate workplan) | +| Automatic scheduled reap without human gate | activity-core instruction (after dry-run proven) | +| wise-validator migration | SAND-WP-0003 | +| SaaS metering | SAND-WP-0006 | + +## Completion criteria + +- `sandboxer create` with no `repo` deploys sand-boxer and returns + `IntrospectionReport` on `ready` +- Before/after host snapshots captured for canary creates +- Stale sandbox inventory with dry-run reap CLI +- Telemetry lands in State Hub `detail` and local JSON artifact +- Runbook and tests merged; operator runs `make fix-consistency REPO=sand-boxer` + +## Operator note + +After merging task status updates: + +```bash +cd ~/state-hub && make fix-consistency REPO=sand-boxer +``` \ No newline at end of file