Add SAND-WP-0008: host telemetry and self-canary introspection

Workplan for default sand-boxer self-deploy, before/after host metrics,
stale sandbox inventory, and telemetry export for centralized analysis.
This commit is contained in:
2026-06-23 14:25:05 +02:00
parent 939c4e1aff
commit 20e25726d7
3 changed files with 263 additions and 1 deletions

View File

@@ -126,7 +126,8 @@ Additional boundaries:
- **Registry:** scaffold present (`registry/indexes/capabilities.yaml` empty;
`registry/capabilities/` placeholder); domain in index still `helix_forge`
from scaffold — needs alignment to `infotech`
- **Workplans:** `SAND-WP-0001` (State Hub bootstrap) in `ready`
- **Workplans:** `SAND-WP-0001` finished; `SAND-WP-0002` active;
`SAND-WP-0008` proposed (host telemetry / self-canary)
- **Lineage (external, not yet migrated):** `the-custodian/e2e-framework/`
(CUST-WP-0028, completed) and `infra/build-machines/` (CUST-WP-0032)

View File

@@ -248,6 +248,7 @@ test (steps 14) pending operator run against CoulombCore/sandboxer01.
| `ext.vm-packer` (build-machines) | SAND-WP-0005 |
| SaaS extensions + payments layer | SAND-WP-0006 |
| Snapshot / restore / checkpoint profiles | SAND-WP-0007 |
| Host telemetry, self-canary, stale sandbox inventory | SAND-WP-0008 |
| Coulomb-native runtime (phase 5) | Backlog |
## Completion criteria

View File

@@ -0,0 +1,260 @@
---
id: SAND-WP-0008
type: workplan
title: "Host telemetry and self-canary introspection"
domain: infotech
repo: sand-boxer
status: ready
owner: codex
topic_slug: custodian
created: "2026-06-23"
updated: "2026-06-23"
---
# Host telemetry and self-canary introspection
Use sand-boxer as its own trial deployment to prove provision/teardown **and**
return actionable host and sandbox intelligence: resource metrics, load before/after,
stale sandbox inventory, and structured telemetry for centralized analysis.
**Charter:** `INTENT.md` (host topology, observable lifecycle)
**Spec:** `docs/meta-framework.md` (Host resource, Meter — extend for self-hosted)
**Predecessor:** SAND-WP-0002 (`ext.compose-ssh`, CLI v0, State Hub events)
**Related:** SAND-WP-0002-T10 (remote smoke), activity-core (scheduled reap jobs)
## Problem
Today `sandboxer create` proves SSH + compose for an arbitrary repo but returns
only lifecycle state and reachability. Operators lack:
- Host load and capacity **before** accepting new sandboxes
- **After** metrics to quantify sandbox cost
- Inventory of **stale** sandboxes (`/tmp/sandboxer/*`, orphaned compose projects)
- A **default smoke path** that does not depend on another repo's `e2e/` layout
sand-boxer should dogfood itself: deploy the sand-boxer tree, run a bounded
introspection bundle on the remote host, and emit telemetry suitable for a
central datastore (State Hub first; export to artifact-store or metrics pipeline
later).
## Design host telemetry contract
```task
id: SAND-WP-0008-T01
status: todo
priority: high
```
Author `docs/host-telemetry.md` defining:
- **HostSnapshot** — point-in-time host metrics (load, CPU%, mem, disk, docker stats summary)
- **SandboxInventory** — known sandboxes on host (compose projects matching `sbx-*`,
directories under configured `base_dir`, age, owning profile if inferable)
- **StaleCandidate** — entries exceeding TTL, idle threshold, or missing store record
- **ProvisionDelta** — `before` / `after` HostSnapshot pair around create/destroy
- **IntrospectionReport** — bundled output attached to sandbox `ready` response
- Retention and privacy rules (no secret paths, no full `docker inspect` dumps by default)
Extend meta-framework spec with `Host` observability fields (read-only; sand-boxer
does not own long-term metrics DB).
## Define profile.sandbox-canary and introspection schema
```task
id: SAND-WP-0008-T02
status: todo
priority: high
```
Add:
- `profiles/profile.sandbox-canary.yaml` — lightweight compose or no-compose
introspection profile bound to `ext.compose-ssh` (or thin `ext.ssh-introspect`
if compose is unnecessary for canary)
- Pydantic models: `HostSnapshot`, `SandboxInventory`, `StaleCandidate`,
`ProvisionDelta`, `IntrospectionReport`
- Default inputs: `repo` optional; when omitted, resolve to sand-boxer repo root
(package parent path or `SANDBOXER_REPO_ROOT`)
Canary deliverable on `ready`: JSON `IntrospectionReport` in sandbox status
`detail` / reachability extension field.
## Implement remote host metrics collector
```task
id: SAND-WP-0008-T03
status: todo
priority: high
```
SSH-side collection (shell + structured parse, no extra daemon on host):
- Load average, CPU count, mem available/total, root disk use
- `docker system df` / running container count
- Optional: `docker stats --no-stream` aggregate for sbx-* projects only
- Bounded runtime (e.g. ≤10s) and non-root-safe commands
Module: `src/sandboxer/telemetry/host_snapshot.py` with unit tests using fixture
command output.
## Implement stale sandbox discovery
```task
id: SAND-WP-0008-T04
status: todo
priority: high
```
Scan remote host for:
- Directories under `base_dir` (default `/tmp/sandboxer`) with mtime age
- `docker compose ls` projects matching `sbx-*` / `e2e-*` legacy patterns
- Cross-check against local `SandboxStore` — flag **orphans** (on host, not in store)
and **zombies** (in store, not on host)
Output `StaleCandidate` list with suggested action: `reap`, `inspect`, `ignore`.
No automatic deletion in this task — dry-run only.
## Capture before/after load around provision
```task
id: SAND-WP-0008-T05
status: todo
priority: medium
```
Integrate into `SandboxManager.create` / `destroy` when profile metadata requests
telemetry (`metadata.observability: canary` or profile id `profile.sandbox-canary`):
1. `HostSnapshot` before extension `provision`
2. Run provision + wait_ready
3. `HostSnapshot` after ready
4. Compute `ProvisionDelta` (load/mem/disk/container deltas)
Same pattern on `destroy` for teardown impact. Tests mock SSH collector.
## Default repo: deploy sand-boxer itself
```task
id: SAND-WP-0008-T06
status: todo
priority: high
```
When `create` has no `repo` input:
- Resolve default to sand-boxer repository root (`SANDBOXER_REPO_ROOT` override)
- Use `profile.sandbox-canary` as default profile when `--profile` omitted **and**
no `repo` given (document precedence: explicit flags win)
- Ship minimal `e2e/e2e.yml` or `docker-compose.canary.yml` in sand-boxer repo if
compose-up is required for parity with `ext.compose-ssh`
CLI examples:
```bash
sandboxer create # canary self-deploy
sandboxer create --profile profile.sandbox-canary
sandboxer create --input repo=/other/repo # unchanged behavior
```
## Wire introspection into canary provision flow
```task
id: SAND-WP-0008-T07
status: todo
priority: high
```
After `wait_ready` for canary profile:
- Rsync includes `src/sandboxer/telemetry/` introspection entry script or invoke
collector modules via SSH one-liner
- Assemble `IntrospectionReport` (inventory + deltas + stale candidates)
- Attach to `SandboxStatus` (new optional `telemetry` field)
- Print human summary in CLI (load delta, stale count, disk headroom)
## Telemetry export for centralized analysis
```task
id: SAND-WP-0008-T08
status: todo
priority: medium
```
Emit structured telemetry to:
1. **State Hub**`progress/` events with `detail` containing `IntrospectionReport`
(extend existing lifecycle emitter)
2. **Local artifact**`~/.local/share/sandboxer/telemetry/<sandbox_id>.json` for
offline analysis
3. **Export hook** (stub) — `TelemetrySink` protocol for future artifact-store /
Prometheus / ClickHouse; document contract only
Include: `host`, `sandbox_id`, `profile_id`, `collected_at`, schema version.
activity-core may schedule periodic canary runs later — out of scope here.
## CLI inspect and stale reap commands
```task
id: SAND-WP-0008-T09
status: todo
priority: medium
```
```bash
sandboxer inspect host [--host coulombcore] # HostSnapshot + inventory, no create
sandboxer inspect stale [--host ...] [--json] # StaleCandidate list
sandboxer reap-stale --dry-run [--host ...] # report only
sandboxer reap-stale --apply [--older-than 24h] # T10+; gated behind --apply
```
`inspect` does not require a running sandbox — SSH + read-only collectors only.
## Runbook, tests, and CoulombCore verification
```task
id: SAND-WP-0008-T10
status: todo
priority: medium
```
- `docs/runbooks/profile-sandbox-canary.md`
- Integration test: mock SSH fixtures for full report assembly
- Manual proof on CoulombCore:
1. `sandboxer create` (no args) → `ready` + `IntrospectionReport`
2. `sandboxer inspect host` matches report host metrics
3. Introduce fake stale dir → appears in `inspect stale`
4. `destroy` → after snapshot shows load recovery
- Satisfies SAND-WP-0002-T10 smoke variant when canary path used
Record optimization hypotheses (disk pressure, stale reap policy) for phase-2
automation via activity-core.
---
## Out of scope
| Item | Target |
|------|--------|
| Long-term metrics database / dashboards | artifact-store or observability stack (separate workplan) |
| Automatic scheduled reap without human gate | activity-core instruction (after dry-run proven) |
| wise-validator migration | SAND-WP-0003 |
| SaaS metering | SAND-WP-0006 |
## Completion criteria
- `sandboxer create` with no `repo` deploys sand-boxer and returns
`IntrospectionReport` on `ready`
- Before/after host snapshots captured for canary creates
- Stale sandbox inventory with dry-run reap CLI
- Telemetry lands in State Hub `detail` and local JSON artifact
- Runbook and tests merged; operator runs `make fix-consistency REPO=sand-boxer`
## Operator note
After merging task status updates:
```bash
cd ~/state-hub && make fix-consistency REPO=sand-boxer
```