generated from coulomb/repo-seed
Add profile.sandbox-canary, HostSnapshot/inventory/stale schemas, SSH collectors, before/after provision deltas, telemetry export to State Hub and local JSON, default `sandboxer create` self-deploy, inspect/reap-stale CLI, runbook, and CoulombCore verification (26 tests pass).
280 lines
9.2 KiB
Markdown
280 lines
9.2 KiB
Markdown
---
|
|
id: SAND-WP-0008
|
|
type: workplan
|
|
title: "Host telemetry and self-canary introspection"
|
|
domain: infotech
|
|
repo: sand-boxer
|
|
status: finished
|
|
owner: codex
|
|
topic_slug: custodian
|
|
created: "2026-06-23"
|
|
updated: "2026-06-23"
|
|
state_hub_workstream_id: "afbcbc84-5ec7-4f8b-ae21-4cbda0d05195"
|
|
---
|
|
|
|
# Host telemetry and self-canary introspection
|
|
|
|
Use sand-boxer as its own trial deployment to prove provision/teardown **and**
|
|
return actionable host and sandbox intelligence: resource metrics, load before/after,
|
|
stale sandbox inventory, and structured telemetry for centralized analysis.
|
|
|
|
**Charter:** `INTENT.md` (host topology, observable lifecycle)
|
|
**Spec:** `docs/meta-framework.md` (Host resource, Meter — extend for self-hosted)
|
|
**Predecessor:** SAND-WP-0002 (`ext.compose-ssh`, CLI v0, State Hub events)
|
|
**Related:** SAND-WP-0002-T10 (remote smoke), activity-core (scheduled reap jobs)
|
|
|
|
## Problem
|
|
|
|
Today `sandboxer create` proves SSH + compose for an arbitrary repo but returns
|
|
only lifecycle state and reachability. Operators lack:
|
|
|
|
- Host load and capacity **before** accepting new sandboxes
|
|
- **After** metrics to quantify sandbox cost
|
|
- Inventory of **stale** sandboxes (`/tmp/sandboxer/*`, orphaned compose projects)
|
|
- A **default smoke path** that does not depend on another repo's `e2e/` layout
|
|
|
|
sand-boxer should dogfood itself: deploy the sand-boxer tree, run a bounded
|
|
introspection bundle on the remote host, and emit telemetry suitable for a
|
|
central datastore (State Hub first; export to artifact-store or metrics pipeline
|
|
later).
|
|
|
|
## Design host telemetry contract
|
|
|
|
```task
|
|
id: SAND-WP-0008-T01
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "8f7b46e3-045e-481c-81bd-1c61734c6eb3"
|
|
```
|
|
|
|
Author `docs/host-telemetry.md` defining:
|
|
|
|
- **HostSnapshot** — point-in-time host metrics (load, CPU%, mem, disk, docker stats summary)
|
|
- **SandboxInventory** — known sandboxes on host (compose projects matching `sbx-*`,
|
|
directories under configured `base_dir`, age, owning profile if inferable)
|
|
- **StaleCandidate** — entries exceeding TTL, idle threshold, or missing store record
|
|
- **ProvisionDelta** — `before` / `after` HostSnapshot pair around create/destroy
|
|
- **IntrospectionReport** — bundled output attached to sandbox `ready` response
|
|
- Retention and privacy rules (no secret paths, no full `docker inspect` dumps by default)
|
|
|
|
Extend meta-framework spec with `Host` observability fields (read-only; sand-boxer
|
|
does not own long-term metrics DB).
|
|
|
|
## Define profile.sandbox-canary and introspection schema
|
|
|
|
```task
|
|
id: SAND-WP-0008-T02
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "732bae4e-2dd9-4500-a86d-e869007bb383"
|
|
```
|
|
|
|
Add:
|
|
|
|
- `profiles/profile.sandbox-canary.yaml` — lightweight compose or no-compose
|
|
introspection profile bound to `ext.compose-ssh` (or thin `ext.ssh-introspect`
|
|
if compose is unnecessary for canary)
|
|
- Pydantic models: `HostSnapshot`, `SandboxInventory`, `StaleCandidate`,
|
|
`ProvisionDelta`, `IntrospectionReport`
|
|
- Default inputs: `repo` optional; when omitted, resolve to sand-boxer repo root
|
|
(package parent path or `SANDBOXER_REPO_ROOT`)
|
|
|
|
Canary deliverable on `ready`: JSON `IntrospectionReport` in sandbox status
|
|
`detail` / reachability extension field.
|
|
|
|
## Implement remote host metrics collector
|
|
|
|
```task
|
|
id: SAND-WP-0008-T03
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "7bd22f27-5058-4c19-98b6-b923909a8815"
|
|
```
|
|
|
|
SSH-side collection (shell + structured parse, no extra daemon on host):
|
|
|
|
- Load average, CPU count, mem available/total, root disk use
|
|
- `docker system df` / running container count
|
|
- Optional: `docker stats --no-stream` aggregate for sbx-* projects only
|
|
- Bounded runtime (e.g. ≤10s) and non-root-safe commands
|
|
|
|
Module: `src/sandboxer/telemetry/host_snapshot.py` with unit tests using fixture
|
|
command output.
|
|
|
|
## Implement stale sandbox discovery
|
|
|
|
```task
|
|
id: SAND-WP-0008-T04
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "c2d19bb7-9322-4744-a71e-75f7701a6fb2"
|
|
```
|
|
|
|
Scan remote host for:
|
|
|
|
- Directories under `base_dir` (default `/tmp/sandboxer`) with mtime age
|
|
- `docker compose ls` projects matching `sbx-*` / `e2e-*` legacy patterns
|
|
- Cross-check against local `SandboxStore` — flag **orphans** (on host, not in store)
|
|
and **zombies** (in store, not on host)
|
|
|
|
Output `StaleCandidate` list with suggested action: `reap`, `inspect`, `ignore`.
|
|
No automatic deletion in this task — dry-run only.
|
|
|
|
## Capture before/after load around provision
|
|
|
|
```task
|
|
id: SAND-WP-0008-T05
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "b6b02289-d36e-4ee1-9ff7-dc59a1d24886"
|
|
```
|
|
|
|
Integrate into `SandboxManager.create` / `destroy` when profile metadata requests
|
|
telemetry (`metadata.observability: canary` or profile id `profile.sandbox-canary`):
|
|
|
|
1. `HostSnapshot` before extension `provision`
|
|
2. Run provision + wait_ready
|
|
3. `HostSnapshot` after ready
|
|
4. Compute `ProvisionDelta` (load/mem/disk/container deltas)
|
|
|
|
Same pattern on `destroy` for teardown impact. Tests mock SSH collector.
|
|
|
|
## Default repo: deploy sand-boxer itself
|
|
|
|
```task
|
|
id: SAND-WP-0008-T06
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "d9941d93-a662-45c0-820b-88d32266c653"
|
|
```
|
|
|
|
When `create` has no `repo` input:
|
|
|
|
- Resolve default to sand-boxer repository root (`SANDBOXER_REPO_ROOT` override)
|
|
- Use `profile.sandbox-canary` as default profile when `--profile` omitted **and**
|
|
no `repo` given (document precedence: explicit flags win)
|
|
- Ship minimal `e2e/e2e.yml` or `docker-compose.canary.yml` in sand-boxer repo if
|
|
compose-up is required for parity with `ext.compose-ssh`
|
|
|
|
CLI examples:
|
|
|
|
```bash
|
|
sandboxer create # canary self-deploy
|
|
sandboxer create --profile profile.sandbox-canary
|
|
sandboxer create --input repo=/other/repo # unchanged behavior
|
|
```
|
|
|
|
## Wire introspection into canary provision flow
|
|
|
|
```task
|
|
id: SAND-WP-0008-T07
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "76430452-c98e-44e5-b625-e243dc12b8a5"
|
|
```
|
|
|
|
After `wait_ready` for canary profile:
|
|
|
|
- Rsync includes `src/sandboxer/telemetry/` introspection entry script or invoke
|
|
collector modules via SSH one-liner
|
|
- Assemble `IntrospectionReport` (inventory + deltas + stale candidates)
|
|
- Attach to `SandboxStatus` (new optional `telemetry` field)
|
|
- Print human summary in CLI (load delta, stale count, disk headroom)
|
|
|
|
## Telemetry export for centralized analysis
|
|
|
|
```task
|
|
id: SAND-WP-0008-T08
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "4ee4b95b-e7b5-4893-b78e-914f808bc00a"
|
|
```
|
|
|
|
Emit structured telemetry to:
|
|
|
|
1. **State Hub** — `progress/` events with `detail` containing `IntrospectionReport`
|
|
(extend existing lifecycle emitter)
|
|
2. **Local artifact** — `~/.local/share/sandboxer/telemetry/<sandbox_id>.json` for
|
|
offline analysis
|
|
3. **Export hook** (stub) — `TelemetrySink` protocol for future artifact-store /
|
|
Prometheus / ClickHouse; document contract only
|
|
|
|
Include: `host`, `sandbox_id`, `profile_id`, `collected_at`, schema version.
|
|
|
|
activity-core may schedule periodic canary runs later — out of scope here.
|
|
|
|
## CLI inspect and stale reap commands
|
|
|
|
```task
|
|
id: SAND-WP-0008-T09
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "6ea8eda6-491b-460a-a526-7565962f449e"
|
|
```
|
|
|
|
```bash
|
|
sandboxer inspect host [--host coulombcore] # HostSnapshot + inventory, no create
|
|
sandboxer inspect stale [--host ...] [--json] # StaleCandidate list
|
|
sandboxer reap-stale --dry-run [--host ...] # report only
|
|
sandboxer reap-stale --apply [--older-than 24h] # T10+; gated behind --apply
|
|
```
|
|
|
|
`inspect` does not require a running sandbox — SSH + read-only collectors only.
|
|
|
|
## Runbook, tests, and CoulombCore verification
|
|
|
|
```task
|
|
id: SAND-WP-0008-T10
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "435a3993-d8d3-4280-b68a-c37e34d20312"
|
|
```
|
|
|
|
- `docs/runbooks/profile-sandbox-canary.md`
|
|
- Integration test: mock SSH fixtures for full report assembly
|
|
- Manual proof on CoulombCore:
|
|
1. `sandboxer create` (no args) → `ready` + `IntrospectionReport`
|
|
2. `sandboxer inspect host` matches report host metrics
|
|
3. Introduce fake stale dir → appears in `inspect stale`
|
|
4. `destroy` → after snapshot shows load recovery
|
|
- Satisfies SAND-WP-0002-T10 smoke variant when canary path used
|
|
|
|
Record optimization hypotheses (disk pressure, stale reap policy) for phase-2
|
|
automation via activity-core.
|
|
|
|
---
|
|
|
|
## Out of scope
|
|
|
|
| Item | Target |
|
|
|------|--------|
|
|
| Long-term metrics database / dashboards | artifact-store or observability stack (separate workplan) |
|
|
| Automatic scheduled reap without human gate | activity-core instruction (after dry-run proven) |
|
|
| wise-validator migration | SAND-WP-0003 |
|
|
| SaaS metering | SAND-WP-0006 |
|
|
|
|
## Completion criteria
|
|
|
|
- `sandboxer create` with no `repo` deploys sand-boxer and returns
|
|
`IntrospectionReport` on `ready`
|
|
- Before/after host snapshots captured for canary creates
|
|
- Stale sandbox inventory with dry-run reap CLI
|
|
- Telemetry lands in State Hub `detail` and local JSON artifact
|
|
- Runbook and tests merged; operator runs `make fix-consistency REPO=sand-boxer`
|
|
|
|
## Operator note
|
|
|
|
After merging task status updates:
|
|
|
|
```bash
|
|
cd ~/state-hub && make fix-consistency REPO=sand-boxer
|
|
```
|
|
|
|
## Verification record (2026-06-23)
|
|
|
|
CoulombCore remote proof:
|
|
|
|
1. `sandboxer create` (no args) → `ready` + `telemetry.provision_delta`
|
|
2. `sandboxer inspect host` → load/mem metrics returned
|
|
3. Stale orphans from prior runs detected in `stale_candidates`
|
|
4. `sandboxer destroy` → `destroy_delta` with load Δ -0.09, mem +54 MB |