generated from coulomb/repo-seed
Add SAND-WP-0008: host telemetry and self-canary introspection
Workplan for default sand-boxer self-deploy, before/after host metrics, stale sandbox inventory, and telemetry export for centralized analysis.
This commit is contained in:
3
SCOPE.md
3
SCOPE.md
@@ -126,7 +126,8 @@ Additional boundaries:
|
||||
- **Registry:** scaffold present (`registry/indexes/capabilities.yaml` empty;
|
||||
`registry/capabilities/` placeholder); domain in index still `helix_forge`
|
||||
from scaffold — needs alignment to `infotech`
|
||||
- **Workplans:** `SAND-WP-0001` (State Hub bootstrap) in `ready`
|
||||
- **Workplans:** `SAND-WP-0001` finished; `SAND-WP-0002` active;
|
||||
`SAND-WP-0008` proposed (host telemetry / self-canary)
|
||||
- **Lineage (external, not yet migrated):** `the-custodian/e2e-framework/`
|
||||
(CUST-WP-0028, completed) and `infra/build-machines/` (CUST-WP-0032)
|
||||
|
||||
|
||||
@@ -248,6 +248,7 @@ test (steps 1–4) pending operator run against CoulombCore/sandboxer01.
|
||||
| `ext.vm-packer` (build-machines) | SAND-WP-0005 |
|
||||
| SaaS extensions + payments layer | SAND-WP-0006 |
|
||||
| Snapshot / restore / checkpoint profiles | SAND-WP-0007 |
|
||||
| Host telemetry, self-canary, stale sandbox inventory | SAND-WP-0008 |
|
||||
| Coulomb-native runtime (phase 5) | Backlog |
|
||||
|
||||
## Completion criteria
|
||||
|
||||
260
workplans/SAND-WP-0008-host-telemetry-and-self-canary.md
Normal file
260
workplans/SAND-WP-0008-host-telemetry-and-self-canary.md
Normal file
@@ -0,0 +1,260 @@
|
||||
---
|
||||
id: SAND-WP-0008
|
||||
type: workplan
|
||||
title: "Host telemetry and self-canary introspection"
|
||||
domain: infotech
|
||||
repo: sand-boxer
|
||||
status: ready
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-23"
|
||||
updated: "2026-06-23"
|
||||
---
|
||||
|
||||
# Host telemetry and self-canary introspection
|
||||
|
||||
Use sand-boxer as its own trial deployment to prove provision/teardown **and**
|
||||
return actionable host and sandbox intelligence: resource metrics, load before/after,
|
||||
stale sandbox inventory, and structured telemetry for centralized analysis.
|
||||
|
||||
**Charter:** `INTENT.md` (host topology, observable lifecycle)
|
||||
**Spec:** `docs/meta-framework.md` (Host resource, Meter — extend for self-hosted)
|
||||
**Predecessor:** SAND-WP-0002 (`ext.compose-ssh`, CLI v0, State Hub events)
|
||||
**Related:** SAND-WP-0002-T10 (remote smoke), activity-core (scheduled reap jobs)
|
||||
|
||||
## Problem
|
||||
|
||||
Today `sandboxer create` proves SSH + compose for an arbitrary repo but returns
|
||||
only lifecycle state and reachability. Operators lack:
|
||||
|
||||
- Host load and capacity **before** accepting new sandboxes
|
||||
- **After** metrics to quantify sandbox cost
|
||||
- Inventory of **stale** sandboxes (`/tmp/sandboxer/*`, orphaned compose projects)
|
||||
- A **default smoke path** that does not depend on another repo's `e2e/` layout
|
||||
|
||||
sand-boxer should dogfood itself: deploy the sand-boxer tree, run a bounded
|
||||
introspection bundle on the remote host, and emit telemetry suitable for a
|
||||
central datastore (State Hub first; export to artifact-store or metrics pipeline
|
||||
later).
|
||||
|
||||
## Design host telemetry contract
|
||||
|
||||
```task
|
||||
id: SAND-WP-0008-T01
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Author `docs/host-telemetry.md` defining:
|
||||
|
||||
- **HostSnapshot** — point-in-time host metrics (load, CPU%, mem, disk, docker stats summary)
|
||||
- **SandboxInventory** — known sandboxes on host (compose projects matching `sbx-*`,
|
||||
directories under configured `base_dir`, age, owning profile if inferable)
|
||||
- **StaleCandidate** — entries exceeding TTL, idle threshold, or missing store record
|
||||
- **ProvisionDelta** — `before` / `after` HostSnapshot pair around create/destroy
|
||||
- **IntrospectionReport** — bundled output attached to sandbox `ready` response
|
||||
- Retention and privacy rules (no secret paths, no full `docker inspect` dumps by default)
|
||||
|
||||
Extend meta-framework spec with `Host` observability fields (read-only; sand-boxer
|
||||
does not own long-term metrics DB).
|
||||
|
||||
## Define profile.sandbox-canary and introspection schema
|
||||
|
||||
```task
|
||||
id: SAND-WP-0008-T02
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Add:
|
||||
|
||||
- `profiles/profile.sandbox-canary.yaml` — lightweight compose or no-compose
|
||||
introspection profile bound to `ext.compose-ssh` (or thin `ext.ssh-introspect`
|
||||
if compose is unnecessary for canary)
|
||||
- Pydantic models: `HostSnapshot`, `SandboxInventory`, `StaleCandidate`,
|
||||
`ProvisionDelta`, `IntrospectionReport`
|
||||
- Default inputs: `repo` optional; when omitted, resolve to sand-boxer repo root
|
||||
(package parent path or `SANDBOXER_REPO_ROOT`)
|
||||
|
||||
Canary deliverable on `ready`: JSON `IntrospectionReport` in sandbox status
|
||||
`detail` / reachability extension field.
|
||||
|
||||
## Implement remote host metrics collector
|
||||
|
||||
```task
|
||||
id: SAND-WP-0008-T03
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
SSH-side collection (shell + structured parse, no extra daemon on host):
|
||||
|
||||
- Load average, CPU count, mem available/total, root disk use
|
||||
- `docker system df` / running container count
|
||||
- Optional: `docker stats --no-stream` aggregate for sbx-* projects only
|
||||
- Bounded runtime (e.g. ≤10s) and non-root-safe commands
|
||||
|
||||
Module: `src/sandboxer/telemetry/host_snapshot.py` with unit tests using fixture
|
||||
command output.
|
||||
|
||||
## Implement stale sandbox discovery
|
||||
|
||||
```task
|
||||
id: SAND-WP-0008-T04
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Scan remote host for:
|
||||
|
||||
- Directories under `base_dir` (default `/tmp/sandboxer`) with mtime age
|
||||
- `docker compose ls` projects matching `sbx-*` / `e2e-*` legacy patterns
|
||||
- Cross-check against local `SandboxStore` — flag **orphans** (on host, not in store)
|
||||
and **zombies** (in store, not on host)
|
||||
|
||||
Output `StaleCandidate` list with suggested action: `reap`, `inspect`, `ignore`.
|
||||
No automatic deletion in this task — dry-run only.
|
||||
|
||||
## Capture before/after load around provision
|
||||
|
||||
```task
|
||||
id: SAND-WP-0008-T05
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Integrate into `SandboxManager.create` / `destroy` when profile metadata requests
|
||||
telemetry (`metadata.observability: canary` or profile id `profile.sandbox-canary`):
|
||||
|
||||
1. `HostSnapshot` before extension `provision`
|
||||
2. Run provision + wait_ready
|
||||
3. `HostSnapshot` after ready
|
||||
4. Compute `ProvisionDelta` (load/mem/disk/container deltas)
|
||||
|
||||
Same pattern on `destroy` for teardown impact. Tests mock SSH collector.
|
||||
|
||||
## Default repo: deploy sand-boxer itself
|
||||
|
||||
```task
|
||||
id: SAND-WP-0008-T06
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
When `create` has no `repo` input:
|
||||
|
||||
- Resolve default to sand-boxer repository root (`SANDBOXER_REPO_ROOT` override)
|
||||
- Use `profile.sandbox-canary` as default profile when `--profile` omitted **and**
|
||||
no `repo` given (document precedence: explicit flags win)
|
||||
- Ship minimal `e2e/e2e.yml` or `docker-compose.canary.yml` in sand-boxer repo if
|
||||
compose-up is required for parity with `ext.compose-ssh`
|
||||
|
||||
CLI examples:
|
||||
|
||||
```bash
|
||||
sandboxer create # canary self-deploy
|
||||
sandboxer create --profile profile.sandbox-canary
|
||||
sandboxer create --input repo=/other/repo # unchanged behavior
|
||||
```
|
||||
|
||||
## Wire introspection into canary provision flow
|
||||
|
||||
```task
|
||||
id: SAND-WP-0008-T07
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
After `wait_ready` for canary profile:
|
||||
|
||||
- Rsync includes `src/sandboxer/telemetry/` introspection entry script or invoke
|
||||
collector modules via SSH one-liner
|
||||
- Assemble `IntrospectionReport` (inventory + deltas + stale candidates)
|
||||
- Attach to `SandboxStatus` (new optional `telemetry` field)
|
||||
- Print human summary in CLI (load delta, stale count, disk headroom)
|
||||
|
||||
## Telemetry export for centralized analysis
|
||||
|
||||
```task
|
||||
id: SAND-WP-0008-T08
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Emit structured telemetry to:
|
||||
|
||||
1. **State Hub** — `progress/` events with `detail` containing `IntrospectionReport`
|
||||
(extend existing lifecycle emitter)
|
||||
2. **Local artifact** — `~/.local/share/sandboxer/telemetry/<sandbox_id>.json` for
|
||||
offline analysis
|
||||
3. **Export hook** (stub) — `TelemetrySink` protocol for future artifact-store /
|
||||
Prometheus / ClickHouse; document contract only
|
||||
|
||||
Include: `host`, `sandbox_id`, `profile_id`, `collected_at`, schema version.
|
||||
|
||||
activity-core may schedule periodic canary runs later — out of scope here.
|
||||
|
||||
## CLI inspect and stale reap commands
|
||||
|
||||
```task
|
||||
id: SAND-WP-0008-T09
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
```bash
|
||||
sandboxer inspect host [--host coulombcore] # HostSnapshot + inventory, no create
|
||||
sandboxer inspect stale [--host ...] [--json] # StaleCandidate list
|
||||
sandboxer reap-stale --dry-run [--host ...] # report only
|
||||
sandboxer reap-stale --apply [--older-than 24h] # T10+; gated behind --apply
|
||||
```
|
||||
|
||||
`inspect` does not require a running sandbox — SSH + read-only collectors only.
|
||||
|
||||
## Runbook, tests, and CoulombCore verification
|
||||
|
||||
```task
|
||||
id: SAND-WP-0008-T10
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
- `docs/runbooks/profile-sandbox-canary.md`
|
||||
- Integration test: mock SSH fixtures for full report assembly
|
||||
- Manual proof on CoulombCore:
|
||||
1. `sandboxer create` (no args) → `ready` + `IntrospectionReport`
|
||||
2. `sandboxer inspect host` matches report host metrics
|
||||
3. Introduce fake stale dir → appears in `inspect stale`
|
||||
4. `destroy` → after snapshot shows load recovery
|
||||
- Satisfies SAND-WP-0002-T10 smoke variant when canary path used
|
||||
|
||||
Record optimization hypotheses (disk pressure, stale reap policy) for phase-2
|
||||
automation via activity-core.
|
||||
|
||||
---
|
||||
|
||||
## Out of scope
|
||||
|
||||
| Item | Target |
|
||||
|------|--------|
|
||||
| Long-term metrics database / dashboards | artifact-store or observability stack (separate workplan) |
|
||||
| Automatic scheduled reap without human gate | activity-core instruction (after dry-run proven) |
|
||||
| wise-validator migration | SAND-WP-0003 |
|
||||
| SaaS metering | SAND-WP-0006 |
|
||||
|
||||
## Completion criteria
|
||||
|
||||
- `sandboxer create` with no `repo` deploys sand-boxer and returns
|
||||
`IntrospectionReport` on `ready`
|
||||
- Before/after host snapshots captured for canary creates
|
||||
- Stale sandbox inventory with dry-run reap CLI
|
||||
- Telemetry lands in State Hub `detail` and local JSON artifact
|
||||
- Runbook and tests merged; operator runs `make fix-consistency REPO=sand-boxer`
|
||||
|
||||
## Operator note
|
||||
|
||||
After merging task status updates:
|
||||
|
||||
```bash
|
||||
cd ~/state-hub && make fix-consistency REPO=sand-boxer
|
||||
```
|
||||
Reference in New Issue
Block a user