Files
sand-boxer/workplans/SAND-WP-0008-host-telemetry-and-self-canary.md
tegwick c0a9261cdc Implement SAND-WP-0008: host telemetry and self-canary
Add profile.sandbox-canary, HostSnapshot/inventory/stale schemas, SSH
collectors, before/after provision deltas, telemetry export to State Hub
and local JSON, default `sandboxer create` self-deploy, inspect/reap-stale
CLI, runbook, and CoulombCore verification (26 tests pass).
2026-06-23 19:53:51 +02:00

9.2 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
SAND-WP-0008 workplan Host telemetry and self-canary introspection infotech sand-boxer finished codex custodian 2026-06-23 2026-06-23 afbcbc84-5ec7-4f8b-ae21-4cbda0d05195

Host telemetry and self-canary introspection

Use sand-boxer as its own trial deployment to prove provision/teardown and return actionable host and sandbox intelligence: resource metrics, load before/after, stale sandbox inventory, and structured telemetry for centralized analysis.

Charter: INTENT.md (host topology, observable lifecycle)
Spec: docs/meta-framework.md (Host resource, Meter — extend for self-hosted)
Predecessor: SAND-WP-0002 (ext.compose-ssh, CLI v0, State Hub events)
Related: SAND-WP-0002-T10 (remote smoke), activity-core (scheduled reap jobs)

Problem

Today sandboxer create proves SSH + compose for an arbitrary repo but returns only lifecycle state and reachability. Operators lack:

  • Host load and capacity before accepting new sandboxes
  • After metrics to quantify sandbox cost
  • Inventory of stale sandboxes (/tmp/sandboxer/*, orphaned compose projects)
  • A default smoke path that does not depend on another repo's e2e/ layout

sand-boxer should dogfood itself: deploy the sand-boxer tree, run a bounded introspection bundle on the remote host, and emit telemetry suitable for a central datastore (State Hub first; export to artifact-store or metrics pipeline later).

Design host telemetry contract

id: SAND-WP-0008-T01
status: done
priority: high
state_hub_task_id: "8f7b46e3-045e-481c-81bd-1c61734c6eb3"

Author docs/host-telemetry.md defining:

  • HostSnapshot — point-in-time host metrics (load, CPU%, mem, disk, docker stats summary)
  • SandboxInventory — known sandboxes on host (compose projects matching sbx-*, directories under configured base_dir, age, owning profile if inferable)
  • StaleCandidate — entries exceeding TTL, idle threshold, or missing store record
  • ProvisionDeltabefore / after HostSnapshot pair around create/destroy
  • IntrospectionReport — bundled output attached to sandbox ready response
  • Retention and privacy rules (no secret paths, no full docker inspect dumps by default)

Extend meta-framework spec with Host observability fields (read-only; sand-boxer does not own long-term metrics DB).

Define profile.sandbox-canary and introspection schema

id: SAND-WP-0008-T02
status: done
priority: high
state_hub_task_id: "732bae4e-2dd9-4500-a86d-e869007bb383"

Add:

  • profiles/profile.sandbox-canary.yaml — lightweight compose or no-compose introspection profile bound to ext.compose-ssh (or thin ext.ssh-introspect if compose is unnecessary for canary)
  • Pydantic models: HostSnapshot, SandboxInventory, StaleCandidate, ProvisionDelta, IntrospectionReport
  • Default inputs: repo optional; when omitted, resolve to sand-boxer repo root (package parent path or SANDBOXER_REPO_ROOT)

Canary deliverable on ready: JSON IntrospectionReport in sandbox status detail / reachability extension field.

Implement remote host metrics collector

id: SAND-WP-0008-T03
status: done
priority: high
state_hub_task_id: "7bd22f27-5058-4c19-98b6-b923909a8815"

SSH-side collection (shell + structured parse, no extra daemon on host):

  • Load average, CPU count, mem available/total, root disk use
  • docker system df / running container count
  • Optional: docker stats --no-stream aggregate for sbx-* projects only
  • Bounded runtime (e.g. ≤10s) and non-root-safe commands

Module: src/sandboxer/telemetry/host_snapshot.py with unit tests using fixture command output.

Implement stale sandbox discovery

id: SAND-WP-0008-T04
status: done
priority: high
state_hub_task_id: "c2d19bb7-9322-4744-a71e-75f7701a6fb2"

Scan remote host for:

  • Directories under base_dir (default /tmp/sandboxer) with mtime age
  • docker compose ls projects matching sbx-* / e2e-* legacy patterns
  • Cross-check against local SandboxStore — flag orphans (on host, not in store) and zombies (in store, not on host)

Output StaleCandidate list with suggested action: reap, inspect, ignore. No automatic deletion in this task — dry-run only.

Capture before/after load around provision

id: SAND-WP-0008-T05
status: done
priority: medium
state_hub_task_id: "b6b02289-d36e-4ee1-9ff7-dc59a1d24886"

Integrate into SandboxManager.create / destroy when profile metadata requests telemetry (metadata.observability: canary or profile id profile.sandbox-canary):

  1. HostSnapshot before extension provision
  2. Run provision + wait_ready
  3. HostSnapshot after ready
  4. Compute ProvisionDelta (load/mem/disk/container deltas)

Same pattern on destroy for teardown impact. Tests mock SSH collector.

Default repo: deploy sand-boxer itself

id: SAND-WP-0008-T06
status: done
priority: high
state_hub_task_id: "d9941d93-a662-45c0-820b-88d32266c653"

When create has no repo input:

  • Resolve default to sand-boxer repository root (SANDBOXER_REPO_ROOT override)
  • Use profile.sandbox-canary as default profile when --profile omitted and no repo given (document precedence: explicit flags win)
  • Ship minimal e2e/e2e.yml or docker-compose.canary.yml in sand-boxer repo if compose-up is required for parity with ext.compose-ssh

CLI examples:

sandboxer create                              # canary self-deploy
sandboxer create --profile profile.sandbox-canary
sandboxer create --input repo=/other/repo     # unchanged behavior

Wire introspection into canary provision flow

id: SAND-WP-0008-T07
status: done
priority: high
state_hub_task_id: "76430452-c98e-44e5-b625-e243dc12b8a5"

After wait_ready for canary profile:

  • Rsync includes src/sandboxer/telemetry/ introspection entry script or invoke collector modules via SSH one-liner
  • Assemble IntrospectionReport (inventory + deltas + stale candidates)
  • Attach to SandboxStatus (new optional telemetry field)
  • Print human summary in CLI (load delta, stale count, disk headroom)

Telemetry export for centralized analysis

id: SAND-WP-0008-T08
status: done
priority: medium
state_hub_task_id: "4ee4b95b-e7b5-4893-b78e-914f808bc00a"

Emit structured telemetry to:

  1. State Hubprogress/ events with detail containing IntrospectionReport (extend existing lifecycle emitter)
  2. Local artifact~/.local/share/sandboxer/telemetry/<sandbox_id>.json for offline analysis
  3. Export hook (stub) — TelemetrySink protocol for future artifact-store / Prometheus / ClickHouse; document contract only

Include: host, sandbox_id, profile_id, collected_at, schema version.

activity-core may schedule periodic canary runs later — out of scope here.

CLI inspect and stale reap commands

id: SAND-WP-0008-T09
status: done
priority: medium
state_hub_task_id: "6ea8eda6-491b-460a-a526-7565962f449e"
sandboxer inspect host [--host coulombcore]     # HostSnapshot + inventory, no create
sandboxer inspect stale [--host ...] [--json]   # StaleCandidate list
sandboxer reap-stale --dry-run [--host ...]     # report only
sandboxer reap-stale --apply [--older-than 24h] # T10+; gated behind --apply

inspect does not require a running sandbox — SSH + read-only collectors only.

Runbook, tests, and CoulombCore verification

id: SAND-WP-0008-T10
status: done
priority: medium
state_hub_task_id: "435a3993-d8d3-4280-b68a-c37e34d20312"
  • docs/runbooks/profile-sandbox-canary.md
  • Integration test: mock SSH fixtures for full report assembly
  • Manual proof on CoulombCore:
    1. sandboxer create (no args) → ready + IntrospectionReport
    2. sandboxer inspect host matches report host metrics
    3. Introduce fake stale dir → appears in inspect stale
    4. destroy → after snapshot shows load recovery
  • Satisfies SAND-WP-0002-T10 smoke variant when canary path used

Record optimization hypotheses (disk pressure, stale reap policy) for phase-2 automation via activity-core.


Out of scope

Item Target
Long-term metrics database / dashboards artifact-store or observability stack (separate workplan)
Automatic scheduled reap without human gate activity-core instruction (after dry-run proven)
wise-validator migration SAND-WP-0003
SaaS metering SAND-WP-0006

Completion criteria

  • sandboxer create with no repo deploys sand-boxer and returns IntrospectionReport on ready
  • Before/after host snapshots captured for canary creates
  • Stale sandbox inventory with dry-run reap CLI
  • Telemetry lands in State Hub detail and local JSON artifact
  • Runbook and tests merged; operator runs make fix-consistency REPO=sand-boxer

Operator note

After merging task status updates:

cd ~/state-hub && make fix-consistency REPO=sand-boxer

Verification record (2026-06-23)

CoulombCore remote proof:

  1. sandboxer create (no args) → ready + telemetry.provision_delta
  2. sandboxer inspect host → load/mem metrics returned
  3. Stale orphans from prior runs detected in stale_candidates
  4. sandboxer destroydestroy_delta with load Δ -0.09, mem +54 MB