Files
the-custodian/workplans/CUST-WP-0028-e2e-sandbox-framework.md
tegwick d061c777d1 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-03-27:
  - update .custodian-brief.md for the-custodian
2026-03-27 00:52:18 +01:00

170 lines
5.0 KiB
Markdown

---
id: CUST-WP-0028
type: workplan
title: "Cross-Repo E2E Sandbox Framework"
domain: railiance
repo: the-custodian
status: active
owner: custodian
topic_slug: railiance
created: "2026-03-27"
updated: "2026-03-27"
state_hub_workstream_id: "b68de20b-e397-4f97-b1be-ad30711fc2a6"
---
# Cross-Repo E2E Sandbox Framework
## Problem
End-to-end tests that require a real running stack (Temporal, Postgres, workers)
cannot be automated in CI or run locally without significant setup friction.
Each repo has to reinvent its own e2e story. activity-core T21 is the immediate
trigger: the full RunActivityWorkflow flow can't be exercised without a live
Temporal cluster.
## Goal
A **convention + runtime** that any repo can opt into by dropping in an `e2e/`
folder. The shared framework, living in `the-custodian/e2e-framework/`, handles:
- Provisioning an isolated sandbox on a remote host (railiance01)
- `rsync` + `docker compose up` with a unique project name (no port conflicts)
- Health polling until the stack is ready
- Running the repo's test command and capturing results
- `docker compose down` (even on failure)
- Reporting structured results to the state-hub
Each repo just provides: `e2e/e2e.yml` + `e2e/compose.yml` + `e2e/tests/`.
The sandbox host defaults to `RAILIANCE01_HOST` env var (SSH alias or IP).
## Architecture
```
the-custodian/
e2e-framework/
schema.py # parse and validate e2e.yml
sandbox.py # provision/teardown remote sandbox dir via SSH
runner.py # rsync, compose up, health-wait, run tests, compose down
reporter.py # push structured result to state-hub
cli.py # entry point: python -m e2e_framework <repo-path>
Makefile # e2e target: make e2e REPO=activity-core
<repo>/
e2e/
e2e.yml # metadata: compose_file, health_checks, test_command, timeout
compose.yml # stack definition (may symlink docker-compose.dev.yml)
tests/ # test scripts (pytest, shell, etc.)
```
## e2e.yml contract
```yaml
name: <repo-slug>
compose_file: e2e/compose.yml # relative to repo root
health_checks:
- name: <label>
url: http://localhost:<port> # checked from remote machine
timeout: 120 # seconds to wait for this check
test_command: python -m pytest e2e/tests/ -v --tb=short
timeout: 300 # hard timeout for test_command
cleanup: always # always | on_success | never
```
## Tasks
### T01 — e2e-framework core: schema, sandbox, runner
```task
id: CUST-WP-0028-T01
status: done
priority: high
state_hub_task_id: "61dbb674-5933-4185-a7af-f9274bdd43c1"
```
Write `the-custodian/e2e-framework/`:
- `schema.py` — dataclasses + YAML loader for `e2e.yml`
- `sandbox.py` — SSH wrapper: `provision()` (mkdir + rsync), `run()`, `teardown()`
- `runner.py` — full lifecycle: up → health-wait → test → down
The SSH transport uses subprocess + system `ssh` (no extra deps). rsync over SSH.
Health checks curl from within the remote machine via `sandbox.run()`.
---
### T02 — reporter, CLI, Makefile target
```task
id: CUST-WP-0028-T02
status: done
priority: high
state_hub_task_id: "cd845d62-0ab7-4180-bc87-59789883258d"
```
Write:
- `reporter.py` — POST structured result to state-hub `add_progress_event`
- `cli.py``python -m e2e_framework <repo-path> [--host HOST] [--keep]`
- `the-custodian/Makefile``make e2e REPO=<slug>` target
---
### T03 — activity-core e2e contract
```task
id: CUST-WP-0028-T03
status: done
priority: high
state_hub_task_id: "2ed2b805-2245-4f7e-84d8-4345c6c5455a"
```
In `activity-core/`:
- `e2e/e2e.yml` — references `docker-compose.dev.yml`, declares Temporal UI health check
- `e2e/compose.yml` — symlink to `../docker-compose.dev.yml`
---
### T04 — activity-core test script (closes T21)
```task
id: CUST-WP-0028-T04
status: done
priority: high
state_hub_task_id: "0d92ac62-dc04-495c-85fa-13a66ffe611a"
```
Write `activity-core/e2e/tests/test_full_flow.py`:
- Seeds one ActivityDefinition
- Triggers RunActivityWorkflow via Temporal client
- Polls for workflow completion
- Asserts run log written to DB
- Clear pass/fail output per step
Updates activity-core WP-0001 T21 status to `done`.
---
### T05 — runbook + smoke test instructions
```task
id: CUST-WP-0028-T05
status: done
priority: medium
state_hub_task_id: "bbb106bd-bd89-4c11-b136-276e4d670097"
```
Write `e2e-framework/RUNBOOK.md`:
- Prerequisites (SSH access to sandbox host, Docker installed)
- First run: `export RAILIANCE01_HOST=<ip>; make e2e REPO=activity-core`
- Troubleshooting: sandbox cleanup, docker compose project list, log locations
Note: manual validation on railiance01 still needed (first live run).
## Done Criteria
- [ ] `make e2e REPO=activity-core` runs the full stack on railiance01, reports pass/fail
- [ ] Sandbox is always cleaned up (compose down + dir removed) unless `--keep`
- [ ] Results posted to state-hub as progress event
- [ ] activity-core T21 closed by the automated test script
- [ ] Any repo can opt in by adding `e2e/e2e.yml`