diff --git a/.custodian-brief.md b/.custodian-brief.md index 5251d63..c8939c3 100644 --- a/.custodian-brief.md +++ b/.custodian-brief.md @@ -1,17 +1,16 @@ # Custodian Brief — guide-board -**Domain:** markitect -**Last synced:** 2026-05-15 11:49 UTC +**Domain:** markitect +**Last synced:** 2026-05-15 12:35 UTC **State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)* ## Active Workstreams ### Assessment Operations Baseline -Progress: 3/6 done | workstream_id: `fc5b1573-91b2-4a19-b6a9-dd4d17057d9b` +Progress: 4/6 done | workstream_id: `fc5b1573-91b2-4a19-b6a9-dd4d17057d9b` **Open tasks:** -- · D2.4 - Service Job Durability Contract `10e4003c` - · D2.5 - Container Smoke Acceptance `9e2e7fa7` - · D2.6 - External Extension Acceptance Path `65fbf1df` diff --git a/README.md b/README.md index f0d061e..941c2c3 100644 --- a/README.md +++ b/README.md @@ -56,5 +56,6 @@ See: - [docs/CONTAINER.md](docs/CONTAINER.md) - [docs/EXTENSION-SDK.md](docs/EXTENSION-SDK.md) - [docs/LOCAL-SERVICE-API.md](docs/LOCAL-SERVICE-API.md) +- [docs/SERVICE-JOB-DURABILITY.md](docs/SERVICE-JOB-DURABILITY.md) - [extensions/CANDIDATES.md](extensions/CANDIDATES.md) - [workplans/GUIDE-BOARD-WP-0001-bootstrapping.md](workplans/GUIDE-BOARD-WP-0001-bootstrapping.md) diff --git a/docs/ASSESSMENT-OPERATIONS.md b/docs/ASSESSMENT-OPERATIONS.md index ca6d61f..bb3d059 100644 --- a/docs/ASSESSMENT-OPERATIONS.md +++ b/docs/ASSESSMENT-OPERATIONS.md @@ -147,7 +147,8 @@ curl -sf http://127.0.0.1:8080/runs/JOB_ID/reports | python3 -m json.tool Service job state is currently in memory for the running service process. Run artifacts are durable in the output directory and can still be inspected after a -service restart. +service restart. See `docs/SERVICE-JOB-DURABILITY.md` for the restart and +recovery contract. ## Status Vocabulary diff --git a/docs/LOCAL-SERVICE-API.md b/docs/LOCAL-SERVICE-API.md index fb9bf8b..481142f 100644 --- a/docs/LOCAL-SERVICE-API.md +++ b/docs/LOCAL-SERVICE-API.md @@ -87,7 +87,8 @@ run directory; the assessment result itself is still reported separately as ### `GET /runs` -Lists known in-memory jobs for the current service process. +Lists known in-memory jobs for the current service process. Job records are not +durable across service restarts. ### `GET /runs/{job_id}` @@ -111,4 +112,5 @@ podman run --rm -p 8080:8080 \ ``` The service keeps job state in memory. Durable run evidence remains in the -mounted output directory. +mounted output directory. See `docs/SERVICE-JOB-DURABILITY.md` for the explicit +restart and recovery contract. diff --git a/docs/SERVICE-JOB-DURABILITY.md b/docs/SERVICE-JOB-DURABILITY.md new file mode 100644 index 0000000..e206f0f --- /dev/null +++ b/docs/SERVICE-JOB-DURABILITY.md @@ -0,0 +1,90 @@ +# Service Job Durability + +Status: draft +Created: 2026-05-15 + +## Decision + +The guide-board local service keeps HTTP job state in memory for the baseline. +This is intentional. The service is a thin local transport over the CLI +contracts, not a workflow database. + +Durable state lives in run directories: + +- `run.json` +- `plan.json` +- `retention-summary.json` +- `normalized/evidence.json` +- `normalized/findings.json` +- `normalized/mappings.json` +- `reports/assessment-package.json` +- `reports/report.md` +- `artifacts/` + +The durable recovery index is the set of `retention-summary.json` files under a +runs directory. + +## Why In-Memory Jobs Stay The Baseline + +In-memory service jobs keep the first service layer dependency-light and easy to +embed in local, container, and extension-specific environments. Operators can +restart the service without migrating or repairing a service database, and the +CLI remains the source of truth for execution semantics. + +This also keeps interrupted service runs easy to reason about: + +- if the process exits before a run completes, the HTTP job record is gone, +- any partial run directory remains for inspection, +- completed runs are recoverable through retained run summaries, +- repeated runs should use a new output directory or an intentional overwrite + policy chosen by the operator. + +## Restart Semantics + +After a service restart: + +- `GET /runs` returns only jobs created since the new service process started, +- old `job_id` values are invalid, +- `GET /runs/{job_id}` cannot recover pre-restart job metadata, +- `GET /runs/{job_id}/reports` only works for jobs known to the current process, +- run artifacts from earlier service processes remain available on disk. + +Operators should recover previous results with the CLI run-history commands: + +```sh +PYTHONPATH=src python3 -m guide_board runs list --runs-dir runs +PYTHONPATH=src python3 -m guide_board runs latest --runs-dir runs +PYTHONPATH=src python3 -m guide_board runs report --runs-dir runs --run-id RUN_ID +``` + +## Recovery Flow + +Use this flow when the service process restarted or a browser/UI lost its job +state: + +1. Identify the output directory passed to `POST /runs`. +2. Confirm whether `retention-summary.json` exists. +3. If it exists, use `guide-board runs report --runs-dir ` to retrieve + report paths. +4. If only partial files exist, inspect `run.json`, `plan.json`, and artifacts + before rerunning. +5. Rerun into a fresh output directory when the prior status is unclear. + +## Future Durable Index Option + +A future durable service index may be added if UI or automation workflows need +cross-restart job lookup. If added, it should remain reconstructable from run +directories and should not become the authority for assessment results. + +The minimum acceptable durable index would contain: + +- job id, +- request payload, +- job transport status, +- run id, +- output directory, +- result paths, +- error summary. + +The index should be optional, dependency-light, and repairable by scanning +retained run summaries. diff --git a/workplans/GUIDE-BOARD-WP-0002-assessment-operations-baseline.md b/workplans/GUIDE-BOARD-WP-0002-assessment-operations-baseline.md index 19dc0fa..2329ce8 100644 --- a/workplans/GUIDE-BOARD-WP-0002-assessment-operations-baseline.md +++ b/workplans/GUIDE-BOARD-WP-0002-assessment-operations-baseline.md @@ -120,7 +120,7 @@ Progress: ```task id: GUIDE-BOARD-WP-0002-T004 -status: todo +status: done priority: medium state_hub_task_id: "10e4003c-dc11-4a8e-aecc-7815559ac439" ``` @@ -134,6 +134,18 @@ Acceptance: - If durable indexing is added, keep it dependency-light and reconstructable from retained run artifacts. +Decision: + +- Keep local service job state intentionally in-memory for the baseline. +- Treat run directories and `retention-summary.json` as the durable recovery + source. + +Progress: + +- Added `docs/SERVICE-JOB-DURABILITY.md`. +- Linked the contract from README, the local service API docs, and the + assessment operations guide. + ## D2.5 - Container Smoke Acceptance ```task