feat(STATE-WP-0064): add consistency sweep remote-all API endpoint

Expose POST /consistency/sweep/remote-all so activity-core can trigger
the workstation ADR-001 remote-all sweep via the bridge tunnel pattern.
Records consistency_sweep_remote_all progress events and documents the
cutover runbook while the local custodian-sync timer remains interim.
This commit is contained in:
2026-06-21 20:19:22 +02:00
parent 0fdebc6aa8
commit 5a7a6ef5ee
9 changed files with 599 additions and 50 deletions

View File

@@ -1,8 +1,9 @@
# State Hub Cron → activity-core ActivityDefinition Migration (Design Stub)
# State Hub Cron → activity-core ActivityDefinition Migration
> CUST-WP-0040 T04. **Design stub — not yet implemented.**
> Migration depends on activity-core WP-0003 reaching the
> "ActivityDefinition file ingestion + cron trigger executor" milestone.
> CUST-WP-0040 T04. **Partially implemented** as of `STATE-WP-0064`.
> The consistency sweep API surface and ActivityDefinition are landed;
> cluster cutover still requires manual canary, parallel week, and local
> timer retirement.
The state hub currently runs two recurring maintenance jobs and one
per-repo event hook. Once activity-core is ready, each becomes an
@@ -36,41 +37,30 @@ run them on a schedule.
## 2. Target ActivityDefinitions
### A. `state-hub-consistency-sweep`
### A. `state-hub-consistency-sweep` (implemented)
```yaml
# activity-definitions/state-hub-consistency-sweep.yaml
id: the-custodian.state-hub-consistency-sweep
description: |
Sweep all registered repos: pull, reconcile workplan files ↔ DB,
apply writeback (C-15), respect pull gate (C-16). Mirrors the
existing custodian-sync systemd timer.
trigger:
trigger_type: cron
cron_expression: "*/15 * * * *"
timezone: UTC
misfire_policy: skip # if a prior run is still active, skip
context:
- kind: http_get # confirm state-hub API is reachable
url: http://127.0.0.1:8000/state/health
bind: hub_health
rule:
when:
- "hub_health.status == 'ok'"
instruction:
kind: shell
cmd: >-
cd /home/worsch/state-hub &&
.venv/bin/python scripts/consistency_check.py --remote --all --max-seconds 300
on_failure: log_and_continue # warn-only sweeps must not page on transient failures
```
Landed in `the-custodian/activity-definitions/state-hub-consistency-sweep.md`
with `enabled: false` until canary and cutover.
Invocation path (matches the hourly RecentlyOnScope pattern):
- activity-core context query: `consistency_sweep_remote_all`
- State Hub endpoint: `POST /consistency/sweep/remote-all`
- payload: `{"max_seconds": 300}`
- progress event: `consistency_sweep_remote_all`
State Hub runs `scripts/consistency_check.py --remote --all --json` on the
workstation host. activity-core does **not** shell into the laptop repo
checkout from the cluster.
Operator runbook: [`docs/consistency-sweep-runbook.md`](consistency-sweep-runbook.md).
Notes:
- Replaces the `custodian-sync.service` + `custodian-sync.timer` pair.
- Replaces the `custodian-sync.service` + `custodian-sync.timer` pair
after parallel week and cutover.
- Lock semantics (`/tmp/custodian-consistency-remote-all.lock`) stay in
the script — activity-core just sets the cadence.
- Once active, `infra/README.md` is updated to instruct users to delete
the systemd timer.
- Local timer retirement is tracked in `STATE-WP-0064-T04`.
### B. `state-hub-stale-task-cleanup`
@@ -119,16 +109,16 @@ trigger:
## 3. Required context queries
Both A and B want to confirm the state hub is reachable before running.
A reusable context source should be added to activity-core for this:
Implemented for A:
- `consistency_sweep_remote_all``POST /consistency/sweep/remote-all`
with a 330s resolver timeout (sweep budget default 300s).
Still optional for B and future splits:
- `state-hub.health``GET /state/health``{status, db, ...}`
- (optional) `state-hub.repos``GET /repos/?status=active` for the
sweep's per-repo branching, if we later split A into one
ActivityDefinition per repo.
These belong to the state-hub adapter referenced in the workplan's
out-of-scope note ("/sbom/status context query endpoint" etc.).
- (optional) `state-hub.repos``GET /repos/?status=active` for per-repo
ActivityDefinitions if the monolithic sweep is split later.
---
@@ -140,8 +130,8 @@ out-of-scope note ("/sbom/status context query endpoint" etc.).
| activity-core shell instruction kind with on_failure semantics | activity-core | activity-core/`src/...` |
| state-hub adapter exposing `state-hub.health` as a context source | activity-core | activity-core/adapters/ |
Until these land, the state hub continues to schedule jobs via systemd
timer + cron entries.
Until B lands and A is cut over, the state hub continues to schedule the
consistency sweep via the local systemd timer.
---