feat(STATE-WP-0064): add consistency sweep remote-all API endpoint

Expose POST /consistency/sweep/remote-all so activity-core can trigger
the workstation ADR-001 remote-all sweep via the bridge tunnel pattern.
Records consistency_sweep_remote_all progress events and documents the
cutover runbook while the local custodian-sync timer remains interim.
This commit is contained in:
2026-06-21 20:19:22 +02:00
parent 0fdebc6aa8
commit 5a7a6ef5ee
9 changed files with 599 additions and 50 deletions

View File

@@ -0,0 +1,105 @@
# State Hub Consistency Sweep Runbook
## Purpose
This runbook answers whether the 15-minute State Hub consistency sync ran
without relying on the local `custodian-sync.timer`.
The intended steady state after `STATE-WP-0064` cutover is:
- activity-core on Railiance01 owns the `*/15 * * * *` UTC schedule and
ActivityRun audit trail.
- State Hub on the workstation owns `scripts/consistency_check.py`, lock
semantics, reconciliation, and the `consistency_sweep_remote_all`
progress event.
- The local systemd timer is disabled after the parallel week passes.
## API Surface
Manual or cluster-triggered invocation:
```bash
curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \
-H "Content-Type: application/json" \
-d '{"max_seconds": 300}' | python3 -m json.tool
```
From Railiance01 through the bridge tunnel, use the `STATE_HUB_URL`
configured for activity-core (for example the `actcore-state-hub-bridge`
service target).
## Schedule Check
From the activity-core host, confirm the definition is synced and the
Temporal schedule exists:
```bash
cd ~/activity-core
ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian make sync-activity-definitions
make sync-schedules
```
Expected definition:
- name: `State Hub Consistency Sweep`
- trigger: `*/15 * * * *`
- timezone: `UTC`
- misfire policy: `skip`
- enabled: `false` until manual canary passes, then `true` after cutover
## Progress Event Check
Query State Hub for the latest sweep event:
```bash
curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \
| python3 -m json.tool
```
Healthy evidence includes:
- `lock_skipped: false` on normal runs
- `repos_processed` entries only for repos that needed action
- `skipped_clean`, `skipped_missing`, and `skipped_budget` metadata when
applicable
- `exit_code: 0` for warn-only remote-all sweeps
A `lock_skipped: true` response is normal when the local timer and the
cluster schedule overlap during the parallel week.
## ActivityRun Check
Query the activity-core database for the most recent run of the sweep
definition:
```sql
select
run_id,
fired_at,
scheduled_for,
context_snapshot->'consistency_sweep_remote_all' as sweep_result
from activity_runs
where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b'
order by fired_at desc
limit 5;
```
## Manual Canary
Before enabling the cluster schedule:
1. Confirm `state-hub-railiance01` tunnel health from ops-bridge.
2. Trigger one manual ActivityRun or POST the API through the bridge URL.
3. Verify the progress event and ActivityRun context snapshot.
4. Confirm idempotence when the local timer also fires (lock skip is OK).
## Cutover
After one parallel week (`STATE-WP-0064-T03`):
```bash
systemctl --user disable --now custodian-sync.timer
```
Then enable the activity-core definition and treat the cluster schedule
as the sole primary runner.

View File

@@ -1,8 +1,9 @@
# State Hub Cron → activity-core ActivityDefinition Migration (Design Stub)
# State Hub Cron → activity-core ActivityDefinition Migration
> CUST-WP-0040 T04. **Design stub — not yet implemented.**
> Migration depends on activity-core WP-0003 reaching the
> "ActivityDefinition file ingestion + cron trigger executor" milestone.
> CUST-WP-0040 T04. **Partially implemented** as of `STATE-WP-0064`.
> The consistency sweep API surface and ActivityDefinition are landed;
> cluster cutover still requires manual canary, parallel week, and local
> timer retirement.
The state hub currently runs two recurring maintenance jobs and one
per-repo event hook. Once activity-core is ready, each becomes an
@@ -36,41 +37,30 @@ run them on a schedule.
## 2. Target ActivityDefinitions
### A. `state-hub-consistency-sweep`
### A. `state-hub-consistency-sweep` (implemented)
```yaml
# activity-definitions/state-hub-consistency-sweep.yaml
id: the-custodian.state-hub-consistency-sweep
description: |
Sweep all registered repos: pull, reconcile workplan files ↔ DB,
apply writeback (C-15), respect pull gate (C-16). Mirrors the
existing custodian-sync systemd timer.
trigger:
trigger_type: cron
cron_expression: "*/15 * * * *"
timezone: UTC
misfire_policy: skip # if a prior run is still active, skip
context:
- kind: http_get # confirm state-hub API is reachable
url: http://127.0.0.1:8000/state/health
bind: hub_health
rule:
when:
- "hub_health.status == 'ok'"
instruction:
kind: shell
cmd: >-
cd /home/worsch/state-hub &&
.venv/bin/python scripts/consistency_check.py --remote --all --max-seconds 300
on_failure: log_and_continue # warn-only sweeps must not page on transient failures
```
Landed in `the-custodian/activity-definitions/state-hub-consistency-sweep.md`
with `enabled: false` until canary and cutover.
Invocation path (matches the hourly RecentlyOnScope pattern):
- activity-core context query: `consistency_sweep_remote_all`
- State Hub endpoint: `POST /consistency/sweep/remote-all`
- payload: `{"max_seconds": 300}`
- progress event: `consistency_sweep_remote_all`
State Hub runs `scripts/consistency_check.py --remote --all --json` on the
workstation host. activity-core does **not** shell into the laptop repo
checkout from the cluster.
Operator runbook: [`docs/consistency-sweep-runbook.md`](consistency-sweep-runbook.md).
Notes:
- Replaces the `custodian-sync.service` + `custodian-sync.timer` pair.
- Replaces the `custodian-sync.service` + `custodian-sync.timer` pair
after parallel week and cutover.
- Lock semantics (`/tmp/custodian-consistency-remote-all.lock`) stay in
the script — activity-core just sets the cadence.
- Once active, `infra/README.md` is updated to instruct users to delete
the systemd timer.
- Local timer retirement is tracked in `STATE-WP-0064-T04`.
### B. `state-hub-stale-task-cleanup`
@@ -119,16 +109,16 @@ trigger:
## 3. Required context queries
Both A and B want to confirm the state hub is reachable before running.
A reusable context source should be added to activity-core for this:
Implemented for A:
- `consistency_sweep_remote_all``POST /consistency/sweep/remote-all`
with a 330s resolver timeout (sweep budget default 300s).
Still optional for B and future splits:
- `state-hub.health``GET /state/health``{status, db, ...}`
- (optional) `state-hub.repos``GET /repos/?status=active` for the
sweep's per-repo branching, if we later split A into one
ActivityDefinition per repo.
These belong to the state-hub adapter referenced in the workplan's
out-of-scope note ("/sbom/status context query endpoint" etc.).
- (optional) `state-hub.repos``GET /repos/?status=active` for per-repo
ActivityDefinitions if the monolithic sweep is split later.
---
@@ -140,8 +130,8 @@ out-of-scope note ("/sbom/status context query endpoint" etc.).
| activity-core shell instruction kind with on_failure semantics | activity-core | activity-core/`src/...` |
| state-hub adapter exposing `state-hub.health` as a context source | activity-core | activity-core/adapters/ |
Until these land, the state hub continues to schedule jobs via systemd
timer + cron entries.
Until B lands and A is cut over, the state hub continues to schedule the
consistency sweep via the local systemd timer.
---