generated from coulomb/repo-seed
feat(STATE-WP-0064): add consistency sweep remote-all API endpoint
Expose POST /consistency/sweep/remote-all so activity-core can trigger the workstation ADR-001 remote-all sweep via the bridge tunnel pattern. Records consistency_sweep_remote_all progress events and documents the cutover runbook while the local custodian-sync timer remains interim.
This commit is contained in:
105
docs/consistency-sweep-runbook.md
Normal file
105
docs/consistency-sweep-runbook.md
Normal file
@@ -0,0 +1,105 @@
|
||||
# State Hub Consistency Sweep Runbook
|
||||
|
||||
## Purpose
|
||||
|
||||
This runbook answers whether the 15-minute State Hub consistency sync ran
|
||||
without relying on the local `custodian-sync.timer`.
|
||||
|
||||
The intended steady state after `STATE-WP-0064` cutover is:
|
||||
|
||||
- activity-core on Railiance01 owns the `*/15 * * * *` UTC schedule and
|
||||
ActivityRun audit trail.
|
||||
- State Hub on the workstation owns `scripts/consistency_check.py`, lock
|
||||
semantics, reconciliation, and the `consistency_sweep_remote_all`
|
||||
progress event.
|
||||
- The local systemd timer is disabled after the parallel week passes.
|
||||
|
||||
## API Surface
|
||||
|
||||
Manual or cluster-triggered invocation:
|
||||
|
||||
```bash
|
||||
curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"max_seconds": 300}' | python3 -m json.tool
|
||||
```
|
||||
|
||||
From Railiance01 through the bridge tunnel, use the `STATE_HUB_URL`
|
||||
configured for activity-core (for example the `actcore-state-hub-bridge`
|
||||
service target).
|
||||
|
||||
## Schedule Check
|
||||
|
||||
From the activity-core host, confirm the definition is synced and the
|
||||
Temporal schedule exists:
|
||||
|
||||
```bash
|
||||
cd ~/activity-core
|
||||
ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian make sync-activity-definitions
|
||||
make sync-schedules
|
||||
```
|
||||
|
||||
Expected definition:
|
||||
|
||||
- name: `State Hub Consistency Sweep`
|
||||
- trigger: `*/15 * * * *`
|
||||
- timezone: `UTC`
|
||||
- misfire policy: `skip`
|
||||
- enabled: `false` until manual canary passes, then `true` after cutover
|
||||
|
||||
## Progress Event Check
|
||||
|
||||
Query State Hub for the latest sweep event:
|
||||
|
||||
```bash
|
||||
curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \
|
||||
| python3 -m json.tool
|
||||
```
|
||||
|
||||
Healthy evidence includes:
|
||||
|
||||
- `lock_skipped: false` on normal runs
|
||||
- `repos_processed` entries only for repos that needed action
|
||||
- `skipped_clean`, `skipped_missing`, and `skipped_budget` metadata when
|
||||
applicable
|
||||
- `exit_code: 0` for warn-only remote-all sweeps
|
||||
|
||||
A `lock_skipped: true` response is normal when the local timer and the
|
||||
cluster schedule overlap during the parallel week.
|
||||
|
||||
## ActivityRun Check
|
||||
|
||||
Query the activity-core database for the most recent run of the sweep
|
||||
definition:
|
||||
|
||||
```sql
|
||||
select
|
||||
run_id,
|
||||
fired_at,
|
||||
scheduled_for,
|
||||
context_snapshot->'consistency_sweep_remote_all' as sweep_result
|
||||
from activity_runs
|
||||
where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b'
|
||||
order by fired_at desc
|
||||
limit 5;
|
||||
```
|
||||
|
||||
## Manual Canary
|
||||
|
||||
Before enabling the cluster schedule:
|
||||
|
||||
1. Confirm `state-hub-railiance01` tunnel health from ops-bridge.
|
||||
2. Trigger one manual ActivityRun or POST the API through the bridge URL.
|
||||
3. Verify the progress event and ActivityRun context snapshot.
|
||||
4. Confirm idempotence when the local timer also fires (lock skip is OK).
|
||||
|
||||
## Cutover
|
||||
|
||||
After one parallel week (`STATE-WP-0064-T03`):
|
||||
|
||||
```bash
|
||||
systemctl --user disable --now custodian-sync.timer
|
||||
```
|
||||
|
||||
Then enable the activity-core definition and treat the cluster schedule
|
||||
as the sole primary runner.
|
||||
@@ -1,8 +1,9 @@
|
||||
# State Hub Cron → activity-core ActivityDefinition Migration (Design Stub)
|
||||
# State Hub Cron → activity-core ActivityDefinition Migration
|
||||
|
||||
> CUST-WP-0040 T04. **Design stub — not yet implemented.**
|
||||
> Migration depends on activity-core WP-0003 reaching the
|
||||
> "ActivityDefinition file ingestion + cron trigger executor" milestone.
|
||||
> CUST-WP-0040 T04. **Partially implemented** as of `STATE-WP-0064`.
|
||||
> The consistency sweep API surface and ActivityDefinition are landed;
|
||||
> cluster cutover still requires manual canary, parallel week, and local
|
||||
> timer retirement.
|
||||
|
||||
The state hub currently runs two recurring maintenance jobs and one
|
||||
per-repo event hook. Once activity-core is ready, each becomes an
|
||||
@@ -36,41 +37,30 @@ run them on a schedule.
|
||||
|
||||
## 2. Target ActivityDefinitions
|
||||
|
||||
### A. `state-hub-consistency-sweep`
|
||||
### A. `state-hub-consistency-sweep` (implemented)
|
||||
|
||||
```yaml
|
||||
# activity-definitions/state-hub-consistency-sweep.yaml
|
||||
id: the-custodian.state-hub-consistency-sweep
|
||||
description: |
|
||||
Sweep all registered repos: pull, reconcile workplan files ↔ DB,
|
||||
apply writeback (C-15), respect pull gate (C-16). Mirrors the
|
||||
existing custodian-sync systemd timer.
|
||||
trigger:
|
||||
trigger_type: cron
|
||||
cron_expression: "*/15 * * * *"
|
||||
timezone: UTC
|
||||
misfire_policy: skip # if a prior run is still active, skip
|
||||
context:
|
||||
- kind: http_get # confirm state-hub API is reachable
|
||||
url: http://127.0.0.1:8000/state/health
|
||||
bind: hub_health
|
||||
rule:
|
||||
when:
|
||||
- "hub_health.status == 'ok'"
|
||||
instruction:
|
||||
kind: shell
|
||||
cmd: >-
|
||||
cd /home/worsch/state-hub &&
|
||||
.venv/bin/python scripts/consistency_check.py --remote --all --max-seconds 300
|
||||
on_failure: log_and_continue # warn-only sweeps must not page on transient failures
|
||||
```
|
||||
Landed in `the-custodian/activity-definitions/state-hub-consistency-sweep.md`
|
||||
with `enabled: false` until canary and cutover.
|
||||
|
||||
Invocation path (matches the hourly RecentlyOnScope pattern):
|
||||
|
||||
- activity-core context query: `consistency_sweep_remote_all`
|
||||
- State Hub endpoint: `POST /consistency/sweep/remote-all`
|
||||
- payload: `{"max_seconds": 300}`
|
||||
- progress event: `consistency_sweep_remote_all`
|
||||
|
||||
State Hub runs `scripts/consistency_check.py --remote --all --json` on the
|
||||
workstation host. activity-core does **not** shell into the laptop repo
|
||||
checkout from the cluster.
|
||||
|
||||
Operator runbook: [`docs/consistency-sweep-runbook.md`](consistency-sweep-runbook.md).
|
||||
|
||||
Notes:
|
||||
- Replaces the `custodian-sync.service` + `custodian-sync.timer` pair.
|
||||
- Replaces the `custodian-sync.service` + `custodian-sync.timer` pair
|
||||
after parallel week and cutover.
|
||||
- Lock semantics (`/tmp/custodian-consistency-remote-all.lock`) stay in
|
||||
the script — activity-core just sets the cadence.
|
||||
- Once active, `infra/README.md` is updated to instruct users to delete
|
||||
the systemd timer.
|
||||
- Local timer retirement is tracked in `STATE-WP-0064-T04`.
|
||||
|
||||
### B. `state-hub-stale-task-cleanup`
|
||||
|
||||
@@ -119,16 +109,16 @@ trigger:
|
||||
|
||||
## 3. Required context queries
|
||||
|
||||
Both A and B want to confirm the state hub is reachable before running.
|
||||
A reusable context source should be added to activity-core for this:
|
||||
Implemented for A:
|
||||
|
||||
- `consistency_sweep_remote_all` — `POST /consistency/sweep/remote-all`
|
||||
with a 330s resolver timeout (sweep budget default 300s).
|
||||
|
||||
Still optional for B and future splits:
|
||||
|
||||
- `state-hub.health` — `GET /state/health` → `{status, db, ...}`
|
||||
- (optional) `state-hub.repos` — `GET /repos/?status=active` for the
|
||||
sweep's per-repo branching, if we later split A into one
|
||||
ActivityDefinition per repo.
|
||||
|
||||
These belong to the state-hub adapter referenced in the workplan's
|
||||
out-of-scope note ("/sbom/status context query endpoint" etc.).
|
||||
- (optional) `state-hub.repos` — `GET /repos/?status=active` for per-repo
|
||||
ActivityDefinitions if the monolithic sweep is split later.
|
||||
|
||||
---
|
||||
|
||||
@@ -140,8 +130,8 @@ out-of-scope note ("/sbom/status context query endpoint" etc.).
|
||||
| activity-core shell instruction kind with on_failure semantics | activity-core | activity-core/`src/...` |
|
||||
| state-hub adapter exposing `state-hub.health` as a context source | activity-core | activity-core/adapters/ |
|
||||
|
||||
Until these land, the state hub continues to schedule jobs via systemd
|
||||
timer + cron entries.
|
||||
Until B lands and A is cut over, the state hub continues to schedule the
|
||||
consistency sweep via the local systemd timer.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user