generated from coulomb/repo-seed
finish(STATE-WP-0064): cut over scheduler and split sweep errors from failures
STATE-WP-0064 cutover (state-hub only): - Retire local custodian-sync.timer; archive units under infra/systemd/archived/ - Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md - Point activity-core-delegation at the consistency-sweep runbook Consistency engine — automation error vs assessment failure: - C-00 is an automation error; C-01..C-23 assessment failures are recorded for follow-up but no longer fail --remote --all scheduled sweeps (exit 0) - Skip workplans/README.md in the workplan glob (human index, not a workplan) - Progress events and compare script expose automation_error and assessment_failures separately from exit_code
This commit is contained in:
@@ -84,7 +84,9 @@ unset.
|
||||
the rule lives in activity-core.
|
||||
|
||||
See [`docs/cron-migration.md`](cron-migration.md) for the
|
||||
ActivityDefinition drafts and cutover plan.
|
||||
ActivityDefinition drafts and cutover plan. The consistency sweep schedule
|
||||
is live on Railiance01 — operator runbook:
|
||||
[`docs/consistency-sweep-runbook.md`](consistency-sweep-runbook.md).
|
||||
|
||||
## What must never happen
|
||||
|
||||
|
||||
@@ -3,16 +3,16 @@
|
||||
## Purpose
|
||||
|
||||
This runbook answers whether the 15-minute State Hub consistency sync ran
|
||||
without relying on the local `custodian-sync.timer`.
|
||||
without relying on the local `custodian-sync.timer` (retired 2026-06-21).
|
||||
|
||||
The intended steady state after `STATE-WP-0064` cutover is:
|
||||
**Steady state** (`STATE-WP-0064` cutover complete):
|
||||
|
||||
- activity-core on Railiance01 owns the `*/15 * * * *` UTC schedule and
|
||||
ActivityRun audit trail.
|
||||
- State Hub on the workstation owns `scripts/consistency_check.py`, lock
|
||||
semantics, reconciliation, and the `consistency_sweep_remote_all`
|
||||
progress event.
|
||||
- The local systemd timer is disabled after the parallel week passes.
|
||||
- The local systemd timer is **disabled**; cluster is the sole scheduler.
|
||||
|
||||
## API Surface
|
||||
|
||||
@@ -65,7 +65,7 @@ Expected definition:
|
||||
- trigger: `*/15 * * * *`
|
||||
- timezone: `UTC`
|
||||
- misfire policy: `skip`
|
||||
- enabled: `true` during parallel week (T03); local timer retired after T04
|
||||
- enabled: `true`
|
||||
|
||||
## Progress Event Check
|
||||
|
||||
@@ -78,14 +78,17 @@ curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all
|
||||
|
||||
Healthy evidence includes:
|
||||
|
||||
- `detail.source: activity-core` on scheduled runs
|
||||
- `lock_skipped: false` on normal runs
|
||||
- `repos_processed` entries only for repos that needed action
|
||||
- `skipped_clean`, `skipped_missing`, and `skipped_budget` metadata when
|
||||
applicable
|
||||
- `exit_code: 0` for warn-only remote-all sweeps
|
||||
- `exit_code: 0` when automation completed (assessment failures are OK)
|
||||
- `automation_error: true` only for infrastructure faults (API down, C-00, etc.)
|
||||
- `assessment_failures` counts repos with hygiene gaps (C-01..C-23) for follow-up
|
||||
|
||||
A `lock_skipped: true` response is normal when the local timer and the
|
||||
cluster schedule overlap during the parallel week.
|
||||
A `lock_skipped: true` response is normal when a sweep is already in flight.
|
||||
Assessment failures do not fail the scheduler; automation errors do.
|
||||
|
||||
## ActivityRun Check
|
||||
|
||||
@@ -106,40 +109,26 @@ limit 5;
|
||||
|
||||
## Manual Canary
|
||||
|
||||
Before enabling the cluster schedule:
|
||||
Before enabling or after changing the cluster schedule:
|
||||
|
||||
1. Confirm `state-hub-railiance01` tunnel health from ops-bridge.
|
||||
2. Trigger one manual ActivityRun or POST the API through the bridge URL.
|
||||
3. Verify the progress event and ActivityRun context snapshot.
|
||||
4. Confirm idempotence when the local timer also fires (lock skip is OK).
|
||||
|
||||
## Parallel week observability (T03)
|
||||
## Observability
|
||||
|
||||
Both runners call the same API and tag progress events with `detail.source`:
|
||||
|
||||
| Source | Runner |
|
||||
|--------|--------|
|
||||
| `local-timer` | `custodian-sync.timer` on the workstation |
|
||||
| `activity-core` | Railiance01 Temporal schedule |
|
||||
|
||||
Summarise evidence:
|
||||
Summarise recent sweep events by source:
|
||||
|
||||
```bash
|
||||
cd ~/state-hub
|
||||
uv run python scripts/compare_consistency_sweep_parallel.py --since-hours 24
|
||||
```
|
||||
|
||||
Expect some `lock_skipped: true` events when both schedules overlap — that is
|
||||
healthy idempotence, not duplicate work.
|
||||
After cutover, expect only `activity-core` (and manual) sources — no new
|
||||
`local-timer` events.
|
||||
|
||||
Parallel window: **2026-06-21 → 2026-06-28** (review before T04 cutover).
|
||||
## Local fallback (emergency only)
|
||||
|
||||
## Cutover
|
||||
|
||||
After one parallel week (`STATE-WP-0064-T03`):
|
||||
|
||||
```bash
|
||||
systemctl --user disable --now custodian-sync.timer
|
||||
```
|
||||
|
||||
The cluster definition stays enabled; disable only the local timer.
|
||||
If cluster scheduling is broken, temporarily re-enable the archived systemd
|
||||
units per [`infra/systemd/archived/README.md`](../infra/systemd/archived/README.md).
|
||||
Disable again once cluster scheduling is restored.
|
||||
@@ -1,9 +1,8 @@
|
||||
# State Hub Cron → activity-core ActivityDefinition Migration
|
||||
|
||||
> CUST-WP-0040 T04. **Partially implemented** as of `STATE-WP-0064`.
|
||||
> The consistency sweep API surface and ActivityDefinition are landed;
|
||||
> cluster cutover still requires manual canary, parallel week, and local
|
||||
> timer retirement.
|
||||
> CUST-WP-0040 T04. **Consistency sweep cut over** as of `STATE-WP-0064`
|
||||
> (2026-06-21). Scheduling is on activity-core (Railiance01); the local
|
||||
> `custodian-sync.timer` is retired. Stale-task cleanup (B) is still pending.
|
||||
|
||||
The state hub currently runs two recurring maintenance jobs and one
|
||||
per-repo event hook. Once activity-core is ready, each becomes an
|
||||
@@ -16,7 +15,7 @@ keeps the underlying scripts; only the *scheduling* moves.
|
||||
|
||||
| # | Source | Trigger today | Script invoked | What it does |
|
||||
| - | ------------------- | -------------------------------------------------------- | -------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
|
||||
| 1 | systemd user timer | every 15 min | `scripts/consistency_check.py --remote --all` | Pull every registered repo, reconcile workplan files ↔ DB, run C-15 writeback + C-16 pull gate |
|
||||
| 1 | activity-core cron | every 15 min (Railiance01) | `POST /consistency/sweep/remote-all` → `consistency_check.py --remote --all` | Pull every registered repo, reconcile workplan files ↔ DB, run C-15 writeback + C-16 pull gate |
|
||||
| 2 | manual / daily cron | `make cleanup-stale` (suggested `0 3 * * *`) | `scripts/cleanup_stale_tasks.py` | Cancel tasks still open in finished/archived workstreams; emits `org.statehub.task.stale` |
|
||||
| 3 | git post-commit | every commit in a registered repo | `make fix-consistency REPO=<slug>` | Per-repo workplan ↔ DB sync immediately after a commit |
|
||||
|
||||
@@ -40,7 +39,7 @@ run them on a schedule.
|
||||
### A. `state-hub-consistency-sweep` (implemented)
|
||||
|
||||
Landed in `the-custodian/activity-definitions/state-hub-consistency-sweep.md`
|
||||
with `enabled: false` until canary and cutover.
|
||||
with `enabled: true` on Railiance01 since 2026-06-21 cutover.
|
||||
|
||||
Invocation path (matches the hourly RecentlyOnScope pattern):
|
||||
|
||||
@@ -56,11 +55,10 @@ checkout from the cluster.
|
||||
Operator runbook: [`docs/consistency-sweep-runbook.md`](consistency-sweep-runbook.md).
|
||||
|
||||
Notes:
|
||||
- Replaces the `custodian-sync.service` + `custodian-sync.timer` pair
|
||||
after parallel week and cutover.
|
||||
- Replaced the `custodian-sync.service` + `custodian-sync.timer` pair
|
||||
(local timer disabled 2026-06-21; units archived under `infra/systemd/archived/`).
|
||||
- Lock semantics (`/tmp/custodian-consistency-remote-all.lock`) stay in
|
||||
the script — activity-core just sets the cadence.
|
||||
- Local timer retirement is tracked in `STATE-WP-0064-T04`.
|
||||
|
||||
### B. `state-hub-stale-task-cleanup`
|
||||
|
||||
@@ -130,8 +128,8 @@ Still optional for B and future splits:
|
||||
| activity-core shell instruction kind with on_failure semantics | activity-core | activity-core/`src/...` |
|
||||
| state-hub adapter exposing `state-hub.health` as a context source | activity-core | activity-core/adapters/ |
|
||||
|
||||
Until B lands and A is cut over, the state hub continues to schedule the
|
||||
consistency sweep via the local systemd timer.
|
||||
A is cut over. Until B lands, stale-task cleanup remains on-demand via
|
||||
`make cleanup-stale` (or a manual daily cron).
|
||||
|
||||
---
|
||||
|
||||
@@ -142,11 +140,9 @@ consistency sweep via the local systemd timer.
|
||||
same DB / NATS effects as the current cron entries.
|
||||
3. Run both in parallel for one week (cron + ActivityDefinition). The
|
||||
scripts are idempotent — duplicate runs are no-ops on a clean state.
|
||||
4. Disable the systemd timer:
|
||||
`systemctl --user disable --now custodian-sync.timer`
|
||||
5. Remove the cleanup-stale cron entry from `crontab -e`.
|
||||
6. Update `infra/README.md` to point at the ActivityDefinitions and
|
||||
archive the systemd unit files.
|
||||
4. ~~Disable the systemd timer~~ — **done** 2026-06-21 (`STATE-WP-0064`).
|
||||
5. Remove the cleanup-stale cron entry from `crontab -e` (when B is enabled).
|
||||
6. ~~Update `infra/README.md` and archive systemd unit files~~ — **done**.
|
||||
7. Per-commit hook stays until a `repo.commit.pushed` event exists.
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user