STATE-WP-0064 cutover (state-hub only): - Retire local custodian-sync.timer; archive units under infra/systemd/archived/ - Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md - Point activity-core-delegation at the consistency-sweep runbook Consistency engine — automation error vs assessment failure: - C-00 is an automation error; C-01..C-23 assessment failures are recorded for follow-up but no longer fail --remote --all scheduled sweeps (exit 0) - Skip workplans/README.md in the workplan glob (human index, not a workplan) - Progress events and compare script expose automation_error and assessment_failures separately from exit_code
7.3 KiB
State Hub Cron → activity-core ActivityDefinition Migration
CUST-WP-0040 T04. Consistency sweep cut over as of
STATE-WP-0064(2026-06-21). Scheduling is on activity-core (Railiance01); the localcustodian-sync.timeris retired. Stale-task cleanup (B) is still pending.
The state hub currently runs two recurring maintenance jobs and one per-repo event hook. Once activity-core is ready, each becomes an ActivityDefinition file checked into the appropriate repo. The state hub keeps the underlying scripts; only the scheduling moves.
1. Inventory of current maintenance automations
| # | Source | Trigger today | Script invoked | What it does |
|---|---|---|---|---|
| 1 | activity-core cron | every 15 min (Railiance01) | POST /consistency/sweep/remote-all → consistency_check.py --remote --all |
Pull every registered repo, reconcile workplan files ↔ DB, run C-15 writeback + C-16 pull gate |
| 2 | manual / daily cron | make cleanup-stale (suggested 0 3 * * *) |
scripts/cleanup_stale_tasks.py |
Cancel tasks still open in finished/archived workstreams; emits org.statehub.task.stale |
| 3 | git post-commit | every commit in a registered repo | make fix-consistency REPO=<slug> |
Per-repo workplan ↔ DB sync immediately after a commit |
Honourable mentions (not currently scheduled, on-demand only — listed for completeness so they don't get mistakenly picked up):
scripts/ingest_sbom.py— invoked viamake ingest-sbom REPO=<slug>.scripts/ingest_capabilities.py— invoked viamake ingest-capabilities[-all].scripts/check_doi.py— invoked viamake check-doi[-all].scripts/validate_repo_adr.py— invoked manually for canon promotion.scripts/ingest_tpsc.py— invoked viamake ingest-tpsc[-all].
These are not in scope for cron migration — they remain on-demand operator/CI commands. They become candidates only if we later decide to run them on a schedule.
2. Target ActivityDefinitions
A. state-hub-consistency-sweep (implemented)
Landed in the-custodian/activity-definitions/state-hub-consistency-sweep.md
with enabled: true on Railiance01 since 2026-06-21 cutover.
Invocation path (matches the hourly RecentlyOnScope pattern):
- activity-core context query:
consistency_sweep_remote_all - State Hub endpoint:
POST /consistency/sweep/remote-all - payload:
{"max_seconds": 300} - progress event:
consistency_sweep_remote_all
State Hub runs scripts/consistency_check.py --remote --all --json on the
workstation host. activity-core does not shell into the laptop repo
checkout from the cluster.
Operator runbook: docs/consistency-sweep-runbook.md.
Notes:
- Replaced the
custodian-sync.service+custodian-sync.timerpair (local timer disabled 2026-06-21; units archived underinfra/systemd/archived/). - Lock semantics (
/tmp/custodian-consistency-remote-all.lock) stay in the script — activity-core just sets the cadence.
B. state-hub-stale-task-cleanup
# activity-definitions/state-hub-stale-task-cleanup.yaml
id: the-custodian.state-hub-stale-task-cleanup
description: |
Daily sweep that cancels tasks still `wait|todo|progress` inside
finished or archived workstreams. Each cancellation also emits
org.statehub.task.stale on NATS for downstream reaction.
trigger:
trigger_type: cron
cron_expression: "0 3 * * *"
timezone: UTC
instruction:
kind: shell
cmd: >-
cd /home/worsch/state-hub &&
.venv/bin/python scripts/cleanup_stale_tasks.py
Notes:
- Replaces the documented (
Cron example: 0 3 * * * …) daily run. - The script already emits NATS events (see CUST-WP-0040 T03), so downstream ActivityDefinitions can react per-task without a second pass.
C. Per-commit consistency sync (currently a git hook)
The git post-commit hook installed by state-hub/scripts/install_hooks.sh
is event-driven, not cron-based. Migrating it to activity-core would
require a repo.commit.pushed event channel that doesn't exist yet.
Recommendation: keep the git hook as-is for now. Revisit once an event source (e.g. Gitea webhook fed into NATS) is available, at which point an event-triggered ActivityDefinition can replace it cleanly:
trigger:
trigger_type: event
event_type: org.repo.commit.pushed
filters:
repo_slug: "*"
3. Required context queries
Implemented for A:
consistency_sweep_remote_all—POST /consistency/sweep/remote-allwith a 330s resolver timeout (sweep budget default 300s).
Still optional for B and future splits:
state-hub.health—GET /state/health→{status, db, ...}- (optional)
state-hub.repos—GET /repos/?status=activefor per-repo ActivityDefinitions if the monolithic sweep is split later.
4. Blockers / sequencing
| Blocker | Owner | Where it lands |
|---|---|---|
| activity-core ActivityDefinition file ingestion + cron executor (WP-0003) | activity-core | activity-core/src/... |
| activity-core shell instruction kind with on_failure semantics | activity-core | activity-core/src/... |
state-hub adapter exposing state-hub.health as a context source |
activity-core | activity-core/adapters/ |
A is cut over. Until B lands, stale-task cleanup remains on-demand via
make cleanup-stale (or a manual daily cron).
5. Cutover plan (when ready)
- Land ActivityDefinitions A + B in activity-core.
- Enable them in staging; verify they fire on schedule and produce the same DB / NATS effects as the current cron entries.
- Run both in parallel for one week (cron + ActivityDefinition). The scripts are idempotent — duplicate runs are no-ops on a clean state.
Disable the systemd timer— done 2026-06-21 (STATE-WP-0064).- Remove the cleanup-stale cron entry from
crontab -e(when B is enabled). Update— done.infra/README.mdand archive systemd unit files- Per-commit hook stays until a
repo.commit.pushedevent exists.
6. Open questions
- Locking: should activity-core wrap shell instructions with a
process lock (today the script self-locks via
/tmp/...)? If yes, the state-hub script's lock can be removed. - Failure surfacing: today systemd journals capture stderr. Where does an ActivityDefinition's shell stderr go? (logs ? activity history ?) — needs activity-core docs before cutover.
- Per-repo split: do we split A into one ActivityDefinition per
registered repo (so failures don't poison the sweep), or keep the
monolithic
--allmode? The latter is simpler and matches today's behaviour; the former gives better observability.