Files

tegwick 39ed5459b9 finish(STATE-WP-0064): cut over scheduler and split sweep errors from failures

STATE-WP-0064 cutover (state-hub only):
- Retire local custodian-sync.timer; archive units under infra/systemd/archived/
- Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md
- Point activity-core-delegation at the consistency-sweep runbook

Consistency engine — automation error vs assessment failure:
- C-00 is an automation error; C-01..C-23 assessment failures are recorded
  for follow-up but no longer fail --remote --all scheduled sweeps (exit 0)
- Skip workplans/README.md in the workplan glob (human index, not a workplan)
- Progress events and compare script expose automation_error and
  assessment_failures separately from exit_code

2026-06-22 01:20:59 +02:00

7.3 KiB

Raw Blame History

State Hub Cron → activity-core ActivityDefinition Migration

CUST-WP-0040 T04. Consistency sweep cut over as of STATE-WP-0064 (2026-06-21). Scheduling is on activity-core (Railiance01); the local custodian-sync.timer is retired. Stale-task cleanup (B) is still pending.

The state hub currently runs two recurring maintenance jobs and one per-repo event hook. Once activity-core is ready, each becomes an ActivityDefinition file checked into the appropriate repo. The state hub keeps the underlying scripts; only the scheduling moves.

1. Inventory of current maintenance automations

#	Source	Trigger today	Script invoked	What it does
1	activity-core cron	every 15 min (Railiance01)	`POST /consistency/sweep/remote-all` → `consistency_check.py --remote --all`	Pull every registered repo, reconcile workplan files ↔ DB, run C-15 writeback + C-16 pull gate
2	manual / daily cron	`make cleanup-stale` (suggested `0 3 * * *`)	`scripts/cleanup_stale_tasks.py`	Cancel tasks still open in finished/archived workstreams; emits `org.statehub.task.stale`
3	git post-commit	every commit in a registered repo	`make fix-consistency REPO=<slug>`	Per-repo workplan ↔ DB sync immediately after a commit

Honourable mentions (not currently scheduled, on-demand only — listed for completeness so they don't get mistakenly picked up):

scripts/ingest_sbom.py — invoked via make ingest-sbom REPO=<slug>.
scripts/ingest_capabilities.py — invoked via make ingest-capabilities[-all].
scripts/check_doi.py — invoked via make check-doi[-all].
scripts/validate_repo_adr.py — invoked manually for canon promotion.
scripts/ingest_tpsc.py — invoked via make ingest-tpsc[-all].

These are not in scope for cron migration — they remain on-demand operator/CI commands. They become candidates only if we later decide to run them on a schedule.

2. Target ActivityDefinitions

A. `state-hub-consistency-sweep` (implemented)

Landed in the-custodian/activity-definitions/state-hub-consistency-sweep.md with enabled: true on Railiance01 since 2026-06-21 cutover.

Invocation path (matches the hourly RecentlyOnScope pattern):

activity-core context query: consistency_sweep_remote_all
State Hub endpoint: POST /consistency/sweep/remote-all
payload: {"max_seconds": 300}
progress event: consistency_sweep_remote_all

State Hub runs scripts/consistency_check.py --remote --all --json on the workstation host. activity-core does not shell into the laptop repo checkout from the cluster.

Operator runbook: docs/consistency-sweep-runbook.md.

Notes:

Replaced the custodian-sync.service + custodian-sync.timer pair (local timer disabled 2026-06-21; units archived under infra/systemd/archived/).
Lock semantics (/tmp/custodian-consistency-remote-all.lock) stay in the script — activity-core just sets the cadence.

B. `state-hub-stale-task-cleanup`

# activity-definitions/state-hub-stale-task-cleanup.yaml
id: the-custodian.state-hub-stale-task-cleanup
description: |
  Daily sweep that cancels tasks still `wait|todo|progress` inside
  finished or archived workstreams. Each cancellation also emits
  org.statehub.task.stale on NATS for downstream reaction.
trigger:
  trigger_type: cron
  cron_expression: "0 3 * * *"
  timezone: UTC
instruction:
  kind: shell
  cmd: >-
    cd /home/worsch/state-hub &&
    .venv/bin/python scripts/cleanup_stale_tasks.py

Notes:

Replaces the documented (Cron example: 0 3 * * * …) daily run.
The script already emits NATS events (see CUST-WP-0040 T03), so downstream ActivityDefinitions can react per-task without a second pass.

C. Per-commit consistency sync (currently a git hook)

The git post-commit hook installed by state-hub/scripts/install_hooks.sh is event-driven, not cron-based. Migrating it to activity-core would require a repo.commit.pushed event channel that doesn't exist yet.

Recommendation: keep the git hook as-is for now. Revisit once an event source (e.g. Gitea webhook fed into NATS) is available, at which point an event-triggered ActivityDefinition can replace it cleanly:

trigger:
  trigger_type: event
  event_type: org.repo.commit.pushed
  filters:
    repo_slug: "*"

3. Required context queries

Implemented for A:

consistency_sweep_remote_all — POST /consistency/sweep/remote-all with a 330s resolver timeout (sweep budget default 300s).

Still optional for B and future splits:

state-hub.health — GET /state/health → {status, db, ...}
(optional) state-hub.repos — GET /repos/?status=active for per-repo ActivityDefinitions if the monolithic sweep is split later.

4. Blockers / sequencing

Blocker	Owner	Where it lands
activity-core ActivityDefinition file ingestion + cron executor (WP-0003)	activity-core	activity-core/`src/...`
activity-core shell instruction kind with on_failure semantics	activity-core	activity-core/`src/...`
state-hub adapter exposing `state-hub.health` as a context source	activity-core	activity-core/adapters/

A is cut over. Until B lands, stale-task cleanup remains on-demand via make cleanup-stale (or a manual daily cron).

5. Cutover plan (when ready)

Land ActivityDefinitions A + B in activity-core.
Enable them in staging; verify they fire on schedule and produce the same DB / NATS effects as the current cron entries.
Run both in parallel for one week (cron + ActivityDefinition). The scripts are idempotent — duplicate runs are no-ops on a clean state.
~~Disable the systemd timer~~ — done 2026-06-21 (STATE-WP-0064).
Remove the cleanup-stale cron entry from crontab -e (when B is enabled).
~~Update infra/README.md and archive systemd unit files~~ — done.
Per-commit hook stays until a repo.commit.pushed event exists.

6. Open questions

Locking: should activity-core wrap shell instructions with a process lock (today the script self-locks via /tmp/...)? If yes, the state-hub script's lock can be removed.
Failure surfacing: today systemd journals capture stderr. Where does an ActivityDefinition's shell stderr go? (logs ? activity history ?) — needs activity-core docs before cutover.
Per-repo split: do we split A into one ActivityDefinition per registered repo (so failures don't poison the sweep), or keep the monolithic --all mode? The latter is simpler and matches today's behaviour; the former gives better observability.

7.3 KiB Raw Blame History