Files
state-hub/docs/cron-migration.md
tegwick 39ed5459b9 finish(STATE-WP-0064): cut over scheduler and split sweep errors from failures
STATE-WP-0064 cutover (state-hub only):
- Retire local custodian-sync.timer; archive units under infra/systemd/archived/
- Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md
- Point activity-core-delegation at the consistency-sweep runbook

Consistency engine — automation error vs assessment failure:
- C-00 is an automation error; C-01..C-23 assessment failures are recorded
  for follow-up but no longer fail --remote --all scheduled sweeps (exit 0)
- Skip workplans/README.md in the workplan glob (human index, not a workplan)
- Progress events and compare script expose automation_error and
  assessment_failures separately from exit_code
2026-06-22 01:20:59 +02:00

162 lines
7.3 KiB
Markdown

# State Hub Cron → activity-core ActivityDefinition Migration
> CUST-WP-0040 T04. **Consistency sweep cut over** as of `STATE-WP-0064`
> (2026-06-21). Scheduling is on activity-core (Railiance01); the local
> `custodian-sync.timer` is retired. Stale-task cleanup (B) is still pending.
The state hub currently runs two recurring maintenance jobs and one
per-repo event hook. Once activity-core is ready, each becomes an
ActivityDefinition file checked into the appropriate repo. The state hub
keeps the underlying scripts; only the *scheduling* moves.
---
## 1. Inventory of current maintenance automations
| # | Source | Trigger today | Script invoked | What it does |
| - | ------------------- | -------------------------------------------------------- | -------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| 1 | activity-core cron | every 15 min (Railiance01) | `POST /consistency/sweep/remote-all``consistency_check.py --remote --all` | Pull every registered repo, reconcile workplan files ↔ DB, run C-15 writeback + C-16 pull gate |
| 2 | manual / daily cron | `make cleanup-stale` (suggested `0 3 * * *`) | `scripts/cleanup_stale_tasks.py` | Cancel tasks still open in finished/archived workstreams; emits `org.statehub.task.stale` |
| 3 | git post-commit | every commit in a registered repo | `make fix-consistency REPO=<slug>` | Per-repo workplan ↔ DB sync immediately after a commit |
Honourable mentions (not currently scheduled, on-demand only — listed for
completeness so they don't get mistakenly picked up):
- `scripts/ingest_sbom.py` — invoked via `make ingest-sbom REPO=<slug>`.
- `scripts/ingest_capabilities.py` — invoked via `make ingest-capabilities[-all]`.
- `scripts/check_doi.py` — invoked via `make check-doi[-all]`.
- `scripts/validate_repo_adr.py` — invoked manually for canon promotion.
- `scripts/ingest_tpsc.py` — invoked via `make ingest-tpsc[-all]`.
These are **not in scope** for cron migration — they remain on-demand
operator/CI commands. They become candidates only if we later decide to
run them on a schedule.
---
## 2. Target ActivityDefinitions
### A. `state-hub-consistency-sweep` (implemented)
Landed in `the-custodian/activity-definitions/state-hub-consistency-sweep.md`
with `enabled: true` on Railiance01 since 2026-06-21 cutover.
Invocation path (matches the hourly RecentlyOnScope pattern):
- activity-core context query: `consistency_sweep_remote_all`
- State Hub endpoint: `POST /consistency/sweep/remote-all`
- payload: `{"max_seconds": 300}`
- progress event: `consistency_sweep_remote_all`
State Hub runs `scripts/consistency_check.py --remote --all --json` on the
workstation host. activity-core does **not** shell into the laptop repo
checkout from the cluster.
Operator runbook: [`docs/consistency-sweep-runbook.md`](consistency-sweep-runbook.md).
Notes:
- Replaced the `custodian-sync.service` + `custodian-sync.timer` pair
(local timer disabled 2026-06-21; units archived under `infra/systemd/archived/`).
- Lock semantics (`/tmp/custodian-consistency-remote-all.lock`) stay in
the script — activity-core just sets the cadence.
### B. `state-hub-stale-task-cleanup`
```yaml
# activity-definitions/state-hub-stale-task-cleanup.yaml
id: the-custodian.state-hub-stale-task-cleanup
description: |
Daily sweep that cancels tasks still `wait|todo|progress` inside
finished or archived workstreams. Each cancellation also emits
org.statehub.task.stale on NATS for downstream reaction.
trigger:
trigger_type: cron
cron_expression: "0 3 * * *"
timezone: UTC
instruction:
kind: shell
cmd: >-
cd /home/worsch/state-hub &&
.venv/bin/python scripts/cleanup_stale_tasks.py
```
Notes:
- Replaces the documented (`Cron example: 0 3 * * * …`) daily run.
- The script already emits NATS events (see CUST-WP-0040 T03), so
downstream ActivityDefinitions can react per-task without a second pass.
### C. Per-commit consistency sync (currently a git hook)
The git `post-commit` hook installed by `state-hub/scripts/install_hooks.sh`
is **event-driven, not cron-based**. Migrating it to activity-core would
require a `repo.commit.pushed` event channel that doesn't exist yet.
Recommendation: **keep the git hook as-is for now**. Revisit once an
event source (e.g. Gitea webhook fed into NATS) is available, at which
point an event-triggered ActivityDefinition can replace it cleanly:
```yaml
trigger:
trigger_type: event
event_type: org.repo.commit.pushed
filters:
repo_slug: "*"
```
---
## 3. Required context queries
Implemented for A:
- `consistency_sweep_remote_all``POST /consistency/sweep/remote-all`
with a 330s resolver timeout (sweep budget default 300s).
Still optional for B and future splits:
- `state-hub.health``GET /state/health``{status, db, ...}`
- (optional) `state-hub.repos``GET /repos/?status=active` for per-repo
ActivityDefinitions if the monolithic sweep is split later.
---
## 4. Blockers / sequencing
| Blocker | Owner | Where it lands |
| ------------------------------------------------------------------------- | -------------- | -------------------------- |
| activity-core ActivityDefinition file ingestion + cron executor (WP-0003) | activity-core | activity-core/`src/...` |
| activity-core shell instruction kind with on_failure semantics | activity-core | activity-core/`src/...` |
| state-hub adapter exposing `state-hub.health` as a context source | activity-core | activity-core/adapters/ |
A is cut over. Until B lands, stale-task cleanup remains on-demand via
`make cleanup-stale` (or a manual daily cron).
---
## 5. Cutover plan (when ready)
1. Land ActivityDefinitions A + B in activity-core.
2. Enable them in staging; verify they fire on schedule and produce the
same DB / NATS effects as the current cron entries.
3. Run both in parallel for one week (cron + ActivityDefinition). The
scripts are idempotent — duplicate runs are no-ops on a clean state.
4. ~~Disable the systemd timer~~**done** 2026-06-21 (`STATE-WP-0064`).
5. Remove the cleanup-stale cron entry from `crontab -e` (when B is enabled).
6. ~~Update `infra/README.md` and archive systemd unit files~~**done**.
7. Per-commit hook stays until a `repo.commit.pushed` event exists.
---
## 6. Open questions
- **Locking**: should activity-core wrap shell instructions with a
process lock (today the script self-locks via `/tmp/...`)? If yes, the
state-hub script's lock can be removed.
- **Failure surfacing**: today systemd journals capture stderr. Where
does an ActivityDefinition's shell stderr go? (logs ? activity
history ?) — needs activity-core docs before cutover.
- **Per-repo split**: do we split A into one ActivityDefinition per
registered repo (so failures don't poison the sweep), or keep the
monolithic `--all` mode? The latter is simpler and matches today's
behaviour; the former gives better observability.