Files
the-custodian/state-hub/docs/cron-migration.md
tegwick ca8a09ed04 feat(state-hub): CUST-WP-0040 — NATS lifecycle event publishing for activity-core
Makes the state hub an event publisher so activity-core can drive
maintenance automation declaratively via ActivityDefinitions, rather
than the hub creating tasks itself.

- api/events/: lazy JetStream publisher + EventEnvelope mirroring
  activity-core's contract; no-op when NATS_URL unset, fire-and-forget
  with logged failures so publishing never breaks an API request.
- Wired publishers on the five v1.0 lifecycle events:
    org.statehub.repo.registered        (POST /repos/)
    org.statehub.workstream.completed   (PATCH /workstreams/* on transition)
    org.statehub.decision.resolved      (POST /decisions/*/resolve)
    org.statehub.domain.goal.activated  (POST /domain-goals/*/activate)
    org.statehub.task.stale             (scripts/cleanup_stale_tasks.py)
- docs/nats-event-subjects.md: subject naming convention + catalog.
- docs/cron-migration.md: design stub for replacing custodian-sync
  systemd timer and cleanup-stale cron with ActivityDefinitions
  (depends on activity-core WP-0003).
- docs/activity-core-delegation.md: protocol, invariants, cutover plan.
- SCOPE.md: declares activity-core as downstream event consumer and
  restates that the state hub stays a read model, not a task factory.

Workplan: workplans/CUST-WP-0040-state-hub-nats-activity-core-integration.md
242 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 05:49:29 +02:00

7.7 KiB

State Hub Cron → activity-core ActivityDefinition Migration (Design Stub)

CUST-WP-0040 T04. Design stub — not yet implemented. Migration depends on activity-core WP-0003 reaching the "ActivityDefinition file ingestion + cron trigger executor" milestone.

The state hub currently runs two recurring maintenance jobs and one per-repo event hook. Once activity-core is ready, each becomes an ActivityDefinition file checked into the appropriate repo. The state hub keeps the underlying scripts; only the scheduling moves.


1. Inventory of current maintenance automations

# Source Trigger today Script invoked What it does
1 systemd user timer every 15 min scripts/consistency_check.py --remote --all Pull every registered repo, reconcile workplan files ↔ DB, run C-15 writeback + C-16 pull gate
2 manual / daily cron make cleanup-stale (suggested 0 3 * * *) scripts/cleanup_stale_tasks.py Cancel tasks still open in completed/archived workstreams; emits org.statehub.task.stale
3 git post-commit every commit in a registered repo make fix-consistency REPO=<slug> Per-repo workplan ↔ DB sync immediately after a commit

Honourable mentions (not currently scheduled, on-demand only — listed for completeness so they don't get mistakenly picked up):

  • scripts/ingest_sbom.py — invoked via make ingest-sbom REPO=<slug>.
  • scripts/ingest_capabilities.py — invoked via make ingest-capabilities[-all].
  • scripts/check_doi.py — invoked via make check-doi[-all].
  • scripts/validate_repo_adr.py — invoked manually for canon promotion.
  • scripts/ingest_tpsc.py — invoked via make ingest-tpsc[-all].

These are not in scope for cron migration — they remain on-demand operator/CI commands. They become candidates only if we later decide to run them on a schedule.


2. Target ActivityDefinitions

A. state-hub-consistency-sweep

# activity-definitions/the-custodian/state-hub-consistency-sweep.yaml
id: the-custodian.state-hub-consistency-sweep
description: |
  Sweep all registered repos: pull, reconcile workplan files ↔ DB,
  apply writeback (C-15), respect pull gate (C-16). Mirrors the
  existing custodian-sync systemd timer.
trigger:
  trigger_type: cron
  cron_expression: "*/15 * * * *"
  timezone: UTC
  misfire_policy: skip            # if a prior run is still active, skip
context:
  - kind: http_get                # confirm state-hub API is reachable
    url: http://127.0.0.1:8000/state/health
    bind: hub_health
rule:
  when:
    - "hub_health.status == 'ok'"
instruction:
  kind: shell
  cmd: >-
    cd /home/worsch/the-custodian/state-hub &&
    .venv/bin/python scripts/consistency_check.py --remote --all --max-seconds 300
  on_failure: log_and_continue    # warn-only sweeps must not page on transient failures

Notes:

  • Replaces the custodian-sync.service + custodian-sync.timer pair.
  • Lock semantics (/tmp/custodian-consistency-remote-all.lock) stay in the script — activity-core just sets the cadence.
  • Once active, infra/README.md is updated to instruct users to delete the systemd timer.

B. state-hub-stale-task-cleanup

# activity-definitions/the-custodian/state-hub-stale-task-cleanup.yaml
id: the-custodian.state-hub-stale-task-cleanup
description: |
  Daily sweep that cancels tasks still 'todo|in_progress|blocked' inside
  completed or archived workstreams. Each cancellation also emits
  org.statehub.task.stale on NATS for downstream reaction.
trigger:
  trigger_type: cron
  cron_expression: "0 3 * * *"
  timezone: UTC
instruction:
  kind: shell
  cmd: >-
    cd /home/worsch/the-custodian/state-hub &&
    .venv/bin/python scripts/cleanup_stale_tasks.py

Notes:

  • Replaces the documented (Cron example: 0 3 * * * …) daily run.
  • The script already emits NATS events (see CUST-WP-0040 T03), so downstream ActivityDefinitions can react per-task without a second pass.

C. Per-commit consistency sync (currently a git hook)

The git post-commit hook installed by state-hub/scripts/install_hooks.sh is event-driven, not cron-based. Migrating it to activity-core would require a repo.commit.pushed event channel that doesn't exist yet.

Recommendation: keep the git hook as-is for now. Revisit once an event source (e.g. Gitea webhook fed into NATS) is available, at which point an event-triggered ActivityDefinition can replace it cleanly:

trigger:
  trigger_type: event
  event_type: org.repo.commit.pushed
  filters:
    repo_slug: "*"

3. Required context queries

Both A and B want to confirm the state hub is reachable before running. A reusable context source should be added to activity-core for this:

  • state-hub.healthGET /state/health{status, db, ...}
  • (optional) state-hub.reposGET /repos/?status=active for the sweep's per-repo branching, if we later split A into one ActivityDefinition per repo.

These belong to the state-hub adapter referenced in the workplan's out-of-scope note ("/sbom/status context query endpoint" etc.).


4. Blockers / sequencing

Blocker Owner Where it lands
activity-core ActivityDefinition file ingestion + cron executor (WP-0003) activity-core activity-core/src/...
activity-core shell instruction kind with on_failure semantics activity-core activity-core/src/...
state-hub adapter exposing state-hub.health as a context source activity-core activity-core/adapters/

Until these land, the state hub continues to schedule jobs via systemd timer + cron entries.


5. Cutover plan (when ready)

  1. Land ActivityDefinitions A + B in activity-core.
  2. Enable them in staging; verify they fire on schedule and produce the same DB / NATS effects as the current cron entries.
  3. Run both in parallel for one week (cron + ActivityDefinition). The scripts are idempotent — duplicate runs are no-ops on a clean state.
  4. Disable the systemd timer: systemctl --user disable --now custodian-sync.timer
  5. Remove the cleanup-stale cron entry from crontab -e.
  6. Update infra/README.md to point at the ActivityDefinitions and archive the systemd unit files.
  7. Per-commit hook stays until a repo.commit.pushed event exists.

6. Open questions

  • Locking: should activity-core wrap shell instructions with a process lock (today the script self-locks via /tmp/...)? If yes, the state-hub script's lock can be removed.
  • Failure surfacing: today systemd journals capture stderr. Where does an ActivityDefinition's shell stderr go? (logs ? activity history ?) — needs activity-core docs before cutover.
  • Per-repo split: do we split A into one ActivityDefinition per registered repo (so failures don't poison the sweep), or keep the monolithic --all mode? The latter is simpler and matches today's behaviour; the former gives better observability.