feat(state-hub): CUST-WP-0040 — NATS lifecycle event publishing for activity-core

Makes the state hub an event publisher so activity-core can drive maintenance automation declaratively via ActivityDefinitions, rather than the hub creating tasks itself. - api/events/: lazy JetStream publisher + EventEnvelope mirroring activity-core's contract; no-op when NATS_URL unset, fire-and-forget with logged failures so publishing never breaks an API request. - Wired publishers on the five v1.0 lifecycle events: org.statehub.repo.registered (POST /repos/) org.statehub.workstream.completed (PATCH /workstreams/* on transition) org.statehub.decision.resolved (POST /decisions/*/resolve) org.statehub.domain.goal.activated (POST /domain-goals/*/activate) org.statehub.task.stale (scripts/cleanup_stale_tasks.py) - docs/nats-event-subjects.md: subject naming convention + catalog. - docs/cron-migration.md: design stub for replacing custodian-sync systemd timer and cleanup-stale cron with ActivityDefinitions (depends on activity-core WP-0003). - docs/activity-core-delegation.md: protocol, invariants, cutover plan. - SCOPE.md: declares activity-core as downstream event consumer and restates that the state hub stays a read model, not a task factory. Workplan: workplans/CUST-WP-0040-state-hub-nats-activity-core-integration.md 242 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 05:49:29 +02:00
parent 2bc7fd8ce7
commit ca8a09ed04
16 changed files with 770 additions and 9 deletions
--- a/state-hub/docs/activity-core-delegation.md
+++ b/state-hub/docs/activity-core-delegation.md
@@ -0,0 +1,151 @@
+# State Hub → activity-core Delegation Protocol
+
+> CUST-WP-0040 T05. Cross-reference:
+> [`docs/nats-event-subjects.md`](nats-event-subjects.md),
+> [`docs/cron-migration.md`](cron-migration.md), and activity-core's
+> `docs/adr/adr-001-event-bridge-architecture.md`.
+
+## TL;DR
+
+The state hub is a **read model** for cross-domain state. It is not a
+task factory. Maintenance automations that *create new work in response
+to state transitions* belong in activity-core as `ActivityDefinition`
+files. The state hub's only job in that flow is to **publish lifecycle
+events** on NATS JetStream so activity-core can react.
+
+```
+                                          NATS JetStream
+                                          subject: org.statehub.>
+                                          stream:  ACTIVITY_EVENTS
+                                          ┌──────────────────────┐
+   POST /repos/                            │                      │
+   PATCH /workstreams/*  ─────publish───▶  │                      │ ───consume───▶  activity-core
+   POST /decisions/*/resolve               │                      │                 EventRouter
+   POST /domain-goals/*/activate           │                      │                       │
+   scripts/cleanup_stale_tasks.py          │                      │                       ▼
+                                           └──────────────────────┘                 RunActivityWorkflow
+   the-custodian/state-hub                                                          (creates tasks in
+                                                                                     issue-core, etc.)
+```
+
+## Why delegate?
+
+| Concern                                  | Living in the state hub today | Lives in activity-core after migration                      |
+| ---------------------------------------- | ----------------------------- | ----------------------------------------------------------- |
+| "When should this maintenance run?"      | cron/systemd timers           | `ActivityDefinition.trigger` (cron + event triggers)        |
+| "What rule decides whether to act?"      | hard-coded in the script      | `ActivityDefinition.rule.when` expressions                  |
+| "What task / side-effect should we run?" | hard-coded in the script      | `ActivityDefinition.instruction` (shell / workflow / etc.)  |
+| "Where do we audit what fired?"          | journalctl + ad hoc logs      | activity-core history + Temporal workflow runs              |
+| "How is it changed safely?"              | edit Python + redeploy hub    | edit YAML in the repo, PR-reviewable, hot-reloadable        |
+
+Concentrating maintenance logic in declarative `ActivityDefinition`
+files makes the rules **auditable**, **testable**, and **modifiable
+without redeploying the state hub**.
+
+## Published lifecycle events (v1.0)
+
+Authoritative list and attributes live in
+[`docs/nats-event-subjects.md`](nats-event-subjects.md). At v1.0 the
+state hub publishes:
+
+| Subject                              | Trigger site (file:fn)                                          |
+| ------------------------------------ | --------------------------------------------------------------- |
+| `org.statehub.repo.registered`       | `api/routers/repos.py:register_repo`                            |
+| `org.statehub.workstream.completed`  | `api/routers/workstreams.py:update_workstream` (on transition)  |
+| `org.statehub.decision.resolved`     | `api/routers/decisions.py:resolve_decision_action`              |
+| `org.statehub.domain.goal.activated` | `api/routers/domain_goals.py:activate_domain_goal`              |
+| `org.statehub.task.stale`            | `scripts/cleanup_stale_tasks.py` (per cancelled task)           |
+
+All events use the shared `EventEnvelope` schema (`api/events/envelope.py`)
+and are published via `publish_event(subject, envelope)`. Publishing is
+fire-and-forget: failures are logged but **never break the API request
+that triggered them**, and the publisher no-ops when `NATS_URL` is
+unset.
+
+## What stays in the state hub
+
+- DB schema + Alembic migrations
+- API endpoints (CRUD + status transitions + read-model queries)
+- MCP tools (read + sanctioned writes: `resolve_decision`,
+  `add_progress_event`, `get_next_steps`)
+- The consistency engine (`scripts/consistency_check.py`) — it owns
+  ADR-001 reconciliation between workplan files and the DB.
+- The `cleanup_stale_tasks.py` *script* (not its schedule) — it owns
+  the lifecycle rule for cancelling orphaned tasks.
+
+## What moves to activity-core
+
+- The *schedule* for the consistency sweep (`*/15 * * * *`) →
+  `the-custodian.state-hub-consistency-sweep` ActivityDefinition.
+- The *schedule* for stale-task cleanup (`0 3 * * *`) →
+  `the-custodian.state-hub-stale-task-cleanup` ActivityDefinition.
+- Any future "when X happens, create a task" logic. The state hub must
+  **not** add such rules to its routers — it publishes the event and
+  the rule lives in activity-core.
+
+See [`docs/cron-migration.md`](cron-migration.md) for the
+ActivityDefinition drafts and cutover plan.
+
+## What must never happen
+
+- **State hub writes directly to activity-core's DB.** All
+  communication is via NATS events.
+- **State hub creates issue-core / Temporal tasks itself.** That is
+  activity-core's job.
+- **Routers publish before committing.** Always publish after
+  `await session.commit()` succeeds. (Otherwise a transaction rollback
+  would still leak an event.)
+- **A publish failure breaks the API response.** The publisher logs and
+  swallows; lost events are recovered by activity-core re-reading state
+  on next sweep, not by the API retrying.
+
+## Operational checklist — migrating a cron to an ActivityDefinition
+
+1. Identify the cron's current side-effects. If any of them
+   *create work* (a task, an issue, a ticket), it is a delegation
+   candidate. Pure consistency reconciliation can stay as a shell-cron
+   for now if simpler.
+2. Decide the trigger: keep it as `cron`, or upgrade it to `event` by
+   first identifying / publishing the state hub lifecycle event the
+   cron is effectively polling for.
+3. Add a row to [`docs/nats-event-subjects.md`](nats-event-subjects.md)
+   if a new event type is being introduced.
+4. Wire `publish_event(...)` at the transition site in the appropriate
+   router. Verify with `nats sub 'org.statehub.>'`.
+5. Land the `ActivityDefinition` in activity-core; enable it in
+   staging.
+6. Run both old cron and new ActivityDefinition in parallel for one
+   week. Both side-effects must be idempotent for this to be safe — if
+   they aren't, fix that first.
+7. Disable the old cron / systemd timer, archive the unit files.
+8. Update [`SCOPE.md`](../../SCOPE.md) "Often used with" to mention the
+   activity-core handoff if a new event type was added.
+
+## Bootstrap and partial-availability behaviour
+
+- **No NATS configured (`NATS_URL` unset)**: publisher is a logged
+  no-op. The state hub remains fully functional. Useful for dev
+  environments and `make test`.
+- **NATS reachable but stream missing**: publisher creates the
+  `ACTIVITY_EVENTS` stream with subject filter `org.>` on first
+  publish, so the state hub can come up before activity-core. In
+  production both should target the same NATS cluster.
+- **activity-core down**: events queue in JetStream and are replayed
+  when the consumer reconnects. The state hub is unaffected.
+- **State hub down**: scheduled ActivityDefinitions in activity-core
+  still fire; ones that need `state-hub.health` context will skip
+  cleanly per their rule.
+
+## Verifying end-to-end
+
+```bash
+# Subscribe to lifecycle events
+nats sub 'org.statehub.>'
+
+# Trigger an event (in another terminal)
+curl -X POST http://127.0.0.1:8000/repos/<slug>/sync
+
+# Observe the envelope on the subscriber. Sample shape:
+# {"id":"...","type":"org.statehub.workstream.completed","version":"1.0",
+#  "timestamp":"...","publisher":"the-custodian/state-hub","attributes":{...}}
+```
--- a/state-hub/docs/cron-migration.md
+++ b/state-hub/docs/cron-migration.md
@@ -0,0 +1,175 @@
+# State Hub Cron → activity-core ActivityDefinition Migration (Design Stub)
+
+> CUST-WP-0040 T04. **Design stub — not yet implemented.**
+> Migration depends on activity-core WP-0003 reaching the
+> "ActivityDefinition file ingestion + cron trigger executor" milestone.
+
+The state hub currently runs two recurring maintenance jobs and one
+per-repo event hook. Once activity-core is ready, each becomes an
+ActivityDefinition file checked into the appropriate repo. The state hub
+keeps the underlying scripts; only the *scheduling* moves.
+
+---
+
+## 1. Inventory of current maintenance automations
+
+| # | Source              | Trigger today                                            | Script invoked                                           | What it does                                                                                       |
+| - | ------------------- | -------------------------------------------------------- | -------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
+| 1 | systemd user timer  | every 15 min                                             | `scripts/consistency_check.py --remote --all`            | Pull every registered repo, reconcile workplan files ↔ DB, run C-15 writeback + C-16 pull gate    |
+| 2 | manual / daily cron | `make cleanup-stale` (suggested `0 3 * * *`)             | `scripts/cleanup_stale_tasks.py`                         | Cancel tasks still open in completed/archived workstreams; emits `org.statehub.task.stale`        |
+| 3 | git post-commit     | every commit in a registered repo                        | `make fix-consistency REPO=<slug>`                       | Per-repo workplan ↔ DB sync immediately after a commit                                            |
+
+Honourable mentions (not currently scheduled, on-demand only — listed for
+completeness so they don't get mistakenly picked up):
+
+- `scripts/ingest_sbom.py` — invoked via `make ingest-sbom REPO=<slug>`.
+- `scripts/ingest_capabilities.py` — invoked via `make ingest-capabilities[-all]`.
+- `scripts/check_doi.py` — invoked via `make check-doi[-all]`.
+- `scripts/validate_repo_adr.py` — invoked manually for canon promotion.
+- `scripts/ingest_tpsc.py` — invoked via `make ingest-tpsc[-all]`.
+
+These are **not in scope** for cron migration — they remain on-demand
+operator/CI commands. They become candidates only if we later decide to
+run them on a schedule.
+
+---
+
+## 2. Target ActivityDefinitions
+
+### A. `state-hub-consistency-sweep`
+
+```yaml
+# activity-definitions/the-custodian/state-hub-consistency-sweep.yaml
+id: the-custodian.state-hub-consistency-sweep
+description: |
+  Sweep all registered repos: pull, reconcile workplan files ↔ DB,
+  apply writeback (C-15), respect pull gate (C-16). Mirrors the
+  existing custodian-sync systemd timer.
+trigger:
+  trigger_type: cron
+  cron_expression: "*/15 * * * *"
+  timezone: UTC
+  misfire_policy: skip            # if a prior run is still active, skip
+context:
+  - kind: http_get                # confirm state-hub API is reachable
+    url: http://127.0.0.1:8000/state/health
+    bind: hub_health
+rule:
+  when:
+    - "hub_health.status == 'ok'"
+instruction:
+  kind: shell
+  cmd: >-
+    cd /home/worsch/the-custodian/state-hub &&
+    .venv/bin/python scripts/consistency_check.py --remote --all --max-seconds 300
+  on_failure: log_and_continue    # warn-only sweeps must not page on transient failures
+```
+
+Notes:
+- Replaces the `custodian-sync.service` + `custodian-sync.timer` pair.
+- Lock semantics (`/tmp/custodian-consistency-remote-all.lock`) stay in
+  the script — activity-core just sets the cadence.
+- Once active, `infra/README.md` is updated to instruct users to delete
+  the systemd timer.
+
+### B. `state-hub-stale-task-cleanup`
+
+```yaml
+# activity-definitions/the-custodian/state-hub-stale-task-cleanup.yaml
+id: the-custodian.state-hub-stale-task-cleanup
+description: |
+  Daily sweep that cancels tasks still 'todo|in_progress|blocked' inside
+  completed or archived workstreams. Each cancellation also emits
+  org.statehub.task.stale on NATS for downstream reaction.
+trigger:
+  trigger_type: cron
+  cron_expression: "0 3 * * *"
+  timezone: UTC
+instruction:
+  kind: shell
+  cmd: >-
+    cd /home/worsch/the-custodian/state-hub &&
+    .venv/bin/python scripts/cleanup_stale_tasks.py
+```
+
+Notes:
+- Replaces the documented (`Cron example: 0 3 * * * …`) daily run.
+- The script already emits NATS events (see CUST-WP-0040 T03), so
+  downstream ActivityDefinitions can react per-task without a second pass.
+
+### C. Per-commit consistency sync (currently a git hook)
+
+The git `post-commit` hook installed by `state-hub/scripts/install_hooks.sh`
+is **event-driven, not cron-based**. Migrating it to activity-core would
+require a `repo.commit.pushed` event channel that doesn't exist yet.
+
+Recommendation: **keep the git hook as-is for now**. Revisit once an
+event source (e.g. Gitea webhook fed into NATS) is available, at which
+point an event-triggered ActivityDefinition can replace it cleanly:
+
+```yaml
+trigger:
+  trigger_type: event
+  event_type: org.repo.commit.pushed
+  filters:
+    repo_slug: "*"
+```
+
+---
+
+## 3. Required context queries
+
+Both A and B want to confirm the state hub is reachable before running.
+A reusable context source should be added to activity-core for this:
+
+- `state-hub.health` — `GET /state/health` → `{status, db, ...}`
+- (optional) `state-hub.repos` — `GET /repos/?status=active` for the
+  sweep's per-repo branching, if we later split A into one
+  ActivityDefinition per repo.
+
+These belong to the state-hub adapter referenced in the workplan's
+out-of-scope note ("/sbom/status context query endpoint" etc.).
+
+---
+
+## 4. Blockers / sequencing
+
+| Blocker                                                                   | Owner          | Where it lands             |
+| ------------------------------------------------------------------------- | -------------- | -------------------------- |
+| activity-core ActivityDefinition file ingestion + cron executor (WP-0003) | activity-core  | activity-core/`src/...`    |
+| activity-core shell instruction kind with on_failure semantics            | activity-core  | activity-core/`src/...`    |
+| state-hub adapter exposing `state-hub.health` as a context source         | activity-core  | activity-core/adapters/    |
+
+Until these land, the state hub continues to schedule jobs via systemd
+timer + cron entries.
+
+---
+
+## 5. Cutover plan (when ready)
+
+1. Land ActivityDefinitions A + B in activity-core.
+2. Enable them in staging; verify they fire on schedule and produce the
+   same DB / NATS effects as the current cron entries.
+3. Run both in parallel for one week (cron + ActivityDefinition). The
+   scripts are idempotent — duplicate runs are no-ops on a clean state.
+4. Disable the systemd timer:
+   `systemctl --user disable --now custodian-sync.timer`
+5. Remove the cleanup-stale cron entry from `crontab -e`.
+6. Update `infra/README.md` to point at the ActivityDefinitions and
+   archive the systemd unit files.
+7. Per-commit hook stays until a `repo.commit.pushed` event exists.
+
+---
+
+## 6. Open questions
+
+- **Locking**: should activity-core wrap shell instructions with a
+  process lock (today the script self-locks via `/tmp/...`)? If yes, the
+  state-hub script's lock can be removed.
+- **Failure surfacing**: today systemd journals capture stderr. Where
+  does an ActivityDefinition's shell stderr go? (logs ? activity
+  history ?) — needs activity-core docs before cutover.
+- **Per-repo split**: do we split A into one ActivityDefinition per
+  registered repo (so failures don't poison the sweep), or keep the
+  monolithic `--all` mode? The latter is simpler and matches today's
+  behaviour; the former gives better observability.
--- a/state-hub/docs/nats-event-subjects.md
+++ b/state-hub/docs/nats-event-subjects.md
@@ -0,0 +1,98 @@
+# NATS Event Subjects — State Hub
+
+> Part of CUST-WP-0040. Cross-reference: activity-core's
+> `event-types/` registry and ADR-001 (event bridge architecture).
+
+The state hub publishes lifecycle events to NATS JetStream so that
+activity-core can drive maintenance and reaction automation declaratively,
+via `ActivityDefinition` rules — rather than the state hub creating tasks
+itself.
+
+This document is the authoritative subject naming convention for state hub
+events. When adding a new event, add a row to the table below first and
+keep the activity-core `event-types/` registry in sync.
+
+---
+
+## Naming convention
+
+```
+org.{producer}.{noun}.{verb}[.{qualifier}]
+```
+
+- **`org`** — top-level namespace shared with activity-core (`org.>`)
+- **`{producer}`** — the publisher subsystem; the state hub uses `statehub`
+- **`{noun}`** — entity the event is about (`repo`, `workstream`, `task`, …)
+- **`{verb}`** — past-tense state transition (`registered`, `completed`, `resolved`, …)
+- **`{qualifier}`** — optional refinement (e.g. `goal.activated`)
+
+All segments are lowercase ASCII. No camelCase, no dashes inside segments.
+
+### Why a `statehub` namespace?
+
+Activity-core listens to `activity.>` for its internal task lifecycle and
+`org.>` for org-wide lifecycle events. Multiple publishers will eventually
+share `org.>` (e.g. railiance, kaizen). The `{producer}` segment keeps
+those publishers from colliding on the same `{noun}.{verb}` shape.
+
+---
+
+## Published subjects (v1.0)
+
+| Subject                              | When                                                         | Required attributes                                                                                                          |
+| ------------------------------------ | ------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------- |
+| `org.statehub.repo.registered`       | A new repo is registered via `POST /repos/`                  | `repo_id`, `repo_slug`, `domain_slug`, `remote_url?`, `local_path?`                                                          |
+| `org.statehub.workstream.completed`  | A workstream transitions to status `completed`               | `workstream_id`, `slug`, `title`, `topic_id`, `repo_id?`, `repo_goal_id?`                                                    |
+| `org.statehub.decision.resolved`     | A decision is resolved via `POST /decisions/{id}/resolve`    | `decision_id`, `title`, `topic_id?`, `workstream_id?`, `decided_by`, `rationale_snippet`                                     |
+| `org.statehub.domain.goal.activated` | A domain goal transitions to `active`                        | `goal_id`, `domain_id`, `domain_slug`, `title`, `superseded_goal_ids[]`                                                      |
+| `org.statehub.task.stale`            | `scripts/cleanup_stale_tasks.py` cancels an out-of-date task | `task_id`, `workstream_id`, `workstream_status`, `task_title`, `task_status_before`                                          |
+
+### Envelope shape
+
+Each message body conforms to the `EventEnvelope` schema in
+`api/events/envelope.py`, mirrored from
+`activity-core/src/activity_core/models.py`:
+
+```json
+{
+  "id": "uuid v4 — stable, used for at-least-once dedup",
+  "type": "org.statehub.repo.registered",
+  "version": "1.0",
+  "timestamp": "2026-05-17T14:00:00Z",
+  "publisher": "the-custodian/state-hub",
+  "attributes": { "...": "event-specific" }
+}
+```
+
+`type` matches the subject. `publisher` is always
+`the-custodian/state-hub` for events emitted from this repo.
+
+---
+
+## Stream
+
+State hub events are published into the **`ACTIVITY_EVENTS`** JetStream
+(subject filter `org.>`). The stream is owned by activity-core; the state
+hub will auto-create it on first publish if it does not exist, so the
+publisher works in dev environments without bootstrapping activity-core
+first. In production both services point at the same NATS cluster and
+activity-core's `EventRouter` consumes the stream durably.
+
+---
+
+## Adding a new event
+
+1. Pick a subject following the convention above.
+2. Add a row to the table in this file (subject, trigger, attributes).
+3. Add a matching `event-types/` entry in activity-core.
+4. Wire `publish_event(subject, EventEnvelope.new(subject, attributes))`
+   at the site of the state transition (inside the same DB transaction
+   only after `await session.commit()` — never publish optimistically).
+5. Verify locally: run `nats sub 'org.statehub.>'` while triggering the
+   transition.
+
+## Versioning
+
+`version` is bumped only when an attribute is removed or its semantics
+change. Adding optional attributes does **not** require a version bump.
+Activity-core consumers must tolerate unknown attribute keys.