feat(state-hub): CUST-WP-0040 — NATS lifecycle event publishing for activity-core
Makes the state hub an event publisher so activity-core can drive
maintenance automation declaratively via ActivityDefinitions, rather
than the hub creating tasks itself.
- api/events/: lazy JetStream publisher + EventEnvelope mirroring
activity-core's contract; no-op when NATS_URL unset, fire-and-forget
with logged failures so publishing never breaks an API request.
- Wired publishers on the five v1.0 lifecycle events:
org.statehub.repo.registered (POST /repos/)
org.statehub.workstream.completed (PATCH /workstreams/* on transition)
org.statehub.decision.resolved (POST /decisions/*/resolve)
org.statehub.domain.goal.activated (POST /domain-goals/*/activate)
org.statehub.task.stale (scripts/cleanup_stale_tasks.py)
- docs/nats-event-subjects.md: subject naming convention + catalog.
- docs/cron-migration.md: design stub for replacing custodian-sync
systemd timer and cleanup-stale cron with ActivityDefinitions
(depends on activity-core WP-0003).
- docs/activity-core-delegation.md: protocol, invariants, cutover plan.
- SCOPE.md: declares activity-core as downstream event consumer and
restates that the state hub stays a read model, not a task factory.
Workplan: workplans/CUST-WP-0040-state-hub-nats-activity-core-integration.md
242 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
151
state-hub/docs/activity-core-delegation.md
Normal file
151
state-hub/docs/activity-core-delegation.md
Normal file
@@ -0,0 +1,151 @@
|
||||
# State Hub → activity-core Delegation Protocol
|
||||
|
||||
> CUST-WP-0040 T05. Cross-reference:
|
||||
> [`docs/nats-event-subjects.md`](nats-event-subjects.md),
|
||||
> [`docs/cron-migration.md`](cron-migration.md), and activity-core's
|
||||
> `docs/adr/adr-001-event-bridge-architecture.md`.
|
||||
|
||||
## TL;DR
|
||||
|
||||
The state hub is a **read model** for cross-domain state. It is not a
|
||||
task factory. Maintenance automations that *create new work in response
|
||||
to state transitions* belong in activity-core as `ActivityDefinition`
|
||||
files. The state hub's only job in that flow is to **publish lifecycle
|
||||
events** on NATS JetStream so activity-core can react.
|
||||
|
||||
```
|
||||
NATS JetStream
|
||||
subject: org.statehub.>
|
||||
stream: ACTIVITY_EVENTS
|
||||
┌──────────────────────┐
|
||||
POST /repos/ │ │
|
||||
PATCH /workstreams/* ─────publish───▶ │ │ ───consume───▶ activity-core
|
||||
POST /decisions/*/resolve │ │ EventRouter
|
||||
POST /domain-goals/*/activate │ │ │
|
||||
scripts/cleanup_stale_tasks.py │ │ ▼
|
||||
└──────────────────────┘ RunActivityWorkflow
|
||||
the-custodian/state-hub (creates tasks in
|
||||
issue-core, etc.)
|
||||
```
|
||||
|
||||
## Why delegate?
|
||||
|
||||
| Concern | Living in the state hub today | Lives in activity-core after migration |
|
||||
| ---------------------------------------- | ----------------------------- | ----------------------------------------------------------- |
|
||||
| "When should this maintenance run?" | cron/systemd timers | `ActivityDefinition.trigger` (cron + event triggers) |
|
||||
| "What rule decides whether to act?" | hard-coded in the script | `ActivityDefinition.rule.when` expressions |
|
||||
| "What task / side-effect should we run?" | hard-coded in the script | `ActivityDefinition.instruction` (shell / workflow / etc.) |
|
||||
| "Where do we audit what fired?" | journalctl + ad hoc logs | activity-core history + Temporal workflow runs |
|
||||
| "How is it changed safely?" | edit Python + redeploy hub | edit YAML in the repo, PR-reviewable, hot-reloadable |
|
||||
|
||||
Concentrating maintenance logic in declarative `ActivityDefinition`
|
||||
files makes the rules **auditable**, **testable**, and **modifiable
|
||||
without redeploying the state hub**.
|
||||
|
||||
## Published lifecycle events (v1.0)
|
||||
|
||||
Authoritative list and attributes live in
|
||||
[`docs/nats-event-subjects.md`](nats-event-subjects.md). At v1.0 the
|
||||
state hub publishes:
|
||||
|
||||
| Subject | Trigger site (file:fn) |
|
||||
| ------------------------------------ | --------------------------------------------------------------- |
|
||||
| `org.statehub.repo.registered` | `api/routers/repos.py:register_repo` |
|
||||
| `org.statehub.workstream.completed` | `api/routers/workstreams.py:update_workstream` (on transition) |
|
||||
| `org.statehub.decision.resolved` | `api/routers/decisions.py:resolve_decision_action` |
|
||||
| `org.statehub.domain.goal.activated` | `api/routers/domain_goals.py:activate_domain_goal` |
|
||||
| `org.statehub.task.stale` | `scripts/cleanup_stale_tasks.py` (per cancelled task) |
|
||||
|
||||
All events use the shared `EventEnvelope` schema (`api/events/envelope.py`)
|
||||
and are published via `publish_event(subject, envelope)`. Publishing is
|
||||
fire-and-forget: failures are logged but **never break the API request
|
||||
that triggered them**, and the publisher no-ops when `NATS_URL` is
|
||||
unset.
|
||||
|
||||
## What stays in the state hub
|
||||
|
||||
- DB schema + Alembic migrations
|
||||
- API endpoints (CRUD + status transitions + read-model queries)
|
||||
- MCP tools (read + sanctioned writes: `resolve_decision`,
|
||||
`add_progress_event`, `get_next_steps`)
|
||||
- The consistency engine (`scripts/consistency_check.py`) — it owns
|
||||
ADR-001 reconciliation between workplan files and the DB.
|
||||
- The `cleanup_stale_tasks.py` *script* (not its schedule) — it owns
|
||||
the lifecycle rule for cancelling orphaned tasks.
|
||||
|
||||
## What moves to activity-core
|
||||
|
||||
- The *schedule* for the consistency sweep (`*/15 * * * *`) →
|
||||
`the-custodian.state-hub-consistency-sweep` ActivityDefinition.
|
||||
- The *schedule* for stale-task cleanup (`0 3 * * *`) →
|
||||
`the-custodian.state-hub-stale-task-cleanup` ActivityDefinition.
|
||||
- Any future "when X happens, create a task" logic. The state hub must
|
||||
**not** add such rules to its routers — it publishes the event and
|
||||
the rule lives in activity-core.
|
||||
|
||||
See [`docs/cron-migration.md`](cron-migration.md) for the
|
||||
ActivityDefinition drafts and cutover plan.
|
||||
|
||||
## What must never happen
|
||||
|
||||
- **State hub writes directly to activity-core's DB.** All
|
||||
communication is via NATS events.
|
||||
- **State hub creates issue-core / Temporal tasks itself.** That is
|
||||
activity-core's job.
|
||||
- **Routers publish before committing.** Always publish after
|
||||
`await session.commit()` succeeds. (Otherwise a transaction rollback
|
||||
would still leak an event.)
|
||||
- **A publish failure breaks the API response.** The publisher logs and
|
||||
swallows; lost events are recovered by activity-core re-reading state
|
||||
on next sweep, not by the API retrying.
|
||||
|
||||
## Operational checklist — migrating a cron to an ActivityDefinition
|
||||
|
||||
1. Identify the cron's current side-effects. If any of them
|
||||
*create work* (a task, an issue, a ticket), it is a delegation
|
||||
candidate. Pure consistency reconciliation can stay as a shell-cron
|
||||
for now if simpler.
|
||||
2. Decide the trigger: keep it as `cron`, or upgrade it to `event` by
|
||||
first identifying / publishing the state hub lifecycle event the
|
||||
cron is effectively polling for.
|
||||
3. Add a row to [`docs/nats-event-subjects.md`](nats-event-subjects.md)
|
||||
if a new event type is being introduced.
|
||||
4. Wire `publish_event(...)` at the transition site in the appropriate
|
||||
router. Verify with `nats sub 'org.statehub.>'`.
|
||||
5. Land the `ActivityDefinition` in activity-core; enable it in
|
||||
staging.
|
||||
6. Run both old cron and new ActivityDefinition in parallel for one
|
||||
week. Both side-effects must be idempotent for this to be safe — if
|
||||
they aren't, fix that first.
|
||||
7. Disable the old cron / systemd timer, archive the unit files.
|
||||
8. Update [`SCOPE.md`](../../SCOPE.md) "Often used with" to mention the
|
||||
activity-core handoff if a new event type was added.
|
||||
|
||||
## Bootstrap and partial-availability behaviour
|
||||
|
||||
- **No NATS configured (`NATS_URL` unset)**: publisher is a logged
|
||||
no-op. The state hub remains fully functional. Useful for dev
|
||||
environments and `make test`.
|
||||
- **NATS reachable but stream missing**: publisher creates the
|
||||
`ACTIVITY_EVENTS` stream with subject filter `org.>` on first
|
||||
publish, so the state hub can come up before activity-core. In
|
||||
production both should target the same NATS cluster.
|
||||
- **activity-core down**: events queue in JetStream and are replayed
|
||||
when the consumer reconnects. The state hub is unaffected.
|
||||
- **State hub down**: scheduled ActivityDefinitions in activity-core
|
||||
still fire; ones that need `state-hub.health` context will skip
|
||||
cleanly per their rule.
|
||||
|
||||
## Verifying end-to-end
|
||||
|
||||
```bash
|
||||
# Subscribe to lifecycle events
|
||||
nats sub 'org.statehub.>'
|
||||
|
||||
# Trigger an event (in another terminal)
|
||||
curl -X POST http://127.0.0.1:8000/repos/<slug>/sync
|
||||
|
||||
# Observe the envelope on the subscriber. Sample shape:
|
||||
# {"id":"...","type":"org.statehub.workstream.completed","version":"1.0",
|
||||
# "timestamp":"...","publisher":"the-custodian/state-hub","attributes":{...}}
|
||||
```
|
||||
175
state-hub/docs/cron-migration.md
Normal file
175
state-hub/docs/cron-migration.md
Normal file
@@ -0,0 +1,175 @@
|
||||
# State Hub Cron → activity-core ActivityDefinition Migration (Design Stub)
|
||||
|
||||
> CUST-WP-0040 T04. **Design stub — not yet implemented.**
|
||||
> Migration depends on activity-core WP-0003 reaching the
|
||||
> "ActivityDefinition file ingestion + cron trigger executor" milestone.
|
||||
|
||||
The state hub currently runs two recurring maintenance jobs and one
|
||||
per-repo event hook. Once activity-core is ready, each becomes an
|
||||
ActivityDefinition file checked into the appropriate repo. The state hub
|
||||
keeps the underlying scripts; only the *scheduling* moves.
|
||||
|
||||
---
|
||||
|
||||
## 1. Inventory of current maintenance automations
|
||||
|
||||
| # | Source | Trigger today | Script invoked | What it does |
|
||||
| - | ------------------- | -------------------------------------------------------- | -------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
|
||||
| 1 | systemd user timer | every 15 min | `scripts/consistency_check.py --remote --all` | Pull every registered repo, reconcile workplan files ↔ DB, run C-15 writeback + C-16 pull gate |
|
||||
| 2 | manual / daily cron | `make cleanup-stale` (suggested `0 3 * * *`) | `scripts/cleanup_stale_tasks.py` | Cancel tasks still open in completed/archived workstreams; emits `org.statehub.task.stale` |
|
||||
| 3 | git post-commit | every commit in a registered repo | `make fix-consistency REPO=<slug>` | Per-repo workplan ↔ DB sync immediately after a commit |
|
||||
|
||||
Honourable mentions (not currently scheduled, on-demand only — listed for
|
||||
completeness so they don't get mistakenly picked up):
|
||||
|
||||
- `scripts/ingest_sbom.py` — invoked via `make ingest-sbom REPO=<slug>`.
|
||||
- `scripts/ingest_capabilities.py` — invoked via `make ingest-capabilities[-all]`.
|
||||
- `scripts/check_doi.py` — invoked via `make check-doi[-all]`.
|
||||
- `scripts/validate_repo_adr.py` — invoked manually for canon promotion.
|
||||
- `scripts/ingest_tpsc.py` — invoked via `make ingest-tpsc[-all]`.
|
||||
|
||||
These are **not in scope** for cron migration — they remain on-demand
|
||||
operator/CI commands. They become candidates only if we later decide to
|
||||
run them on a schedule.
|
||||
|
||||
---
|
||||
|
||||
## 2. Target ActivityDefinitions
|
||||
|
||||
### A. `state-hub-consistency-sweep`
|
||||
|
||||
```yaml
|
||||
# activity-definitions/the-custodian/state-hub-consistency-sweep.yaml
|
||||
id: the-custodian.state-hub-consistency-sweep
|
||||
description: |
|
||||
Sweep all registered repos: pull, reconcile workplan files ↔ DB,
|
||||
apply writeback (C-15), respect pull gate (C-16). Mirrors the
|
||||
existing custodian-sync systemd timer.
|
||||
trigger:
|
||||
trigger_type: cron
|
||||
cron_expression: "*/15 * * * *"
|
||||
timezone: UTC
|
||||
misfire_policy: skip # if a prior run is still active, skip
|
||||
context:
|
||||
- kind: http_get # confirm state-hub API is reachable
|
||||
url: http://127.0.0.1:8000/state/health
|
||||
bind: hub_health
|
||||
rule:
|
||||
when:
|
||||
- "hub_health.status == 'ok'"
|
||||
instruction:
|
||||
kind: shell
|
||||
cmd: >-
|
||||
cd /home/worsch/the-custodian/state-hub &&
|
||||
.venv/bin/python scripts/consistency_check.py --remote --all --max-seconds 300
|
||||
on_failure: log_and_continue # warn-only sweeps must not page on transient failures
|
||||
```
|
||||
|
||||
Notes:
|
||||
- Replaces the `custodian-sync.service` + `custodian-sync.timer` pair.
|
||||
- Lock semantics (`/tmp/custodian-consistency-remote-all.lock`) stay in
|
||||
the script — activity-core just sets the cadence.
|
||||
- Once active, `infra/README.md` is updated to instruct users to delete
|
||||
the systemd timer.
|
||||
|
||||
### B. `state-hub-stale-task-cleanup`
|
||||
|
||||
```yaml
|
||||
# activity-definitions/the-custodian/state-hub-stale-task-cleanup.yaml
|
||||
id: the-custodian.state-hub-stale-task-cleanup
|
||||
description: |
|
||||
Daily sweep that cancels tasks still 'todo|in_progress|blocked' inside
|
||||
completed or archived workstreams. Each cancellation also emits
|
||||
org.statehub.task.stale on NATS for downstream reaction.
|
||||
trigger:
|
||||
trigger_type: cron
|
||||
cron_expression: "0 3 * * *"
|
||||
timezone: UTC
|
||||
instruction:
|
||||
kind: shell
|
||||
cmd: >-
|
||||
cd /home/worsch/the-custodian/state-hub &&
|
||||
.venv/bin/python scripts/cleanup_stale_tasks.py
|
||||
```
|
||||
|
||||
Notes:
|
||||
- Replaces the documented (`Cron example: 0 3 * * * …`) daily run.
|
||||
- The script already emits NATS events (see CUST-WP-0040 T03), so
|
||||
downstream ActivityDefinitions can react per-task without a second pass.
|
||||
|
||||
### C. Per-commit consistency sync (currently a git hook)
|
||||
|
||||
The git `post-commit` hook installed by `state-hub/scripts/install_hooks.sh`
|
||||
is **event-driven, not cron-based**. Migrating it to activity-core would
|
||||
require a `repo.commit.pushed` event channel that doesn't exist yet.
|
||||
|
||||
Recommendation: **keep the git hook as-is for now**. Revisit once an
|
||||
event source (e.g. Gitea webhook fed into NATS) is available, at which
|
||||
point an event-triggered ActivityDefinition can replace it cleanly:
|
||||
|
||||
```yaml
|
||||
trigger:
|
||||
trigger_type: event
|
||||
event_type: org.repo.commit.pushed
|
||||
filters:
|
||||
repo_slug: "*"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Required context queries
|
||||
|
||||
Both A and B want to confirm the state hub is reachable before running.
|
||||
A reusable context source should be added to activity-core for this:
|
||||
|
||||
- `state-hub.health` — `GET /state/health` → `{status, db, ...}`
|
||||
- (optional) `state-hub.repos` — `GET /repos/?status=active` for the
|
||||
sweep's per-repo branching, if we later split A into one
|
||||
ActivityDefinition per repo.
|
||||
|
||||
These belong to the state-hub adapter referenced in the workplan's
|
||||
out-of-scope note ("/sbom/status context query endpoint" etc.).
|
||||
|
||||
---
|
||||
|
||||
## 4. Blockers / sequencing
|
||||
|
||||
| Blocker | Owner | Where it lands |
|
||||
| ------------------------------------------------------------------------- | -------------- | -------------------------- |
|
||||
| activity-core ActivityDefinition file ingestion + cron executor (WP-0003) | activity-core | activity-core/`src/...` |
|
||||
| activity-core shell instruction kind with on_failure semantics | activity-core | activity-core/`src/...` |
|
||||
| state-hub adapter exposing `state-hub.health` as a context source | activity-core | activity-core/adapters/ |
|
||||
|
||||
Until these land, the state hub continues to schedule jobs via systemd
|
||||
timer + cron entries.
|
||||
|
||||
---
|
||||
|
||||
## 5. Cutover plan (when ready)
|
||||
|
||||
1. Land ActivityDefinitions A + B in activity-core.
|
||||
2. Enable them in staging; verify they fire on schedule and produce the
|
||||
same DB / NATS effects as the current cron entries.
|
||||
3. Run both in parallel for one week (cron + ActivityDefinition). The
|
||||
scripts are idempotent — duplicate runs are no-ops on a clean state.
|
||||
4. Disable the systemd timer:
|
||||
`systemctl --user disable --now custodian-sync.timer`
|
||||
5. Remove the cleanup-stale cron entry from `crontab -e`.
|
||||
6. Update `infra/README.md` to point at the ActivityDefinitions and
|
||||
archive the systemd unit files.
|
||||
7. Per-commit hook stays until a `repo.commit.pushed` event exists.
|
||||
|
||||
---
|
||||
|
||||
## 6. Open questions
|
||||
|
||||
- **Locking**: should activity-core wrap shell instructions with a
|
||||
process lock (today the script self-locks via `/tmp/...`)? If yes, the
|
||||
state-hub script's lock can be removed.
|
||||
- **Failure surfacing**: today systemd journals capture stderr. Where
|
||||
does an ActivityDefinition's shell stderr go? (logs ? activity
|
||||
history ?) — needs activity-core docs before cutover.
|
||||
- **Per-repo split**: do we split A into one ActivityDefinition per
|
||||
registered repo (so failures don't poison the sweep), or keep the
|
||||
monolithic `--all` mode? The latter is simpler and matches today's
|
||||
behaviour; the former gives better observability.
|
||||
98
state-hub/docs/nats-event-subjects.md
Normal file
98
state-hub/docs/nats-event-subjects.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# NATS Event Subjects — State Hub
|
||||
|
||||
> Part of CUST-WP-0040. Cross-reference: activity-core's
|
||||
> `event-types/` registry and ADR-001 (event bridge architecture).
|
||||
|
||||
The state hub publishes lifecycle events to NATS JetStream so that
|
||||
activity-core can drive maintenance and reaction automation declaratively,
|
||||
via `ActivityDefinition` rules — rather than the state hub creating tasks
|
||||
itself.
|
||||
|
||||
This document is the authoritative subject naming convention for state hub
|
||||
events. When adding a new event, add a row to the table below first and
|
||||
keep the activity-core `event-types/` registry in sync.
|
||||
|
||||
---
|
||||
|
||||
## Naming convention
|
||||
|
||||
```
|
||||
org.{producer}.{noun}.{verb}[.{qualifier}]
|
||||
```
|
||||
|
||||
- **`org`** — top-level namespace shared with activity-core (`org.>`)
|
||||
- **`{producer}`** — the publisher subsystem; the state hub uses `statehub`
|
||||
- **`{noun}`** — entity the event is about (`repo`, `workstream`, `task`, …)
|
||||
- **`{verb}`** — past-tense state transition (`registered`, `completed`, `resolved`, …)
|
||||
- **`{qualifier}`** — optional refinement (e.g. `goal.activated`)
|
||||
|
||||
All segments are lowercase ASCII. No camelCase, no dashes inside segments.
|
||||
|
||||
### Why a `statehub` namespace?
|
||||
|
||||
Activity-core listens to `activity.>` for its internal task lifecycle and
|
||||
`org.>` for org-wide lifecycle events. Multiple publishers will eventually
|
||||
share `org.>` (e.g. railiance, kaizen). The `{producer}` segment keeps
|
||||
those publishers from colliding on the same `{noun}.{verb}` shape.
|
||||
|
||||
---
|
||||
|
||||
## Published subjects (v1.0)
|
||||
|
||||
| Subject | When | Required attributes |
|
||||
| ------------------------------------ | ------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `org.statehub.repo.registered` | A new repo is registered via `POST /repos/` | `repo_id`, `repo_slug`, `domain_slug`, `remote_url?`, `local_path?` |
|
||||
| `org.statehub.workstream.completed` | A workstream transitions to status `completed` | `workstream_id`, `slug`, `title`, `topic_id`, `repo_id?`, `repo_goal_id?` |
|
||||
| `org.statehub.decision.resolved` | A decision is resolved via `POST /decisions/{id}/resolve` | `decision_id`, `title`, `topic_id?`, `workstream_id?`, `decided_by`, `rationale_snippet` |
|
||||
| `org.statehub.domain.goal.activated` | A domain goal transitions to `active` | `goal_id`, `domain_id`, `domain_slug`, `title`, `superseded_goal_ids[]` |
|
||||
| `org.statehub.task.stale` | `scripts/cleanup_stale_tasks.py` cancels an out-of-date task | `task_id`, `workstream_id`, `workstream_status`, `task_title`, `task_status_before` |
|
||||
|
||||
### Envelope shape
|
||||
|
||||
Each message body conforms to the `EventEnvelope` schema in
|
||||
`api/events/envelope.py`, mirrored from
|
||||
`activity-core/src/activity_core/models.py`:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "uuid v4 — stable, used for at-least-once dedup",
|
||||
"type": "org.statehub.repo.registered",
|
||||
"version": "1.0",
|
||||
"timestamp": "2026-05-17T14:00:00Z",
|
||||
"publisher": "the-custodian/state-hub",
|
||||
"attributes": { "...": "event-specific" }
|
||||
}
|
||||
```
|
||||
|
||||
`type` matches the subject. `publisher` is always
|
||||
`the-custodian/state-hub` for events emitted from this repo.
|
||||
|
||||
---
|
||||
|
||||
## Stream
|
||||
|
||||
State hub events are published into the **`ACTIVITY_EVENTS`** JetStream
|
||||
(subject filter `org.>`). The stream is owned by activity-core; the state
|
||||
hub will auto-create it on first publish if it does not exist, so the
|
||||
publisher works in dev environments without bootstrapping activity-core
|
||||
first. In production both services point at the same NATS cluster and
|
||||
activity-core's `EventRouter` consumes the stream durably.
|
||||
|
||||
---
|
||||
|
||||
## Adding a new event
|
||||
|
||||
1. Pick a subject following the convention above.
|
||||
2. Add a row to the table in this file (subject, trigger, attributes).
|
||||
3. Add a matching `event-types/` entry in activity-core.
|
||||
4. Wire `publish_event(subject, EventEnvelope.new(subject, attributes))`
|
||||
at the site of the state transition (inside the same DB transaction
|
||||
only after `await session.commit()` — never publish optimistically).
|
||||
5. Verify locally: run `nats sub 'org.statehub.>'` while triggering the
|
||||
transition.
|
||||
|
||||
## Versioning
|
||||
|
||||
`version` is bumped only when an attribute is removed or its semantics
|
||||
change. Adding optional attributes does **not** require a version bump.
|
||||
Activity-core consumers must tolerate unknown attribute keys.
|
||||
Reference in New Issue
Block a user