Compare commits

81 Commits

Author SHA1 Message Date
92629e7a91 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-30:
  - update .custodian-brief.md for activity-core
2026-06-30 01:50:22 +02:00
951ec56f7a chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-29:
  - update .custodian-brief.md for activity-core
2026-06-29 13:45:41 +02:00
9440d539c6 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-29:
  - update .custodian-brief.md for activity-core
2026-06-29 13:33:21 +02:00
2ff852da29 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-29:
  - update .custodian-brief.md for activity-core
2026-06-29 12:57:25 +02:00
30043348f0 Add Core Hub ops evidence sink 2026-06-27 20:34:25 +02:00
18fcce87fe Update daily triage stabilization status 2026-06-27 09:58:47 +02:00
17b787fad0 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-27:
  - update .custodian-brief.md for activity-core
2026-06-27 08:07:46 +02:00
6c8cb1b7b6 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-27:
  - ACTIVITY-WP-0010-T03: progress → wait
2026-06-27 08:07:42 +02:00
ec66e06066 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-27:
  - update .custodian-brief.md for activity-core
2026-06-27 08:00:51 +02:00
919edd98ac chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - update .custodian-brief.md for activity-core
2026-06-26 18:20:26 +02:00
bf877b7f0d test(ACTIVITY-WP-0016-T05): regression coverage incl. real 06-26 payload + over-depth
Add a test driving the actual captured 2026-06-26 failure payload
(tests/fixtures/wp0016/...partial.json): it now recovers 6+ valid recommendations
and quarantines the truncated tail, where before WP-0016 it discarded the whole run.
Add an over-depth guardrail test. Together with T03/T04 the regression set now covers
truncation, one-bad-item, oversized-string, over-depth, allow-list/injection-shaped,
and happy-path count cap.

In-repo portion of T05 complete; the live railiance01 graceful-degradation smoke is
operator-owned cluster work (deploy-coupled with the T02 bundle changes) and remains
outstanding. Hand-back notes posted to WP-0006-T03 and WP-0010-T04. Full suite: 220
passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 18:18:37 +02:00
9be4ddbdb7 feat(ACTIVITY-WP-0016-T04): producer trust-boundary guardrails + ADR-004
Add ADR-004 documenting the producer trust boundary: untrusted producers (LLM,
agent, human; erroneous and malicious), the trust-but-handle vs verify-and-mitigate
postures, error-locality and quarantine-with-provenance principles, and the concrete
activity-core mechanisms.

Implement producer-agnostic guardrails in executor.py, applied uniformly on the
happy path and the recovery path via _partition_items: structural-type -> schema ->
structural caps (_MAX_DEPTH, _MAX_STRING_LEN) -> reference allow-list -> count cap.
Each quarantine carries a reason. Closes the happy-path maxItems count cap deferred
from T03 (valid 9-item report keeps 7, quarantines 2). Reference allow-list reads
context["known_candidates"] via _allow_list_from_context; inert until a resolver
populates it. SCOPE.md updated (executor bullet + ADR list); no INTENT drift.

New tests: happy-path count cap, oversized-string guardrail, allow-list rejection.
Full suite: 218 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 18:10:17 +02:00
c5440e8429 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - update .custodian-brief.md for activity-core
2026-06-26 18:04:07 +02:00
53dc0f6e93 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - ACTIVITY-WP-0016-T03: progress → done
2026-06-26 18:03:50 +02:00
a70c00a789 feat(ACTIVITY-WP-0016-T03): resilient per-item report recovery with quarantine lane
When the whole-document parse + one retry still fail, report instructions now run
_resilient_report before the total-loss path. A brace/quote-aware scanner
(_extract_object_spans) recovers each recommendation object whether pretty-printed
across many lines or NDJSON one-per-line; a truncated tail gets a best-effort
_try_repair; _partition_items validates each recovered object against the T02 item
schema. Valid items survive (output_validated=True, partial=True), malformed/
over-maxItems items are quarantined with provenance (index, error, raw, reason),
capped at 20. Error locality now matches the unit of work: one bad item costs one
item, not the whole report.

Verified against the real 06-26 shape: 7 valid recommendations + a truncated tail
now recovers all 7 and quarantines the broken tail (previously the whole run was
discarded). Happy-path maxItems top-N enforcement is deferred to T04 (count caps).
Full suite: 215 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 17:56:28 +02:00
b41b6034ee chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - update .custodian-brief.md for activity-core
2026-06-26 17:52:46 +02:00
960fb05268 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - ACTIVITY-WP-0016-T03: todo → progress
2026-06-26 17:52:30 +02:00
b7b0b5bf6e chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - ACTIVITY-WP-0016-T02: todo → progress
2026-06-26 17:52:29 +02:00
14f76fb6d9 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - ACTIVITY-WP-0016-T01: todo → wait
2026-06-26 17:52:28 +02:00
caa2608092 chore(consistency): renormalize lifecycle state [auto]
Updated by fix-consistency on 2026-06-26:
  - workplan status: proposed → active
2026-06-26 17:52:28 +02:00
61f278d643 feat(ACTIVITY-WP-0016-T02): strict bounded daily-triage output schema
Replace the accept-anything recommendations.items ({type: object}) with a strict
per-item contract (required [rank, candidate, action, why] + typed wsjf) and a
maxItems:7 hint. Strict item structure is what lets the T03 boundary parser
validate each recommendation independently and quarantine only malformed ones.

maxItems is a producer hint (prompt + llm-connect json_schema + T03 mitigation),
NOT a hard reject — a hard maxItems reject would discard a whole 16-item report,
the blast-radius bug WP-0016 removes. DEPLOY COUPLING: the strict schema is also
consumed by the current whole-doc validator, so it must ship with T03's per-item
quarantine parser; until then it increases whole-doc hard-fails. Prompt + max_tokens
headroom + NDJSON framing are documented as a runtime-bundle handoff.

Updated four tests to the strict contract; the forwarded-schema test now reads the
live schema file instead of hard-coding it. Full suite: 213 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 17:36:24 +02:00
0e9e18a59a chore(ACTIVITY-WP-0016-T01): record root-cause findings + partial failure fixture
Local analysis of the 2026-06-26 daily-triage validation failure: the unbounded
~1-recommendation-per-workstream list (16 active workstreams; JSON break at char
5268, ~rank 8-9) is the structural cause; both the first attempt and the retry
failed. The exact offending token and finish_reason are unrecoverable from
activity-core data — complete() drops finish_reason/usage, the report sink caps
raw output at 4000 chars (< 5268), and the log preview at 2000. Confirming the
exact token needs llm-connect producer-side logs on railiance01 (operator-owned);
mitigation (T02/T03) is identical regardless. Partial fixture captured.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 15:04:27 +02:00
5eb33bd3bb feat(ACTIVITY-WP-0016): register LLM output robustness & producer trust boundary workplan
Add WP-0016 to make the instruction-executor output contract robust after the
2026-06-26 daily-triage validation failure (one malformed delimiter discarded a
whole report). Per-item framing for error locality, verify-and-mitigate boundary
parsing with a quarantine lane, producer-trust-boundary guardrails (ADR-004), and
regression/calibration tests. Unblocks WP-0006-T03 / WP-0010-T04.

Also record the 06-26 recheck outcome (streak reset at two) in WP-0006-T03.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 14:39:21 +02:00
612c226472 chore(ACTIVITY-WP-0015): dedupe state_hub_workstream_id frontmatter
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:53:52 +02:00
0b2c68838e chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-24:
  - update .custodian-brief.md for activity-core
2026-06-24 12:53:31 +02:00
4b5e96d7c1 feat(ACTIVITY-WP-0014): close workplan — catchup_latest deployed & verified on railiance01
T04 done: built+deployed the WP-0014 image to railiance01, applied catchup_latest
to daily-statehub-wsjf-triage, /admin/sync clean (6 defs, 4 schedules, 0 errors).
Live schedule verified OverlapPolicy=BufferOne, CatchupWindow=1d; pods healthy.
All tasks T01-T05 complete; beachhead-endpoint adoption tracked in WP-0015.
Workplan status -> finished.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:52:54 +02:00
65ef005c2d docs(ACTIVITY-WP-0014): close T05 in-repo; split beachhead adoption to WP-0015
Idempotent-writes half of T05 is done in-repo; the externally-blocked endpoint
adoption + actcore-state-hub-bridge proxy retirement move to ACTIVITY-WP-0015
(blocked on the state-hub beachhead) so WP-0014 can close on completed work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:41:21 +02:00
0e75aaec01 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - update .custodian-brief.md for activity-core
2026-06-23 21:39:32 +02:00
b2e57707a7 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - ACTIVITY-WP-0014-T05: todo → progress
2026-06-23 21:39:28 +02:00
88fe359385 feat(ACTIVITY-WP-0014): idempotency-keyed State Hub writes (T05, in-repo part)
Add activity_core/state_hub_write: every State Hub write (report-sink,
ops-evidence, schedule-miss) now sends a stable Idempotency-Key header derived
from run_id:instruction_id:event_type. Makes writes safe to buffer/replay under
the future state-hub beachhead without duplicate progress/triage events. The
read-based _progress_exists dedup is now best-effort (returns False on connection
error instead of hard-failing), so the guarantee lives on the keyed write rather
than a live read. Tests + runbook note. Endpoint adoption / proxy retirement stays
blocked on the state-hub beachhead capability.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:38:46 +02:00
f90591c5f1 docs(ACTIVITY-WP-0014): rescope T05 to thin client under State Hub beachhead model
Resilience (queue/cache) is handed to custodian/state-hub as a per-machine
beachhead; activity-core keeps only idempotent writes + adopt-beachhead-endpoint
and retires its bespoke actcore-state-hub-bridge proxy. Proposal sent to state-hub.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:18:01 +02:00
cf7a11dcd9 docs(ACTIVITY-WP-0014): correct Motivation to match T01 findings
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 17:16:17 +02:00
99e5d525a8 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - update .custodian-brief.md for activity-core
2026-06-23 17:15:41 +02:00
8424c13783 docs(ACTIVITY-WP-0014): T01 root cause — State Hub Connection refused, not misfire
Live inspection of railiance01 (ssh + in-node kubectl/temporal) overturns the
catchup_window hypothesis: the daily-triage schedule is healthy (CatchupWindow
365d default, 0 MissedCatchupWindow). The 2026-06-23T05:20Z fire ran but Failed
at the report sink with '[Errno 111] Connection refused' posting to State Hub.
railiance01 reaches State Hub via a reverse tunnel back to the workstation, which
is unreachable at 07:20 Europe/Berlin (102 resolver timeouts in 24h). Mark T01
done; add T05 for resilient sinks/resolvers as the real incident fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 17:14:04 +02:00
864f90f9b9 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - update .custodian-brief.md for activity-core
2026-06-23 14:27:54 +02:00
053d18b24a feat(ACTIVITY-WP-0014): missed-fire detection & alert sink (T03)
Add activity_core/schedule_health: a pure evaluate_schedule_health() verdict
(built on Temporal's num_actions_missed_catchup_window plus a staleness check),
an async check_schedule_health() reader, and post_missed_fire_alert() that emits
a schedule_miss State Hub progress event. Makes a missed fire visible even under
misfire_policy=skip, where Temporal drops it by design. Unit tests for the
verdict logic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 14:25:33 +02:00
77af65afb2 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - update .custodian-brief.md for activity-core
2026-06-23 14:17:14 +02:00
0495f8a43f chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - ACTIVITY-WP-0014-T04: progress → wait
2026-06-23 14:17:06 +02:00
c6cad9e7b3 chore(consistency): renormalize lifecycle state [auto]
Updated by fix-consistency on 2026-06-23:
  - workplan status: proposed → active
2026-06-23 14:17:06 +02:00
a83b117f60 feat(ACTIVITY-WP-0014): explicit run-miss recovery policies (T02, T04)
Set Temporal catchup_window on cron schedules so a fire missed during a
worker/Temporal outage is no longer silently dropped. Redefine misfire_policy
into three explicit modes — skip, catchup_all, catchup_latest — mapping to
(catchup_window, overlap) pairs; legacy catchup/compress aliased. Add
catchup_window_seconds override. Remove the ad-hoc upsert-time 1h backfill in
favour of native catchup. Apply catchup_latest to daily-statehub-wsjf-triage in
the Railiance runtime manifest and document run-miss policies in the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 14:15:45 +02:00
ffc0ee2cb7 feat(ACTIVITY-WP-0014): plan schedule misfire robustness & run-miss options
Cron fires are silently dropped: _build_schedule() sets SchedulePolicy(overlap=)
but never catchup_window, so a brief worker/Temporal outage at trigger time drops
the fire with no recovery and no signal (root cause of missing 06-22/06-23 daily
triage runs). Define three explicit run-miss policies: skip, catchup_all,
catchup_latest, plus missed-fire detection.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 13:46:19 +02:00
59b3b73061 ui rules established 2026-06-22 23:03:40 +02:00
4bc5111dfd chore(consistency): apply state_hub_workstream_id writeback
Sync archived workplan frontmatter from State Hub fix-consistency.
2026-06-22 17:43:32 +02:00
e9a6029ded chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-22:
  - update .custodian-brief.md for activity-core
2026-06-22 16:50:01 +02:00
bf4e61f0bf feat(ACTIVITY-WP-0012): complete live admin-sync no-restart smoke
Ran Railiance01 cluster validation for POST /admin/sync without restarting
actcore-worker, added a repeatable smoke script, and closed the workplan.
2026-06-22 16:25:26 +02:00
40fa851ec0 fix(bridge): use /state/health for readiness probe
The actcore-state-hub-bridge readiness probe hit /state/summary through
the tunnel proxy chain. Cold-cache summary requests and intermittent
tunnel stalls routinely exceeded the 5s probe timeout (1584 failures
over 17h), leaving the pod 0/1 Ready and breaking hourly/triage sinks.

Use /state/health instead — same signal the ops inventory already
expects, and completes in ~30ms through the bridge.
2026-06-22 14:03:57 +02:00
e0742d18d7 Mark .repo-classification.yaml human-reviewed (CUST-WP-0050 T02)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 11:40:43 +02:00
ccac285b0a Reclassify as tooling (CUST-WP-0050 T02)
Apply the new 'tooling' category (reusable internal tooling/infrastructure)
from the Repo Classification Standard. First-pass agent classification.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 03:06:01 +02:00
a0dcc52353 Add repo classification (CUST-WP-0050 T02)
First-pass agent classification per the Repo Classification Standard v1.0
(canon-repo-classification); pending human review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 02:44:46 +02:00
faf5d60ae8 feat(STATE-WP-0064): enable cluster consistency sweep schedule
Enable the definition in k8s projection and pass activity-core source tags.
2026-06-21 21:46:43 +02:00
adfd1a9067 fix(STATE-WP-0064): allow 360s POST timeout on state-hub bridge proxy
Consistency sweeps exceed the previous 30s urllib timeout when triggered from
Railiance01 activity-core through actcore-state-hub-bridge.
2026-06-21 20:56:35 +02:00
44987457c1 chore: add make sync-schedules target for Temporal schedule reconcile
Wraps python -m activity_core.sync_schedules for operator discoverability.
2026-06-21 20:28:04 +02:00
3a981cc98f feat(STATE-WP-0064): wire consistency_sweep_remote_all state-hub query
Add POST /consistency/sweep/remote-all resolver support with a 330s
timeout and k8s projection for the consistency sweep definition.
2026-06-21 20:19:22 +02:00
dbd2fbb11c docs(workplan): record railiance01 llm-connect smoke evidence
Note the 2026-06-19 live reconciliation on railiance01: llm-connect
deployed, worker restarted with LLM_CONNECT_URL, fixture smoke passed.
Manual daily triage still blocked on actcore-state-hub-bridge reachability.
2026-06-19 15:58:04 +02:00
c938b80503 chore(kaizen): demote coach/optimization to weekly operate cadence
After coulomb-loop bootstrap E2E (3/3 cycles on 2026-06-18), revert
activity-core from experimental daily crons to weekly Monday schedules
so discover_kaizen_scheduled_repos(cadence=weekly) matches the
operate-phase ActivityDefinitions. Drop the disabled tdd-workflow stub.
2026-06-19 11:32:36 +02:00
3e93567a53 Add admin sync hot reload path 2026-06-19 01:54:13 +02:00
6f68f8f9ec chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-19:
  - update .custodian-brief.md for activity-core
2026-06-19 01:52:52 +02:00
f05c56e202 fix(issue-sink): stringify triggering_event_id before JSON encode
IssueCoreRestSink.emit() passed task_spec.triggering_event_id straight
into the httpx json= payload. When the field is a UUID object (rather
than a string), httpx's JSON encoder raised
"TypeError: Object of type UUID is not JSON serializable", failing the
emission. Guard with str(), preserving None for optional event ids.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 00:15:03 +02:00
200ec0c97a Add credential routing instructions for all agent runtimes
Propagate shared credential-routing section (Codex, Claude, Grok, llm-connect)
from state-hub template via scripts/propagate_credential_routing.py.
2026-06-18 22:48:37 +02:00
42e5ef725c Document issue-core emission contract in AGENTS.md
Add ISSUE_CORE_URL, ISSUE_CORE_API_KEY, and ISSUE_SINK_TYPE guidance so
agents pair keys locally or via OpenBao instead of requesting them from
ops-warden.
2026-06-18 22:34:59 +02:00
a08bd1684f Add ISSUE_CORE_API_KEY auth to IssueCoreRestSink
Issue-core requires a shared ingestion key on POST /issues/. The REST sink
now sends Authorization: Bearer using ISSUE_CORE_API_KEY and fails fast
when the key is missing under ISSUE_SINK_TYPE=rest.

Updates .env.example, emission boundary docs, and unit tests for the
header contract and missing-key error.
2026-06-18 22:30:13 +02:00
2078915854 Add reuse-surface report gaps resolver 2026-06-18 17:58:00 +02:00
23f4956b68 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-18:
  - update .custodian-brief.md for activity-core
2026-06-18 17:52:38 +02:00
764339e490 chore(consistency): renormalize lifecycle state [auto]
Updated by fix-consistency on 2026-06-18:
  - workplan status: ready → active
2026-06-18 17:52:33 +02:00
17e2e39165 Track definition schedule hot reload 2026-06-18 15:21:59 +02:00
6518ecefce chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-18:
  - update .custodian-brief.md for activity-core
2026-06-18 15:20:03 +02:00
727868a245 Finish event payload resolver workplan 2026-06-18 15:15:07 +02:00
a279d59f73 Add kaizen agent project assets 2026-06-18 15:14:20 +02:00
23e2316dff Harden coding retro resolver selection 2026-06-18 15:13:08 +02:00
206bb336d2 Wire llm-connect runtime for daily triage 2026-06-18 15:12:31 +02:00
977a3bd97f Align activity-core scope boundaries 2026-06-18 15:11:48 +02:00
78eed5f942 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-18:
  - update .custodian-brief.md for activity-core
2026-06-18 15:09:20 +02:00
717535b62d Close event-payload live smoke handoff 2026-06-18 14:26:27 +02:00
b2816d9776 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-18:
  - update .custodian-brief.md for activity-core
2026-06-18 14:05:59 +02:00
0554014083 Add event-payload context resolver 2026-06-18 14:01:11 +02:00
b84e474ac5 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-18:
  - update .custodian-brief.md for activity-core
2026-06-18 13:16:24 +02:00
498d90b965 chore: promote coulomb-loop pilot schedule to daily stabilize phase 2026-06-18 12:09:25 +02:00
a2a6a30d8b chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-18:
  - update .custodian-brief.md for activity-core
2026-06-18 12:07:56 +02:00
9a72c9f210 fix: unwrap single-key kaizen resolver payloads in resolve_context
When discover_kaizen_projects returns {"projects": [...]} bound to
context.projects, for_each can iterate the list directly. Multi-key
summaries (e.g. repo SBOM bulk) remain unchanged.
2026-06-18 08:11:09 +02:00
517bf9c133 Add kaizen context resolver for scheduled agent fleet discovery.
Implement discover_kaizen_scheduled_repos and discover_kaizen_projects per
kaizen-agentic ADR-005 contract: State Hub roster, roster.yaml filter, schedule
validation, and prepare_command emission. Register kaizen/resolver/shell source
types with unit tests and runbook dry-run instructions.
2026-06-18 07:46:46 +02:00
29bf87a44c Opt in to coulomb-loop kaizen bootstrap scheduling.
Add .kaizen/schedule.yml for coach and optimization agent runs during the
hourly bootstrap phase of the coulomb-loop engagement.
2026-06-18 04:53:51 +02:00
82 changed files with 7108 additions and 227 deletions

View File

@@ -0,0 +1,50 @@
# Credential and access routing
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
for inference. Run this check **before** requesting secrets, API keys, SSH access,
login tokens, or database passwords — in any repo, not only `ops-warden`.
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
other credential need belongs to another subsystem. **Do not** message
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
### Lookup (do this first)
```bash
warden route find "<describe your need>" --json
warden route show <catalog-id> --json
```
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
| Agent runtime | How to orient |
| --- | --- |
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=activity-core` is for coordination, not secret vending |
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
### Quick routing table
| I need… | Owner | ops-warden executes? |
| --- | --- | --- |
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes**`warden sign` |
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
| Authorization decision | flex-auth | No — route only |
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
### Anti-patterns (do not do these)
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
- Pasting secrets into Git, State Hub, workplans, logs, or chat
### Other capabilities (reuse-surface)
Non-credential capabilities are usually discovered through **reuse-surface** federation
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
every repo's agent instructions because it is high-frequency, high-risk, and easy to
get wrong.
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`

View File

@@ -1,11 +1,11 @@
## First Session Protocol ## First Session Protocol
Triggered when `get_domain_summary("custodian")` shows **no workstreams**. Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
The project is registered but work has not yet been structured. The project is registered but work has not yet been structured.
**Step 1 — Read, don't write** **Step 1 — Read, don't write**
- `~/the-custodian/canon/projects/custodian/project_charter_v0.1.md` — purpose, scope - `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
- `~/the-custodian/canon/projects/custodian/roadmap_v0.1.md` — planned phases - `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
- Scan repo root: README, directory structure, existing code or docs - Scan repo root: README, directory structure, existing code or docs
**Step 2 — Survey in-progress work** **Step 2 — Survey in-progress work**
@@ -17,7 +17,7 @@ roadmap phase. **Wait for approval before creating.**
**Step 4 — Create workplan file first, then DB record (ADR-001)** **Step 4 — Create workplan file first, then DB record (ADR-001)**
``` ```
workplans/activity-core-WP-NNNN-<slug>.md ← write this first workplans/ACTIVITY-WP-NNNN-<slug>.md ← write this first
``` ```
Then register in the hub: Then register in the hub:
``` ```
@@ -28,7 +28,7 @@ create_task(workstream_id="<id>", title="...", priority="high|medium|low")
**Step 5 — Record the setup** **Step 5 — Record the setup**
``` ```
add_progress_event( add_progress_event(
summary="First session: structured custodian into N workstreams, M tasks", summary="First session: structured infotech into N workstreams, M tasks",
event_type="milestone", event_type="milestone",
topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
detail={"workstreams": [...], "tasks_created": M} detail={"workstreams": [...], "tasks_created": M}

View File

@@ -1,5 +1,5 @@
**Purpose:** Durable task factory built on Temporal. Manages ActivityDefinitions, schedules recurring workflows via Temporal Schedules, routes events via NATS JetStream, and exposes a FastAPI CRUD surface for the custodian domain. **Purpose:** Durable task factory built on Temporal. Manages ActivityDefinitions, schedules recurring workflows via Temporal Schedules, routes events via NATS JetStream, and exposes a FastAPI CRUD surface for the custodian domain.
**Domain:** custodian **Domain:** infotech
**Repo slug:** activity-core **Repo slug:** activity-core
**Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a **Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a

View File

@@ -1,6 +1,7 @@
## Session Protocol ## Session Protocol
State Hub: http://127.0.0.1:8000 Dev Hub (State Hub API): http://127.0.0.1:8000
MCP server name in `~/.claude.json`: `dev-hub`
**Step 1 — Orient** **Step 1 — Orient**
@@ -10,7 +11,7 @@ cat .custodian-brief.md
``` ```
Then call the MCP tool for richer cross-domain context when MCP tools are exposed: Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
``` ```
get_domain_summary("custodian") get_domain_summary("infotech")
``` ```
If MCP tools are unavailable in the current agent session, use the REST API: If MCP tools are unavailable in the current agent session, use the REST API:
```bash ```bash
@@ -39,11 +40,11 @@ curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
ls workplans/ ls workplans/
``` ```
For each file with `status: ready`, `active`, or `blocked`, note pending For each file with `status: ready`, `active`, or `blocked`, note pending
`todo`/`in_progress` tasks. `wait`/`todo`/`progress` tasks.
**Step 4 — Present brief** **Step 4 — Present brief**
1. **Active workstreams** for `custodian` — title, task counts, blocking decisions 1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
2. **Pending tasks** from `workplans/` + any `[repo:activity-core]` hub tasks 2. **Pending tasks** from `workplans/` + any `[repo:activity-core]` hub tasks
3. **Goal guidance** — if `goal_guidance` in summary: 3. **Goal guidance** — if `goal_guidance` in summary:
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"* - `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*

View File

@@ -1,7 +1,7 @@
## Workplan Convention (ADR-001) ## Workplan Convention (ADR-001)
File location: `workplans/activity-core-WP-NNNN-<slug>.md` File location: `workplans/ACTIVITY-WP-NNNN-<slug>.md`
ID prefix: `ACTIVITY-WP` ID prefix: `ACTIVITY-WP-`
Work items originate as files in this repo **before** being registered in the hub. Work items originate as files in this repo **before** being registered in the hub.
@@ -12,7 +12,7 @@ repo state, and `finished` when implementation is complete. `stalled` and
`needs_review` are derived health labels, not stored statuses. `needs_review` are derived health labels, not stored statuses.
Closed workplans may be moved to `workplans/archived/` with a completion-date Closed workplans may be moved to `workplans/archived/` with a completion-date
prefix: `YYMMDD-activity-core-WP-NNNN-<slug>.md`. The frontmatter id remains prefix: `YYMMDD-ACTIVITY-WP-NNNN-<slug>.md`. The frontmatter id remains
unchanged; the prefix is only for quick visual reference. unchanged; the prefix is only for quick visual reference.
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**: Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
@@ -25,4 +25,16 @@ Ecosystem todos from other agents arrive as `[repo:activity-core]` hub tasks —
visible at session start. Pick one up by creating the workplan file, then registering visible at session start. Pick one up by creating the workplan file, then registering
the workstream. the workstream.
Task blocks use this shape:
```task
id: ACTIVITY-WP-NNNN-T01
status: wait | todo | progress | done | cancel
priority: high | medium | low
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
```
Status progression is `todo``progress``done`; use `wait` for waiting or
blocked work and `cancel` for stopped work.
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here --> <!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->

View File

@@ -1,18 +1,56 @@
<!-- custodian-brief: generated by fix-consistency — do not edit manually --> <!-- custodian-brief: generated by fix-consistency — do not edit manually -->
# Custodian Brief — activity-core # Custodian Brief — activity-core
**Domain:** custodian **Domain:** infotech
**Last synced:** 2026-06-17 21:59 UTC **Last synced:** 2026-06-29 23:50 UTC
**State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)* **State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
## Active Workstreams ## Active Workstreams
### Automation schedule inventory Make targets
Progress: 0/5 done | workstream_id: `21c73763-9adc-42f6-8fd2-1b8b33c2c770`
**Open tasks:**
- · Task: Define the automation inventory contract `8de24590`
- · Task: Implement a non-mutating inventory CLI `538cb9a5`
- · Task: Add Make targets `f2001721`
- · Task: Document the inventory workflow `f687743b`
- · Task: Verify against current repo and live/degraded sources `5317b532`
### LLM Output Robustness & The Producer Trust Boundary
Progress: 3/10 done | workstream_id: `4ef0d53b-1777-41ae-80c6-1b69fdb34726`
**Open tasks:**
- ! Reproduce & Root-Cause The Failure `74fd16a5`
*(wait: Local analysis complete: mechanism is the unbounded ~1-per-workstream recommendation list (16 active workstreams; break at char 5268 ~rank 8-9); both first attempt and retry failed. Exact token + finish_reason are unrecoverable from activity-core (complete() drops finish_reason; report cap 4000 < 5268; log cap 2000). Remaining: pull llm-connect producer-side logs on railiance01 (cluster/operator-owned). Does NOT block T02/T03 — mitigation is identical regardless.)*
- ► Tests + calibration re-entry `b7b9e07a`
- ► Schema + Prompt Redesign For Error Locality `ae67ca8c`
- ► Tests + Calibration Re-Entry `c881500b`
- · Reproduce & root-cause the 06-26 validation failure `2d3bba00`
- · Schema + prompt redesign for error locality `5da6962c`
- · Boundary parser — verify & mitigate with quarantine lane `4c408114`
### Post-triage operational hardening ### Post-triage operational hardening
Progress: 5/6 done | workstream_id: `5646e13a-13af-4724-bca6-3c0d86f96733` Progress: 7/8 done | workstream_id: `5646e13a-13af-4724-bca6-3c0d86f96733`
**Open tasks:** **Open tasks:**
- ! Three-Run Calibration Feedback `7cbf0a35` - ! Three-Run Calibration Feedback `7cbf0a35`
### Adopt State Hub Beachhead Endpoint
Progress: 0/2 done | workstream_id: `bbc07f9e-9323-4b2b-b556-c33b37d0b228`
**Open tasks:**
- ! Point STATE_HUB_URL at the beachhead `76b6132d`
- ! Retire the bespoke actcore-state-hub-bridge proxy `526c2129`
### Daily Triage LLM Reconciliation And Evidence
Progress: 2/5 done | workstream_id: `f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9`
**Open tasks:**
- ! Run Daily Triage Fixture Smoke `10e0df77`
- ! Collect Three Clean Scheduled Runs `dc6b9482`
- ! Close Handoff State `ecc57e21`
### Intent gap closure ### Intent gap closure
Progress: 4/6 done | workstream_id: `d64cfbba-6da7-4737-afb9-866afa0e9cda` Progress: 4/6 done | workstream_id: `d64cfbba-6da7-4737-afb9-866afa0e9cda`
@@ -30,6 +68,6 @@ Progress: 2/3 done | workstream_id: `7387fc50-1f2c-471a-9d85-bb085cbd0b63`
## MCP Orientation (when available) ## MCP Orientation (when available)
If the state-hub MCP server is reachable, call: If the state-hub MCP server is reachable, call:
`get_domain_summary("custodian")` `get_domain_summary("infotech")`
This provides richer cross-domain context. This provides richer cross-domain context.
If the MCP call fails, use this file as your orientation source. If the MCP call fails, use this file as your orientation source.

View File

@@ -18,14 +18,17 @@ STATE_HUB_URL=http://127.0.0.1:8000
# Repo scoping — used by the repo-scoping context adapter. Binds {} on failure. # Repo scoping — used by the repo-scoping context adapter. Binds {} on failure.
REPO_SCOPING_URL=http://127.0.0.1:8020 REPO_SCOPING_URL=http://127.0.0.1:8020
# Issue Core — task emission backend. # Issue Core — task emission backend.
ISSUE_CORE_URL=http://127.0.0.1:8010 ISSUE_CORE_URL=http://127.0.0.1:8765
# Shared ingestion key — must match issue-core's ISSUE_CORE_API_KEY.
ISSUE_CORE_API_KEY=
# Sink type: 'rest' (POST to issue-core) or 'null' (discard, for dry-run). # Sink type: 'rest' (POST to issue-core) or 'null' (discard, for dry-run).
ISSUE_SINK_TYPE=rest ISSUE_SINK_TYPE=rest
# ── Activity definitions ─────────────────────────────────────────────────────── # ── Activity definitions ───────────────────────────────────────────────────────
# Colon-separated paths to additional activity-definitions/ directories. # Colon-separated paths to additional activity-definitions/ directories.
# The local activity-definitions/ directory is always scanned. # The local activity-definitions/ directory is always scanned.
ACTIVITY_DEFINITION_DIRS= # Coulomb-loop kaizen engagement definitions (colon-separated for more roots).
ACTIVITY_DEFINITION_DIRS=/home/worsch/coulomb-loop
# ── Observability ───────────────────────────────────────────────────────────── # ── Observability ─────────────────────────────────────────────────────────────
# Prometheus metrics bind address (Temporal SDK metrics). # Prometheus metrics bind address (Temporal SDK metrics).

View File

@@ -0,0 +1,24 @@
---
agent: coach
project: activity-core
last_updated: 2026-06-18
session_count: 0
---
## Project Context
<!-- What this agent knows about the project it works in -->
## Accumulated Findings
<!-- Patterns, recurring issues, key decisions encountered -->
## What Worked
<!-- Approaches that produced good results in this project -->
## Watch Points
<!-- Recurring risks, traps, or areas requiring extra care -->
## Open Threads
<!-- Things noticed but not yet acted on -->
## Session Log
<!-- One-line entry per session: date · summary · outcome -->

View File

@@ -0,0 +1,24 @@
---
agent: optimization
project: activity-core
last_updated: 2026-06-18
session_count: 0
---
## Project Context
<!-- What this agent knows about the project it works in -->
## Accumulated Findings
<!-- Patterns, recurring issues, key decisions encountered -->
## What Worked
<!-- Approaches that produced good results in this project -->
## Watch Points
<!-- Recurring risks, traps, or areas requiring extra care -->
## Open Threads
<!-- Things noticed but not yet acted on -->
## Session Log
<!-- One-line entry per session: date · summary · outcome -->

View File

@@ -0,0 +1,2 @@
{"agent": "coach", "execution_time_s": 120.0, "quality_score": 0.85, "success": true, "timestamp": "2026-06-18T06:10:35Z"}
{"agent": "coach", "execution_time_s": 118.0, "quality_score": 0.86, "success": true, "timestamp": "2026-06-18T10:06:38Z"}

View File

@@ -0,0 +1,12 @@
{
"agent": "coach",
"avg_execution_time_s": 119.0,
"avg_quality_score": 0.855,
"execution_count": 2,
"last_execution": "2026-06-18T10:06:38Z",
"success_rate": 1.0,
"trend": {
"quality_score": "stable",
"success_rate": "stable"
}
}

View File

@@ -0,0 +1,2 @@
{"agent": "optimization", "execution_time_s": 90.0, "quality_score": 0.8, "success": true, "timestamp": "2026-06-18T06:10:35Z"}
{"agent": "optimization", "execution_time_s": 88.0, "quality_score": 0.81, "success": true, "timestamp": "2026-06-18T10:06:38Z"}

View File

@@ -0,0 +1,12 @@
{
"agent": "optimization",
"avg_execution_time_s": 89.0,
"avg_quality_score": 0.805,
"execution_count": 2,
"last_execution": "2026-06-18T10:06:38Z",
"success_rate": 1.0,
"trend": {
"quality_score": "stable",
"success_rate": "stable"
}
}

View File

@@ -0,0 +1,59 @@
{
"agents": [
{
"agent_name": "coach",
"meets_sample_threshold": false,
"metrics_count": 2,
"optimization_cycles": 0,
"performance_analysis": {
"analysis_timestamp": "2026-06-18T12:06:39.212809",
"avg_execution_time": 119.0,
"avg_quality_score": 0.855,
"avg_success_rate": 1.0,
"execution_time_trend": -0.01680672268907563,
"quality_score_trend": 0.01169590643274855,
"success_rate_trend": 0.0,
"window_size": 2
},
"recommendations": [
{
"details": "Average execution time: 119.00s",
"message": "Consider optimizing execution time",
"priority": "high",
"type": "performance"
}
],
"report_timestamp": "2026-06-18T12:06:39.213012",
"sample_threshold": 10
},
{
"agent_name": "optimization",
"meets_sample_threshold": false,
"metrics_count": 2,
"optimization_cycles": 0,
"performance_analysis": {
"analysis_timestamp": "2026-06-18T12:06:39.220252",
"avg_execution_time": 89.0,
"avg_quality_score": 0.805,
"avg_success_rate": 1.0,
"execution_time_trend": -0.02247191011235955,
"quality_score_trend": 0.012422360248447215,
"success_rate_trend": 0.0,
"window_size": 2
},
"recommendations": [
{
"details": "Average execution time: 89.00s",
"message": "Consider optimizing execution time",
"priority": "high",
"type": "performance"
}
],
"report_timestamp": "2026-06-18T12:06:39.220417",
"sample_threshold": 10
}
],
"min_samples": 10,
"optimized_at": "2026-06-18",
"project": "activity-core"
}

15
.kaizen/schedule.yml Normal file
View File

@@ -0,0 +1,15 @@
# Kaizen scheduled agent execution manifest (ADR-005)
# Engagement: coulomb-loop bootstrap — weekly cadence
# Regulator promotes cadence per customer engagement policy (ADR-003).
# Validate with: kaizen-agentic schedule validate
version: '1'
timezone: Europe/Berlin
agents:
coach:
cadence: weekly
cron: 0 9 * * 1
enabled: true
optimization:
cadence: weekly
cron: 0 10 * * 1
enabled: true

28
.repo-classification.yaml Normal file
View File

@@ -0,0 +1,28 @@
# Repo classification (Repo Classification Standard v1.0).
repo_classification:
standard: Repo Classification Standard
version: '1.0'
classified_at: '2026-06-22'
classified_by: human
category: tooling
domain: infotech
secondary_domains:
- agents
capability_tags:
- workflow
- orchestration
- automation
- coordination
- observability
business_stake:
- technology
- operations
- automation
- execution
business_mechanics:
- coordination
- operation
- adaptation
notes: Org-wide event bridge / task factory (Temporal-based). Active bounded implementation
-> project.

View File

@@ -4,7 +4,7 @@
**Purpose:** Durable task factory built on Temporal. Manages ActivityDefinitions, schedules recurring workflows via Temporal Schedules, routes events via NATS JetStream, and exposes a FastAPI CRUD surface for the custodian domain. **Purpose:** Durable task factory built on Temporal. Manages ActivityDefinitions, schedules recurring workflows via Temporal Schedules, routes events via NATS JetStream, and exposes a FastAPI CRUD surface for the custodian domain.
**Domain:** custodian **Domain:** infotech
**Repo slug:** activity-core **Repo slug:** activity-core
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a` **Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
**Workplan prefix:** `ACTIVITY-WP-` **Workplan prefix:** `ACTIVITY-WP-`
@@ -83,7 +83,7 @@ curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe) 1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
2. Check inbox: `GET /messages/?to_agent=activity-core&unread_only=true`; mark read 2. Check inbox: `GET /messages/?to_agent=activity-core&unread_only=true`; mark read
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks 3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
4. Check blocked tasks: `GET /tasks/?needs_human=true` 4. Check human-needed tasks: `GET /tasks/?needs_human=true`
**During work:** **During work:**
- Update task statuses in workplan files as tasks progress - Update task statuses in workplan files as tasks progress
@@ -101,6 +101,63 @@ curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
--- ---
## Credential and access routing
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
for inference. Run this check **before** requesting secrets, API keys, SSH access,
login tokens, or database passwords — in any repo, not only `ops-warden`.
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
other credential need belongs to another subsystem. **Do not** message
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
### Lookup (do this first)
```bash
warden route find "<describe your need>" --json
warden route show <catalog-id> --json
```
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
| Agent runtime | How to orient |
| --- | --- |
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=activity-core` is for coordination, not secret vending |
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
### Quick routing table
| I need… | Owner | ops-warden executes? |
| --- | --- | --- |
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
| Authorization decision | flex-auth | No — route only |
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
### Anti-patterns (do not do these)
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
- Pasting secrets into Git, State Hub, workplans, logs, or chat
### Other capabilities (reuse-surface)
Non-credential capabilities are usually discovered through **reuse-surface** federation
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
every repo's agent instructions because it is high-frequency, high-risk, and easy to
get wrong.
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
<!-- REPO-AGENTS-EXTENSIONS -->
<!-- Append repo-specific agent instructions below this marker.
The state-hub template sync preserves content after this line. -->
---
## Workplan Convention (ADR-001) ## Workplan Convention (ADR-001)
Work items originate as files in this repo — not in the hub. The hub is a Work items originate as files in this repo — not in the hub. The hub is a
@@ -124,7 +181,7 @@ anything needing analysis, design, approval, dependencies, or multiple phases.
id: ACTIVITY-WP-NNNN id: ACTIVITY-WP-NNNN
type: workplan type: workplan
title: "..." title: "..."
domain: custodian domain: infotech
repo: activity-core repo: activity-core
status: proposed | ready | active | blocked | backlog | finished | archived status: proposed | ready | active | blocked | backlog | finished | archived
owner: codex owner: codex
@@ -154,10 +211,7 @@ state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
Task description text. Task description text.
``` ```
Status progression: `todo` → `progress` → `done`; use `wait` for a task Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
blocked on external input and `cancel` for intentionally abandoned work.
Workstream/workplan lifecycle status is separate; frontmatter `blocked` remains
valid there.
To create a new workplan: To create a new workplan:
1. Write the file following the format above 1. Write the file following the format above

View File

@@ -8,4 +8,5 @@
@.claude/rules/stack-and-commands.md @.claude/rules/stack-and-commands.md
@.claude/rules/architecture.md @.claude/rules/architecture.md
@.claude/rules/repo-boundary.md @.claude/rules/repo-boundary.md
@.claude/rules/credential-routing.md
@.claude/rules/agents.md @.claude/rules/agents.md

View File

@@ -1,13 +1,16 @@
-include .env -include .env
export export
.PHONY: sync-event-types sync-activity-definitions test migrate sync-all \ .PHONY: sync-event-types sync-activity-definitions sync-schedules test migrate sync-all \
dev-up dev-down railiance-up railiance-down \ dev-up dev-down railiance-up railiance-down \
start-worker start-api start-event-router help start-worker start-api start-event-router help
sync-activity-definitions: ## Sync ActivityDefinition files into DB sync-activity-definitions: ## Sync ActivityDefinition files into DB
uv run python -m activity_core.sync_activity_definitions uv run python -m activity_core.sync_activity_definitions
sync-schedules: ## Reconcile Temporal schedules from activity_definitions DB
uv run python -m activity_core.sync_schedules
sync-event-types: ## Sync event type YAML files into DB sync-event-types: ## Sync event type YAML files into DB
uv run python scripts/sync_event_types.py uv run python scripts/sync_event_types.py
@@ -52,3 +55,17 @@ help: ## Show this help message
@grep -Eh '^[a-zA-Z_-]+:.*?##' $(MAKEFILE_LIST) | \ @grep -Eh '^[a-zA-Z_-]+:.*?##' $(MAKEFILE_LIST) | \
awk 'BEGIN {FS = ":.*?## "}; {printf " \033[36m%-24s\033[0m %s\n", $$1, $$2}' | \ awk 'BEGIN {FS = ":.*?## "}; {printf " \033[36m%-24s\033[0m %s\n", $$1, $$2}' | \
sort sort
# Agent Management Targets
agents-list:
@echo "Installed agents:"
@ls agents/ 2>/dev/null | grep agent- | sed 's/agent-//g' | sed 's/.md//g' \
|| echo "No agents installed"
agents-update:
@echo "Updating agents..."
@kaizen-agentic update
agents-validate:
@echo "Validating agents..."
@kaizen-agentic validate agents/

180
SCOPE.md
View File

@@ -1,7 +1,7 @@
--- ---
domain: capabilities domain: capabilities
repo: activity-core repo: activity-core
updated: "2026-06-03" updated: "2026-06-16"
--- ---
# SCOPE # SCOPE
@@ -16,7 +16,8 @@ updated: "2026-06-03"
activity-core is the org-wide Event Bridge for the Coulomb organization — a activity-core is the org-wide Event Bridge for the Coulomb organization — a
rule-governed event loop that receives time-based and domain events, evaluates rule-governed event loop that receives time-based and domain events, evaluates
declarative rules and LLM instructions against current org context, and emits declarative rules and LLM instructions against current org context, and emits
structured task sets to issue-core. structured task, report, and evidence outputs without owning downstream task
lifecycle.
--- ---
@@ -27,8 +28,11 @@ An `ActivityDefinition` (a markdown file checked into a repo) declares a trigger
resolve before evaluation, and a set of rules and instructions that determine resolve before evaluation, and a set of rules and instructions that determine
what tasks to create. When triggered, a durable Temporal workflow loads the what tasks to create. When triggered, a durable Temporal workflow loads the
definition, resolves context, evaluates the rule/instruction set, and emits task definition, resolves context, evaluates the rule/instruction set, and emits task
creation requests to issue-core. Everything is auditable: the spawn log records creation requests to issue-core or configured dry-run/audit sinks. Instructions
the triggering event, matched rule, and resulting task references. may also emit validated reports, and selected context resolvers may emit compact
non-secret evidence. Everything is auditable: the spawn log records the
triggering event, matched rule/instruction metadata, model/prompt hash where
applicable, and resulting task references.
The two evaluation modes: The two evaluation modes:
- **Rule** — deterministic condition (sandboxed Python-like DSL) → fixed task - **Rule** — deterministic condition (sandboxed Python-like DSL) → fixed task
@@ -48,21 +52,35 @@ The two evaluation modes:
attribute schemas, example payloads, and intent documentation. attribute schemas, example payloads, and intent documentation.
Curator-gating configurable per runtime environment. Curator-gating configurable per runtime environment.
- **Trigger types**: 5-field cron with timezone and misfire policy; one-off - **Trigger types**: 5-field cron with timezone and misfire policy; one-off
scheduled datetime; event-type subscription via NATS. scheduled datetime; event-type subscription via NATS; manual one-shot API
trigger; one-shot schedule smoke tests for recurring definitions.
- **Context resolution adapters**: repo-scoping (repository capability queries), - **Context resolution adapters**: repo-scoping (repository capability queries),
state hub (domain and workstream state), extensible for other sources. State Hub (domain/workstream state, SBOM status, daily triage digest, coding
retro read model), and ops inventory (bounded HTTP/HTTPS probes of a
non-secret service inventory). The adapter registry is extensible for other
sources.
- **Rule evaluator**: sandboxed AST walker for Python-like boolean expressions - **Rule evaluator**: sandboxed AST walker for Python-like boolean expressions
over event attributes and resolved context. Rule actions support safe over event attributes and resolved context. Rule actions support safe
`context.*` / `event.*` interpolation and explicit `for_each` per-item `context.*` / `event.*` interpolation and explicit `for_each` per-item
binding. No `exec()`. binding. No `exec()`.
- **Instruction executor**: trusted-field prompt rendering, LLM call via - **Instruction executor**: trusted-field prompt rendering, LLM call via
llm-connect, structured output validation, optional curator review queue, llm-connect, structured output validation, item-granular recovery with a
and deterministic report sinks. quarantine lane and producer guardrails (count/length/depth caps, reference
allow-list) at the producer trust boundary, bounded validation-failure
artifacts for report instructions, review-required audit metadata, and
deterministic report sinks. A real downstream review queue is not implemented
in this repo.
- **Task emission adapter**: abstraction over issue-core; current transport is - **Task emission adapter**: abstraction over issue-core; current transport is
REST; designed to migrate to NATS subscription without code changes. REST, with `ISSUE_SINK_TYPE=null` for dry-run/audit mode. It is designed to
migrate to a durable issue-core-owned NATS command boundary when issue-core
provides that contract.
- **Report sinks**: instruction report outputs can be persisted to bounded - **Report sinks**: instruction report outputs can be persisted to bounded
local working memory and posted as State Hub progress events. These are local working memory and posted as State Hub progress events. These are
reporting outputs, not task lifecycle ownership. reporting outputs, not task lifecycle ownership.
- **Ops evidence sinks**: `ops-inventory` context sources can post compact
non-secret `ops_inventory_probe` summaries to State Hub. Inter-Hub submission
is present only as a gated/deferred sink result until operator-owned
`OPS_HUB_KEY` custody and widget mapping are ready.
- **Spawn audit log**: every task emission recorded with rule/instruction id, - **Spawn audit log**: every task emission recorded with rule/instruction id,
triggering event id, model and prompt hash (instructions), issue-core task ref. triggering event id, model and prompt hash (instructions), issue-core task ref.
- **Webhook receiver**: HTTP endpoint normalising inbound Gitea/GitHub webhook - **Webhook receiver**: HTTP endpoint normalising inbound Gitea/GitHub webhook
@@ -84,6 +102,14 @@ The two evaluation modes:
coordinated changes belong to project-core (future). coordinated changes belong to project-core (future).
- **Execution of automatable tasks** — Temporal Activities that do real work - **Execution of automatable tasks** — Temporal Activities that do real work
(run a scan, apply a patch, call an API) live in per-repo workers, not here. (run a scan, apply a patch, call an API) live in per-repo workers, not here.
- **General ops execution** — Kubernetes, SSH, tunnel, authenticated service
checks, secret custody, OpenBao writes, and Inter-Hub widget/API-key
provisioning belong to the owning operational repos and operator workflows.
activity-core may record non-secret probe evidence; it must not become the ops
control plane.
- **Service inventory authority** — the Custodian inventory remains owned by
the custodian/state-hub surface. activity-core may read a projected
non-secret snapshot.
- **Event broker hosting** — NATS JetStream is org infrastructure; activity-core - **Event broker hosting** — NATS JetStream is org infrastructure; activity-core
consumes it but does not own its lifecycle. consumes it but does not own its lifecycle.
- **Temporal server hosting** — activity-core uses the Temporal SDK; the server - **Temporal server hosting** — activity-core uses the Temporal SDK; the server
@@ -101,6 +127,9 @@ The two evaluation modes:
structured tasks in the right repos." structured tasks in the right repos."
- You need one-off future task scheduling without a separate reminder system. - You need one-off future task scheduling without a separate reminder system.
- You want an auditable record of what triggered what and why. - You want an auditable record of what triggered what and why.
- You need a scheduled, non-secret evidence note proving that declared service
endpoints or access paths were observed, without executing privileged ops
commands.
- You are replacing scattered bespoke cron jobs and manual coordination with - You are replacing scattered bespoke cron jobs and manual coordination with
a governed, observable automation layer. a governed, observable automation layer.
@@ -117,29 +146,45 @@ The two evaluation modes:
## Current State ## Current State
- **Status**: active production-backed service. Foundation, triggers/ops, - **Status**: active production-backed service with two visible open gates:
event bridge, Railiance deployment, and the production service workplans are `ACTIVITY-WP-0006` still waits on three clean consecutive scheduled daily
complete. The stale March WP-0002 handoff note has been reconciled and triage runs and calibration feedback, and `ACTIVITY-WP-0008` is blocked until
archived. Helix Forge publishes the upstream `coding_retro` read model needed to enable
the Saturday schedule. `ACTIVITY-WP-0007` is finished: the bounded
ops-inventory probe/evidence slice has live Railiance evidence.
- **Implementation**: core is functional. `RunActivityWorkflow`, - **Implementation**: core is functional. `RunActivityWorkflow`,
`TaskExecutorWorkflow` (stub), PostgreSQL schema, Temporal Schedules, NATS `TaskExecutorWorkflow` (stub), PostgreSQL schema, Temporal Schedules and smoke
Event Router, FastAPI admin API, Prometheus metrics, event type registry, schedules, NATS Event Router, FastAPI admin API, Prometheus metrics, event
markdown ActivityDefinition parser/sync, rule evaluator, instruction type registry, markdown ActivityDefinition parser/sync, rule evaluator,
executor, context resolvers, issue sink, report sinks, Kubernetes deployment, instruction executor, context resolvers, issue sink, report sinks, ops
and operational runbook are all implemented. evidence sink, Kubernetes deployment, and operational runbook are all
- **Operational proof**: the daily State Hub WSJF triage cutover has completed implemented.
far enough that activity-core is now the trusted scheduled substrate for the - **Current definitions**: `weekly-sbom-staleness` is enabled and demonstrates
routine report. Recent hardening fixed the State Hub SBOM resolver contract, the deterministic rule/fan-out path. `weekly-coding-retro` is present and
made slow LLM activity timeouts configurable, and added safe rule action tested but intentionally disabled until live `coding_retro` evidence exists.
interpolation plus explicit `for_each` binding for per-repo SBOM staleness Railiance projects the daily State Hub WSJF triage definition and the disabled
tasks. ops-service-inventory probe definition from the runtime bundle.
- **Stability**: construction risk has shifted to operational hardening risk. - **Operational proof**: the State Hub daily WSJF triage path has produced
The full test suite passed on 2026-06-03 (`125 passed, 1 skipped`). The validated reports and working-memory notes, but the calibration gate is not
remaining work is mostly observability, status-canon adaptation, contract closed. A 2026-06-16 recheck found State Hub `daily_triage` progress and
documentation, and broader production adoption rather than first working-memory `daily-triage-*` notes only through 2026-06-06, so there is not
implementation. yet evidence for three clean consecutive scheduled runs after the June 7
- **Next**: `ACTIVITY-WP-0006` — post-triage operational hardening and scope runtime projection failure. The ops inventory probe path has live fallback
alignment. evidence in State Hub; Inter-Hub per-entity submission remains deferred.
- **Task emission posture**: the issue-core REST sink is implemented, but the
Railiance runtime currently uses `ISSUE_SINK_TYPE=null` dry-run/audit mode.
Switching to live issue-core task creation requires a verified endpoint,
credentials, and duplicate-handling check in the target environment.
- **Stability**: construction risk has shifted to operational hardening and
adoption risk. The last recorded full-suite pass in the workplans was
2026-06-04 (`128 passed, 1 skipped`), with later targeted coverage added for
ops inventory, ops evidence sinks, Railiance projection wiring, and weekly
coding retro parsing/rule behavior.
- **Next**: close `ACTIVITY-WP-0006-T03` with real scheduled-run calibration
evidence; close `ACTIVITY-WP-0008-T03` once upstream `coding_retro` publication
exists and the dry-run/duplicate check passes; decide when to move selected
task/report/evidence sinks from dry-run or fallback mode to their intended
live backends.
--- ---
@@ -159,9 +204,9 @@ database, the project planner, or a general execution worker. The local
workplan explicitly rehomes execution responsibility. workplan explicitly rehomes execution responsibility.
One boundary nuance is now explicit: activity-core may post State Hub progress One boundary nuance is now explicit: activity-core may post State Hub progress
events as a configured report sink. That is acceptable because it records the events as a configured report or evidence sink. That is acceptable because it
result of an activity-core activation; it is not ownership of State Hub state, records the result of an activity-core activation; it is not ownership of State
task lifecycle, or workstream planning. Hub state, task lifecycle, or workstream planning.
The main drift risk is convenience creep: adding direct task tracking, The main drift risk is convenience creep: adding direct task tracking,
project-phase state, or bespoke operational scripts because the Temporal project-phase state, or bespoke operational scripts because the Temporal
@@ -169,27 +214,58 @@ substrate is already nearby. Future work should prefer declarative
ActivityDefinitions, bounded context resolvers, and outbound adapters over ActivityDefinitions, bounded context resolvers, and outbound adapters over
new one-off control paths. new one-off control paths.
## Known Gaps Against Intent
- **Scheduled-run trust gap**: INTENT promises recurring coordination work that
runs without Bernd as the manual coordination layer. The daily triage path is
implemented, but its current calibration task still lacks three clean
consecutive scheduled runs after the June 7 runtime failure. Until that closes,
daily triage remains a production-backed capability with an evidence gap, not
a fully proven standing substrate.
- **Task creation gap**: INTENT says activations emit task creation requests to
issue-core. The REST sink exists, but Railiance is still in `ISSUE_SINK_TYPE=null`
mode. That preserves auditability and avoids accidental duplicate/live tasks,
but it means production schedules are not yet consistently creating real
issue-core tasks.
- **Review queue gap**: `review_required` is explicitly metadata only in the
current contract. No issue-core review queue integration exists here, so any
future queue routing needs a downstream issue-core contract before high-impact
instruction outputs rely on it.
- **Evidence backend posture**: the State Hub fallback evidence path is the
accepted current backend for `ops_inventory_probe`. Inter-Hub/ops-hub
submission is deliberately deferred behind `OPS_HUB_KEY`, widget mapping, and
operator approval, so per-entity ops evidence publication is future work.
- **Execution-boundary residue**: `TaskExecutorWorkflow` is still registered as
a stub that writes a done `task_instances` row. It should remain inert or be
removed/re-homed before it attracts real execution work, because execution is
explicitly outside activity-core's intent.
- **API exposure posture**: the FastAPI surface stays ClusterIP-only for now.
External ingress remains future work until an authenticated access policy is
designed.
--- ---
## How It Fits ## How It Fits
``` ```
[NATS JetStream] ← publishers: state hub, Gitea webhooks, Temporal signals, cron [NATS JetStream] ← publishers: State Hub, Gitea webhooks, Temporal signals, cron
[activity-core] ← event type registry, rule evaluator, instruction executor [activity-core] ← event type registry, rule evaluator, instruction executor
[activity-core] → [issue-core] → [repos/services] [activity-core] → [issue-core] → [repos/services]
[activity-core] → [report sinks] [activity-core] → [report/evidence sinks] → [State Hub / working memory / future Inter-Hub]
``` ```
- **Upstream**: NATS (event bus), Temporal (durable workflow engine), PostgreSQL - **Upstream**: NATS (event bus), Temporal (durable workflow engine), PostgreSQL
(definitions and audit log), repo-scoping (context adapter), state hub (context (definitions and audit log), repo-scoping (context adapter), State Hub (context
adapter and event publisher). adapter and event publisher).
- **Downstream**: issue-core (task management) and configured report sinks. - **Downstream**: issue-core (task management) and configured report/evidence sinks.
Agents and humans pick up tasks from issue-core and do the actual work. Agents and humans pick up tasks from issue-core and do the actual work.
Railiance may use the null sink for dry-run/audit mode until live issue-core
emission is approved.
- **Coordinates with**: the state hub delegates maintenance automations to - **Coordinates with**: the state hub delegates maintenance automations to
activity-core by publishing lifecycle events or by being resolved as context. activity-core by publishing lifecycle events or by being resolved as context.
activity-core may post progress events as report outputs, but it does not own activity-core may post progress events as report/evidence outputs, but it
State Hub task/workstream state. does not own State Hub task/workstream state.
--- ---
@@ -203,6 +279,11 @@ new one-off control paths.
by a sandboxed AST walker. by a sandboxed AST walker.
- **Instruction** — LLM-evaluated task generation with trusted-field prompt - **Instruction** — LLM-evaluated task generation with trusted-field prompt
interpolation and structured output schema enforcement. interpolation and structured output schema enforcement.
- **Report sink** — configured persistence for instruction reports, currently
working-memory markdown notes and State Hub progress events.
- **Evidence sink** — configured persistence for compact non-secret resolver
evidence, currently State Hub progress for ops inventory probes; Inter-Hub is
a deferred gated target.
- **Event type** — a registered, schema-documented category of event (e.g. - **Event type** — a registered, schema-documented category of event (e.g.
`org.repo.registered`). Publisher-declared; curator-gated per environment. `org.repo.registered`). Publisher-declared; curator-gated per environment.
- **Spawn audit trail** — activity-core's local record of what tasks were emitted, - **Spawn audit trail** — activity-core's local record of what tasks were emitted,
@@ -219,8 +300,12 @@ new one-off control paths.
- `issue-core` (formerly issue-facade) — downstream task management; receives - `issue-core` (formerly issue-facade) — downstream task management; receives
all task emission from activity-core. all task emission from activity-core.
- `repo-scoping` — context adapter for repository capability queries. - `repo-scoping` — context adapter for repository capability queries.
- `the-custodian` / state hub — context adapter for domain state; delegates - `the-custodian` / State Hub — context adapter for domain state; delegates
maintenance automation to activity-core via NATS events. maintenance automation to activity-core via NATS events.
- `llm-connect` — instruction execution backend for judgement-oriented reports
such as daily State Hub WSJF triage.
- `inter-hub` / `ops-hub` — future richer ops evidence intake target; currently
operator-gated and not required for the State Hub fallback evidence path.
- `rules-core` (future extraction) — the rule evaluator and instruction executor - `rules-core` (future extraction) — the rule evaluator and instruction executor
module, currently in `src/activity_core/rules/`. module, currently in `src/activity_core/rules/`.
- `project-core` (future) — project and initiative management; will use - `project-core` (future) — project and initiative management; will use
@@ -237,6 +322,9 @@ new one-off control paths.
governance model, event type schema, ActivityDefinition structure. governance model, event type schema, ActivityDefinition structure.
- `docs/adr/adr-003-rule-instruction-model.md` — Rule DSL, Instruction safety - `docs/adr/adr-003-rule-instruction-model.md` — Rule DSL, Instruction safety
model, evaluation semantics, audit trail, testing strategy. model, evaluation semantics, audit trail, testing strategy.
- `docs/adr/adr-004-producer-trust-boundary.md` — untrusted-producer premise,
trust-but-handle vs verify-and-mitigate postures, error-locality and
quarantine-with-provenance, producer guardrails for LLM/agent/human output.
--- ---
@@ -248,7 +336,10 @@ new one-off control paths.
`src/activity_core/activities.py` (Temporal activities), `src/activity_core/activities.py` (Temporal activities),
`src/activity_core/event_router.py` (NATS → Temporal), `src/activity_core/event_router.py` (NATS → Temporal),
`src/activity_core/schedule_manager.py` (Temporal Schedules), `src/activity_core/schedule_manager.py` (Temporal Schedules),
`src/activity_core/api.py` (FastAPI admin). `src/activity_core/api.py` (FastAPI admin),
`src/activity_core/report_sinks.py` (instruction reports),
`src/activity_core/ops_evidence_sinks.py` (ops evidence),
and `src/activity_core/context_resolvers/` (external context adapters).
- Definition files: `event-types/`, `activity-definitions/`, and `tasks/`. - Definition files: `event-types/`, `activity-definitions/`, and `tasks/`.
- Dev environment: `docker-compose.dev.yml` (Temporal + PostgreSQL + NATS). - Dev environment: `docker-compose.dev.yml` (Temporal + PostgreSQL + NATS).
- Entry points: `uv run python -m activity_core.worker` (Temporal worker), - Entry points: `uv run python -m activity_core.worker` (Temporal worker),
@@ -264,6 +355,7 @@ title: Durable event-triggered task factory
description: > description: >
Org-wide Event Bridge that receives time-based and domain events, evaluates Org-wide Event Bridge that receives time-based and domain events, evaluates
declarative rules and LLM instructions against current org context, and emits declarative rules and LLM instructions against current org context, and emits
structured task sets to issue-core with a full spawn audit trail. structured task, report, and evidence outputs with a full spawn/report audit
keywords: [temporal, workflow, event-bridge, task, cron, event, rule, instruction, org-automation] trail while leaving task lifecycle ownership downstream.
keywords: [temporal, workflow, event-bridge, task, report, evidence, cron, event, rule, instruction, org-automation]
``` ```

184
agents/agent-coach.md Normal file
View File

@@ -0,0 +1,184 @@
---
name: coach
description: Coaching meta-agent that reads all agent memories in a project and synthesises cross-agent briefs and new-agent orientations
category: meta
memory: enabled
---
# Coach Agent
## Role
You are the **kaizen-agentic Coach** — a meta-agent that observes, synthesises,
and advises. You do not perform domain work (coding, testing, infrastructure).
Your sole purpose is to read across the accumulated memories of all agents in a
project and produce useful, targeted briefs.
You are invoked via:
```
kaizen-agentic memory brief <agent-name>
```
Or directly by the operator: *"Coach, brief the sys-medic agent on this project"*
or *"Coach, what patterns have you observed across all agents?"*
---
## What You Do
### 1. Cross-Agent Synthesis
Read all `.kaizen/agents/*/memory.md` files in the current project. Identify:
- **Shared patterns**: themes that appear across multiple agents
(e.g. "three agents flagged missing test coverage as a risk")
- **Cross-domain risks**: signals in one agent's memory that should inform
another (e.g. infrastructure instability flagged by sys-medic → tdd-workflow
should account for flaky environments)
- **Resource or architectural signals**: recurring mentions of specific files,
modules, services, or systems across agents
- **Contradictions or gaps**: where agents hold conflicting assumptions or where
no agent has coverage
### 2. New-Agent Orientation
When asked to brief a specific agent about to be deployed for the first time:
1. Read all existing agent memories in the project
2. Filter for what is relevant to the incoming agent's domain
3. Produce a targeted orientation brief covering:
- **Project context**: what kind of project this is, key constraints
- **What to know first**: the most important facts for this agent
- **Watch points**: risks or pitfalls flagged by other agents that are relevant
- **What has worked**: successful approaches in adjacent domains
- **Open threads**: unresolved items from other agents that may interact with
this agent's work
### 3. Fleet Health Overview
When asked for a fleet overview:
- Summarise the health of the agent fleet: which agents are active, stale, or
missing from the project
- Flag agents with high `session_count` and still-open `## Open Threads`
- Identify agents whose memories suggest overlapping concerns
- Recommend whether any memory files should be reviewed or reset
---
## How to Read Agent Memory Files
Memory files live at `.kaizen/agents/<name>/memory.md` relative to the project
root. Each follows ADR-002 structure:
```
## Project Context ← agent's understanding of the project
## Accumulated Findings ← patterns and recurring issues
## What Worked ← validated approaches
## Watch Points ← risks and traps
## Open Threads ← unresolved items
## Session Log ← chronological session summaries
```
When synthesising, weight `## Watch Points` and `## Open Threads` most heavily —
these are the signals most likely to be actionable for another agent.
### Project metrics (ADR-004)
Quantitative performance data lives at `.kaizen/metrics/<agent>/summary.json`.
`kaizen-agentic memory brief <agent>` includes a `## Performance Summary` block
when metrics exist.
When synthesising orientations:
- Combine qualitative memory with quantitative trends (success rate, quality,
execution time, trend arrows)
- Flag agents with declining success rate or quality trends
- Cross-reference metrics with `## Watch Points` — do metrics confirm or
contradict qualitative findings?
- Note when an agent has memory but no metrics (incomplete session-close protocol)
Fleet optimizer output at `.kaizen/metrics/optimizer/analysis.json` provides
project-wide analysis from `kaizen-agentic metrics optimize`.
---
## Output Format
### Cross-agent brief
```
## Cross-Agent Brief — <project name>
Generated: <date>
Agents with memory: <list>
### Shared Patterns
<bullet list of themes appearing across ≥2 agents>
### Cross-Domain Risks
<risks from one domain relevant to others>
### Open Threads (fleet-wide)
<unresolved items that span or affect multiple agents>
### Fleet Health
<which agents are active/stale, any concerning signals>
```
### New-agent orientation
```
## Orientation Brief for: <agent-name>
Project: <project name>
Generated: <date>
Sources: <which agent memories were read>
### Performance Summary
<from .kaizen/metrics/<agent>/ when available — success rate, quality, trends>
### What to Know First
<35 most important facts for this agent>
### Watch Points
<risks relevant to this agent's domain>
### What Has Worked
<approaches validated by other agents that apply here>
### Open Threads You May Encounter
<items from other agents that may intersect with your work>
```
---
## Behaviour Boundaries
- **Do not** modify agent memory files
- **Do not** perform any domain-specific work (coding, testing, diagnosis)
- **Do not** make decisions — synthesise and advise only
- **If no memories exist**: say so clearly and offer to help initialise them
- **If asked about a specific agent not present**: note the gap
---
## Coach's Own Memory
The coach maintains `.kaizen/agents/coach/memory.md` covering:
- Fleet-level patterns observed over time
- How the agent population in this project has evolved
- Meta-observations about how well the memory convention is being followed
- Recurring gaps or blind spots in the agent fleet
### Session Start
1. Check for `.kaizen/agents/coach/memory.md`.
2. If present, read it — prior fleet observations provide context for the current synthesis.
3. Scan `.kaizen/agents/*/memory.md` to build the current fleet picture.
### Session Close
1. Update `## Accumulated Findings` with new fleet-level patterns.
2. Note any new agents added or memory files reset.
3. Append one line to `## Session Log`: `YYYY-MM-DD · <brief requested for> · <key finding>`.
4. Bump `last_updated` and `session_count`.

View File

@@ -0,0 +1,191 @@
---
name: optimization
description: Meta-agent that analyzes and optimizes other Claude Code subagents based on their performance data, usage patterns, and effectiveness metrics. Use PROACTIVELY for agent ecosystem improvement.
model: inherit
category: meta
memory: enabled
---
# Kaizen Optimizer - Agent Performance Meta-Optimizer
## Purpose
Meta-agent that analyzes and optimizes other Claude Code subagents based on their performance data, usage patterns, and effectiveness metrics. Continuously improves the agent ecosystem by identifying patterns that correlate with success or failure, and proposing data-driven refinements to agent specifications.
## When to Use This Agent
Use the kaizen-optimizer agent when you need:
- Analysis of subagent performance and effectiveness
- Optimization recommendations for existing agents
- Agent specification improvements based on usage data
- Performance pattern identification across agent invocations
- Agent ecosystem health assessment
- Continuous improvement of the agent framework
### Trigger Patterns
1. **Scheduled Reviews**: Regular analysis of agent performance (weekly/monthly)
2. **Performance Degradation**: When agent success rates drop below thresholds
3. **New Agent Evaluation**: After deploying new agents to assess effectiveness
4. **Usage Pattern Changes**: When agent usage patterns shift significantly
5. **Explicit Optimization Requests**: Direct requests for agent improvement analysis
### Example Usage Scenarios
1. **Post-Project Analysis**: "Analyze how well our agents performed during Issue #15 implementation and suggest improvements"
2. **Agent Performance Review**: "Review the effectiveness of tddai-assistant over the last 30 days and recommend optimizations"
3. **Ecosystem Optimization**: "Identify which agents are underperforming and suggest specification improvements"
4. **Success Pattern Analysis**: "Analyze successful agent chains and recommend best practices"
## Agent Capabilities
### Performance Analysis
- **Success Rate Analysis**: Track agent task completion and success metrics
- **Usage Pattern Recognition**: Identify how agents are being used effectively
- **Failure Mode Analysis**: Categorize and analyze agent failure patterns
- **Response Quality Assessment**: Evaluate the quality of agent outputs
### Optimization Recommendations
- **Specification Refinements**: Suggest improvements to agent descriptions and capabilities
- **Trigger Pattern Optimization**: Refine when and how agents should be invoked
- **Chain Optimization**: Recommend better agent collaboration patterns
- **Scope Adjustments**: Identify agents that are too broad or too narrow in scope
### Meta-Learning
- **Pattern Detection**: Identify successful agent behaviors and specifications
- **Correlation Analysis**: Find relationships between agent characteristics and performance
- **Best Practice Extraction**: Distill successful patterns into reusable guidelines
- **Evolution Tracking**: Monitor how agent improvements affect performance over time
## Analysis Framework
### Data Collection Focus
Since this operates within Claude Code's environment, analysis is based on:
- **Conversation Context**: Agent invocation patterns and outcomes within sessions
- **User Feedback Patterns**: Implicit success signals from user interactions
- **Task Completion Rates**: Whether agents successfully complete their assigned tasks
- **Agent Specification Quality**: How well specifications match actual usage
### Performance Metrics
- **Invocation Success**: How often agents complete tasks as intended
- **User Satisfaction Indicators**: Continued usage, follow-up requests, task completion
- **Agent Utilization**: Which agents are used most/least and why
- **Chain Effectiveness**: Success rates of multi-agent workflows
## Optimization Strategies
### Specification Enhancement
- **Clarity Improvements**: Make agent purposes and capabilities clearer
- **Scope Refinement**: Adjust agent boundaries for better effectiveness
- **Example Enhancement**: Add better usage examples and scenarios
- **Integration Guidance**: Improve agent-to-agent collaboration descriptions
### Performance Improvement
- **Trigger Optimization**: Refine when agents should be automatically suggested
- **Capability Matching**: Ensure agent capabilities match user needs
- **Redundancy Reduction**: Identify and resolve agent overlap issues
- **Gap Identification**: Find missing capabilities in the agent ecosystem
## Integration with Agent Ecosystem
### Analyzes All Agents
- **general-purpose**: Assess effectiveness for research and multi-step tasks
- **tddai-assistant**: Evaluate TDD workflow support and methodology adherence
- **project-assistant**: Review project management and milestone tracking performance
- **claude-expert**: Analyze documentation and feature explanation effectiveness
- **statusline-setup**: Assess configuration task success rates
- **output-style-setup**: Evaluate creative task completion effectiveness
### Collaborative Analysis
Works with other agents to gather performance data:
- Uses **general-purpose** for complex analysis tasks
- Coordinates with **project-assistant** for milestone-based performance tracking
- Leverages **claude-expert** for framework knowledge and best practices
## Expected Outputs
### Performance Analysis Reports
- Agent effectiveness rankings with supporting evidence
- Usage pattern analysis and trend identification
- Success/failure correlation analysis
- Performance bottleneck identification
### Optimization Recommendations
- Specific agent specification improvements
- Trigger pattern refinements
- Agent chain optimization suggestions
- New agent capability recommendations
### Implementation Guidance
- Prioritized improvement roadmap
- Specification update templates
- A/B testing suggestions for agent improvements
- Rollback strategies for failed optimizations
## Best Practices for Usage
### Provide Performance Context
- Share specific agent interactions that were particularly effective or ineffective
- Describe user experience challenges with current agents
- Include examples of successful and unsuccessful agent chains
- Specify performance concerns or optimization goals
### Be Specific About Scope
- Focus on particular agents or agent categories for analysis
- Define time windows for performance analysis
- Specify success criteria for optimization efforts
- Clarify whether analysis should be broad ecosystem or targeted
### Implementation Approach
- Request prioritized recommendations based on impact vs. effort
- Ask for specific specification changes rather than general advice
- Seek rollback plans for proposed optimizations
- Request measurable success criteria for improvements
## Quality Standards
### Analysis Rigor
- Evidence-based recommendations supported by usage patterns
- Consideration of trade-offs between different optimization approaches
- Realistic improvement expectations and timelines
- Acknowledgment of limitations in available performance data
### Recommendation Quality
- Specific, actionable changes to agent specifications
- Clear success criteria for measuring improvement effectiveness
- Integration considerations for agent ecosystem harmony
- Risk assessment for proposed changes
## Integration Notes
This agent operates within Claude Code's conversation context and focuses on:
- **Qualitative Analysis**: Since detailed metrics aren't available, focuses on behavioral patterns and user interaction quality
- **Specification Optimization**: Improving agent descriptions, examples, and usage guidance
- **Ecosystem Balance**: Ensuring agents complement rather than compete with each other
- **Practical Improvements**: Recommendations that can be implemented through specification updates
The agent serves as the continuous improvement engine for the subagent ecosystem, ensuring agents evolve to better serve user needs and project requirements.
## Session Start
1. Check for `.kaizen/agents/optimization/memory.md` in the project root.
2. If present, read it before beginning analysis.
3. Review `.kaizen/metrics/optimizer/analysis.json` if it exists for the latest fleet report.
## Session Close
1. When analysis completes, note key findings in `## Accumulated Findings`.
2. Append one line to `## Session Log`: `YYYY-MM-DD · <agents reviewed> · <outcome>`.
3. Bump `last_updated` and increment `session_count`.
4. Persist quantitative analysis via CLI (ADR-004):
```bash
kaizen-agentic metrics optimize [agent-name]
```
Run without an agent name to analyze all agents with project metrics. Requires
≥10 execution records per agent for actionable recommendations (see
`wiki/AgentKaizenOptimizer.md`).

View File

@@ -216,11 +216,21 @@ it. The output schema must define `List[TaskSpec]` or a compatible envelope.
#### `review_required: true` #### `review_required: true`
When set, the instruction's proposed task list is written to a **pending review When set today, the instruction's task/report output is marked with
queue** in issue-core rather than directly created. A human or curator agent `review_required=true` in activity-core audit metadata. For report-producing
reviews and approves/rejects before tasks are materialised. This is the default instructions, this flag is also persisted in configured report sinks so an
for instructions that create high-impact tasks (cross-repo changes, security operator can distinguish validated-but-review-worthy output from routine
responses, production operations). output.
activity-core does **not** currently route proposed tasks to a pending review
queue. That queue must be owned by issue-core, because issue-core owns task
lifecycle state. Until issue-core exposes a review contract, `review_required`
is metadata only; it must not be treated as evidence that live task creation was
held for approval.
Future issue-core review integration may use the same field, but that change
must update the issue sink contract and tests before any ActivityDefinition
relies on queue routing.
#### Evaluation semantics #### Evaluation semantics
@@ -286,7 +296,8 @@ This boundary makes future extraction to `rules-core` a packaging exercise, not
tasks" behaviour is replaced by explicit rule blocks. tasks" behaviour is replaced by explicit rule blocks.
- A new `RuleEvaluator` class (AST walker) is added to `src/activity_core/rules/`. - A new `RuleEvaluator` class (AST walker) is added to `src/activity_core/rules/`.
- A new `InstructionExecutor` class handles prompt rendering, LLM call, output - A new `InstructionExecutor` class handles prompt rendering, LLM call, output
validation, and review queue routing. validation, and review-required audit metadata. Pending review queue routing
remains a future issue-core integration.
- Integration tests for rule evaluation use fixture JSON; no running Temporal required. - Integration tests for rule evaluation use fixture JSON; no running Temporal required.
- The `task_spawn_log` table is added to the Postgres schema (new Alembic migration). - The `task_spawn_log` table is added to the Postgres schema (new Alembic migration).
- ActivityDefinition files that omit both `rules` and `instructions` are valid - ActivityDefinition files that omit both `rules` and `instructions` are valid

View File

@@ -0,0 +1,156 @@
---
id: ACT-ADR-004
type: architecture-decision-record
title: "The Producer Trust Boundary — Guardrails and Error-Correction for Untrusted Output"
status: accepted
decided_by: Bernd Worsch
date: "2026-06-26"
scope: cross-repo
affects:
- activity-core
- rules-core (future extraction)
tags: ["architecture", "llm", "safety", "validation", "guardrails", "trust-boundary", "resilience"]
---
# ACT-ADR-004: The Producer Trust Boundary
## Status
Accepted.
## Context
On 2026-06-26 the scheduled daily WSJF triage instruction fired on time, called
llm-connect successfully, and produced a long ranked recommendation list — but
the JSON broke at char 5268 (~rank 89 of ~16), failing schema validation. Because
the report was validated and consumed as a single monolithic JSON document, one
malformed delimiter discarded the **entire** run, including the 7 perfectly good
recommendations the model had already emitted. The scheduling and runtime layers
were healthy; the failure was entirely at the seam where free-form model output
meets a strict consumer.
This is not a one-off bug, it is a recurring class. activity-core has a **trust
boundary** wherever generative or human-authored output meets strict deterministic
consumers: the JSON Schema validator, the task emitter, and any classic compute
pipeline downstream. The producers on the other side of that boundary — **LLMs,
agents, and humans** — are all *untrusted producers*. Their output may be:
- **erroneous** — hallucination, truncation at a token limit, drift, type slips,
typos, a missing delimiter; or
- **malicious** — prompt injection, crafted payloads, or oversized / deeply-nested
structures intended to exhaust or confuse the consumer.
The pre-existing design treated producer output optimistically: parse the whole
document, validate the whole document, and on any failure discard the whole
document (preserving only a bounded diagnostic preview). That gives **zero error
locality** — the blast radius of any single defect is the entire activation.
## Decision
Treat the producer→consumer seam as an explicit, adversarial **trust boundary**,
and place guardrails plus error-correction tooling *at that boundary* rather than
letting raw producer output flow into deterministic consumers.
### Two non-fail-fast postures
When hard-failing on a problem is undesirable, there are two sound strategies, and
they **compose**:
- **A) Trust but handle exceptions** (optimistic / reactive). Consume the output
as-is; on exception, catch → repair → retry → or quarantine. Cheap on the happy
path; blast radius depends entirely on how granular the catch is. Best when
failures are rare and locally recoverable. Risk: failures surface late, possibly
after partial side effects.
- **B) Verify and mitigate** (defensive / proactive). Validate, sanitize, clamp,
and normalize the output to a known-good shape *before* it enters the pipeline —
drop bad items, coerce types, bound sizes/depth, allow-list references — so the
consumer only ever sees clean input. Higher upfront cost, smaller blast radius,
no partial side effects. Best when failures are common or consequences are high.
### Governing principles
1. **Push verification to the boundary; keep the interior strict.** Apply posture
**B** at the producer→consumer boundary; keep posture **A** for residual
exceptions inside the verified core. Never relax the interior schema to absorb
producer sloppiness.
2. **Make error locality match the unit of work.** One bad recommendation must
cost one recommendation, not the whole report. Structuring the payload so each
item is independently parseable and validatable is the highest-leverage change.
3. **Quarantine, never silently drop.** Invalid units are preserved as bounded,
provenance-tagged artifacts (`index`, `error`, `raw` snippet, `reason`) so they
can be debugged or replayed. Degraded-but-usable is reported distinctly from
total loss.
4. **Both human and agent input get the same rigor.** Guardrails are
producer-agnostic: the same count / length / depth caps and reference
allow-lists apply whether the producer is an LLM, an agent, or a human.
### What this means concretely in activity-core
Implemented in `src/activity_core/rules/executor.py`:
- **Strict-structure-only schema.** The daily-triage output schema is strict on
per-item *structure* (`required [rank, candidate, action, why]`, typed `wsjf`)
and carries `maxItems` as a producer *hint* — never as a hard whole-document
reject, which would reproduce the very blast-radius failure (ACT-ADR-002 governs
the schema format; `schemas/daily-triage-report.json`).
- **Item-granular recovery (posture B).** When whole-document parse + one retry
fail, `_resilient_report` recovers individually-parseable recommendation objects
via a brace/quote-aware scanner (`_extract_object_spans`) that works for both
pretty-printed and NDJSON output, attempts a best-effort `_try_repair` on a
truncated tail, validates each recovered object against the item schema, and
keeps the valid ones. Survivors are emitted with `output_validated=true`,
`partial=true`, and `review_required=true`.
- **Producer guardrails (`_partition_items`, applied on both the recovery and the
happy path).** Per recommendation: structural type → schema → structural caps
(`_MAX_DEPTH`, `_MAX_STRING_LEN`) → reference allow-list → count cap (top-N by
`maxItems`). The first failing check quarantines the item with provenance and a
`reason` (`malformed` / `schema` / `guardrail` / `allow_list` / `over_limit`).
- **Reference allow-list.** A recommendation whose `candidate` is not in the set of
known ids is quarantined. The set is sourced from resolved context
(`context["known_candidates"]`, via `_allow_list_from_context`); the check is
inert until a context resolver populates it, so the capability ships now and
activates with a one-line resolver change.
### Where each posture sits
| Layer | Posture | Mechanism |
|-------|---------|-----------|
| Schema / contract | B | strict per-item structure; `maxItems` as hint |
| Whole-document parse | A | tolerant parse + single retry |
| Failed parse | B | item-granular recovery + repair + quarantine |
| Per-item screening | B | schema + depth/length caps + allow-list + count cap |
| Emitted report | — | `partial` / `quarantined_*` provenance; never silent |
## Consequences
- A single malformed or oversized item no longer discards an entire activation;
the daily-triage run that failed on 2026-06-26 would now deliver its 7 valid
recommendations and quarantine the broken tail.
- Reports gain a `partial` / `quarantined_*` vocabulary; downstream report sinks
and reviewers can distinguish degraded-but-usable from total loss.
- Guardrail thresholds (`_MAX_DEPTH`, `_MAX_STRING_LEN`, `maxItems`, the
allow-list) are policy knobs that will need tuning; they are intentionally
conservative defaults, not a finished calibration.
- **Known retention gap (follow-on):** `LLMConnectClient.complete()` still returns
only `content`, discarding `finish_reason`/`usage`, and the total-loss artifact
caps raw output below realistic break points. Capturing those signals so
failures stay debuggable is tracked as a retention fix, not closed by this ADR.
## Alternatives considered
- **Hard-enforce `maxItems` in the validator.** Rejected: a hard reject of an
over-count document reproduces the whole-document blast radius. Mitigation (keep
top-N, quarantine the rest) is preferred.
- **Relax the schema to accept anything.** Rejected: violates principle 1; pushes
malformed data into downstream consumers.
- **Retry-until-valid only (pure posture A).** Rejected as the sole strategy: the
2026-06-26 failure recurred across both the initial attempt and the retry, so
retry alone does not bound the blast radius.
## References
- ACT-ADR-002 — markdown-as-definition format and output schema governance.
- ACT-ADR-003 — Rule vs. Instruction model; the Instruction prompt-injection
surface this boundary complements on the output side.
- `workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md` — the
implementing workplan.

View File

@@ -18,7 +18,7 @@ extension point `af654abb`).
| Queue name | Registered workers | | Queue name | Registered workers |
|---|---| |---|---|
| `orchestrator-tq` | `RunActivityWorkflow` and all its activities (`load_activity_definition`, `resolve_context`, `log_run`) | | `orchestrator-tq` | `RunActivityWorkflow` and all its activities (`load_activity_definition`, `resolve_context`, `log_run`) |
| `task-execution-tq` | `TaskExecutorWorkflow` and all concrete task type workflows | | `task-execution-tq` | `TaskExecutorWorkflow` compatibility stub only; real execution belongs in per-repo workers |
**Rule:** a workflow and its activities must be registered on the same task queue. **Rule:** a workflow and its activities must be registered on the same task queue.
Cross-queue activity calls require an explicit `task_queue` argument on Cross-queue activity calls require an explicit `task_queue` argument on
@@ -60,6 +60,12 @@ A single process may run workers for multiple task queues, but each `Worker`
instance is bound to one task queue. Use separate `Worker` instances for instance is bound to one task queue. Use separate `Worker` instances for
`orchestrator-tq` and `task-execution-tq`. `orchestrator-tq` and `task-execution-tq`.
`TaskExecutorWorkflow` is not a production execution surface for activity-core.
It exists only as a compatibility/idempotency stub that writes a synthetic
`task_instances` row in older tests and dev flows. Do not add concrete task
execution logic here; execution ownership belongs to per-repo workers or a
future execution-owned repo/workplan.
--- ---
## Search attributes ## Search attributes

View File

@@ -11,7 +11,9 @@ The current authoritative boundary is the issue-core REST API:
POST {ISSUE_CORE_URL}/issues/ POST {ISSUE_CORE_URL}/issues/
``` ```
`IssueCoreRestSink` sends this payload: `IssueCoreRestSink` authenticates with the shared `ISSUE_CORE_API_KEY` env var
(same value as the issue-core server) via `Authorization: Bearer <key>` and
sends this payload:
```json ```json
{ {
@@ -52,7 +54,7 @@ task reference before it can replace `IssueCoreRestSink`.
Weekly SBOM staleness is safe to evaluate in dry-run mode because the rule Weekly SBOM staleness is safe to evaluate in dry-run mode because the rule
contract is deterministic and tested. Do not enable it against the real REST sink contract is deterministic and tested. Do not enable it against the real REST sink
until issue-core credentials, endpoint reachability, and duplicate-handling are until `ISSUE_CORE_API_KEY`, endpoint reachability, and duplicate-handling are
verified in the target environment. verified in the target environment.
## Verification ## Verification

View File

@@ -116,7 +116,58 @@ asyncio.run(publish())
--- ---
## Syncing schedules manually ## Syncing definitions and schedules manually
When the API is running, prefer the admin sync endpoint for definition or
schedule changes. It refreshes file-backed ActivityDefinitions and reconciles
Temporal Schedules without restarting the worker:
```bash
curl -s -X POST \
'http://localhost:8010/admin/sync?definitions=true&schedules=true'
```
The response reports:
- `definitions.synced`
- `event_types.synced`
- `schedules.upserted`
- `schedules.paused`
- `schedules.deleted_orphans`
- bounded `errors[]`
`event_types` defaults to `false` for this endpoint because event-triggered
definitions already reload from the DB in the event router path; opt in when
the operator intentionally changed event type definition files:
```bash
curl -s -X POST \
'http://localhost:8010/admin/sync?definitions=true&schedules=true&event_types=true'
```
The v1 posture is manual/operator-triggered sync. A periodic background loop is
deferred until live use shows it is needed; this keeps customer definition
changes explicit and avoids background repo scanning from the worker.
### Railiance01 no-restart smoke
After changing a projected definition in `k8s/railiance/20-runtime.yaml`,
apply the ConfigMap and wait for the API pod volume to refresh (up to ~60s),
then reconcile without restarting `actcore-worker`:
```bash
export KUBECONFIG=~/.kube/config-hosteurope
kubectl apply -f k8s/railiance/20-runtime.yaml
sleep 60
kubectl -n activity-core exec deploy/actcore-api -- \
python3 -c 'import urllib.request; req=urllib.request.Request("http://localhost:8010/admin/sync?definitions=true&schedules=true", method="POST"); print(urllib.request.urlopen(req).read().decode())'
```
Automated regression for the disabled `ops-service-inventory-probes`
projection (enable/cadence flip, idempotent repeat sync, rollback) lives in
`scripts/smoke_admin_sync_no_restart.py`.
If the API is unavailable, the schedule-only CLI remains available:
```bash ```bash
TEMPORAL_HOST=localhost:7233 \ TEMPORAL_HOST=localhost:7233 \
@@ -126,7 +177,7 @@ ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
This reconciles all Temporal Schedules with the `activity_definitions` table: This reconciles all Temporal Schedules with the `activity_definitions` table:
- Upserts schedules for every enabled cron definition - Upserts schedules for every enabled cron definition
- Creates paused schedules for disabled cron definitions - Creates paused schedules for disabled cron or one-shot scheduled definitions
- Deletes orphaned schedules with no matching DB row - Deletes orphaned schedules with no matching DB row
After adding or changing a recurring ActivityDefinition or workflow activity After adding or changing a recurring ActivityDefinition or workflow activity
@@ -159,14 +210,34 @@ repos, and emits one automated task per stale repo through explicit
`weekly-coding-retro` follows the same cron -> context resolver -> per-repo task `weekly-coding-retro` follows the same cron -> context resolver -> per-repo task
pattern for coding-session retrospection. It runs Saturdays at 19:00 pattern for coding-session retrospection. It runs Saturdays at 19:00
Europe/Berlin and resolves the latest State Hub `/progress/` item with Europe/Berlin and resolves the latest State Hub `/progress/` item with
`event_type=coding_retro` into `context.retro.suggestions`. Each positive-score `event_type=coding_retro` and a matching `window_days` into
suggestion emits one task to `context.s.repo` with labels `context.retro.suggestions`. Each positive-score suggestion emits one task to
`coding-retro`, `improvement`, and `automated`. `context.s.repo` with labels `coding-retro`, `improvement`, and `automated`.
The weekly schedule intentionally ignores broader retro windows such as 30-day
catch-up reports.
Keep `weekly-coding-retro` disabled until Helix Forge publishes the Keep `weekly-coding-retro` disabled until Helix Forge publishes the
`coding_retro` read model and a smoke run confirms the resolver returns a `coding_retro` read model and a smoke run confirms the resolver returns a
non-empty suggestion set with no duplicate target tasks on re-run. non-empty suggestion set with no duplicate target tasks on re-run.
## Ops inventory evidence posture
The current accepted live backend for activity-core ops inventory probes is
State Hub progress with `event_type=ops_inventory_probe`.
Inter-Hub / ops-hub per-entity submission remains intentionally deferred until
all of these are true:
- `OPS_HUB_KEY` is provisioned through an operator-owned secret path, never Git,
chat, or State Hub detail.
- Widget or capability mapping is configured for the target ops-hub entities.
- Production Inter-Hub intake is deployed and smoke-tested for the relevant
authenticated routes.
Until then, missing Inter-Hub configuration should produce an explicit skipped
sink result, not a failed probe. This posture was recorded in State Hub decision
`7c235bbb-ee6f-4c3e-b1dd-74717eac9082`.
--- ---
## Temporal UI — filtering by activity ## Temporal UI — filtering by activity
@@ -262,6 +333,52 @@ the same durable consumer name provides automatic failover.
--- ---
## Run-miss recovery policies (cron triggers)
A cron fire is **missed** when the worker or Temporal is unavailable at trigger
time. `trigger_config.misfire_policy` selects what happens when the system
recovers. Each policy combines a Temporal **catchup window** (how far back missed
fires are recovered) with an **overlap policy** (what to do if a recovered fire
would start while a prior run is still executing):
| `misfire_policy` | Behaviour | Default catchup window | Overlap |
| --- | --- | --- | --- |
| `skip` | Run on trigger or skip — a missed fire is never recovered | 60s grace | `SKIP` |
| `catchup_all` | Recover **every** fire missed during the outage | 365 days | `BUFFER_ALL` |
| `catchup_latest` | Recover only the **most recent** missed fire; no backlog | 24h | `BUFFER_ONE` |
Set `trigger_config.catchup_window_seconds` to override the per-policy default
(e.g. an hourly definition using `catchup_latest` should set it to ~3600 so a
single missed hour is recovered but older ones are not).
Legacy values are still accepted: `catchup``catchup_all`,
`compress``catchup_latest`.
> **Why this exists:** before ACTIVITY-WP-0014 no catchup window was set, so a
> brief outage at trigger time silently dropped the fire with no recovery and no
> log line. The `daily-statehub-wsjf-triage` definition now uses `catchup_latest`.
## State Hub write idempotency (ACTIVITY-WP-0014 T05)
Every State Hub write from activity-core (report-sink progress, ops-evidence
progress, schedule-miss alerts) carries a stable **`Idempotency-Key`** header
derived deterministically from the write's identity
(`run_id:instruction_id:event_type`, or `schedule_miss:activity_id:last_fired`
for miss alerts). This makes writes safe to **buffer and replay** under the
planned State Hub *beachhead* (per-machine read cache + write outbox): a flush —
possibly retried after an outage — cannot create duplicate progress/triage
events once State Hub / the beachhead honours the header.
The guarantee lives on the write, not on a live dedup read. The read-based
`_progress_exists` check is now best-effort only: if State Hub is unreachable it
returns `False` (proceed to the keyed write) rather than hard-failing. The header
passes untouched through the `actcore-state-hub-bridge` proxy and is ignored by
State Hub versions that do not yet honour it.
> The queue/cache itself is **not** built in activity-core — it belongs to the
> state-hub beachhead. activity-core only emits the key. See the proposal sent to
> the `state-hub` agent.
## Troubleshooting ## Troubleshooting
### Worker fails to start: "ACTCORE_DB_URL is required" ### Worker fails to start: "ACTCORE_DB_URL is required"
@@ -271,6 +388,9 @@ Set the environment variable before running the worker.
1. Check Temporal UI → Schedules tab for the schedule status. 1. Check Temporal UI → Schedules tab for the schedule status.
2. Ensure `enabled=True` on the ActivityDefinition (paused schedules don't fire). 2. Ensure `enabled=True` on the ActivityDefinition (paused schedules don't fire).
3. Verify the cron expression with: `docker exec temporal-admin-tools temporal schedule describe --schedule-id activity-schedule-<uuid>` 3. Verify the cron expression with: `docker exec temporal-admin-tools temporal schedule describe --schedule-id activity-schedule-<uuid>`
4. If a fire was **missed entirely** (no run, no failure event) during an outage,
check `misfire_policy` — under `skip` missed fires are dropped by design. Use
`catchup_all` or `catchup_latest` to recover them. See *Run-miss recovery policies*.
### Event not routing ### Event not routing
1. Check NATS monitoring: http://localhost:8222/jsz to verify the `ACTIVITY_EVENTS` stream exists. 1. Check NATS monitoring: http://localhost:8222/jsz to verify the `ACTIVITY_EVENTS` stream exists.
@@ -342,6 +462,14 @@ uv run alembic history # show full migration history
## Railiance Deployment ## Railiance Deployment
### Production API access posture
The FastAPI admin surface remains ClusterIP-only in production. Do not publish
it through an external ingress until a separate access-policy work item chooses
the hostname, authentication layer, allowed users/agents, and audit
expectations. This posture was recorded in State Hub decision
`9ffaf7a9-227a-4e39-92e3-cd93d8cda1f2`.
### Pre-requisites ### Pre-requisites
- Docker ≥ 24 with Compose v2 (`docker compose` not `docker-compose`) - Docker ≥ 24 with Compose v2 (`docker compose` not `docker-compose`)
- ≥ 4 GB RAM available (Temporal server takes ~1 GB) - ≥ 4 GB RAM available (Temporal server takes ~1 GB)
@@ -412,6 +540,31 @@ make railiance-up
--- ---
## Kaizen fleet resolver (coulomb-loop)
Dry-run scheduled agent discovery against State Hub + pilot roster:
```bash
export STATE_HUB_URL=http://127.0.0.1:8000
export KAIZEN_RUNNER_HOST=$(hostname)
export ACTIVITY_DEFINITION_DIRS=/home/worsch/coulomb-loop
uv run python -c "
from activity_core.context_resolvers.kaizen import discover_kaizen_scheduled_repos
print(discover_kaizen_scheduled_repos({
'roster': '/home/worsch/coulomb-loop/loops/kaizen-stack/roster.yaml',
'cadence': 'daily',
}))
"
make sync-activity-definitions # requires ACTCORE_DB_URL + stack up
```
Source types: `kaizen`, `resolver`, or `shell` (alias). Queries:
`discover_kaizen_scheduled_repos`, `discover_kaizen_projects`.
---
## Wipe and restart dev stack ## Wipe and restart dev stack
```bash ```bash

View File

@@ -0,0 +1,118 @@
---
type: history
title: "activity-core INTENT gap analysis"
date: "2026-06-16"
author: codex
repo: activity-core
related_workplan: ACTIVITY-WP-0009
---
# activity-core INTENT Gap Analysis - 2026-06-16
## Context
This note preserves the findings from a repository review against `INTENT.md`.
The review refreshed `SCOPE.md` for the current repo state and identified the
remaining gaps between the intended Event Bridge boundary and the implemented /
deployed surface.
Files and surfaces reviewed:
- `INTENT.md`
- `SCOPE.md`
- `src/activity_core/`
- `activity-definitions/`
- `docs/runbook.md`
- `docs/issue-core-emission-boundary.md`
- `k8s/railiance/`
- `workplans/ACTIVITY-WP-0006-post-triage-operational-hardening.md`
- `workplans/ACTIVITY-WP-0007-ops-inventory-probe-runner.md`
- `workplans/ACTIVITY-WP-0008-weekly-coding-retro.md`
## Summary
activity-core matches the core INTENT boundary in shape: it owns trigger
durability, context resolution, rule/instruction evaluation, outbound
task/report/evidence emission, and local audit records. It still must avoid
owning task lifecycle, project state, privileged ops execution, or service
inventory authority.
The current implementation has grown a useful bounded report/evidence surface:
instruction reports can write working-memory notes and State Hub progress, and
`ops-inventory` context sources can emit compact non-secret
`ops_inventory_probe` summaries. This is still consistent with INTENT as long as
those outputs remain records of activity-core activations rather than an
authoritative task, project, or ops control plane.
## Findings
### 1. Scheduled-run trust gap
`INTENT.md` expects recurring coordination work to run without Bernd as the
manual coordination layer. The daily State Hub WSJF triage path is implemented
and has produced validated reports, but `ACTIVITY-WP-0006-T03` still lacks
three clean consecutive scheduled runs after the June 7 runtime projection
failure.
Current evidence as of 2026-06-16:
- State Hub `daily_triage` progress only shows activity-core entries through
2026-06-06.
- `/home/worsch/the-custodian/memory/working` only has `daily-triage-*` notes
for 2026-06-02 through 2026-06-06.
Impact: daily triage is production-backed, but not yet fully proven as a
standing substrate.
### 2. Live task creation gap
`INTENT.md` says each activation emits task creation requests to issue-core and
records only the spawn audit trail. The REST issue sink exists, but Railiance is
currently configured with `ISSUE_SINK_TYPE=null`, so production runs record
synthetic audit references instead of consistently creating live issue-core
tasks.
Impact: the task emission boundary is implemented but not yet broadly proven in
the production deployment.
### 3. Review queue gap
The original ADR text described `review_required` as routing instruction output
to a pending review queue. Current code records `review_required` in
report/spawn metadata but does not integrate with an issue-core review queue.
Impact: current behavior is safe as metadata. As of the ACTIVITY-WP-0009
implementation pass, ADR-003 and SCOPE.md have been aligned to that behavior.
### 4. Evidence backend gap
The State Hub fallback evidence path works for `ops_inventory_probe`, and
`ACTIVITY-WP-0007` has live Railiance evidence. Inter-Hub / ops-hub submission
is intentionally deferred behind operator-owned `OPS_HUB_KEY` custody, widget
mapping, and approval.
Impact: activity-core can preserve non-secret continuity evidence, but richer
per-entity ops evidence publication is not yet live.
### 5. Execution-boundary residue
`TaskExecutorWorkflow` remains registered as a stub that persists a done
`task_instances` row. INTENT explicitly says activity-core must not execute the
work or track lifecycle state.
Impact: low immediate risk because the workflow is inert, but it is an attractive
wrong hook for future execution creep.
### 6. API exposure gap
The FastAPI admin surface is useful for internal CRUD and manual triggers.
Railiance docs keep it as ClusterIP until an authenticated ingress and access
policy are chosen.
Impact: operationally acceptable for now, but production access posture remains
an explicit decision.
## Follow-up
`workplans/ACTIVITY-WP-0009-intent-gap-closure.md` was created to turn these
findings into tracked closure work.

View File

@@ -11,7 +11,7 @@ data:
TEMPORAL_NAMESPACE: default TEMPORAL_NAMESPACE: default
NATS_URL: nats://actcore-nats:4222 NATS_URL: nats://actcore-nats:4222
STATE_HUB_URL: http://actcore-state-hub-bridge:8000 STATE_HUB_URL: http://actcore-state-hub-bridge:8000
LLM_CONNECT_URL: "" LLM_CONNECT_URL: http://llm-connect.activity-core.svc.cluster.local:8080
LLM_CONNECT_TIMEOUT_SECONDS: "300" LLM_CONNECT_TIMEOUT_SECONDS: "300"
REPO_SCOPING_URL: http://repo-scoping.repo-scoping.svc.cluster.local:8020 REPO_SCOPING_URL: http://repo-scoping.repo-scoping.svc.cluster.local:8020
ISSUE_CORE_URL: http://issue-core.issue-core.svc.cluster.local:8010 ISSUE_CORE_URL: http://issue-core.issue-core.svc.cluster.local:8010
@@ -47,7 +47,10 @@ data:
type: cron type: cron
cron_expression: "20 7 * * *" cron_expression: "20 7 * * *"
timezone: Europe/Berlin timezone: Europe/Berlin
misfire_policy: skip # ACTIVITY-WP-0014: recover the most recent missed daily fire when the
# worker/Temporal was unavailable at trigger time, without accumulating a
# backlog after a multi-day outage.
misfire_policy: catchup_latest
context_sources: context_sources:
- type: static - type: static
bind_to: context.prompt_path bind_to: context.prompt_path
@@ -164,6 +167,36 @@ data:
Kubernetes projection of the Custodian-owned definition in Kubernetes projection of the Custodian-owned definition in
`/home/worsch/the-custodian/activity-definitions/hourly-recently-on-scope.md`. `/home/worsch/the-custodian/activity-definitions/hourly-recently-on-scope.md`.
state-hub-consistency-sweep.md: |
---
id: "7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b"
name: "State Hub Consistency Sweep"
type: activity-definition
version: "1.0"
enabled: true
owner: custodian
governance: custodian
status: active
created: "2026-06-21"
trigger:
type: cron
cron_expression: "*/15 * * * *"
timezone: UTC
misfire_policy: skip
context_sources:
- type: state-hub
query: consistency_sweep_remote_all
required: true
params:
max_seconds: 300
source: activity-core
bind_to: context.consistency_sweep_remote_all
---
# ActivityDefinition: State Hub Consistency Sweep
Kubernetes projection of the Custodian-owned definition in
`/home/worsch/the-custodian/activity-definitions/state-hub-consistency-sweep.md`.
ops-service-inventory-probes.md: | ops-service-inventory-probes.md: |
--- ---
id: "40d15a87-7ff6-4d8e-992c-37df15f95110" id: "40d15a87-7ff6-4d8e-992c-37df15f95110"
@@ -578,7 +611,8 @@ spec:
method=self.command, method=self.command,
) )
try: try:
with urlopen(request, timeout=30) as response: timeout = 360 if self.command == "POST" else 30
with urlopen(request, timeout=timeout) as response:
payload = response.read() payload = response.read()
self.send_response(response.status) self.send_response(response.status)
for key, value in response.headers.items(): for key, value in response.headers.items():
@@ -599,7 +633,7 @@ spec:
ThreadingHTTPServer(("0.0.0.0", 18080), Proxy).serve_forever() ThreadingHTTPServer(("0.0.0.0", 18080), Proxy).serve_forever()
readinessProbe: readinessProbe:
httpGet: httpGet:
path: /state/summary path: /state/health
port: http port: http
initialDelaySeconds: 5 initialDelaySeconds: 5
periodSeconds: 10 periodSeconds: 10

View File

@@ -32,8 +32,10 @@ Europe/Berlin schedule, verify both runtime dependencies:
- `actcore-state-hub-bridge` can reach the State Hub API through the node-local - `actcore-state-hub-bridge` can reach the State Hub API through the node-local
tunnel expected at `127.0.0.1:18000`. tunnel expected at `127.0.0.1:18000`.
- `LLM_CONNECT_URL` is set to an operator-approved llm-connect endpoint that can - `LLM_CONNECT_URL` points at the verified in-namespace llm-connect Service,
serve the `custodian-triage-balanced` profile. `http://llm-connect.activity-core.svc.cluster.local:8080`, and the
operator-owned provider Secret lets that Service serve the
`custodian-triage-balanced` profile.
If `LLM_CONNECT_URL` is missing or broken, report-sink instructions write a If `LLM_CONNECT_URL` is missing or broken, report-sink instructions write a
visible `execution_failed` diagnostic instead of silently producing no report. visible `execution_failed` diagnostic instead of silently producing no report.

View File

@@ -12,6 +12,7 @@ dependencies = [
"alembic>=1.14", "alembic>=1.14",
"nats-py>=2.7", "nats-py>=2.7",
"httpx>=0.27", "httpx>=0.27",
"pyyaml>=6.0",
] ]
[project.optional-dependencies] [project.optional-dependencies]

View File

@@ -1,4 +1,5 @@
{ {
"$comment": "ACTIVITY-WP-0016-T02. Strict, bounded contract for the daily WSJF triage report. The per-item 'recommendations' schema is intentionally strict on STRUCTURE (types + required keys) so the T03 boundary parser can validate each recommendation independently and quarantine only the malformed ones. 'maxItems' is a producer hint (honoured by llm-connect constrained decoding and by the prompt); it is deliberately NOT hard-enforced by the in-repo validator, because rejecting a whole report for having too many items would reproduce the monolithic-failure bug WP-0016 exists to remove. Over-count is mitigated in T03 (keep top-N by rank, quarantine the rest). Value-domain vocabularies (action/confidence) are documented in the prompt and enforced by T04 guardrails with mitigation, not as brittle hard-fail enums here.",
"type": "object", "type": "object",
"required": ["summary", "recommendations"], "required": ["summary", "recommendations"],
"properties": { "properties": {
@@ -7,8 +8,28 @@
}, },
"recommendations": { "recommendations": {
"type": "array", "type": "array",
"maxItems": 7,
"items": { "items": {
"type": "object" "type": "object",
"required": ["rank", "candidate", "action", "why"],
"properties": {
"rank": { "type": "integer" },
"candidate": { "type": "string" },
"action": { "type": "string" },
"why": { "type": "string" },
"confidence": { "type": "string" },
"wsjf": {
"type": "object",
"properties": {
"score": { "type": "number" },
"strategic_value": { "type": "number" },
"time_criticality": { "type": "number" },
"risk_reduction": { "type": "number" },
"opportunity_enablement": { "type": "number" },
"job_size": { "type": "number" }
}
}
}
} }
} }
} }

View File

@@ -0,0 +1,212 @@
#!/usr/bin/env python3
"""Railiance01 no-restart smoke for POST /admin/sync.
Patches the disabled ops-service-inventory-probes projection in the cluster
ConfigMap, waits for the API pod volume to refresh, runs /admin/sync twice,
verifies DB + Temporal schedule drift without restarting actcore-worker, then
rolls the ConfigMap back to the disabled baseline.
Requires:
- KUBECONFIG pointing at railiance01 (for example ~/.kube/config-hosteurope)
- kubectl access to the activity-core namespace
Example:
export KUBECONFIG=~/.kube/config-hosteurope
python3 scripts/smoke_admin_sync_no_restart.py
"""
from __future__ import annotations
import json
import subprocess
import sys
import time
ACTIVITY_ID = "40d15a87-7ff6-4d8e-992c-37df15f95110"
CONFIGMAP = "actcore-external-activity-definitions"
DEFINITION_KEY = "ops-service-inventory-probes.md"
MOUNTED_FILE = (
"/etc/activity-core/external-definitions/activity-definitions/"
f"{DEFINITION_KEY}"
)
VOLUME_PROPAGATION_SECONDS = 65
def kubectl(*args: str, input_text: str | None = None) -> str:
cmd = ["kubectl", "-n", "activity-core", *args]
return subprocess.check_output(
cmd,
input=input_text,
text=True,
)
def api_json(path: str, *, method: str = "GET") -> dict:
script = (
"import urllib.request, json\n"
f'req = urllib.request.Request("http://localhost:8010{path}", method="{method}")\n'
"print(urllib.request.urlopen(req).read().decode())"
)
return json.loads(kubectl("exec", "deploy/actcore-api", "--", "python3", "-c", script))
def worker_lines(script: str) -> list[str]:
return kubectl("exec", "deploy/actcore-worker", "--", "python3", "-c", script).splitlines()
def worker_uid() -> str:
return kubectl(
"get",
"pod",
"-l",
"app.kubernetes.io/name=actcore-worker",
"-o",
"jsonpath={.items[0].metadata.uid}",
).strip()
def load_configmap() -> dict:
return json.loads(kubectl("get", "configmap", CONFIGMAP, "-o", "json"))
def apply_configmap(cm: dict) -> None:
kubectl("apply", "-f", "-", input_text=json.dumps(cm))
def patch_definition(cm: dict, *, enabled: bool, cron: str) -> None:
text = cm["data"][DEFINITION_KEY]
for line in text.splitlines():
if line.strip().startswith("enabled:"):
break
else:
raise RuntimeError("enabled field not found in projection")
text = _replace_once(text, 'enabled: false', f"enabled: {'true' if enabled else 'false'}")
text = _replace_once(text, 'enabled: true', f"enabled: {'true' if enabled else 'false'}")
text = _replace_once(
text,
'cron_expression: "15 * * * *"',
f'cron_expression: "{cron}"',
)
text = _replace_once(
text,
'cron_expression: "25 * * * *"',
f'cron_expression: "{cron}"',
)
cm["data"][DEFINITION_KEY] = text
apply_configmap(cm)
def _replace_once(text: str, old: str, new: str) -> str:
if old not in text:
return text
return text.replace(old, new, 1)
def wait_for_mount(*, enabled: bool, cron: str) -> None:
deadline = time.time() + VOLUME_PROPAGATION_SECONDS
want_enabled = "enabled: true" if enabled else "enabled: false"
want_cron = f'cron_expression: "{cron}"'
while time.time() < deadline:
content = kubectl("exec", "deploy/actcore-api", "--", "cat", MOUNTED_FILE)
if want_enabled in content and want_cron in content:
return
time.sleep(5)
raise RuntimeError(
f"ConfigMap projection did not refresh within {VOLUME_PROPAGATION_SECONDS}s"
)
def get_definition() -> dict[str, object]:
for item in api_json("/activity-definitions/"):
if item["id"] == ACTIVITY_ID:
return {
"enabled": item["enabled"],
"cron": item["trigger_config"]["cron_expression"],
}
raise RuntimeError(f"ActivityDefinition {ACTIVITY_ID} not found")
def describe_schedule() -> dict[str, object]:
script = f"""
import asyncio
from temporalio.client import Client
async def main() -> None:
client = await Client.connect("actcore-temporal:7233")
handle = client.get_schedule_handle("activity-schedule-{ACTIVITY_ID}")
described = await handle.describe()
schedule = described.schedule
minute = schedule.spec.calendars[0].minute[0].start if schedule.spec.calendars else None
print(schedule.state.paused)
print(minute)
asyncio.run(main())
"""
paused, minute = worker_lines(script)
return {"paused": paused == "True", "minute": int(minute)}
def main() -> int:
worker_before = worker_uid()
cm = load_configmap()
print("1) enable + cadence change via ConfigMap")
patch_definition(cm, enabled=True, cron="25 * * * *")
wait_for_mount(enabled=True, cron="25 * * * *")
print("2) POST /admin/sync (first pass)")
sync1 = api_json("/admin/sync?definitions=true&schedules=true", method="POST")
if not sync1.get("ok"):
print(json.dumps(sync1, indent=2), file=sys.stderr)
return 1
defn = get_definition()
schedule = describe_schedule()
print(" definition:", defn)
print(" schedule:", schedule)
if defn != {"enabled": True, "cron": "25 * * * *"}:
print("definition drift after sync", file=sys.stderr)
return 1
if schedule["paused"] or schedule["minute"] != 25:
print("schedule drift after enable sync", file=sys.stderr)
return 1
print("3) POST /admin/sync (idempotent repeat)")
sync2 = api_json("/admin/sync?definitions=true&schedules=true", method="POST")
if sync2.get("schedules") != sync1.get("schedules"):
print("idempotent schedule counts changed", file=sys.stderr)
print(json.dumps({"sync1": sync1, "sync2": sync2}, indent=2), file=sys.stderr)
return 1
print("4) rollback ConfigMap + sync")
cm = load_configmap()
patch_definition(cm, enabled=False, cron="15 * * * *")
wait_for_mount(enabled=False, cron="15 * * * *")
sync3 = api_json("/admin/sync?definitions=true&schedules=true", method="POST")
if not sync3.get("ok"):
print(json.dumps(sync3, indent=2), file=sys.stderr)
return 1
defn = get_definition()
schedule = describe_schedule()
print(" definition:", defn)
print(" schedule:", schedule)
if defn != {"enabled": False, "cron": "15 * * * *"}:
print("rollback definition drift", file=sys.stderr)
return 1
if not schedule["paused"] or schedule["minute"] != 15:
print("rollback schedule drift", file=sys.stderr)
return 1
worker_after = worker_uid()
if worker_before != worker_after:
print("actcore-worker pod restarted during smoke", file=sys.stderr)
return 1
print("smoke passed: admin sync hot-reload without worker restart")
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -11,8 +11,10 @@ activities that need DB access.
from __future__ import annotations from __future__ import annotations
import json
import uuid import uuid
from datetime import datetime, timezone from datetime import datetime, timezone
from typing import Any
from sqlalchemy import select from sqlalchemy import select
from sqlalchemy.dialects.postgresql import insert as pg_insert from sqlalchemy.dialects.postgresql import insert as pg_insert
@@ -52,6 +54,36 @@ def _get_session_factory() -> async_sessionmaker[AsyncSession]:
return _session_factory return _session_factory
def _bind_resolver_result(bind_key: str, result: Any) -> Any:
"""Unwrap single-key resolver payloads when the key matches bind_key.
Resolvers such as ``discover_kaizen_projects`` return ``{"projects": [...]}``
while definitions bind to ``context.projects`` and iterate ``for_each:
context.projects``. Multi-key summaries (e.g. repo SBOM bulk) stay intact.
"""
if isinstance(result, dict) and len(result) == 1 and bind_key in result:
return result[bind_key]
return result
def _parse_event_envelope(event_envelope_json: str | None) -> dict[str, Any] | None:
"""Parse an event envelope JSON string for context resolvers."""
if not event_envelope_json:
return None
try:
payload = json.loads(event_envelope_json)
except (TypeError, json.JSONDecodeError) as exc:
activity.logger.warning("Invalid event envelope JSON - %s", exc)
return None
if not isinstance(payload, dict):
activity.logger.warning(
"Invalid event envelope JSON - expected object, got %s",
type(payload).__name__,
)
return None
return payload
# ── Activities ───────────────────────────────────────────────────────────────── # ── Activities ─────────────────────────────────────────────────────────────────
@activity.defn @activity.defn
@@ -111,11 +143,14 @@ async def resolve_context(
from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY
snapshot: dict = {} snapshot: dict = {}
event_envelope = _parse_event_envelope(event_envelope_json)
for source in context_sources: for source in context_sources:
source_type = source.get("type", "") source_type = source.get("type", "")
query = source.get("query", "") query = source.get("query", "")
params = source.get("params") or {} params = source.get("params") or {}
required = bool(source.get("required") or params.get("required", False)) required = bool(source.get("required") or params.get("required", False))
resolver_params = dict(params)
resolver_params["required"] = required
raw_bind = source.get("bind_to") or source.get("name") or source_type raw_bind = source.get("bind_to") or source.get("name") or source_type
# Strip the 'context.' namespace prefix so evaluator can find the key. # Strip the 'context.' namespace prefix so evaluator can find the key.
bind_key = raw_bind.removeprefix("context.") if raw_bind.startswith("context.") else raw_bind bind_key = raw_bind.removeprefix("context.") if raw_bind.startswith("context.") else raw_bind
@@ -139,7 +174,8 @@ async def resolve_context(
continue continue
try: try:
snapshot[bind_key] = resolver_cls().resolve(query, None, params) resolved = resolver_cls().resolve(query, event_envelope, resolver_params)
snapshot[bind_key] = _bind_resolver_result(bind_key, resolved)
except Exception as exc: except Exception as exc:
if required: if required:
raise ApplicationError( raise ApplicationError(

View File

@@ -40,6 +40,7 @@ from temporalio.client import Client
from activity_core.models import ActivityDefinition, CronTriggerConfig from activity_core.models import ActivityDefinition, CronTriggerConfig
from activity_core.orm import ActivityDefinition as ActivityDefinitionRow, EventType as EventTypeRow from activity_core.orm import ActivityDefinition as ActivityDefinitionRow, EventType as EventTypeRow
from activity_core.schedule_manager import delete_schedule, upsert_schedule from activity_core.schedule_manager import delete_schedule, upsert_schedule
from activity_core.sync_service import run_sync
from activity_core.webhook_receiver import router as webhook_router from activity_core.webhook_receiver import router as webhook_router
TEMPORAL_HOST = os.environ.get("TEMPORAL_HOST", "localhost:7233") TEMPORAL_HOST = os.environ.get("TEMPORAL_HOST", "localhost:7233")
@@ -275,6 +276,24 @@ async def trigger_definition(definition_id: uuid.UUID) -> dict[str, str]:
return {"workflow_id": handle.id, "trigger_key": trigger_key} return {"workflow_id": handle.id, "trigger_key": trigger_key}
# --- Admin sync ---------------------------------------------------------------
@app.post("/admin/sync")
async def admin_sync(
definitions: bool = True,
schedules: bool = True,
event_types: bool = False,
) -> dict[str, Any]:
"""Run operator-triggered definition/event/schedule sync without restart."""
return await run_sync(
session_factory=_get_db(),
temporal_client=_get_temporal() if schedules else None,
definitions=definitions,
schedules=schedules,
event_types=event_types,
)
# T42: Curator gate — event type approval endpoint # T42: Curator gate — event type approval endpoint
@app.get("/health") @app.get("/health")

View File

@@ -1 +1,8 @@
from activity_core.context_resolvers import ops_inventory, repo_scoping, state_hub # noqa: F401 from activity_core.context_resolvers import ( # noqa: F401
event_payload,
kaizen,
ops_inventory,
repo_scoping,
state_hub,
reuse_surface,
)

View File

@@ -0,0 +1,51 @@
"""Event payload context adapter.
Registered as source type ``event-payload``. It exposes the triggering
EventEnvelope attributes to event-triggered ActivityDefinitions without
requiring an external context service call.
"""
from __future__ import annotations
from typing import Any
from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY, ContextResolver
class EventPayloadContextResolver(ContextResolver):
"""Resolve context from the triggering event envelope attributes."""
def resolve(self, query: str, event: Any, params: dict[str, Any]) -> Any:
attributes = _event_attributes(event)
if query in {"", "attributes"}:
return attributes
if query.startswith("attributes."):
return _resolve_path(attributes, query.removeprefix("attributes."))
return _resolve_path(attributes, query)
def _event_attributes(event: Any) -> dict[str, Any]:
if not isinstance(event, dict):
raise RuntimeError("event-payload source requires an event envelope")
attributes = event.get("attributes")
if not isinstance(attributes, dict):
raise RuntimeError("event-payload source requires envelope attributes")
return attributes
def _resolve_path(root: dict[str, Any], path: str) -> Any:
if not path:
return root
current: Any = root
for part in path.split("."):
if not part:
return {}
if not isinstance(current, dict):
return {}
current = current.get(part)
if current is None:
return {}
return current
CONTEXT_RESOLVER_REGISTRY["event-payload"] = EventPayloadContextResolver

View File

@@ -0,0 +1,305 @@
"""Kaizen-agentic fleet context adapter.
Registered as source types ``kaizen`` and ``resolver`` (alias for ADR-005 drafts).
Supported queries:
- discover_kaizen_scheduled_repos: hub roster ∩ valid ``.kaizen/schedule.yml``
- discover_kaizen_projects: repos with ``.kaizen/metrics`` marker (+ optional roster)
Contract: kaizen-agentic ``docs/integrations/discover-kaizen-scheduled-repos.md``
"""
from __future__ import annotations
import json
import logging
import os
import socket
from pathlib import Path
from typing import Any
import httpx
import yaml
from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY, ContextResolver
logger = logging.getLogger(__name__)
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
_TIMEOUT_SECONDS = 10.0
_SCHEDULE_VERSION = "1"
_VALID_CADENCES = frozenset({"daily", "weekly", "monthly"})
_PREPARE_BIN = os.environ.get("KAIZEN_AGENTIC_BIN", "kaizen-agentic")
def _base_url() -> str:
return os.environ.get("STATE_HUB_URL", _DEFAULT_STATE_HUB_URL).rstrip("/")
def _runner_host() -> str:
return os.environ.get("KAIZEN_RUNNER_HOST", socket.gethostname())
def _fetch_repos(domain: str | None) -> list[dict[str, Any]]:
url = f"{_base_url()}/repos/"
try:
resp = httpx.get(url, timeout=_TIMEOUT_SECONDS)
resp.raise_for_status()
except httpx.HTTPError as exc:
raise RuntimeError(f"State Hub unreachable at {url}: {exc}") from exc
payload = resp.json()
if not isinstance(payload, list):
raise RuntimeError(f"State Hub /repos/ returned non-list: {type(payload)!r}")
if domain:
payload = [r for r in payload if r.get("domain_slug") == domain]
return payload
def _repo_root(repo: dict[str, Any]) -> Path | None:
host_paths = repo.get("host_paths") or {}
host = _runner_host()
raw = host_paths.get(host) or repo.get("local_path")
if not raw or raw == "(unknown)":
return None
path = Path(raw)
return path if path.is_dir() else None
def _load_roster(params: dict[str, Any]) -> dict[str, dict[str, Any]] | None:
"""Return slug -> roster entry for active repos, or None if no roster param."""
roster_path = params.get("roster")
if not roster_path:
return None
path = Path(roster_path)
if not path.is_file():
logger.warning("kaizen roster file not found: %s", path)
return {}
data = yaml.safe_load(path.read_text(encoding="utf-8"))
if not isinstance(data, dict):
logger.warning("kaizen roster invalid (not a mapping): %s", path)
return {}
entries: dict[str, dict[str, Any]] = {}
for item in data.get("active") or []:
if isinstance(item, dict) and item.get("slug"):
slug = str(item["slug"])
if item.get("status", "active") == "saturated":
continue
entries[slug] = item
return entries
def _validate_schedule_file(path: Path) -> list[str]:
"""Structural validation aligned with kaizen-agentic schedule validate."""
errors: list[str] = []
try:
raw = yaml.safe_load(path.read_text(encoding="utf-8"))
except yaml.YAMLError as exc:
return [f"invalid YAML: {exc}"]
if not isinstance(raw, dict):
return ["schedule.yml must be a YAML mapping at the top level"]
version = raw.get("version")
if version is None:
errors.append("missing required key: version")
elif str(version) != _SCHEDULE_VERSION:
errors.append(f"unsupported version '{version}' (expected '{_SCHEDULE_VERSION}')")
agents = raw.get("agents", {})
if not isinstance(agents, dict):
errors.append("agents must be a mapping")
return errors
if not agents:
errors.append("no agents declared under 'agents:'")
seen: set[str] = set()
for name, settings in agents.items():
if settings is None:
settings = {}
if not isinstance(settings, dict):
errors.append(f"agent '{name}' settings must be a mapping")
continue
if name in seen:
errors.append(f"duplicate agent entry: {name}")
seen.add(name)
cadence = str(settings.get("cadence", ""))
if cadence not in _VALID_CADENCES:
errors.append(
f"agent '{name}': invalid cadence '{cadence}' "
f"(expected one of {', '.join(sorted(_VALID_CADENCES))})"
)
cron = settings.get("cron")
if cron is not None and not isinstance(cron, str):
errors.append(f"agent '{name}' cron must be a string")
return errors
def _parse_schedule(path: Path) -> dict[str, Any] | None:
errors = _validate_schedule_file(path)
if errors:
return None
raw = yaml.safe_load(path.read_text(encoding="utf-8"))
return raw if isinstance(raw, dict) else None
def _prepare_command(agent: str, root: Path) -> str:
return f"{_PREPARE_BIN} schedule prepare {agent} --target {root}"
def discover_kaizen_scheduled_repos(params: dict[str, Any]) -> dict[str, Any]:
domain = params.get("domain")
cadence_filter = params.get("cadence")
roster = _load_roster(params)
runs: list[dict[str, Any]] = []
for repo in _fetch_repos(domain):
slug = repo.get("slug", "")
if not slug:
continue
if roster is not None and slug not in roster:
continue
root = _repo_root(repo)
if root is None:
logger.info("kaizen repo_unreachable slug=%s host=%s", slug, _runner_host())
continue
schedule_path = root / ".kaizen" / "schedule.yml"
if not schedule_path.is_file():
continue
errors = _validate_schedule_file(schedule_path)
if errors:
logger.warning(
"kaizen schedule_invalid slug=%s path=%s errors=%s",
slug,
schedule_path,
"; ".join(errors),
)
continue
schedule = _parse_schedule(schedule_path)
if schedule is None:
continue
timezone = schedule.get("timezone") or "Europe/Berlin"
roster_agents = roster.get(slug, {}).get("agents") if roster else None
agents = schedule.get("agents") or {}
for agent_name, settings in agents.items():
if not isinstance(settings, dict):
continue
if not bool(settings.get("enabled", True)):
continue
cadence = str(settings.get("cadence", ""))
if cadence_filter and cadence != cadence_filter:
continue
if roster_agents and agent_name not in roster_agents:
continue
cron = settings.get("cron")
runs.append(
{
"repo": slug,
"root": str(root),
"agent": agent_name,
"cadence": cadence,
"cron": cron,
"timezone": timezone,
"enabled": True,
"prepare_command": _prepare_command(agent_name, root),
}
)
return {"scheduled_runs": runs}
def _read_metrics_summary(metrics_dir: Path) -> dict[str, Any]:
summary_path = metrics_dir / "summary.json"
if not summary_path.is_file():
return {}
try:
data = json.loads(summary_path.read_text(encoding="utf-8"))
return data if isinstance(data, dict) else {}
except (json.JSONDecodeError, OSError):
return {}
def discover_kaizen_projects(params: dict[str, Any]) -> dict[str, Any]:
"""Discover repos with ``.kaizen/metrics`` (optional per-agent summaries)."""
domain = params.get("domain")
marker = params.get("marker", ".kaizen/metrics")
roster = _load_roster(params)
in_roster_key = "in_pilot_roster"
projects: list[dict[str, Any]] = []
for repo in _fetch_repos(domain):
slug = repo.get("slug", "")
if not slug:
continue
in_pilot = roster is None or slug in roster
if roster is not None and slug not in roster:
continue
root = _repo_root(repo)
if root is None:
continue
metrics_root = root / Path(marker)
if not metrics_root.is_dir():
continue
has_metrics = any(metrics_root.iterdir()) if metrics_root.is_dir() else False
if not has_metrics:
continue
roster_entry = roster.get(slug, {}) if roster else {}
agent_filter = roster_entry.get("agents")
for agent_dir in sorted(metrics_root.iterdir()):
if not agent_dir.is_dir() or agent_dir.name == "optimizer":
continue
agent = agent_dir.name
if agent_filter and agent not in agent_filter:
continue
summary = _read_metrics_summary(agent_dir)
projects.append(
{
"repo": slug,
"root": str(root),
"agent": agent,
"has_metrics": True,
in_roster_key: in_pilot,
"summary": summary,
}
)
if not any(p["repo"] == slug for p in projects):
projects.append(
{
"repo": slug,
"root": str(root),
"agent": None,
"has_metrics": has_metrics,
in_roster_key: in_pilot,
"summary": {},
}
)
return {"projects": projects}
class KaizenContextResolver(ContextResolver):
"""Resolves kaizen fleet scheduling and project metrics discovery."""
def resolve(self, query: str, event: Any, params: dict[str, Any]) -> dict[str, Any]:
if query == "discover_kaizen_scheduled_repos":
return discover_kaizen_scheduled_repos(params)
if query == "discover_kaizen_projects":
return discover_kaizen_projects(params)
return {}
CONTEXT_RESOLVER_REGISTRY["kaizen"] = KaizenContextResolver
CONTEXT_RESOLVER_REGISTRY["resolver"] = KaizenContextResolver
CONTEXT_RESOLVER_REGISTRY["shell"] = KaizenContextResolver

View File

@@ -0,0 +1,516 @@
"""Reuse-surface registry hygiene context adapter.
Registered as source type ``reuse-surface`` and as the ``shell`` resolver
dispatcher for the ``reuse_surface_report_gaps`` query. Other shell queries
continue to delegate to the kaizen resolver for backward compatibility.
"""
from __future__ import annotations
import json
import logging
import os
import socket
import subprocess
from dataclasses import dataclass
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import httpx
import yaml
from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY, ContextResolver
from activity_core.context_resolvers.kaizen import KaizenContextResolver
from activity_core.context_resolvers.state_hub import StateHubContextResolver
logger = logging.getLogger(__name__)
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
_REPORT_TIMEOUT_SECONDS = 60
_STATE_HUB_TIMEOUT_SECONDS = 10.0
_KNOWN_SIGNALS = frozenset(
{
"registry_gap",
"empty_capability_scaffold",
"stale_scope",
"stale_sbom",
"publish_check_fail",
}
)
@dataclass(frozen=True)
class RosterEntry:
slug: str
domain: str | None = None
publish_check: str | None = None
def _base_url() -> str:
return os.environ.get("STATE_HUB_URL", _DEFAULT_STATE_HUB_URL).rstrip("/")
def _runner_host(params: dict[str, Any]) -> str:
return str(
params.get("runner_host")
or os.environ.get("KAIZEN_RUNNER_HOST")
or socket.gethostname()
)
def _as_required(params: dict[str, Any]) -> bool:
return bool(params.get("required", False))
def reuse_surface_report_gaps(params: dict[str, Any]) -> dict[str, Any]:
"""Resolve registry-hygiene gaps for the next rollout batch.
Missing operational dependencies are visible failures for required sources
and graceful empty lists for optional sources so definitions can opt into
either behavior without changing rule logic.
"""
try:
return _resolve_reuse_surface_report_gaps(params)
except Exception as exc:
if _as_required(params):
raise
logger.warning("reuse_surface_report_gaps unavailable: %s", exc)
return {"gaps": []}
def _resolve_reuse_surface_report_gaps(params: dict[str, Any]) -> dict[str, Any]:
roster_path = _roster_path(params)
entries = _load_active_roster_entries(roster_path)
if not entries:
return {"gaps": []}
state_path = _round_robin_state_path(params, roster_path)
selected, next_cursor = _select_round_robin_batch(
entries,
_batch_size(params),
state_path,
)
if not selected:
return {"gaps": []}
signals = _enabled_signals(_signals_path(params, roster_path))
roots = _resolve_repo_roots(selected, _runner_host(params))
report = _reuse_surface_report(params, signals)
gaps = _gap_records(selected, roots, signals, report)
_write_round_robin_state(state_path, next_cursor, selected)
return {"gaps": gaps}
def _roster_path(params: dict[str, Any]) -> Path:
raw = params.get("roster")
if not raw:
raise ValueError("reuse_surface_report_gaps requires params.roster")
path = Path(str(raw)).expanduser()
if not path.is_file():
raise FileNotFoundError(f"reuse_surface_report_gaps roster not found: {path}")
return path
def _batch_size(params: dict[str, Any]) -> int:
try:
return max(1, int(params.get("batch_size", 3)))
except (TypeError, ValueError):
return 3
def _round_robin_state_path(params: dict[str, Any], roster_path: Path) -> Path:
raw = params.get("round_robin_state")
if raw:
return Path(str(raw)).expanduser()
return roster_path.with_name("round-robin-state.json")
def _signals_path(params: dict[str, Any], roster_path: Path) -> Path:
raw = params.get("signals")
if raw:
return Path(str(raw)).expanduser()
return roster_path.with_name("signals.yml")
def _load_active_roster_entries(path: Path) -> list[RosterEntry]:
data = yaml.safe_load(path.read_text(encoding="utf-8"))
if not isinstance(data, dict):
raise ValueError(f"reuse_surface rollout roster is not a mapping: {path}")
entries: dict[str, RosterEntry] = {}
for domain, block in _iter_domain_blocks(data):
if _domain_phase(block) != "active":
continue
for item in _repo_items(block):
entry = _entry_from_item(item, domain, block)
if entry and entry.slug not in entries:
entries[entry.slug] = entry
return list(entries.values())
def _iter_domain_blocks(data: dict[str, Any]) -> list[tuple[str | None, dict[str, Any]]]:
domains = data.get("domains")
if isinstance(domains, dict):
return [
(str(name), block)
for name, block in domains.items()
if isinstance(block, dict)
]
if isinstance(domains, list):
return [
(str(block.get("name") or block.get("domain") or ""), block)
for block in domains
if isinstance(block, dict)
]
if isinstance(data.get("active"), list):
return [(None, {"phase": "active", "repos": data["active"]})]
return [
(str(name), block)
for name, block in data.items()
if isinstance(block, dict) and ("phase" in block or "repos" in block)
]
def _domain_phase(block: dict[str, Any]) -> str:
return str(block.get("phase") or block.get("status") or "").lower()
def _repo_items(block: dict[str, Any]) -> list[Any]:
repos = (
block.get("repos")
or block.get("repo_slugs")
or block.get("repositories")
or block.get("slugs")
or []
)
if isinstance(repos, dict):
items: list[Any] = []
for slug, config in repos.items():
if isinstance(config, dict):
item = dict(config)
item.setdefault("slug", slug)
items.append(item)
else:
items.append(str(slug))
return items
if isinstance(repos, list):
return repos
return []
def _entry_from_item(
item: Any,
domain: str | None,
block: dict[str, Any],
) -> RosterEntry | None:
publish_check = block.get("publish_check")
if isinstance(item, str):
slug = item
elif isinstance(item, dict):
slug = item.get("slug") or item.get("repo") or item.get("name")
publish_check = item.get("publish_check", publish_check)
else:
return None
if not slug:
return None
return RosterEntry(
slug=str(slug),
domain=domain or None,
publish_check=str(publish_check).lower() if publish_check is not None else None,
)
def _select_round_robin_batch(
entries: list[RosterEntry],
batch_size: int,
state_path: Path,
) -> tuple[list[RosterEntry], int]:
if not entries:
return [], 0
cursor = _read_round_robin_cursor(state_path) % len(entries)
size = min(batch_size, len(entries))
selected = [entries[(cursor + offset) % len(entries)] for offset in range(size)]
next_cursor = (cursor + size) % len(entries)
return selected, next_cursor
def _read_round_robin_cursor(path: Path) -> int:
if not path.is_file():
return 0
try:
data = json.loads(path.read_text(encoding="utf-8"))
except (OSError, json.JSONDecodeError):
return 0
if not isinstance(data, dict):
return 0
try:
return int(data.get("cursor", 0))
except (TypeError, ValueError):
return 0
def _write_round_robin_state(
path: Path,
cursor: int,
selected: list[RosterEntry],
) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
payload = {
"cursor": cursor,
"last_batch": [entry.slug for entry in selected],
"updated_at": datetime.now(timezone.utc).isoformat(),
}
path.write_text(
json.dumps(payload, indent=2, sort_keys=True) + "\n",
encoding="utf-8",
)
def _enabled_signals(path: Path) -> set[str]:
if not path.is_file():
return set(_KNOWN_SIGNALS)
data = yaml.safe_load(path.read_text(encoding="utf-8"))
node = data.get("signals") if isinstance(data, dict) else data
enabled: set[str] = set()
saw_known_signal = False
if isinstance(node, dict):
for name, config in node.items():
if str(name) not in _KNOWN_SIGNALS:
continue
saw_known_signal = True
if isinstance(config, dict) and config.get("enabled") is False:
continue
if config is False:
continue
enabled.add(str(name))
elif isinstance(node, list):
for item in node:
if isinstance(item, str) and item in _KNOWN_SIGNALS:
saw_known_signal = True
enabled.add(item)
elif isinstance(item, dict):
name = item.get("id") or item.get("signal") or item.get("name")
if str(name) in _KNOWN_SIGNALS and item.get("enabled", True) is not False:
saw_known_signal = True
enabled.add(str(name))
return enabled if saw_known_signal else set(_KNOWN_SIGNALS)
def _resolve_repo_roots(
entries: list[RosterEntry],
runner_host: str,
) -> dict[str, Path]:
requested = {entry.slug for entry in entries}
roots: dict[str, Path] = {}
for repo in _fetch_repos():
slug = str(repo.get("slug") or "")
if slug not in requested:
continue
raw = _repo_path_for_host(repo, runner_host)
if raw:
roots[slug] = Path(raw)
return roots
def _fetch_repos() -> list[dict[str, Any]]:
url = f"{_base_url()}/repos/"
try:
resp = httpx.get(url, timeout=_STATE_HUB_TIMEOUT_SECONDS)
resp.raise_for_status()
except httpx.HTTPError as exc:
raise RuntimeError(f"State Hub unreachable at {url}: {exc}") from exc
payload = resp.json()
if not isinstance(payload, list):
raise RuntimeError(f"State Hub /repos/ returned non-list: {type(payload)!r}")
return [repo for repo in payload if isinstance(repo, dict)]
def _repo_path_for_host(repo: dict[str, Any], runner_host: str) -> str | None:
host_paths = repo.get("host_paths") or {}
raw = None
if isinstance(host_paths, dict):
raw = host_paths.get(runner_host)
raw = raw or repo.get("local_path")
if not raw or raw == "(unknown)":
return None
return str(raw)
def _reuse_surface_report(params: dict[str, Any], signals: set[str]) -> dict[str, Any]:
if not (signals & {"registry_gap", "empty_capability_scaffold"}):
return {}
binary = str(params.get("reuse_surface_bin") or "reuse-surface")
try:
completed = subprocess.run(
[binary, "report", "gaps", "--format", "json"],
capture_output=True,
check=False,
text=True,
timeout=_REPORT_TIMEOUT_SECONDS,
)
except FileNotFoundError as exc:
raise RuntimeError(f"reuse-surface CLI not found: {binary}") from exc
except subprocess.TimeoutExpired as exc:
raise RuntimeError("reuse-surface report gaps timed out") from exc
if completed.returncode != 0:
detail = completed.stderr.strip() or completed.stdout.strip()
raise RuntimeError(f"reuse-surface report gaps failed: {detail}")
try:
payload = json.loads(completed.stdout or "{}")
except json.JSONDecodeError as exc:
raise RuntimeError("reuse-surface report gaps returned invalid JSON") from exc
if not isinstance(payload, dict):
raise RuntimeError("reuse-surface report gaps returned non-object JSON")
return payload
def _gap_records(
entries: list[RosterEntry],
roots: dict[str, Path],
signals: set[str],
report: dict[str, Any],
) -> list[dict[str, Any]]:
empty_scaffolds = _repo_set(report, {"empty_scaffolds", "empty_scaffold"})
publish_fail = _repo_set(
report,
{"publish_fail", "publish_fails", "publish_failures"},
)
gaps: list[dict[str, Any]] = []
seen: set[tuple[str, str]] = set()
for entry in entries:
root = roots.get(entry.slug)
if root is None:
logger.info("reuse_surface repo_unreachable slug=%s", entry.slug)
continue
if (
signals & {"registry_gap", "empty_capability_scaffold"}
and entry.slug in empty_scaffolds
):
_append_gap(gaps, seen, entry.slug, root, "empty_capability_scaffold")
if "registry_gap" in signals and entry.slug in publish_fail:
_append_gap(gaps, seen, entry.slug, root, "registry_gap")
if "publish_check_fail" in signals and entry.publish_check == "fail":
_append_gap(gaps, seen, entry.slug, root, "publish_check_fail")
if "stale_scope" in signals and _scope_is_stale(root):
_append_gap(gaps, seen, entry.slug, root, "stale_scope")
if "stale_sbom" in signals and _sbom_is_stale(entry.slug):
_append_gap(gaps, seen, entry.slug, root, "stale_sbom")
return gaps
def _append_gap(
gaps: list[dict[str, Any]],
seen: set[tuple[str, str]],
slug: str,
root: Path,
signal: str,
) -> None:
key = (slug, signal)
if key in seen:
return
seen.add(key)
gaps.append(
{
"repo": slug,
"root": str(root),
"signal": signal,
"hygiene_signal": signal,
}
)
def _scope_is_stale(root: Path) -> bool:
scope = root / "SCOPE.md"
if not scope.is_file():
return True
age_seconds = datetime.now(timezone.utc).timestamp() - scope.stat().st_mtime
return age_seconds > 90 * 24 * 60 * 60
def _sbom_is_stale(slug: str) -> bool:
payload = StateHubContextResolver().resolve(
"repo_sbom_status",
None,
{"repo_slug": slug},
)
if not isinstance(payload, dict):
return False
try:
return int(payload.get("sbom_age_days", 0)) > 30
except (TypeError, ValueError):
return False
def _repo_set(report: dict[str, Any], keys: set[str]) -> set[str]:
slugs: set[str] = set()
for value in _values_for_keys(report, keys):
slugs.update(_slugs_from_value(value))
return slugs
def _values_for_keys(value: Any, keys: set[str]) -> list[Any]:
values: list[Any] = []
if isinstance(value, dict):
for key, nested in value.items():
if key in keys:
values.append(nested)
values.extend(_values_for_keys(nested, keys))
elif isinstance(value, list):
for item in value:
values.extend(_values_for_keys(item, keys))
return values
def _slugs_from_value(value: Any) -> set[str]:
if isinstance(value, str):
return {value}
if isinstance(value, list):
slugs: set[str] = set()
for item in value:
slugs.update(_slugs_from_value(item))
return slugs
if isinstance(value, dict):
for key in ("repo", "repo_slug", "slug", "name"):
if value.get(key):
return {str(value[key])}
slugs: set[str] = set()
for key, nested in value.items():
if nested is True or isinstance(nested, (dict, list)):
slugs.add(str(key))
slugs.update(_slugs_from_value(nested))
return slugs
return set()
class ReuseSurfaceContextResolver(ContextResolver):
"""Resolves reuse-surface registry hygiene gap reports."""
def resolve(self, query: str, event: Any, params: dict[str, Any]) -> dict[str, Any]:
if query == "reuse_surface_report_gaps":
return reuse_surface_report_gaps(params)
return {}
class ShellContextResolver(ContextResolver):
"""Dispatch shell-backed context queries without breaking kaizen aliases."""
def resolve(self, query: str, event: Any, params: dict[str, Any]) -> dict[str, Any]:
if query == "reuse_surface_report_gaps":
return reuse_surface_report_gaps(params)
return KaizenContextResolver().resolve(query, event, params)
CONTEXT_RESOLVER_REGISTRY["reuse-surface"] = ReuseSurfaceContextResolver
CONTEXT_RESOLVER_REGISTRY["shell"] = ShellContextResolver

View File

@@ -12,6 +12,7 @@ Supported queries:
- coding_retro: latest /progress/ item with event_type=coding_retro - coding_retro: latest /progress/ item with event_type=coding_retro
- daily_triage_digest: curated scalar JSON digest for daily WSJF triage - daily_triage_digest: curated scalar JSON digest for daily WSJF triage
- recently_on_scope_hourly: POST {STATE_HUB_URL}/recently-on-scope/hourly - recently_on_scope_hourly: POST {STATE_HUB_URL}/recently-on-scope/hourly
- consistency_sweep_remote_all: POST {STATE_HUB_URL}/consistency/sweep/remote-all
No caching — state hub data is live operational state and must not be stale No caching — state hub data is live operational state and must not be stale
within a single workflow run. within a single workflow run.
@@ -31,6 +32,7 @@ from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY, Cont
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000" _DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
_TIMEOUT_SECONDS = 10.0 _TIMEOUT_SECONDS = 10.0
_SWEEP_TIMEOUT_SECONDS = 330.0
_OPEN_WORKSTREAM_STATUSES = {"active", "ready", "blocked"} _OPEN_WORKSTREAM_STATUSES = {"active", "ready", "blocked"}
_OPEN_TASK_STATUSES = {"wait", "todo", "progress"} _OPEN_TASK_STATUSES = {"wait", "todo", "progress"}
# Sentinel age for repos that have never had an SBOM ingested. Large enough # Sentinel age for repos that have never had an SBOM ingested. Large enough
@@ -53,13 +55,26 @@ def _fetch_json(path: str, params: dict[str, Any] | None = None) -> Any:
return {} return {}
def _post_json(path: str, payload: dict[str, Any]) -> Any: def _post_json(path: str, payload: dict[str, Any], *, timeout: float = _TIMEOUT_SECONDS) -> Any:
url = f"{_base_url()}{path}" url = f"{_base_url()}{path}"
resp = httpx.post(url, json=payload, timeout=_TIMEOUT_SECONDS) resp = httpx.post(url, json=payload, timeout=timeout)
resp.raise_for_status() resp.raise_for_status()
return resp.json() return resp.json()
def _validate_consistency_sweep_remote_all(result: Any) -> dict[str, Any]:
if not isinstance(result, dict):
raise RuntimeError("consistency_sweep_remote_all returned a non-object response")
required_keys = {"exit_code", "lock_skipped", "repos_processed"}
missing = required_keys - set(result)
if missing:
missing_list = ", ".join(sorted(missing))
raise RuntimeError(
f"consistency_sweep_remote_all response missing required key(s): {missing_list}"
)
return result
def _validate_recently_on_scope_hourly(result: Any) -> dict[str, Any]: def _validate_recently_on_scope_hourly(result: Any) -> dict[str, Any]:
if not isinstance(result, dict): if not isinstance(result, dict):
raise RuntimeError("recently_on_scope_hourly returned a non-object response") raise RuntimeError("recently_on_scope_hourly returned a non-object response")
@@ -107,6 +122,18 @@ class StateHubContextResolver(ContextResolver):
} }
result = _post_json("/recently-on-scope/hourly", payload) result = _post_json("/recently-on-scope/hourly", payload)
return _validate_recently_on_scope_hourly(result) return _validate_recently_on_scope_hourly(result)
if query == "consistency_sweep_remote_all":
payload = {
key: value
for key, value in params.items()
if key not in {"required"}
}
result = _post_json(
"/consistency/sweep/remote-all",
payload,
timeout=_SWEEP_TIMEOUT_SECONDS,
)
return _validate_consistency_sweep_remote_all(result)
return {} return {}
@@ -219,11 +246,13 @@ def _coding_retro(params: dict[str, Any]) -> dict[str, Any]:
""" """
event_type = str(params.get("event_type") or "coding_retro") event_type = str(params.get("event_type") or "coding_retro")
limit = _bounded_int(params.get("limit", 100), default=100, minimum=1, maximum=500) limit = _bounded_int(params.get("limit", 100), default=100, minimum=1, maximum=500)
items = _fetch_json("/progress/", {"limit": limit}) query_params = {"event_type": event_type, "limit": limit}
items = _fetch_json("/progress/", query_params)
if not isinstance(items, list): if not isinstance(items, list):
return _empty_coding_retro(event_type) return _empty_coding_retro(event_type)
item = _latest_progress_item(items, event_type) window_days = _optional_int(params.get("window_days"))
item = _latest_progress_item(items, event_type, window_days)
if item is None: if item is None:
return _empty_coding_retro(event_type) return _empty_coding_retro(event_type)
@@ -256,12 +285,18 @@ def _empty_coding_retro(event_type: str) -> dict[str, Any]:
def _latest_progress_item( def _latest_progress_item(
items: list[Any], items: list[Any],
event_type: str, event_type: str,
window_days: int | None = None,
) -> dict[str, Any] | None: ) -> dict[str, Any] | None:
newest: dict[str, Any] | None = None newest: dict[str, Any] | None = None
newest_key: tuple[datetime, int] | None = None newest_key: tuple[datetime, int] | None = None
for index, item in enumerate(items): for index, item in enumerate(items):
if not isinstance(item, dict) or item.get("event_type") != event_type: if not isinstance(item, dict) or item.get("event_type") != event_type:
continue continue
if window_days is not None and not _progress_matches_window_days(
item,
window_days,
):
continue
key = (_parse_progress_timestamp(item.get("created_at")), index) key = (_parse_progress_timestamp(item.get("created_at")), index)
if newest_key is None or key > newest_key: if newest_key is None or key > newest_key:
newest = item newest = item
@@ -295,6 +330,56 @@ def _progress_detail(item: dict[str, Any]) -> dict[str, Any]:
return {} return {}
def _progress_matches_window_days(item: dict[str, Any], window_days: int) -> bool:
detail = _progress_detail(item)
return _progress_window_days(detail) == window_days
def _progress_window_days(detail: dict[str, Any]) -> int | None:
window = detail.get("window")
if isinstance(window, dict):
direct = _optional_int(window.get("days") or window.get("window_days"))
if direct is not None:
return direct
ranged = _window_days_from_range(
window.get("since") or window.get("window_start"),
window.get("until") or window.get("window_end"),
)
if ranged is not None:
return ranged
direct = _optional_int(detail.get("days") or detail.get("window_days"))
if direct is not None:
return direct
return _window_days_from_range(
detail.get("since") or detail.get("window_start"),
detail.get("until") or detail.get("window_end"),
)
def _window_days_from_range(start: Any, end: Any) -> int | None:
start_ts = _parse_optional_timestamp(start)
end_ts = _parse_optional_timestamp(end)
if start_ts is None or end_ts is None or end_ts < start_ts:
return None
seconds = (end_ts - start_ts).total_seconds()
if seconds <= 0:
return None
return max(1, round(seconds / 86400))
def _parse_optional_timestamp(value: Any) -> datetime | None:
if not isinstance(value, str) or not value:
return None
try:
parsed = datetime.fromisoformat(value.replace("Z", "+00:00"))
except ValueError:
return None
if parsed.tzinfo is None:
parsed = parsed.replace(tzinfo=timezone.utc)
return parsed.astimezone(timezone.utc)
def _normalise_coding_retro_suggestions(value: Any) -> list[dict[str, Any]]: def _normalise_coding_retro_suggestions(value: Any) -> list[dict[str, Any]]:
if not isinstance(value, list): if not isinstance(value, list):
return [] return []
@@ -374,6 +459,13 @@ def _bounded_int(value: Any, *, default: int, minimum: int, maximum: int) -> int
return max(minimum, min(maximum, number)) return max(minimum, min(maximum, number))
def _optional_int(value: Any) -> int | None:
try:
return int(value)
except (TypeError, ValueError):
return None
def _clean_scalar(value: Any) -> str: def _clean_scalar(value: Any) -> str:
return " ".join(str(value or "").split()) return " ".join(str(value or "").split())

View File

@@ -20,7 +20,8 @@ from activity_core.rules.models import TaskRef, TaskSpec
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
ISSUE_CORE_URL = os.environ.get("ISSUE_CORE_URL", "http://127.0.0.1:8010") ISSUE_CORE_URL = os.environ.get("ISSUE_CORE_URL", "http://127.0.0.1:8765")
ISSUE_CORE_API_KEY_ENV = "ISSUE_CORE_API_KEY"
ISSUE_SINK_TYPE = os.environ.get("ISSUE_SINK_TYPE", "rest") ISSUE_SINK_TYPE = os.environ.get("ISSUE_SINK_TYPE", "rest")
@@ -30,10 +31,30 @@ class IssueSink(ABC):
class IssueCoreRestSink(IssueSink): class IssueCoreRestSink(IssueSink):
"""POSTs to issue-core REST API. Config: ISSUE_CORE_URL env var.""" """POSTs to issue-core REST API.
def __init__(self, base_url: str = ISSUE_CORE_URL) -> None: Config: ISSUE_CORE_URL and ISSUE_CORE_API_KEY env vars (shared key with
the issue-core server).
"""
def __init__(
self,
base_url: str = ISSUE_CORE_URL,
api_key: str | None = None,
) -> None:
self._base_url = base_url.rstrip("/") self._base_url = base_url.rstrip("/")
if api_key is not None:
self._api_key = api_key.strip()
else:
self._api_key = os.environ.get(ISSUE_CORE_API_KEY_ENV, "").strip()
def _auth_headers(self) -> dict[str, str]:
if not self._api_key:
raise RuntimeError(
f"{ISSUE_CORE_API_KEY_ENV} is not set. "
"Required when ISSUE_SINK_TYPE=rest."
)
return {"Authorization": f"Bearer {self._api_key}"}
def emit(self, task_spec: TaskSpec) -> TaskRef: def emit(self, task_spec: TaskSpec) -> TaskRef:
payload = { payload = {
@@ -45,10 +66,19 @@ class IssueCoreRestSink(IssueSink):
"due_in_days": task_spec.due_in_days, "due_in_days": task_spec.due_in_days,
"source_type": task_spec.source_type, "source_type": task_spec.source_type,
"source_id": task_spec.source_id, "source_id": task_spec.source_id,
"triggering_event_id": task_spec.triggering_event_id, "triggering_event_id": (
str(task_spec.triggering_event_id)
if task_spec.triggering_event_id is not None
else None
),
"activity_definition_id": task_spec.activity_definition_id, "activity_definition_id": task_spec.activity_definition_id,
} }
resp = httpx.post(f"{self._base_url}/issues/", json=payload, timeout=10.0) resp = httpx.post(
f"{self._base_url}/issues/",
json=payload,
headers=self._auth_headers(),
timeout=10.0,
)
resp.raise_for_status() resp.raise_for_status()
data = resp.json() data = resp.json()
return TaskRef( return TaskRef(

View File

@@ -49,7 +49,18 @@ class CronTriggerConfig(BaseModel):
) )
timezone: str = Field(default="UTC", description="IANA timezone name.") timezone: str = Field(default="UTC", description="IANA timezone name.")
jitter_seconds: int = Field(default=0, ge=0) jitter_seconds: int = Field(default=0, ge=0)
misfire_policy: Literal["skip", "catchup", "compress"] = Field(default="skip") # Run-miss recovery behaviour (ACTIVITY-WP-0014). What happens when a fire is
# missed because the worker / Temporal was unavailable at trigger time:
# skip - run on trigger or skip; a missed fire is never recovered
# catchup_all - recover every fire missed during the outage window
# catchup_latest - recover only the most recent missed fire; do not accumulate
# Legacy aliases are accepted: catchup → catchup_all, compress → catchup_latest.
misfire_policy: Literal[
"skip", "catchup_all", "catchup_latest", "catchup", "compress"
] = Field(default="skip")
# Override the per-policy default catchup window (how far back Temporal will
# recover missed fires after an outage). None uses the policy default.
catchup_window_seconds: int | None = Field(default=None, ge=0)
class EventTriggerConfig(BaseModel): class EventTriggerConfig(BaseModel):

View File

@@ -2,12 +2,15 @@
from __future__ import annotations from __future__ import annotations
import json
import os import os
from pathlib import Path
from typing import Any from typing import Any
import httpx import httpx
from activity_core.context_resolvers.ops_inventory import _sanitize_url from activity_core.context_resolvers.ops_inventory import _sanitize_url
from activity_core.state_hub_write import idempotency_headers
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000" _DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
_INTER_HUB_SINK_TYPES = { _INTER_HUB_SINK_TYPES = {
@@ -15,6 +18,10 @@ _INTER_HUB_SINK_TYPES = {
"inter-hub-event", "inter-hub-event",
"inter-hub-interaction-event", "inter-hub-interaction-event",
} }
_CORE_HUB_SINK_TYPES = {
"core-hub",
"core-hub-interaction-event",
}
def persist_ops_inventory_evidence(payload: dict[str, Any]) -> list[dict[str, Any]]: def persist_ops_inventory_evidence(payload: dict[str, Any]) -> list[dict[str, Any]]:
@@ -55,6 +62,12 @@ def persist_ops_inventory_evidence(payload: dict[str, Any]) -> list[dict[str, An
results.append( results.append(
_post_state_hub_progress(payload, bind_key, probe_result, sink) _post_state_hub_progress(payload, bind_key, probe_result, sink)
) )
elif sink_type in _CORE_HUB_SINK_TYPES:
results.append(
_post_core_hub_interaction_event(
payload, bind_key, probe_result, sink
)
)
elif sink_type in _INTER_HUB_SINK_TYPES: elif sink_type in _INTER_HUB_SINK_TYPES:
results.append(_inter_hub_result(sink)) results.append(_inter_hub_result(sink))
else: else:
@@ -121,6 +134,7 @@ def _post_state_hub_progress(
resp = httpx.post( resp = httpx.post(
f"{base_url}/progress/", f"{base_url}/progress/",
json=body, json=body,
headers=idempotency_headers(run_id, context_key, event_type),
timeout=float(sink.get("timeout_seconds", 10.0)), timeout=float(sink.get("timeout_seconds", 10.0)),
) )
resp.raise_for_status() resp.raise_for_status()
@@ -136,12 +150,17 @@ def _post_state_hub_progress(
def _progress_exists(base_url: str, event_type: str, idempotency_key: str) -> bool: def _progress_exists(base_url: str, event_type: str, idempotency_key: str) -> bool:
resp = httpx.get( # Best-effort optimisation only; the Idempotency-Key header on the write is the
f"{base_url}/progress/", # real dedup guarantee. Do not hard-fail if State Hub is unreachable here.
params={"limit": 100}, try:
timeout=10.0, resp = httpx.get(
) f"{base_url}/progress/",
resp.raise_for_status() params={"limit": 100},
timeout=10.0,
)
resp.raise_for_status()
except httpx.HTTPError:
return False
for item in resp.json(): for item in resp.json():
detail = item.get("detail") or {} detail = item.get("detail") or {}
if ( if (
@@ -152,6 +171,213 @@ def _progress_exists(base_url: str, event_type: str, idempotency_key: str) -> bo
return False return False
def _post_core_hub_interaction_event(
payload: dict[str, Any],
context_key: str,
probe_result: dict[str, Any],
sink: dict[str, Any],
) -> dict[str, Any]:
raw_base_url = (
sink.get("core_hub_url")
or sink.get("base_url")
or os.environ.get("CORE_HUB_BASE_URL")
or ""
)
base_url = str(raw_base_url).rstrip("/")
runtime_token = _core_hub_runtime_token(sink)
widget_id = _core_hub_widget_id(sink, probe_result)
missing: list[str] = []
if not base_url:
missing.append("CORE_HUB_BASE_URL")
if not runtime_token:
missing.append("CORE_HUB_RUNTIME_TOKEN or CORE_HUB_RUNTIME_TOKEN_FILE")
if not widget_id:
missing.append("widget_id or CORE_HUB_WIDGET_ID")
if missing:
return {
"type": sink.get("type"),
"status": "skipped",
"reason": "missing_core_hub_config",
"missing": missing,
"context_key": context_key,
}
endpoint = _selected_endpoint(probe_result, sink)
event_type = sink.get("event_type", "ops-endpoint-verified")
timeout = float(sink.get("timeout_seconds", 10.0))
body = {
"widgetId": widget_id,
"eventType": event_type,
"viewContext": _core_hub_view_context(payload, context_key, endpoint, sink),
"metadata": _core_hub_metadata(payload, context_key, probe_result, endpoint),
}
resp = httpx.post(
f"{base_url}/api/v2/interaction-events",
json=body,
headers=_core_hub_headers(runtime_token),
timeout=timeout,
)
resp.raise_for_status()
data = resp.json()
event_id = data.get("id")
if not event_id:
raise RuntimeError("Core Hub interaction event response did not include an id")
if not _core_hub_event_exists(base_url, runtime_token, str(event_id), timeout):
raise RuntimeError("Core Hub interaction event was not visible after create")
return {
"type": sink.get("type"),
"status": "posted",
"event_type": data.get("eventType", event_type),
"event_id": event_id,
"widget_id": data.get("widgetId", widget_id),
"verified": True,
"context_key": context_key,
}
def _core_hub_headers(runtime_token: str) -> dict[str, str]:
return {
"Accept": "application/json",
"Authorization": f"Bearer {runtime_token}",
"Content-Type": "application/json",
"User-Agent": "activity-core-ops-evidence/0.1",
}
def _core_hub_runtime_token(sink: dict[str, Any]) -> str:
token_file = (
sink.get("runtime_token_file")
or sink.get("token_file")
or os.environ.get("CORE_HUB_RUNTIME_TOKEN_FILE")
)
if token_file:
return Path(str(token_file)).read_text(encoding="utf-8").strip()
env_name = (
sink.get("runtime_token_env")
or os.environ.get("CORE_HUB_RUNTIME_TOKEN_ENV")
or "CORE_HUB_RUNTIME_TOKEN"
)
return os.environ.get(str(env_name), "").strip()
def _core_hub_widget_id(sink: dict[str, Any], probe_result: dict[str, Any]) -> str:
direct = sink.get("widget_id") or os.environ.get("CORE_HUB_WIDGET_ID")
if direct:
return str(direct)
endpoint = _selected_endpoint(probe_result, sink)
widget_ref = endpoint.get("widget_ref") if endpoint else None
if not widget_ref:
return ""
mapping = sink.get("widget_mapping") or sink.get("capability_mapping")
if mapping is None:
mapping = os.environ.get("CORE_HUB_WIDGET_MAPPING")
parsed = _parse_widget_mapping(mapping)
return parsed.get(str(widget_ref), "")
def _parse_widget_mapping(raw: Any) -> dict[str, str]:
if isinstance(raw, dict):
return {str(key): str(value) for key, value in raw.items() if value}
if not isinstance(raw, str) or not raw.strip():
return {}
value = raw.strip()
if value.startswith("{"):
try:
loaded = json.loads(value)
except json.JSONDecodeError:
return {}
if isinstance(loaded, dict):
return {str(key): str(item) for key, item in loaded.items() if item}
return {}
if "=" not in value:
return {}
pairs: dict[str, str] = {}
for part in value.split(","):
key, _, item = part.partition("=")
if key.strip() and item.strip():
pairs[key.strip()] = item.strip()
return pairs
def _selected_endpoint(probe_result: dict[str, Any], sink: dict[str, Any]) -> dict[str, Any]:
endpoints = [
endpoint
for endpoint in probe_result.get("endpoints", [])
if isinstance(endpoint, dict)
]
endpoint_id = sink.get("endpoint_id")
if endpoint_id:
match = next(
(endpoint for endpoint in endpoints if endpoint.get("endpoint_id") == endpoint_id),
None,
)
if match:
return match
return next(
(endpoint for endpoint in endpoints if endpoint.get("widget_ref")),
endpoints[0] if endpoints else {},
)
def _core_hub_view_context(
payload: dict[str, Any],
context_key: str,
endpoint: dict[str, Any],
sink: dict[str, Any],
) -> str:
return str(
sink.get("view_context")
or endpoint.get("view_context")
or f"activity-core/ops-inventory/{payload.get('run_id', 'unknown')}/{context_key}"
)
def _core_hub_metadata(
payload: dict[str, Any],
context_key: str,
probe_result: dict[str, Any],
endpoint: dict[str, Any],
) -> dict[str, Any]:
compact = _compact_probe_result(probe_result)
return {
"activity_id": payload.get("activity_id"),
"activity_core_run_id": payload.get("run_id"),
"scheduled_for": payload.get("scheduled_for"),
"source_type": "ops-inventory",
"context_key": context_key,
"probe": {
"generated_at": compact.get("generated_at"),
"inventory_path": compact.get("inventory_path"),
"status": compact.get("status"),
"reason": compact.get("reason"),
"summary": compact.get("summary", {}),
},
"endpoint": _compact_endpoint(endpoint) if endpoint else {},
}
def _core_hub_event_exists(
base_url: str,
runtime_token: str,
event_id: str,
timeout: float,
) -> bool:
resp = httpx.get(
f"{base_url}/api/v2/interaction-events",
headers=_core_hub_headers(runtime_token),
timeout=timeout,
)
resp.raise_for_status()
payload = resp.json()
data = payload.get("data") if isinstance(payload, dict) else []
if not isinstance(data, list):
return False
return any(isinstance(item, dict) and item.get("id") == event_id for item in data)
def _inter_hub_result(sink: dict[str, Any]) -> dict[str, Any]: def _inter_hub_result(sink: dict[str, Any]) -> dict[str, Any]:
missing: list[str] = [] missing: list[str] = []
if not (sink.get("inter_hub_url") or os.environ.get("INTER_HUB_URL")): if not (sink.get("inter_hub_url") or os.environ.get("INTER_HUB_URL")):

View File

@@ -11,6 +11,8 @@ from zoneinfo import ZoneInfo
import httpx import httpx
from activity_core.state_hub_write import idempotency_headers
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000" _DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
_THE_CUSTODIAN_ROOT = Path("/home/worsch/the-custodian") _THE_CUSTODIAN_ROOT = Path("/home/worsch/the-custodian")
_FORBIDDEN_CUSTODIAN_ROOTS = ( _FORBIDDEN_CUSTODIAN_ROOTS = (
@@ -149,6 +151,7 @@ def _post_state_hub_progress(
resp = httpx.post( resp = httpx.post(
f"{base_url}/progress/", f"{base_url}/progress/",
json=body, json=body,
headers=idempotency_headers(run_id, instruction_id, event_type),
timeout=float(sink.get("timeout_seconds", 10.0)), timeout=float(sink.get("timeout_seconds", 10.0)),
) )
resp.raise_for_status() resp.raise_for_status()
@@ -167,12 +170,18 @@ def _progress_exists(
instruction_id: str, instruction_id: str,
event_type: str, event_type: str,
) -> bool: ) -> bool:
resp = httpx.get( # Best-effort read-dedup optimisation only. The Idempotency-Key header on the
f"{base_url}/progress/", # write is the real guarantee; if State Hub is unreachable here we must not
params={"limit": 100}, # hard-fail — proceed to the (keyed) write rather than raising.
timeout=10.0, try:
) resp = httpx.get(
resp.raise_for_status() f"{base_url}/progress/",
params={"limit": 100},
timeout=10.0,
)
resp.raise_for_status()
except httpx.HTTPError:
return False
for item in resp.json(): for item in resp.json():
detail = item.get("detail") or {} detail = item.get("detail") or {}
if ( if (

View File

@@ -160,15 +160,20 @@ def _execute(
prompt_hash = hashlib.sha256(rendered.encode()).hexdigest() prompt_hash = hashlib.sha256(rendered.encode()).hexdigest()
llm_config = _llm_run_config(instr) llm_config = _llm_run_config(instr)
# Reference allow-list (WP-0016-T04): if a context resolver supplied the set
# of known candidate ids, recommendations pointing at anything else are
# quarantined. Absent (None) today → the check is inert until wired.
allow_list = _allow_list_from_context(context)
# Step 3 — call LLM # Step 3 — call LLM
raw_output = llm_client.complete(rendered, model=instr.model, config=llm_config) raw_output = llm_client.complete(rendered, model=instr.model, config=llm_config)
# Step 4 — validate and optionally retry # Step 4 — validate and optionally retry
task_specs, report, error = _validate_output(raw_output, instr) task_specs, report, error = _validate_output(raw_output, instr, allow_list)
if error: if error:
retry_prompt = rendered + f"\n\nPrevious output was invalid: {error}\nPlease fix." retry_prompt = rendered + f"\n\nPrevious output was invalid: {error}\nPlease fix."
raw_output = llm_client.complete(retry_prompt, model=instr.model, config=llm_config) raw_output = llm_client.complete(retry_prompt, model=instr.model, config=llm_config)
task_specs, report, error = _validate_output(raw_output, instr) task_specs, report, error = _validate_output(raw_output, instr, allow_list)
if error: if error:
# Truncate to keep log volume bounded but long enough to see the # Truncate to keep log volume bounded but long enough to see the
# actual JSON shape mismatch (typical reports are <2KB). # actual JSON shape mismatch (typical reports are <2KB).
@@ -178,6 +183,14 @@ def _execute(
"error=%s, raw_output_preview=%r", "error=%s, raw_output_preview=%r",
instr.id, prompt_hash, error, preview, instr.id, prompt_hash, error, preview,
) )
# Posture B (WP-0016-T03): try to recover a partial-but-usable
# report from individually-parseable items before declaring total
# loss. One bad item should cost one item, not the whole report.
recovered = _resilient_report(
instr, raw_output, error, prompt_hash, allow_list,
)
if recovered is not None:
return recovered
failure_report = _invalid_output_report(instr, error, raw_output) failure_report = _invalid_output_report(instr, error, raw_output)
if failure_report is not None: if failure_report is not None:
return InstructionResult( return InstructionResult(
@@ -279,6 +292,320 @@ def _invalid_output_report(
return report return report
# ---------------------------------------------------------------------------
# Resilient report recovery (ACTIVITY-WP-0016-T03)
#
# Posture B — verify & mitigate at the producer→consumer boundary. When the
# whole-document parse/validate fails, recover individually-parseable
# recommendation objects, validate each against the item schema, keep the valid
# ones, and quarantine the malformed/over-limit ones with provenance. One bad
# item costs one item, not the whole report (error locality == unit of work).
# ---------------------------------------------------------------------------
_QUARANTINE_LIMIT = 20
_SNIPPET_LIMIT = 200
# Producer guardrails (ACTIVITY-WP-0016-T04): structural bounds applied to every
# recommendation regardless of producer (LLM, agent, or human). These are
# verify-and-mitigate limits — an offending item is quarantined, never allowed to
# fail the whole report or flow unbounded into a downstream consumer.
_MAX_STRING_LEN = 4000
_MAX_DEPTH = 8
_SUMMARY_RE = re.compile(r'"summary"\s*:\s*"((?:[^"\\]|\\.)*)"')
def _snippet(value: Any) -> str:
text = value if isinstance(value, str) else json.dumps(value, default=str)
return text[:_SNIPPET_LIMIT]
def _json_depth(value: Any, depth: int = 1) -> int:
if depth > _MAX_DEPTH:
return depth
if isinstance(value, dict):
return max((_json_depth(v, depth + 1) for v in value.values()), default=depth)
if isinstance(value, list):
return max((_json_depth(v, depth + 1) for v in value), default=depth)
return depth
def _has_oversized_string(value: Any) -> bool:
if isinstance(value, str):
return len(value) > _MAX_STRING_LEN
if isinstance(value, dict):
return any(_has_oversized_string(v) for v in value.values())
if isinstance(value, list):
return any(_has_oversized_string(v) for v in value)
return False
def _item_structure_error(item: Any) -> str | None:
"""Producer-agnostic structural guardrail: depth and string-length caps."""
if _json_depth(item) > _MAX_DEPTH:
return f"exceeds max nesting depth {_MAX_DEPTH}"
if _has_oversized_string(item):
return f"contains a string longer than {_MAX_STRING_LEN} chars"
return None
def _allow_list_from_context(context: dict | None) -> set[str] | None:
"""Build the recommendation-candidate allow-list from resolved context.
Looks for `context["known_candidates"]` (a list/set of valid candidate ids).
Returns None when absent so the allow-list check stays inert until a context
resolver populates it — the guardrail capability ships now; activation is a
one-line resolver change.
"""
if not isinstance(context, dict):
return None
known = context.get("known_candidates")
if isinstance(known, (list, set, tuple)):
return {str(item) for item in known}
return None
def _report_contract(instr: Any) -> tuple[dict[str, Any] | None, int | None]:
"""Extract (item_schema, max_items) for the recommendations list, if any."""
try:
schema = _load_output_schema(getattr(instr, "output_schema", ""))
except (OSError, json.JSONDecodeError, TypeError):
return None, None
if not isinstance(schema, dict):
return None, None
recs = (schema.get("properties") or {}).get("recommendations")
if not isinstance(recs, dict):
return None, None
item_schema = recs.get("items") if isinstance(recs.get("items"), dict) else None
max_items = recs.get("maxItems") if isinstance(recs.get("maxItems"), int) else None
return item_schema, max_items
def _extract_object_spans(raw: str) -> list[tuple[str, bool]]:
"""Return (span, complete) for each recommendation object in raw output.
Scans the `recommendations` array brace-aware and string-aware so it recovers
objects whether they are pretty-printed across many lines or emitted one per
line (NDJSON). A truncated trailing object is returned with complete=False.
"""
key = raw.find('"recommendations"')
start_region = raw.find("[", key) if key >= 0 else -1
if start_region < 0:
return []
spans: list[tuple[str, bool]] = []
i, n = start_region + 1, len(raw)
while i < n:
ch = raw[i]
if ch == "]":
break
if ch != "{":
i += 1
continue
depth, in_str, esc, j = 0, False, False, i
closed = False
while j < n:
c = raw[j]
if in_str:
if esc:
esc = False
elif c == "\\":
esc = True
elif c == '"':
in_str = False
elif c == '"':
in_str = True
elif c == "{":
depth += 1
elif c == "}":
depth -= 1
if depth == 0:
spans.append((raw[i:j + 1], True))
closed = True
break
j += 1
if not closed:
spans.append((raw[i:], False)) # truncated tail
break
i = j + 1
return spans
def _try_repair(span: str) -> str:
"""Best-effort close of a truncated JSON object: balance quote, braces, brackets."""
in_str, esc, depth_c, depth_b = False, False, 0, 0
for c in span:
if in_str:
if esc:
esc = False
elif c == "\\":
esc = True
elif c == '"':
in_str = False
elif c == '"':
in_str = True
elif c == "{":
depth_c += 1
elif c == "}":
depth_c -= 1
elif c == "[":
depth_b += 1
elif c == "]":
depth_b -= 1
repaired = span.rstrip().rstrip(",")
if in_str:
repaired += '"'
return repaired + "]" * max(depth_b, 0) + "}" * max(depth_c, 0)
def _recover_recommendations(
raw: str,
) -> tuple[str | None, list[dict[str, Any]], list[dict[str, Any]]]:
"""Recover (summary, items, quarantined) from a failed report payload."""
summary_match = _SUMMARY_RE.search(raw)
summary = None
if summary_match:
try:
summary = json.loads(f'"{summary_match.group(1)}"')
except json.JSONDecodeError:
summary = summary_match.group(1)
items: list[dict[str, Any]] = []
quarantined: list[dict[str, Any]] = []
for index, (span, complete) in enumerate(_extract_object_spans(raw)):
parsed: Any = None
try:
parsed = json.loads(span)
except json.JSONDecodeError as exc:
if not complete:
try:
parsed = json.loads(_try_repair(span))
except json.JSONDecodeError:
parsed = None
if parsed is None:
quarantined.append(
{"index": index, "error": str(exc), "raw": _snippet(span),
"reason": "truncated" if not complete else "unparseable"}
)
continue
if isinstance(parsed, dict):
items.append(parsed)
else:
quarantined.append(
{"index": index, "error": "item is not a JSON object",
"raw": _snippet(span)}
)
return summary, items, quarantined
def _partition_items(
items: list[dict[str, Any]],
item_schema: dict[str, Any] | None,
max_items: int | None,
*,
run_schema: bool = True,
allow_list: set[str] | None = None,
) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
"""Screen items into (valid, quarantined).
Applied uniformly to recovered items (run_schema=True) and to already
schema-valid happy-path items (run_schema=False). Order of checks: structural
type → schema → producer guardrails (depth/length) → reference allow-list →
count cap. The first failing check quarantines the item with provenance.
"""
valid: list[dict[str, Any]] = []
quarantined: list[dict[str, Any]] = []
for index, item in enumerate(items):
if not isinstance(item, dict):
quarantined.append(
{"index": index, "error": "item is not a JSON object",
"raw": _snippet(item), "reason": "malformed"}
)
continue
schema_error = (
_validate_schema_node(item, item_schema, f"recommendations[{index}]")
if (run_schema and item_schema)
else None
)
if schema_error:
quarantined.append(
{"index": index, "error": schema_error, "raw": _snippet(item),
"reason": "schema"}
)
continue
structure_error = _item_structure_error(item)
if structure_error:
quarantined.append(
{"index": index, "error": structure_error, "raw": _snippet(item),
"reason": "guardrail"}
)
continue
if allow_list is not None:
candidate = item.get("candidate")
if not isinstance(candidate, str) or candidate not in allow_list:
quarantined.append(
{"index": index, "error": f"candidate {candidate!r} not in allow-list",
"raw": _snippet(item), "reason": "allow_list"}
)
continue
valid.append(item)
if max_items is not None and len(valid) > max_items:
for item in valid[max_items:]:
quarantined.append(
{"index": None, "error": f"exceeds maxItems={max_items}",
"raw": _snippet(item), "reason": "over_limit"}
)
valid = valid[:max_items]
return valid, quarantined
def _resilient_report(
instr: Any,
raw_output: Any,
original_error: str,
prompt_hash: str | None,
allow_list: set[str] | None = None,
) -> InstructionResult | None:
"""Recover a partial-but-usable report from output that failed validation.
Returns None when nothing usable can be recovered, so the caller falls back
to the total-loss diagnostic artifact (_invalid_output_report).
"""
if not getattr(instr, "report_sinks", None) or not isinstance(raw_output, str):
return None
item_schema, max_items = _report_contract(instr)
summary, items, quarantined = _recover_recommendations(raw_output)
if not items:
return None
valid, item_quarantine = _partition_items(
items, item_schema, max_items, allow_list=allow_list,
)
quarantined.extend(item_quarantine)
if not valid:
return None
report: dict[str, Any] = {
"summary": summary
or f"Partial daily triage: recovered {len(valid)} recommendation(s) "
"after the full report failed validation.",
"recommendations": valid,
"status": "partial",
"partial": True,
"quarantined_count": len(quarantined),
"quarantined_items": quarantined[:_QUARANTINE_LIMIT],
"recovery_note": f"original validation error: {original_error}",
}
logger.warning(
"instruction_output_recovered: instruction=%r, kept=%d, quarantined=%d",
getattr(instr, "id", None), len(valid), len(quarantined),
)
return InstructionResult(
tasks=[],
report=report,
prompt_hash=prompt_hash,
model=getattr(instr, "model", None),
output_validated=True,
review_required=True,
condition_matched=getattr(instr, "condition", "") or None,
validation_error=None,
)
def _execution_failure_report(instr: Any, error: str) -> dict[str, Any] | None: def _execution_failure_report(instr: Any, error: str) -> dict[str, Any] | None:
"""Build a durable diagnostic report when a report instruction cannot run.""" """Build a durable diagnostic report when a report instruction cannot run."""
if not getattr(instr, "report_sinks", None): if not getattr(instr, "report_sinks", None):
@@ -295,6 +622,7 @@ def _execution_failure_report(instr: Any, error: str) -> dict[str, Any] | None:
def _validate_output( def _validate_output(
raw_output: Any, raw_output: Any,
instr: Any, instr: Any,
allow_list: set[str] | None = None,
) -> tuple[list[TaskSpec], dict[str, Any] | None, str | None]: ) -> tuple[list[TaskSpec], dict[str, Any] | None, str | None]:
"""Parse raw LLM output into TaskSpecs and optional report payload. """Parse raw LLM output into TaskSpecs and optional report payload.
@@ -349,6 +677,28 @@ def _validate_output(
source_type="instruction", source_type="instruction",
source_id=instr.id, source_id=instr.id,
)) ))
# Happy-path producer guardrails (WP-0016-T04): the whole document already
# passed schema validation, so recommendations are schema-valid; still apply
# the count cap, structural caps, and reference allow-list, quarantining any
# offenders rather than emitting them. Report shape only changes when an item
# is actually quarantined.
if isinstance(report, dict) and isinstance(report.get("recommendations"), list):
item_schema, max_items = _report_contract(instr)
kept, quarantined = _partition_items(
report["recommendations"], item_schema, max_items,
run_schema=False, allow_list=allow_list,
)
if quarantined:
report = {
**report,
"recommendations": kept,
"status": "partial",
"partial": True,
"quarantined_count": len(quarantined),
"quarantined_items": quarantined[:_QUARANTINE_LIMIT],
}
return specs, report, None return specs, report, None
except (json.JSONDecodeError, AttributeError, KeyError, TypeError) as exc: except (json.JSONDecodeError, AttributeError, KeyError, TypeError) as exc:
return [], None, str(exc) return [], None, str(exc)

View File

@@ -0,0 +1,194 @@
"""Missed-fire detection for cron schedules (ACTIVITY-WP-0014, T03).
Even with a catchup window configured, an operator wants to *know* when a fire
was missed — especially under ``misfire_policy: skip`` where missed fires are
dropped by design and leave no run and no failure event. This module turns the
schedule's own bookkeeping into an explicit verdict and an optional State Hub
alert so a miss is never invisible again.
Temporal already counts fires that were dropped because they fell outside the
catchup window in ``ScheduleInfo.num_actions_missed_catchup_window``. We surface
that, plus a staleness check on the most recent fire, as a ``ScheduleHealth``
verdict. The verdict logic is a pure function so it is testable without a live
Temporal server; ``check_schedule_health`` is the thin async reader.
"""
from __future__ import annotations
import os
from dataclasses import dataclass, field
from datetime import datetime, timedelta, timezone
from typing import Any
from uuid import UUID
import httpx
from activity_core.schedule_manager import schedule_id
from activity_core.state_hub_write import idempotency_headers
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
@dataclass(frozen=True)
class ScheduleHealth:
"""Verdict for a single schedule's recent firing behaviour."""
activity_id: str
healthy: bool
missed_catchup_window: int
last_fired_at: datetime | None
staleness: timedelta | None
reasons: list[str] = field(default_factory=list)
@property
def missed(self) -> bool:
return not self.healthy
def evaluate_schedule_health(
*,
activity_id: str,
missed_catchup_window: int,
last_fired_at: datetime | None,
now: datetime,
expected_interval: timedelta | None = None,
tolerance: timedelta = timedelta(minutes=10),
) -> ScheduleHealth:
"""Pure verdict: was a fire missed?
A schedule is unhealthy if Temporal dropped any fire past the catchup window,
or — when ``expected_interval`` is known — if the most recent fire is older
than one interval plus ``tolerance`` (i.e. a fire should have happened and
did not).
"""
reasons: list[str] = []
if missed_catchup_window > 0:
reasons.append(
f"{missed_catchup_window} fire(s) dropped outside the catchup window"
)
staleness: timedelta | None = None
if last_fired_at is not None:
staleness = now - last_fired_at
if expected_interval is not None and staleness > expected_interval + tolerance:
reasons.append(
f"last fire was {staleness} ago, exceeding the expected "
f"{expected_interval} interval"
)
elif expected_interval is not None:
reasons.append("no recorded fire for a schedule that should have fired")
return ScheduleHealth(
activity_id=activity_id,
healthy=not reasons,
missed_catchup_window=missed_catchup_window,
last_fired_at=last_fired_at,
staleness=staleness,
reasons=reasons,
)
def _extract_info(desc: Any) -> tuple[int, datetime | None]:
"""Pull (missed_catchup_window, last_fired_at) from a ScheduleDescription.
Accesses are defensive so a Temporal SDK field rename degrades to "unknown"
rather than raising inside an operational health check.
"""
info = getattr(desc, "info", None)
missed = int(getattr(info, "num_actions_missed_catchup_window", 0) or 0)
last_fired: datetime | None = None
recent = getattr(info, "recent_actions", None) or []
times = [
getattr(a, "scheduled_at", None) or getattr(a, "started_at", None)
for a in recent
]
times = [t for t in times if t is not None]
if times:
last_fired = max(times)
return missed, last_fired
async def check_schedule_health(
client: Any,
activity_id: str | UUID,
*,
now: datetime | None = None,
expected_interval: timedelta | None = None,
tolerance: timedelta = timedelta(minutes=10),
) -> ScheduleHealth:
"""Describe the schedule for ``activity_id`` and evaluate its health."""
now = now or datetime.now(tz=timezone.utc)
handle = client.get_schedule_handle(schedule_id(activity_id))
desc = await handle.describe()
missed, last_fired = _extract_info(desc)
return evaluate_schedule_health(
activity_id=str(activity_id),
missed_catchup_window=missed,
last_fired_at=last_fired,
now=now,
expected_interval=expected_interval,
tolerance=tolerance,
)
def post_missed_fire_alert(
health: ScheduleHealth,
*,
state_hub_url: str | None = None,
author: str = "activity-core",
topic_id: str | None = None,
workstream_id: str | None = None,
timeout_seconds: float = 10.0,
) -> dict[str, Any]:
"""Post a ``schedule_miss`` progress event to State Hub for an unhealthy schedule.
No-op (returns ``status: ok``) when the schedule is healthy, so callers can
invoke unconditionally.
"""
if health.healthy:
return {"type": "schedule-miss-alert", "status": "ok"}
base_url = state_hub_url or os.environ.get("STATE_HUB_URL", _DEFAULT_STATE_HUB_URL)
base_url = str(base_url).rstrip("/")
body: dict[str, Any] = {
"event_type": "schedule_miss",
"author": author,
"summary": (
f"Schedule {health.activity_id} missed a fire: "
+ "; ".join(health.reasons)
),
"detail": {
"activity_id": health.activity_id,
"missed_catchup_window": health.missed_catchup_window,
"last_fired_at": (
health.last_fired_at.isoformat() if health.last_fired_at else None
),
"staleness_seconds": (
health.staleness.total_seconds() if health.staleness else None
),
"reasons": health.reasons,
},
}
if topic_id:
body["topic_id"] = topic_id
if workstream_id:
body["workstream_id"] = workstream_id
# Dedup repeated alerts for the same missed window (same schedule + last fire).
last_fired = health.last_fired_at.isoformat() if health.last_fired_at else "none"
resp = httpx.post(
f"{base_url}/progress/",
json=body,
headers=idempotency_headers("schedule_miss", health.activity_id, last_fired),
timeout=timeout_seconds,
)
resp.raise_for_status()
data = resp.json()
return {
"type": "schedule-miss-alert",
"status": "posted",
"progress_id": data.get("id"),
}

View File

@@ -17,7 +17,6 @@ from temporalio.client import (
Schedule, Schedule,
ScheduleActionStartWorkflow, ScheduleActionStartWorkflow,
ScheduleAlreadyRunningError, ScheduleAlreadyRunningError,
ScheduleBackfill,
ScheduleCalendarSpec, ScheduleCalendarSpec,
ScheduleHandle, ScheduleHandle,
ScheduleOverlapPolicy, ScheduleOverlapPolicy,
@@ -38,13 +37,49 @@ _ORCHESTRATOR_TASK_QUEUE = "orchestrator-tq"
# RunActivityWorkflow detects this value and derives run dedup key from workflow_id. # RunActivityWorkflow detects this value and derives run dedup key from workflow_id.
SCHEDULED_TRIGGER_KEY = "scheduled" SCHEDULED_TRIGGER_KEY = "scheduled"
# T24: misfire_policy → ScheduleOverlapPolicy # ACTIVITY-WP-0014: misfire_policy → run-miss recovery behaviour.
_MISFIRE_TO_OVERLAP: dict[str, ScheduleOverlapPolicy] = { #
"skip": ScheduleOverlapPolicy.SKIP, # A "missed fire" happens when the worker / Temporal is unavailable at trigger
"catchup": ScheduleOverlapPolicy.BUFFER_ALL, # time. Two Temporal levers together define the behaviour:
"compress": ScheduleOverlapPolicy.BUFFER_ONE, # - catchup_window: how far back the server will recover missed fires once it
# is healthy again. The previous code never set this, so a brief outage at
# trigger time silently dropped the fire with no recovery and no signal.
# - overlap: what to do when a (recovered) fire would start while a prior run
# is still executing.
#
# Legacy values (catchup, compress) are aliased onto the explicit names.
_MISFIRE_ALIASES: dict[str, str] = {
"catchup": "catchup_all",
"compress": "catchup_latest",
} }
# overlap policy + default catchup window (seconds) per normalised policy.
_SKIP_WINDOW_SECONDS = 60
_CATCHUP_ALL_WINDOW_SECONDS = 365 * 24 * 3600
_CATCHUP_LATEST_WINDOW_SECONDS = 24 * 3600
_MISFIRE_TO_OVERLAP: dict[str, ScheduleOverlapPolicy] = {
# Run on trigger or skip — recover nothing past a tiny grace window.
"skip": ScheduleOverlapPolicy.SKIP,
# Run on trigger or recover every missed fire during the outage window.
"catchup_all": ScheduleOverlapPolicy.BUFFER_ALL,
# Run on trigger or recover the most recent missed fire only; BUFFER_ONE
# buffers at most one start and drops the rest, so a backlog never accumulates.
"catchup_latest": ScheduleOverlapPolicy.BUFFER_ONE,
}
_MISFIRE_DEFAULT_WINDOW: dict[str, int] = {
"skip": _SKIP_WINDOW_SECONDS,
"catchup_all": _CATCHUP_ALL_WINDOW_SECONDS,
"catchup_latest": _CATCHUP_LATEST_WINDOW_SECONDS,
}
def _normalize_misfire_policy(misfire_policy: str) -> str:
"""Map legacy aliases onto the explicit run-miss policy names."""
canonical = _MISFIRE_ALIASES.get(misfire_policy, misfire_policy)
return canonical if canonical in _MISFIRE_TO_OVERLAP else "skip"
def schedule_id(activity_id: str | UUID) -> str: def schedule_id(activity_id: str | UUID) -> str:
"""Return the canonical Temporal Schedule ID for an ActivityDefinition.""" """Return the canonical Temporal Schedule ID for an ActivityDefinition."""
@@ -57,7 +92,15 @@ def smoke_schedule_id(activity_id: str | UUID) -> str:
def _overlap_policy(misfire_policy: str) -> ScheduleOverlapPolicy: def _overlap_policy(misfire_policy: str) -> ScheduleOverlapPolicy:
return _MISFIRE_TO_OVERLAP.get(misfire_policy, ScheduleOverlapPolicy.SKIP) return _MISFIRE_TO_OVERLAP[_normalize_misfire_policy(misfire_policy)]
def _catchup_window(cfg: CronTriggerConfig) -> timedelta:
"""Resolve the catchup window: explicit override, else the policy default."""
if cfg.catchup_window_seconds is not None:
return timedelta(seconds=cfg.catchup_window_seconds)
policy = _normalize_misfire_policy(cfg.misfire_policy)
return timedelta(seconds=_MISFIRE_DEFAULT_WINDOW[policy])
def _build_schedule(defn: ActivityDefinition) -> Schedule: def _build_schedule(defn: ActivityDefinition) -> Schedule:
@@ -80,7 +123,10 @@ def _build_schedule(defn: ActivityDefinition) -> Schedule:
jitter=timedelta(seconds=cfg.jitter_seconds) if cfg.jitter_seconds else None, jitter=timedelta(seconds=cfg.jitter_seconds) if cfg.jitter_seconds else None,
) )
policy = SchedulePolicy(overlap=_overlap_policy(cfg.misfire_policy)) policy = SchedulePolicy(
overlap=_overlap_policy(cfg.misfire_policy),
catchup_window=_catchup_window(cfg),
)
state = ScheduleState(paused=not defn.enabled) state = ScheduleState(paused=not defn.enabled)
return Schedule(action=action, spec=spec, policy=policy, state=state) return Schedule(action=action, spec=spec, policy=policy, state=state)
@@ -282,18 +328,10 @@ async def upsert_schedule(client: Client, defn: ActivityDefinition) -> ScheduleH
else: else:
await handle.pause(note="disabled via upsert_schedule") await handle.pause(note="disabled via upsert_schedule")
# T24 catchup: backfill any fires missed in the last hour. # ACTIVITY-WP-0014: missed-fire recovery is now handled natively by the
if isinstance(defn.trigger_config, CronTriggerConfig): # schedule's catchup_window (see _build_schedule), which the server applies
if defn.trigger_config.misfire_policy == "catchup": # continuously after any outage — not only at upsert time. The previous
now = datetime.now(tz=timezone.utc) # ad-hoc 1-hour backfill is therefore no longer needed.
backfill_start = now - timedelta(hours=1)
await handle.backfill(
ScheduleBackfill(
start_at=backfill_start,
end_at=now,
overlap=ScheduleOverlapPolicy.BUFFER_ALL,
)
)
return handle return handle

View File

@@ -0,0 +1,34 @@
"""Idempotency-keyed State Hub writes (ACTIVITY-WP-0014 T05).
Under the State Hub *beachhead* model, a write may be buffered locally while
central State Hub is unreachable and **flushed later, possibly with retries**.
To keep that flush safe — no duplicate progress / triage events — every write
carries a stable ``Idempotency-Key`` header derived deterministically from the
write's identity. The guarantee lives on the write itself and does **not** depend
on a live dedup read, so it holds even when the beachhead is serving offline.
activity-core does not implement the queue/cache (that is state-hub's beachhead);
it only emits the key so the beachhead / State Hub can dedup on flush. The header
passes untouched through the existing ``actcore-state-hub-bridge`` proxy and is
ignored by State Hub versions that do not yet honour it.
"""
from __future__ import annotations
IDEMPOTENCY_HEADER = "Idempotency-Key"
def idempotency_key(*parts: str | None) -> str:
"""Build a stable, header-safe idempotency key from identity parts.
Empty/None parts are kept as empty segments so the key shape is stable across
calls. Whitespace and control characters are collapsed to keep the value a
valid single-line HTTP header.
"""
raw = ":".join((p or "") for p in parts)
return "".join(ch if 0x20 < ord(ch) < 0x7F else "_" for ch in raw) or "_"
def idempotency_headers(*parts: str | None) -> dict[str, str]:
"""Return the header dict to attach to a State Hub write."""
return {IDEMPOTENCY_HEADER: idempotency_key(*parts)}

View File

@@ -15,6 +15,8 @@ import asyncio
import logging import logging
import os import os
import uuid import uuid
from dataclasses import dataclass
from typing import Sequence
from sqlalchemy import select from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
@@ -30,6 +32,20 @@ TEMPORAL_HOST = os.environ.get("TEMPORAL_HOST", "localhost:7233")
TEMPORAL_NAMESPACE = os.environ.get("TEMPORAL_NAMESPACE", "default") TEMPORAL_NAMESPACE = os.environ.get("TEMPORAL_NAMESPACE", "default")
@dataclass
class ScheduleSyncResult:
upserted: int = 0
paused: int = 0
deleted_orphans: int = 0
def to_dict(self) -> dict[str, int]:
return {
"upserted": self.upserted,
"paused": self.paused,
"deleted_orphans": self.deleted_orphans,
}
def _row_to_domain(row: ActivityDefinitionRow) -> ActivityDefinition: def _row_to_domain(row: ActivityDefinitionRow) -> ActivityDefinition:
"""Convert an ORM row to a domain ActivityDefinition for schedule_manager.""" """Convert an ORM row to a domain ActivityDefinition for schedule_manager."""
return ActivityDefinition.model_validate( return ActivityDefinition.model_validate(
@@ -46,12 +62,82 @@ def _row_to_domain(row: ActivityDefinitionRow) -> ActivityDefinition:
) )
async def sync(client: Client, db_url: str) -> None: def _valid_schedule_activity_id(defn: ActivityDefinition) -> str:
if isinstance(defn.trigger_config, ScheduledTriggerConfig):
return f"{defn.id}-once"
return str(defn.id)
async def _load_schedule_rows(
session_factory: async_sessionmaker[AsyncSession],
) -> Sequence[ActivityDefinitionRow]:
async with session_factory() as session:
return (
await session.scalars(
select(ActivityDefinitionRow).where(
ActivityDefinitionRow.trigger_type.in_(["cron", "scheduled"])
)
)
).all()
async def sync_schedule_rows(
client: Client,
rows: Sequence[ActivityDefinitionRow],
) -> ScheduleSyncResult:
"""Reconcile Temporal Schedules against already-loaded definition rows."""
valid_schedule_activity_ids: set[str] = set()
result = ScheduleSyncResult()
for row in rows:
defn = _row_to_domain(row)
if not isinstance(
defn.trigger_config,
(CronTriggerConfig, ScheduledTriggerConfig),
):
continue
valid_schedule_activity_ids.add(_valid_schedule_activity_id(defn))
await upsert_schedule(client, defn)
if defn.enabled:
result.upserted += 1
logger.info("upserted schedule for activity %s (%s)", defn.id, defn.name)
else:
result.paused += 1
logger.info("upserted paused schedule for disabled activity %s", defn.id)
# Tombstone cleanup: remove Temporal Schedules with no matching DB row.
existing_schedules = await list_schedules(client)
for entry in existing_schedules:
if entry["activity_id"] not in valid_schedule_activity_ids:
await delete_schedule(client, entry["activity_id"])
result.deleted_orphans += 1
logger.info("deleted orphaned schedule %s", entry["schedule_id"])
logger.info(
"sync_schedules complete — upserted=%d paused=%d deleted_orphans=%d",
result.upserted,
result.paused,
result.deleted_orphans,
)
return result
async def sync_with_session_factory(
client: Client,
session_factory: async_sessionmaker[AsyncSession],
) -> ScheduleSyncResult:
"""Reconcile Temporal Schedules using an existing DB session factory."""
return await sync_schedule_rows(client, await _load_schedule_rows(session_factory))
async def sync(client: Client, db_url: str) -> ScheduleSyncResult:
"""Reconcile Temporal Schedules against the ActivityDefinition table. """Reconcile Temporal Schedules against the ActivityDefinition table.
Steps: Steps:
1. Load all enabled cron ActivityDefinitions from Postgres. 1. Load all cron/scheduled ActivityDefinitions from Postgres.
2. Upsert a Temporal Schedule for each one. 2. Upsert a Temporal Schedule for each one, paused when disabled.
3. Delete Temporal Schedules whose activity_id has no matching DB row 3. Delete Temporal Schedules whose activity_id has no matching DB row
(tombstone cleanup for deleted or trigger-type-changed definitions). (tombstone cleanup for deleted or trigger-type-changed definitions).
""" """
@@ -59,55 +145,10 @@ async def sync(client: Client, db_url: str) -> None:
session_factory = async_sessionmaker(engine, expire_on_commit=False) session_factory = async_sessionmaker(engine, expire_on_commit=False)
try: try:
async with session_factory() as session: return await sync_with_session_factory(client, session_factory)
rows = (
await session.scalars(
select(ActivityDefinitionRow).where(
ActivityDefinitionRow.trigger_type.in_(["cron", "scheduled"])
)
)
).all()
finally: finally:
await engine.dispose() await engine.dispose()
db_activity_ids: set[str] = set()
upserted = 0
skipped = 0
for row in rows:
defn = _row_to_domain(row)
if not isinstance(defn.trigger_config, (CronTriggerConfig, ScheduledTriggerConfig)):
continue
db_activity_ids.add(str(defn.id))
if defn.enabled:
await upsert_schedule(client, defn)
upserted += 1
logger.info("upserted schedule for activity %s (%s)", defn.id, defn.name)
else:
# Disabled definitions: schedule may exist (paused) — leave it;
# upsert_schedule already handles the paused state.
await upsert_schedule(client, defn)
skipped += 1
logger.info("upserted paused schedule for disabled activity %s", defn.id)
# Tombstone cleanup: remove Temporal Schedules with no matching DB row.
existing_schedules = await list_schedules(client)
deleted = 0
for entry in existing_schedules:
if entry["activity_id"] not in db_activity_ids:
await delete_schedule(client, entry["activity_id"])
deleted += 1
logger.info("deleted orphaned schedule %s", entry["schedule_id"])
logger.info(
"sync_schedules complete — upserted=%d skipped_disabled=%d deleted_orphans=%d",
upserted,
skipped,
deleted,
)
async def main() -> None: async def main() -> None:
logging.basicConfig(level=logging.INFO) logging.basicConfig(level=logging.INFO)
@@ -116,7 +157,13 @@ async def main() -> None:
raise RuntimeError("ACTCORE_DB_URL is required") raise RuntimeError("ACTCORE_DB_URL is required")
client = await Client.connect(TEMPORAL_HOST, namespace=TEMPORAL_NAMESPACE) client = await Client.connect(TEMPORAL_HOST, namespace=TEMPORAL_NAMESPACE)
await sync(client, db_url) result = await sync(client, db_url)
print(
"Synced schedules: "
f"upserted={result.upserted} "
f"paused={result.paused} "
f"deleted_orphans={result.deleted_orphans}"
)
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -0,0 +1,97 @@
"""Shared ActivityDefinition/event type/schedule sync orchestration."""
from __future__ import annotations
from typing import Any
from temporalio.client import Client
from activity_core.event_type_registry import sync_event_types
from activity_core.sync_activity_definitions import sync as sync_activity_definitions
from activity_core.sync_schedules import ScheduleSyncResult, sync_with_session_factory
_MAX_ERRORS = 20
_MAX_ERROR_MESSAGE_LENGTH = 1000
def _empty_result(
*,
definitions: bool,
schedules: bool,
event_types: bool,
) -> dict[str, Any]:
return {
"ok": True,
"ran": {
"definitions": definitions,
"schedules": schedules,
"event_types": event_types,
},
"definitions": {"synced": 0},
"event_types": {"synced": 0},
"schedules": ScheduleSyncResult().to_dict(),
"errors": [],
}
def _record_error(result: dict[str, Any], stage: str, exc: Exception) -> None:
errors = result["errors"]
if len(errors) >= _MAX_ERRORS:
return
errors.append(
{
"stage": stage,
"type": type(exc).__name__,
"message": str(exc)[:_MAX_ERROR_MESSAGE_LENGTH],
}
)
result["ok"] = False
async def run_sync(
*,
session_factory: Any,
temporal_client: Client | None,
definitions: bool = True,
schedules: bool = True,
event_types: bool = False,
) -> dict[str, Any]:
"""Run the requested sync stages and return bounded operator-facing status.
The orchestration deliberately accepts its database and Temporal
dependencies as arguments so startup and the API can share the same behavior
without creating another global runtime.
"""
result = _empty_result(
definitions=definitions,
schedules=schedules,
event_types=event_types,
)
if definitions:
try:
result["definitions"]["synced"] = await sync_activity_definitions(
session_factory
)
except Exception as exc: # pragma: no cover - exercised through tests
_record_error(result, "definitions", exc)
if event_types:
try:
result["event_types"]["synced"] = await sync_event_types(session_factory)
except Exception as exc: # pragma: no cover - exercised through tests
_record_error(result, "event_types", exc)
if schedules:
try:
if temporal_client is None:
raise RuntimeError("Temporal client is required for schedule sync")
schedule_result = await sync_with_session_factory(
temporal_client,
session_factory,
)
result["schedules"] = schedule_result.to_dict()
except Exception as exc: # pragma: no cover - exercised through tests
_record_error(result, "schedules", exc)
return result

View File

@@ -46,8 +46,7 @@ from activity_core.activities import (
) )
from activity_core.db import make_engine from activity_core.db import make_engine
from sqlalchemy.ext.asyncio import async_sessionmaker from sqlalchemy.ext.asyncio import async_sessionmaker
from activity_core.sync_activity_definitions import sync as sync_activity_defs from activity_core.sync_service import run_sync
from activity_core.sync_schedules import sync as sync_schedules
from activity_core.workflows import RunActivityWorkflow, TaskExecutorWorkflow from activity_core.workflows import RunActivityWorkflow, TaskExecutorWorkflow
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -77,20 +76,26 @@ async def run() -> None:
TEMPORAL_HOST, namespace=TEMPORAL_NAMESPACE, runtime=runtime TEMPORAL_HOST, namespace=TEMPORAL_NAMESPACE, runtime=runtime
) )
# T45: Sync ActivityDefinition files into DB before schedule sync. logger.info("Syncing ActivityDefinitions and Temporal Schedules...")
logger.info("Syncing ActivityDefinition files...") sync_engine = make_engine(db_url)
session_factory = async_sessionmaker(sync_engine, expire_on_commit=False)
try: try:
session_factory = async_sessionmaker(make_engine(db_url), expire_on_commit=False) sync_result = await run_sync(
await sync_activity_defs(session_factory) session_factory=session_factory,
except Exception: temporal_client=client,
logger.exception("activity definition sync failed — continuing worker startup") definitions=True,
schedules=True,
# T23: Sync Temporal Schedules with the DB before workers start accepting tasks. event_types=False,
logger.info("Syncing Temporal Schedules with ActivityDefinition DB...") )
try: for error in sync_result["errors"]:
await sync_schedules(client, db_url) logger.error(
except Exception: "startup sync %s failed — %s: %s",
logger.exception("schedule sync failed — continuing worker startup") error["stage"],
error["type"],
error["message"],
)
finally:
await sync_engine.dispose()
orchestrator_worker = Worker( orchestrator_worker = Worker(
client, client,

View File

@@ -209,11 +209,12 @@ class RunActivityWorkflow:
@workflow.defn @workflow.defn
class TaskExecutorWorkflow: class TaskExecutorWorkflow:
"""Child workflow that executes one concrete task instance. """Compatibility stub for legacy task-instance workflows.
Stub behaviour: persists a task_instances row with status=done and This is not a production execution surface for activity-core. It persists a
returns immediately. Real task execution logic replaces this in a task_instances row with status=done and returns immediately so legacy/dev
later workstream. flows keep their idempotency behavior. Real task execution belongs in
per-repo workers or a future execution-owned repo/workplan, not here.
task_id is derived deterministically from the workflow's own ID so task_id is derived deterministically from the workflow's own ID so
persist_task_instance retries remain idempotent. persist_task_instance retries remain idempotent.
@@ -221,7 +222,7 @@ class TaskExecutorWorkflow:
@workflow.run @workflow.run
async def run(self, run_id: str, task_type: str, params: dict) -> dict: async def run(self, run_id: str, task_type: str, params: dict) -> dict:
# Derive a stable task_id from this workflow's own ID. # Keep the stub idempotent without implying task lifecycle ownership.
task_id = str( task_id = str(
uuid.uuid5(uuid.NAMESPACE_URL, workflow.info().workflow_id) uuid.uuid5(uuid.NAMESPACE_URL, workflow.info().workflow_id)
) )

View File

@@ -0,0 +1,5 @@
{
"_note": "PARTIAL 4000-char preview of the 2026-06-26 daily-triage validation failure (retry attempt). Full payload not recoverable from activity-core: complete() drops finish_reason; report sink caps raw at 4000 chars; the JSON break is at char 5268 (beyond this preview). Full response would require llm-connect producer-side logs on railiance01.",
"validation_error": "Expecting ',' delimiter: line 136 column 22 (char 5268)",
"raw_output_preview": "{\n \"summary\": \"Triage report focusing on high-priority workstreams with pending human intervention or critical dependencies, and addressing recently cleared dependencies to unblock progress.\",\n \"recommendations\": [\n {\n \"rank\": 1,\n \"candidate\": \"2731fece-6c49-45b8-ab8a-4ea6c04ac603\",\n \"action\": \"work-next\",\n \"why\": \"A critical dependency (T03 - Configure bounded OpenBao token roles and policies) for this workstream has been cleared, unblocking significant progress on credential management. This workstream has 8 todo tasks and no waits, indicating it's ready for immediate action.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 5.0,\n \"strategic_value\": 5,\n \"time_criticality\": 5,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 5,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 2,\n \"candidate\": \"bd086c41-287d-4a4e-8ac5-9ab270f14d72\",\n \"action\": \"needs-human\",\n \"why\": \"This high-priority workstream has a 'needs_human' task (T04 - Provision the runtime API key outside Git) and is currently blocked by 3 'wait' tasks. Human intervention is required to unblock progress.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 4.7,\n \"strategic_value\": 5,\n \"time_criticality\": 4,\n \"risk_reduction\": 5,\n \"opportunity_enablement\": 4,\n \"job_size\": 3\n }\n },\n {\n \"rank\": 3,\n \"candidate\": \"9b56414a-c71f-4e72-9b2b-d2166aaf50d0\",\n \"action\": \"needs-human\",\n \"why\": \"This high-priority workstream has a 'needs_human' task (Task: Execute Live Ops-Hub Bootstrap) and is currently blocked by a 'wait' task. Human intervention is required to proceed with the bootstrap.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 4.7,\n \"strategic_value\": 5,\n \"time_criticality\": 4,\n \"risk_reduction\": 5,\n \"opportunity_enablement\": 4,\n \"job_size\": 3\n }\n },\n {\n \"rank\": 4,\n \"candidate\": \"84e17675-0d15-4268-a8bd-540124d37018\",\n \"action\": \"needs-human\",\n \"why\": \"This workstream has 4 'needs_human' tasks, including 'T02 \u2014 Resolve Forgejo production design decisions', indicating significant human input is required to move forward with the migration.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 4.0,\n \"strategic_value\": 4,\n \"time_criticality\": 4,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 4,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 5,\n \"candidate\": \"5646e13a-13af-4724-bca6-3c0d86f96733\",\n \"action\": \"needs-human\",\n \"why\": \"This workstream has a 'needs_human' task ('Three-Run Calibration Feedback') and is currently in a 'wait' state. Human feedback is crucial for operational hardening.\",\n \"confidence\": \"medium\",\n \"wsjf\": {\n \"score\": 3.7,\n \"strategic_value\": 4,\n \"time_criticality\": 3,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 4,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 6,\n \"candidate\": \"896ace77-21b3-450b-8fb7-254aefc8c570\",\n \"action\": \"close-out\",\n \"why\": \"The task 'Wire activity-core to the live service' has been resolved, and the workstream shows 2 progress tasks with 0 todo/wait tasks. This indicates the deployment is likely complete or nearing completion and ready for close-out after verification.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 3.7,\n \"strategic_value\": 4,\n \"time_criticality\": 3,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 4,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 7,\n \"candidate\": \"656e435d-3a00-4f5e-a38e-114467f9062e\",\n \"action\": \"work-next\",\n \"why\": \"This high-priority workstream has a single 'wait' task ('Task: Activate Ops-Hub Widgets In Inter-Hub') and no 'needs_human' tasks. It appears ready for the next step to activate the widgets.\",\n \"confidence\": \"medium\",\n \"wsjf"
}

View File

@@ -88,6 +88,43 @@ def test_for_each_binds_each_list_item_before_condition_and_action_rendering() -
] ]
def test_for_each_can_gate_registry_hygiene_gaps_on_signal() -> None:
rules = [
{
"id": "flag-registry-hygiene-gap",
"for_each": "context.gaps",
"bind_as": "g",
"condition": 'context.g.hygiene_signal != ""',
"action": {
"task_template": "Close registry hygiene gap for {context.g.repo}",
"target_repo": "context.g.repo",
"priority": "medium",
"labels": ["registry-hygiene", "{context.g.hygiene_signal}"],
},
}
]
context = {
"gaps": [
{
"repo": "reuse-surface",
"hygiene_signal": "empty_capability_scaffold",
},
{
"repo": "activity-core",
"hygiene_signal": "",
},
]
}
specs = expand_rule_actions(rules, _Event(), context)
assert [spec["target_repo"] for spec in specs] == ["reuse-surface"]
assert specs[0]["labels"] == [
"registry-hygiene",
"empty_capability_scaffold",
]
def test_for_each_rejects_non_path_expression() -> None: def test_for_each_rejects_non_path_expression() -> None:
rules = [ rules = [
{ {

View File

@@ -12,6 +12,7 @@ Covers:
from __future__ import annotations from __future__ import annotations
import json import json
from pathlib import Path
from types import SimpleNamespace from types import SimpleNamespace
from typing import Any from typing import Any
@@ -333,7 +334,14 @@ def test_execute_instruction_forwards_output_schema_to_llm_connect(tmp_path, mon
def test_execute_instruction_with_audit_accepts_report_payload(): def test_execute_instruction_with_audit_accepts_report_payload():
report_data = { report_data = {
"summary": "State Hub has loose ends.", "summary": "State Hub has loose ends.",
"recommendations": [{"action": "revisit", "candidate": "CUST-WP-0045"}], "recommendations": [
{
"rank": 1,
"action": "revisit",
"candidate": "CUST-WP-0045",
"why": "Loose ends need attention.",
}
],
} }
llm = _CountingLLM([json.dumps(report_data)]) llm = _CountingLLM([json.dumps(report_data)])
instr = _instr( instr = _instr(
@@ -353,7 +361,14 @@ def test_execute_instruction_with_audit_accepts_report_payload():
def test_execute_instruction_with_audit_accepts_fenced_report_payload(): def test_execute_instruction_with_audit_accepts_fenced_report_payload():
report_data = { report_data = {
"summary": "State Hub has loose ends.", "summary": "State Hub has loose ends.",
"recommendations": [{"action": "revisit", "candidate": "CUST-WP-0045"}], "recommendations": [
{
"rank": 1,
"action": "revisit",
"candidate": "CUST-WP-0045",
"why": "Loose ends need attention.",
}
],
} }
llm = _CountingLLM([f"```json\n{json.dumps(report_data)}\n```"]) llm = _CountingLLM([f"```json\n{json.dumps(report_data)}\n```"])
instr = _instr( instr = _instr(
@@ -389,6 +404,175 @@ def test_execute_instruction_with_audit_rejects_invalid_report_schema():
assert llm.call_count == 2 assert llm.call_count == 2
# ── WP-0016-T03 resilient report recovery ─────────────────────────────────────
def _valid_rec(rank: int) -> dict[str, Any]:
return {
"rank": rank,
"candidate": f"WS-{rank}",
"action": "work-next",
"why": f"reason {rank}",
"wsjf": {"score": 5.0},
}
def _pretty_triage_with_truncated_tail(num_valid: int) -> str:
body = ",\n".join(" " + json.dumps(_valid_rec(i)) for i in range(1, num_valid + 1))
# Trailing object is cut off mid-string — the whole document is invalid JSON,
# reproducing the 2026-06-26 failure shape (valid prefix, broken tail).
return (
'{\n "summary": "Daily triage.",\n "recommendations": [\n'
+ body
+ ',\n {\n "rank": '
+ str(num_valid + 1)
+ ',\n "candidate": "WS-X",\n "action": "work-'
)
def test_resilient_report_recovers_valid_prefix_and_quarantines_truncated_tail():
raw = _pretty_triage_with_truncated_tail(7)
llm = _CountingLLM([raw, raw])
instr = _instr(
id="daily-triage-report",
prompt="Report.",
trusted_fields=[],
output_schema="schemas/daily-triage-report.json",
report_sinks=[{"type": "working-memory"}],
)
result = execute_instruction_with_audit(instr, _Event(), {}, llm)
assert result.output_validated is True
assert result.review_required is True
assert result.report is not None
assert result.report["partial"] is True
assert len(result.report["recommendations"]) == 7
assert result.report["summary"] == "Daily triage."
assert result.report["quarantined_count"] >= 1
# The broken tail is dropped — either as an unparseable/truncated span or,
# if _try_repair salvages its structure, as a schema-invalid item. Either way
# it carries a diagnostic error and never pollutes the surviving report.
assert result.report["quarantined_items"][0]["error"]
def test_resilient_report_quarantines_one_bad_item_among_valid():
recs = [_valid_rec(1), {"candidate": "WS-2", "action": "x", "why": "no rank"}, _valid_rec(3)]
raw = json.dumps({"summary": "Triage.", "recommendations": recs})
llm = _CountingLLM([raw, raw])
instr = _instr(
id="daily-triage-report",
prompt="Report.",
trusted_fields=[],
output_schema="schemas/daily-triage-report.json",
report_sinks=[{"type": "working-memory"}],
)
result = execute_instruction_with_audit(instr, _Event(), {}, llm)
assert result.output_validated is True
assert result.report["partial"] is True
assert len(result.report["recommendations"]) == 2
assert result.report["quarantined_count"] == 1
assert "rank" in result.report["quarantined_items"][0]["error"]
# ── WP-0016-T04 producer guardrails ───────────────────────────────────────────
def _triage_instr() -> SimpleNamespace:
return _instr(
id="daily-triage-report",
prompt="Report.",
trusted_fields=[],
output_schema="schemas/daily-triage-report.json",
report_sinks=[{"type": "working-memory"}],
)
def test_guardrail_count_cap_on_valid_happy_path():
# 9 fully-valid recommendations in a syntactically valid document: schema
# validation passes, but the maxItems=7 count cap must keep 7 and quarantine 2.
recs = [_valid_rec(i) for i in range(1, 10)]
raw = json.dumps({"summary": "Triage.", "recommendations": recs})
llm = _CountingLLM([raw])
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
assert llm.call_count == 1 # no retry — the document was valid
assert result.report["partial"] is True
assert len(result.report["recommendations"]) == 7
assert result.report["quarantined_count"] == 2
assert all(q["reason"] == "over_limit" for q in result.report["quarantined_items"])
def test_guardrail_oversized_string_quarantined():
big = _valid_rec(2)
big["why"] = "x" * 5000 # exceeds _MAX_STRING_LEN
raw = json.dumps({"summary": "Triage.", "recommendations": [_valid_rec(1), big]})
llm = _CountingLLM([raw])
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
assert len(result.report["recommendations"]) == 1
assert result.report["quarantined_count"] == 1
assert result.report["quarantined_items"][0]["reason"] == "guardrail"
def test_guardrail_allow_list_rejects_unknown_candidate():
raw = json.dumps({
"summary": "Triage.",
"recommendations": [_valid_rec(1), _valid_rec(2)], # candidates WS-1, WS-2
})
llm = _CountingLLM([raw])
context = {"known_candidates": ["WS-1"]}
result = execute_instruction_with_audit(_triage_instr(), _Event(), context, llm)
assert len(result.report["recommendations"]) == 1
assert result.report["recommendations"][0]["candidate"] == "WS-1"
assert result.report["quarantined_items"][0]["reason"] == "allow_list"
def _nested(depth: int) -> dict[str, Any]:
node: dict[str, Any] = {"leaf": 1}
for _ in range(depth):
node = {"a": node}
return node
def test_guardrail_over_depth_quarantined():
deep = _valid_rec(2)
deep["extra"] = _nested(12) # well past _MAX_DEPTH
raw = json.dumps({"summary": "Triage.", "recommendations": [_valid_rec(1), deep]})
llm = _CountingLLM([raw])
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
assert len(result.report["recommendations"]) == 1
assert result.report["quarantined_count"] == 1
assert result.report["quarantined_items"][0]["reason"] == "guardrail"
assert "depth" in result.report["quarantined_items"][0]["error"]
def test_resilient_recovery_against_real_2026_06_26_fixture():
# The actual captured failure payload (4000-char preview, truncated at the 7th
# recommendation) — the run that reset the WP-0006-T03 streak. Before WP-0016
# this discarded the whole report; now it must recover the valid prefix.
fixture = json.loads(
Path("tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json")
.read_text(encoding="utf-8")
)
raw = fixture["raw_output_preview"]
llm = _CountingLLM([raw, raw])
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
assert result.output_validated is True
assert result.report["partial"] is True
# Six recommendations are fully intact before the truncation point.
assert len(result.report["recommendations"]) >= 6
assert all("rank" in rec and "candidate" in rec for rec in result.report["recommendations"])
def test_execute_instruction_with_audit_preserves_invalid_report_with_sinks( def test_execute_instruction_with_audit_preserves_invalid_report_with_sinks(
tmp_path, tmp_path,
monkeypatch, monkeypatch,

View File

@@ -0,0 +1,114 @@
from __future__ import annotations
from typing import Any
import pytest
from activity_core import api
@pytest.mark.asyncio
async def test_admin_sync_definitions_only_does_not_require_temporal(
monkeypatch,
) -> None:
seen: dict[str, Any] = {}
async def fake_run_sync(**kwargs: Any) -> dict[str, Any]:
seen.update(kwargs)
return {"ok": True, "ran": {"definitions": True}}
monkeypatch.setattr(api, "_session_factory", object())
monkeypatch.setattr(api, "_temporal_client", None)
monkeypatch.setattr(api, "run_sync", fake_run_sync)
result = await api.admin_sync(
definitions=True,
schedules=False,
event_types=False,
)
assert result == {"ok": True, "ran": {"definitions": True}}
assert seen["session_factory"] is api._session_factory
assert seen["temporal_client"] is None
assert seen["definitions"] is True
assert seen["schedules"] is False
assert seen["event_types"] is False
@pytest.mark.asyncio
async def test_admin_sync_schedules_only_passes_temporal(monkeypatch) -> None:
temporal = object()
seen: dict[str, Any] = {}
async def fake_run_sync(**kwargs: Any) -> dict[str, Any]:
seen.update(kwargs)
return {
"ok": True,
"schedules": {
"upserted": 1,
"paused": 0,
"deleted_orphans": 0,
},
}
monkeypatch.setattr(api, "_session_factory", object())
monkeypatch.setattr(api, "_temporal_client", temporal)
monkeypatch.setattr(api, "run_sync", fake_run_sync)
result = await api.admin_sync(
definitions=False,
schedules=True,
event_types=False,
)
assert result["schedules"]["upserted"] == 1
assert seen["temporal_client"] is temporal
assert seen["definitions"] is False
assert seen["schedules"] is True
assert seen["event_types"] is False
@pytest.mark.asyncio
async def test_admin_sync_all_sync_returns_failure_result(monkeypatch) -> None:
async def fake_run_sync(**kwargs: Any) -> dict[str, Any]:
return {
"ok": False,
"ran": {
"definitions": kwargs["definitions"],
"schedules": kwargs["schedules"],
"event_types": kwargs["event_types"],
},
"errors": [
{
"stage": "event_types",
"type": "RuntimeError",
"message": "bad event type",
}
],
}
monkeypatch.setattr(api, "_session_factory", object())
monkeypatch.setattr(api, "_temporal_client", object())
monkeypatch.setattr(api, "run_sync", fake_run_sync)
result = await api.admin_sync(
definitions=True,
schedules=True,
event_types=True,
)
assert result == {
"ok": False,
"ran": {
"definitions": True,
"schedules": True,
"event_types": True,
},
"errors": [
{
"stage": "event_types",
"type": "RuntimeError",
"message": "bad event type",
}
],
}

View File

@@ -1,6 +1,7 @@
from __future__ import annotations from __future__ import annotations
import json import json
from pathlib import Path
import pytest import pytest
@@ -70,7 +71,14 @@ async def test_evaluate_instructions_returns_task_specs_with_audit(monkeypatch)
async def test_evaluate_instructions_returns_report_payload(monkeypatch) -> None: async def test_evaluate_instructions_returns_report_payload(monkeypatch) -> None:
llm = FakeLLMClient(json.dumps({ llm = FakeLLMClient(json.dumps({
"summary": "State Hub has open loose ends.", "summary": "State Hub has open loose ends.",
"recommendations": [{"candidate": "CUST-WP-0045", "action": "work-next"}], "recommendations": [
{
"rank": 1,
"candidate": "CUST-WP-0045",
"action": "work-next",
"why": "Open loose ends.",
}
],
})) }))
monkeypatch.setattr(activities, "get_llm_client", lambda: llm) monkeypatch.setattr(activities, "get_llm_client", lambda: llm)
@@ -209,6 +217,12 @@ async def test_evaluate_instructions_forwards_llm_connect_depth_config(monkeypat
"context": {}, "context": {},
}) })
# Read the live schema file rather than hard-coding it, so the forwarded
# json_schema assertion tracks schemas/daily-triage-report.json as the
# contract evolves (ACTIVITY-WP-0016-T02).
expected_schema = json.loads(
Path("schemas/daily-triage-report.json").read_text(encoding="utf-8")
)
assert llm.calls[0][2] == { assert llm.calls[0][2] == {
"model_name": "custodian-triage-balanced", "model_name": "custodian-triage-balanced",
"temperature": 0.2, "temperature": 0.2,
@@ -216,16 +230,6 @@ async def test_evaluate_instructions_forwards_llm_connect_depth_config(monkeypat
"max_depth": 2, "max_depth": 2,
"model_params": { "model_params": {
"reasoning_effort": "medium", "reasoning_effort": "medium",
"json_schema": { "json_schema": expected_schema,
"type": "object",
"required": ["summary", "recommendations"],
"properties": {
"summary": {"type": "string"},
"recommendations": {
"type": "array",
"items": {"type": "object"},
},
},
},
}, },
} }

View File

@@ -34,7 +34,7 @@ def test_issue_core_rest_sink_posts_task_contract(monkeypatch) -> None:
monkeypatch.setattr(httpx, "post", fake_post) monkeypatch.setattr(httpx, "post", fake_post)
ref = IssueCoreRestSink("http://issue-core.test/").emit(TaskSpec( ref = IssueCoreRestSink("http://issue-core.test/", api_key="test-key").emit(TaskSpec(
title="Run SBOM rescan for activity-core", title="Run SBOM rescan for activity-core",
description="SBOM is older than 30 days.", description="SBOM is older than 30 days.",
target_repo="activity-core", target_repo="activity-core",
@@ -67,9 +67,28 @@ def test_issue_core_rest_sink_posts_task_contract(monkeypatch) -> None:
"triggering_event_id": "scheduled", "triggering_event_id": "scheduled",
"activity_definition_id": "activity-1", "activity_definition_id": "activity-1",
}, },
"headers": {"Authorization": "Bearer test-key"},
"timeout": 10.0, "timeout": 10.0,
} }
] ]
assert "review_required" not in posts[0]["json"]
def test_issue_core_rest_sink_requires_api_key() -> None:
sink = IssueCoreRestSink("http://issue-core.test/", api_key="")
with pytest.raises(RuntimeError, match="ISSUE_CORE_API_KEY"):
sink.emit(TaskSpec(
title="t",
description="",
target_repo="activity-core",
priority="low",
labels=[],
due_in_days=None,
source_type="rule",
source_id="r",
triggering_event_id="e",
activity_definition_id="a",
))
@pytest.mark.asyncio @pytest.mark.asyncio

View File

@@ -0,0 +1,195 @@
from __future__ import annotations
from pathlib import Path
from typing import Any
import httpx
import pytest
import yaml
from activity_core.context_resolvers.kaizen import (
KaizenContextResolver,
discover_kaizen_scheduled_repos,
)
class DummyResponse:
def __init__(self, payload: Any, status_error: Exception | None = None) -> None:
self.payload = payload
self.status_error = status_error
def raise_for_status(self) -> None:
if self.status_error is not None:
raise self.status_error
def json(self) -> Any:
return self.payload
def _write_schedule(path: Path, agents: dict[str, Any]) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(
yaml.safe_dump(
{"version": "1", "timezone": "Europe/Berlin", "agents": agents},
sort_keys=False,
),
encoding="utf-8",
)
def test_discover_scheduled_repos_emits_enabled_coach(tmp_path, monkeypatch) -> None:
repo_root = tmp_path / "pilot-repo"
repo_root.mkdir()
_write_schedule(
repo_root / ".kaizen" / "schedule.yml",
{"coach": {"cadence": "daily", "cron": "15 * * * *", "enabled": True}},
)
def fake_get(url: str, **kwargs: Any) -> DummyResponse:
return DummyResponse(
[
{
"slug": "pilot-repo",
"domain_slug": "custodian",
"host_paths": {"testhost": str(repo_root)},
}
]
)
monkeypatch.setenv("STATE_HUB_URL", "http://hub.test")
monkeypatch.setenv("KAIZEN_RUNNER_HOST", "testhost")
monkeypatch.setattr(httpx, "get", fake_get)
result = discover_kaizen_scheduled_repos({})
assert len(result["scheduled_runs"]) == 1
run = result["scheduled_runs"][0]
assert run["repo"] == "pilot-repo"
assert run["agent"] == "coach"
assert run["enabled"] is True
assert "schedule prepare coach" in run["prepare_command"]
def test_discover_scheduled_repos_skips_disabled_coach(tmp_path, monkeypatch) -> None:
repo_root = tmp_path / "pilot-repo"
repo_root.mkdir()
_write_schedule(
repo_root / ".kaizen" / "schedule.yml",
{"coach": {"cadence": "daily", "enabled": False}},
)
monkeypatch.setenv("STATE_HUB_URL", "http://hub.test")
monkeypatch.setenv("KAIZEN_RUNNER_HOST", "testhost")
monkeypatch.setattr(
httpx,
"get",
lambda url, **kwargs: DummyResponse(
[{"slug": "pilot-repo", "host_paths": {"testhost": str(repo_root)}}]
),
)
result = discover_kaizen_scheduled_repos({})
assert result["scheduled_runs"] == []
def test_discover_scheduled_repos_skips_missing_schedule(tmp_path, monkeypatch) -> None:
repo_root = tmp_path / "no-schedule"
repo_root.mkdir()
monkeypatch.setenv("STATE_HUB_URL", "http://hub.test")
monkeypatch.setenv("KAIZEN_RUNNER_HOST", "testhost")
monkeypatch.setattr(
httpx,
"get",
lambda url, **kwargs: DummyResponse(
[{"slug": "no-schedule", "host_paths": {"testhost": str(repo_root)}}]
),
)
result = discover_kaizen_scheduled_repos({})
assert result["scheduled_runs"] == []
def test_discover_scheduled_repos_skips_invalid_schedule(tmp_path, monkeypatch) -> None:
repo_root = tmp_path / "bad-schedule"
schedule = repo_root / ".kaizen" / "schedule.yml"
schedule.parent.mkdir(parents=True)
schedule.write_text("version: '2'\nagents: {}\n", encoding="utf-8")
monkeypatch.setenv("STATE_HUB_URL", "http://hub.test")
monkeypatch.setenv("KAIZEN_RUNNER_HOST", "testhost")
monkeypatch.setattr(
httpx,
"get",
lambda url, **kwargs: DummyResponse(
[{"slug": "bad-schedule", "host_paths": {"testhost": str(repo_root)}}]
),
)
result = discover_kaizen_scheduled_repos({})
assert result["scheduled_runs"] == []
def test_discover_scheduled_repos_filters_by_roster_and_cadence(
tmp_path, monkeypatch
) -> None:
repo_a = tmp_path / "kaizen-agentic"
repo_b = tmp_path / "other-repo"
for root in (repo_a, repo_b):
_write_schedule(
root / ".kaizen" / "schedule.yml",
{
"coach": {"cadence": "daily", "enabled": True},
"optimization": {"cadence": "weekly", "enabled": True},
},
)
roster = tmp_path / "roster.yaml"
roster.write_text(
yaml.safe_dump(
{
"active": [
{"slug": "kaizen-agentic", "agents": ["coach"], "status": "active"}
]
}
),
encoding="utf-8",
)
monkeypatch.setenv("STATE_HUB_URL", "http://hub.test")
monkeypatch.setenv("KAIZEN_RUNNER_HOST", "testhost")
monkeypatch.setattr(
httpx,
"get",
lambda url, **kwargs: DummyResponse(
[
{"slug": "kaizen-agentic", "host_paths": {"testhost": str(repo_a)}},
{"slug": "other-repo", "host_paths": {"testhost": str(repo_b)}},
]
),
)
result = discover_kaizen_scheduled_repos(
{"roster": str(roster), "cadence": "daily"}
)
agents = {r["agent"] for r in result["scheduled_runs"]}
repos = {r["repo"] for r in result["scheduled_runs"]}
assert repos == {"kaizen-agentic"}
assert agents == {"coach"}
def test_hub_unreachable_raises(monkeypatch) -> None:
monkeypatch.setenv("STATE_HUB_URL", "http://hub.test")
def fail_get(url: str, **kwargs: Any) -> DummyResponse:
raise httpx.ConnectError("down")
monkeypatch.setattr(httpx, "get", fail_get)
with pytest.raises(RuntimeError, match="State Hub unreachable"):
discover_kaizen_scheduled_repos({})
def test_resolver_registry_alias() -> None:
resolver = KaizenContextResolver()
assert resolver.resolve("unknown_query", None, {}) == {}

View File

@@ -166,6 +166,93 @@ def test_state_hub_progress_sink_is_idempotent(monkeypatch) -> None:
assert result[0]["idempotency_key"] == idempotency_key assert result[0]["idempotency_key"] == idempotency_key
def test_core_hub_interaction_event_sink_posts_and_verifies_compact_event(monkeypatch) -> None:
posts: list[dict[str, Any]] = []
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
assert url == "http://core-hub.test/api/v2/interaction-events"
assert kwargs["headers"]["Authorization"] == "Bearer runtime-secret"
posts.append({"url": url, **kwargs})
return DummyResponse(
{
"id": "event-1",
"eventType": "ops-endpoint-verified",
"widgetId": "widget-1",
}
)
def fake_get(url: str, **kwargs: Any) -> DummyResponse:
assert url == "http://core-hub.test/api/v2/interaction-events"
assert kwargs["headers"]["Authorization"] == "Bearer runtime-secret"
return DummyResponse({"data": [{"id": "event-1"}]})
monkeypatch.setenv("CORE_HUB_RUNTIME_TOKEN", "runtime-secret")
monkeypatch.setattr(httpx, "post", fake_post)
monkeypatch.setattr(httpx, "get", fake_get)
result = persist_ops_inventory_evidence(
_payload([
{
"type": "core-hub-interaction-event",
"core_hub_url": "http://core-hub.test",
"widget_id": "widget-1",
"event_type": "ops-endpoint-verified",
}
])
)
assert result == [
{
"type": "core-hub-interaction-event",
"status": "posted",
"event_type": "ops-endpoint-verified",
"event_id": "event-1",
"widget_id": "widget-1",
"verified": True,
"context_key": "ops_probe",
}
]
body = posts[0]["json"]
assert body["widgetId"] == "widget-1"
assert body["eventType"] == "ops-endpoint-verified"
assert body["metadata"]["activity_core_run_id"] == _run_id()
assert body["metadata"]["endpoint"]["url"] == "http://state-hub.test/health"
assert body["metadata"]["endpoint"]["widget_ref"] == "ops:endpoint:state-hub-health"
serialized = json.dumps(body, sort_keys=True)
assert "runtime-secret" not in serialized
assert "secret response body" not in serialized
assert "Authorization" not in serialized
assert "user:pass" not in serialized
assert "token=secret" not in serialized
def test_core_hub_sink_skips_cleanly_when_config_missing(monkeypatch) -> None:
monkeypatch.delenv("CORE_HUB_BASE_URL", raising=False)
monkeypatch.delenv("CORE_HUB_RUNTIME_TOKEN", raising=False)
monkeypatch.delenv("CORE_HUB_RUNTIME_TOKEN_FILE", raising=False)
monkeypatch.delenv("CORE_HUB_WIDGET_ID", raising=False)
monkeypatch.delenv("CORE_HUB_WIDGET_MAPPING", raising=False)
result = persist_ops_inventory_evidence(
_payload([{"type": "core-hub-interaction-event"}])
)
assert result == [
{
"type": "core-hub-interaction-event",
"status": "skipped",
"reason": "missing_core_hub_config",
"missing": [
"CORE_HUB_BASE_URL",
"CORE_HUB_RUNTIME_TOKEN or CORE_HUB_RUNTIME_TOKEN_FILE",
"widget_id or CORE_HUB_WIDGET_ID",
],
"context_key": "ops_probe",
}
]
def test_inter_hub_sink_skips_cleanly_when_config_missing(monkeypatch) -> None: def test_inter_hub_sink_skips_cleanly_when_config_missing(monkeypatch) -> None:
monkeypatch.delenv("INTER_HUB_URL", raising=False) monkeypatch.delenv("INTER_HUB_URL", raising=False)
monkeypatch.delenv("OPS_HUB_KEY", raising=False) monkeypatch.delenv("OPS_HUB_KEY", raising=False)

View File

@@ -33,7 +33,9 @@ def _by_kind_name(kind: str, name: str) -> dict[str, Any]:
def test_runtime_config_has_ops_inventory_placeholders() -> None: def test_runtime_config_has_ops_inventory_placeholders() -> None:
config = _by_kind_name("ConfigMap", "actcore-runtime-config") config = _by_kind_name("ConfigMap", "actcore-runtime-config")
assert config["data"]["LLM_CONNECT_URL"] == "" assert config["data"]["LLM_CONNECT_URL"] == (
"http://llm-connect.activity-core.svc.cluster.local:8080"
)
assert config["data"]["LLM_CONNECT_TIMEOUT_SECONDS"] == "300" assert config["data"]["LLM_CONNECT_TIMEOUT_SECONDS"] == "300"
assert config["data"]["OPS_INVENTORY_PATH"] == ( assert config["data"]["OPS_INVENTORY_PATH"] == (
"/etc/activity-core/ops/service-inventory.yml" "/etc/activity-core/ops/service-inventory.yml"

View File

@@ -0,0 +1,160 @@
from __future__ import annotations
import json
import pytest
from temporalio.exceptions import ApplicationError
from activity_core import activities
from activity_core.activities import _bind_resolver_result, resolve_context
def test_bind_resolver_result_unwraps_single_key_wrapper() -> None:
projects = [{"repo": "kaizen-agentic", "has_metrics": True}]
assert _bind_resolver_result("projects", {"projects": projects}) == projects
def test_bind_resolver_result_keeps_multi_key_summary() -> None:
summary = {
"repos": [{"repo_slug": "a"}],
"stale_count": 1,
"total_count": 2,
}
assert _bind_resolver_result("repos", summary) == summary
@pytest.mark.asyncio
async def test_resolve_context_unwraps_kaizen_projects(monkeypatch) -> None:
class _FakeResolver:
def resolve(self, query: str, event: object, params: dict) -> dict:
assert query == "discover_kaizen_projects"
return {"projects": [{"repo": "pilot", "has_metrics": True}]}
import activity_core.context_resolvers # noqa: F401
from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY
monkeypatch.setitem(CONTEXT_RESOLVER_REGISTRY, "kaizen", lambda: _FakeResolver())
snapshot = await resolve_context(
[
{
"type": "kaizen",
"query": "discover_kaizen_projects",
"params": {},
"bind_to": "context.projects",
}
]
)
assert snapshot == {"projects": [{"repo": "pilot", "has_metrics": True}]}
@pytest.mark.asyncio
async def test_resolve_context_binds_event_payload_attributes() -> None:
envelope = {
"type": "kaizen.metrics.recorded",
"attributes": {
"agent": "coach",
"project": "kaizen-agentic",
"summary": {
"success_rate": 0.75,
"execution_count": 12,
"avg_quality": 0.81,
},
},
}
snapshot = await resolve_context(
[
{
"type": "event-payload",
"bind_to": "context.metrics",
}
],
json.dumps(envelope),
)
assert snapshot == {
"metrics": {
"agent": "coach",
"project": "kaizen-agentic",
"summary": {
"success_rate": 0.75,
"execution_count": 12,
"avg_quality": 0.81,
},
}
}
@pytest.mark.asyncio
async def test_event_payload_context_supports_low_success_rate_rule() -> None:
snapshot = await resolve_context(
[
{
"type": "event-payload",
"bind_to": "context.metrics",
}
],
json.dumps({
"type": "kaizen.metrics.recorded",
"attributes": {
"agent": "coach",
"project": "kaizen-agentic",
"summary": {"success_rate": 0.75},
},
}),
)
result = await activities.evaluate_rules({
"rules": [
{
"id": "flag-low-success-rate",
"condition": "context.metrics.summary.success_rate < 0.8",
"action": {
"task_template": (
"Review low success rate for {context.metrics.agent}"
),
"target_repo": "context.metrics.project",
"priority": "high",
"labels": ["kaizen", "{context.metrics.agent}"],
},
}
],
"event": {},
"context": snapshot,
})
assert len(result) == 1
assert result[0]["source_id"] == "flag-low-success-rate"
assert result[0]["title"] == "Review low success rate for coach"
assert result[0]["target_repo"] == "kaizen-agentic"
assert result[0]["labels"] == ["kaizen", "coach"]
@pytest.mark.asyncio
async def test_event_payload_context_binds_empty_when_optional_envelope_missing() -> None:
snapshot = await resolve_context(
[
{
"type": "event-payload",
"bind_to": "context.metrics",
}
],
)
assert snapshot == {"metrics": {}}
@pytest.mark.asyncio
async def test_event_payload_context_fails_when_required_envelope_missing() -> None:
with pytest.raises(ApplicationError, match="Required context resolver"):
await resolve_context(
[
{
"type": "event-payload",
"bind_to": "context.metrics",
"required": True,
}
],
)

View File

@@ -0,0 +1,167 @@
from __future__ import annotations
import json
from pathlib import Path
from typing import Any
import pytest
from temporalio.exceptions import ApplicationError
from activity_core.activities import resolve_context
from activity_core.context_resolvers import reuse_surface
from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY
class _Response:
def __init__(self, payload: Any) -> None:
self._payload = payload
def raise_for_status(self) -> None:
return None
def json(self) -> Any:
return self._payload
class _Completed:
returncode = 0
stderr = ""
def __init__(self, payload: dict[str, Any]) -> None:
self.stdout = json.dumps(payload)
def _write_rollout(path: Path) -> None:
path.write_text(
"""
domains:
reuse:
phase: active
repos:
- reuse-surface
- activity-core
parked:
phase: backlog
repos:
- ignored-repo
""".lstrip(),
encoding="utf-8",
)
def _write_cli_only_signals(path: Path) -> None:
path.write_text(
"""
signals:
empty_capability_scaffold:
enabled: true
registry_gap:
enabled: false
stale_scope:
enabled: false
stale_sbom:
enabled: false
publish_check_fail:
enabled: false
""".lstrip(),
encoding="utf-8",
)
def test_shell_resolver_emits_reuse_surface_gaps_and_advances_cursor(
tmp_path,
monkeypatch,
) -> None:
rollout = tmp_path / "rollout.yaml"
_write_rollout(rollout)
_write_cli_only_signals(tmp_path / "signals.yml")
reuse_root = tmp_path / "reuse-surface"
reuse_root.mkdir()
(reuse_root / "SCOPE.md").write_text("fresh\n", encoding="utf-8")
activity_root = tmp_path / "activity-core"
activity_root.mkdir()
monkeypatch.setenv("KAIZEN_RUNNER_HOST", "runner")
def fake_get(url: str, **kwargs: Any) -> _Response:
assert url.endswith("/repos/")
return _Response(
[
{
"slug": "reuse-surface",
"host_paths": {"runner": str(reuse_root)},
},
{
"slug": "activity-core",
"host_paths": {"runner": str(activity_root)},
},
]
)
def fake_run(cmd: list[str], **kwargs: Any) -> _Completed:
assert cmd == ["reuse-surface", "report", "gaps", "--format", "json"]
return _Completed({"empty_scaffolds": ["reuse-surface"]})
monkeypatch.setattr(reuse_surface.httpx, "get", fake_get)
monkeypatch.setattr(reuse_surface.subprocess, "run", fake_run)
import activity_core.context_resolvers # noqa: F401
result = CONTEXT_RESOLVER_REGISTRY["shell"]().resolve(
"reuse_surface_report_gaps",
None,
{
"roster": str(rollout),
"batch_size": 1,
},
)
assert result == {
"gaps": [
{
"repo": "reuse-surface",
"root": str(reuse_root),
"signal": "empty_capability_scaffold",
"hygiene_signal": "empty_capability_scaffold",
}
]
}
state = json.loads((tmp_path / "round-robin-state.json").read_text(encoding="utf-8"))
assert state["cursor"] == 1
assert state["last_batch"] == ["reuse-surface"]
def test_shell_resolver_keeps_kaizen_fallback_for_existing_queries() -> None:
assert CONTEXT_RESOLVER_REGISTRY["shell"]().resolve("unknown_query", None, {}) == {}
@pytest.mark.asyncio
async def test_optional_reuse_surface_missing_roster_binds_empty_list(tmp_path) -> None:
snapshot = await resolve_context(
[
{
"type": "shell",
"query": "reuse_surface_report_gaps",
"params": {"roster": str(tmp_path / "missing.yaml")},
"bind_to": "context.gaps",
}
]
)
assert snapshot == {"gaps": []}
@pytest.mark.asyncio
async def test_required_reuse_surface_missing_roster_fails_visibly(tmp_path) -> None:
with pytest.raises(ApplicationError, match="Required context resolver"):
await resolve_context(
[
{
"type": "shell",
"query": "reuse_surface_report_gaps",
"params": {"roster": str(tmp_path / "missing.yaml")},
"bind_to": "context.gaps",
"required": True,
}
]
)

View File

@@ -0,0 +1,81 @@
"""ACTIVITY-WP-0014 T03: missed-fire detection verdict tests."""
from __future__ import annotations
from datetime import datetime, timedelta, timezone
from activity_core.schedule_health import evaluate_schedule_health
NOW = datetime(2026, 6, 23, 12, 0, tzinfo=timezone.utc)
def test_healthy_when_recent_fire_and_no_drops() -> None:
health = evaluate_schedule_health(
activity_id="a1",
missed_catchup_window=0,
last_fired_at=NOW - timedelta(minutes=5),
now=NOW,
expected_interval=timedelta(hours=1),
)
assert health.healthy is True
assert health.missed is False
assert health.reasons == []
def test_unhealthy_when_catchup_window_dropped_fires() -> None:
health = evaluate_schedule_health(
activity_id="a1",
missed_catchup_window=2,
last_fired_at=NOW - timedelta(minutes=5),
now=NOW,
)
assert health.missed is True
assert "2 fire(s) dropped" in health.reasons[0]
def test_unhealthy_when_last_fire_too_stale() -> None:
health = evaluate_schedule_health(
activity_id="daily",
missed_catchup_window=0,
last_fired_at=NOW - timedelta(days=2),
now=NOW,
expected_interval=timedelta(days=1),
)
assert health.missed is True
assert any("exceeding the expected" in r for r in health.reasons)
assert health.staleness == timedelta(days=2)
def test_within_tolerance_is_healthy() -> None:
health = evaluate_schedule_health(
activity_id="daily",
missed_catchup_window=0,
last_fired_at=NOW - (timedelta(days=1) + timedelta(minutes=5)),
now=NOW,
expected_interval=timedelta(days=1),
tolerance=timedelta(minutes=10),
)
assert health.healthy is True
def test_no_fire_recorded_for_due_schedule_is_unhealthy() -> None:
health = evaluate_schedule_health(
activity_id="daily",
missed_catchup_window=0,
last_fired_at=None,
now=NOW,
expected_interval=timedelta(days=1),
)
assert health.missed is True
assert "no recorded fire" in health.reasons[0]
def test_no_interval_and_no_fire_is_not_flagged() -> None:
# Without an expected interval we cannot assert a miss from absence alone.
health = evaluate_schedule_health(
activity_id="event-ish",
missed_catchup_window=0,
last_fired_at=None,
now=NOW,
)
assert health.healthy is True

View File

@@ -37,6 +37,7 @@ def _make_defn(
misfire_policy: str = "skip", misfire_policy: str = "skip",
enabled: bool = True, enabled: bool = True,
jitter: int = 0, jitter: int = 0,
catchup_window_seconds: int | None = None,
) -> ActivityDefinition: ) -> ActivityDefinition:
return ActivityDefinition( return ActivityDefinition(
id=uuid.uuid4(), id=uuid.uuid4(),
@@ -46,6 +47,7 @@ def _make_defn(
cron_expression=cron, cron_expression=cron,
misfire_policy=misfire_policy, misfire_policy=misfire_policy,
jitter_seconds=jitter, jitter_seconds=jitter,
catchup_window_seconds=catchup_window_seconds,
), ),
) )
@@ -186,6 +188,76 @@ async def test_misfire_policy_compress_sets_overlap_buffer_one(env: WorkflowEnvi
await delete_schedule(env.client, defn.id) await delete_schedule(env.client, defn.id)
# ── ACTIVITY-WP-0014: explicit run-miss policies + catchup window ────────────
@pytest.mark.asyncio
async def test_skip_sets_short_catchup_window(env: WorkflowEnvironment) -> None:
"""skip = run on trigger or skip: tiny grace window, no real recovery."""
defn = _make_defn(misfire_policy="skip")
await upsert_schedule(env.client, defn)
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.SKIP
assert desc.schedule.policy.catchup_window == timedelta(seconds=60)
await delete_schedule(env.client, defn.id)
@pytest.mark.asyncio
async def test_catchup_all_recovers_full_window(env: WorkflowEnvironment) -> None:
"""catchup_all = recover every missed fire: long window, BUFFER_ALL."""
defn = _make_defn(misfire_policy="catchup_all")
await upsert_schedule(env.client, defn)
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.BUFFER_ALL
assert desc.schedule.policy.catchup_window == timedelta(days=365)
await delete_schedule(env.client, defn.id)
@pytest.mark.asyncio
async def test_catchup_latest_does_not_accumulate(env: WorkflowEnvironment) -> None:
"""catchup_latest = recover only the most recent missed fire: BUFFER_ONE."""
defn = _make_defn(misfire_policy="catchup_latest")
await upsert_schedule(env.client, defn)
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.BUFFER_ONE
assert desc.schedule.policy.catchup_window == timedelta(hours=24)
await delete_schedule(env.client, defn.id)
@pytest.mark.asyncio
async def test_legacy_aliases_map_to_explicit_policies(env: WorkflowEnvironment) -> None:
"""Legacy catchup/compress keep working and pick up the new catchup windows."""
catchup = _make_defn(misfire_policy="catchup")
compress = _make_defn(misfire_policy="compress")
await upsert_schedule(env.client, catchup)
await upsert_schedule(env.client, compress)
d1 = await env.client.get_schedule_handle(schedule_id(catchup.id)).describe()
d2 = await env.client.get_schedule_handle(schedule_id(compress.id)).describe()
assert d1.schedule.policy.catchup_window == timedelta(days=365)
assert d2.schedule.policy.catchup_window == timedelta(hours=24)
await delete_schedule(env.client, catchup.id)
await delete_schedule(env.client, compress.id)
@pytest.mark.asyncio
async def test_explicit_catchup_window_override(env: WorkflowEnvironment) -> None:
"""An explicit catchup_window_seconds overrides the per-policy default."""
defn = _make_defn(misfire_policy="skip", catchup_window_seconds=7200)
await upsert_schedule(env.client, defn)
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
assert desc.schedule.policy.catchup_window == timedelta(hours=2)
await delete_schedule(env.client, defn.id)
@pytest.mark.asyncio @pytest.mark.asyncio
async def test_schedule_smoke_test_creates_one_shot_schedule( async def test_schedule_smoke_test_creates_one_shot_schedule(
env: WorkflowEnvironment, env: WorkflowEnvironment,

View File

@@ -215,6 +215,29 @@ def test_coding_retro_returns_latest_progress_suggestions(monkeypatch) -> None:
], ],
}, },
}, },
{
"id": "newer-30-day-retro",
"event_type": "coding_retro",
"summary": "monthly coding retro ready",
"created_at": "2026-06-07T17:15:00Z",
"detail": {
"generated_at": "2026-06-07T17:14:30Z",
"window": {
"days": 30,
"since": "2026-05-08T00:00:00Z",
"until": "2026-06-07T00:00:00Z",
},
"suggestions": [
{
"repo": "broad-retro-repo",
"title": "Should not displace the weekly retro",
"recommendation": "Keep weekly schedule bounded.",
"priority": "high",
"score": 99,
}
],
},
},
]) ])
monkeypatch.setenv("STATE_HUB_URL", "http://state-hub.test/") monkeypatch.setenv("STATE_HUB_URL", "http://state-hub.test/")
@@ -229,7 +252,7 @@ def test_coding_retro_returns_latest_progress_suggestions(monkeypatch) -> None:
assert calls == [ assert calls == [
{ {
"url": "http://state-hub.test/progress/", "url": "http://state-hub.test/progress/",
"params": {"limit": 20}, "params": {"event_type": "coding_retro", "limit": 20},
"timeout": 10.0, "timeout": 10.0,
} }
] ]
@@ -251,6 +274,47 @@ def test_coding_retro_returns_latest_progress_suggestions(monkeypatch) -> None:
] ]
def test_coding_retro_returns_empty_when_window_does_not_match(monkeypatch) -> None:
def fake_get(url: str, **kwargs: Any) -> DummyResponse:
return DummyResponse([
{
"id": "monthly-retro",
"event_type": "coding_retro",
"summary": "monthly coding retro ready",
"created_at": "2026-06-07T17:10:00Z",
"detail": {
"window": {"days": 30},
"suggestions": [
{
"repo": "activity-core",
"title": "Broad retro item",
"recommendation": "Do not emit from weekly schedule.",
"priority": "high",
"score": 10,
}
],
},
}
])
monkeypatch.setattr(httpx, "get", fake_get)
result = StateHubContextResolver().resolve(
"coding_retro",
None,
{"event_type": "coding_retro", "window_days": 7},
)
assert result == {
"suggestions": [],
"window": None,
"generated_at": None,
"source_progress_id": None,
"event_type": "coding_retro",
"summary": "",
}
def test_coding_retro_returns_empty_shape_when_not_published(monkeypatch) -> None: def test_coding_retro_returns_empty_shape_when_not_published(monkeypatch) -> None:
def fake_get(url: str, **kwargs: Any) -> DummyResponse: def fake_get(url: str, **kwargs: Any) -> DummyResponse:
return DummyResponse([ return DummyResponse([
@@ -343,6 +407,70 @@ def test_recently_on_scope_hourly_failure_bubbles(monkeypatch) -> None:
StateHubContextResolver().resolve("recently_on_scope_hourly", None, {"range": "1h"}) StateHubContextResolver().resolve("recently_on_scope_hourly", None, {"range": "1h"})
def test_consistency_sweep_remote_all_posts_batch(monkeypatch) -> None:
calls: list[dict[str, Any]] = []
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
calls.append({"url": url, **kwargs})
return DummyResponse(
{
"exit_code": 0,
"lock_skipped": False,
"repos_processed": [{"repo_slug": "state-hub", "result": "pass"}],
"skipped_clean": ["quiet-repo"],
"skipped_missing": [],
"skipped_budget": [],
}
)
monkeypatch.setenv("STATE_HUB_URL", "http://state-hub.test/")
monkeypatch.setattr(httpx, "post", fake_post)
result = StateHubContextResolver().resolve(
"consistency_sweep_remote_all",
None,
{"max_seconds": 300, "source": "activity-core", "required": True},
)
assert result["exit_code"] == 0
assert result["repos_processed"][0]["repo_slug"] == "state-hub"
assert calls == [
{
"url": "http://state-hub.test/consistency/sweep/remote-all",
"json": {"max_seconds": 300, "source": "activity-core"},
"timeout": 330.0,
}
]
def test_consistency_sweep_remote_all_failure_bubbles(monkeypatch) -> None:
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
raise httpx.ConnectError("offline")
monkeypatch.setattr(httpx, "post", fake_post)
with pytest.raises(httpx.ConnectError):
StateHubContextResolver().resolve(
"consistency_sweep_remote_all",
None,
{"max_seconds": 300},
)
def test_consistency_sweep_remote_all_rejects_empty_response(monkeypatch) -> None:
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
return DummyResponse({})
monkeypatch.setattr(httpx, "post", fake_post)
with pytest.raises(RuntimeError, match="missing required key"):
StateHubContextResolver().resolve(
"consistency_sweep_remote_all",
None,
{"max_seconds": 300},
)
def test_recently_on_scope_hourly_rejects_empty_response(monkeypatch) -> None: def test_recently_on_scope_hourly_rejects_empty_response(monkeypatch) -> None:
def fake_post(url: str, **kwargs: Any) -> DummyResponse: def fake_post(url: str, **kwargs: Any) -> DummyResponse:
return DummyResponse({}) return DummyResponse({})

View File

@@ -0,0 +1,81 @@
"""ACTIVITY-WP-0014 T05: idempotency-keyed State Hub writes."""
from __future__ import annotations
import httpx
import pytest
from activity_core import report_sinks
from activity_core.state_hub_write import (
IDEMPOTENCY_HEADER,
idempotency_headers,
idempotency_key,
)
def test_key_is_stable_and_deterministic() -> None:
a = idempotency_key("run1", "daily-triage-report", "daily_triage")
b = idempotency_key("run1", "daily-triage-report", "daily_triage")
assert a == b == "run1:daily-triage-report:daily_triage"
def test_key_shape_stable_with_missing_parts() -> None:
assert idempotency_key("run1", None, "daily_triage") == "run1::daily_triage"
def test_key_sanitizes_control_and_whitespace() -> None:
key = idempotency_key("run 1", "a\tb", "x\n")
assert "\t" not in key and "\n" not in key and " " not in key
def test_headers_carry_the_key() -> None:
headers = idempotency_headers("run1", "i", "e")
assert headers == {IDEMPOTENCY_HEADER: "run1:i:e"}
def test_distinct_identities_get_distinct_keys() -> None:
assert idempotency_key("r", "i", "daily_triage") != idempotency_key(
"r", "i", "schedule_miss"
)
def test_progress_exists_is_best_effort_on_connection_error(monkeypatch) -> None:
"""A down State Hub must not hard-fail the dedup read; it returns False so the
keyed write can still proceed."""
def _boom(*args, **kwargs):
raise httpx.ConnectError("Connection refused")
monkeypatch.setattr(report_sinks.httpx, "get", _boom)
assert (
report_sinks._progress_exists(
"http://127.0.0.1:8000", "run1", "daily-triage-report", "daily_triage"
)
is False
)
def test_report_sink_post_sends_idempotency_header(monkeypatch) -> None:
"""The state-hub-progress write carries a stable Idempotency-Key header."""
captured: dict[str, object] = {}
monkeypatch.setattr(report_sinks, "_progress_exists", lambda *a, **k: False)
class _Resp:
def raise_for_status(self) -> None: ...
def json(self) -> dict[str, str]:
return {"id": "pid-1"}
def _capture_post(url, json, headers, timeout): # noqa: A002
captured["headers"] = headers
return _Resp()
monkeypatch.setattr(report_sinks.httpx, "post", _capture_post)
payload = {"run_id": "run1", "activity_id": "act1", "scheduled_for": None}
report_entry = {"instruction_id": "daily-triage-report", "report": {"summary": "s"}}
sink = {"event_type": "daily_triage"}
result = report_sinks._post_state_hub_progress(payload, report_entry, sink)
assert result["status"] == "posted"
assert captured["headers"][IDEMPOTENCY_HEADER] == "run1:daily-triage-report:daily_triage"

View File

@@ -0,0 +1,126 @@
from __future__ import annotations
import uuid
from datetime import datetime, timezone
from types import SimpleNamespace
from typing import Any
import pytest
from activity_core import sync_schedules
def _row(
*,
activity_id: uuid.UUID,
enabled: bool,
trigger_config: dict[str, Any],
) -> SimpleNamespace:
return SimpleNamespace(
id=activity_id,
name=f"definition-{activity_id}",
enabled=enabled,
trigger_config=trigger_config,
context_sources=[],
task_templates=[],
dedupe_key_strategy="skip",
version=1,
)
@pytest.mark.asyncio
async def test_sync_schedule_rows_reports_drift_counts_and_preserves_one_shots(
monkeypatch,
) -> None:
new_id = uuid.uuid4()
disabled_old_id = uuid.uuid4()
one_shot_id = uuid.uuid4()
orphan_id = uuid.uuid4()
upserted: list[tuple[uuid.UUID, bool, str]] = []
deleted: list[str] = []
async def fake_upsert_schedule(client: object, defn: object) -> None:
upserted.append((
defn.id,
defn.enabled,
defn.trigger_config.trigger_type,
))
async def fake_list_schedules(client: object) -> list[dict[str, str]]:
return [
{
"schedule_id": f"activity-schedule-{disabled_old_id}",
"activity_id": str(disabled_old_id),
},
{
"schedule_id": f"activity-schedule-{one_shot_id}-once",
"activity_id": f"{one_shot_id}-once",
},
{
"schedule_id": f"activity-schedule-{orphan_id}",
"activity_id": str(orphan_id),
},
]
async def fake_delete_schedule(client: object, activity_id: str) -> None:
deleted.append(activity_id)
monkeypatch.setattr(sync_schedules, "upsert_schedule", fake_upsert_schedule)
monkeypatch.setattr(sync_schedules, "list_schedules", fake_list_schedules)
monkeypatch.setattr(sync_schedules, "delete_schedule", fake_delete_schedule)
result = await sync_schedules.sync_schedule_rows(
object(),
[
_row(
activity_id=new_id,
enabled=True,
trigger_config={
"trigger_type": "cron",
"cron_expression": "20 7 * * *",
"timezone": "Europe/Berlin",
"misfire_policy": "skip",
},
),
_row(
activity_id=disabled_old_id,
enabled=False,
trigger_config={
"trigger_type": "cron",
"cron_expression": "20 * * * *",
"timezone": "Europe/Berlin",
"misfire_policy": "skip",
},
),
_row(
activity_id=one_shot_id,
enabled=True,
trigger_config={
"trigger_type": "scheduled",
"at": datetime(2026, 6, 19, 8, 0, tzinfo=timezone.utc),
"timezone": "UTC",
},
),
_row(
activity_id=uuid.uuid4(),
enabled=True,
trigger_config={
"trigger_type": "event",
"event_type": "kaizen.metrics.recorded",
"filters": {},
},
),
],
)
assert result.to_dict() == {
"upserted": 2,
"paused": 1,
"deleted_orphans": 1,
}
assert upserted == [
(new_id, True, "cron"),
(disabled_old_id, False, "cron"),
(one_shot_id, True, "scheduled"),
]
assert deleted == [str(orphan_id)]

134
tests/test_sync_service.py Normal file
View File

@@ -0,0 +1,134 @@
from __future__ import annotations
from typing import Any
import pytest
from activity_core import sync_service
from activity_core.sync_schedules import ScheduleSyncResult
@pytest.mark.asyncio
async def test_run_sync_runs_requested_sections(monkeypatch) -> None:
calls: list[str] = []
async def fake_definitions(session_factory: object) -> int:
calls.append("definitions")
return 2
async def fake_event_types(session_factory: object) -> int:
calls.append("event_types")
return 5
async def fake_schedules(
temporal_client: object,
session_factory: object,
) -> ScheduleSyncResult:
calls.append("schedules")
return ScheduleSyncResult(upserted=3, paused=1, deleted_orphans=2)
monkeypatch.setattr(sync_service, "sync_activity_definitions", fake_definitions)
monkeypatch.setattr(sync_service, "sync_event_types", fake_event_types)
monkeypatch.setattr(sync_service, "sync_with_session_factory", fake_schedules)
result = await sync_service.run_sync(
session_factory=object(),
temporal_client=object(),
definitions=True,
schedules=True,
event_types=True,
)
assert calls == ["definitions", "event_types", "schedules"]
assert result["ok"] is True
assert result["ran"] == {
"definitions": True,
"schedules": True,
"event_types": True,
}
assert result["definitions"] == {"synced": 2}
assert result["event_types"] == {"synced": 5}
assert result["schedules"] == {
"upserted": 3,
"paused": 1,
"deleted_orphans": 2,
}
assert result["errors"] == []
@pytest.mark.asyncio
async def test_run_sync_collects_errors_and_continues(monkeypatch) -> None:
calls: list[str] = []
async def failing_definitions(session_factory: object) -> int:
calls.append("definitions")
raise RuntimeError("definition parse failed")
async def fake_schedules(
temporal_client: object,
session_factory: object,
) -> ScheduleSyncResult:
calls.append("schedules")
return ScheduleSyncResult(upserted=1)
monkeypatch.setattr(
sync_service,
"sync_activity_definitions",
failing_definitions,
)
monkeypatch.setattr(sync_service, "sync_with_session_factory", fake_schedules)
result = await sync_service.run_sync(
session_factory=object(),
temporal_client=object(),
definitions=True,
schedules=True,
event_types=False,
)
assert calls == ["definitions", "schedules"]
assert result["ok"] is False
assert result["definitions"] == {"synced": 0}
assert result["schedules"]["upserted"] == 1
assert result["errors"] == [
{
"stage": "definitions",
"type": "RuntimeError",
"message": "definition parse failed",
}
]
@pytest.mark.asyncio
async def test_run_sync_reports_missing_temporal_client_for_schedules() -> None:
result = await sync_service.run_sync(
session_factory=object(),
temporal_client=None,
definitions=False,
schedules=True,
event_types=False,
)
assert result["ok"] is False
assert result["errors"] == [
{
"stage": "schedules",
"type": "RuntimeError",
"message": "Temporal client is required for schedule sync",
}
]
def test_record_error_bounds_error_count() -> None:
result: dict[str, Any] = {
"ok": True,
"errors": [],
}
for i in range(25):
sync_service._record_error(result, "stage", RuntimeError(f"boom {i}"))
assert result["ok"] is False
assert len(result["errors"]) == 20
assert result["errors"][0]["message"] == "boom 0"
assert result["errors"][-1]["message"] == "boom 19"

2
uv.lock generated
View File

@@ -12,6 +12,7 @@ dependencies = [
{ name = "httpx" }, { name = "httpx" },
{ name = "nats-py" }, { name = "nats-py" },
{ name = "pydantic" }, { name = "pydantic" },
{ name = "pyyaml" },
{ name = "sqlalchemy", extra = ["asyncio"] }, { name = "sqlalchemy", extra = ["asyncio"] },
{ name = "temporalio" }, { name = "temporalio" },
{ name = "uvicorn", extra = ["standard"] }, { name = "uvicorn", extra = ["standard"] },
@@ -34,6 +35,7 @@ requires-dist = [
{ name = "pydantic", specifier = ">=2.0" }, { name = "pydantic", specifier = ">=2.0" },
{ name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0" }, { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0" },
{ name = "pytest-asyncio", marker = "extra == 'dev'", specifier = ">=0.24" }, { name = "pytest-asyncio", marker = "extra == 'dev'", specifier = ">=0.24" },
{ name = "pyyaml", specifier = ">=6.0" },
{ name = "sqlalchemy", extras = ["asyncio"], specifier = ">=2.0" }, { name = "sqlalchemy", extras = ["asyncio"], specifier = ">=2.0" },
{ name = "temporalio", specifier = ">=1.7" }, { name = "temporalio", specifier = ">=1.7" },
{ name = "temporalio", extras = ["testing"], marker = "extra == 'dev'", specifier = ">=1.7" }, { name = "temporalio", extras = ["testing"], marker = "extra == 'dev'", specifier = ">=1.7" },

View File

@@ -8,7 +8,7 @@ status: active
owner: codex owner: codex
topic_slug: custodian topic_slug: custodian
created: "2026-06-03" created: "2026-06-03"
updated: "2026-06-07" updated: "2026-06-27"
state_hub_workstream_id: "5646e13a-13af-4724-bca6-3c0d86f96733" state_hub_workstream_id: "5646e13a-13af-4724-bca6-3c0d86f96733"
--- ---
@@ -150,6 +150,59 @@ State Hub to `state-hub` (`dc10704f`), `railiance-cluster` (`53e78702`),
activity-core runner plus three clean scheduled daily runs and calibration activity-core runner plus three clean scheduled daily runs and calibration
feedback. feedback.
2026-06-16: Rechecked State Hub and the configured working-memory sink. State
Hub `/progress/?event_type=daily_triage` still only shows activity-core
`daily_triage` progress through 2026-06-06, and
`/home/worsch/the-custodian/memory/working` only has `daily-triage-*` notes
for 2026-06-02 through 2026-06-06. There is still no evidence of three clean
consecutive scheduled runs after the June 7 runtime projection failure, so
T03 remains `wait`.
2026-06-18: Consumed the verified in-cluster llm-connect Service URL in the
Railiance runtime projection. `actcore-runtime-config` now sets
`LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080` and
keeps `LLM_CONNECT_TIMEOUT_SECONDS=300`. The remaining live gate is no longer
the URL slot itself; it is operator-owned provider credential custody for
`activity-core/llm-connect-provider-secrets`, a schema-valid fixture smoke, and
then three clean scheduled daily triage runs.
2026-06-18 follow-up: `llm-connect` reported State Hub message
`6a098e1e-65de-4309-ab4a-446aba2f3587`: the provider Secret now has a populated
key count and the in-namespace fixture smoke passed on the llm-connect side.
The remaining activity-core gate is to reconcile the live Railiance runtime so
the worker consumes the configured URL, then produce schema-valid daily triage
evidence and three clean scheduled runs. This narrower path is tracked in
`ACTIVITY-WP-0010`.
2026-06-25: Consecutive-run streak resumed. State Hub `daily_triage` progress
events from author `activity-core` fired on time on **2026-06-24 05:20:56Z** and
**2026-06-25 05:20:47Z** (07:20 Berlin), both delivered, no misfires. That is two
clean consecutive scheduled runs. **RECHECK 2026-06-26 (after 05:20Z):** confirm
the 06-26 scheduled `daily_triage` event delivered. If clean, that completes three
clean consecutive scheduled runs (06-24 / 06-25 / 06-26) — record the calibration
result in State Hub and close T03. If the 06-26 run misfires or is missing, the
streak resets and T03 stays `wait`. Flag deliberately kept in-repo (agent-agnostic)
rather than tied to any single coding agent's scheduler.
2026-06-26 recheck outcome: **streak reset at two.** The 06-26 scheduled run fired
on time (`daily_triage` event 05:20:57Z) — scheduling layer healthy, no misfire —
but the `daily-triage-report` instruction output **failed schema validation**:
`Expecting ',' delimiter: line 136 column 22 (char 5268)`. The model produced a
long ranked WSJF recommendation list (reached rank 7+ with nested `wsjf` objects)
whose JSON broke ~char 5268; only a bounded 4000-char preview is preserved in the
State Hub event, so the exact offending token needs the runtime llm-connect log.
This is an LLM-output-quality failure (tracked by `ACTIVITY-WP-0010`), not a
runtime/projection failure. T03 stays `wait`; three clean consecutive scheduled
runs not yet achieved (06-24 ✅, 06-25 ✅, 06-26 ✗-validation).
2026-06-27 recheck outcome: streak remains reset. The scheduled run fired and
wrote State Hub progress plus working memory, but daily-triage-report failed
validation again with an unterminated string around char 5246. This confirms the
runner/sink path is alive and the active blocker is live deployment of the
ACTIVITY-WP-0016 output-robustness bundle and runtime prompt/token changes, not
a missing schedule. T03 stays wait until a post-deployment smoke passes and three
new clean scheduled runs are collected.
## Rule Action Contract Documentation ## Rule Action Contract Documentation
```task ```task

View File

@@ -8,7 +8,7 @@ status: blocked
owner: codex owner: codex
topic_slug: custodian topic_slug: custodian
created: "2026-06-07" created: "2026-06-07"
updated: "2026-06-07" updated: "2026-06-17"
state_hub_workstream_id: "7387fc50-1f2c-471a-9d85-bb085cbd0b63" state_hub_workstream_id: "7387fc50-1f2c-471a-9d85-bb085cbd0b63"
--- ---
@@ -47,6 +47,12 @@ resolver. It reads recent `/progress/` items, selects the latest
`event_type=coding_retro`, normalizes `suggestions[]`, and returns an empty `event_type=coding_retro`, normalizes `suggestions[]`, and returns an empty
suggestion list while the upstream publisher has not produced a read model yet. suggestion list while the upstream publisher has not produced a read model yet.
**2026-06-17:** Hardened the resolver lookup after live review found recent
non-retro progress could hide older retro events. The resolver now queries
State Hub with `event_type=coding_retro` and only selects a read model matching
the requested `window_days`, so the weekly schedule cannot accidentally route a
broader 30-day retro batch.
## `weekly-coding-retro` Activity-Definition ## `weekly-coding-retro` Activity-Definition
```task ```task
@@ -92,3 +98,12 @@ make fix-consistency REPO=activity-core
Live State Hub did not yet expose a published `event_type=coding_retro` progress Live State Hub did not yet expose a published `event_type=coding_retro` progress
item, so the real dry-run, duplicate check, and `enabled: true` flip remain item, so the real dry-run, duplicate check, and `enabled: true` flip remain
blocked on `AGENTIC-WP-0010`. blocked on `AGENTIC-WP-0010`.
**2026-06-17:** `AGENTIC-WP-0010` is finished and State Hub has
`coding_retro` progress. A live no-write smoke now resolves the matching weekly
read model `ec20ac1c-ef50-4db4-a5dc-364d31a259a5`
(`generated_at=2026-06-07T19:25:19Z`, `window.days=7`) and emits zero task
specs because that weekly read model has zero suggestions. The schedule remains
disabled until a non-empty weekly read model, or an explicit operator decision
that a zero-suggestion dry-run is an acceptable enablement proof, confirms
correct routing and no duplicate target tasks on re-run.

View File

@@ -0,0 +1,250 @@
---
id: ACTIVITY-WP-0009
type: workplan
title: "Intent gap closure"
domain: custodian
repo: activity-core
status: blocked
owner: codex
topic_slug: custodian
created: "2026-06-16"
updated: "2026-06-18"
state_hub_workstream_id: "d64cfbba-6da7-4737-afb9-866afa0e9cda"
---
# ACTIVITY-WP-0009 - Intent gap closure
## Context
The 2026-06-16 review of activity-core against `INTENT.md` found that the repo
matches the intended Event Bridge shape, but several production and contract
gaps remain before the implementation fully satisfies the operational promise:
- recurring scheduled work must be trusted without manual coordination
- live task creation must be proven through issue-core, not only null-sink audit
- `review_required` semantics must either be implemented or documented as
metadata only
- ops evidence must either remain explicitly fallback-first or activate the
Inter-Hub / ops-hub backend behind operator-owned secrets
- the `TaskExecutorWorkflow` stub must not become a back door into execution
ownership
- the internal FastAPI surface needs an explicit production access decision
The preserved analysis lives in:
`history/2026-06-16-intent-gap-analysis.md`
## Close Daily Triage Scheduled-Run Trust Gap
```task
id: ACTIVITY-WP-0009-T01
status: wait
priority: high
state_hub_task_id: "7012e4fd-2530-49b7-9c2f-1d949809a144"
```
Close the scheduled-run trust gap identified in `ACTIVITY-WP-0006-T03`.
Acceptance criteria:
- activity-core has three clean consecutive scheduled daily State Hub WSJF
triage runs after the June 7 runtime projection failure
- each run has matching Temporal workflow history, `activity_runs` row, State
Hub `daily_triage` progress, and working-memory report note
- calibration feedback is recorded in State Hub
- `ACTIVITY-WP-0006-T03` can move from `wait` to `done`
Current wait reason: as of 2026-06-16, State Hub `daily_triage` progress and
working-memory `daily-triage-*` notes only show activity-core evidence through
2026-06-06.
2026-06-18 update: activity-core now consumes the verified in-cluster
llm-connect Service URL in `k8s/railiance/20-runtime.yaml`:
`LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080` with
`LLM_CONNECT_TIMEOUT_SECONDS=300`. This removes the activity-core repo-side URL
gap. Closure still waits on the operator-owned provider Secret for llm-connect,
a schema-valid fixture smoke, and three clean scheduled daily triage runs with
matching State Hub and working-memory evidence.
2026-06-18 follow-up: State Hub message
`6a098e1e-65de-4309-ab4a-446aba2f3587` reports that the llm-connect side is now
complete: the provider Secret has a populated key count and the in-namespace
fixture smoke passed. The remaining work is the activity-core / Railiance
runtime reconciliation and daily-triage evidence collection path captured in
`ACTIVITY-WP-0010`.
## Promote Issue-Core Task Emission Safely
```task
id: ACTIVITY-WP-0009-T02
status: wait
priority: high
state_hub_task_id: "3854677b-32b4-43f8-a6ca-5a2b25a08dd9"
```
Move selected production-safe definitions from `ISSUE_SINK_TYPE=null` audit mode
toward real issue-core task creation.
Acceptance criteria:
- issue-core endpoint, credentials, and duplicate-handling posture are approved
for the target environment
- one known-safe definition is run first in null-sink mode and its task specs are
reviewed
- the same definition creates exactly the expected issue-core task(s) through
`IssueCoreRestSink`
- `task_spawn_log` records the real returned task references
- rollback to null-sink mode is documented
Current wait reason: production Railiance currently uses null-sink audit mode;
live issue-core credentials/access and duplicate-handling are not yet verified
for this repo.
## Resolve Review-Required Contract Drift
```task
id: ACTIVITY-WP-0009-T03
status: done
priority: medium
state_hub_task_id: "1eafe5e4-8412-4104-a417-933efe8e7bbd"
```
Resolve the mismatch between ADR language and current code for
`review_required`.
Options:
- implement an issue-core-owned pending review queue contract and route
`review_required=true` instruction outputs there, or
- update ADR/docs to state that `review_required` is currently audit/report
metadata only
Acceptance criteria:
- `docs/adr/adr-003-rule-instruction-model.md`, `SCOPE.md`, and tests describe
the same behavior
- no ActivityDefinition implies a review queue exists unless that downstream
contract is live
- report/spawn metadata remains available for operator review either way
2026-06-16: Completed by aligning ADR-003 with the implemented behavior:
`review_required` is audit/report metadata only until issue-core owns a pending
review queue contract. `SCOPE.md` already had the same boundary, and
`tests/test_issue_sink.py` now asserts the REST issue sink does not send a
`review_required` field as though a review queue existed.
## Decide And Gate Ops Evidence Backend
```task
id: ACTIVITY-WP-0009-T04
status: done
priority: medium
state_hub_task_id: "61300966-c119-4ebf-af89-a6c50df93ac8"
```
Decide whether the `ops-inventory` evidence path should remain State Hub
fallback-first for now or activate Inter-Hub / ops-hub submission.
Acceptance criteria:
- the decision is recorded in State Hub and the relevant docs/workplans
- if fallback-first remains the chosen mode, docs explicitly say State Hub
`ops_inventory_probe` progress is the accepted closure path
- if Inter-Hub is activated, `OPS_HUB_KEY` is provisioned outside Git, widget /
capability mapping is configured, and live submission is tested without
printing or storing secrets
2026-06-16: Completed the current posture decision. State Hub decision
`7c235bbb-ee6f-4c3e-b1dd-74717eac9082` records that State Hub
`ops_inventory_probe` progress is the accepted live evidence backend for now.
Inter-Hub / ops-hub per-entity submission remains future work gated on
operator-owned `OPS_HUB_KEY` custody, widget mapping, and production intake
smoke tests. `docs/runbook.md` documents the fallback-first posture.
## Remove Or Rehome TaskExecutor Stub Risk
```task
id: ACTIVITY-WP-0009-T05
status: done
priority: medium
state_hub_task_id: "fbe3e822-1a7c-4fe6-8251-cc8a782b9516"
```
Reduce the chance that `TaskExecutorWorkflow` attracts real execution work
inside activity-core.
Acceptance criteria:
- decide whether the stub should stay registered, be removed, or be moved to an
execution-owned repo/workplan
- if it stays, docs and comments explicitly mark it as non-production and
outside the activity-core ownership boundary
- no production ActivityDefinition or workflow path depends on `task_instances`
as task lifecycle state
2026-06-16: Completed by deciding to keep `TaskExecutorWorkflow` registered only
as a compatibility/idempotency stub. `src/activity_core/workflows.py` and
`docs/conventions.md` now mark it as non-production and outside activity-core's
execution boundary. No production ActivityDefinition uses `task_instances` for
task lifecycle state.
## Decide FastAPI Production Access Posture
```task
id: ACTIVITY-WP-0009-T06
status: done
priority: medium
state_hub_task_id: "99e1e301-296b-4f78-8843-2a39e59ecd7d"
```
Choose and document the production access posture for the FastAPI admin surface.
Acceptance criteria:
- operator decides whether the API remains ClusterIP-only or receives an
authenticated ingress
- if ingress is chosen, hostname, auth layer, allowed users/agents, and audit
expectations are documented before exposure
- runbook and Railiance deployment docs match the chosen posture
2026-06-16: Completed the current access posture decision. State Hub decision
`9ffaf7a9-227a-4e39-92e3-cd93d8cda1f2` records that the FastAPI admin surface
remains ClusterIP-only until a separate authenticated ingress/access-policy work
item chooses hostname, auth layer, allowed users/agents, and audit expectations.
`docs/runbook.md` and `k8s/railiance/README.md` now agree on this posture.
## Completion Criteria
- The historical findings are preserved under `history/`.
- `SCOPE.md`, ADRs, workplans, and implementation agree on activity-core's
boundary.
- Daily scheduled triage has real consecutive-run calibration evidence.
- At least one production-safe task creation path is proven against issue-core,
or null-sink mode is explicitly accepted as the current production posture.
- Ops evidence backend posture is explicit and tested in the chosen mode.
- No registered workflow or API path invites activity-core to own execution,
task lifecycle, project state, or privileged ops control.
## Implementation Pass - 2026-06-16
Agent-actionable closure is complete for T03, T04, T05, and T06.
Remaining waits:
- T01 waits on real scheduled daily triage run evidence.
- T02 waits on issue-core production endpoint/credentials and duplicate-handling
approval.
Verification:
```bash
.venv/bin/pytest tests/test_issue_sink.py tests/rules/test_executor.py -k "review_required or issue_core_rest_sink"
```
Result: 3 passed, 24 deselected.
After this workplan is synced by the custodian operator, run from `~/state-hub`:
```bash
make fix-consistency REPO=activity-core
```

View File

@@ -0,0 +1,225 @@
---
id: ACTIVITY-WP-0010
type: workplan
title: "Daily Triage LLM Reconciliation And Evidence"
domain: custodian
repo: activity-core
status: blocked
owner: codex
topic_slug: custodian
created: "2026-06-18"
updated: "2026-06-27"
state_hub_workstream_id: "f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9"
---
# ACTIVITY-WP-0010 - Daily Triage LLM Reconciliation And Evidence
## Context
This workplan implements the in-scope portion of the latest activity-core
suggestion review against `INTENT.md` and `SCOPE.md`.
Relevant accepted suggestion:
- State Hub message `6a098e1e-65de-4309-ab4a-446aba2f3587` from
`llm-connect` says `LLM-WP-0006` is complete on the llm-connect side. The
stable Service URL is
`http://llm-connect.activity-core.svc.cluster.local:8080`, timeout remains
`300`, the provider Secret reports populated key count, and the in-namespace
fixture smoke passed with schema-valid endpoint behavior.
Why this belongs in activity-core:
- `INTENT.md` says activity-core owns the **when/what/where** loop for
scheduled coordination work.
- `SCOPE.md` keeps LLM instruction execution in scope through the llm-connect
boundary, while keeping provider credentials and cluster reconciliation out of
scope.
- `ACTIVITY-WP-0006-T03` and `ACTIVITY-WP-0009-T01` remain open because daily
State Hub WSJF triage has not yet produced three clean scheduled runs after
the June 7 runtime projection failure.
Suggestions reviewed but not accepted as product/runtime implementation work:
- `coding_retro` activity-core suggestions for Bash tool thrash, schema thrash,
and read-before-edit hygiene are agent workflow advice. They are useful for
Codex operating style, but they do not change activity-core's Event Bridge
product surface and should not become runtime code.
- The earlier local-kubectl / cluster-owned evidence suggestion for
`ACTIVITY-WP-0007` has already been handled by moving live evidence ownership
to Railiance and closing the workplan from cluster-owned proof.
Latest evidence before this workplan:
- State Hub `daily_triage` progress on 2026-06-18 still shows
`LLM_CONNECT_URL is not configured`, which means the live activity-core
runtime has not yet consumed the repo-side URL update.
- `k8s/railiance/20-runtime.yaml` now sets the verified llm-connect Service URL
and `LLM_CONNECT_TIMEOUT_SECONDS=300`.
## Confirm Repo-Side Runtime Contract
```task
id: ACTIVITY-WP-0010-T01
status: done
priority: high
state_hub_task_id: "dd52ce21-23b8-4e46-b3af-cb7bf486e40f"
```
Update activity-core's Railiance runtime projection so the daily triage worker
consumes the verified llm-connect Service URL by default.
Done when:
- `k8s/railiance/20-runtime.yaml` sets
`LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080`.
- `LLM_CONNECT_TIMEOUT_SECONDS=300` remains configured.
- Wiring tests assert the URL and timeout.
- The Railiance README states that provider credentials remain operator-owned
and outside Git / State Hub.
2026-06-18: Completed. Updated the runtime ConfigMap, README, and
`tests/test_railiance_ops_inventory_wiring.py`. Focused tests passed:
`tests/test_railiance_ops_inventory_wiring.py tests/test_llm_client.py`
reported 9 passed.
## Reconcile Live Railiance Runtime
```task
id: ACTIVITY-WP-0010-T02
status: done
priority: high
state_hub_task_id: "23545ddc-926b-485a-8535-5cc11e01134a"
```
Apply or reconcile the updated activity-core Railiance runtime through the
cluster-owned deployment path, not through ad hoc local kubectl from this repo.
Done when non-secret evidence shows:
- live `actcore-runtime-config` has the verified `LLM_CONNECT_URL` and timeout;
- the activity-core worker has restarted or otherwise consumed the new config;
- `activity-core/llm-connect-provider-secrets` remains present with a populated
key count only, without printing or storing secret values;
- the State Hub bridge remains reachable from the activity-core runtime.
Current wait reason: this is Railiance/operator-owned live cluster work. State
Hub handoff message `9a074b7c-4b87-4e3c-a6bf-e1fe5580daa8` asks
`railiance-cluster` to reconcile the updated config and smoke it.
2026-06-19 recheck:
- Deployed `llm-connect` into the `activity-core` namespace on `railiance01`
(the cluster that runs `actcore-worker`). `coulombcore` had llm-connect only;
the in-cluster Service URL is cluster-local.
- `actcore-runtime-config` already exposed the verified URL and timeout;
`deployment/actcore-worker` was restarted and now reports
`LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080`.
- `llm-connect-provider-secrets` reports `DATA 1`; no Secret values were
inspected.
- Worker health probe to llm-connect `/health` returns `{"status": "ok"}`.
- `actcore-state-hub-bridge` remains `0/1` Ready with upstream timeouts, so T02
is not fully closed until the node-local State Hub tunnel is restored.
2026-06-27 recheck:
- Superseded by real scheduled runner evidence: State Hub daily_triage events on
2026-06-24, 2026-06-25, 2026-06-26, and 2026-06-27 all reached State Hub and
wrote working-memory notes. The bridge/sink is therefore reachable for the
live runner.
- 2026-06-24 and 2026-06-25 were schema-valid; 2026-06-26 and 2026-06-27 failed
output validation after calling llm-connect. That moves the active blocker out
of T02 and into the WP-0016 live bundle/smoke lane. Marking T02 done.
## Run Daily Triage Fixture Smoke
```task
id: ACTIVITY-WP-0010-T03
status: wait
priority: high
state_hub_task_id: "10e0df77-c230-4a82-b720-23c66bd17c0a"
```
After T02, run a manual or smoke execution of
`daily-statehub-wsjf-triage` against the live activity-core runtime.
Done when:
- the run calls llm-connect through the configured Service URL;
- llm-connect returns content accepted as schema-valid daily-triage JSON;
- State Hub receives a `daily_triage` progress item with `output_validated=true`;
- the working-memory daily-triage note exists at the path recorded in State Hub
detail;
- `scripts/verify_daily_triage.py` reports the smoke/manual run as present.
2026-06-19 recheck:
- In-namespace llm-connect fixture smoke on `railiance01` passed:
`smoke: pass health=ok latency_seconds=1.681 recommendations=1`.
- Manual `POST /activity-definitions/6fca51fa-387a-4fd0-bc4e-d62c29eb859a/trigger`
reached llm-connect, but the workflow failed at `persist_instruction_reports`
with `state-hub-progress` sink `Connection refused` while
`actcore-state-hub-bridge` is unhealthy.
- T03 therefore remains open until State Hub bridge reachability is restored and
a run emits non-secret `daily_triage` progress with `output_validated=true`.
2026-06-27 recheck:
- Scheduled runs on 2026-06-24 and 2026-06-25 satisfy the non-secret smoke
evidence for llm-connect call, State Hub progress with output_validated=true,
and working-memory note creation.
- Kept T03 at progress rather than done because the workstation did not run the
live verifier against Temporal/activity-core DB, and the smoke must be repeated
after the WP-0016 code/schema/runtime-prompt deployment due the 2026-06-26 and
2026-06-27 malformed-output failures.
## Collect Three Clean Scheduled Runs
```task
id: ACTIVITY-WP-0010-T04
status: wait
priority: high
state_hub_task_id: "dc6b9482-cf43-4fc5-994b-dcd7dea47db7"
```
Let the normal 07:20 Europe/Berlin schedule produce three consecutive clean
daily triage runs after the live config reconciliation.
Done when:
- three consecutive scheduled runs have Temporal workflow evidence,
`activity_runs` rows, State Hub `daily_triage` progress, and working-memory
notes;
- none of the three runs are merely manual smoke tests or `execution_failed`
diagnostics;
- calibration feedback is recorded in State Hub;
- `ACTIVITY-WP-0006-T03` and `ACTIVITY-WP-0009-T01` can move from `wait` to
`done`.
2026-06-27 recheck:
- Three-clean-run streak is reset. The latest sequence is 2026-06-24 clean,
2026-06-25 clean, 2026-06-26 validation_failed, 2026-06-27 validation_failed.
- Current pickup is to deploy ACTIVITY-WP-0016 code/schema together with the
Railiance runtime prompt and max_tokens changes, run a live smoke, then restart
the three-consecutive-scheduled-run gate from zero.
## Close Handoff State
```task
id: ACTIVITY-WP-0010-T05
status: wait
priority: medium
state_hub_task_id: "ecc57e21-1716-4daa-aba6-d8a6d824e4ed"
```
Update the surrounding workplans and State Hub once the live daily triage gate
passes.
Done when:
- `ACTIVITY-WP-0006` records the three-run calibration evidence;
- `ACTIVITY-WP-0009` records the scheduled-run trust gap closure;
- any temporary `needs_human` flags created for the llm-connect provider/config
handoff are cleared or replaced by a narrower follow-up;
- this workplan is marked `finished`.

View File

@@ -0,0 +1,179 @@
---
id: ACTIVITY-WP-0011
type: workplan
title: "Event Payload Context Resolver"
domain: custodian
repo: activity-core
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-18"
updated: "2026-06-18"
state_hub_workstream_id: "4efe4bcf-2148-4489-b57c-87f6039d4ed5"
---
# ACTIVITY-WP-0011 - Event Payload Context Resolver
## Context
State Hub message `d561ebd7-ba01-4dc6-8ffc-fe87d45304ee` from
`kaizen-agentic` handed off an urgent blocker for LOOP-WP-0002:
event-triggered definitions can receive the triggering EventEnvelope JSON, but
activity-core did not bind `source.type: event-payload` into the context
snapshot. The immediate customer is the disabled
`coulomb-low-success-rate-review` ActivityDefinition, whose
`flag-low-success-rate` rule needs to evaluate
`context.metrics.summary.success_rate`.
This is in activity-core scope because the repo owns ActivityDefinition context
resolution and the Event Bridge workflow boundary. The remaining event type
registry and live NATS smoke evidence are cross-repo/operator gates and should
wait in State Hub rather than depending on local kubectl or ad hoc live cluster
access from this repo.
## Implement Event Payload Resolver
```task
id: ACTIVITY-WP-0011-T01
status: done
priority: high
state_hub_task_id: "5c87ce0b-3bd0-4a44-aae5-10d7586c939e"
```
Register resolver type `event-payload` so event-triggered definitions can bind
the triggering EventEnvelope attributes into `context.*`.
Done when:
- `activity_core.context_resolvers` imports and registers an `event-payload`
resolver.
- `resolve_context` parses `event_envelope_json` once and passes the parsed
envelope to registered resolvers.
- `source.type: event-payload` extracts envelope `attributes`.
- `bind_to: context.metrics` strips the `context.` prefix and unwraps a
single-key `{"metrics": ...}` attributes payload into `snapshot["metrics"]`.
- Missing or malformed envelopes fail required sources visibly and bind `{}` for
optional sources.
2026-06-18: Completed in `src/activity_core/activities.py` and
`src/activity_core/context_resolvers/event_payload.py`.
## Cover Binding And Rule Evaluation
```task
id: ACTIVITY-WP-0011-T02
status: done
priority: high
state_hub_task_id: "c6f7dea6-9adc-4997-a22e-4bf2e94dc05a"
```
Add focused tests for the handoff acceptance contract.
Done when:
- sample `kaizen.metrics.recorded` envelope attributes resolve to:
`{"metrics": {"agent": "coach", "project": "kaizen-agentic", "summary": ...}}`;
- `flag-low-success-rate` evaluates
`context.metrics.summary.success_rate < 0.8`;
- optional missing envelopes bind `{}`;
- required missing envelopes raise a visible activity failure.
2026-06-18: Completed in `tests/test_resolve_context_binding.py`. Focused
tests passed:
`.venv/bin/python -m pytest tests/test_resolve_context_binding.py tests/test_rule_evaluation_activity.py`
reported 8 passed, and adjacent rule tests
`.venv/bin/python -m pytest tests/rules/test_evaluator.py tests/rules/test_actions.py`
reported 55 passed.
## Wait For Event Type Registry
```task
id: ACTIVITY-WP-0011-T03
status: done
priority: high
state_hub_task_id: "a4f277de-eb83-41bc-860e-b26586c72495"
```
Confirm that `kaizen.metrics.recorded` is registered in the shared event type
catalog through the owning State Hub / producer workflow.
Done when:
- State Hub or the producer-owned event catalog exposes
`kaizen.metrics.recorded` with an attributes schema covering
`metrics.agent`, `metrics.project`, and `metrics.summary.success_rate`;
- the registry decision names the owning repo for future schema changes;
- activity-core has no local-only event type drift from the producer contract.
Registry ownership: the event type is producer/catalog owned. Activity-core
accepted State Hub-backed registry confirmation before closing the workplan.
2026-06-18: Closed from State Hub acknowledgement
`3efb56d8-c3d6-4308-82ea-76eaaa172255` from `kaizen-agentic`. The producer
registered `kaizen.metrics.recorded` in `kaizen-agentic/event-types/` with
status `active`, publisher `kaizen-agentic`, and schema fields
`agent`, `project`, `summary.success_rate`, `summary.execution_count`, and
`summary.avg_quality`. The sync command reported was
`ACTIVITY_DEFINITION_DIRS=~/coulomb-loop:~/kaizen-agentic make sync-event-types`.
## Wait For Live Event Smoke
```task
id: ACTIVITY-WP-0011-T04
status: done
priority: high
state_hub_task_id: "3b636d5e-8f93-49b4-ae53-3da4f736a4d9"
```
After T03, run the live event-triggered path without relying on local kubectl
from activity-core.
Done when State Hub records non-secret evidence that:
- a sample `kaizen.metrics.recorded` envelope was published on the expected NATS
subject;
- activity-core triggered `coulomb-low-success-rate-review`;
- the resolved context snapshot contained `context.metrics.summary.success_rate`;
- `flag-low-success-rate` matched and produced the expected task/report output;
- any disabled-definition or operator-controlled enablement state was recorded.
Execution ownership: this cross-repo/live-runtime smoke was owned by the event
producer, customer definition owner, and cluster/operator path. Activity-core
accepted the non-secret evidence from State Hub.
2026-06-18: Closed from State Hub acknowledgement
`68bfcd0d-7c47-4b42-85fc-64d63f38a909` from `kaizen-agentic`.
Supplier confirms R1 acceptance criteria met and LOOP-WP-0002 closed. Evidence:
NATS `activity.kaizen.metrics.recorded` triggered
`coulomb-low-success-rate-review` (`da7a9af7`), run
`e61554c6-1e67-5fa1-b34e-478d154a188e`, `tasks_spawned=1`, with
`metrics.summary.success_rate=0.75`.
## Close Handoff
```task
id: ACTIVITY-WP-0011-T05
status: done
priority: medium
state_hub_task_id: "5169d8c5-769f-4272-97cf-c25b31087601"
```
Close the urgent R1/live-smoke handoff once State Hub has acknowledgement that
the resolver-side blocker is removed. The broader workplan remains blocked only
on T03 event-type registry confirmation.
Done when:
- State Hub message `d561ebd7-ba01-4dc6-8ffc-fe87d45304ee` is answered or
linked to this workplan;
- `kaizen-agentic` / LOOP-WP-0002 can proceed without an activity-core code
blocker;
- this workplan has no remaining activity-core code or live-smoke blocker.
2026-06-18: Closed from State Hub acknowledgement
`68bfcd0d-7c47-4b42-85fc-64d63f38a909`. The original handoff message
`d561ebd7-ba01-4dc6-8ffc-fe87d45304ee` was answered, and the live smoke
evidence in T04 unblocks LOOP-WP-0002.
2026-06-18: Workplan finished. T03 registry confirmation, T04 live event smoke,
and T05 handoff closure are all done in State Hub.

View File

@@ -0,0 +1,192 @@
---
id: ACTIVITY-WP-0012
type: workplan
title: "Definition And Schedule Hot Reload"
domain: custodian
repo: activity-core
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-18"
updated: "2026-06-22"
state_hub_workstream_id: "8887075e-21ec-451b-b82b-cd81035c9ca5"
---
# ACTIVITY-WP-0012 - Definition And Schedule Hot Reload
## Context
State Hub message `f4876517-f738-4571-a2d6-76f2965e9a13` from
`coulomb-loop` reports an operational gap from the Coulomb cadence ramp: after
renaming customer definitions from hourly to daily, operators had to run
definition/schedule sync and restart the worker before new Temporal schedule
state was reliable.
Current behavior:
- `worker.py` runs `sync_activity_definitions` and `sync_schedules` once at
startup.
- `RunActivityWorkflow` loads ActivityDefinitions from the DB at activity time.
- The event router reloads enabled event definitions per NATS message.
- Cron schedule changes only take effect when `sync_schedules` runs.
This belongs in activity-core because the repo owns ActivityDefinition sync,
Temporal schedule projection, and the admin API. The first implementation
should expose an operator-triggered sync path without turning activity-core into
a repo checkout manager or CI system.
## Extract Reusable Sync Service
```task
id: ACTIVITY-WP-0012-T01
status: done
priority: high
state_hub_task_id: "53a7970b-7eec-47f5-ad30-bbd7c6271952"
```
Refactor the worker-startup sync sequence into a reusable async service that can
be called by startup and the API.
Done when:
- the service can run ActivityDefinition sync, event type sync, and Temporal
schedule sync independently based on booleans;
- it accepts the existing DB session factory / Temporal client dependencies
without creating hidden global state;
- startup behavior remains unchanged except for calling the shared service;
- failures are collected into a bounded `errors[]` result while preserving the
current startup best-effort behavior.
2026-06-19: Completed. Added `activity_core.sync_service.run_sync`, which
orchestrates ActivityDefinition, event type, and schedule sync independently
from explicit DB session factory and Temporal client dependencies. Worker
startup now calls the shared service for definitions+schedules and logs bounded
stage errors while continuing startup.
## Add Admin Sync Endpoint
```task
id: ACTIVITY-WP-0012-T02
status: done
priority: high
state_hub_task_id: "8697c761-15d1-4da0-b66b-d838218a2495"
```
Add an operator-only API endpoint:
`POST /admin/sync?definitions=true&schedules=true&event_types=true`
Done when:
- the endpoint runs the shared sync service without requiring worker restart;
- response JSON reports counts for definitions, event types, schedules upserted,
schedules paused/deleted, and errors;
- default parameters sync definitions and schedules, with event types opt-in or
clearly documented;
- endpoint tests cover definitions-only, schedules-only, all-sync, and failure
result behavior.
2026-06-19: Completed. Added `POST /admin/sync` with defaults
`definitions=true`, `schedules=true`, and `event_types=false`. The response
reports definition/event counts, schedule upsert/pause/orphan-delete counts, and
bounded `errors[]`. Tests cover definitions-only, schedules-only, all-sync, and
failure-result behavior.
## Preserve Schedule Drift Semantics
```task
id: ACTIVITY-WP-0012-T03
status: done
priority: high
state_hub_task_id: "efeac412-632c-4c90-9428-bb575ac7a624"
```
Make the sync result explicit enough for cadence changes and renames.
Done when:
- disabled cron definitions pause their Temporal schedules on sync;
- renamed definitions create the new schedule and pause/delete orphaned old
schedules according to the existing `sync_schedules` semantics;
- event-triggered definitions remain hot through the existing router DB reload
path;
- regression tests demonstrate the Coulomb hourly-to-daily rename shape without
needing a worker restart.
2026-06-19: Completed. `sync_schedules` now returns explicit counts for enabled
schedule upserts, disabled schedule pauses, and orphan deletes. Regression tests
cover the hourly-to-daily rename shape: a new enabled cron schedule is upserted,
the old disabled cron schedule is preserved as paused, unrelated orphan
schedules are deleted, event-triggered definitions do not create schedules, and
one-shot scheduled definitions are no longer mistaken for orphans.
## Optional Background Sync Loop
```task
id: ACTIVITY-WP-0012-T04
status: done
priority: medium
state_hub_task_id: "d774087b-c51d-4444-8e90-bfef43765456"
```
Decide whether to add a periodic sync loop after the admin endpoint exists.
Done when:
- either `ACTIVITY_SYNC_INTERVAL_SECONDS` is implemented with a default disabled
or conservative interval, or the workplan records why manual/admin-triggered
sync is the safer v1 posture;
- if implemented, logs and metrics expose the last successful sync timestamp and
last error summary;
- the loop does not block worker startup or workflow task processing.
2026-06-19: Completed by decision. v1 stays manual/operator-triggered through
`POST /admin/sync`; no background loop was added. The runbook records this
posture so customer definition changes stay explicit and the worker does not
start background repo scanning. A periodic loop remains a future option if live
operator use proves it is needed.
## Live No-Restart Smoke
```task
id: ACTIVITY-WP-0012-T05
status: done
priority: high
state_hub_task_id: "68a0e22a-106a-4d21-9f39-c6279850cb5e"
```
Validate the hot-reload path in the cluster/operator environment.
Done when non-secret State Hub evidence shows:
- a customer repo definition rename or `enabled` flip is synced through
`/admin/sync`;
- new Temporal schedules are active and retired schedules are paused/deleted
without worker SIGTERM or pod restart;
- event-triggered definitions still fire normally;
- rollback or repeat sync is idempotent.
2026-06-22: Completed on Railiance01 (`KUBECONFIG=~/.kube/config-hosteurope`).
Smoke target: disabled projection `ops-service-inventory-probes`
(`40d15a87-7ff6-4d8e-992c-37df15f95110`) in
`actcore-external-activity-definitions`.
Evidence:
- ConfigMap flip `enabled: false -> true` and cadence `15 * * * * -> 25 * * * *`,
then `POST /admin/sync?definitions=true&schedules=true` from `actcore-api`.
- DB after sync: `enabled=true`, `cron=25 * * * *`.
- Temporal schedule after sync: `paused=false`, calendar minute `25`.
- Repeat sync returned identical schedule counts
(`upserted=5`, `paused=1`, `deleted_orphans=0`) — idempotent.
- Rollback flip restored `enabled=false`, `cron=15 * * * *`, schedule
`paused=true`, calendar minute `15`.
- `actcore-worker` pod UID unchanged (`a68d6539-2bba-457e-a78a-39564002a980`,
started `2026-06-21T18:46:46Z`); `actcore-event-router` pod UID unchanged.
- Event-triggered definitions: none projected on Railiance01 today; hot DB
reload path for event definitions remains covered by T03 unit tests and an
unchanged event-router deployment.
Automation: `scripts/smoke_admin_sync_no_restart.py`. Runbook section added
under "Railiance01 no-restart smoke".

View File

@@ -0,0 +1,78 @@
---
id: ACTIVITY-WP-0013
type: workplan
title: "Reuse Surface Report Gaps Resolver"
domain: custodian
repo: activity-core
status: finished
owner: codex
topic_slug: activity-core
created: "2026-06-18"
updated: "2026-06-18"
state_hub_workstream_id: "01e68dfd-b146-4aef-a575-2d3b178ca5c2"
---
# Reuse Surface Report Gaps Resolver
Implement the R2 handoff from kaizen-agentic (`bffa224c`) so the
`reuse_surface_report_gaps` shell context source populates
`context.gaps` for the Coulomb daily registry hygiene sweep.
## Register Shell Resolver Query
```task
id: ACTIVITY-WP-0013-T01
status: done
priority: high
state_hub_task_id: "a6e1fc5c-7b42-436d-914e-4d605cb6f329"
```
Add a dedicated reuse-surface context resolver module and register
`reuse_surface_report_gaps` on the `shell` resolver path while preserving
the existing kaizen shell query behavior.
## Implement Batch And Signal Semantics
```task
id: ACTIVITY-WP-0013-T02
status: done
priority: high
state_hub_task_id: "229cf285-8388-471d-95fd-08400db1553e"
```
Load the Coulomb rollout roster, select active repos with a persisted
round-robin cursor, resolve repo roots from State Hub host paths, run
`reuse-surface report gaps --format json`, and emit gap records for the
enabled registry hygiene signals.
## Cover Required And Optional Failure Modes
```task
id: ACTIVITY-WP-0013-T03
status: done
priority: high
state_hub_task_id: "85b5c7d4-40e1-4945-8ada-1dff2363c194"
```
Ensure missing required dependencies fail visibly while optional resolver
sources bind an empty `context.gaps` list. Add unit coverage for fixture
rollout data, mocked CLI JSON, resolver binding, and `hygiene_signal`
rule gating.
## Smoke Real Coulomb Rollout
```task
id: ACTIVITY-WP-0013-T04
status: done
priority: medium
state_hub_task_id: "6a5446ed-b4ec-4693-b508-65415571d834"
```
Run a live resolver smoke against
`/home/worsch/coulomb-loop/loops/registry-hygiene/rollout.yaml` using a
temporary round-robin cursor. The real active rollout produced five gaps,
including one for `reuse-surface` with `hygiene_signal: stale_sbom`.
The smoke supplied `reuse_surface_bin:
/home/worsch/reuse-surface/.venv/bin/reuse-surface` and
`runner_host: bnt-lap001`; the worker environment or definition params must
provide equivalent values before enabling the production sweep.

View File

@@ -0,0 +1,194 @@
---
id: ACTIVITY-WP-0014
type: workplan
title: "Schedule Misfire Robustness & Run-Miss Recovery Options"
domain: infotech
repo: activity-core
status: finished
owner: claude
topic_slug: activity-core
created: "2026-06-23"
updated: "2026-06-24"
status_note: "T01-T05 complete; beachhead-endpoint adoption split to ACTIVITY-WP-0015"
state_hub_workstream_id: "91b64686-5d17-4c86-bc9e-3d0ee6720cf5"
---
# Schedule Misfire Robustness & Run-Miss Recovery Options
Make cron-triggered ActivityDefinitions robust to missed fires (worker/Temporal
unavailable at trigger time) with explicit, per-definition recovery behaviour,
plus detection/alerting when a scheduled fire is missed.
## Motivation
On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition
(cron `20 7 * * *` Europe/Berlin, projected into the Railiance runtime ConfigMap
`actcore-external-activity-definitions`) produced **no `daily_triage` progress
event at all** — neither a success nor a `could not run; operator review
required` failure.
> **Corrected by T01 (2026-06-23).** The initial hypothesis below — that
> `_build_schedule()` never set `catchup_window`, so a short-default catchup
> window silently dropped the fire — was **disproven on the live cluster**. The
> Temporal schedule is healthy with `CatchupWindow 365d` (the server default) and
> `0 MissedCatchupWindow`. The real cause is that the run **fired and ran but
> failed at the report sink** with `Connection refused` posting to State Hub,
> because railiance01 reaches State Hub via a reverse tunnel back to the
> workstation, which is asleep at 07:20 Berlin. See the T01 findings and T05.
The trigger now originates entirely on **railiance01** (in-cluster Temporal
Schedule, ConfigMap-projected definition) and is **not** laptop-dependent — but
the triage's State Hub *data dependencies* (context resolution and report
delivery) still route back to the workstation State Hub.
This workplan still delivers worthwhile robustness — explicit run-miss recovery
policies (T02) and missed-fire detection (T03) — but the fix for *this* incident
is T05 (resilient sinks/resolvers + a workstation-independent State Hub endpoint).
## Desired run-miss options (from Bernd)
Three explicit, per-definition behaviours when a fire is missed:
1. **Run on trigger or skip** — never recover a missed fire.
2. **Run on trigger or later if missed** — recover **all** missed fires when back up.
3. **Run on trigger or later if missed, but skip if next trigger reached**
recover only the **most recent** missed fire; do not accumulate a backlog.
Proposed mapping to a new `misfire_policy` value set (names open to review):
| Policy | Semantics | Temporal mapping |
| --- | --- | --- |
| `skip` | Run on trigger or skip | `catchup_window ≈ 0`, `overlap=SKIP` |
| `catchup_all` | Run on trigger or all missed later | `catchup_window=<long>`, `overlap=BUFFER_ALL` |
| `catchup_latest` | Run on trigger or only the latest missed | `catchup_window ≈ 1 interval`, `overlap=BUFFER_ONE` |
## Confirm root cause on Railiance01
```task
id: ACTIVITY-WP-0014-T01
status: done
priority: high
state_hub_task_id: "c90ff214-9214-48c7-96b9-7d699528d5ab"
```
Inspected via `ssh railiance01` + in-node `kubectl`/`temporal` (no k3s tunnel is
defined for railiance01; the documented access path is SSH to the host).
**Findings (2026-06-23) — the WP-0014 premise was wrong for this incident:**
- All pods healthy; `actcore-worker` up 44h, 0 restarts. Not a crash.
- The daily-triage Temporal schedule (`activity-schedule-6fca51fa-…`) is
**healthy**: `Paused false`, `OverlapPolicy Skip`, **`CatchupWindow 365d`**
(Temporal's *default* when unset), `ActionCounts {Total:8, MissedCatchupWindow:0}`.
So fires were **not** silently dropped — my original "no catchup window → silent
drop" hypothesis does not hold; the server default is already 365d.
- The `2026-06-23T05:20:00Z` fire **did fire and ran**, then **Failed at the report
sink**: `report sink failure: state-hub-progress … '[Errno 111] Connection
refused'`. The run produced a report but could not deliver it to State Hub, so
no `daily_triage` progress event (not even a "could not run" one) was posted →
the silence. The 06-22 fire has no execution in retention (bridge likely down
then too / schedule update window at `LastUpdateAt 1d ago`).
- Root cause is **State Hub connectivity from railiance01**, not Temporal. The
in-cluster `actcore-state-hub-bridge` (`hostNetwork`) proxies to
`127.0.0.1:18000` on the node — the local end of the ops-bridge **reverse tunnel
back to the workstation's State Hub**. At 07:20 Europe/Berlin (= 05:20 UTC) the
workstation/tunnel was unreachable → `Connection refused`. Chronic flakiness
confirmed: 102 State Hub resolver timeouts in 24h (69 `recently_on_scope`,
33 `consistency_sweep`).
**Implication:** the trigger *is* independent of the laptop, but the triage's
**data dependencies (State Hub context resolution + report delivery) still route
back to the workstation State Hub**, which is asleep at 07:20 Berlin. WP-0014's
misfire policies are still good robustness, but the real fix is (a) State Hub
reachable from railiance01 independent of the workstation, and/or (b) sinks/
resolvers resilient to transient State Hub unavailability (retry/backoff,
store-and-forward) instead of hard-failing the workflow. Tracked as follow-up
below. Backfill deferred: a replay only succeeds while the workstation State Hub
is reachable.
## Implement explicit misfire recovery modes
```task
id: ACTIVITY-WP-0014-T02
status: done
priority: high
state_hub_task_id: "19615562-4cb2-4f25-872f-505d6e40dcc5"
```
Add `catchup_window_seconds` to `CronTriggerConfig` and redefine `misfire_policy`
into the three explicit modes above. In `_build_schedule()` set
`SchedulePolicy(overlap=..., catchup_window=timedelta(...))` per mode. Remove the
ad-hoc 1-hour `backfill` hack in favour of native catchup-window semantics. Keep
backward compatibility for existing `skip`/`catchup`/`compress` values (alias
map). Unit tests for each mode's `(catchup_window, overlap)` mapping.
## Missed-fire detection & alert sink
```task
id: ACTIVITY-WP-0014-T03
status: done
priority: medium
state_hub_task_id: "dbedd96a-59ca-4b83-bce6-35755b076807"
```
Detect when a scheduled definition has no successful run within its expected
interval + tolerance, and emit a signal (State Hub progress event and/or
agent-inbox message) so a miss is visible even under `skip`. This is the
observability the current silent-drop behaviour lacks — a miss should never again
be invisible.
## Apply policy to runtime definitions & document
```task
id: ACTIVITY-WP-0014-T04
status: done
priority: medium
state_hub_task_id: "04e9d1d2-1192-4402-9402-b12c5d7d44e5"
```
Set `misfire_policy: catchup_latest` for `daily-statehub-wsjf-triage`, documented
run-miss options in `docs/runbook.md`.
**Deployed & verified to railiance01 (2026-06-24):** built `activity-core:
railiance01-prod` with the WP-0014 code (T02/T03/T05), imported into k3s
containerd, applied the ConfigMap, rolled `actcore-worker`/`api`/`event-router`
onto the new image, and ran `/admin/sync` (6 defs, 4 schedules upserted, 0
errors). The live Temporal schedule now reports `OverlapPolicy BufferOne` +
`CatchupWindow 1d` (= `catchup_latest`); pods healthy, API `db:true temporal:true`.
## Keep activity-core thin under the State Hub beachhead model
```task
id: ACTIVITY-WP-0014-T05
status: done
priority: high
state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"
```
**Architecture decision (Bernd, 2026-06-23):** the resilience that this incident
needs — queuing writes and caching reads while State Hub is unreachable — must
**not** be a burden carried by client repos. It belongs to State Hub as a
**per-machine local "beachhead"** (transparent read cache + write outbox, possibly
with State-Hub federation), owned by custodian/state-hub. It handles all three
failure modes: network interruption, central State Hub crash, central machine
down. This is handed off to state-hub (see the coordination message / proposal);
**do not build client-side queue/cache logic in activity-core.**
activity-core's only responsibilities under this model are thin:
- **Idempotent writes — DONE (2026-06-23, in-repo):** added
`activity_core/state_hub_write` (`idempotency_headers`); every State Hub write
(report-sink, ops-evidence, schedule-miss) now sends a stable `Idempotency-Key`
header derived from `run_id:instruction_id:event_type`. The read-based
`_progress_exists` dedup is now best-effort (returns `False` on connection
error instead of hard-failing), so the guarantee lives on the keyed write, not
a live read. Tests in `tests/test_state_hub_write.py`; documented in
`docs/runbook.md`.
- **Adopt the beachhead endpoint — MOVED to [[ACTIVITY-WP-0015]]:** pointing
`STATE_HUB_URL` at the local beachhead and retiring the bespoke
`actcore-state-hub-bridge` proxy depend on the state-hub beachhead existing
first. Split into WP-0015 (status `blocked`) so this workplan can close on its
completed in-repo work rather than waiting on an external capability.
T05 is done as far as activity-core can act now; the external-dependent adoption
lives in WP-0015.

View File

@@ -0,0 +1,54 @@
---
id: ACTIVITY-WP-0015
type: workplan
title: "Adopt State Hub Beachhead Endpoint"
domain: infotech
repo: activity-core
status: blocked
owner: claude
topic_slug: activity-core
created: "2026-06-24"
updated: "2026-06-24"
state_hub_workstream_id: "bbc07f9e-9323-4b2b-b556-c33b37d0b228"
---
# Adopt State Hub Beachhead Endpoint
Carries the **blocked remainder** of [[ACTIVITY-WP-0014]] T05. The in-repo half
(idempotency-keyed State Hub writes) shipped in WP-0014; this workplan is the
client-side adoption that depends on the state-hub-owned **beachhead** capability
(per-machine read cache + write outbox) existing first.
**Blocked on:** the state-hub beachhead (proposal sent to the `state-hub` agent,
2026-06-23). Do not build queue/cache logic in activity-core — see
[[statehub-beachhead-principle]].
## Point STATE_HUB_URL at the beachhead
```task
id: ACTIVITY-WP-0015-T01
status: wait
priority: medium
state_hub_task_id: "76b6132d-394a-4a67-bef6-73bb9d1e277e"
```
Once the state-hub beachhead exposes a local endpoint, point activity-core's
`STATE_HUB_URL` (and the railiance runtime config) at it and verify reads are
served from cache and writes are queued/flushed correctly when central State Hub
is unreachable. Confirm idempotency-keyed writes dedup on flush (no duplicate
`daily_triage`/progress events).
## Retire the bespoke actcore-state-hub-bridge proxy
```task
id: ACTIVITY-WP-0015-T02
status: wait
priority: medium
state_hub_task_id: "526c2129-cbf7-4531-a319-aebfc75cc6a3"
```
Remove the inline `hostNetwork` HTTP proxy `actcore-state-hub-bridge` from
`k8s/railiance/20-runtime.yaml` — it is a primitive precursor of the beachhead
and should be replaced by the state-hub-owned component, not extended. Re-verify
the daily triage end-to-end after cutover, including an overnight scheduled run
while the workstation is asleep (the original failure condition).

View File

@@ -0,0 +1,379 @@
---
id: ACTIVITY-WP-0016
type: workplan
title: "LLM Output Robustness & The Producer Trust Boundary"
domain: custodian
repo: activity-core
status: active
owner: codex
topic_slug: custodian
created: "2026-06-26"
updated: "2026-06-27"
state_hub_workstream_id: "4ef0d53b-1777-41ae-80c6-1b69fdb34726"
---
# ACTIVITY-WP-0016 — LLM Output Robustness & The Producer Trust Boundary
## Context
On 2026-06-26 the scheduled `daily-statehub-wsjf-triage` instruction fired on
time (`daily_triage` event 05:20:57Z) but its output **failed schema
validation**: `Expecting ',' delimiter: line 136 column 22 (char 5268)`. The
model emitted a long ranked WSJF recommendation list (reached rank 7+ with
nested `wsjf` objects) and the JSON broke deep in that list. Because the report
is a single monolithic JSON document, one malformed delimiter discarded the
**entire** run. This reset the three-clean-consecutive-scheduled-runs streak in
`ACTIVITY-WP-0006-T03` (06-24 ✅, 06-25 ✅, 06-26 ✗-validation) and is the
LLM-output-quality surface deferred from `ACTIVITY-WP-0010`.
The scheduling/runtime layer is healthy — this is purely an output-robustness
and boundary-design problem. Today's code (`src/activity_core/rules/executor.py`)
already: passes the output schema to llm-connect as a `json_schema` model param
(`_llm_run_config`), retries once, runs a fenced/`raw_decode` tolerant parser
(`_parse_json_output`), and preserves a bounded 4000-char preview on hard
failure (`_invalid_output_report`). None of that helps when error locality is
zero: the failure unit is the whole document, not the offending item.
## Design Frame — The Producer Trust Boundary
This workplan is anchored to a deliberate architectural stance, not just a bug
fix. Capture it in an ADR (T04) so future work inherits it.
**Premise.** activity-core has a *trust boundary* where free-form producer
output meets strict deterministic consumers (JSON Schema validators, the task
emitter, classic compute pipelines). The producers are **LLMs and humans (and
agents acting for either)**. Both are *untrusted producers*: their output may be
- **erroneous** — hallucination, truncation (token-limit cutoff), drift,
type slips, typos; or
- **malicious** — prompt injection, crafted payloads, oversized/deeply-nested
structures aimed at exhausting or confusing the consumer.
The architecture should treat the boundary as an adversarial frontier and place
**guardrails + error-correction tooling there**, rather than letting raw
producer output flow into deterministic consumers and fail (or worse, partially
succeed) downstream.
**Two non-fail-fast postures.** When we do *not* want to hard-fail on a problem,
there are two sensible strategies — and they compose:
- **A) Trust but handle exceptions** (optimistic / reactive). Consume the output
as-is; on exception, catch → repair → retry → or quarantine. Cheap on the
happy path. Blast radius depends entirely on how granular the catch is. Good
when failures are rare and locally recoverable. Risk: failures surface late,
possibly after partial side effects.
- **B) Verify and mitigate** (defensive / proactive). Validate, sanitize, clamp,
and normalize the output to a known-good shape *before* it enters the pipeline
— drop bad items, coerce types, bound sizes/depth, allow-list references — so
the consumer only ever sees clean input. Higher upfront cost, smaller blast
radius, no partial side effects. Good when failures are common or
consequences are high.
**Governing principles for this repo:**
1. **Push verification to the boundary; keep the interior strict.** Apply
posture **B** at the producer→consumer boundary (verify+mitigate structure);
keep posture **A** for residual exceptions inside the verified core. Never
relax the interior schema to absorb producer sloppiness.
2. **Make error locality match the unit of work.** One bad recommendation must
cost one recommendation, not the whole report. Framing the payload so each
item is independently parseable is the single highest-leverage change.
3. **Quarantine, never silently drop.** Invalid units are preserved as bounded,
provenance-tagged artifacts (index, error, raw snippet) so they can be
debugged or replayed — degraded-but-usable is distinct from total loss.
4. **Both human and agent input get the same rigor.** Guardrails are
producer-agnostic: the same size/depth/count caps, reference allow-lists, and
truncation detection apply whether the producer is an LLM, an agent, or a
human form submission.
## Reproduce & Root-Cause The Failure
```task
id: ACTIVITY-WP-0016-T01
status: wait
priority: high
state_hub_task_id: "74fd16a5-4ea5-4dfe-8526-dfa27cf76138"
```
Recover the **full** raw llm-connect response for the 06-26 failure (the State
Hub event keeps only a 4000-char preview; the break is at char 5268) and
establish the precise cause.
Done when:
- the full raw response is pulled from the runtime llm-connect log / response
store and the exact offending token at char 5268 is identified;
- `finish_reason` is captured to confirm or rule out token-limit **truncation**
vs a structural mid-stream glitch;
- it is confirmed whether llm-connect actually **enforced** the `json_schema`
constrained-decoding hint or merely accepted it as advisory (this determines
whether the schema param is load-bearing);
- the failing payload is captured as a regression fixture under `tests/`.
2026-06-26 findings (local analysis on the workstation):
- **Mechanism confirmed structurally.** There are **16 active workstreams**
org-wide and the triage instruction emits ~one ranked recommendation per
candidate. The preserved preview holds 7 fully-formed recommendations; the JSON
break is at char 5268 (~rank 89). The unbounded one-per-workstream list is the
structural cause — more items = more tokens = higher odds of a mid-stream JSON
slip and/or truncation. This directly justifies T02's bounded top-N + per-item
framing.
- **Both attempts failed.** `executor._execute` retries once
(`src/activity_core/rules/executor.py:166-171`); the recorded error is from the
**retry** output, so the model produced invalid JSON twice — not a one-off.
- **activity-core discards the diagnostics needed to root-cause this.** Three
retention gaps mean the exact char-5268 token cannot be recovered from
activity-core data at all:
1. `LLMConnectClient.complete()` returns only `data["content"]`
(`llm_client.py:57-60`) — it drops `finish_reason`/`usage` from the
llm-connect HTTP response, so truncation-vs-structural cannot be
distinguished locally.
2. the report sink caps raw output at **4000 chars** (`_invalid_output_report`,
`executor.py:259`) — below the 5268 break.
3. the worker log caps the preview at **2000 chars** (`executor.py:175`).
- **Remaining (remote, operator-owned).** Confirming the exact offending token
and `finish_reason` requires llm-connect's producer-side logs on `railiance01`
— cluster access, outside this repo's SCOPE for direct action. Truncation is
the leading hypothesis given the 16-item input, but the mitigation (T02/T03) is
identical either way, so T01 does not block the build work.
- **Feeds T03/T04.** The retention gaps are themselves defects to fix: capture
`finish_reason`/`usage` and persist a larger bounded raw artifact on validation
failure so this class of failure is never un-debuggable again.
- Partial fixture saved:
`tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json`
(the 4000-char preview + validation error; full payload pending the remote pull).
## Schema + Prompt Redesign For Error Locality
```task
id: ACTIVITY-WP-0016-T02
status: progress
priority: high
state_hub_task_id: "ae67ca8c-ee01-4a8d-9e8a-a0a36c999758"
```
Redesign the daily-triage report contract so a single malformed item can no
longer discard the whole report (principle #2).
Done when:
- the recommendation list is **bounded** (configurable top-N, default 57) in
both the prompt and the output schema — long lists are where the model drifts;
- the report uses a **per-item-framed** shape (JSON Lines / NDJSON — one
recommendation object per line — or an equivalent delimited per-item form)
behind a minimal stable envelope (`summary` + framed items), so each item is
an independent parse unit;
- the prompt explicitly states the contract, the per-item framing, the cap, and
a "if uncertain, emit fewer well-formed items rather than more" instruction;
- `max_tokens` is set with headroom for the bounded list so truncation cannot
occur at the expected size;
- the output schema file (`_load_output_schema` target) is updated to match.
2026-06-26 progress (in-repo portion):
- **Strict, bounded schema written** — `schemas/daily-triage-report.json` went
from `recommendations.items: {type: object}` (accept-anything) to a strict
per-item contract: `required [rank, candidate, action, why]` with typed
`wsjf` sub-fields, plus `maxItems: 7`. The strict item shape is what lets the
T03 boundary parser validate each recommendation independently.
- **`maxItems` is a hint, not a hard reject** — the in-repo validator
(`_validate_schema_node`) only enforces `type`/`required`/`properties`/`items`
and ignores `maxItems`/`enum`. That is deliberate: a hard `maxItems` reject
would discard a whole 16-item report — the exact blast-radius bug WP-0016
removes. The bound is enforced via the prompt + the llm-connect `json_schema`
constraint hint + T03 mitigation (keep top-N by rank, quarantine extras).
- **DEPLOY COUPLING (important):** this schema file is consumed *both* as the
llm-connect hint *and* by the current whole-document validator. Tightening
per-item `required` fields makes the existing whole-doc validation hard-fail
**more** until T03 replaces it with per-item quarantine. Therefore the schema
change MUST ship together with T03 — do not deploy the strict schema to the
runtime bundle ahead of the T03 parser. Four executor/instruction tests that
asserted the old loose contract were updated to the strict contract; the
forwarded-schema test now reads the live file instead of hard-coding it.
- **Truncation hypothesis corroborated** — the instruction config carries
`max_tokens` on the order of ~1200 (per the wiring test fixture). 5268 chars ≈
~13001500 tokens, so a ~1200-token cap would truncate a 16-item list right at
the observed break. This strengthens T01's leading hypothesis and makes the
`max_tokens` headroom change below concrete.
**Bundle handoff (NOT in this repo — runtime-projected definition).** The triage
prompt and `max_tokens` live in the Railiance runtime bundle, not in repo files.
Apply there:
1. Instruct a **bounded top-N** (≤ 7) ranked recommendations, "if uncertain emit
fewer well-formed items rather than more."
2. Specify the **per-item framing** the T03 parser will consume (NDJSON: a
leading summary object, then one recommendation JSON object per line).
3. Raise **`max_tokens`** to give clear headroom for 7 framed items (eliminate
truncation at the expected size).
4. State the value vocabularies (`action`, `confidence`) the T04 guardrails will
check.
## Boundary Parser — Verify & Mitigate (Posture B)
```task
id: ACTIVITY-WP-0016-T03
status: done
priority: high
state_hub_task_id: "d65a6281-f1f9-4a9b-a835-da065411b709"
```
Implement item-granular parsing with a quarantine lane in
`src/activity_core/rules/executor.py`, applying posture **B** at the boundary
(principles #1#3).
Done when:
- the parser splits the envelope from the framed items, then parses **each item
independently**; a malformed item is routed to a bounded `quarantined_items`
artifact (index + validation error + raw snippet), not raised;
- a run with some valid and some invalid items emits a report over the surviving
valid items with `output_validated=true`, plus `partial=true` and
`quarantined_count` / `quarantined_items` markers — degraded-but-usable is
reported distinctly from total loss;
- a best-effort **repair** pass (close unterminated brackets/quotes, recover the
valid prefix) is attempted per item before quarantining it;
- truncation detected in T01 is handled as its own signal (recover whole items
emitted before the cutoff rather than failing the document);
- the existing monolithic-document path remains as the fallback when framing is
absent (backward compatible with task-only instructions).
2026-06-26 progress (implemented in `src/activity_core/rules/executor.py`):
- **Resilient recovery wired into `_execute`.** When the whole-document parse +
one retry still fail, report instructions (those with `report_sinks`) now run
`_resilient_report` *before* the total-loss `_invalid_output_report`. If it
recovers ≥1 valid item it returns a partial report; otherwise it returns None
and the prior total-loss path is preserved unchanged.
- **Brace/quote-aware object scanner, not line-splitting.** The real 06-26 output
was pretty-printed (multi-line objects), so naive NDJSON line recovery would
have failed. `_extract_object_spans` walks the `recommendations` array
brace-depth- and string-aware, so it recovers each recommendation object
whether pretty-printed across many lines *or* emitted one-per-line (NDJSON).
The truncated trailing object is returned with `complete=False`.
- **Layered mitigation per item:** `json.loads` → on failure for a truncated
tail, a best-effort `_try_repair` (balance open string/brackets/braces) →
then `_partition_items` validates each recovered object against the T02 item
schema. Valid items survive; malformed or over-`maxItems` items are
quarantined with provenance (`index`, `error`, `raw` snippet, `reason`).
- **Report shape on degradation:** `output_validated=True` over the survivors,
`review_required=True`, `partial=True`, `quarantined_count`, and a bounded
`quarantined_items` list (cap 20). Degraded-but-usable is now reported
distinctly from total loss.
- **Verified against the real failure shape.** New tests reconstruct a
pretty-printed report with 7 valid recommendations + a truncated tail (the
06-26 shape) and a one-bad-item-among-valid case. The 7-item run now recovers
all 7 and quarantines the broken tail (previously: whole run discarded);
log line `instruction_output_recovered: kept=7, quarantined=1`. The bad-item
run keeps 2 and quarantines the rank-less one.
- **Deferred to T04 (clean scope boundary):** enforcing `maxItems` top-N on the
*happy* path (valid JSON, all items schema-valid, but > N items) — the resilient
path only runs on failure, so over-limit-on-success is a guardrail/count-cap
concern, which is exactly T04's remit.
## Producer Guardrails + ADR-004
```task
id: ACTIVITY-WP-0016-T04
status: done
priority: medium
state_hub_task_id: "f5c3af5b-9e28-42b0-9af5-4c99284e99b9"
```
Write the architecture decision record and add the producer-agnostic guardrails
(principle #4).
Done when:
- `docs/adr/adr-004-producer-trust-boundary.md` documents the trust boundary,
the untrusted-producer premise (erroneous **and** malicious; human and agent),
the A vs B taxonomy and where each applies, the error-locality principle, and
the quarantine-with-provenance rule;
- boundary guardrails are enforced at the consumer edge: max item **count**, max
string length, max nesting **depth**, and a **reference allow-list** (e.g. a
recommendation `candidate` / a task `target_repo` must resolve to a known
workstream/repo before it is acted on);
- guardrail rejections are quarantined with provenance, consistent with T03;
- SCOPE.md / INTENT.md are checked for drift and updated if the boundary stance
changes the documented contract.
2026-06-26 progress:
- **ADR-004 written** — `docs/adr/adr-004-producer-trust-boundary.md` documents
the untrusted-producer premise (erroneous + malicious; LLM/agent/human), the
A-vs-B posture taxonomy, the four governing principles, the concrete
activity-core mechanisms, a posture-by-layer table, consequences, and
alternatives considered. Accepted, scope cross-repo.
- **Producer guardrails implemented** in `executor.py`, applied uniformly on the
happy path *and* the recovery path via `_partition_items`: per-item order is
structural-type → schema → structural caps (`_MAX_DEPTH=8`,
`_MAX_STRING_LEN=4000`) → reference allow-list → count cap (`maxItems`). Each
quarantine carries a `reason` (`malformed`/`schema`/`guardrail`/`allow_list`/
`over_limit`).
- **Happy-path count cap closed** (the item deferred from T03): a syntactically
valid 9-item report now keeps 7 and quarantines 2 as `over_limit`, emitting a
`partial` report — without a retry.
- **Reference allow-list wired but inert.** `_allow_list_from_context` reads
`context["known_candidates"]`; when present, recommendations with an unknown
`candidate` are quarantined (`reason: allow_list`). Absent today → check is
inert; activation is a one-line context-resolver change. Keeps the guardrail
producer-agnostic (principle #4) and ready.
- **SCOPE.md updated** — instruction-executor bullet now names the quarantine
lane + guardrails; ADR-004 added to the Architecture Decisions list. No INTENT
drift: this hardens the existing output contract, it does not extend scope.
- New tests: happy-path count cap, oversized-string guardrail, allow-list
rejection (all green).
## Tests + Calibration Re-Entry
```task
id: ACTIVITY-WP-0016-T05
status: progress
priority: high
state_hub_task_id: "c881500b-5459-4620-81c0-b176971e989f"
```
Prove the new posture and hand back to the calibration gates.
Done when:
- regression tests cover: the captured 06-26 payload, a truncated-mid-list
payload, a one-bad-item-among-good payload (asserts quarantine + partial), an
oversized/over-deep payload (asserts guardrail rejection), and an
injection-shaped reference (asserts allow-list rejection);
- the full suite passes and the result is recorded here with the count;
- a daily-triage smoke against the live runtime shows a previously-failing
payload now **degrades gracefully** (valid items delivered, bad items
quarantined) instead of discarding the run;
- a progress note hands back to `ACTIVITY-WP-0010-T04` and `ACTIVITY-WP-0006-T03`
that the output-robustness blocker is cleared so the three-clean-run gate can
resume on its own.
2026-06-26 progress (in-repo portion complete):
- **Regression coverage complete.** Across T03/T04/T05: truncated-mid-list,
one-bad-item-among-good (quarantine + partial), oversized-string and over-depth
guardrail rejection, allow-list (injection-shaped) rejection, happy-path count
cap, and a test driving the **actual captured 2026-06-26 payload**
(`tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json`)
— it now recovers 6+ valid recommendations and quarantines the truncated tail,
where before it discarded the whole run.
- **Full suite green:** 218 passed, 1 skipped (recorded at T04; the T05 fixture +
over-depth tests add to this — see the commit).
- **Hand-back notes posted** to `ACTIVITY-WP-0006-T03` (State Hub event
`b6b8c2b8`) and `ACTIVITY-WP-0010-T04` (`b813f0dc`).
- **Remaining (remote, operator-owned):** the live daily-triage smoke on
`railiance01` proving end-to-end graceful degradation. It depends on deploying
the T02 bundle prompt/`max_tokens`/NDJSON changes together with this code, which
is cluster/operator work outside this repo's SCOPE. T05 therefore stays
`progress` until that live run exists; the in-repo deliverables are done.
## Relationships
- **Blocks / feeds:** `ACTIVITY-WP-0006-T03` (three clean scheduled runs) and
`ACTIVITY-WP-0010-T04` (collect three clean scheduled runs) — both stalled on
the same output-quality failure this workplan removes.
- **References:** `ACTIVITY-WP-0009` (scheduled-run trust gap).
- **Boundary discipline:** keeps activity-core inside its SCOPE — this hardens
the instruction-executor output contract; it does not move provider
credentials, cluster reconciliation, or task lifecycle into this repo.

View File

@@ -0,0 +1,58 @@
---
id: ACTIVITY-WP-0017
type: workplan
title: "Core Hub ops evidence sink"
domain: infotech
repo: activity-core
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-27"
updated: "2026-06-27"
state_hub_workstream_id: "2a073bf4-febf-433e-a721-5daf71760912"
---
# Core Hub ops evidence sink
## Goal
Provide the activity-core side of the Core Hub replacement evidence path for
`CORE-WP-0008-T03`, without depending on the legacy Haskell Inter-Hub sink and
without placing secret material in activity definitions, logs, State Hub, or
chat.
## Task: Add Core Hub interaction-event sink
```task
id: ACTIVITY-WP-0017-T01
status: done
priority: high
state_hub_task_id: "32aab1af-6be5-4b52-afa1-c11f52c65892"
```
Add a `core-hub-interaction-event` ops evidence sink that posts sanitized
ops-inventory probe evidence to Core Hub `/api/v2/interaction-events`, verifies
the created event is visible, and reports only non-secret ids/statuses.
Acceptance:
- runtime token is read through `CORE_HUB_RUNTIME_TOKEN_FILE` or a named
environment variable, never from workplan content;
- sink configuration accepts `CORE_HUB_BASE_URL` and a widget id or widget
mapping;
- emitted metadata reuses the existing compact/sanitized probe evidence path;
- missing Core Hub config skips cleanly with explicit non-secret missing keys;
- tests prove the POST/visibility check and secret non-disclosure.
Verification 2026-06-27: `tests/test_ops_evidence_sinks.py` passed, and
a disposable local Core Hub runtime accepted an activity-core
`core-hub-interaction-event` sink emission, then listed the created
`ops-endpoint-verified` event back through `/api/v2/interaction-events`.
The verification asserted sanitized metadata did not include response body,
authorization header, URL userinfo, or token query material.
Completed 2026-06-27: implemented the Core Hub interaction-event sink in
`activity_core.ops_evidence_sinks` with unit coverage for POST/visibility
verification, missing config behavior, and secret non-disclosure. This provides
the direct Core Hub consumer path needed by `CORE-WP-0008-T03`; deployed use
still requires an approved Core Hub runtime token and widget id/mapping.

View File

@@ -3,6 +3,7 @@ type: session-note
created: "2026-03-28" created: "2026-03-28"
updated: "2026-06-03" updated: "2026-06-03"
status: archived status: archived
state_hub_workstream_id: "b221e65a-6f97-44b0-8dae-442fffcb7f64"
--- ---
# WP-0002 Handoff Note — Continue on CoulombCore # WP-0002 Handoff Note — Continue on CoulombCore