Compare commits

75 Commits

Author SHA1 Message Date
a1e2a426b9 ISSUE-WP-0003-T06: issue-core REST sink via actcore-issue-core-bridge (node-local tunnel 18765)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:20:12 +02:00
9113206974 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-07-02:
  - update .custodian-brief.md for activity-core
2026-07-02 11:55:47 +02:00
79fd3406a3 ACTIVITY-WP-0016: finished
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 11:55:07 +02:00
ef9a1a76c2 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-07-02:
  - update .custodian-brief.md for activity-core
2026-07-02 11:54:43 +02:00
0da655979d ACTIVITY-WP-0016: T05 done via live railiance01 proof, T01 cancelled (evidence unrecoverable)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 11:54:04 +02:00
7612112e7e RAIL-BS-WP-0008-T02: bounded top-7 + NDJSON per-item framing in daily-triage Instruction
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 10:44:00 +02:00
6a5321525e chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-07-02:
  - update .custodian-brief.md for activity-core
2026-07-02 02:19:57 +02:00
2f55167215 Add automation inventory surface 2026-07-02 02:15:39 +02:00
ffe10f098e Add automation status surface 2026-07-01 20:12:04 +02:00
3f85274916 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-30:
  - update .custodian-brief.md for activity-core
2026-06-30 12:38:50 +02:00
bb14d08212 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-30:
  - update .custodian-brief.md for activity-core
2026-06-30 12:36:50 +02:00
92629e7a91 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-30:
  - update .custodian-brief.md for activity-core
2026-06-30 01:50:22 +02:00
951ec56f7a chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-29:
  - update .custodian-brief.md for activity-core
2026-06-29 13:45:41 +02:00
9440d539c6 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-29:
  - update .custodian-brief.md for activity-core
2026-06-29 13:33:21 +02:00
2ff852da29 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-29:
  - update .custodian-brief.md for activity-core
2026-06-29 12:57:25 +02:00
30043348f0 Add Core Hub ops evidence sink 2026-06-27 20:34:25 +02:00
18fcce87fe Update daily triage stabilization status 2026-06-27 09:58:47 +02:00
17b787fad0 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-27:
  - update .custodian-brief.md for activity-core
2026-06-27 08:07:46 +02:00
6c8cb1b7b6 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-27:
  - ACTIVITY-WP-0010-T03: progress → wait
2026-06-27 08:07:42 +02:00
ec66e06066 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-27:
  - update .custodian-brief.md for activity-core
2026-06-27 08:00:51 +02:00
919edd98ac chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - update .custodian-brief.md for activity-core
2026-06-26 18:20:26 +02:00
bf877b7f0d test(ACTIVITY-WP-0016-T05): regression coverage incl. real 06-26 payload + over-depth
Add a test driving the actual captured 2026-06-26 failure payload
(tests/fixtures/wp0016/...partial.json): it now recovers 6+ valid recommendations
and quarantines the truncated tail, where before WP-0016 it discarded the whole run.
Add an over-depth guardrail test. Together with T03/T04 the regression set now covers
truncation, one-bad-item, oversized-string, over-depth, allow-list/injection-shaped,
and happy-path count cap.

In-repo portion of T05 complete; the live railiance01 graceful-degradation smoke is
operator-owned cluster work (deploy-coupled with the T02 bundle changes) and remains
outstanding. Hand-back notes posted to WP-0006-T03 and WP-0010-T04. Full suite: 220
passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 18:18:37 +02:00
9be4ddbdb7 feat(ACTIVITY-WP-0016-T04): producer trust-boundary guardrails + ADR-004
Add ADR-004 documenting the producer trust boundary: untrusted producers (LLM,
agent, human; erroneous and malicious), the trust-but-handle vs verify-and-mitigate
postures, error-locality and quarantine-with-provenance principles, and the concrete
activity-core mechanisms.

Implement producer-agnostic guardrails in executor.py, applied uniformly on the
happy path and the recovery path via _partition_items: structural-type -> schema ->
structural caps (_MAX_DEPTH, _MAX_STRING_LEN) -> reference allow-list -> count cap.
Each quarantine carries a reason. Closes the happy-path maxItems count cap deferred
from T03 (valid 9-item report keeps 7, quarantines 2). Reference allow-list reads
context["known_candidates"] via _allow_list_from_context; inert until a resolver
populates it. SCOPE.md updated (executor bullet + ADR list); no INTENT drift.

New tests: happy-path count cap, oversized-string guardrail, allow-list rejection.
Full suite: 218 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 18:10:17 +02:00
c5440e8429 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - update .custodian-brief.md for activity-core
2026-06-26 18:04:07 +02:00
53dc0f6e93 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - ACTIVITY-WP-0016-T03: progress → done
2026-06-26 18:03:50 +02:00
a70c00a789 feat(ACTIVITY-WP-0016-T03): resilient per-item report recovery with quarantine lane
When the whole-document parse + one retry still fail, report instructions now run
_resilient_report before the total-loss path. A brace/quote-aware scanner
(_extract_object_spans) recovers each recommendation object whether pretty-printed
across many lines or NDJSON one-per-line; a truncated tail gets a best-effort
_try_repair; _partition_items validates each recovered object against the T02 item
schema. Valid items survive (output_validated=True, partial=True), malformed/
over-maxItems items are quarantined with provenance (index, error, raw, reason),
capped at 20. Error locality now matches the unit of work: one bad item costs one
item, not the whole report.

Verified against the real 06-26 shape: 7 valid recommendations + a truncated tail
now recovers all 7 and quarantines the broken tail (previously the whole run was
discarded). Happy-path maxItems top-N enforcement is deferred to T04 (count caps).
Full suite: 215 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 17:56:28 +02:00
b41b6034ee chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - update .custodian-brief.md for activity-core
2026-06-26 17:52:46 +02:00
960fb05268 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - ACTIVITY-WP-0016-T03: todo → progress
2026-06-26 17:52:30 +02:00
b7b0b5bf6e chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - ACTIVITY-WP-0016-T02: todo → progress
2026-06-26 17:52:29 +02:00
14f76fb6d9 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - ACTIVITY-WP-0016-T01: todo → wait
2026-06-26 17:52:28 +02:00
caa2608092 chore(consistency): renormalize lifecycle state [auto]
Updated by fix-consistency on 2026-06-26:
  - workplan status: proposed → active
2026-06-26 17:52:28 +02:00
61f278d643 feat(ACTIVITY-WP-0016-T02): strict bounded daily-triage output schema
Replace the accept-anything recommendations.items ({type: object}) with a strict
per-item contract (required [rank, candidate, action, why] + typed wsjf) and a
maxItems:7 hint. Strict item structure is what lets the T03 boundary parser
validate each recommendation independently and quarantine only malformed ones.

maxItems is a producer hint (prompt + llm-connect json_schema + T03 mitigation),
NOT a hard reject — a hard maxItems reject would discard a whole 16-item report,
the blast-radius bug WP-0016 removes. DEPLOY COUPLING: the strict schema is also
consumed by the current whole-doc validator, so it must ship with T03's per-item
quarantine parser; until then it increases whole-doc hard-fails. Prompt + max_tokens
headroom + NDJSON framing are documented as a runtime-bundle handoff.

Updated four tests to the strict contract; the forwarded-schema test now reads the
live schema file instead of hard-coding it. Full suite: 213 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 17:36:24 +02:00
0e9e18a59a chore(ACTIVITY-WP-0016-T01): record root-cause findings + partial failure fixture
Local analysis of the 2026-06-26 daily-triage validation failure: the unbounded
~1-recommendation-per-workstream list (16 active workstreams; JSON break at char
5268, ~rank 8-9) is the structural cause; both the first attempt and the retry
failed. The exact offending token and finish_reason are unrecoverable from
activity-core data — complete() drops finish_reason/usage, the report sink caps
raw output at 4000 chars (< 5268), and the log preview at 2000. Confirming the
exact token needs llm-connect producer-side logs on railiance01 (operator-owned);
mitigation (T02/T03) is identical regardless. Partial fixture captured.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 15:04:27 +02:00
5eb33bd3bb feat(ACTIVITY-WP-0016): register LLM output robustness & producer trust boundary workplan
Add WP-0016 to make the instruction-executor output contract robust after the
2026-06-26 daily-triage validation failure (one malformed delimiter discarded a
whole report). Per-item framing for error locality, verify-and-mitigate boundary
parsing with a quarantine lane, producer-trust-boundary guardrails (ADR-004), and
regression/calibration tests. Unblocks WP-0006-T03 / WP-0010-T04.

Also record the 06-26 recheck outcome (streak reset at two) in WP-0006-T03.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 14:39:21 +02:00
612c226472 chore(ACTIVITY-WP-0015): dedupe state_hub_workstream_id frontmatter
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:53:52 +02:00
0b2c68838e chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-24:
  - update .custodian-brief.md for activity-core
2026-06-24 12:53:31 +02:00
4b5e96d7c1 feat(ACTIVITY-WP-0014): close workplan — catchup_latest deployed & verified on railiance01
T04 done: built+deployed the WP-0014 image to railiance01, applied catchup_latest
to daily-statehub-wsjf-triage, /admin/sync clean (6 defs, 4 schedules, 0 errors).
Live schedule verified OverlapPolicy=BufferOne, CatchupWindow=1d; pods healthy.
All tasks T01-T05 complete; beachhead-endpoint adoption tracked in WP-0015.
Workplan status -> finished.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:52:54 +02:00
65ef005c2d docs(ACTIVITY-WP-0014): close T05 in-repo; split beachhead adoption to WP-0015
Idempotent-writes half of T05 is done in-repo; the externally-blocked endpoint
adoption + actcore-state-hub-bridge proxy retirement move to ACTIVITY-WP-0015
(blocked on the state-hub beachhead) so WP-0014 can close on completed work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:41:21 +02:00
0e75aaec01 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - update .custodian-brief.md for activity-core
2026-06-23 21:39:32 +02:00
b2e57707a7 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - ACTIVITY-WP-0014-T05: todo → progress
2026-06-23 21:39:28 +02:00
88fe359385 feat(ACTIVITY-WP-0014): idempotency-keyed State Hub writes (T05, in-repo part)
Add activity_core/state_hub_write: every State Hub write (report-sink,
ops-evidence, schedule-miss) now sends a stable Idempotency-Key header derived
from run_id:instruction_id:event_type. Makes writes safe to buffer/replay under
the future state-hub beachhead without duplicate progress/triage events. The
read-based _progress_exists dedup is now best-effort (returns False on connection
error instead of hard-failing), so the guarantee lives on the keyed write rather
than a live read. Tests + runbook note. Endpoint adoption / proxy retirement stays
blocked on the state-hub beachhead capability.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:38:46 +02:00
f90591c5f1 docs(ACTIVITY-WP-0014): rescope T05 to thin client under State Hub beachhead model
Resilience (queue/cache) is handed to custodian/state-hub as a per-machine
beachhead; activity-core keeps only idempotent writes + adopt-beachhead-endpoint
and retires its bespoke actcore-state-hub-bridge proxy. Proposal sent to state-hub.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:18:01 +02:00
cf7a11dcd9 docs(ACTIVITY-WP-0014): correct Motivation to match T01 findings
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 17:16:17 +02:00
99e5d525a8 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - update .custodian-brief.md for activity-core
2026-06-23 17:15:41 +02:00
8424c13783 docs(ACTIVITY-WP-0014): T01 root cause — State Hub Connection refused, not misfire
Live inspection of railiance01 (ssh + in-node kubectl/temporal) overturns the
catchup_window hypothesis: the daily-triage schedule is healthy (CatchupWindow
365d default, 0 MissedCatchupWindow). The 2026-06-23T05:20Z fire ran but Failed
at the report sink with '[Errno 111] Connection refused' posting to State Hub.
railiance01 reaches State Hub via a reverse tunnel back to the workstation, which
is unreachable at 07:20 Europe/Berlin (102 resolver timeouts in 24h). Mark T01
done; add T05 for resilient sinks/resolvers as the real incident fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 17:14:04 +02:00
864f90f9b9 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - update .custodian-brief.md for activity-core
2026-06-23 14:27:54 +02:00
053d18b24a feat(ACTIVITY-WP-0014): missed-fire detection & alert sink (T03)
Add activity_core/schedule_health: a pure evaluate_schedule_health() verdict
(built on Temporal's num_actions_missed_catchup_window plus a staleness check),
an async check_schedule_health() reader, and post_missed_fire_alert() that emits
a schedule_miss State Hub progress event. Makes a missed fire visible even under
misfire_policy=skip, where Temporal drops it by design. Unit tests for the
verdict logic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 14:25:33 +02:00
77af65afb2 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - update .custodian-brief.md for activity-core
2026-06-23 14:17:14 +02:00
0495f8a43f chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - ACTIVITY-WP-0014-T04: progress → wait
2026-06-23 14:17:06 +02:00
c6cad9e7b3 chore(consistency): renormalize lifecycle state [auto]
Updated by fix-consistency on 2026-06-23:
  - workplan status: proposed → active
2026-06-23 14:17:06 +02:00
a83b117f60 feat(ACTIVITY-WP-0014): explicit run-miss recovery policies (T02, T04)
Set Temporal catchup_window on cron schedules so a fire missed during a
worker/Temporal outage is no longer silently dropped. Redefine misfire_policy
into three explicit modes — skip, catchup_all, catchup_latest — mapping to
(catchup_window, overlap) pairs; legacy catchup/compress aliased. Add
catchup_window_seconds override. Remove the ad-hoc upsert-time 1h backfill in
favour of native catchup. Apply catchup_latest to daily-statehub-wsjf-triage in
the Railiance runtime manifest and document run-miss policies in the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 14:15:45 +02:00
ffc0ee2cb7 feat(ACTIVITY-WP-0014): plan schedule misfire robustness & run-miss options
Cron fires are silently dropped: _build_schedule() sets SchedulePolicy(overlap=)
but never catchup_window, so a brief worker/Temporal outage at trigger time drops
the fire with no recovery and no signal (root cause of missing 06-22/06-23 daily
triage runs). Define three explicit run-miss policies: skip, catchup_all,
catchup_latest, plus missed-fire detection.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 13:46:19 +02:00
59b3b73061 ui rules established 2026-06-22 23:03:40 +02:00
4bc5111dfd chore(consistency): apply state_hub_workstream_id writeback
Sync archived workplan frontmatter from State Hub fix-consistency.
2026-06-22 17:43:32 +02:00
e9a6029ded chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-22:
  - update .custodian-brief.md for activity-core
2026-06-22 16:50:01 +02:00
bf4e61f0bf feat(ACTIVITY-WP-0012): complete live admin-sync no-restart smoke
Ran Railiance01 cluster validation for POST /admin/sync without restarting
actcore-worker, added a repeatable smoke script, and closed the workplan.
2026-06-22 16:25:26 +02:00
40fa851ec0 fix(bridge): use /state/health for readiness probe
The actcore-state-hub-bridge readiness probe hit /state/summary through
the tunnel proxy chain. Cold-cache summary requests and intermittent
tunnel stalls routinely exceeded the 5s probe timeout (1584 failures
over 17h), leaving the pod 0/1 Ready and breaking hourly/triage sinks.

Use /state/health instead — same signal the ops inventory already
expects, and completes in ~30ms through the bridge.
2026-06-22 14:03:57 +02:00
e0742d18d7 Mark .repo-classification.yaml human-reviewed (CUST-WP-0050 T02)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 11:40:43 +02:00
ccac285b0a Reclassify as tooling (CUST-WP-0050 T02)
Apply the new 'tooling' category (reusable internal tooling/infrastructure)
from the Repo Classification Standard. First-pass agent classification.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 03:06:01 +02:00
a0dcc52353 Add repo classification (CUST-WP-0050 T02)
First-pass agent classification per the Repo Classification Standard v1.0
(canon-repo-classification); pending human review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 02:44:46 +02:00
faf5d60ae8 feat(STATE-WP-0064): enable cluster consistency sweep schedule
Enable the definition in k8s projection and pass activity-core source tags.
2026-06-21 21:46:43 +02:00
adfd1a9067 fix(STATE-WP-0064): allow 360s POST timeout on state-hub bridge proxy
Consistency sweeps exceed the previous 30s urllib timeout when triggered from
Railiance01 activity-core through actcore-state-hub-bridge.
2026-06-21 20:56:35 +02:00
44987457c1 chore: add make sync-schedules target for Temporal schedule reconcile
Wraps python -m activity_core.sync_schedules for operator discoverability.
2026-06-21 20:28:04 +02:00
3a981cc98f feat(STATE-WP-0064): wire consistency_sweep_remote_all state-hub query
Add POST /consistency/sweep/remote-all resolver support with a 330s
timeout and k8s projection for the consistency sweep definition.
2026-06-21 20:19:22 +02:00
dbd2fbb11c docs(workplan): record railiance01 llm-connect smoke evidence
Note the 2026-06-19 live reconciliation on railiance01: llm-connect
deployed, worker restarted with LLM_CONNECT_URL, fixture smoke passed.
Manual daily triage still blocked on actcore-state-hub-bridge reachability.
2026-06-19 15:58:04 +02:00
c938b80503 chore(kaizen): demote coach/optimization to weekly operate cadence
After coulomb-loop bootstrap E2E (3/3 cycles on 2026-06-18), revert
activity-core from experimental daily crons to weekly Monday schedules
so discover_kaizen_scheduled_repos(cadence=weekly) matches the
operate-phase ActivityDefinitions. Drop the disabled tdd-workflow stub.
2026-06-19 11:32:36 +02:00
3e93567a53 Add admin sync hot reload path 2026-06-19 01:54:13 +02:00
6f68f8f9ec chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-19:
  - update .custodian-brief.md for activity-core
2026-06-19 01:52:52 +02:00
f05c56e202 fix(issue-sink): stringify triggering_event_id before JSON encode
IssueCoreRestSink.emit() passed task_spec.triggering_event_id straight
into the httpx json= payload. When the field is a UUID object (rather
than a string), httpx's JSON encoder raised
"TypeError: Object of type UUID is not JSON serializable", failing the
emission. Guard with str(), preserving None for optional event ids.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 00:15:03 +02:00
200ec0c97a Add credential routing instructions for all agent runtimes
Propagate shared credential-routing section (Codex, Claude, Grok, llm-connect)
from state-hub template via scripts/propagate_credential_routing.py.
2026-06-18 22:48:37 +02:00
42e5ef725c Document issue-core emission contract in AGENTS.md
Add ISSUE_CORE_URL, ISSUE_CORE_API_KEY, and ISSUE_SINK_TYPE guidance so
agents pair keys locally or via OpenBao instead of requesting them from
ops-warden.
2026-06-18 22:34:59 +02:00
a08bd1684f Add ISSUE_CORE_API_KEY auth to IssueCoreRestSink
Issue-core requires a shared ingestion key on POST /issues/. The REST sink
now sends Authorization: Bearer using ISSUE_CORE_API_KEY and fails fast
when the key is missing under ISSUE_SINK_TYPE=rest.

Updates .env.example, emission boundary docs, and unit tests for the
header contract and missing-key error.
2026-06-18 22:30:13 +02:00
2078915854 Add reuse-surface report gaps resolver 2026-06-18 17:58:00 +02:00
23f4956b68 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-18:
  - update .custodian-brief.md for activity-core
2026-06-18 17:52:38 +02:00
764339e490 chore(consistency): renormalize lifecycle state [auto]
Updated by fix-consistency on 2026-06-18:
  - workplan status: ready → active
2026-06-18 17:52:33 +02:00
68 changed files with 6890 additions and 207 deletions

View File

@@ -0,0 +1,50 @@
# Credential and access routing
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
for inference. Run this check **before** requesting secrets, API keys, SSH access,
login tokens, or database passwords — in any repo, not only `ops-warden`.
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
other credential need belongs to another subsystem. **Do not** message
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
### Lookup (do this first)
```bash
warden route find "<describe your need>" --json
warden route show <catalog-id> --json
```
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
| Agent runtime | How to orient |
| --- | --- |
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=activity-core` is for coordination, not secret vending |
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
### Quick routing table
| I need… | Owner | ops-warden executes? |
| --- | --- | --- |
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes**`warden sign` |
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
| Authorization decision | flex-auth | No — route only |
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
### Anti-patterns (do not do these)
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
- Pasting secrets into Git, State Hub, workplans, logs, or chat
### Other capabilities (reuse-surface)
Non-credential capabilities are usually discovered through **reuse-surface** federation
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
every repo's agent instructions because it is high-frequency, high-risk, and easy to
get wrong.
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`

View File

@@ -1,11 +1,11 @@
## First Session Protocol
Triggered when `get_domain_summary("custodian")` shows **no workstreams**.
Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
The project is registered but work has not yet been structured.
**Step 1 — Read, don't write**
- `~/the-custodian/canon/projects/custodian/project_charter_v0.1.md` — purpose, scope
- `~/the-custodian/canon/projects/custodian/roadmap_v0.1.md` — planned phases
- `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
- `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
- Scan repo root: README, directory structure, existing code or docs
**Step 2 — Survey in-progress work**
@@ -17,7 +17,7 @@ roadmap phase. **Wait for approval before creating.**
**Step 4 — Create workplan file first, then DB record (ADR-001)**
```
workplans/activity-core-WP-NNNN-<slug>.md ← write this first
workplans/ACTIVITY-WP-NNNN-<slug>.md ← write this first
```
Then register in the hub:
```
@@ -28,7 +28,7 @@ create_task(workstream_id="<id>", title="...", priority="high|medium|low")
**Step 5 — Record the setup**
```
add_progress_event(
summary="First session: structured custodian into N workstreams, M tasks",
summary="First session: structured infotech into N workstreams, M tasks",
event_type="milestone",
topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
detail={"workstreams": [...], "tasks_created": M}

View File

@@ -1,5 +1,5 @@
**Purpose:** Durable task factory built on Temporal. Manages ActivityDefinitions, schedules recurring workflows via Temporal Schedules, routes events via NATS JetStream, and exposes a FastAPI CRUD surface for the custodian domain.
**Domain:** custodian
**Domain:** infotech
**Repo slug:** activity-core
**Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a

View File

@@ -1,6 +1,7 @@
## Session Protocol
State Hub: http://127.0.0.1:8000
Dev Hub (State Hub API): http://127.0.0.1:8000
MCP server name in `~/.claude.json`: `dev-hub`
**Step 1 — Orient**
@@ -10,7 +11,7 @@ cat .custodian-brief.md
```
Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
```
get_domain_summary("custodian")
get_domain_summary("infotech")
```
If MCP tools are unavailable in the current agent session, use the REST API:
```bash
@@ -39,11 +40,11 @@ curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
ls workplans/
```
For each file with `status: ready`, `active`, or `blocked`, note pending
`todo`/`in_progress` tasks.
`wait`/`todo`/`progress` tasks.
**Step 4 — Present brief**
1. **Active workstreams** for `custodian` — title, task counts, blocking decisions
1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
2. **Pending tasks** from `workplans/` + any `[repo:activity-core]` hub tasks
3. **Goal guidance** — if `goal_guidance` in summary:
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*

View File

@@ -1,7 +1,7 @@
## Workplan Convention (ADR-001)
File location: `workplans/activity-core-WP-NNNN-<slug>.md`
ID prefix: `ACTIVITY-WP`
File location: `workplans/ACTIVITY-WP-NNNN-<slug>.md`
ID prefix: `ACTIVITY-WP-`
Work items originate as files in this repo **before** being registered in the hub.
@@ -12,7 +12,7 @@ repo state, and `finished` when implementation is complete. `stalled` and
`needs_review` are derived health labels, not stored statuses.
Closed workplans may be moved to `workplans/archived/` with a completion-date
prefix: `YYMMDD-activity-core-WP-NNNN-<slug>.md`. The frontmatter id remains
prefix: `YYMMDD-ACTIVITY-WP-NNNN-<slug>.md`. The frontmatter id remains
unchanged; the prefix is only for quick visual reference.
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
@@ -25,4 +25,16 @@ Ecosystem todos from other agents arrive as `[repo:activity-core]` hub tasks —
visible at session start. Pick one up by creating the workplan file, then registering
the workstream.
Task blocks use this shape:
```task
id: ACTIVITY-WP-NNNN-T01
status: wait | todo | progress | done | cancel
priority: high | medium | low
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
```
Status progression is `todo``progress``done`; use `wait` for waiting or
blocked work and `cancel` for stopped work.
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->

View File

@@ -1,33 +1,23 @@
<!-- custodian-brief: generated by fix-consistency — do not edit manually -->
# Custodian Brief — activity-core
**Domain:** custodian
**Last synced:** 2026-06-18 13:20 UTC
**Domain:** infotech
**Last synced:** 2026-07-02 09:55 UTC
**State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
## Active Workstreams
### Definition And Schedule Hot Reload
Progress: 0/5 done | workstream_id: `8887075e-21ec-451b-b82b-cd81035c9ca5`
### Adopt State Hub Beachhead Endpoint
Progress: 0/2 done | workstream_id: `bbc07f9e-9323-4b2b-b556-c33b37d0b228`
**Open tasks:**
- ! Live No-Restart Smoke `68a0e22a`
- · Extract Reusable Sync Service `53a7970b`
- · Add Admin Sync Endpoint `8697c761`
- · Preserve Schedule Drift Semantics `efeac412`
- · Optional Background Sync Loop `d774087b`
### Post-triage operational hardening
Progress: 6/7 done | workstream_id: `5646e13a-13af-4724-bca6-3c0d86f96733`
**Open tasks:**
- ! Three-Run Calibration Feedback `7cbf0a35`
- ! Point STATE_HUB_URL at the beachhead `76b6132d`
- ! Retire the bespoke actcore-state-hub-bridge proxy `526c2129`
### Daily Triage LLM Reconciliation And Evidence
Progress: 1/5 done | workstream_id: `f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9`
Progress: 2/5 done | workstream_id: `f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9`
**Open tasks:**
- ! Reconcile Live Railiance Runtime `23545ddc`
- ! Run Daily Triage Fixture Smoke `10e0df77`
- ! Collect Three Clean Scheduled Runs `dc6b9482`
- ! Close Handoff State `ecc57e21`
@@ -49,6 +39,6 @@ Progress: 2/3 done | workstream_id: `7387fc50-1f2c-471a-9d85-bb085cbd0b63`
## MCP Orientation (when available)
If the state-hub MCP server is reachable, call:
`get_domain_summary("custodian")`
`get_domain_summary("infotech")`
This provides richer cross-domain context.
If the MCP call fails, use this file as your orientation source.

View File

@@ -18,7 +18,9 @@ STATE_HUB_URL=http://127.0.0.1:8000
# Repo scoping — used by the repo-scoping context adapter. Binds {} on failure.
REPO_SCOPING_URL=http://127.0.0.1:8020
# Issue Core — task emission backend.
ISSUE_CORE_URL=http://127.0.0.1:8010
ISSUE_CORE_URL=http://127.0.0.1:8765
# Shared ingestion key — must match issue-core's ISSUE_CORE_API_KEY.
ISSUE_CORE_API_KEY=
# Sink type: 'rest' (POST to issue-core) or 'null' (discard, for dry-run).
ISSUE_SINK_TYPE=rest

View File

@@ -1,17 +1,15 @@
# Kaizen scheduled agent execution (ADR-005)
# Engagement: coulomb-loop — stabilize phase (daily crons per ADR-003)
# Promoted 2026-06-18 after 3/3 bootstrap E2E cycles
# Kaizen scheduled agent execution manifest (ADR-005)
# Engagement: coulomb-loop bootstrap — weekly cadence
# Regulator promotes cadence per customer engagement policy (ADR-003).
# Validate with: kaizen-agentic schedule validate
version: '1'
timezone: Europe/Berlin
agents:
coach:
cadence: daily
cron: "0 9 * * *"
cadence: weekly
cron: 0 9 * * 1
enabled: true
optimization:
cadence: daily
cron: "0 10 * * *"
cadence: weekly
cron: 0 10 * * 1
enabled: true
tdd-workflow:
cadence: monthly
enabled: false

28
.repo-classification.yaml Normal file
View File

@@ -0,0 +1,28 @@
# Repo classification (Repo Classification Standard v1.0).
repo_classification:
standard: Repo Classification Standard
version: '1.0'
classified_at: '2026-06-22'
classified_by: human
category: tooling
domain: infotech
secondary_domains:
- agents
capability_tags:
- workflow
- orchestration
- automation
- coordination
- observability
business_stake:
- technology
- operations
- automation
- execution
business_mechanics:
- coordination
- operation
- adaptation
notes: Org-wide event bridge / task factory (Temporal-based). Active bounded implementation
-> project.

View File

@@ -4,7 +4,7 @@
**Purpose:** Durable task factory built on Temporal. Manages ActivityDefinitions, schedules recurring workflows via Temporal Schedules, routes events via NATS JetStream, and exposes a FastAPI CRUD surface for the custodian domain.
**Domain:** custodian
**Domain:** infotech
**Repo slug:** activity-core
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
**Workplan prefix:** `ACTIVITY-WP-`
@@ -83,7 +83,7 @@ curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
2. Check inbox: `GET /messages/?to_agent=activity-core&unread_only=true`; mark read
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
4. Check blocked tasks: `GET /tasks/?needs_human=true`
4. Check human-needed tasks: `GET /tasks/?needs_human=true`
**During work:**
- Update task statuses in workplan files as tasks progress
@@ -101,6 +101,78 @@ curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
---
## Credential and access routing
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
for inference. Run this check **before** requesting secrets, API keys, SSH access,
login tokens, or database passwords — in any repo, not only `ops-warden`.
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
other credential need belongs to another subsystem. **Do not** message
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
### Lookup (do this first)
```bash
warden route find "<describe your need>" --json
warden route show <catalog-id> --json
```
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
| Agent runtime | How to orient |
| --- | --- |
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=activity-core` is for coordination, not secret vending |
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
### Quick routing table
| I need… | Owner | ops-warden executes? |
| --- | --- | --- |
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
| Authorization decision | flex-auth | No — route only |
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
### Anti-patterns (do not do these)
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
- Pasting secrets into Git, State Hub, workplans, logs, or chat
### Other capabilities (reuse-surface)
Non-credential capabilities are usually discovered through **reuse-surface** federation
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
every repo's agent instructions because it is high-frequency, high-risk, and easy to
get wrong.
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
<!-- REPO-AGENTS-EXTENSIONS -->
<!-- Append repo-specific agent instructions below this marker.
The state-hub template sync preserves content after this line. -->
---
## Automation Scheduling Preference
Durable activity-core automations must use this repo's own infrastructure:
Temporal Schedules, NATS JetStream, activity-core run records, State Hub
progress, and configured report/evidence sinks. Do not use coding
assistant-provided automation, reminder, or heartbeat tooling as the execution
or evidence source for production or operational recurrence.
Coding assistants may run repo-native inspection commands and summarize their
outputs, but the baseline answer to questions like "How did our automations go
since Friday?" must come from deterministic local tooling such as the
ACTIVITY-WP-0018 automation status surface.
---
## Workplan Convention (ADR-001)
Work items originate as files in this repo — not in the hub. The hub is a
@@ -124,7 +196,7 @@ anything needing analysis, design, approval, dependencies, or multiple phases.
id: ACTIVITY-WP-NNNN
type: workplan
title: "..."
domain: custodian
domain: infotech
repo: activity-core
status: proposed | ready | active | blocked | backlog | finished | archived
owner: codex
@@ -154,10 +226,7 @@ state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
Task description text.
```
Status progression: `todo` → `progress` → `done`; use `wait` for a task
blocked on external input and `cancel` for intentionally abandoned work.
Workstream/workplan lifecycle status is separate; frontmatter `blocked` remains
valid there.
Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
To create a new workplan:
1. Write the file following the format above

View File

@@ -8,4 +8,5 @@
@.claude/rules/stack-and-commands.md
@.claude/rules/architecture.md
@.claude/rules/repo-boundary.md
@.claude/rules/credential-routing.md
@.claude/rules/agents.md

View File

@@ -1,13 +1,17 @@
-include .env
export
.PHONY: sync-event-types sync-activity-definitions test migrate sync-all \
.PHONY: sync-event-types sync-activity-definitions sync-schedules test migrate sync-all \
automation-status automation-status-json automation-list automation-list-json \
dev-up dev-down railiance-up railiance-down \
start-worker start-api start-event-router help
sync-activity-definitions: ## Sync ActivityDefinition files into DB
uv run python -m activity_core.sync_activity_definitions
sync-schedules: ## Reconcile Temporal schedules from activity_definitions DB
uv run python -m activity_core.sync_schedules
sync-event-types: ## Sync event type YAML files into DB
uv run python scripts/sync_event_types.py
@@ -21,6 +25,27 @@ migrate: ## Apply all pending Alembic migrations
sync-all: sync-event-types sync-activity-definitions ## Sync event types and activity definitions
# -- Automation status ---------------------------------------------------------
SINCE ?= today
FORMAT ?= human
ENABLED ?= all
TRIGGER ?=
ACTIVITY_ID ?=
ACTIVITY_NAME ?=
automation-status: ## Report recent automation status from repo-owned evidence
uv run python scripts/automation_status.py --since "$(SINCE)" $(if $(UNTIL),--until "$(UNTIL)",) --format "$(FORMAT)"
automation-status-json: ## Report recent automation status as JSON
$(MAKE) automation-status FORMAT=json
automation-list: ## List configured scheduled automations from repo-owned definitions
@uv run python scripts/automation_inventory.py --format "$(FORMAT)" --enabled "$(ENABLED)" $(if $(TRIGGER),--trigger-type "$(TRIGGER)",) $(if $(ACTIVITY_ID),--activity-id "$(ACTIVITY_ID)",) $(if $(ACTIVITY_NAME),--activity-name "$(ACTIVITY_NAME)",)
automation-list-json: ## List configured scheduled automations as JSON
@$(MAKE) --no-print-directory automation-list FORMAT=json
# ── Infrastructure ─────────────────────────────────────────────────────────────
dev-up: ## Start full dev stack (Temporal + PG + ES + NATS)

View File

@@ -64,7 +64,9 @@ The two evaluation modes:
`context.*` / `event.*` interpolation and explicit `for_each` per-item
binding. No `exec()`.
- **Instruction executor**: trusted-field prompt rendering, LLM call via
llm-connect, structured output validation, bounded validation-failure
llm-connect, structured output validation, item-granular recovery with a
quarantine lane and producer guardrails (count/length/depth caps, reference
allow-list) at the producer trust boundary, bounded validation-failure
artifacts for report instructions, review-required audit metadata, and
deterministic report sinks. A real downstream review queue is not implemented
in this repo.
@@ -88,6 +90,9 @@ The two evaluation modes:
- **REST admin API** (FastAPI): CRUD for ActivityDefinitions, manual trigger,
event type registry queries.
- **Prometheus metrics**: Temporal SDK metrics exposed for scraping.
- **Automation status surface**: deterministic, non-LLM status reporting via
`make automation-status` / `scripts/automation_status.py`, using repo-owned
evidence sources rather than coding assistant scheduler state.
- **Operational runbook**: `docs/runbook.md`.
---
@@ -114,6 +119,10 @@ The two evaluation modes:
runs on Railiance infrastructure (or Docker Compose for dev).
- **End-user task UI** — tasks land in issue-core; presentation is separate.
- **Synchronous request-response patterns** — Temporal is async-first.
- **Coding assistant automation infrastructure** — assistant-provided reminders,
heartbeats, or scheduled jobs are not the execution or evidence authority for
activity-core automations. Assistants may run and summarize repo-native
commands only.
---
@@ -130,6 +139,8 @@ The two evaluation modes:
commands.
- You are replacing scattered bespoke cron jobs and manual coordination with
a governed, observable automation layer.
- You need to answer "how did our automations go since Friday?" from
deterministic repo-native evidence before any optional LLM summary.
---
@@ -320,6 +331,9 @@ new one-off control paths.
governance model, event type schema, ActivityDefinition structure.
- `docs/adr/adr-003-rule-instruction-model.md` — Rule DSL, Instruction safety
model, evaluation semantics, audit trail, testing strategy.
- `docs/adr/adr-004-producer-trust-boundary.md` — untrusted-producer premise,
trust-but-handle vs verify-and-mitigate postures, error-locality and
quarantine-with-provenance, producer guardrails for LLM/agent/human output.
---

View File

@@ -0,0 +1,156 @@
---
id: ACT-ADR-004
type: architecture-decision-record
title: "The Producer Trust Boundary — Guardrails and Error-Correction for Untrusted Output"
status: accepted
decided_by: Bernd Worsch
date: "2026-06-26"
scope: cross-repo
affects:
- activity-core
- rules-core (future extraction)
tags: ["architecture", "llm", "safety", "validation", "guardrails", "trust-boundary", "resilience"]
---
# ACT-ADR-004: The Producer Trust Boundary
## Status
Accepted.
## Context
On 2026-06-26 the scheduled daily WSJF triage instruction fired on time, called
llm-connect successfully, and produced a long ranked recommendation list — but
the JSON broke at char 5268 (~rank 89 of ~16), failing schema validation. Because
the report was validated and consumed as a single monolithic JSON document, one
malformed delimiter discarded the **entire** run, including the 7 perfectly good
recommendations the model had already emitted. The scheduling and runtime layers
were healthy; the failure was entirely at the seam where free-form model output
meets a strict consumer.
This is not a one-off bug, it is a recurring class. activity-core has a **trust
boundary** wherever generative or human-authored output meets strict deterministic
consumers: the JSON Schema validator, the task emitter, and any classic compute
pipeline downstream. The producers on the other side of that boundary — **LLMs,
agents, and humans** — are all *untrusted producers*. Their output may be:
- **erroneous** — hallucination, truncation at a token limit, drift, type slips,
typos, a missing delimiter; or
- **malicious** — prompt injection, crafted payloads, or oversized / deeply-nested
structures intended to exhaust or confuse the consumer.
The pre-existing design treated producer output optimistically: parse the whole
document, validate the whole document, and on any failure discard the whole
document (preserving only a bounded diagnostic preview). That gives **zero error
locality** — the blast radius of any single defect is the entire activation.
## Decision
Treat the producer→consumer seam as an explicit, adversarial **trust boundary**,
and place guardrails plus error-correction tooling *at that boundary* rather than
letting raw producer output flow into deterministic consumers.
### Two non-fail-fast postures
When hard-failing on a problem is undesirable, there are two sound strategies, and
they **compose**:
- **A) Trust but handle exceptions** (optimistic / reactive). Consume the output
as-is; on exception, catch → repair → retry → or quarantine. Cheap on the happy
path; blast radius depends entirely on how granular the catch is. Best when
failures are rare and locally recoverable. Risk: failures surface late, possibly
after partial side effects.
- **B) Verify and mitigate** (defensive / proactive). Validate, sanitize, clamp,
and normalize the output to a known-good shape *before* it enters the pipeline —
drop bad items, coerce types, bound sizes/depth, allow-list references — so the
consumer only ever sees clean input. Higher upfront cost, smaller blast radius,
no partial side effects. Best when failures are common or consequences are high.
### Governing principles
1. **Push verification to the boundary; keep the interior strict.** Apply posture
**B** at the producer→consumer boundary; keep posture **A** for residual
exceptions inside the verified core. Never relax the interior schema to absorb
producer sloppiness.
2. **Make error locality match the unit of work.** One bad recommendation must
cost one recommendation, not the whole report. Structuring the payload so each
item is independently parseable and validatable is the highest-leverage change.
3. **Quarantine, never silently drop.** Invalid units are preserved as bounded,
provenance-tagged artifacts (`index`, `error`, `raw` snippet, `reason`) so they
can be debugged or replayed. Degraded-but-usable is reported distinctly from
total loss.
4. **Both human and agent input get the same rigor.** Guardrails are
producer-agnostic: the same count / length / depth caps and reference
allow-lists apply whether the producer is an LLM, an agent, or a human.
### What this means concretely in activity-core
Implemented in `src/activity_core/rules/executor.py`:
- **Strict-structure-only schema.** The daily-triage output schema is strict on
per-item *structure* (`required [rank, candidate, action, why]`, typed `wsjf`)
and carries `maxItems` as a producer *hint* — never as a hard whole-document
reject, which would reproduce the very blast-radius failure (ACT-ADR-002 governs
the schema format; `schemas/daily-triage-report.json`).
- **Item-granular recovery (posture B).** When whole-document parse + one retry
fail, `_resilient_report` recovers individually-parseable recommendation objects
via a brace/quote-aware scanner (`_extract_object_spans`) that works for both
pretty-printed and NDJSON output, attempts a best-effort `_try_repair` on a
truncated tail, validates each recovered object against the item schema, and
keeps the valid ones. Survivors are emitted with `output_validated=true`,
`partial=true`, and `review_required=true`.
- **Producer guardrails (`_partition_items`, applied on both the recovery and the
happy path).** Per recommendation: structural type → schema → structural caps
(`_MAX_DEPTH`, `_MAX_STRING_LEN`) → reference allow-list → count cap (top-N by
`maxItems`). The first failing check quarantines the item with provenance and a
`reason` (`malformed` / `schema` / `guardrail` / `allow_list` / `over_limit`).
- **Reference allow-list.** A recommendation whose `candidate` is not in the set of
known ids is quarantined. The set is sourced from resolved context
(`context["known_candidates"]`, via `_allow_list_from_context`); the check is
inert until a context resolver populates it, so the capability ships now and
activates with a one-line resolver change.
### Where each posture sits
| Layer | Posture | Mechanism |
|-------|---------|-----------|
| Schema / contract | B | strict per-item structure; `maxItems` as hint |
| Whole-document parse | A | tolerant parse + single retry |
| Failed parse | B | item-granular recovery + repair + quarantine |
| Per-item screening | B | schema + depth/length caps + allow-list + count cap |
| Emitted report | — | `partial` / `quarantined_*` provenance; never silent |
## Consequences
- A single malformed or oversized item no longer discards an entire activation;
the daily-triage run that failed on 2026-06-26 would now deliver its 7 valid
recommendations and quarantine the broken tail.
- Reports gain a `partial` / `quarantined_*` vocabulary; downstream report sinks
and reviewers can distinguish degraded-but-usable from total loss.
- Guardrail thresholds (`_MAX_DEPTH`, `_MAX_STRING_LEN`, `maxItems`, the
allow-list) are policy knobs that will need tuning; they are intentionally
conservative defaults, not a finished calibration.
- **Known retention gap (follow-on):** `LLMConnectClient.complete()` still returns
only `content`, discarding `finish_reason`/`usage`, and the total-loss artifact
caps raw output below realistic break points. Capturing those signals so
failures stay debuggable is tracked as a retention fix, not closed by this ADR.
## Alternatives considered
- **Hard-enforce `maxItems` in the validator.** Rejected: a hard reject of an
over-count document reproduces the whole-document blast radius. Mitigation (keep
top-N, quarantine the rest) is preferred.
- **Relax the schema to accept anything.** Rejected: violates principle 1; pushes
malformed data into downstream consumers.
- **Retry-until-valid only (pure posture A).** Rejected as the sole strategy: the
2026-06-26 failure recurred across both the initial attempt and the retry, so
retry alone does not bound the blast radius.
## References
- ACT-ADR-002 — markdown-as-definition format and output schema governance.
- ACT-ADR-003 — Rule vs. Instruction model; the Instruction prompt-injection
surface this boundary complements on the output side.
- `workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md` — the
implementing workplan.

View File

@@ -11,7 +11,9 @@ The current authoritative boundary is the issue-core REST API:
POST {ISSUE_CORE_URL}/issues/
```
`IssueCoreRestSink` sends this payload:
`IssueCoreRestSink` authenticates with the shared `ISSUE_CORE_API_KEY` env var
(same value as the issue-core server) via `Authorization: Bearer <key>` and
sends this payload:
```json
{
@@ -52,7 +54,7 @@ task reference before it can replace `IssueCoreRestSink`.
Weekly SBOM staleness is safe to evaluate in dry-run mode because the rule
contract is deterministic and tested. Do not enable it against the real REST sink
until issue-core credentials, endpoint reachability, and duplicate-handling are
until `ISSUE_CORE_API_KEY`, endpoint reachability, and duplicate-handling are
verified in the target environment.
## Verification

View File

@@ -116,7 +116,129 @@ asyncio.run(publish())
---
## Syncing schedules manually
## Syncing definitions and schedules manually
When the API is running, prefer the admin sync endpoint for definition or
schedule changes. It refreshes file-backed ActivityDefinitions and reconciles
Temporal Schedules without restarting the worker:
```bash
curl -s -X POST \
'http://localhost:8010/admin/sync?definitions=true&schedules=true'
```
The response reports:
- `definitions.synced`
- `event_types.synced`
- `schedules.upserted`
- `schedules.paused`
- `schedules.deleted_orphans`
- bounded `errors[]`
## Automation inventory
Use the repo-native inventory command to answer "what automations are scheduled
at all?" before checking whether a recent window succeeded. The command is
read-only: it loads ActivityDefinition rows or files and, when `TEMPORAL_HOST`
is configured, describes Temporal schedules for visibility. It does not sync,
upsert, pause, delete, or enqueue schedules.
```bash
# Human-readable configured automation inventory.
make automation-list
# JSON for scripts or assistant summarization.
make automation-list-json
# Common filters.
make automation-list ENABLED=true TRIGGER=cron
make automation-list ACTIVITY_ID=6fca51fa-387a-4fd0-bc4e-d62c29eb859a
```
Inventory answers what is configured; `make automation-status` answers what
happened in a time window. Missing optional live sources are warnings, not
silent omissions, so a degraded local run still lists repo definition files.
Compact human output looks like:
```text
- Daily State Hub WSJF Triage [enabled cron] schedule=activity-schedule-... trigger=20 7 * * * tz=Europe/Berlin source=files temporal=not_checked
```
## Automation status
Use the repo-native status command to answer operator questions such as "how did
our automations go since Friday?". This is the baseline evidence surface; LLMs
or coding assistants may summarize the output, but they are not the scheduler or
source of truth.
```bash
# Human-readable status. `friday` resolves in Europe/Berlin by default.
make automation-status SINCE=friday
# JSON for scripts or assistant summarization.
make automation-status-json SINCE=2026-06-26
```
The command reads activity-core owned evidence only: ActivityDefinition files or
DB rows, `activity_runs`, State Hub progress, working-memory report notes, and
Temporal visibility when `TEMPORAL_HOST` is configured. Missing live sources are
reported as warnings rather than hidden. It exits non-zero for real automation
failures such as `missed`, `validation_failed`, or `sink_failed`.
Useful knobs:
```bash
AUTOMATION_STATUS_TIMEOUT_SECONDS=10 make automation-status SINCE=friday
make automation-status SINCE=2026-06-26 FORMAT=json
make automation-status SINCE=2026-06-26 UNTIL=2026-06-27 ACTCORE_DB_URL=
```
Example distinction from the June 2026 daily triage evidence:
```text
- Activity 6fca51fa-387a-4fd0-bc4e-d62c29eb859a [validation_failed] expected=0 runs=0 evidence=2
evidence state_hub_progress event_type=daily_triage run=ebec6e41... output_validated=false validation_error=Unterminated string...
evidence state_hub_progress event_type=daily_triage run=c7370f9c... output_validated=false validation_error=Expecting ',' delimiter...
```
That means the schedule/report path left evidence, but the report was not a
clean validated output. Disabled schedules, such as the gated weekly coding
retro, are reported as `disabled` and are not counted as missed runs.
`event_types` defaults to `false` for this endpoint because event-triggered
definitions already reload from the DB in the event router path; opt in when
the operator intentionally changed event type definition files:
```bash
curl -s -X POST \
'http://localhost:8010/admin/sync?definitions=true&schedules=true&event_types=true'
```
The v1 posture is manual/operator-triggered sync. A periodic background loop is
deferred until live use shows it is needed; this keeps customer definition
changes explicit and avoids background repo scanning from the worker.
### Railiance01 no-restart smoke
After changing a projected definition in `k8s/railiance/20-runtime.yaml`,
apply the ConfigMap and wait for the API pod volume to refresh (up to ~60s),
then reconcile without restarting `actcore-worker`:
```bash
export KUBECONFIG=~/.kube/config-hosteurope
kubectl apply -f k8s/railiance/20-runtime.yaml
sleep 60
kubectl -n activity-core exec deploy/actcore-api -- \
python3 -c 'import urllib.request; req=urllib.request.Request("http://localhost:8010/admin/sync?definitions=true&schedules=true", method="POST"); print(urllib.request.urlopen(req).read().decode())'
```
Automated regression for the disabled `ops-service-inventory-probes`
projection (enable/cadence flip, idempotent repeat sync, rollback) lives in
`scripts/smoke_admin_sync_no_restart.py`.
If the API is unavailable, the schedule-only CLI remains available:
```bash
TEMPORAL_HOST=localhost:7233 \
@@ -126,7 +248,7 @@ ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
This reconciles all Temporal Schedules with the `activity_definitions` table:
- Upserts schedules for every enabled cron definition
- Creates paused schedules for disabled cron definitions
- Creates paused schedules for disabled cron or one-shot scheduled definitions
- Deletes orphaned schedules with no matching DB row
After adding or changing a recurring ActivityDefinition or workflow activity
@@ -282,6 +404,52 @@ the same durable consumer name provides automatic failover.
---
## Run-miss recovery policies (cron triggers)
A cron fire is **missed** when the worker or Temporal is unavailable at trigger
time. `trigger_config.misfire_policy` selects what happens when the system
recovers. Each policy combines a Temporal **catchup window** (how far back missed
fires are recovered) with an **overlap policy** (what to do if a recovered fire
would start while a prior run is still executing):
| `misfire_policy` | Behaviour | Default catchup window | Overlap |
| --- | --- | --- | --- |
| `skip` | Run on trigger or skip — a missed fire is never recovered | 60s grace | `SKIP` |
| `catchup_all` | Recover **every** fire missed during the outage | 365 days | `BUFFER_ALL` |
| `catchup_latest` | Recover only the **most recent** missed fire; no backlog | 24h | `BUFFER_ONE` |
Set `trigger_config.catchup_window_seconds` to override the per-policy default
(e.g. an hourly definition using `catchup_latest` should set it to ~3600 so a
single missed hour is recovered but older ones are not).
Legacy values are still accepted: `catchup``catchup_all`,
`compress``catchup_latest`.
> **Why this exists:** before ACTIVITY-WP-0014 no catchup window was set, so a
> brief outage at trigger time silently dropped the fire with no recovery and no
> log line. The `daily-statehub-wsjf-triage` definition now uses `catchup_latest`.
## State Hub write idempotency (ACTIVITY-WP-0014 T05)
Every State Hub write from activity-core (report-sink progress, ops-evidence
progress, schedule-miss alerts) carries a stable **`Idempotency-Key`** header
derived deterministically from the write's identity
(`run_id:instruction_id:event_type`, or `schedule_miss:activity_id:last_fired`
for miss alerts). This makes writes safe to **buffer and replay** under the
planned State Hub *beachhead* (per-machine read cache + write outbox): a flush —
possibly retried after an outage — cannot create duplicate progress/triage
events once State Hub / the beachhead honours the header.
The guarantee lives on the write, not on a live dedup read. The read-based
`_progress_exists` check is now best-effort only: if State Hub is unreachable it
returns `False` (proceed to the keyed write) rather than hard-failing. The header
passes untouched through the `actcore-state-hub-bridge` proxy and is ignored by
State Hub versions that do not yet honour it.
> The queue/cache itself is **not** built in activity-core — it belongs to the
> state-hub beachhead. activity-core only emits the key. See the proposal sent to
> the `state-hub` agent.
## Troubleshooting
### Worker fails to start: "ACTCORE_DB_URL is required"
@@ -291,6 +459,9 @@ Set the environment variable before running the worker.
1. Check Temporal UI → Schedules tab for the schedule status.
2. Ensure `enabled=True` on the ActivityDefinition (paused schedules don't fire).
3. Verify the cron expression with: `docker exec temporal-admin-tools temporal schedule describe --schedule-id activity-schedule-<uuid>`
4. If a fire was **missed entirely** (no run, no failure event) during an outage,
check `misfire_policy` — under `skip` missed fires are dropped by design. Use
`catchup_all` or `catchup_latest` to recover them. See *Run-miss recovery policies*.
### Event not routing
1. Check NATS monitoring: http://localhost:8222/jsz to verify the `ACTIVITY_EVENTS` stream exists.

View File

@@ -14,8 +14,8 @@ data:
LLM_CONNECT_URL: http://llm-connect.activity-core.svc.cluster.local:8080
LLM_CONNECT_TIMEOUT_SECONDS: "300"
REPO_SCOPING_URL: http://repo-scoping.repo-scoping.svc.cluster.local:8020
ISSUE_CORE_URL: http://issue-core.issue-core.svc.cluster.local:8010
ISSUE_SINK_TYPE: "null"
ISSUE_CORE_URL: http://actcore-issue-core-bridge.activity-core.svc.cluster.local:8765
ISSUE_SINK_TYPE: "rest"
ACTIVITY_DEFINITION_DIRS: /etc/activity-core/external-definitions
OPS_INVENTORY_PATH: /etc/activity-core/ops/service-inventory.yml
INTER_HUB_URL: ""
@@ -47,7 +47,10 @@ data:
type: cron
cron_expression: "20 7 * * *"
timezone: Europe/Berlin
misfire_policy: skip
# ACTIVITY-WP-0014: recover the most recent missed daily fire when the
# worker/Temporal was unavailable at trigger time, without accumulating a
# backlog after a multi-day outage.
misfire_policy: catchup_latest
context_sources:
- type: static
bind_to: context.prompt_path
@@ -91,15 +94,19 @@ data:
Score each recommendation with the WSJF rubric from the prompt:
(strategic_value + time_criticality + risk_reduction +
opportunity_enablement) / job_size. Use integer factor values from 1 to 5,
round score to one decimal place, sort recommendations by rank, and return at
most 10 recommendations.
round score to one decimal place, sort recommendations by rank, and return
only the bounded top-7 (at most 7) ranked recommendations. If uncertain,
emit fewer well-formed recommendations rather than more.
Curated digest:
{context.daily_triage_digest}
Return only JSON matching
`/etc/activity-core/schemas/daily-triage-report.json`. Do not wrap the JSON
in Markdown fences or add prose before or after it:
`/etc/activity-core/schemas/daily-triage-report.json`. Emit the "summary"
field first, then inside the "recommendations" array write one complete
recommendation JSON object per line (NDJSON-style per-item framing) so
each item can be recovered independently if the output is truncated. Do
not wrap the JSON in Markdown fences or add prose before or after it:
{
"summary": "short operator-facing summary",
"recommendations": [
@@ -164,6 +171,36 @@ data:
Kubernetes projection of the Custodian-owned definition in
`/home/worsch/the-custodian/activity-definitions/hourly-recently-on-scope.md`.
state-hub-consistency-sweep.md: |
---
id: "7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b"
name: "State Hub Consistency Sweep"
type: activity-definition
version: "1.0"
enabled: true
owner: custodian
governance: custodian
status: active
created: "2026-06-21"
trigger:
type: cron
cron_expression: "*/15 * * * *"
timezone: UTC
misfire_policy: skip
context_sources:
- type: state-hub
query: consistency_sweep_remote_all
required: true
params:
max_seconds: 300
source: activity-core
bind_to: context.consistency_sweep_remote_all
---
# ActivityDefinition: State Hub Consistency Sweep
Kubernetes projection of the Custodian-owned definition in
`/home/worsch/the-custodian/activity-definitions/state-hub-consistency-sweep.md`.
ops-service-inventory-probes.md: |
---
id: "40d15a87-7ff6-4d8e-992c-37df15f95110"
@@ -399,7 +436,7 @@ data:
"recommendations": {
"type": "array",
"minItems": 1,
"maxItems": 10,
"maxItems": 7,
"items": {
"type": "object",
"required": ["rank", "candidate", "action", "why", "confidence", "wsjf"],
@@ -408,7 +445,7 @@ data:
"rank": {
"type": "integer",
"minimum": 1,
"maximum": 10
"maximum": 7
},
"candidate": {
"type": "string"
@@ -578,7 +615,8 @@ spec:
method=self.command,
)
try:
with urlopen(request, timeout=30) as response:
timeout = 360 if self.command == "POST" else 30
with urlopen(request, timeout=timeout) as response:
payload = response.read()
self.send_response(response.status)
for key, value in response.headers.items():
@@ -599,12 +637,123 @@ spec:
ThreadingHTTPServer(("0.0.0.0", 18080), Proxy).serve_forever()
readinessProbe:
httpGet:
path: /state/summary
path: /state/health
port: http
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
apiVersion: v1
kind: Service
metadata:
name: actcore-issue-core-bridge
namespace: activity-core
labels:
app.kubernetes.io/name: actcore-issue-core-bridge
app.kubernetes.io/part-of: activity-core
spec:
selector:
app.kubernetes.io/name: actcore-issue-core-bridge
ports:
- name: http
port: 8765
targetPort: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: actcore-issue-core-bridge
namespace: activity-core
labels:
app.kubernetes.io/name: actcore-issue-core-bridge
app.kubernetes.io/part-of: activity-core
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: actcore-issue-core-bridge
template:
metadata:
labels:
app.kubernetes.io/name: actcore-issue-core-bridge
app.kubernetes.io/part-of: activity-core
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: proxy
image: activity-core:railiance01-prod
imagePullPolicy: Never
ports:
- name: http
containerPort: 18081
command:
- python
- -c
- |
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from urllib.error import HTTPError, URLError
from urllib.request import Request, urlopen
TARGET = "http://127.0.0.1:18765"
HOP_HEADERS = {"connection", "host", "keep-alive", "proxy-authenticate",
"proxy-authorization", "te", "trailers",
"transfer-encoding", "upgrade"}
class Proxy(BaseHTTPRequestHandler):
def do_GET(self):
self._proxy()
def do_POST(self):
self._proxy()
def do_PATCH(self):
self._proxy()
def _proxy(self):
length = int(self.headers.get("content-length", "0") or "0")
body = self.rfile.read(length) if length else None
headers = {
key: value
for key, value in self.headers.items()
if key.lower() not in HOP_HEADERS
}
request = Request(
TARGET + self.path,
data=body,
headers=headers,
method=self.command,
)
try:
timeout = 360 if self.command == "POST" else 30
with urlopen(request, timeout=timeout) as response:
payload = response.read()
self.send_response(response.status)
for key, value in response.headers.items():
if key.lower() not in HOP_HEADERS:
self.send_header(key, value)
self.end_headers()
self.wfile.write(payload)
except HTTPError as exc:
payload = exc.read()
self.send_response(exc.code)
self.end_headers()
self.wfile.write(payload)
except URLError as exc:
self.send_response(502)
self.end_headers()
self.wfile.write(str(exc).encode())
ThreadingHTTPServer(("0.0.0.0", 18081), Proxy).serve_forever()
readinessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 6
---
---
apiVersion: batch/v1
kind: Job

View File

@@ -1,4 +1,5 @@
{
"$comment": "ACTIVITY-WP-0016-T02. Strict, bounded contract for the daily WSJF triage report. The per-item 'recommendations' schema is intentionally strict on STRUCTURE (types + required keys) so the T03 boundary parser can validate each recommendation independently and quarantine only the malformed ones. 'maxItems' is a producer hint (honoured by llm-connect constrained decoding and by the prompt); it is deliberately NOT hard-enforced by the in-repo validator, because rejecting a whole report for having too many items would reproduce the monolithic-failure bug WP-0016 exists to remove. Over-count is mitigated in T03 (keep top-N by rank, quarantine the rest). Value-domain vocabularies (action/confidence) are documented in the prompt and enforced by T04 guardrails with mitigation, not as brittle hard-fail enums here.",
"type": "object",
"required": ["summary", "recommendations"],
"properties": {
@@ -7,8 +8,28 @@
},
"recommendations": {
"type": "array",
"maxItems": 7,
"items": {
"type": "object"
"type": "object",
"required": ["rank", "candidate", "action", "why"],
"properties": {
"rank": { "type": "integer" },
"candidate": { "type": "string" },
"action": { "type": "string" },
"why": { "type": "string" },
"confidence": { "type": "string" },
"wsjf": {
"type": "object",
"properties": {
"score": { "type": "number" },
"strategic_value": { "type": "number" },
"time_criticality": { "type": "number" },
"risk_reduction": { "type": "number" },
"opportunity_enablement": { "type": "number" },
"job_size": { "type": "number" }
}
}
}
}
}
}

View File

@@ -0,0 +1,8 @@
#!/usr/bin/env python3
"""CLI wrapper for the repo-native automation inventory report."""
from activity_core.automation_status import inventory_main
if __name__ == "__main__":
raise SystemExit(inventory_main())

View File

@@ -0,0 +1,8 @@
#!/usr/bin/env python3
"""CLI wrapper for the repo-native automation status report."""
from activity_core.automation_status import main
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -0,0 +1,212 @@
#!/usr/bin/env python3
"""Railiance01 no-restart smoke for POST /admin/sync.
Patches the disabled ops-service-inventory-probes projection in the cluster
ConfigMap, waits for the API pod volume to refresh, runs /admin/sync twice,
verifies DB + Temporal schedule drift without restarting actcore-worker, then
rolls the ConfigMap back to the disabled baseline.
Requires:
- KUBECONFIG pointing at railiance01 (for example ~/.kube/config-hosteurope)
- kubectl access to the activity-core namespace
Example:
export KUBECONFIG=~/.kube/config-hosteurope
python3 scripts/smoke_admin_sync_no_restart.py
"""
from __future__ import annotations
import json
import subprocess
import sys
import time
ACTIVITY_ID = "40d15a87-7ff6-4d8e-992c-37df15f95110"
CONFIGMAP = "actcore-external-activity-definitions"
DEFINITION_KEY = "ops-service-inventory-probes.md"
MOUNTED_FILE = (
"/etc/activity-core/external-definitions/activity-definitions/"
f"{DEFINITION_KEY}"
)
VOLUME_PROPAGATION_SECONDS = 65
def kubectl(*args: str, input_text: str | None = None) -> str:
cmd = ["kubectl", "-n", "activity-core", *args]
return subprocess.check_output(
cmd,
input=input_text,
text=True,
)
def api_json(path: str, *, method: str = "GET") -> dict:
script = (
"import urllib.request, json\n"
f'req = urllib.request.Request("http://localhost:8010{path}", method="{method}")\n'
"print(urllib.request.urlopen(req).read().decode())"
)
return json.loads(kubectl("exec", "deploy/actcore-api", "--", "python3", "-c", script))
def worker_lines(script: str) -> list[str]:
return kubectl("exec", "deploy/actcore-worker", "--", "python3", "-c", script).splitlines()
def worker_uid() -> str:
return kubectl(
"get",
"pod",
"-l",
"app.kubernetes.io/name=actcore-worker",
"-o",
"jsonpath={.items[0].metadata.uid}",
).strip()
def load_configmap() -> dict:
return json.loads(kubectl("get", "configmap", CONFIGMAP, "-o", "json"))
def apply_configmap(cm: dict) -> None:
kubectl("apply", "-f", "-", input_text=json.dumps(cm))
def patch_definition(cm: dict, *, enabled: bool, cron: str) -> None:
text = cm["data"][DEFINITION_KEY]
for line in text.splitlines():
if line.strip().startswith("enabled:"):
break
else:
raise RuntimeError("enabled field not found in projection")
text = _replace_once(text, 'enabled: false', f"enabled: {'true' if enabled else 'false'}")
text = _replace_once(text, 'enabled: true', f"enabled: {'true' if enabled else 'false'}")
text = _replace_once(
text,
'cron_expression: "15 * * * *"',
f'cron_expression: "{cron}"',
)
text = _replace_once(
text,
'cron_expression: "25 * * * *"',
f'cron_expression: "{cron}"',
)
cm["data"][DEFINITION_KEY] = text
apply_configmap(cm)
def _replace_once(text: str, old: str, new: str) -> str:
if old not in text:
return text
return text.replace(old, new, 1)
def wait_for_mount(*, enabled: bool, cron: str) -> None:
deadline = time.time() + VOLUME_PROPAGATION_SECONDS
want_enabled = "enabled: true" if enabled else "enabled: false"
want_cron = f'cron_expression: "{cron}"'
while time.time() < deadline:
content = kubectl("exec", "deploy/actcore-api", "--", "cat", MOUNTED_FILE)
if want_enabled in content and want_cron in content:
return
time.sleep(5)
raise RuntimeError(
f"ConfigMap projection did not refresh within {VOLUME_PROPAGATION_SECONDS}s"
)
def get_definition() -> dict[str, object]:
for item in api_json("/activity-definitions/"):
if item["id"] == ACTIVITY_ID:
return {
"enabled": item["enabled"],
"cron": item["trigger_config"]["cron_expression"],
}
raise RuntimeError(f"ActivityDefinition {ACTIVITY_ID} not found")
def describe_schedule() -> dict[str, object]:
script = f"""
import asyncio
from temporalio.client import Client
async def main() -> None:
client = await Client.connect("actcore-temporal:7233")
handle = client.get_schedule_handle("activity-schedule-{ACTIVITY_ID}")
described = await handle.describe()
schedule = described.schedule
minute = schedule.spec.calendars[0].minute[0].start if schedule.spec.calendars else None
print(schedule.state.paused)
print(minute)
asyncio.run(main())
"""
paused, minute = worker_lines(script)
return {"paused": paused == "True", "minute": int(minute)}
def main() -> int:
worker_before = worker_uid()
cm = load_configmap()
print("1) enable + cadence change via ConfigMap")
patch_definition(cm, enabled=True, cron="25 * * * *")
wait_for_mount(enabled=True, cron="25 * * * *")
print("2) POST /admin/sync (first pass)")
sync1 = api_json("/admin/sync?definitions=true&schedules=true", method="POST")
if not sync1.get("ok"):
print(json.dumps(sync1, indent=2), file=sys.stderr)
return 1
defn = get_definition()
schedule = describe_schedule()
print(" definition:", defn)
print(" schedule:", schedule)
if defn != {"enabled": True, "cron": "25 * * * *"}:
print("definition drift after sync", file=sys.stderr)
return 1
if schedule["paused"] or schedule["minute"] != 25:
print("schedule drift after enable sync", file=sys.stderr)
return 1
print("3) POST /admin/sync (idempotent repeat)")
sync2 = api_json("/admin/sync?definitions=true&schedules=true", method="POST")
if sync2.get("schedules") != sync1.get("schedules"):
print("idempotent schedule counts changed", file=sys.stderr)
print(json.dumps({"sync1": sync1, "sync2": sync2}, indent=2), file=sys.stderr)
return 1
print("4) rollback ConfigMap + sync")
cm = load_configmap()
patch_definition(cm, enabled=False, cron="15 * * * *")
wait_for_mount(enabled=False, cron="15 * * * *")
sync3 = api_json("/admin/sync?definitions=true&schedules=true", method="POST")
if not sync3.get("ok"):
print(json.dumps(sync3, indent=2), file=sys.stderr)
return 1
defn = get_definition()
schedule = describe_schedule()
print(" definition:", defn)
print(" schedule:", schedule)
if defn != {"enabled": False, "cron": "15 * * * *"}:
print("rollback definition drift", file=sys.stderr)
return 1
if not schedule["paused"] or schedule["minute"] != 15:
print("rollback schedule drift", file=sys.stderr)
return 1
worker_after = worker_uid()
if worker_before != worker_after:
print("actcore-worker pod restarted during smoke", file=sys.stderr)
return 1
print("smoke passed: admin sync hot-reload without worker restart")
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -149,6 +149,8 @@ async def resolve_context(
query = source.get("query", "")
params = source.get("params") or {}
required = bool(source.get("required") or params.get("required", False))
resolver_params = dict(params)
resolver_params["required"] = required
raw_bind = source.get("bind_to") or source.get("name") or source_type
# Strip the 'context.' namespace prefix so evaluator can find the key.
bind_key = raw_bind.removeprefix("context.") if raw_bind.startswith("context.") else raw_bind
@@ -172,7 +174,7 @@ async def resolve_context(
continue
try:
resolved = resolver_cls().resolve(query, event_envelope, params)
resolved = resolver_cls().resolve(query, event_envelope, resolver_params)
snapshot[bind_key] = _bind_resolver_result(bind_key, resolved)
except Exception as exc:
if required:
@@ -364,6 +366,7 @@ async def evaluate_instructions(payload: dict) -> dict:
"output_validated": result.output_validated,
"review_required": result.review_required,
"validation_error": result.validation_error,
"llm_response_metadata": result.llm_response_metadata,
})
for spec in result.tasks:
task_specs.append({

View File

@@ -40,6 +40,7 @@ from temporalio.client import Client
from activity_core.models import ActivityDefinition, CronTriggerConfig
from activity_core.orm import ActivityDefinition as ActivityDefinitionRow, EventType as EventTypeRow
from activity_core.schedule_manager import delete_schedule, upsert_schedule
from activity_core.sync_service import run_sync
from activity_core.webhook_receiver import router as webhook_router
TEMPORAL_HOST = os.environ.get("TEMPORAL_HOST", "localhost:7233")
@@ -275,6 +276,24 @@ async def trigger_definition(definition_id: uuid.UUID) -> dict[str, str]:
return {"workflow_id": handle.id, "trigger_key": trigger_key}
# --- Admin sync ---------------------------------------------------------------
@app.post("/admin/sync")
async def admin_sync(
definitions: bool = True,
schedules: bool = True,
event_types: bool = False,
) -> dict[str, Any]:
"""Run operator-triggered definition/event/schedule sync without restart."""
return await run_sync(
session_factory=_get_db(),
temporal_client=_get_temporal() if schedules else None,
definitions=definitions,
schedules=schedules,
event_types=event_types,
)
# T42: Curator gate — event type approval endpoint
@app.get("/health")

File diff suppressed because it is too large Load Diff

View File

@@ -4,4 +4,5 @@ from activity_core.context_resolvers import ( # noqa: F401
ops_inventory,
repo_scoping,
state_hub,
reuse_surface,
)

View File

@@ -0,0 +1,516 @@
"""Reuse-surface registry hygiene context adapter.
Registered as source type ``reuse-surface`` and as the ``shell`` resolver
dispatcher for the ``reuse_surface_report_gaps`` query. Other shell queries
continue to delegate to the kaizen resolver for backward compatibility.
"""
from __future__ import annotations
import json
import logging
import os
import socket
import subprocess
from dataclasses import dataclass
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import httpx
import yaml
from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY, ContextResolver
from activity_core.context_resolvers.kaizen import KaizenContextResolver
from activity_core.context_resolvers.state_hub import StateHubContextResolver
logger = logging.getLogger(__name__)
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
_REPORT_TIMEOUT_SECONDS = 60
_STATE_HUB_TIMEOUT_SECONDS = 10.0
_KNOWN_SIGNALS = frozenset(
{
"registry_gap",
"empty_capability_scaffold",
"stale_scope",
"stale_sbom",
"publish_check_fail",
}
)
@dataclass(frozen=True)
class RosterEntry:
slug: str
domain: str | None = None
publish_check: str | None = None
def _base_url() -> str:
return os.environ.get("STATE_HUB_URL", _DEFAULT_STATE_HUB_URL).rstrip("/")
def _runner_host(params: dict[str, Any]) -> str:
return str(
params.get("runner_host")
or os.environ.get("KAIZEN_RUNNER_HOST")
or socket.gethostname()
)
def _as_required(params: dict[str, Any]) -> bool:
return bool(params.get("required", False))
def reuse_surface_report_gaps(params: dict[str, Any]) -> dict[str, Any]:
"""Resolve registry-hygiene gaps for the next rollout batch.
Missing operational dependencies are visible failures for required sources
and graceful empty lists for optional sources so definitions can opt into
either behavior without changing rule logic.
"""
try:
return _resolve_reuse_surface_report_gaps(params)
except Exception as exc:
if _as_required(params):
raise
logger.warning("reuse_surface_report_gaps unavailable: %s", exc)
return {"gaps": []}
def _resolve_reuse_surface_report_gaps(params: dict[str, Any]) -> dict[str, Any]:
roster_path = _roster_path(params)
entries = _load_active_roster_entries(roster_path)
if not entries:
return {"gaps": []}
state_path = _round_robin_state_path(params, roster_path)
selected, next_cursor = _select_round_robin_batch(
entries,
_batch_size(params),
state_path,
)
if not selected:
return {"gaps": []}
signals = _enabled_signals(_signals_path(params, roster_path))
roots = _resolve_repo_roots(selected, _runner_host(params))
report = _reuse_surface_report(params, signals)
gaps = _gap_records(selected, roots, signals, report)
_write_round_robin_state(state_path, next_cursor, selected)
return {"gaps": gaps}
def _roster_path(params: dict[str, Any]) -> Path:
raw = params.get("roster")
if not raw:
raise ValueError("reuse_surface_report_gaps requires params.roster")
path = Path(str(raw)).expanduser()
if not path.is_file():
raise FileNotFoundError(f"reuse_surface_report_gaps roster not found: {path}")
return path
def _batch_size(params: dict[str, Any]) -> int:
try:
return max(1, int(params.get("batch_size", 3)))
except (TypeError, ValueError):
return 3
def _round_robin_state_path(params: dict[str, Any], roster_path: Path) -> Path:
raw = params.get("round_robin_state")
if raw:
return Path(str(raw)).expanduser()
return roster_path.with_name("round-robin-state.json")
def _signals_path(params: dict[str, Any], roster_path: Path) -> Path:
raw = params.get("signals")
if raw:
return Path(str(raw)).expanduser()
return roster_path.with_name("signals.yml")
def _load_active_roster_entries(path: Path) -> list[RosterEntry]:
data = yaml.safe_load(path.read_text(encoding="utf-8"))
if not isinstance(data, dict):
raise ValueError(f"reuse_surface rollout roster is not a mapping: {path}")
entries: dict[str, RosterEntry] = {}
for domain, block in _iter_domain_blocks(data):
if _domain_phase(block) != "active":
continue
for item in _repo_items(block):
entry = _entry_from_item(item, domain, block)
if entry and entry.slug not in entries:
entries[entry.slug] = entry
return list(entries.values())
def _iter_domain_blocks(data: dict[str, Any]) -> list[tuple[str | None, dict[str, Any]]]:
domains = data.get("domains")
if isinstance(domains, dict):
return [
(str(name), block)
for name, block in domains.items()
if isinstance(block, dict)
]
if isinstance(domains, list):
return [
(str(block.get("name") or block.get("domain") or ""), block)
for block in domains
if isinstance(block, dict)
]
if isinstance(data.get("active"), list):
return [(None, {"phase": "active", "repos": data["active"]})]
return [
(str(name), block)
for name, block in data.items()
if isinstance(block, dict) and ("phase" in block or "repos" in block)
]
def _domain_phase(block: dict[str, Any]) -> str:
return str(block.get("phase") or block.get("status") or "").lower()
def _repo_items(block: dict[str, Any]) -> list[Any]:
repos = (
block.get("repos")
or block.get("repo_slugs")
or block.get("repositories")
or block.get("slugs")
or []
)
if isinstance(repos, dict):
items: list[Any] = []
for slug, config in repos.items():
if isinstance(config, dict):
item = dict(config)
item.setdefault("slug", slug)
items.append(item)
else:
items.append(str(slug))
return items
if isinstance(repos, list):
return repos
return []
def _entry_from_item(
item: Any,
domain: str | None,
block: dict[str, Any],
) -> RosterEntry | None:
publish_check = block.get("publish_check")
if isinstance(item, str):
slug = item
elif isinstance(item, dict):
slug = item.get("slug") or item.get("repo") or item.get("name")
publish_check = item.get("publish_check", publish_check)
else:
return None
if not slug:
return None
return RosterEntry(
slug=str(slug),
domain=domain or None,
publish_check=str(publish_check).lower() if publish_check is not None else None,
)
def _select_round_robin_batch(
entries: list[RosterEntry],
batch_size: int,
state_path: Path,
) -> tuple[list[RosterEntry], int]:
if not entries:
return [], 0
cursor = _read_round_robin_cursor(state_path) % len(entries)
size = min(batch_size, len(entries))
selected = [entries[(cursor + offset) % len(entries)] for offset in range(size)]
next_cursor = (cursor + size) % len(entries)
return selected, next_cursor
def _read_round_robin_cursor(path: Path) -> int:
if not path.is_file():
return 0
try:
data = json.loads(path.read_text(encoding="utf-8"))
except (OSError, json.JSONDecodeError):
return 0
if not isinstance(data, dict):
return 0
try:
return int(data.get("cursor", 0))
except (TypeError, ValueError):
return 0
def _write_round_robin_state(
path: Path,
cursor: int,
selected: list[RosterEntry],
) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
payload = {
"cursor": cursor,
"last_batch": [entry.slug for entry in selected],
"updated_at": datetime.now(timezone.utc).isoformat(),
}
path.write_text(
json.dumps(payload, indent=2, sort_keys=True) + "\n",
encoding="utf-8",
)
def _enabled_signals(path: Path) -> set[str]:
if not path.is_file():
return set(_KNOWN_SIGNALS)
data = yaml.safe_load(path.read_text(encoding="utf-8"))
node = data.get("signals") if isinstance(data, dict) else data
enabled: set[str] = set()
saw_known_signal = False
if isinstance(node, dict):
for name, config in node.items():
if str(name) not in _KNOWN_SIGNALS:
continue
saw_known_signal = True
if isinstance(config, dict) and config.get("enabled") is False:
continue
if config is False:
continue
enabled.add(str(name))
elif isinstance(node, list):
for item in node:
if isinstance(item, str) and item in _KNOWN_SIGNALS:
saw_known_signal = True
enabled.add(item)
elif isinstance(item, dict):
name = item.get("id") or item.get("signal") or item.get("name")
if str(name) in _KNOWN_SIGNALS and item.get("enabled", True) is not False:
saw_known_signal = True
enabled.add(str(name))
return enabled if saw_known_signal else set(_KNOWN_SIGNALS)
def _resolve_repo_roots(
entries: list[RosterEntry],
runner_host: str,
) -> dict[str, Path]:
requested = {entry.slug for entry in entries}
roots: dict[str, Path] = {}
for repo in _fetch_repos():
slug = str(repo.get("slug") or "")
if slug not in requested:
continue
raw = _repo_path_for_host(repo, runner_host)
if raw:
roots[slug] = Path(raw)
return roots
def _fetch_repos() -> list[dict[str, Any]]:
url = f"{_base_url()}/repos/"
try:
resp = httpx.get(url, timeout=_STATE_HUB_TIMEOUT_SECONDS)
resp.raise_for_status()
except httpx.HTTPError as exc:
raise RuntimeError(f"State Hub unreachable at {url}: {exc}") from exc
payload = resp.json()
if not isinstance(payload, list):
raise RuntimeError(f"State Hub /repos/ returned non-list: {type(payload)!r}")
return [repo for repo in payload if isinstance(repo, dict)]
def _repo_path_for_host(repo: dict[str, Any], runner_host: str) -> str | None:
host_paths = repo.get("host_paths") or {}
raw = None
if isinstance(host_paths, dict):
raw = host_paths.get(runner_host)
raw = raw or repo.get("local_path")
if not raw or raw == "(unknown)":
return None
return str(raw)
def _reuse_surface_report(params: dict[str, Any], signals: set[str]) -> dict[str, Any]:
if not (signals & {"registry_gap", "empty_capability_scaffold"}):
return {}
binary = str(params.get("reuse_surface_bin") or "reuse-surface")
try:
completed = subprocess.run(
[binary, "report", "gaps", "--format", "json"],
capture_output=True,
check=False,
text=True,
timeout=_REPORT_TIMEOUT_SECONDS,
)
except FileNotFoundError as exc:
raise RuntimeError(f"reuse-surface CLI not found: {binary}") from exc
except subprocess.TimeoutExpired as exc:
raise RuntimeError("reuse-surface report gaps timed out") from exc
if completed.returncode != 0:
detail = completed.stderr.strip() or completed.stdout.strip()
raise RuntimeError(f"reuse-surface report gaps failed: {detail}")
try:
payload = json.loads(completed.stdout or "{}")
except json.JSONDecodeError as exc:
raise RuntimeError("reuse-surface report gaps returned invalid JSON") from exc
if not isinstance(payload, dict):
raise RuntimeError("reuse-surface report gaps returned non-object JSON")
return payload
def _gap_records(
entries: list[RosterEntry],
roots: dict[str, Path],
signals: set[str],
report: dict[str, Any],
) -> list[dict[str, Any]]:
empty_scaffolds = _repo_set(report, {"empty_scaffolds", "empty_scaffold"})
publish_fail = _repo_set(
report,
{"publish_fail", "publish_fails", "publish_failures"},
)
gaps: list[dict[str, Any]] = []
seen: set[tuple[str, str]] = set()
for entry in entries:
root = roots.get(entry.slug)
if root is None:
logger.info("reuse_surface repo_unreachable slug=%s", entry.slug)
continue
if (
signals & {"registry_gap", "empty_capability_scaffold"}
and entry.slug in empty_scaffolds
):
_append_gap(gaps, seen, entry.slug, root, "empty_capability_scaffold")
if "registry_gap" in signals and entry.slug in publish_fail:
_append_gap(gaps, seen, entry.slug, root, "registry_gap")
if "publish_check_fail" in signals and entry.publish_check == "fail":
_append_gap(gaps, seen, entry.slug, root, "publish_check_fail")
if "stale_scope" in signals and _scope_is_stale(root):
_append_gap(gaps, seen, entry.slug, root, "stale_scope")
if "stale_sbom" in signals and _sbom_is_stale(entry.slug):
_append_gap(gaps, seen, entry.slug, root, "stale_sbom")
return gaps
def _append_gap(
gaps: list[dict[str, Any]],
seen: set[tuple[str, str]],
slug: str,
root: Path,
signal: str,
) -> None:
key = (slug, signal)
if key in seen:
return
seen.add(key)
gaps.append(
{
"repo": slug,
"root": str(root),
"signal": signal,
"hygiene_signal": signal,
}
)
def _scope_is_stale(root: Path) -> bool:
scope = root / "SCOPE.md"
if not scope.is_file():
return True
age_seconds = datetime.now(timezone.utc).timestamp() - scope.stat().st_mtime
return age_seconds > 90 * 24 * 60 * 60
def _sbom_is_stale(slug: str) -> bool:
payload = StateHubContextResolver().resolve(
"repo_sbom_status",
None,
{"repo_slug": slug},
)
if not isinstance(payload, dict):
return False
try:
return int(payload.get("sbom_age_days", 0)) > 30
except (TypeError, ValueError):
return False
def _repo_set(report: dict[str, Any], keys: set[str]) -> set[str]:
slugs: set[str] = set()
for value in _values_for_keys(report, keys):
slugs.update(_slugs_from_value(value))
return slugs
def _values_for_keys(value: Any, keys: set[str]) -> list[Any]:
values: list[Any] = []
if isinstance(value, dict):
for key, nested in value.items():
if key in keys:
values.append(nested)
values.extend(_values_for_keys(nested, keys))
elif isinstance(value, list):
for item in value:
values.extend(_values_for_keys(item, keys))
return values
def _slugs_from_value(value: Any) -> set[str]:
if isinstance(value, str):
return {value}
if isinstance(value, list):
slugs: set[str] = set()
for item in value:
slugs.update(_slugs_from_value(item))
return slugs
if isinstance(value, dict):
for key in ("repo", "repo_slug", "slug", "name"):
if value.get(key):
return {str(value[key])}
slugs: set[str] = set()
for key, nested in value.items():
if nested is True or isinstance(nested, (dict, list)):
slugs.add(str(key))
slugs.update(_slugs_from_value(nested))
return slugs
return set()
class ReuseSurfaceContextResolver(ContextResolver):
"""Resolves reuse-surface registry hygiene gap reports."""
def resolve(self, query: str, event: Any, params: dict[str, Any]) -> dict[str, Any]:
if query == "reuse_surface_report_gaps":
return reuse_surface_report_gaps(params)
return {}
class ShellContextResolver(ContextResolver):
"""Dispatch shell-backed context queries without breaking kaizen aliases."""
def resolve(self, query: str, event: Any, params: dict[str, Any]) -> dict[str, Any]:
if query == "reuse_surface_report_gaps":
return reuse_surface_report_gaps(params)
return KaizenContextResolver().resolve(query, event, params)
CONTEXT_RESOLVER_REGISTRY["reuse-surface"] = ReuseSurfaceContextResolver
CONTEXT_RESOLVER_REGISTRY["shell"] = ShellContextResolver

View File

@@ -12,6 +12,7 @@ Supported queries:
- coding_retro: latest /progress/ item with event_type=coding_retro
- daily_triage_digest: curated scalar JSON digest for daily WSJF triage
- recently_on_scope_hourly: POST {STATE_HUB_URL}/recently-on-scope/hourly
- consistency_sweep_remote_all: POST {STATE_HUB_URL}/consistency/sweep/remote-all
No caching — state hub data is live operational state and must not be stale
within a single workflow run.
@@ -31,6 +32,7 @@ from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY, Cont
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
_TIMEOUT_SECONDS = 10.0
_SWEEP_TIMEOUT_SECONDS = 330.0
_OPEN_WORKSTREAM_STATUSES = {"active", "ready", "blocked"}
_OPEN_TASK_STATUSES = {"wait", "todo", "progress"}
# Sentinel age for repos that have never had an SBOM ingested. Large enough
@@ -53,13 +55,26 @@ def _fetch_json(path: str, params: dict[str, Any] | None = None) -> Any:
return {}
def _post_json(path: str, payload: dict[str, Any]) -> Any:
def _post_json(path: str, payload: dict[str, Any], *, timeout: float = _TIMEOUT_SECONDS) -> Any:
url = f"{_base_url()}{path}"
resp = httpx.post(url, json=payload, timeout=_TIMEOUT_SECONDS)
resp = httpx.post(url, json=payload, timeout=timeout)
resp.raise_for_status()
return resp.json()
def _validate_consistency_sweep_remote_all(result: Any) -> dict[str, Any]:
if not isinstance(result, dict):
raise RuntimeError("consistency_sweep_remote_all returned a non-object response")
required_keys = {"exit_code", "lock_skipped", "repos_processed"}
missing = required_keys - set(result)
if missing:
missing_list = ", ".join(sorted(missing))
raise RuntimeError(
f"consistency_sweep_remote_all response missing required key(s): {missing_list}"
)
return result
def _validate_recently_on_scope_hourly(result: Any) -> dict[str, Any]:
if not isinstance(result, dict):
raise RuntimeError("recently_on_scope_hourly returned a non-object response")
@@ -107,6 +122,18 @@ class StateHubContextResolver(ContextResolver):
}
result = _post_json("/recently-on-scope/hourly", payload)
return _validate_recently_on_scope_hourly(result)
if query == "consistency_sweep_remote_all":
payload = {
key: value
for key, value in params.items()
if key not in {"required"}
}
result = _post_json(
"/consistency/sweep/remote-all",
payload,
timeout=_SWEEP_TIMEOUT_SECONDS,
)
return _validate_consistency_sweep_remote_all(result)
return {}

View File

@@ -20,7 +20,8 @@ from activity_core.rules.models import TaskRef, TaskSpec
logger = logging.getLogger(__name__)
ISSUE_CORE_URL = os.environ.get("ISSUE_CORE_URL", "http://127.0.0.1:8010")
ISSUE_CORE_URL = os.environ.get("ISSUE_CORE_URL", "http://127.0.0.1:8765")
ISSUE_CORE_API_KEY_ENV = "ISSUE_CORE_API_KEY"
ISSUE_SINK_TYPE = os.environ.get("ISSUE_SINK_TYPE", "rest")
@@ -30,10 +31,30 @@ class IssueSink(ABC):
class IssueCoreRestSink(IssueSink):
"""POSTs to issue-core REST API. Config: ISSUE_CORE_URL env var."""
"""POSTs to issue-core REST API.
def __init__(self, base_url: str = ISSUE_CORE_URL) -> None:
Config: ISSUE_CORE_URL and ISSUE_CORE_API_KEY env vars (shared key with
the issue-core server).
"""
def __init__(
self,
base_url: str = ISSUE_CORE_URL,
api_key: str | None = None,
) -> None:
self._base_url = base_url.rstrip("/")
if api_key is not None:
self._api_key = api_key.strip()
else:
self._api_key = os.environ.get(ISSUE_CORE_API_KEY_ENV, "").strip()
def _auth_headers(self) -> dict[str, str]:
if not self._api_key:
raise RuntimeError(
f"{ISSUE_CORE_API_KEY_ENV} is not set. "
"Required when ISSUE_SINK_TYPE=rest."
)
return {"Authorization": f"Bearer {self._api_key}"}
def emit(self, task_spec: TaskSpec) -> TaskRef:
payload = {
@@ -45,10 +66,19 @@ class IssueCoreRestSink(IssueSink):
"due_in_days": task_spec.due_in_days,
"source_type": task_spec.source_type,
"source_id": task_spec.source_id,
"triggering_event_id": task_spec.triggering_event_id,
"triggering_event_id": (
str(task_spec.triggering_event_id)
if task_spec.triggering_event_id is not None
else None
),
"activity_definition_id": task_spec.activity_definition_id,
}
resp = httpx.post(f"{self._base_url}/issues/", json=payload, timeout=10.0)
resp = httpx.post(
f"{self._base_url}/issues/",
json=payload,
headers=self._auth_headers(),
timeout=10.0,
)
resp.raise_for_status()
data = resp.json()
return TaskRef(

View File

@@ -17,6 +17,8 @@ import httpx
class DisabledLLMClient:
"""LLM client used when no llm-connect endpoint is configured."""
last_response_metadata: dict[str, Any] | None = None
def complete(
self,
prompt: str,
@@ -32,6 +34,7 @@ class LLMConnectClient:
def __init__(self, base_url: str, timeout_seconds: float = 300.0) -> None:
self.base_url = base_url.rstrip("/")
self.timeout_seconds = timeout_seconds
self.last_response_metadata: dict[str, Any] | None = None
def complete(
self,
@@ -54,12 +57,48 @@ class LLMConnectClient:
)
resp.raise_for_status()
data = resp.json()
self.last_response_metadata = _extract_response_metadata(data)
content = data.get("content")
if not isinstance(content, str):
raise ValueError("llm-connect response missing string content")
return content
_SAFE_RESPONSE_METADATA_KEYS = {
"finish_reason",
"usage",
"model",
"model_name",
"provider",
"request_id",
"response_id",
"trace_id",
"latency_ms",
"duration_ms",
"elapsed_ms",
"created",
"created_at",
}
def _extract_response_metadata(data: dict[str, Any]) -> dict[str, Any]:
"""Keep non-secret llm-connect diagnostics alongside the returned content."""
return {
key: value for key, value in data.items()
if key in _SAFE_RESPONSE_METADATA_KEYS and _json_safe(value)
}
def _json_safe(value: Any) -> bool:
try:
import json
json.dumps(value)
except (TypeError, ValueError):
return False
return True
def get_llm_client() -> DisabledLLMClient | LLMConnectClient:
base_url = os.environ.get("LLM_CONNECT_URL", "").strip()
if not base_url:

View File

@@ -49,7 +49,18 @@ class CronTriggerConfig(BaseModel):
)
timezone: str = Field(default="UTC", description="IANA timezone name.")
jitter_seconds: int = Field(default=0, ge=0)
misfire_policy: Literal["skip", "catchup", "compress"] = Field(default="skip")
# Run-miss recovery behaviour (ACTIVITY-WP-0014). What happens when a fire is
# missed because the worker / Temporal was unavailable at trigger time:
# skip - run on trigger or skip; a missed fire is never recovered
# catchup_all - recover every fire missed during the outage window
# catchup_latest - recover only the most recent missed fire; do not accumulate
# Legacy aliases are accepted: catchup → catchup_all, compress → catchup_latest.
misfire_policy: Literal[
"skip", "catchup_all", "catchup_latest", "catchup", "compress"
] = Field(default="skip")
# Override the per-policy default catchup window (how far back Temporal will
# recover missed fires after an outage). None uses the policy default.
catchup_window_seconds: int | None = Field(default=None, ge=0)
class EventTriggerConfig(BaseModel):

View File

@@ -2,12 +2,15 @@
from __future__ import annotations
import json
import os
from pathlib import Path
from typing import Any
import httpx
from activity_core.context_resolvers.ops_inventory import _sanitize_url
from activity_core.state_hub_write import idempotency_headers
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
_INTER_HUB_SINK_TYPES = {
@@ -15,6 +18,10 @@ _INTER_HUB_SINK_TYPES = {
"inter-hub-event",
"inter-hub-interaction-event",
}
_CORE_HUB_SINK_TYPES = {
"core-hub",
"core-hub-interaction-event",
}
def persist_ops_inventory_evidence(payload: dict[str, Any]) -> list[dict[str, Any]]:
@@ -55,6 +62,12 @@ def persist_ops_inventory_evidence(payload: dict[str, Any]) -> list[dict[str, An
results.append(
_post_state_hub_progress(payload, bind_key, probe_result, sink)
)
elif sink_type in _CORE_HUB_SINK_TYPES:
results.append(
_post_core_hub_interaction_event(
payload, bind_key, probe_result, sink
)
)
elif sink_type in _INTER_HUB_SINK_TYPES:
results.append(_inter_hub_result(sink))
else:
@@ -121,6 +134,7 @@ def _post_state_hub_progress(
resp = httpx.post(
f"{base_url}/progress/",
json=body,
headers=idempotency_headers(run_id, context_key, event_type),
timeout=float(sink.get("timeout_seconds", 10.0)),
)
resp.raise_for_status()
@@ -136,12 +150,17 @@ def _post_state_hub_progress(
def _progress_exists(base_url: str, event_type: str, idempotency_key: str) -> bool:
resp = httpx.get(
f"{base_url}/progress/",
params={"limit": 100},
timeout=10.0,
)
resp.raise_for_status()
# Best-effort optimisation only; the Idempotency-Key header on the write is the
# real dedup guarantee. Do not hard-fail if State Hub is unreachable here.
try:
resp = httpx.get(
f"{base_url}/progress/",
params={"limit": 100},
timeout=10.0,
)
resp.raise_for_status()
except httpx.HTTPError:
return False
for item in resp.json():
detail = item.get("detail") or {}
if (
@@ -152,6 +171,213 @@ def _progress_exists(base_url: str, event_type: str, idempotency_key: str) -> bo
return False
def _post_core_hub_interaction_event(
payload: dict[str, Any],
context_key: str,
probe_result: dict[str, Any],
sink: dict[str, Any],
) -> dict[str, Any]:
raw_base_url = (
sink.get("core_hub_url")
or sink.get("base_url")
or os.environ.get("CORE_HUB_BASE_URL")
or ""
)
base_url = str(raw_base_url).rstrip("/")
runtime_token = _core_hub_runtime_token(sink)
widget_id = _core_hub_widget_id(sink, probe_result)
missing: list[str] = []
if not base_url:
missing.append("CORE_HUB_BASE_URL")
if not runtime_token:
missing.append("CORE_HUB_RUNTIME_TOKEN or CORE_HUB_RUNTIME_TOKEN_FILE")
if not widget_id:
missing.append("widget_id or CORE_HUB_WIDGET_ID")
if missing:
return {
"type": sink.get("type"),
"status": "skipped",
"reason": "missing_core_hub_config",
"missing": missing,
"context_key": context_key,
}
endpoint = _selected_endpoint(probe_result, sink)
event_type = sink.get("event_type", "ops-endpoint-verified")
timeout = float(sink.get("timeout_seconds", 10.0))
body = {
"widgetId": widget_id,
"eventType": event_type,
"viewContext": _core_hub_view_context(payload, context_key, endpoint, sink),
"metadata": _core_hub_metadata(payload, context_key, probe_result, endpoint),
}
resp = httpx.post(
f"{base_url}/api/v2/interaction-events",
json=body,
headers=_core_hub_headers(runtime_token),
timeout=timeout,
)
resp.raise_for_status()
data = resp.json()
event_id = data.get("id")
if not event_id:
raise RuntimeError("Core Hub interaction event response did not include an id")
if not _core_hub_event_exists(base_url, runtime_token, str(event_id), timeout):
raise RuntimeError("Core Hub interaction event was not visible after create")
return {
"type": sink.get("type"),
"status": "posted",
"event_type": data.get("eventType", event_type),
"event_id": event_id,
"widget_id": data.get("widgetId", widget_id),
"verified": True,
"context_key": context_key,
}
def _core_hub_headers(runtime_token: str) -> dict[str, str]:
return {
"Accept": "application/json",
"Authorization": f"Bearer {runtime_token}",
"Content-Type": "application/json",
"User-Agent": "activity-core-ops-evidence/0.1",
}
def _core_hub_runtime_token(sink: dict[str, Any]) -> str:
token_file = (
sink.get("runtime_token_file")
or sink.get("token_file")
or os.environ.get("CORE_HUB_RUNTIME_TOKEN_FILE")
)
if token_file:
return Path(str(token_file)).read_text(encoding="utf-8").strip()
env_name = (
sink.get("runtime_token_env")
or os.environ.get("CORE_HUB_RUNTIME_TOKEN_ENV")
or "CORE_HUB_RUNTIME_TOKEN"
)
return os.environ.get(str(env_name), "").strip()
def _core_hub_widget_id(sink: dict[str, Any], probe_result: dict[str, Any]) -> str:
direct = sink.get("widget_id") or os.environ.get("CORE_HUB_WIDGET_ID")
if direct:
return str(direct)
endpoint = _selected_endpoint(probe_result, sink)
widget_ref = endpoint.get("widget_ref") if endpoint else None
if not widget_ref:
return ""
mapping = sink.get("widget_mapping") or sink.get("capability_mapping")
if mapping is None:
mapping = os.environ.get("CORE_HUB_WIDGET_MAPPING")
parsed = _parse_widget_mapping(mapping)
return parsed.get(str(widget_ref), "")
def _parse_widget_mapping(raw: Any) -> dict[str, str]:
if isinstance(raw, dict):
return {str(key): str(value) for key, value in raw.items() if value}
if not isinstance(raw, str) or not raw.strip():
return {}
value = raw.strip()
if value.startswith("{"):
try:
loaded = json.loads(value)
except json.JSONDecodeError:
return {}
if isinstance(loaded, dict):
return {str(key): str(item) for key, item in loaded.items() if item}
return {}
if "=" not in value:
return {}
pairs: dict[str, str] = {}
for part in value.split(","):
key, _, item = part.partition("=")
if key.strip() and item.strip():
pairs[key.strip()] = item.strip()
return pairs
def _selected_endpoint(probe_result: dict[str, Any], sink: dict[str, Any]) -> dict[str, Any]:
endpoints = [
endpoint
for endpoint in probe_result.get("endpoints", [])
if isinstance(endpoint, dict)
]
endpoint_id = sink.get("endpoint_id")
if endpoint_id:
match = next(
(endpoint for endpoint in endpoints if endpoint.get("endpoint_id") == endpoint_id),
None,
)
if match:
return match
return next(
(endpoint for endpoint in endpoints if endpoint.get("widget_ref")),
endpoints[0] if endpoints else {},
)
def _core_hub_view_context(
payload: dict[str, Any],
context_key: str,
endpoint: dict[str, Any],
sink: dict[str, Any],
) -> str:
return str(
sink.get("view_context")
or endpoint.get("view_context")
or f"activity-core/ops-inventory/{payload.get('run_id', 'unknown')}/{context_key}"
)
def _core_hub_metadata(
payload: dict[str, Any],
context_key: str,
probe_result: dict[str, Any],
endpoint: dict[str, Any],
) -> dict[str, Any]:
compact = _compact_probe_result(probe_result)
return {
"activity_id": payload.get("activity_id"),
"activity_core_run_id": payload.get("run_id"),
"scheduled_for": payload.get("scheduled_for"),
"source_type": "ops-inventory",
"context_key": context_key,
"probe": {
"generated_at": compact.get("generated_at"),
"inventory_path": compact.get("inventory_path"),
"status": compact.get("status"),
"reason": compact.get("reason"),
"summary": compact.get("summary", {}),
},
"endpoint": _compact_endpoint(endpoint) if endpoint else {},
}
def _core_hub_event_exists(
base_url: str,
runtime_token: str,
event_id: str,
timeout: float,
) -> bool:
resp = httpx.get(
f"{base_url}/api/v2/interaction-events",
headers=_core_hub_headers(runtime_token),
timeout=timeout,
)
resp.raise_for_status()
payload = resp.json()
data = payload.get("data") if isinstance(payload, dict) else []
if not isinstance(data, list):
return False
return any(isinstance(item, dict) and item.get("id") == event_id for item in data)
def _inter_hub_result(sink: dict[str, Any]) -> dict[str, Any]:
missing: list[str] = []
if not (sink.get("inter_hub_url") or os.environ.get("INTER_HUB_URL")):

View File

@@ -11,6 +11,8 @@ from zoneinfo import ZoneInfo
import httpx
from activity_core.state_hub_write import idempotency_headers
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
_THE_CUSTODIAN_ROOT = Path("/home/worsch/the-custodian")
_FORBIDDEN_CUSTODIAN_ROOTS = (
@@ -134,6 +136,7 @@ def _post_state_hub_progress(
"output_validated": report_entry.get("output_validated"),
"review_required": report_entry.get("review_required"),
"validation_error": report_entry.get("validation_error"),
"llm_response_metadata": report_entry.get("llm_response_metadata"),
"report": report,
},
}
@@ -149,6 +152,7 @@ def _post_state_hub_progress(
resp = httpx.post(
f"{base_url}/progress/",
json=body,
headers=idempotency_headers(run_id, instruction_id, event_type),
timeout=float(sink.get("timeout_seconds", 10.0)),
)
resp.raise_for_status()
@@ -167,12 +171,18 @@ def _progress_exists(
instruction_id: str,
event_type: str,
) -> bool:
resp = httpx.get(
f"{base_url}/progress/",
params={"limit": 100},
timeout=10.0,
)
resp.raise_for_status()
# Best-effort read-dedup optimisation only. The Idempotency-Key header on the
# write is the real guarantee; if State Hub is unreachable here we must not
# hard-fail — proceed to the (keyed) write rather than raising.
try:
resp = httpx.get(
f"{base_url}/progress/",
params={"limit": 100},
timeout=10.0,
)
resp.raise_for_status()
except httpx.HTTPError:
return False
for item in resp.json():
detail = item.get("detail") or {}
if (
@@ -215,6 +225,16 @@ def _render_markdown(
lines.extend([summary, ""])
if validation_error:
lines.extend(["Validation error:", "", f"`{validation_error}`", ""])
metadata = report_entry.get("llm_response_metadata")
if metadata:
lines.extend([
"LLM response metadata:",
"",
"```json",
json.dumps(metadata, indent=2, sort_keys=True),
"```",
"",
])
lines.extend([
"```json",
json.dumps(report, indent=2, sort_keys=True),

View File

@@ -41,6 +41,7 @@ class InstructionResult:
review_required: bool = False
condition_matched: str | None = None
validation_error: str | None = None
llm_response_metadata: dict[str, Any] | None = None
def _resolve_path(obj: Any, path: str) -> Any:
@@ -160,15 +161,22 @@ def _execute(
prompt_hash = hashlib.sha256(rendered.encode()).hexdigest()
llm_config = _llm_run_config(instr)
# Reference allow-list (WP-0016-T04): if a context resolver supplied the set
# of known candidate ids, recommendations pointing at anything else are
# quarantined. Absent (None) today → the check is inert until wired.
allow_list = _allow_list_from_context(context)
# Step 3 — call LLM
raw_output = llm_client.complete(rendered, model=instr.model, config=llm_config)
response_metadata = _llm_response_metadata(llm_client)
# Step 4 — validate and optionally retry
task_specs, report, error = _validate_output(raw_output, instr)
task_specs, report, error = _validate_output(raw_output, instr, allow_list)
if error:
retry_prompt = rendered + f"\n\nPrevious output was invalid: {error}\nPlease fix."
raw_output = llm_client.complete(retry_prompt, model=instr.model, config=llm_config)
task_specs, report, error = _validate_output(raw_output, instr)
response_metadata = _llm_response_metadata(llm_client)
task_specs, report, error = _validate_output(raw_output, instr, allow_list)
if error:
# Truncate to keep log volume bounded but long enough to see the
# actual JSON shape mismatch (typical reports are <2KB).
@@ -178,7 +186,18 @@ def _execute(
"error=%s, raw_output_preview=%r",
instr.id, prompt_hash, error, preview,
)
failure_report = _invalid_output_report(instr, error, raw_output)
# Posture B (WP-0016-T03): try to recover a partial-but-usable
# report from individually-parseable items before declaring total
# loss. One bad item should cost one item, not the whole report.
recovered = _resilient_report(
instr, raw_output, error, prompt_hash, allow_list,
response_metadata=response_metadata,
)
if recovered is not None:
return recovered
failure_report = _invalid_output_report(
instr, error, raw_output, response_metadata=response_metadata,
)
if failure_report is not None:
return InstructionResult(
tasks=[],
@@ -189,6 +208,7 @@ def _execute(
review_required=True,
condition_matched=instr.condition or None,
validation_error=error,
llm_response_metadata=response_metadata,
)
return _empty_result(instr, prompt_hash=prompt_hash, validation_error=error)
@@ -200,6 +220,7 @@ def _execute(
output_validated=True,
review_required=bool(getattr(instr, "review_required", False)),
condition_matched=instr.condition or None,
llm_response_metadata=response_metadata,
)
@@ -239,6 +260,7 @@ def _invalid_output_report(
instr: Any,
validation_error: str,
raw_output: Any,
response_metadata: dict[str, Any] | None = None,
) -> dict[str, Any] | None:
"""Build a durable diagnostic report for invalid report-sink output.
@@ -256,7 +278,7 @@ def _invalid_output_report(
partial_output = _parse_json_output(raw_output)
except json.JSONDecodeError:
partial_output = None
raw_preview = raw_output[:4000]
raw_preview = raw_output[:_RAW_OUTPUT_PREVIEW_LIMIT]
else:
partial_output = raw_output
@@ -268,6 +290,8 @@ def _invalid_output_report(
"status": "validation_failed",
"validation_error": validation_error,
}
if response_metadata:
report["llm_response_metadata"] = response_metadata
if isinstance(partial_output, dict):
if isinstance(partial_output.get("summary"), str):
report["partial_summary"] = partial_output["summary"]
@@ -279,6 +303,358 @@ def _invalid_output_report(
return report
# ---------------------------------------------------------------------------
# Resilient report recovery (ACTIVITY-WP-0016-T03)
#
# Posture B — verify & mitigate at the producer→consumer boundary. When the
# whole-document parse/validate fails, recover individually-parseable
# recommendation objects, validate each against the item schema, keep the valid
# ones, and quarantine the malformed/over-limit ones with provenance. One bad
# item costs one item, not the whole report (error locality == unit of work).
# ---------------------------------------------------------------------------
_QUARANTINE_LIMIT = 20
_SNIPPET_LIMIT = 200
# Producer guardrails (ACTIVITY-WP-0016-T04): structural bounds applied to every
# recommendation regardless of producer (LLM, agent, or human). These are
# verify-and-mitigate limits — an offending item is quarantined, never allowed to
# fail the whole report or flow unbounded into a downstream consumer.
_MAX_STRING_LEN = 4000
_MAX_DEPTH = 8
_RAW_OUTPUT_PREVIEW_LIMIT = 12000
_SUMMARY_RE = re.compile(r'"summary"\s*:\s*"((?:[^"\\]|\\.)*)"')
_SAFE_RESPONSE_METADATA_KEYS = {
"finish_reason",
"usage",
"model",
"model_name",
"provider",
"request_id",
"response_id",
"trace_id",
"latency_ms",
"duration_ms",
"elapsed_ms",
"created",
"created_at",
}
def _llm_response_metadata(llm_client: Any) -> dict[str, Any] | None:
metadata = getattr(llm_client, "last_response_metadata", None)
if not isinstance(metadata, dict) or not metadata:
return None
safe: dict[str, Any] = {}
for key, value in metadata.items():
if key not in _SAFE_RESPONSE_METADATA_KEYS:
continue
try:
json.dumps(value)
except (TypeError, ValueError):
continue
safe[str(key)] = value
return safe or None
def _snippet(value: Any) -> str:
text = value if isinstance(value, str) else json.dumps(value, default=str)
return text[:_SNIPPET_LIMIT]
def _json_depth(value: Any, depth: int = 1) -> int:
if depth > _MAX_DEPTH:
return depth
if isinstance(value, dict):
return max((_json_depth(v, depth + 1) for v in value.values()), default=depth)
if isinstance(value, list):
return max((_json_depth(v, depth + 1) for v in value), default=depth)
return depth
def _has_oversized_string(value: Any) -> bool:
if isinstance(value, str):
return len(value) > _MAX_STRING_LEN
if isinstance(value, dict):
return any(_has_oversized_string(v) for v in value.values())
if isinstance(value, list):
return any(_has_oversized_string(v) for v in value)
return False
def _item_structure_error(item: Any) -> str | None:
"""Producer-agnostic structural guardrail: depth and string-length caps."""
if _json_depth(item) > _MAX_DEPTH:
return f"exceeds max nesting depth {_MAX_DEPTH}"
if _has_oversized_string(item):
return f"contains a string longer than {_MAX_STRING_LEN} chars"
return None
def _allow_list_from_context(context: dict | None) -> set[str] | None:
"""Build the recommendation-candidate allow-list from resolved context.
Looks for `context["known_candidates"]` (a list/set of valid candidate ids).
Returns None when absent so the allow-list check stays inert until a context
resolver populates it — the guardrail capability ships now; activation is a
one-line resolver change.
"""
if not isinstance(context, dict):
return None
known = context.get("known_candidates")
if isinstance(known, (list, set, tuple)):
return {str(item) for item in known}
return None
def _report_contract(instr: Any) -> tuple[dict[str, Any] | None, int | None]:
"""Extract (item_schema, max_items) for the recommendations list, if any."""
try:
schema = _load_output_schema(getattr(instr, "output_schema", ""))
except (OSError, json.JSONDecodeError, TypeError):
return None, None
if not isinstance(schema, dict):
return None, None
recs = (schema.get("properties") or {}).get("recommendations")
if not isinstance(recs, dict):
return None, None
item_schema = recs.get("items") if isinstance(recs.get("items"), dict) else None
max_items = recs.get("maxItems") if isinstance(recs.get("maxItems"), int) else None
return item_schema, max_items
def _extract_object_spans(raw: str) -> list[tuple[str, bool]]:
"""Return (span, complete) for each recommendation object in raw output.
Scans the `recommendations` array brace-aware and string-aware so it recovers
objects whether they are pretty-printed across many lines or emitted one per
line (NDJSON). A truncated trailing object is returned with complete=False.
"""
key = raw.find('"recommendations"')
start_region = raw.find("[", key) if key >= 0 else -1
if start_region < 0:
return []
spans: list[tuple[str, bool]] = []
i, n = start_region + 1, len(raw)
while i < n:
ch = raw[i]
if ch == "]":
break
if ch != "{":
i += 1
continue
depth, in_str, esc, j = 0, False, False, i
closed = False
while j < n:
c = raw[j]
if in_str:
if esc:
esc = False
elif c == "\\":
esc = True
elif c == '"':
in_str = False
elif c == '"':
in_str = True
elif c == "{":
depth += 1
elif c == "}":
depth -= 1
if depth == 0:
spans.append((raw[i:j + 1], True))
closed = True
break
j += 1
if not closed:
spans.append((raw[i:], False)) # truncated tail
break
i = j + 1
return spans
def _try_repair(span: str) -> str:
"""Best-effort close of a truncated JSON object: balance quote, braces, brackets."""
in_str, esc, depth_c, depth_b = False, False, 0, 0
for c in span:
if in_str:
if esc:
esc = False
elif c == "\\":
esc = True
elif c == '"':
in_str = False
elif c == '"':
in_str = True
elif c == "{":
depth_c += 1
elif c == "}":
depth_c -= 1
elif c == "[":
depth_b += 1
elif c == "]":
depth_b -= 1
repaired = span.rstrip().rstrip(",")
if in_str:
repaired += '"'
return repaired + "]" * max(depth_b, 0) + "}" * max(depth_c, 0)
def _recover_recommendations(
raw: str,
) -> tuple[str | None, list[dict[str, Any]], list[dict[str, Any]]]:
"""Recover (summary, items, quarantined) from a failed report payload."""
summary_match = _SUMMARY_RE.search(raw)
summary = None
if summary_match:
try:
summary = json.loads(f'"{summary_match.group(1)}"')
except json.JSONDecodeError:
summary = summary_match.group(1)
items: list[dict[str, Any]] = []
quarantined: list[dict[str, Any]] = []
for index, (span, complete) in enumerate(_extract_object_spans(raw)):
parsed: Any = None
try:
parsed = json.loads(span)
except json.JSONDecodeError as exc:
if not complete:
try:
parsed = json.loads(_try_repair(span))
except json.JSONDecodeError:
parsed = None
if parsed is None:
quarantined.append(
{"index": index, "error": str(exc), "raw": _snippet(span),
"reason": "truncated" if not complete else "unparseable"}
)
continue
if isinstance(parsed, dict):
items.append(parsed)
else:
quarantined.append(
{"index": index, "error": "item is not a JSON object",
"raw": _snippet(span)}
)
return summary, items, quarantined
def _partition_items(
items: list[dict[str, Any]],
item_schema: dict[str, Any] | None,
max_items: int | None,
*,
run_schema: bool = True,
allow_list: set[str] | None = None,
) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
"""Screen items into (valid, quarantined).
Applied uniformly to recovered items (run_schema=True) and to already
schema-valid happy-path items (run_schema=False). Order of checks: structural
type → schema → producer guardrails (depth/length) → reference allow-list →
count cap. The first failing check quarantines the item with provenance.
"""
valid: list[dict[str, Any]] = []
quarantined: list[dict[str, Any]] = []
for index, item in enumerate(items):
if not isinstance(item, dict):
quarantined.append(
{"index": index, "error": "item is not a JSON object",
"raw": _snippet(item), "reason": "malformed"}
)
continue
schema_error = (
_validate_schema_node(item, item_schema, f"recommendations[{index}]")
if (run_schema and item_schema)
else None
)
if schema_error:
quarantined.append(
{"index": index, "error": schema_error, "raw": _snippet(item),
"reason": "schema"}
)
continue
structure_error = _item_structure_error(item)
if structure_error:
quarantined.append(
{"index": index, "error": structure_error, "raw": _snippet(item),
"reason": "guardrail"}
)
continue
if allow_list is not None:
candidate = item.get("candidate")
if not isinstance(candidate, str) or candidate not in allow_list:
quarantined.append(
{"index": index, "error": f"candidate {candidate!r} not in allow-list",
"raw": _snippet(item), "reason": "allow_list"}
)
continue
valid.append(item)
if max_items is not None and len(valid) > max_items:
for item in valid[max_items:]:
quarantined.append(
{"index": None, "error": f"exceeds maxItems={max_items}",
"raw": _snippet(item), "reason": "over_limit"}
)
valid = valid[:max_items]
return valid, quarantined
def _resilient_report(
instr: Any,
raw_output: Any,
original_error: str,
prompt_hash: str | None,
allow_list: set[str] | None = None,
response_metadata: dict[str, Any] | None = None,
) -> InstructionResult | None:
"""Recover a partial-but-usable report from output that failed validation.
Returns None when nothing usable can be recovered, so the caller falls back
to the total-loss diagnostic artifact (_invalid_output_report).
"""
if not getattr(instr, "report_sinks", None) or not isinstance(raw_output, str):
return None
item_schema, max_items = _report_contract(instr)
summary, items, quarantined = _recover_recommendations(raw_output)
if not items:
return None
valid, item_quarantine = _partition_items(
items, item_schema, max_items, allow_list=allow_list,
)
quarantined.extend(item_quarantine)
if not valid:
return None
report: dict[str, Any] = {
"summary": summary
or f"Partial daily triage: recovered {len(valid)} recommendation(s) "
"after the full report failed validation.",
"recommendations": valid,
"status": "partial",
"partial": True,
"quarantined_count": len(quarantined),
"quarantined_items": quarantined[:_QUARANTINE_LIMIT],
"recovery_note": f"original validation error: {original_error}",
}
if response_metadata:
report["llm_response_metadata"] = response_metadata
logger.warning(
"instruction_output_recovered: instruction=%r, kept=%d, quarantined=%d",
getattr(instr, "id", None), len(valid), len(quarantined),
)
return InstructionResult(
tasks=[],
report=report,
prompt_hash=prompt_hash,
model=getattr(instr, "model", None),
output_validated=True,
review_required=True,
condition_matched=getattr(instr, "condition", "") or None,
validation_error=None,
llm_response_metadata=response_metadata,
)
def _execution_failure_report(instr: Any, error: str) -> dict[str, Any] | None:
"""Build a durable diagnostic report when a report instruction cannot run."""
if not getattr(instr, "report_sinks", None):
@@ -295,6 +671,7 @@ def _execution_failure_report(instr: Any, error: str) -> dict[str, Any] | None:
def _validate_output(
raw_output: Any,
instr: Any,
allow_list: set[str] | None = None,
) -> tuple[list[TaskSpec], dict[str, Any] | None, str | None]:
"""Parse raw LLM output into TaskSpecs and optional report payload.
@@ -349,6 +726,28 @@ def _validate_output(
source_type="instruction",
source_id=instr.id,
))
# Happy-path producer guardrails (WP-0016-T04): the whole document already
# passed schema validation, so recommendations are schema-valid; still apply
# the count cap, structural caps, and reference allow-list, quarantining any
# offenders rather than emitting them. Report shape only changes when an item
# is actually quarantined.
if isinstance(report, dict) and isinstance(report.get("recommendations"), list):
item_schema, max_items = _report_contract(instr)
kept, quarantined = _partition_items(
report["recommendations"], item_schema, max_items,
run_schema=False, allow_list=allow_list,
)
if quarantined:
report = {
**report,
"recommendations": kept,
"status": "partial",
"partial": True,
"quarantined_count": len(quarantined),
"quarantined_items": quarantined[:_QUARANTINE_LIMIT],
}
return specs, report, None
except (json.JSONDecodeError, AttributeError, KeyError, TypeError) as exc:
return [], None, str(exc)

View File

@@ -0,0 +1,194 @@
"""Missed-fire detection for cron schedules (ACTIVITY-WP-0014, T03).
Even with a catchup window configured, an operator wants to *know* when a fire
was missed — especially under ``misfire_policy: skip`` where missed fires are
dropped by design and leave no run and no failure event. This module turns the
schedule's own bookkeeping into an explicit verdict and an optional State Hub
alert so a miss is never invisible again.
Temporal already counts fires that were dropped because they fell outside the
catchup window in ``ScheduleInfo.num_actions_missed_catchup_window``. We surface
that, plus a staleness check on the most recent fire, as a ``ScheduleHealth``
verdict. The verdict logic is a pure function so it is testable without a live
Temporal server; ``check_schedule_health`` is the thin async reader.
"""
from __future__ import annotations
import os
from dataclasses import dataclass, field
from datetime import datetime, timedelta, timezone
from typing import Any
from uuid import UUID
import httpx
from activity_core.schedule_manager import schedule_id
from activity_core.state_hub_write import idempotency_headers
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
@dataclass(frozen=True)
class ScheduleHealth:
"""Verdict for a single schedule's recent firing behaviour."""
activity_id: str
healthy: bool
missed_catchup_window: int
last_fired_at: datetime | None
staleness: timedelta | None
reasons: list[str] = field(default_factory=list)
@property
def missed(self) -> bool:
return not self.healthy
def evaluate_schedule_health(
*,
activity_id: str,
missed_catchup_window: int,
last_fired_at: datetime | None,
now: datetime,
expected_interval: timedelta | None = None,
tolerance: timedelta = timedelta(minutes=10),
) -> ScheduleHealth:
"""Pure verdict: was a fire missed?
A schedule is unhealthy if Temporal dropped any fire past the catchup window,
or — when ``expected_interval`` is known — if the most recent fire is older
than one interval plus ``tolerance`` (i.e. a fire should have happened and
did not).
"""
reasons: list[str] = []
if missed_catchup_window > 0:
reasons.append(
f"{missed_catchup_window} fire(s) dropped outside the catchup window"
)
staleness: timedelta | None = None
if last_fired_at is not None:
staleness = now - last_fired_at
if expected_interval is not None and staleness > expected_interval + tolerance:
reasons.append(
f"last fire was {staleness} ago, exceeding the expected "
f"{expected_interval} interval"
)
elif expected_interval is not None:
reasons.append("no recorded fire for a schedule that should have fired")
return ScheduleHealth(
activity_id=activity_id,
healthy=not reasons,
missed_catchup_window=missed_catchup_window,
last_fired_at=last_fired_at,
staleness=staleness,
reasons=reasons,
)
def _extract_info(desc: Any) -> tuple[int, datetime | None]:
"""Pull (missed_catchup_window, last_fired_at) from a ScheduleDescription.
Accesses are defensive so a Temporal SDK field rename degrades to "unknown"
rather than raising inside an operational health check.
"""
info = getattr(desc, "info", None)
missed = int(getattr(info, "num_actions_missed_catchup_window", 0) or 0)
last_fired: datetime | None = None
recent = getattr(info, "recent_actions", None) or []
times = [
getattr(a, "scheduled_at", None) or getattr(a, "started_at", None)
for a in recent
]
times = [t for t in times if t is not None]
if times:
last_fired = max(times)
return missed, last_fired
async def check_schedule_health(
client: Any,
activity_id: str | UUID,
*,
now: datetime | None = None,
expected_interval: timedelta | None = None,
tolerance: timedelta = timedelta(minutes=10),
) -> ScheduleHealth:
"""Describe the schedule for ``activity_id`` and evaluate its health."""
now = now or datetime.now(tz=timezone.utc)
handle = client.get_schedule_handle(schedule_id(activity_id))
desc = await handle.describe()
missed, last_fired = _extract_info(desc)
return evaluate_schedule_health(
activity_id=str(activity_id),
missed_catchup_window=missed,
last_fired_at=last_fired,
now=now,
expected_interval=expected_interval,
tolerance=tolerance,
)
def post_missed_fire_alert(
health: ScheduleHealth,
*,
state_hub_url: str | None = None,
author: str = "activity-core",
topic_id: str | None = None,
workstream_id: str | None = None,
timeout_seconds: float = 10.0,
) -> dict[str, Any]:
"""Post a ``schedule_miss`` progress event to State Hub for an unhealthy schedule.
No-op (returns ``status: ok``) when the schedule is healthy, so callers can
invoke unconditionally.
"""
if health.healthy:
return {"type": "schedule-miss-alert", "status": "ok"}
base_url = state_hub_url or os.environ.get("STATE_HUB_URL", _DEFAULT_STATE_HUB_URL)
base_url = str(base_url).rstrip("/")
body: dict[str, Any] = {
"event_type": "schedule_miss",
"author": author,
"summary": (
f"Schedule {health.activity_id} missed a fire: "
+ "; ".join(health.reasons)
),
"detail": {
"activity_id": health.activity_id,
"missed_catchup_window": health.missed_catchup_window,
"last_fired_at": (
health.last_fired_at.isoformat() if health.last_fired_at else None
),
"staleness_seconds": (
health.staleness.total_seconds() if health.staleness else None
),
"reasons": health.reasons,
},
}
if topic_id:
body["topic_id"] = topic_id
if workstream_id:
body["workstream_id"] = workstream_id
# Dedup repeated alerts for the same missed window (same schedule + last fire).
last_fired = health.last_fired_at.isoformat() if health.last_fired_at else "none"
resp = httpx.post(
f"{base_url}/progress/",
json=body,
headers=idempotency_headers("schedule_miss", health.activity_id, last_fired),
timeout=timeout_seconds,
)
resp.raise_for_status()
data = resp.json()
return {
"type": "schedule-miss-alert",
"status": "posted",
"progress_id": data.get("id"),
}

View File

@@ -17,7 +17,6 @@ from temporalio.client import (
Schedule,
ScheduleActionStartWorkflow,
ScheduleAlreadyRunningError,
ScheduleBackfill,
ScheduleCalendarSpec,
ScheduleHandle,
ScheduleOverlapPolicy,
@@ -38,13 +37,49 @@ _ORCHESTRATOR_TASK_QUEUE = "orchestrator-tq"
# RunActivityWorkflow detects this value and derives run dedup key from workflow_id.
SCHEDULED_TRIGGER_KEY = "scheduled"
# T24: misfire_policy → ScheduleOverlapPolicy
_MISFIRE_TO_OVERLAP: dict[str, ScheduleOverlapPolicy] = {
"skip": ScheduleOverlapPolicy.SKIP,
"catchup": ScheduleOverlapPolicy.BUFFER_ALL,
"compress": ScheduleOverlapPolicy.BUFFER_ONE,
# ACTIVITY-WP-0014: misfire_policy → run-miss recovery behaviour.
#
# A "missed fire" happens when the worker / Temporal is unavailable at trigger
# time. Two Temporal levers together define the behaviour:
# - catchup_window: how far back the server will recover missed fires once it
# is healthy again. The previous code never set this, so a brief outage at
# trigger time silently dropped the fire with no recovery and no signal.
# - overlap: what to do when a (recovered) fire would start while a prior run
# is still executing.
#
# Legacy values (catchup, compress) are aliased onto the explicit names.
_MISFIRE_ALIASES: dict[str, str] = {
"catchup": "catchup_all",
"compress": "catchup_latest",
}
# overlap policy + default catchup window (seconds) per normalised policy.
_SKIP_WINDOW_SECONDS = 60
_CATCHUP_ALL_WINDOW_SECONDS = 365 * 24 * 3600
_CATCHUP_LATEST_WINDOW_SECONDS = 24 * 3600
_MISFIRE_TO_OVERLAP: dict[str, ScheduleOverlapPolicy] = {
# Run on trigger or skip — recover nothing past a tiny grace window.
"skip": ScheduleOverlapPolicy.SKIP,
# Run on trigger or recover every missed fire during the outage window.
"catchup_all": ScheduleOverlapPolicy.BUFFER_ALL,
# Run on trigger or recover the most recent missed fire only; BUFFER_ONE
# buffers at most one start and drops the rest, so a backlog never accumulates.
"catchup_latest": ScheduleOverlapPolicy.BUFFER_ONE,
}
_MISFIRE_DEFAULT_WINDOW: dict[str, int] = {
"skip": _SKIP_WINDOW_SECONDS,
"catchup_all": _CATCHUP_ALL_WINDOW_SECONDS,
"catchup_latest": _CATCHUP_LATEST_WINDOW_SECONDS,
}
def _normalize_misfire_policy(misfire_policy: str) -> str:
"""Map legacy aliases onto the explicit run-miss policy names."""
canonical = _MISFIRE_ALIASES.get(misfire_policy, misfire_policy)
return canonical if canonical in _MISFIRE_TO_OVERLAP else "skip"
def schedule_id(activity_id: str | UUID) -> str:
"""Return the canonical Temporal Schedule ID for an ActivityDefinition."""
@@ -57,7 +92,15 @@ def smoke_schedule_id(activity_id: str | UUID) -> str:
def _overlap_policy(misfire_policy: str) -> ScheduleOverlapPolicy:
return _MISFIRE_TO_OVERLAP.get(misfire_policy, ScheduleOverlapPolicy.SKIP)
return _MISFIRE_TO_OVERLAP[_normalize_misfire_policy(misfire_policy)]
def _catchup_window(cfg: CronTriggerConfig) -> timedelta:
"""Resolve the catchup window: explicit override, else the policy default."""
if cfg.catchup_window_seconds is not None:
return timedelta(seconds=cfg.catchup_window_seconds)
policy = _normalize_misfire_policy(cfg.misfire_policy)
return timedelta(seconds=_MISFIRE_DEFAULT_WINDOW[policy])
def _build_schedule(defn: ActivityDefinition) -> Schedule:
@@ -80,7 +123,10 @@ def _build_schedule(defn: ActivityDefinition) -> Schedule:
jitter=timedelta(seconds=cfg.jitter_seconds) if cfg.jitter_seconds else None,
)
policy = SchedulePolicy(overlap=_overlap_policy(cfg.misfire_policy))
policy = SchedulePolicy(
overlap=_overlap_policy(cfg.misfire_policy),
catchup_window=_catchup_window(cfg),
)
state = ScheduleState(paused=not defn.enabled)
return Schedule(action=action, spec=spec, policy=policy, state=state)
@@ -282,18 +328,10 @@ async def upsert_schedule(client: Client, defn: ActivityDefinition) -> ScheduleH
else:
await handle.pause(note="disabled via upsert_schedule")
# T24 catchup: backfill any fires missed in the last hour.
if isinstance(defn.trigger_config, CronTriggerConfig):
if defn.trigger_config.misfire_policy == "catchup":
now = datetime.now(tz=timezone.utc)
backfill_start = now - timedelta(hours=1)
await handle.backfill(
ScheduleBackfill(
start_at=backfill_start,
end_at=now,
overlap=ScheduleOverlapPolicy.BUFFER_ALL,
)
)
# ACTIVITY-WP-0014: missed-fire recovery is now handled natively by the
# schedule's catchup_window (see _build_schedule), which the server applies
# continuously after any outage — not only at upsert time. The previous
# ad-hoc 1-hour backfill is therefore no longer needed.
return handle

View File

@@ -0,0 +1,34 @@
"""Idempotency-keyed State Hub writes (ACTIVITY-WP-0014 T05).
Under the State Hub *beachhead* model, a write may be buffered locally while
central State Hub is unreachable and **flushed later, possibly with retries**.
To keep that flush safe — no duplicate progress / triage events — every write
carries a stable ``Idempotency-Key`` header derived deterministically from the
write's identity. The guarantee lives on the write itself and does **not** depend
on a live dedup read, so it holds even when the beachhead is serving offline.
activity-core does not implement the queue/cache (that is state-hub's beachhead);
it only emits the key so the beachhead / State Hub can dedup on flush. The header
passes untouched through the existing ``actcore-state-hub-bridge`` proxy and is
ignored by State Hub versions that do not yet honour it.
"""
from __future__ import annotations
IDEMPOTENCY_HEADER = "Idempotency-Key"
def idempotency_key(*parts: str | None) -> str:
"""Build a stable, header-safe idempotency key from identity parts.
Empty/None parts are kept as empty segments so the key shape is stable across
calls. Whitespace and control characters are collapsed to keep the value a
valid single-line HTTP header.
"""
raw = ":".join((p or "") for p in parts)
return "".join(ch if 0x20 < ord(ch) < 0x7F else "_" for ch in raw) or "_"
def idempotency_headers(*parts: str | None) -> dict[str, str]:
"""Return the header dict to attach to a State Hub write."""
return {IDEMPOTENCY_HEADER: idempotency_key(*parts)}

View File

@@ -15,6 +15,8 @@ import asyncio
import logging
import os
import uuid
from dataclasses import dataclass
from typing import Sequence
from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
@@ -30,6 +32,20 @@ TEMPORAL_HOST = os.environ.get("TEMPORAL_HOST", "localhost:7233")
TEMPORAL_NAMESPACE = os.environ.get("TEMPORAL_NAMESPACE", "default")
@dataclass
class ScheduleSyncResult:
upserted: int = 0
paused: int = 0
deleted_orphans: int = 0
def to_dict(self) -> dict[str, int]:
return {
"upserted": self.upserted,
"paused": self.paused,
"deleted_orphans": self.deleted_orphans,
}
def _row_to_domain(row: ActivityDefinitionRow) -> ActivityDefinition:
"""Convert an ORM row to a domain ActivityDefinition for schedule_manager."""
return ActivityDefinition.model_validate(
@@ -46,12 +62,82 @@ def _row_to_domain(row: ActivityDefinitionRow) -> ActivityDefinition:
)
async def sync(client: Client, db_url: str) -> None:
def _valid_schedule_activity_id(defn: ActivityDefinition) -> str:
if isinstance(defn.trigger_config, ScheduledTriggerConfig):
return f"{defn.id}-once"
return str(defn.id)
async def _load_schedule_rows(
session_factory: async_sessionmaker[AsyncSession],
) -> Sequence[ActivityDefinitionRow]:
async with session_factory() as session:
return (
await session.scalars(
select(ActivityDefinitionRow).where(
ActivityDefinitionRow.trigger_type.in_(["cron", "scheduled"])
)
)
).all()
async def sync_schedule_rows(
client: Client,
rows: Sequence[ActivityDefinitionRow],
) -> ScheduleSyncResult:
"""Reconcile Temporal Schedules against already-loaded definition rows."""
valid_schedule_activity_ids: set[str] = set()
result = ScheduleSyncResult()
for row in rows:
defn = _row_to_domain(row)
if not isinstance(
defn.trigger_config,
(CronTriggerConfig, ScheduledTriggerConfig),
):
continue
valid_schedule_activity_ids.add(_valid_schedule_activity_id(defn))
await upsert_schedule(client, defn)
if defn.enabled:
result.upserted += 1
logger.info("upserted schedule for activity %s (%s)", defn.id, defn.name)
else:
result.paused += 1
logger.info("upserted paused schedule for disabled activity %s", defn.id)
# Tombstone cleanup: remove Temporal Schedules with no matching DB row.
existing_schedules = await list_schedules(client)
for entry in existing_schedules:
if entry["activity_id"] not in valid_schedule_activity_ids:
await delete_schedule(client, entry["activity_id"])
result.deleted_orphans += 1
logger.info("deleted orphaned schedule %s", entry["schedule_id"])
logger.info(
"sync_schedules complete — upserted=%d paused=%d deleted_orphans=%d",
result.upserted,
result.paused,
result.deleted_orphans,
)
return result
async def sync_with_session_factory(
client: Client,
session_factory: async_sessionmaker[AsyncSession],
) -> ScheduleSyncResult:
"""Reconcile Temporal Schedules using an existing DB session factory."""
return await sync_schedule_rows(client, await _load_schedule_rows(session_factory))
async def sync(client: Client, db_url: str) -> ScheduleSyncResult:
"""Reconcile Temporal Schedules against the ActivityDefinition table.
Steps:
1. Load all enabled cron ActivityDefinitions from Postgres.
2. Upsert a Temporal Schedule for each one.
1. Load all cron/scheduled ActivityDefinitions from Postgres.
2. Upsert a Temporal Schedule for each one, paused when disabled.
3. Delete Temporal Schedules whose activity_id has no matching DB row
(tombstone cleanup for deleted or trigger-type-changed definitions).
"""
@@ -59,55 +145,10 @@ async def sync(client: Client, db_url: str) -> None:
session_factory = async_sessionmaker(engine, expire_on_commit=False)
try:
async with session_factory() as session:
rows = (
await session.scalars(
select(ActivityDefinitionRow).where(
ActivityDefinitionRow.trigger_type.in_(["cron", "scheduled"])
)
)
).all()
return await sync_with_session_factory(client, session_factory)
finally:
await engine.dispose()
db_activity_ids: set[str] = set()
upserted = 0
skipped = 0
for row in rows:
defn = _row_to_domain(row)
if not isinstance(defn.trigger_config, (CronTriggerConfig, ScheduledTriggerConfig)):
continue
db_activity_ids.add(str(defn.id))
if defn.enabled:
await upsert_schedule(client, defn)
upserted += 1
logger.info("upserted schedule for activity %s (%s)", defn.id, defn.name)
else:
# Disabled definitions: schedule may exist (paused) — leave it;
# upsert_schedule already handles the paused state.
await upsert_schedule(client, defn)
skipped += 1
logger.info("upserted paused schedule for disabled activity %s", defn.id)
# Tombstone cleanup: remove Temporal Schedules with no matching DB row.
existing_schedules = await list_schedules(client)
deleted = 0
for entry in existing_schedules:
if entry["activity_id"] not in db_activity_ids:
await delete_schedule(client, entry["activity_id"])
deleted += 1
logger.info("deleted orphaned schedule %s", entry["schedule_id"])
logger.info(
"sync_schedules complete — upserted=%d skipped_disabled=%d deleted_orphans=%d",
upserted,
skipped,
deleted,
)
async def main() -> None:
logging.basicConfig(level=logging.INFO)
@@ -116,7 +157,13 @@ async def main() -> None:
raise RuntimeError("ACTCORE_DB_URL is required")
client = await Client.connect(TEMPORAL_HOST, namespace=TEMPORAL_NAMESPACE)
await sync(client, db_url)
result = await sync(client, db_url)
print(
"Synced schedules: "
f"upserted={result.upserted} "
f"paused={result.paused} "
f"deleted_orphans={result.deleted_orphans}"
)
if __name__ == "__main__":

View File

@@ -0,0 +1,97 @@
"""Shared ActivityDefinition/event type/schedule sync orchestration."""
from __future__ import annotations
from typing import Any
from temporalio.client import Client
from activity_core.event_type_registry import sync_event_types
from activity_core.sync_activity_definitions import sync as sync_activity_definitions
from activity_core.sync_schedules import ScheduleSyncResult, sync_with_session_factory
_MAX_ERRORS = 20
_MAX_ERROR_MESSAGE_LENGTH = 1000
def _empty_result(
*,
definitions: bool,
schedules: bool,
event_types: bool,
) -> dict[str, Any]:
return {
"ok": True,
"ran": {
"definitions": definitions,
"schedules": schedules,
"event_types": event_types,
},
"definitions": {"synced": 0},
"event_types": {"synced": 0},
"schedules": ScheduleSyncResult().to_dict(),
"errors": [],
}
def _record_error(result: dict[str, Any], stage: str, exc: Exception) -> None:
errors = result["errors"]
if len(errors) >= _MAX_ERRORS:
return
errors.append(
{
"stage": stage,
"type": type(exc).__name__,
"message": str(exc)[:_MAX_ERROR_MESSAGE_LENGTH],
}
)
result["ok"] = False
async def run_sync(
*,
session_factory: Any,
temporal_client: Client | None,
definitions: bool = True,
schedules: bool = True,
event_types: bool = False,
) -> dict[str, Any]:
"""Run the requested sync stages and return bounded operator-facing status.
The orchestration deliberately accepts its database and Temporal
dependencies as arguments so startup and the API can share the same behavior
without creating another global runtime.
"""
result = _empty_result(
definitions=definitions,
schedules=schedules,
event_types=event_types,
)
if definitions:
try:
result["definitions"]["synced"] = await sync_activity_definitions(
session_factory
)
except Exception as exc: # pragma: no cover - exercised through tests
_record_error(result, "definitions", exc)
if event_types:
try:
result["event_types"]["synced"] = await sync_event_types(session_factory)
except Exception as exc: # pragma: no cover - exercised through tests
_record_error(result, "event_types", exc)
if schedules:
try:
if temporal_client is None:
raise RuntimeError("Temporal client is required for schedule sync")
schedule_result = await sync_with_session_factory(
temporal_client,
session_factory,
)
result["schedules"] = schedule_result.to_dict()
except Exception as exc: # pragma: no cover - exercised through tests
_record_error(result, "schedules", exc)
return result

View File

@@ -46,8 +46,7 @@ from activity_core.activities import (
)
from activity_core.db import make_engine
from sqlalchemy.ext.asyncio import async_sessionmaker
from activity_core.sync_activity_definitions import sync as sync_activity_defs
from activity_core.sync_schedules import sync as sync_schedules
from activity_core.sync_service import run_sync
from activity_core.workflows import RunActivityWorkflow, TaskExecutorWorkflow
logger = logging.getLogger(__name__)
@@ -77,20 +76,26 @@ async def run() -> None:
TEMPORAL_HOST, namespace=TEMPORAL_NAMESPACE, runtime=runtime
)
# T45: Sync ActivityDefinition files into DB before schedule sync.
logger.info("Syncing ActivityDefinition files...")
logger.info("Syncing ActivityDefinitions and Temporal Schedules...")
sync_engine = make_engine(db_url)
session_factory = async_sessionmaker(sync_engine, expire_on_commit=False)
try:
session_factory = async_sessionmaker(make_engine(db_url), expire_on_commit=False)
await sync_activity_defs(session_factory)
except Exception:
logger.exception("activity definition sync failed — continuing worker startup")
# T23: Sync Temporal Schedules with the DB before workers start accepting tasks.
logger.info("Syncing Temporal Schedules with ActivityDefinition DB...")
try:
await sync_schedules(client, db_url)
except Exception:
logger.exception("schedule sync failed — continuing worker startup")
sync_result = await run_sync(
session_factory=session_factory,
temporal_client=client,
definitions=True,
schedules=True,
event_types=False,
)
for error in sync_result["errors"]:
logger.error(
"startup sync %s failed — %s: %s",
error["stage"],
error["type"],
error["message"],
)
finally:
await sync_engine.dispose()
orchestrator_worker = Worker(
client,

View File

@@ -0,0 +1,5 @@
{
"_note": "PARTIAL 4000-char preview of the 2026-06-26 daily-triage validation failure (retry attempt). Full payload not recoverable from activity-core: complete() drops finish_reason; report sink caps raw at 4000 chars; the JSON break is at char 5268 (beyond this preview). Full response would require llm-connect producer-side logs on railiance01.",
"validation_error": "Expecting ',' delimiter: line 136 column 22 (char 5268)",
"raw_output_preview": "{\n \"summary\": \"Triage report focusing on high-priority workstreams with pending human intervention or critical dependencies, and addressing recently cleared dependencies to unblock progress.\",\n \"recommendations\": [\n {\n \"rank\": 1,\n \"candidate\": \"2731fece-6c49-45b8-ab8a-4ea6c04ac603\",\n \"action\": \"work-next\",\n \"why\": \"A critical dependency (T03 - Configure bounded OpenBao token roles and policies) for this workstream has been cleared, unblocking significant progress on credential management. This workstream has 8 todo tasks and no waits, indicating it's ready for immediate action.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 5.0,\n \"strategic_value\": 5,\n \"time_criticality\": 5,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 5,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 2,\n \"candidate\": \"bd086c41-287d-4a4e-8ac5-9ab270f14d72\",\n \"action\": \"needs-human\",\n \"why\": \"This high-priority workstream has a 'needs_human' task (T04 - Provision the runtime API key outside Git) and is currently blocked by 3 'wait' tasks. Human intervention is required to unblock progress.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 4.7,\n \"strategic_value\": 5,\n \"time_criticality\": 4,\n \"risk_reduction\": 5,\n \"opportunity_enablement\": 4,\n \"job_size\": 3\n }\n },\n {\n \"rank\": 3,\n \"candidate\": \"9b56414a-c71f-4e72-9b2b-d2166aaf50d0\",\n \"action\": \"needs-human\",\n \"why\": \"This high-priority workstream has a 'needs_human' task (Task: Execute Live Ops-Hub Bootstrap) and is currently blocked by a 'wait' task. Human intervention is required to proceed with the bootstrap.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 4.7,\n \"strategic_value\": 5,\n \"time_criticality\": 4,\n \"risk_reduction\": 5,\n \"opportunity_enablement\": 4,\n \"job_size\": 3\n }\n },\n {\n \"rank\": 4,\n \"candidate\": \"84e17675-0d15-4268-a8bd-540124d37018\",\n \"action\": \"needs-human\",\n \"why\": \"This workstream has 4 'needs_human' tasks, including 'T02 \u2014 Resolve Forgejo production design decisions', indicating significant human input is required to move forward with the migration.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 4.0,\n \"strategic_value\": 4,\n \"time_criticality\": 4,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 4,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 5,\n \"candidate\": \"5646e13a-13af-4724-bca6-3c0d86f96733\",\n \"action\": \"needs-human\",\n \"why\": \"This workstream has a 'needs_human' task ('Three-Run Calibration Feedback') and is currently in a 'wait' state. Human feedback is crucial for operational hardening.\",\n \"confidence\": \"medium\",\n \"wsjf\": {\n \"score\": 3.7,\n \"strategic_value\": 4,\n \"time_criticality\": 3,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 4,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 6,\n \"candidate\": \"896ace77-21b3-450b-8fb7-254aefc8c570\",\n \"action\": \"close-out\",\n \"why\": \"The task 'Wire activity-core to the live service' has been resolved, and the workstream shows 2 progress tasks with 0 todo/wait tasks. This indicates the deployment is likely complete or nearing completion and ready for close-out after verification.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 3.7,\n \"strategic_value\": 4,\n \"time_criticality\": 3,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 4,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 7,\n \"candidate\": \"656e435d-3a00-4f5e-a38e-114467f9062e\",\n \"action\": \"work-next\",\n \"why\": \"This high-priority workstream has a single 'wait' task ('Task: Activate Ops-Hub Widgets In Inter-Hub') and no 'needs_human' tasks. It appears ready for the next step to activate the widgets.\",\n \"confidence\": \"medium\",\n \"wsjf"
}

View File

@@ -88,6 +88,43 @@ def test_for_each_binds_each_list_item_before_condition_and_action_rendering() -
]
def test_for_each_can_gate_registry_hygiene_gaps_on_signal() -> None:
rules = [
{
"id": "flag-registry-hygiene-gap",
"for_each": "context.gaps",
"bind_as": "g",
"condition": 'context.g.hygiene_signal != ""',
"action": {
"task_template": "Close registry hygiene gap for {context.g.repo}",
"target_repo": "context.g.repo",
"priority": "medium",
"labels": ["registry-hygiene", "{context.g.hygiene_signal}"],
},
}
]
context = {
"gaps": [
{
"repo": "reuse-surface",
"hygiene_signal": "empty_capability_scaffold",
},
{
"repo": "activity-core",
"hygiene_signal": "",
},
]
}
specs = expand_rule_actions(rules, _Event(), context)
assert [spec["target_repo"] for spec in specs] == ["reuse-surface"]
assert specs[0]["labels"] == [
"registry-hygiene",
"empty_capability_scaffold",
]
def test_for_each_rejects_non_path_expression() -> None:
rules = [
{

View File

@@ -12,6 +12,7 @@ Covers:
from __future__ import annotations
import json
from pathlib import Path
from types import SimpleNamespace
from typing import Any
@@ -333,7 +334,14 @@ def test_execute_instruction_forwards_output_schema_to_llm_connect(tmp_path, mon
def test_execute_instruction_with_audit_accepts_report_payload():
report_data = {
"summary": "State Hub has loose ends.",
"recommendations": [{"action": "revisit", "candidate": "CUST-WP-0045"}],
"recommendations": [
{
"rank": 1,
"action": "revisit",
"candidate": "CUST-WP-0045",
"why": "Loose ends need attention.",
}
],
}
llm = _CountingLLM([json.dumps(report_data)])
instr = _instr(
@@ -353,7 +361,14 @@ def test_execute_instruction_with_audit_accepts_report_payload():
def test_execute_instruction_with_audit_accepts_fenced_report_payload():
report_data = {
"summary": "State Hub has loose ends.",
"recommendations": [{"action": "revisit", "candidate": "CUST-WP-0045"}],
"recommendations": [
{
"rank": 1,
"action": "revisit",
"candidate": "CUST-WP-0045",
"why": "Loose ends need attention.",
}
],
}
llm = _CountingLLM([f"```json\n{json.dumps(report_data)}\n```"])
instr = _instr(
@@ -389,6 +404,216 @@ def test_execute_instruction_with_audit_rejects_invalid_report_schema():
assert llm.call_count == 2
# ── WP-0016-T03 resilient report recovery ─────────────────────────────────────
def _valid_rec(rank: int) -> dict[str, Any]:
return {
"rank": rank,
"candidate": f"WS-{rank}",
"action": "work-next",
"why": f"reason {rank}",
"wsjf": {"score": 5.0},
}
def _pretty_triage_with_truncated_tail(num_valid: int) -> str:
body = ",\n".join(" " + json.dumps(_valid_rec(i)) for i in range(1, num_valid + 1))
# Trailing object is cut off mid-string — the whole document is invalid JSON,
# reproducing the 2026-06-26 failure shape (valid prefix, broken tail).
return (
'{\n "summary": "Daily triage.",\n "recommendations": [\n'
+ body
+ ',\n {\n "rank": '
+ str(num_valid + 1)
+ ',\n "candidate": "WS-X",\n "action": "work-'
)
def test_resilient_report_recovers_valid_prefix_and_quarantines_truncated_tail():
raw = _pretty_triage_with_truncated_tail(7)
llm = _CountingLLM([raw, raw])
instr = _instr(
id="daily-triage-report",
prompt="Report.",
trusted_fields=[],
output_schema="schemas/daily-triage-report.json",
report_sinks=[{"type": "working-memory"}],
)
result = execute_instruction_with_audit(instr, _Event(), {}, llm)
assert result.output_validated is True
assert result.review_required is True
assert result.report is not None
assert result.report["partial"] is True
assert len(result.report["recommendations"]) == 7
assert result.report["summary"] == "Daily triage."
assert result.report["quarantined_count"] >= 1
# The broken tail is dropped — either as an unparseable/truncated span or,
# if _try_repair salvages its structure, as a schema-invalid item. Either way
# it carries a diagnostic error and never pollutes the surviving report.
assert result.report["quarantined_items"][0]["error"]
def test_resilient_report_quarantines_one_bad_item_among_valid():
recs = [_valid_rec(1), {"candidate": "WS-2", "action": "x", "why": "no rank"}, _valid_rec(3)]
raw = json.dumps({"summary": "Triage.", "recommendations": recs})
llm = _CountingLLM([raw, raw])
instr = _instr(
id="daily-triage-report",
prompt="Report.",
trusted_fields=[],
output_schema="schemas/daily-triage-report.json",
report_sinks=[{"type": "working-memory"}],
)
result = execute_instruction_with_audit(instr, _Event(), {}, llm)
assert result.output_validated is True
assert result.report["partial"] is True
assert len(result.report["recommendations"]) == 2
assert result.report["quarantined_count"] == 1
assert "rank" in result.report["quarantined_items"][0]["error"]
# ── WP-0016-T04 producer guardrails ───────────────────────────────────────────
def _triage_instr() -> SimpleNamespace:
return _instr(
id="daily-triage-report",
prompt="Report.",
trusted_fields=[],
output_schema="schemas/daily-triage-report.json",
report_sinks=[{"type": "working-memory"}],
)
def test_guardrail_count_cap_on_valid_happy_path():
# 9 fully-valid recommendations in a syntactically valid document: schema
# validation passes, but the maxItems=7 count cap must keep 7 and quarantine 2.
recs = [_valid_rec(i) for i in range(1, 10)]
raw = json.dumps({"summary": "Triage.", "recommendations": recs})
llm = _CountingLLM([raw])
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
assert llm.call_count == 1 # no retry — the document was valid
assert result.report["partial"] is True
assert len(result.report["recommendations"]) == 7
assert result.report["quarantined_count"] == 2
assert all(q["reason"] == "over_limit" for q in result.report["quarantined_items"])
def test_guardrail_oversized_string_quarantined():
big = _valid_rec(2)
big["why"] = "x" * 5000 # exceeds _MAX_STRING_LEN
raw = json.dumps({"summary": "Triage.", "recommendations": [_valid_rec(1), big]})
llm = _CountingLLM([raw])
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
assert len(result.report["recommendations"]) == 1
assert result.report["quarantined_count"] == 1
assert result.report["quarantined_items"][0]["reason"] == "guardrail"
def test_guardrail_allow_list_rejects_unknown_candidate():
raw = json.dumps({
"summary": "Triage.",
"recommendations": [_valid_rec(1), _valid_rec(2)], # candidates WS-1, WS-2
})
llm = _CountingLLM([raw])
context = {"known_candidates": ["WS-1"]}
result = execute_instruction_with_audit(_triage_instr(), _Event(), context, llm)
assert len(result.report["recommendations"]) == 1
assert result.report["recommendations"][0]["candidate"] == "WS-1"
assert result.report["quarantined_items"][0]["reason"] == "allow_list"
def _nested(depth: int) -> dict[str, Any]:
node: dict[str, Any] = {"leaf": 1}
for _ in range(depth):
node = {"a": node}
return node
def test_guardrail_over_depth_quarantined():
deep = _valid_rec(2)
deep["extra"] = _nested(12) # well past _MAX_DEPTH
raw = json.dumps({"summary": "Triage.", "recommendations": [_valid_rec(1), deep]})
llm = _CountingLLM([raw])
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
assert len(result.report["recommendations"]) == 1
assert result.report["quarantined_count"] == 1
assert result.report["quarantined_items"][0]["reason"] == "guardrail"
assert "depth" in result.report["quarantined_items"][0]["error"]
def test_resilient_recovery_against_real_2026_06_26_fixture():
# The actual captured failure payload (4000-char preview, truncated at the 7th
# recommendation) — the run that reset the WP-0006-T03 streak. Before WP-0016
# this discarded the whole report; now it must recover the valid prefix.
fixture = json.loads(
Path("tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json")
.read_text(encoding="utf-8")
)
raw = fixture["raw_output_preview"]
llm = _CountingLLM([raw, raw])
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
assert result.output_validated is True
assert result.report["partial"] is True
# Six recommendations are fully intact before the truncation point.
assert len(result.report["recommendations"]) >= 6
assert all("rank" in rec and "candidate" in rec for rec in result.report["recommendations"])
class _MetadataBadLLM:
def __init__(self) -> None:
self.call_count = 0
self.last_response_metadata: dict[str, Any] | None = None
def complete(
self,
prompt: str,
model: str = "",
config: dict | None = None,
) -> str:
self.call_count += 1
self.last_response_metadata = {
"finish_reason": "length",
"usage": {"input_tokens": 1100, "output_tokens": 1200},
}
return ("x" * 9000) + "{"
def test_invalid_report_preserves_response_metadata_and_long_preview():
llm = _MetadataBadLLM()
instr = _instr(
id="daily-triage-report",
prompt="Report.",
trusted_fields=[],
report_sinks=[{"type": "working-memory", "path": "/tmp"}],
)
result = execute_instruction_with_audit(instr, _Event(), {}, llm)
assert llm.call_count == 2
assert result.output_validated is False
assert result.llm_response_metadata == {
"finish_reason": "length",
"usage": {"input_tokens": 1100, "output_tokens": 1200},
}
assert result.report["llm_response_metadata"] == result.llm_response_metadata
assert len(result.report["raw_output_preview"]) > 4000
def test_execute_instruction_with_audit_preserves_invalid_report_with_sinks(
tmp_path,
monkeypatch,

View File

@@ -0,0 +1,114 @@
from __future__ import annotations
from typing import Any
import pytest
from activity_core import api
@pytest.mark.asyncio
async def test_admin_sync_definitions_only_does_not_require_temporal(
monkeypatch,
) -> None:
seen: dict[str, Any] = {}
async def fake_run_sync(**kwargs: Any) -> dict[str, Any]:
seen.update(kwargs)
return {"ok": True, "ran": {"definitions": True}}
monkeypatch.setattr(api, "_session_factory", object())
monkeypatch.setattr(api, "_temporal_client", None)
monkeypatch.setattr(api, "run_sync", fake_run_sync)
result = await api.admin_sync(
definitions=True,
schedules=False,
event_types=False,
)
assert result == {"ok": True, "ran": {"definitions": True}}
assert seen["session_factory"] is api._session_factory
assert seen["temporal_client"] is None
assert seen["definitions"] is True
assert seen["schedules"] is False
assert seen["event_types"] is False
@pytest.mark.asyncio
async def test_admin_sync_schedules_only_passes_temporal(monkeypatch) -> None:
temporal = object()
seen: dict[str, Any] = {}
async def fake_run_sync(**kwargs: Any) -> dict[str, Any]:
seen.update(kwargs)
return {
"ok": True,
"schedules": {
"upserted": 1,
"paused": 0,
"deleted_orphans": 0,
},
}
monkeypatch.setattr(api, "_session_factory", object())
monkeypatch.setattr(api, "_temporal_client", temporal)
monkeypatch.setattr(api, "run_sync", fake_run_sync)
result = await api.admin_sync(
definitions=False,
schedules=True,
event_types=False,
)
assert result["schedules"]["upserted"] == 1
assert seen["temporal_client"] is temporal
assert seen["definitions"] is False
assert seen["schedules"] is True
assert seen["event_types"] is False
@pytest.mark.asyncio
async def test_admin_sync_all_sync_returns_failure_result(monkeypatch) -> None:
async def fake_run_sync(**kwargs: Any) -> dict[str, Any]:
return {
"ok": False,
"ran": {
"definitions": kwargs["definitions"],
"schedules": kwargs["schedules"],
"event_types": kwargs["event_types"],
},
"errors": [
{
"stage": "event_types",
"type": "RuntimeError",
"message": "bad event type",
}
],
}
monkeypatch.setattr(api, "_session_factory", object())
monkeypatch.setattr(api, "_temporal_client", object())
monkeypatch.setattr(api, "run_sync", fake_run_sync)
result = await api.admin_sync(
definitions=True,
schedules=True,
event_types=True,
)
assert result == {
"ok": False,
"ran": {
"definitions": True,
"schedules": True,
"event_types": True,
},
"errors": [
{
"stage": "event_types",
"type": "RuntimeError",
"message": "bad event type",
}
],
}

View File

@@ -0,0 +1,289 @@
from __future__ import annotations
import asyncio
import json
from datetime import datetime
from pathlib import Path
from zoneinfo import ZoneInfo
from activity_core import automation_status as status
ACTIVITY_ID = "00000000-0000-0000-0000-000000000123"
def _window():
return status.resolve_window(
"2026-06-26",
"2026-06-29",
"Europe/Berlin",
)
def _definition(enabled: bool = True):
return {
"id": ACTIVITY_ID,
"name": "Daily Check",
"enabled": enabled,
"trigger_type": "cron",
"trigger_config": {
"trigger_type": "cron",
"cron_expression": "0 9 * * *",
"timezone": "Europe/Berlin",
"misfire_policy": "skip",
},
"source": "test",
}
def test_friday_shortcut_resolves_to_previous_friday_start() -> None:
now = datetime(2026, 6, 29, 12, 0, tzinfo=ZoneInfo("Europe/Berlin"))
window = status.resolve_window("friday", None, "Europe/Berlin", now=now)
assert window["since"].isoformat() == "2026-06-26T00:00:00+02:00"
assert window["until"].isoformat() == "2026-06-29T12:00:00+02:00"
def test_expected_fires_for_simple_cron_window() -> None:
fires = status.expected_fires(_definition(), _window())
assert fires == [
"2026-06-26T09:00:00+02:00",
"2026-06-27T09:00:00+02:00",
"2026-06-28T09:00:00+02:00",
"2026-06-29T09:00:00+02:00",
]
def test_completed_when_expected_run_exists() -> None:
run = {
"run_id": "run-1",
"activity_id": ACTIVITY_ID,
"scheduled_for": "2026-06-26T07:00:00+00:00",
"fired_at": "2026-06-26T07:00:10+00:00",
"tasks_spawned": 1,
}
report = status.classify_activity(
_definition(),
_window(),
[run],
[{"source": "state_hub_progress", "run_id": "run-1", "output_validated": True}],
None,
["2026-06-26T09:00:00+02:00"],
runs_available=True,
)
assert report["status"] == "completed"
def test_validation_failure_wins_over_completed_run() -> None:
run = {"run_id": "run-1", "activity_id": ACTIVITY_ID, "scheduled_for": None, "fired_at": "2026-06-26T07:00:10+00:00"}
report = status.classify_activity(
_definition(),
_window(),
[run],
[{"source": "working_memory", "run_id": "run-1", "output_validated": False}],
None,
["2026-06-26T09:00:00+02:00"],
runs_available=True,
)
assert report["status"] == "validation_failed"
def test_missed_when_expected_fire_has_no_run_and_runs_available() -> None:
report = status.classify_activity(
_definition(),
_window(),
[],
[],
None,
["2026-06-26T09:00:00+02:00"],
runs_available=True,
)
assert report["status"] == "missed"
def test_disabled_schedule_is_not_counted_as_missed() -> None:
report = status.classify_activity(
_definition(enabled=False),
_window(),
[],
[],
None,
["2026-06-26T09:00:00+02:00"],
runs_available=True,
)
assert report["status"] == "disabled"
def test_scheduled_definition_reports_one_shot_schedule_id() -> None:
definition = {
"id": ACTIVITY_ID,
"name": "One Shot",
"enabled": True,
"trigger_type": "scheduled",
"trigger_config": {
"trigger_type": "scheduled",
"at": "2026-06-26T09:00:00+02:00",
"timezone": "Europe/Berlin",
},
"source": "test",
}
report = status.classify_activity(
definition,
_window(),
[],
[],
None,
["2026-06-26T09:00:00+02:00"],
runs_available=False,
)
assert status.automation_schedule_id(_definition()) == f"activity-schedule-{ACTIVITY_ID}"
assert report["schedule_id"] == f"activity-schedule-{ACTIVITY_ID}-once"
def test_partial_source_availability_is_unknown_not_missed() -> None:
report = status.classify_activity(
_definition(),
_window(),
[],
[],
None,
["2026-06-26T09:00:00+02:00"],
runs_available=False,
)
assert report["status"] == "unknown"
assert "missed-run verdict is unknown" in report["warnings"][0]
def test_working_memory_frontmatter_evidence(tmp_path: Path) -> None:
note = tmp_path / "daily-triage-2026-06-26-run.md"
note.write_text(
"---\n"
"source: activity-core\n"
f"activity_id: {ACTIVITY_ID}\n"
"activity_core_run_id: run-1\n"
"scheduled_for: 2026-06-26T07:00:00+00:00\n"
"output_validated: false\n"
"created: 2026-06-26T07:01:00+00:00\n"
"---\n"
"body\n",
encoding="utf-8",
)
evidence, source = status.load_working_memory_evidence(str(tmp_path), _window())
assert source["status"] == "ok"
assert evidence[0]["run_id"] == "run-1"
assert evidence[0]["output_validated"] is False
def _scheduled_definition(enabled: bool = False):
return {
"id": "00000000-0000-0000-0000-000000000456",
"name": "One Shot",
"enabled": enabled,
"trigger_type": "scheduled",
"trigger_config": {
"trigger_type": "scheduled",
"at": "2026-06-26T09:00:00+02:00",
"timezone": "Europe/Berlin",
},
"source": "db",
}
def test_inventory_report_uses_db_definition_rows(monkeypatch) -> None:
async def fake_load_definitions(args, warnings):
return [dict(_definition(), source="db"), _scheduled_definition()], {"status": "ok", "source": "db"}
async def fake_temporal(host, namespace, definitions, *, timeout_seconds):
return {
ACTIVITY_ID: {
"schedule_id": f"activity-schedule-{ACTIVITY_ID}",
"available": True,
"paused": False,
"missed_catchup_window": 0,
"last_fired_at": None,
},
}, {"status": "ok", "count": 1}
monkeypatch.setattr(status, "load_definitions", fake_load_definitions)
monkeypatch.setattr(status, "load_temporal_visibility", fake_temporal)
args = status.parse_inventory_args(["--format", "json"])
report, exit_code = asyncio.run(status.build_inventory_report(args))
assert exit_code == 0
assert report["sources"]["definitions"] == {"status": "ok", "source": "db"}
assert report["summary"]["automation_count"] == 2
assert report["automations"][0]["definition_source"] == "db"
assert report["automations"][0]["temporal"]["status"] == "active"
assert report["automations"][1]["schedule_id"].endswith("-once")
def test_inventory_file_fallback_when_db_url_missing(monkeypatch) -> None:
monkeypatch.setattr(status, "file_definitions", lambda: [dict(_definition(), source="files")])
args = status.parse_inventory_args(["--db-url", "", "--temporal-host", ""])
report, exit_code = asyncio.run(status.build_inventory_report(args))
assert exit_code == 0
assert report["sources"]["definitions"]["status"] == "degraded"
assert report["automations"][0]["definition_source"] == "files"
assert "ACTCORE_DB_URL is not set" in report["warnings"][0]
def test_inventory_filters_disabled_definitions() -> None:
definitions = [_definition(enabled=True), _scheduled_definition(enabled=False)]
filtered = status.filter_inventory_definitions(
definitions,
ids=[],
names=[],
enabled=False,
trigger_types=set(),
)
assert [item["name"] for item in filtered] == ["One Shot"]
def test_inventory_temporal_unavailable_is_warning_not_failure(monkeypatch) -> None:
async def fake_load_definitions(args, warnings):
return [_definition()], {"status": "ok", "source": "db"}
async def fake_temporal(host, namespace, definitions, *, timeout_seconds):
return {}, {"status": "unavailable", "warning": "Temporal unavailable: nope"}
monkeypatch.setattr(status, "load_definitions", fake_load_definitions)
monkeypatch.setattr(status, "load_temporal_visibility", fake_temporal)
args = status.parse_inventory_args([])
report, exit_code = asyncio.run(status.build_inventory_report(args))
assert exit_code == 0
assert report["automations"][0]["temporal"]["status"] == "not_checked"
assert report["warnings"] == ["Temporal unavailable: nope"]
def test_inventory_cli_emits_json(monkeypatch, capsys) -> None:
monkeypatch.setattr(status, "file_definitions", lambda: [dict(_definition(), source="files")])
exit_code = asyncio.run(status.async_inventory_main([
"--db-url", "",
"--temporal-host", "",
"--format", "json",
]))
payload = json.loads(capsys.readouterr().out)
assert exit_code == 0
assert payload["mode"] == "automation-inventory"
assert payload["automations"][0]["name"] == "Daily Check"

View File

@@ -1,6 +1,7 @@
from __future__ import annotations
import json
from pathlib import Path
import pytest
@@ -70,7 +71,14 @@ async def test_evaluate_instructions_returns_task_specs_with_audit(monkeypatch)
async def test_evaluate_instructions_returns_report_payload(monkeypatch) -> None:
llm = FakeLLMClient(json.dumps({
"summary": "State Hub has open loose ends.",
"recommendations": [{"candidate": "CUST-WP-0045", "action": "work-next"}],
"recommendations": [
{
"rank": 1,
"candidate": "CUST-WP-0045",
"action": "work-next",
"why": "Open loose ends.",
}
],
}))
monkeypatch.setattr(activities, "get_llm_client", lambda: llm)
@@ -209,6 +217,12 @@ async def test_evaluate_instructions_forwards_llm_connect_depth_config(monkeypat
"context": {},
})
# Read the live schema file rather than hard-coding it, so the forwarded
# json_schema assertion tracks schemas/daily-triage-report.json as the
# contract evolves (ACTIVITY-WP-0016-T02).
expected_schema = json.loads(
Path("schemas/daily-triage-report.json").read_text(encoding="utf-8")
)
assert llm.calls[0][2] == {
"model_name": "custodian-triage-balanced",
"temperature": 0.2,
@@ -216,16 +230,6 @@ async def test_evaluate_instructions_forwards_llm_connect_depth_config(monkeypat
"max_depth": 2,
"model_params": {
"reasoning_effort": "medium",
"json_schema": {
"type": "object",
"required": ["summary", "recommendations"],
"properties": {
"summary": {"type": "string"},
"recommendations": {
"type": "array",
"items": {"type": "object"},
},
},
},
"json_schema": expected_schema,
},
}

View File

@@ -34,7 +34,7 @@ def test_issue_core_rest_sink_posts_task_contract(monkeypatch) -> None:
monkeypatch.setattr(httpx, "post", fake_post)
ref = IssueCoreRestSink("http://issue-core.test/").emit(TaskSpec(
ref = IssueCoreRestSink("http://issue-core.test/", api_key="test-key").emit(TaskSpec(
title="Run SBOM rescan for activity-core",
description="SBOM is older than 30 days.",
target_repo="activity-core",
@@ -67,12 +67,30 @@ def test_issue_core_rest_sink_posts_task_contract(monkeypatch) -> None:
"triggering_event_id": "scheduled",
"activity_definition_id": "activity-1",
},
"headers": {"Authorization": "Bearer test-key"},
"timeout": 10.0,
}
]
assert "review_required" not in posts[0]["json"]
def test_issue_core_rest_sink_requires_api_key() -> None:
sink = IssueCoreRestSink("http://issue-core.test/", api_key="")
with pytest.raises(RuntimeError, match="ISSUE_CORE_API_KEY"):
sink.emit(TaskSpec(
title="t",
description="",
target_repo="activity-core",
priority="low",
labels=[],
due_in_days=None,
source_type="rule",
source_id="r",
triggering_event_id="e",
activity_definition_id="a",
))
@pytest.mark.asyncio
async def test_emit_tasks_raises_when_sink_fails(monkeypatch) -> None:
class FailingSink:

View File

@@ -13,7 +13,12 @@ def test_llm_connect_client_forwards_run_config(monkeypatch) -> None:
pass
def json(self) -> dict:
return {"content": '{"summary":"ok","recommendations":[]}'}
return {
"content": '{"summary":"ok","recommendations":[]}',
"finish_reason": "stop",
"usage": {"input_tokens": 10, "output_tokens": 20},
"raw_response": {"provider_blob": "not persisted"},
}
def fake_post(url: str, json: dict, timeout: float) -> Response:
captured["url"] = url
@@ -50,3 +55,7 @@ def test_llm_connect_client_forwards_run_config(monkeypatch) -> None:
"timeout_seconds": 42,
},
}
assert client.last_response_metadata == {
"finish_reason": "stop",
"usage": {"input_tokens": 10, "output_tokens": 20},
}

View File

@@ -166,6 +166,93 @@ def test_state_hub_progress_sink_is_idempotent(monkeypatch) -> None:
assert result[0]["idempotency_key"] == idempotency_key
def test_core_hub_interaction_event_sink_posts_and_verifies_compact_event(monkeypatch) -> None:
posts: list[dict[str, Any]] = []
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
assert url == "http://core-hub.test/api/v2/interaction-events"
assert kwargs["headers"]["Authorization"] == "Bearer runtime-secret"
posts.append({"url": url, **kwargs})
return DummyResponse(
{
"id": "event-1",
"eventType": "ops-endpoint-verified",
"widgetId": "widget-1",
}
)
def fake_get(url: str, **kwargs: Any) -> DummyResponse:
assert url == "http://core-hub.test/api/v2/interaction-events"
assert kwargs["headers"]["Authorization"] == "Bearer runtime-secret"
return DummyResponse({"data": [{"id": "event-1"}]})
monkeypatch.setenv("CORE_HUB_RUNTIME_TOKEN", "runtime-secret")
monkeypatch.setattr(httpx, "post", fake_post)
monkeypatch.setattr(httpx, "get", fake_get)
result = persist_ops_inventory_evidence(
_payload([
{
"type": "core-hub-interaction-event",
"core_hub_url": "http://core-hub.test",
"widget_id": "widget-1",
"event_type": "ops-endpoint-verified",
}
])
)
assert result == [
{
"type": "core-hub-interaction-event",
"status": "posted",
"event_type": "ops-endpoint-verified",
"event_id": "event-1",
"widget_id": "widget-1",
"verified": True,
"context_key": "ops_probe",
}
]
body = posts[0]["json"]
assert body["widgetId"] == "widget-1"
assert body["eventType"] == "ops-endpoint-verified"
assert body["metadata"]["activity_core_run_id"] == _run_id()
assert body["metadata"]["endpoint"]["url"] == "http://state-hub.test/health"
assert body["metadata"]["endpoint"]["widget_ref"] == "ops:endpoint:state-hub-health"
serialized = json.dumps(body, sort_keys=True)
assert "runtime-secret" not in serialized
assert "secret response body" not in serialized
assert "Authorization" not in serialized
assert "user:pass" not in serialized
assert "token=secret" not in serialized
def test_core_hub_sink_skips_cleanly_when_config_missing(monkeypatch) -> None:
monkeypatch.delenv("CORE_HUB_BASE_URL", raising=False)
monkeypatch.delenv("CORE_HUB_RUNTIME_TOKEN", raising=False)
monkeypatch.delenv("CORE_HUB_RUNTIME_TOKEN_FILE", raising=False)
monkeypatch.delenv("CORE_HUB_WIDGET_ID", raising=False)
monkeypatch.delenv("CORE_HUB_WIDGET_MAPPING", raising=False)
result = persist_ops_inventory_evidence(
_payload([{"type": "core-hub-interaction-event"}])
)
assert result == [
{
"type": "core-hub-interaction-event",
"status": "skipped",
"reason": "missing_core_hub_config",
"missing": [
"CORE_HUB_BASE_URL",
"CORE_HUB_RUNTIME_TOKEN or CORE_HUB_RUNTIME_TOKEN_FILE",
"widget_id or CORE_HUB_WIDGET_ID",
],
"context_key": "ops_probe",
}
]
def test_inter_hub_sink_skips_cleanly_when_config_missing(monkeypatch) -> None:
monkeypatch.delenv("INTER_HUB_URL", raising=False)
monkeypatch.delenv("OPS_HUB_KEY", raising=False)

View File

@@ -93,12 +93,21 @@ def test_external_configmap_projects_enabled_daily_wsjf_definition(tmp_path) ->
assert definition.trigger_config["cron_expression"] == "20 7 * * *"
assert definition.trigger_config["timezone"] == "Europe/Berlin"
assert instruction["id"] == "daily-triage-report"
assert instruction["max_tokens"] == 1800
assert "most 7 recommendations" in instruction["prompt"]
assert "fewer well-formed" in instruction["prompt"]
assert instruction["output_schema"] == (
"/etc/activity-core/schemas/daily-triage-report.json"
)
assert instruction["report_sinks"][0]["type"] == "working-memory"
assert instruction["report_sinks"][1]["event_type"] == "daily_triage"
schema = _by_kind_name("ConfigMap", "actcore-report-schemas")
daily_schema = yaml.safe_load(schema["data"]["daily-triage-report.json"])
recommendations = daily_schema["properties"]["recommendations"]
assert recommendations["maxItems"] == 7
assert recommendations["items"]["properties"]["rank"]["maximum"] == 7
def test_ops_inventory_configmap_contains_probeable_inventory() -> None:
config = _by_kind_name("ConfigMap", "actcore-ops-service-inventory")

View File

@@ -37,6 +37,10 @@ def _payload(sinks: list[dict[str, Any]]) -> dict[str, Any]:
"output_validated": True,
"review_required": False,
"validation_error": None,
"llm_response_metadata": {
"finish_reason": "stop",
"usage": {"output_tokens": 50},
},
}
],
}
@@ -62,6 +66,8 @@ def test_working_memory_sink_writes_idempotently(tmp_path) -> None:
assert "output_validated: true" in text
assert "review_required: false" in text
assert "model: test-model" in text
assert "LLM response metadata:" in text
assert '"finish_reason": "stop"' in text
assert "State Hub has loose ends." in text
@@ -113,6 +119,10 @@ def test_state_hub_progress_sink_posts(monkeypatch) -> None:
assert posts[0]["json"]["detail"]["activity_core_run_id"] == payload_run_id()
assert posts[0]["json"]["detail"]["output_validated"] is True
assert posts[0]["json"]["detail"]["review_required"] is False
assert posts[0]["json"]["detail"]["llm_response_metadata"] == {
"finish_reason": "stop",
"usage": {"output_tokens": 50},
}
def test_state_hub_progress_includes_prior_working_memory_path(

View File

@@ -0,0 +1,167 @@
from __future__ import annotations
import json
from pathlib import Path
from typing import Any
import pytest
from temporalio.exceptions import ApplicationError
from activity_core.activities import resolve_context
from activity_core.context_resolvers import reuse_surface
from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY
class _Response:
def __init__(self, payload: Any) -> None:
self._payload = payload
def raise_for_status(self) -> None:
return None
def json(self) -> Any:
return self._payload
class _Completed:
returncode = 0
stderr = ""
def __init__(self, payload: dict[str, Any]) -> None:
self.stdout = json.dumps(payload)
def _write_rollout(path: Path) -> None:
path.write_text(
"""
domains:
reuse:
phase: active
repos:
- reuse-surface
- activity-core
parked:
phase: backlog
repos:
- ignored-repo
""".lstrip(),
encoding="utf-8",
)
def _write_cli_only_signals(path: Path) -> None:
path.write_text(
"""
signals:
empty_capability_scaffold:
enabled: true
registry_gap:
enabled: false
stale_scope:
enabled: false
stale_sbom:
enabled: false
publish_check_fail:
enabled: false
""".lstrip(),
encoding="utf-8",
)
def test_shell_resolver_emits_reuse_surface_gaps_and_advances_cursor(
tmp_path,
monkeypatch,
) -> None:
rollout = tmp_path / "rollout.yaml"
_write_rollout(rollout)
_write_cli_only_signals(tmp_path / "signals.yml")
reuse_root = tmp_path / "reuse-surface"
reuse_root.mkdir()
(reuse_root / "SCOPE.md").write_text("fresh\n", encoding="utf-8")
activity_root = tmp_path / "activity-core"
activity_root.mkdir()
monkeypatch.setenv("KAIZEN_RUNNER_HOST", "runner")
def fake_get(url: str, **kwargs: Any) -> _Response:
assert url.endswith("/repos/")
return _Response(
[
{
"slug": "reuse-surface",
"host_paths": {"runner": str(reuse_root)},
},
{
"slug": "activity-core",
"host_paths": {"runner": str(activity_root)},
},
]
)
def fake_run(cmd: list[str], **kwargs: Any) -> _Completed:
assert cmd == ["reuse-surface", "report", "gaps", "--format", "json"]
return _Completed({"empty_scaffolds": ["reuse-surface"]})
monkeypatch.setattr(reuse_surface.httpx, "get", fake_get)
monkeypatch.setattr(reuse_surface.subprocess, "run", fake_run)
import activity_core.context_resolvers # noqa: F401
result = CONTEXT_RESOLVER_REGISTRY["shell"]().resolve(
"reuse_surface_report_gaps",
None,
{
"roster": str(rollout),
"batch_size": 1,
},
)
assert result == {
"gaps": [
{
"repo": "reuse-surface",
"root": str(reuse_root),
"signal": "empty_capability_scaffold",
"hygiene_signal": "empty_capability_scaffold",
}
]
}
state = json.loads((tmp_path / "round-robin-state.json").read_text(encoding="utf-8"))
assert state["cursor"] == 1
assert state["last_batch"] == ["reuse-surface"]
def test_shell_resolver_keeps_kaizen_fallback_for_existing_queries() -> None:
assert CONTEXT_RESOLVER_REGISTRY["shell"]().resolve("unknown_query", None, {}) == {}
@pytest.mark.asyncio
async def test_optional_reuse_surface_missing_roster_binds_empty_list(tmp_path) -> None:
snapshot = await resolve_context(
[
{
"type": "shell",
"query": "reuse_surface_report_gaps",
"params": {"roster": str(tmp_path / "missing.yaml")},
"bind_to": "context.gaps",
}
]
)
assert snapshot == {"gaps": []}
@pytest.mark.asyncio
async def test_required_reuse_surface_missing_roster_fails_visibly(tmp_path) -> None:
with pytest.raises(ApplicationError, match="Required context resolver"):
await resolve_context(
[
{
"type": "shell",
"query": "reuse_surface_report_gaps",
"params": {"roster": str(tmp_path / "missing.yaml")},
"bind_to": "context.gaps",
"required": True,
}
]
)

View File

@@ -0,0 +1,81 @@
"""ACTIVITY-WP-0014 T03: missed-fire detection verdict tests."""
from __future__ import annotations
from datetime import datetime, timedelta, timezone
from activity_core.schedule_health import evaluate_schedule_health
NOW = datetime(2026, 6, 23, 12, 0, tzinfo=timezone.utc)
def test_healthy_when_recent_fire_and_no_drops() -> None:
health = evaluate_schedule_health(
activity_id="a1",
missed_catchup_window=0,
last_fired_at=NOW - timedelta(minutes=5),
now=NOW,
expected_interval=timedelta(hours=1),
)
assert health.healthy is True
assert health.missed is False
assert health.reasons == []
def test_unhealthy_when_catchup_window_dropped_fires() -> None:
health = evaluate_schedule_health(
activity_id="a1",
missed_catchup_window=2,
last_fired_at=NOW - timedelta(minutes=5),
now=NOW,
)
assert health.missed is True
assert "2 fire(s) dropped" in health.reasons[0]
def test_unhealthy_when_last_fire_too_stale() -> None:
health = evaluate_schedule_health(
activity_id="daily",
missed_catchup_window=0,
last_fired_at=NOW - timedelta(days=2),
now=NOW,
expected_interval=timedelta(days=1),
)
assert health.missed is True
assert any("exceeding the expected" in r for r in health.reasons)
assert health.staleness == timedelta(days=2)
def test_within_tolerance_is_healthy() -> None:
health = evaluate_schedule_health(
activity_id="daily",
missed_catchup_window=0,
last_fired_at=NOW - (timedelta(days=1) + timedelta(minutes=5)),
now=NOW,
expected_interval=timedelta(days=1),
tolerance=timedelta(minutes=10),
)
assert health.healthy is True
def test_no_fire_recorded_for_due_schedule_is_unhealthy() -> None:
health = evaluate_schedule_health(
activity_id="daily",
missed_catchup_window=0,
last_fired_at=None,
now=NOW,
expected_interval=timedelta(days=1),
)
assert health.missed is True
assert "no recorded fire" in health.reasons[0]
def test_no_interval_and_no_fire_is_not_flagged() -> None:
# Without an expected interval we cannot assert a miss from absence alone.
health = evaluate_schedule_health(
activity_id="event-ish",
missed_catchup_window=0,
last_fired_at=None,
now=NOW,
)
assert health.healthy is True

View File

@@ -37,6 +37,7 @@ def _make_defn(
misfire_policy: str = "skip",
enabled: bool = True,
jitter: int = 0,
catchup_window_seconds: int | None = None,
) -> ActivityDefinition:
return ActivityDefinition(
id=uuid.uuid4(),
@@ -46,6 +47,7 @@ def _make_defn(
cron_expression=cron,
misfire_policy=misfire_policy,
jitter_seconds=jitter,
catchup_window_seconds=catchup_window_seconds,
),
)
@@ -186,6 +188,76 @@ async def test_misfire_policy_compress_sets_overlap_buffer_one(env: WorkflowEnvi
await delete_schedule(env.client, defn.id)
# ── ACTIVITY-WP-0014: explicit run-miss policies + catchup window ────────────
@pytest.mark.asyncio
async def test_skip_sets_short_catchup_window(env: WorkflowEnvironment) -> None:
"""skip = run on trigger or skip: tiny grace window, no real recovery."""
defn = _make_defn(misfire_policy="skip")
await upsert_schedule(env.client, defn)
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.SKIP
assert desc.schedule.policy.catchup_window == timedelta(seconds=60)
await delete_schedule(env.client, defn.id)
@pytest.mark.asyncio
async def test_catchup_all_recovers_full_window(env: WorkflowEnvironment) -> None:
"""catchup_all = recover every missed fire: long window, BUFFER_ALL."""
defn = _make_defn(misfire_policy="catchup_all")
await upsert_schedule(env.client, defn)
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.BUFFER_ALL
assert desc.schedule.policy.catchup_window == timedelta(days=365)
await delete_schedule(env.client, defn.id)
@pytest.mark.asyncio
async def test_catchup_latest_does_not_accumulate(env: WorkflowEnvironment) -> None:
"""catchup_latest = recover only the most recent missed fire: BUFFER_ONE."""
defn = _make_defn(misfire_policy="catchup_latest")
await upsert_schedule(env.client, defn)
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.BUFFER_ONE
assert desc.schedule.policy.catchup_window == timedelta(hours=24)
await delete_schedule(env.client, defn.id)
@pytest.mark.asyncio
async def test_legacy_aliases_map_to_explicit_policies(env: WorkflowEnvironment) -> None:
"""Legacy catchup/compress keep working and pick up the new catchup windows."""
catchup = _make_defn(misfire_policy="catchup")
compress = _make_defn(misfire_policy="compress")
await upsert_schedule(env.client, catchup)
await upsert_schedule(env.client, compress)
d1 = await env.client.get_schedule_handle(schedule_id(catchup.id)).describe()
d2 = await env.client.get_schedule_handle(schedule_id(compress.id)).describe()
assert d1.schedule.policy.catchup_window == timedelta(days=365)
assert d2.schedule.policy.catchup_window == timedelta(hours=24)
await delete_schedule(env.client, catchup.id)
await delete_schedule(env.client, compress.id)
@pytest.mark.asyncio
async def test_explicit_catchup_window_override(env: WorkflowEnvironment) -> None:
"""An explicit catchup_window_seconds overrides the per-policy default."""
defn = _make_defn(misfire_policy="skip", catchup_window_seconds=7200)
await upsert_schedule(env.client, defn)
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
assert desc.schedule.policy.catchup_window == timedelta(hours=2)
await delete_schedule(env.client, defn.id)
@pytest.mark.asyncio
async def test_schedule_smoke_test_creates_one_shot_schedule(
env: WorkflowEnvironment,

View File

@@ -407,6 +407,70 @@ def test_recently_on_scope_hourly_failure_bubbles(monkeypatch) -> None:
StateHubContextResolver().resolve("recently_on_scope_hourly", None, {"range": "1h"})
def test_consistency_sweep_remote_all_posts_batch(monkeypatch) -> None:
calls: list[dict[str, Any]] = []
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
calls.append({"url": url, **kwargs})
return DummyResponse(
{
"exit_code": 0,
"lock_skipped": False,
"repos_processed": [{"repo_slug": "state-hub", "result": "pass"}],
"skipped_clean": ["quiet-repo"],
"skipped_missing": [],
"skipped_budget": [],
}
)
monkeypatch.setenv("STATE_HUB_URL", "http://state-hub.test/")
monkeypatch.setattr(httpx, "post", fake_post)
result = StateHubContextResolver().resolve(
"consistency_sweep_remote_all",
None,
{"max_seconds": 300, "source": "activity-core", "required": True},
)
assert result["exit_code"] == 0
assert result["repos_processed"][0]["repo_slug"] == "state-hub"
assert calls == [
{
"url": "http://state-hub.test/consistency/sweep/remote-all",
"json": {"max_seconds": 300, "source": "activity-core"},
"timeout": 330.0,
}
]
def test_consistency_sweep_remote_all_failure_bubbles(monkeypatch) -> None:
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
raise httpx.ConnectError("offline")
monkeypatch.setattr(httpx, "post", fake_post)
with pytest.raises(httpx.ConnectError):
StateHubContextResolver().resolve(
"consistency_sweep_remote_all",
None,
{"max_seconds": 300},
)
def test_consistency_sweep_remote_all_rejects_empty_response(monkeypatch) -> None:
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
return DummyResponse({})
monkeypatch.setattr(httpx, "post", fake_post)
with pytest.raises(RuntimeError, match="missing required key"):
StateHubContextResolver().resolve(
"consistency_sweep_remote_all",
None,
{"max_seconds": 300},
)
def test_recently_on_scope_hourly_rejects_empty_response(monkeypatch) -> None:
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
return DummyResponse({})

View File

@@ -0,0 +1,81 @@
"""ACTIVITY-WP-0014 T05: idempotency-keyed State Hub writes."""
from __future__ import annotations
import httpx
import pytest
from activity_core import report_sinks
from activity_core.state_hub_write import (
IDEMPOTENCY_HEADER,
idempotency_headers,
idempotency_key,
)
def test_key_is_stable_and_deterministic() -> None:
a = idempotency_key("run1", "daily-triage-report", "daily_triage")
b = idempotency_key("run1", "daily-triage-report", "daily_triage")
assert a == b == "run1:daily-triage-report:daily_triage"
def test_key_shape_stable_with_missing_parts() -> None:
assert idempotency_key("run1", None, "daily_triage") == "run1::daily_triage"
def test_key_sanitizes_control_and_whitespace() -> None:
key = idempotency_key("run 1", "a\tb", "x\n")
assert "\t" not in key and "\n" not in key and " " not in key
def test_headers_carry_the_key() -> None:
headers = idempotency_headers("run1", "i", "e")
assert headers == {IDEMPOTENCY_HEADER: "run1:i:e"}
def test_distinct_identities_get_distinct_keys() -> None:
assert idempotency_key("r", "i", "daily_triage") != idempotency_key(
"r", "i", "schedule_miss"
)
def test_progress_exists_is_best_effort_on_connection_error(monkeypatch) -> None:
"""A down State Hub must not hard-fail the dedup read; it returns False so the
keyed write can still proceed."""
def _boom(*args, **kwargs):
raise httpx.ConnectError("Connection refused")
monkeypatch.setattr(report_sinks.httpx, "get", _boom)
assert (
report_sinks._progress_exists(
"http://127.0.0.1:8000", "run1", "daily-triage-report", "daily_triage"
)
is False
)
def test_report_sink_post_sends_idempotency_header(monkeypatch) -> None:
"""The state-hub-progress write carries a stable Idempotency-Key header."""
captured: dict[str, object] = {}
monkeypatch.setattr(report_sinks, "_progress_exists", lambda *a, **k: False)
class _Resp:
def raise_for_status(self) -> None: ...
def json(self) -> dict[str, str]:
return {"id": "pid-1"}
def _capture_post(url, json, headers, timeout): # noqa: A002
captured["headers"] = headers
return _Resp()
monkeypatch.setattr(report_sinks.httpx, "post", _capture_post)
payload = {"run_id": "run1", "activity_id": "act1", "scheduled_for": None}
report_entry = {"instruction_id": "daily-triage-report", "report": {"summary": "s"}}
sink = {"event_type": "daily_triage"}
result = report_sinks._post_state_hub_progress(payload, report_entry, sink)
assert result["status"] == "posted"
assert captured["headers"][IDEMPOTENCY_HEADER] == "run1:daily-triage-report:daily_triage"

View File

@@ -0,0 +1,126 @@
from __future__ import annotations
import uuid
from datetime import datetime, timezone
from types import SimpleNamespace
from typing import Any
import pytest
from activity_core import sync_schedules
def _row(
*,
activity_id: uuid.UUID,
enabled: bool,
trigger_config: dict[str, Any],
) -> SimpleNamespace:
return SimpleNamespace(
id=activity_id,
name=f"definition-{activity_id}",
enabled=enabled,
trigger_config=trigger_config,
context_sources=[],
task_templates=[],
dedupe_key_strategy="skip",
version=1,
)
@pytest.mark.asyncio
async def test_sync_schedule_rows_reports_drift_counts_and_preserves_one_shots(
monkeypatch,
) -> None:
new_id = uuid.uuid4()
disabled_old_id = uuid.uuid4()
one_shot_id = uuid.uuid4()
orphan_id = uuid.uuid4()
upserted: list[tuple[uuid.UUID, bool, str]] = []
deleted: list[str] = []
async def fake_upsert_schedule(client: object, defn: object) -> None:
upserted.append((
defn.id,
defn.enabled,
defn.trigger_config.trigger_type,
))
async def fake_list_schedules(client: object) -> list[dict[str, str]]:
return [
{
"schedule_id": f"activity-schedule-{disabled_old_id}",
"activity_id": str(disabled_old_id),
},
{
"schedule_id": f"activity-schedule-{one_shot_id}-once",
"activity_id": f"{one_shot_id}-once",
},
{
"schedule_id": f"activity-schedule-{orphan_id}",
"activity_id": str(orphan_id),
},
]
async def fake_delete_schedule(client: object, activity_id: str) -> None:
deleted.append(activity_id)
monkeypatch.setattr(sync_schedules, "upsert_schedule", fake_upsert_schedule)
monkeypatch.setattr(sync_schedules, "list_schedules", fake_list_schedules)
monkeypatch.setattr(sync_schedules, "delete_schedule", fake_delete_schedule)
result = await sync_schedules.sync_schedule_rows(
object(),
[
_row(
activity_id=new_id,
enabled=True,
trigger_config={
"trigger_type": "cron",
"cron_expression": "20 7 * * *",
"timezone": "Europe/Berlin",
"misfire_policy": "skip",
},
),
_row(
activity_id=disabled_old_id,
enabled=False,
trigger_config={
"trigger_type": "cron",
"cron_expression": "20 * * * *",
"timezone": "Europe/Berlin",
"misfire_policy": "skip",
},
),
_row(
activity_id=one_shot_id,
enabled=True,
trigger_config={
"trigger_type": "scheduled",
"at": datetime(2026, 6, 19, 8, 0, tzinfo=timezone.utc),
"timezone": "UTC",
},
),
_row(
activity_id=uuid.uuid4(),
enabled=True,
trigger_config={
"trigger_type": "event",
"event_type": "kaizen.metrics.recorded",
"filters": {},
},
),
],
)
assert result.to_dict() == {
"upserted": 2,
"paused": 1,
"deleted_orphans": 1,
}
assert upserted == [
(new_id, True, "cron"),
(disabled_old_id, False, "cron"),
(one_shot_id, True, "scheduled"),
]
assert deleted == [str(orphan_id)]

134
tests/test_sync_service.py Normal file
View File

@@ -0,0 +1,134 @@
from __future__ import annotations
from typing import Any
import pytest
from activity_core import sync_service
from activity_core.sync_schedules import ScheduleSyncResult
@pytest.mark.asyncio
async def test_run_sync_runs_requested_sections(monkeypatch) -> None:
calls: list[str] = []
async def fake_definitions(session_factory: object) -> int:
calls.append("definitions")
return 2
async def fake_event_types(session_factory: object) -> int:
calls.append("event_types")
return 5
async def fake_schedules(
temporal_client: object,
session_factory: object,
) -> ScheduleSyncResult:
calls.append("schedules")
return ScheduleSyncResult(upserted=3, paused=1, deleted_orphans=2)
monkeypatch.setattr(sync_service, "sync_activity_definitions", fake_definitions)
monkeypatch.setattr(sync_service, "sync_event_types", fake_event_types)
monkeypatch.setattr(sync_service, "sync_with_session_factory", fake_schedules)
result = await sync_service.run_sync(
session_factory=object(),
temporal_client=object(),
definitions=True,
schedules=True,
event_types=True,
)
assert calls == ["definitions", "event_types", "schedules"]
assert result["ok"] is True
assert result["ran"] == {
"definitions": True,
"schedules": True,
"event_types": True,
}
assert result["definitions"] == {"synced": 2}
assert result["event_types"] == {"synced": 5}
assert result["schedules"] == {
"upserted": 3,
"paused": 1,
"deleted_orphans": 2,
}
assert result["errors"] == []
@pytest.mark.asyncio
async def test_run_sync_collects_errors_and_continues(monkeypatch) -> None:
calls: list[str] = []
async def failing_definitions(session_factory: object) -> int:
calls.append("definitions")
raise RuntimeError("definition parse failed")
async def fake_schedules(
temporal_client: object,
session_factory: object,
) -> ScheduleSyncResult:
calls.append("schedules")
return ScheduleSyncResult(upserted=1)
monkeypatch.setattr(
sync_service,
"sync_activity_definitions",
failing_definitions,
)
monkeypatch.setattr(sync_service, "sync_with_session_factory", fake_schedules)
result = await sync_service.run_sync(
session_factory=object(),
temporal_client=object(),
definitions=True,
schedules=True,
event_types=False,
)
assert calls == ["definitions", "schedules"]
assert result["ok"] is False
assert result["definitions"] == {"synced": 0}
assert result["schedules"]["upserted"] == 1
assert result["errors"] == [
{
"stage": "definitions",
"type": "RuntimeError",
"message": "definition parse failed",
}
]
@pytest.mark.asyncio
async def test_run_sync_reports_missing_temporal_client_for_schedules() -> None:
result = await sync_service.run_sync(
session_factory=object(),
temporal_client=None,
definitions=False,
schedules=True,
event_types=False,
)
assert result["ok"] is False
assert result["errors"] == [
{
"stage": "schedules",
"type": "RuntimeError",
"message": "Temporal client is required for schedule sync",
}
]
def test_record_error_bounds_error_count() -> None:
result: dict[str, Any] = {
"ok": True,
"errors": [],
}
for i in range(25):
sync_service._record_error(result, "stage", RuntimeError(f"boom {i}"))
assert result["ok"] is False
assert len(result["errors"]) == 20
assert result["errors"][0]["message"] == "boom 0"
assert result["errors"][-1]["message"] == "boom 19"

View File

@@ -4,11 +4,11 @@ type: workplan
title: "Post-triage operational hardening"
domain: custodian
repo: activity-core
status: active
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-03"
updated: "2026-06-16"
updated: "2026-06-30"
state_hub_workstream_id: "5646e13a-13af-4724-bca6-3c0d86f96733"
---
@@ -104,7 +104,7 @@ and emitted a validated `daily_triage` report plus working-memory note.
```task
id: ACTIVITY-WP-0006-T03
status: wait
status: done
priority: medium
state_hub_task_id: "7cbf0a35-71a1-47ac-afc2-f51ad2180fd0"
```
@@ -174,6 +174,56 @@ the worker consumes the configured URL, then produce schema-valid daily triage
evidence and three clean scheduled runs. This narrower path is tracked in
`ACTIVITY-WP-0010`.
2026-06-25: Consecutive-run streak resumed. State Hub `daily_triage` progress
events from author `activity-core` fired on time on **2026-06-24 05:20:56Z** and
**2026-06-25 05:20:47Z** (07:20 Berlin), both delivered, no misfires. That is two
clean consecutive scheduled runs. **RECHECK 2026-06-26 (after 05:20Z):** confirm
the 06-26 scheduled `daily_triage` event delivered. If clean, that completes three
clean consecutive scheduled runs (06-24 / 06-25 / 06-26) — record the calibration
result in State Hub and close T03. If the 06-26 run misfires or is missing, the
streak resets and T03 stays `wait`. Flag deliberately kept in-repo (agent-agnostic)
rather than tied to any single coding agent's scheduler.
2026-06-26 recheck outcome: **streak reset at two.** The 06-26 scheduled run fired
on time (`daily_triage` event 05:20:57Z) — scheduling layer healthy, no misfire —
but the `daily-triage-report` instruction output **failed schema validation**:
`Expecting ',' delimiter: line 136 column 22 (char 5268)`. The model produced a
long ranked WSJF recommendation list (reached rank 7+ with nested `wsjf` objects)
whose JSON broke ~char 5268; only a bounded 4000-char preview is preserved in the
State Hub event, so the exact offending token needs the runtime llm-connect log.
This is an LLM-output-quality failure (tracked by `ACTIVITY-WP-0010`), not a
runtime/projection failure. T03 stays `wait`; three clean consecutive scheduled
runs not yet achieved (06-24 ✅, 06-25 ✅, 06-26 ✗-validation).
2026-06-27 recheck outcome: streak remains reset. The scheduled run fired and
wrote State Hub progress plus working memory, but daily-triage-report failed
validation again with an unterminated string around char 5246. This confirms the
runner/sink path is alive and the active blocker is live deployment of the
ACTIVITY-WP-0016 output-robustness bundle and runtime prompt/token changes, not
a missing schedule. T03 stays wait until a post-deployment smoke passes and three
new clean scheduled runs are collected.
2026-06-30 early checkpoint: two new clean scheduled runs exist after the
validation failures. State Hub daily_triage progress shows 2026-06-28
05:20:51Z run `6a44d6dd-3f02-53f2-a5d8-d42b76b0ef98` and 2026-06-29
05:20:49Z run `1dfb47c9-07bf-551b-b778-1d21a40bd95c`, both with
`output_validated=true` and working-memory notes written. The current local time
was 2026-06-30 01:37 Europe/Berlin, before the expected 07:20 Berlin scheduled
fire, so the three-clean-run gate cannot close yet. Recheck after 2026-06-30
05:20Z; if that scheduled run validates, the clean streak is 06-28 / 06-29 /
06-30 and T03 can close with calibration feedback.
2026-06-30 closeout: the 07:20 Berlin scheduled run fired at 05:20:50Z as run
`ac3d71a0-2f8f-50df-b3ce-7c60c2abb5c5` with `output_validated=true` and a
working-memory note written. The post-failure clean streak is now complete:
2026-06-28 (`6a44d6dd`), 2026-06-29 (`1dfb47c9`), and 2026-06-30 (`ac3d71a0`).
Calibration feedback: the scheduler, worker, llm-connect route, State Hub sink,
and working-memory sink are stable again; the recommendations were operationally
useful but too dense at 10 items, repeatedly emphasizing human-dependency and
infrastructure-unblock work. ACTIVITY-WP-0016 now owns the density/contract fix:
Railiance runtime projection was aligned to a top-7 contract so the next live
run can prove the bounded output posture. T03 is done.
## Rule Action Contract Documentation
```task

View File

@@ -8,7 +8,7 @@ status: blocked
owner: codex
topic_slug: custodian
created: "2026-06-18"
updated: "2026-06-18"
updated: "2026-06-27"
state_hub_workstream_id: "f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9"
---
@@ -87,7 +87,7 @@ reported 9 passed.
```task
id: ACTIVITY-WP-0010-T02
status: wait
status: done
priority: high
state_hub_task_id: "23545ddc-926b-485a-8535-5cc11e01134a"
```
@@ -107,6 +107,30 @@ Current wait reason: this is Railiance/operator-owned live cluster work. State
Hub handoff message `9a074b7c-4b87-4e3c-a6bf-e1fe5580daa8` asks
`railiance-cluster` to reconcile the updated config and smoke it.
2026-06-19 recheck:
- Deployed `llm-connect` into the `activity-core` namespace on `railiance01`
(the cluster that runs `actcore-worker`). `coulombcore` had llm-connect only;
the in-cluster Service URL is cluster-local.
- `actcore-runtime-config` already exposed the verified URL and timeout;
`deployment/actcore-worker` was restarted and now reports
`LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080`.
- `llm-connect-provider-secrets` reports `DATA 1`; no Secret values were
inspected.
- Worker health probe to llm-connect `/health` returns `{"status": "ok"}`.
- `actcore-state-hub-bridge` remains `0/1` Ready with upstream timeouts, so T02
is not fully closed until the node-local State Hub tunnel is restored.
2026-06-27 recheck:
- Superseded by real scheduled runner evidence: State Hub daily_triage events on
2026-06-24, 2026-06-25, 2026-06-26, and 2026-06-27 all reached State Hub and
wrote working-memory notes. The bridge/sink is therefore reachable for the
live runner.
- 2026-06-24 and 2026-06-25 were schema-valid; 2026-06-26 and 2026-06-27 failed
output validation after calling llm-connect. That moves the active blocker out
of T02 and into the WP-0016 live bundle/smoke lane. Marking T02 done.
## Run Daily Triage Fixture Smoke
```task
@@ -128,6 +152,27 @@ Done when:
detail;
- `scripts/verify_daily_triage.py` reports the smoke/manual run as present.
2026-06-19 recheck:
- In-namespace llm-connect fixture smoke on `railiance01` passed:
`smoke: pass health=ok latency_seconds=1.681 recommendations=1`.
- Manual `POST /activity-definitions/6fca51fa-387a-4fd0-bc4e-d62c29eb859a/trigger`
reached llm-connect, but the workflow failed at `persist_instruction_reports`
with `state-hub-progress` sink `Connection refused` while
`actcore-state-hub-bridge` is unhealthy.
- T03 therefore remains open until State Hub bridge reachability is restored and
a run emits non-secret `daily_triage` progress with `output_validated=true`.
2026-06-27 recheck:
- Scheduled runs on 2026-06-24 and 2026-06-25 satisfy the non-secret smoke
evidence for llm-connect call, State Hub progress with output_validated=true,
and working-memory note creation.
- Kept T03 at progress rather than done because the workstation did not run the
live verifier against Temporal/activity-core DB, and the smoke must be repeated
after the WP-0016 code/schema/runtime-prompt deployment due the 2026-06-26 and
2026-06-27 malformed-output failures.
## Collect Three Clean Scheduled Runs
```task
@@ -151,6 +196,14 @@ Done when:
- `ACTIVITY-WP-0006-T03` and `ACTIVITY-WP-0009-T01` can move from `wait` to
`done`.
2026-06-27 recheck:
- Three-clean-run streak is reset. The latest sequence is 2026-06-24 clean,
2026-06-25 clean, 2026-06-26 validation_failed, 2026-06-27 validation_failed.
- Current pickup is to deploy ACTIVITY-WP-0016 code/schema together with the
Railiance runtime prompt and max_tokens changes, run a live smoke, then restart
the three-consecutive-scheduled-run gate from zero.
## Close Handoff State
```task

View File

@@ -4,11 +4,11 @@ type: workplan
title: "Definition And Schedule Hot Reload"
domain: custodian
repo: activity-core
status: ready
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-18"
updated: "2026-06-18"
updated: "2026-06-22"
state_hub_workstream_id: "8887075e-21ec-451b-b82b-cd81035c9ca5"
---
@@ -39,7 +39,7 @@ a repo checkout manager or CI system.
```task
id: ACTIVITY-WP-0012-T01
status: todo
status: done
priority: high
state_hub_task_id: "53a7970b-7eec-47f5-ad30-bbd7c6271952"
```
@@ -57,11 +57,17 @@ Done when:
- failures are collected into a bounded `errors[]` result while preserving the
current startup best-effort behavior.
2026-06-19: Completed. Added `activity_core.sync_service.run_sync`, which
orchestrates ActivityDefinition, event type, and schedule sync independently
from explicit DB session factory and Temporal client dependencies. Worker
startup now calls the shared service for definitions+schedules and logs bounded
stage errors while continuing startup.
## Add Admin Sync Endpoint
```task
id: ACTIVITY-WP-0012-T02
status: todo
status: done
priority: high
state_hub_task_id: "8697c761-15d1-4da0-b66b-d838218a2495"
```
@@ -80,11 +86,17 @@ Done when:
- endpoint tests cover definitions-only, schedules-only, all-sync, and failure
result behavior.
2026-06-19: Completed. Added `POST /admin/sync` with defaults
`definitions=true`, `schedules=true`, and `event_types=false`. The response
reports definition/event counts, schedule upsert/pause/orphan-delete counts, and
bounded `errors[]`. Tests cover definitions-only, schedules-only, all-sync, and
failure-result behavior.
## Preserve Schedule Drift Semantics
```task
id: ACTIVITY-WP-0012-T03
status: todo
status: done
priority: high
state_hub_task_id: "efeac412-632c-4c90-9428-bb575ac7a624"
```
@@ -101,11 +113,18 @@ Done when:
- regression tests demonstrate the Coulomb hourly-to-daily rename shape without
needing a worker restart.
2026-06-19: Completed. `sync_schedules` now returns explicit counts for enabled
schedule upserts, disabled schedule pauses, and orphan deletes. Regression tests
cover the hourly-to-daily rename shape: a new enabled cron schedule is upserted,
the old disabled cron schedule is preserved as paused, unrelated orphan
schedules are deleted, event-triggered definitions do not create schedules, and
one-shot scheduled definitions are no longer mistaken for orphans.
## Optional Background Sync Loop
```task
id: ACTIVITY-WP-0012-T04
status: todo
status: done
priority: medium
state_hub_task_id: "d774087b-c51d-4444-8e90-bfef43765456"
```
@@ -121,11 +140,17 @@ Done when:
last error summary;
- the loop does not block worker startup or workflow task processing.
2026-06-19: Completed by decision. v1 stays manual/operator-triggered through
`POST /admin/sync`; no background loop was added. The runbook records this
posture so customer definition changes stay explicit and the worker does not
start background repo scanning. A periodic loop remains a future option if live
operator use proves it is needed.
## Live No-Restart Smoke
```task
id: ACTIVITY-WP-0012-T05
status: wait
status: done
priority: high
state_hub_task_id: "68a0e22a-106a-4d21-9f39-c6279850cb5e"
```
@@ -141,5 +166,27 @@ Done when non-secret State Hub evidence shows:
- event-triggered definitions still fire normally;
- rollback or repeat sync is idempotent.
Current wait reason: this gate depends on the implementation tasks and a
cluster-owned smoke path.
2026-06-22: Completed on Railiance01 (`KUBECONFIG=~/.kube/config-hosteurope`).
Smoke target: disabled projection `ops-service-inventory-probes`
(`40d15a87-7ff6-4d8e-992c-37df15f95110`) in
`actcore-external-activity-definitions`.
Evidence:
- ConfigMap flip `enabled: false -> true` and cadence `15 * * * * -> 25 * * * *`,
then `POST /admin/sync?definitions=true&schedules=true` from `actcore-api`.
- DB after sync: `enabled=true`, `cron=25 * * * *`.
- Temporal schedule after sync: `paused=false`, calendar minute `25`.
- Repeat sync returned identical schedule counts
(`upserted=5`, `paused=1`, `deleted_orphans=0`) — idempotent.
- Rollback flip restored `enabled=false`, `cron=15 * * * *`, schedule
`paused=true`, calendar minute `15`.
- `actcore-worker` pod UID unchanged (`a68d6539-2bba-457e-a78a-39564002a980`,
started `2026-06-21T18:46:46Z`); `actcore-event-router` pod UID unchanged.
- Event-triggered definitions: none projected on Railiance01 today; hot DB
reload path for event definitions remains covered by T03 unit tests and an
unchanged event-router deployment.
Automation: `scripts/smoke_admin_sync_no_restart.py`. Runbook section added
under "Railiance01 no-restart smoke".

View File

@@ -0,0 +1,78 @@
---
id: ACTIVITY-WP-0013
type: workplan
title: "Reuse Surface Report Gaps Resolver"
domain: custodian
repo: activity-core
status: finished
owner: codex
topic_slug: activity-core
created: "2026-06-18"
updated: "2026-06-18"
state_hub_workstream_id: "01e68dfd-b146-4aef-a575-2d3b178ca5c2"
---
# Reuse Surface Report Gaps Resolver
Implement the R2 handoff from kaizen-agentic (`bffa224c`) so the
`reuse_surface_report_gaps` shell context source populates
`context.gaps` for the Coulomb daily registry hygiene sweep.
## Register Shell Resolver Query
```task
id: ACTIVITY-WP-0013-T01
status: done
priority: high
state_hub_task_id: "a6e1fc5c-7b42-436d-914e-4d605cb6f329"
```
Add a dedicated reuse-surface context resolver module and register
`reuse_surface_report_gaps` on the `shell` resolver path while preserving
the existing kaizen shell query behavior.
## Implement Batch And Signal Semantics
```task
id: ACTIVITY-WP-0013-T02
status: done
priority: high
state_hub_task_id: "229cf285-8388-471d-95fd-08400db1553e"
```
Load the Coulomb rollout roster, select active repos with a persisted
round-robin cursor, resolve repo roots from State Hub host paths, run
`reuse-surface report gaps --format json`, and emit gap records for the
enabled registry hygiene signals.
## Cover Required And Optional Failure Modes
```task
id: ACTIVITY-WP-0013-T03
status: done
priority: high
state_hub_task_id: "85b5c7d4-40e1-4945-8ada-1dff2363c194"
```
Ensure missing required dependencies fail visibly while optional resolver
sources bind an empty `context.gaps` list. Add unit coverage for fixture
rollout data, mocked CLI JSON, resolver binding, and `hygiene_signal`
rule gating.
## Smoke Real Coulomb Rollout
```task
id: ACTIVITY-WP-0013-T04
status: done
priority: medium
state_hub_task_id: "6a5446ed-b4ec-4693-b508-65415571d834"
```
Run a live resolver smoke against
`/home/worsch/coulomb-loop/loops/registry-hygiene/rollout.yaml` using a
temporary round-robin cursor. The real active rollout produced five gaps,
including one for `reuse-surface` with `hygiene_signal: stale_sbom`.
The smoke supplied `reuse_surface_bin:
/home/worsch/reuse-surface/.venv/bin/reuse-surface` and
`runner_host: bnt-lap001`; the worker environment or definition params must
provide equivalent values before enabling the production sweep.

View File

@@ -0,0 +1,194 @@
---
id: ACTIVITY-WP-0014
type: workplan
title: "Schedule Misfire Robustness & Run-Miss Recovery Options"
domain: infotech
repo: activity-core
status: finished
owner: claude
topic_slug: activity-core
created: "2026-06-23"
updated: "2026-06-24"
status_note: "T01-T05 complete; beachhead-endpoint adoption split to ACTIVITY-WP-0015"
state_hub_workstream_id: "91b64686-5d17-4c86-bc9e-3d0ee6720cf5"
---
# Schedule Misfire Robustness & Run-Miss Recovery Options
Make cron-triggered ActivityDefinitions robust to missed fires (worker/Temporal
unavailable at trigger time) with explicit, per-definition recovery behaviour,
plus detection/alerting when a scheduled fire is missed.
## Motivation
On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition
(cron `20 7 * * *` Europe/Berlin, projected into the Railiance runtime ConfigMap
`actcore-external-activity-definitions`) produced **no `daily_triage` progress
event at all** — neither a success nor a `could not run; operator review
required` failure.
> **Corrected by T01 (2026-06-23).** The initial hypothesis below — that
> `_build_schedule()` never set `catchup_window`, so a short-default catchup
> window silently dropped the fire — was **disproven on the live cluster**. The
> Temporal schedule is healthy with `CatchupWindow 365d` (the server default) and
> `0 MissedCatchupWindow`. The real cause is that the run **fired and ran but
> failed at the report sink** with `Connection refused` posting to State Hub,
> because railiance01 reaches State Hub via a reverse tunnel back to the
> workstation, which is asleep at 07:20 Berlin. See the T01 findings and T05.
The trigger now originates entirely on **railiance01** (in-cluster Temporal
Schedule, ConfigMap-projected definition) and is **not** laptop-dependent — but
the triage's State Hub *data dependencies* (context resolution and report
delivery) still route back to the workstation State Hub.
This workplan still delivers worthwhile robustness — explicit run-miss recovery
policies (T02) and missed-fire detection (T03) — but the fix for *this* incident
is T05 (resilient sinks/resolvers + a workstation-independent State Hub endpoint).
## Desired run-miss options (from Bernd)
Three explicit, per-definition behaviours when a fire is missed:
1. **Run on trigger or skip** — never recover a missed fire.
2. **Run on trigger or later if missed** — recover **all** missed fires when back up.
3. **Run on trigger or later if missed, but skip if next trigger reached**
recover only the **most recent** missed fire; do not accumulate a backlog.
Proposed mapping to a new `misfire_policy` value set (names open to review):
| Policy | Semantics | Temporal mapping |
| --- | --- | --- |
| `skip` | Run on trigger or skip | `catchup_window ≈ 0`, `overlap=SKIP` |
| `catchup_all` | Run on trigger or all missed later | `catchup_window=<long>`, `overlap=BUFFER_ALL` |
| `catchup_latest` | Run on trigger or only the latest missed | `catchup_window ≈ 1 interval`, `overlap=BUFFER_ONE` |
## Confirm root cause on Railiance01
```task
id: ACTIVITY-WP-0014-T01
status: done
priority: high
state_hub_task_id: "c90ff214-9214-48c7-96b9-7d699528d5ab"
```
Inspected via `ssh railiance01` + in-node `kubectl`/`temporal` (no k3s tunnel is
defined for railiance01; the documented access path is SSH to the host).
**Findings (2026-06-23) — the WP-0014 premise was wrong for this incident:**
- All pods healthy; `actcore-worker` up 44h, 0 restarts. Not a crash.
- The daily-triage Temporal schedule (`activity-schedule-6fca51fa-…`) is
**healthy**: `Paused false`, `OverlapPolicy Skip`, **`CatchupWindow 365d`**
(Temporal's *default* when unset), `ActionCounts {Total:8, MissedCatchupWindow:0}`.
So fires were **not** silently dropped — my original "no catchup window → silent
drop" hypothesis does not hold; the server default is already 365d.
- The `2026-06-23T05:20:00Z` fire **did fire and ran**, then **Failed at the report
sink**: `report sink failure: state-hub-progress … '[Errno 111] Connection
refused'`. The run produced a report but could not deliver it to State Hub, so
no `daily_triage` progress event (not even a "could not run" one) was posted →
the silence. The 06-22 fire has no execution in retention (bridge likely down
then too / schedule update window at `LastUpdateAt 1d ago`).
- Root cause is **State Hub connectivity from railiance01**, not Temporal. The
in-cluster `actcore-state-hub-bridge` (`hostNetwork`) proxies to
`127.0.0.1:18000` on the node — the local end of the ops-bridge **reverse tunnel
back to the workstation's State Hub**. At 07:20 Europe/Berlin (= 05:20 UTC) the
workstation/tunnel was unreachable → `Connection refused`. Chronic flakiness
confirmed: 102 State Hub resolver timeouts in 24h (69 `recently_on_scope`,
33 `consistency_sweep`).
**Implication:** the trigger *is* independent of the laptop, but the triage's
**data dependencies (State Hub context resolution + report delivery) still route
back to the workstation State Hub**, which is asleep at 07:20 Berlin. WP-0014's
misfire policies are still good robustness, but the real fix is (a) State Hub
reachable from railiance01 independent of the workstation, and/or (b) sinks/
resolvers resilient to transient State Hub unavailability (retry/backoff,
store-and-forward) instead of hard-failing the workflow. Tracked as follow-up
below. Backfill deferred: a replay only succeeds while the workstation State Hub
is reachable.
## Implement explicit misfire recovery modes
```task
id: ACTIVITY-WP-0014-T02
status: done
priority: high
state_hub_task_id: "19615562-4cb2-4f25-872f-505d6e40dcc5"
```
Add `catchup_window_seconds` to `CronTriggerConfig` and redefine `misfire_policy`
into the three explicit modes above. In `_build_schedule()` set
`SchedulePolicy(overlap=..., catchup_window=timedelta(...))` per mode. Remove the
ad-hoc 1-hour `backfill` hack in favour of native catchup-window semantics. Keep
backward compatibility for existing `skip`/`catchup`/`compress` values (alias
map). Unit tests for each mode's `(catchup_window, overlap)` mapping.
## Missed-fire detection & alert sink
```task
id: ACTIVITY-WP-0014-T03
status: done
priority: medium
state_hub_task_id: "dbedd96a-59ca-4b83-bce6-35755b076807"
```
Detect when a scheduled definition has no successful run within its expected
interval + tolerance, and emit a signal (State Hub progress event and/or
agent-inbox message) so a miss is visible even under `skip`. This is the
observability the current silent-drop behaviour lacks — a miss should never again
be invisible.
## Apply policy to runtime definitions & document
```task
id: ACTIVITY-WP-0014-T04
status: done
priority: medium
state_hub_task_id: "04e9d1d2-1192-4402-9402-b12c5d7d44e5"
```
Set `misfire_policy: catchup_latest` for `daily-statehub-wsjf-triage`, documented
run-miss options in `docs/runbook.md`.
**Deployed & verified to railiance01 (2026-06-24):** built `activity-core:
railiance01-prod` with the WP-0014 code (T02/T03/T05), imported into k3s
containerd, applied the ConfigMap, rolled `actcore-worker`/`api`/`event-router`
onto the new image, and ran `/admin/sync` (6 defs, 4 schedules upserted, 0
errors). The live Temporal schedule now reports `OverlapPolicy BufferOne` +
`CatchupWindow 1d` (= `catchup_latest`); pods healthy, API `db:true temporal:true`.
## Keep activity-core thin under the State Hub beachhead model
```task
id: ACTIVITY-WP-0014-T05
status: done
priority: high
state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"
```
**Architecture decision (Bernd, 2026-06-23):** the resilience that this incident
needs — queuing writes and caching reads while State Hub is unreachable — must
**not** be a burden carried by client repos. It belongs to State Hub as a
**per-machine local "beachhead"** (transparent read cache + write outbox, possibly
with State-Hub federation), owned by custodian/state-hub. It handles all three
failure modes: network interruption, central State Hub crash, central machine
down. This is handed off to state-hub (see the coordination message / proposal);
**do not build client-side queue/cache logic in activity-core.**
activity-core's only responsibilities under this model are thin:
- **Idempotent writes — DONE (2026-06-23, in-repo):** added
`activity_core/state_hub_write` (`idempotency_headers`); every State Hub write
(report-sink, ops-evidence, schedule-miss) now sends a stable `Idempotency-Key`
header derived from `run_id:instruction_id:event_type`. The read-based
`_progress_exists` dedup is now best-effort (returns `False` on connection
error instead of hard-failing), so the guarantee lives on the keyed write, not
a live read. Tests in `tests/test_state_hub_write.py`; documented in
`docs/runbook.md`.
- **Adopt the beachhead endpoint — MOVED to [[ACTIVITY-WP-0015]]:** pointing
`STATE_HUB_URL` at the local beachhead and retiring the bespoke
`actcore-state-hub-bridge` proxy depend on the state-hub beachhead existing
first. Split into WP-0015 (status `blocked`) so this workplan can close on its
completed in-repo work rather than waiting on an external capability.
T05 is done as far as activity-core can act now; the external-dependent adoption
lives in WP-0015.

View File

@@ -0,0 +1,54 @@
---
id: ACTIVITY-WP-0015
type: workplan
title: "Adopt State Hub Beachhead Endpoint"
domain: infotech
repo: activity-core
status: blocked
owner: claude
topic_slug: activity-core
created: "2026-06-24"
updated: "2026-06-24"
state_hub_workstream_id: "bbc07f9e-9323-4b2b-b556-c33b37d0b228"
---
# Adopt State Hub Beachhead Endpoint
Carries the **blocked remainder** of [[ACTIVITY-WP-0014]] T05. The in-repo half
(idempotency-keyed State Hub writes) shipped in WP-0014; this workplan is the
client-side adoption that depends on the state-hub-owned **beachhead** capability
(per-machine read cache + write outbox) existing first.
**Blocked on:** the state-hub beachhead (proposal sent to the `state-hub` agent,
2026-06-23). Do not build queue/cache logic in activity-core — see
[[statehub-beachhead-principle]].
## Point STATE_HUB_URL at the beachhead
```task
id: ACTIVITY-WP-0015-T01
status: wait
priority: medium
state_hub_task_id: "76b6132d-394a-4a67-bef6-73bb9d1e277e"
```
Once the state-hub beachhead exposes a local endpoint, point activity-core's
`STATE_HUB_URL` (and the railiance runtime config) at it and verify reads are
served from cache and writes are queued/flushed correctly when central State Hub
is unreachable. Confirm idempotency-keyed writes dedup on flush (no duplicate
`daily_triage`/progress events).
## Retire the bespoke actcore-state-hub-bridge proxy
```task
id: ACTIVITY-WP-0015-T02
status: wait
priority: medium
state_hub_task_id: "526c2129-cbf7-4531-a319-aebfc75cc6a3"
```
Remove the inline `hostNetwork` HTTP proxy `actcore-state-hub-bridge` from
`k8s/railiance/20-runtime.yaml` — it is a primitive precursor of the beachhead
and should be replaced by the state-hub-owned component, not extended. Re-verify
the daily triage end-to-end after cutover, including an overnight scheduled run
while the workstation is asleep (the original failure condition).

View File

@@ -0,0 +1,434 @@
---
id: ACTIVITY-WP-0016
type: workplan
title: "LLM Output Robustness & The Producer Trust Boundary"
domain: custodian
repo: activity-core
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-26"
updated: "2026-06-30"
state_hub_workstream_id: "4ef0d53b-1777-41ae-80c6-1b69fdb34726"
---
# ACTIVITY-WP-0016 — LLM Output Robustness & The Producer Trust Boundary
## Context
On 2026-06-26 the scheduled `daily-statehub-wsjf-triage` instruction fired on
time (`daily_triage` event 05:20:57Z) but its output **failed schema
validation**: `Expecting ',' delimiter: line 136 column 22 (char 5268)`. The
model emitted a long ranked WSJF recommendation list (reached rank 7+ with
nested `wsjf` objects) and the JSON broke deep in that list. Because the report
is a single monolithic JSON document, one malformed delimiter discarded the
**entire** run. This reset the three-clean-consecutive-scheduled-runs streak in
`ACTIVITY-WP-0006-T03` (06-24 ✅, 06-25 ✅, 06-26 ✗-validation) and is the
LLM-output-quality surface deferred from `ACTIVITY-WP-0010`.
The scheduling/runtime layer is healthy — this is purely an output-robustness
and boundary-design problem. Today's code (`src/activity_core/rules/executor.py`)
already: passes the output schema to llm-connect as a `json_schema` model param
(`_llm_run_config`), retries once, runs a fenced/`raw_decode` tolerant parser
(`_parse_json_output`), and preserves a bounded 4000-char preview on hard
failure (`_invalid_output_report`). None of that helps when error locality is
zero: the failure unit is the whole document, not the offending item.
## Design Frame — The Producer Trust Boundary
This workplan is anchored to a deliberate architectural stance, not just a bug
fix. Capture it in an ADR (T04) so future work inherits it.
**Premise.** activity-core has a *trust boundary* where free-form producer
output meets strict deterministic consumers (JSON Schema validators, the task
emitter, classic compute pipelines). The producers are **LLMs and humans (and
agents acting for either)**. Both are *untrusted producers*: their output may be
- **erroneous** — hallucination, truncation (token-limit cutoff), drift,
type slips, typos; or
- **malicious** — prompt injection, crafted payloads, oversized/deeply-nested
structures aimed at exhausting or confusing the consumer.
The architecture should treat the boundary as an adversarial frontier and place
**guardrails + error-correction tooling there**, rather than letting raw
producer output flow into deterministic consumers and fail (or worse, partially
succeed) downstream.
**Two non-fail-fast postures.** When we do *not* want to hard-fail on a problem,
there are two sensible strategies — and they compose:
- **A) Trust but handle exceptions** (optimistic / reactive). Consume the output
as-is; on exception, catch → repair → retry → or quarantine. Cheap on the
happy path. Blast radius depends entirely on how granular the catch is. Good
when failures are rare and locally recoverable. Risk: failures surface late,
possibly after partial side effects.
- **B) Verify and mitigate** (defensive / proactive). Validate, sanitize, clamp,
and normalize the output to a known-good shape *before* it enters the pipeline
— drop bad items, coerce types, bound sizes/depth, allow-list references — so
the consumer only ever sees clean input. Higher upfront cost, smaller blast
radius, no partial side effects. Good when failures are common or
consequences are high.
**Governing principles for this repo:**
1. **Push verification to the boundary; keep the interior strict.** Apply
posture **B** at the producer→consumer boundary (verify+mitigate structure);
keep posture **A** for residual exceptions inside the verified core. Never
relax the interior schema to absorb producer sloppiness.
2. **Make error locality match the unit of work.** One bad recommendation must
cost one recommendation, not the whole report. Framing the payload so each
item is independently parseable is the single highest-leverage change.
3. **Quarantine, never silently drop.** Invalid units are preserved as bounded,
provenance-tagged artifacts (index, error, raw snippet) so they can be
debugged or replayed — degraded-but-usable is distinct from total loss.
4. **Both human and agent input get the same rigor.** Guardrails are
producer-agnostic: the same size/depth/count caps, reference allow-lists, and
truncation detection apply whether the producer is an LLM, an agent, or a
human form submission.
## Reproduce & Root-Cause The Failure
```task
id: ACTIVITY-WP-0016-T01
status: cancel
priority: high
state_hub_task_id: "74fd16a5-4ea5-4dfe-8526-dfa27cf76138"
```
Recover the **full** raw llm-connect response for the 06-26 failure (the State
Hub event keeps only a 4000-char preview; the break is at char 5268) and
establish the precise cause.
Done when:
- the full raw response is pulled from the runtime llm-connect log / response
store and the exact offending token at char 5268 is identified;
- `finish_reason` is captured to confirm or rule out token-limit **truncation**
vs a structural mid-stream glitch;
- it is confirmed whether llm-connect actually **enforced** the `json_schema`
constrained-decoding hint or merely accepted it as advisory (this determines
whether the schema param is load-bearing);
- the failing payload is captured as a regression fixture under `tests/`.
2026-06-26 findings (local analysis on the workstation):
- **Mechanism confirmed structurally.** There are **16 active workstreams**
org-wide and the triage instruction emits ~one ranked recommendation per
candidate. The preserved preview holds 7 fully-formed recommendations; the JSON
break is at char 5268 (~rank 89). The unbounded one-per-workstream list is the
structural cause — more items = more tokens = higher odds of a mid-stream JSON
slip and/or truncation. This directly justifies T02's bounded top-N + per-item
framing.
- **Both attempts failed.** `executor._execute` retries once
(`src/activity_core/rules/executor.py:166-171`); the recorded error is from the
**retry** output, so the model produced invalid JSON twice — not a one-off.
- **activity-core discards the diagnostics needed to root-cause this.** Three
retention gaps mean the exact char-5268 token cannot be recovered from
activity-core data at all:
1. `LLMConnectClient.complete()` returns only `data["content"]`
(`llm_client.py:57-60`) — it drops `finish_reason`/`usage` from the
llm-connect HTTP response, so truncation-vs-structural cannot be
distinguished locally.
2. the report sink caps raw output at **4000 chars** (`_invalid_output_report`,
`executor.py:259`) — below the 5268 break.
3. the worker log caps the preview at **2000 chars** (`executor.py:175`).
- **Remaining (remote, operator-owned).** Confirming the exact offending token
and `finish_reason` requires llm-connect's producer-side logs on `railiance01`
— cluster access, outside this repo's SCOPE for direct action. Truncation is
the leading hypothesis given the 16-item input, but the mitigation (T02/T03) is
identical either way, so T01 does not block the build work.
- **Feeds T03/T04.** The retention gaps are themselves defects to fix: capture
`finish_reason`/`usage` and persist a larger bounded raw artifact on validation
failure so this class of failure is never un-debuggable again.
- Partial fixture saved:
`tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json`
(the 4000-char preview + validation error; full payload pending the remote pull).
2026-06-30 local retention hardening: activity-core now preserves future
llm-connect diagnostic metadata instead of dropping it at the client boundary.
`LLMConnectClient.complete()` still returns the content string for compatibility,
but records safe non-secret response fields such as `finish_reason` and `usage`
on `last_response_metadata`; the executor copies that into report artifacts,
State Hub progress detail, and working-memory notes. Invalid report raw previews
were raised from 4000 to 12000 chars. This does not recover the historical
06-26 full payload or producer-side `finish_reason`, so T01 remains wait on the
remote llm-connect log pull, but the retention gap is closed for future failures.
## Schema + Prompt Redesign For Error Locality
```task
id: ACTIVITY-WP-0016-T02
status: done
priority: high
state_hub_task_id: "ae67ca8c-ee01-4a8d-9e8a-a0a36c999758"
```
Redesign the daily-triage report contract so a single malformed item can no
longer discard the whole report (principle #2).
Done when:
- the recommendation list is **bounded** (configurable top-N, default 57) in
both the prompt and the output schema — long lists are where the model drifts;
- the report uses a **per-item-framed** shape (JSON Lines / NDJSON — one
recommendation object per line — or an equivalent delimited per-item form)
behind a minimal stable envelope (`summary` + framed items), so each item is
an independent parse unit;
- the prompt explicitly states the contract, the per-item framing, the cap, and
a "if uncertain, emit fewer well-formed items rather than more" instruction;
- `max_tokens` is set with headroom for the bounded list so truncation cannot
occur at the expected size;
- the output schema file (`_load_output_schema` target) is updated to match.
2026-06-26 progress (in-repo portion):
- **Strict, bounded schema written** — `schemas/daily-triage-report.json` went
from `recommendations.items: {type: object}` (accept-anything) to a strict
per-item contract: `required [rank, candidate, action, why]` with typed
`wsjf` sub-fields, plus `maxItems: 7`. The strict item shape is what lets the
T03 boundary parser validate each recommendation independently.
- **`maxItems` is a hint, not a hard reject** — the in-repo validator
(`_validate_schema_node`) only enforces `type`/`required`/`properties`/`items`
and ignores `maxItems`/`enum`. That is deliberate: a hard `maxItems` reject
would discard a whole 16-item report — the exact blast-radius bug WP-0016
removes. The bound is enforced via the prompt + the llm-connect `json_schema`
constraint hint + T03 mitigation (keep top-N by rank, quarantine extras).
- **DEPLOY COUPLING (important):** this schema file is consumed *both* as the
llm-connect hint *and* by the current whole-document validator. Tightening
per-item `required` fields makes the existing whole-doc validation hard-fail
**more** until T03 replaces it with per-item quarantine. Therefore the schema
change MUST ship together with T03 — do not deploy the strict schema to the
runtime bundle ahead of the T03 parser. Four executor/instruction tests that
asserted the old loose contract were updated to the strict contract; the
forwarded-schema test now reads the live file instead of hard-coding it.
- **Truncation hypothesis corroborated** — the instruction config carries
`max_tokens` on the order of ~1200 (per the wiring test fixture). 5268 chars ≈
~13001500 tokens, so a ~1200-token cap would truncate a 16-item list right at
the observed break. This strengthens T01's leading hypothesis and makes the
`max_tokens` headroom change below concrete.
**Bundle handoff (NOT in this repo — runtime-projected definition).** The triage
prompt and `max_tokens` live in the Railiance runtime bundle, not in repo files.
Apply there:
1. Instruct a **bounded top-N** (≤ 7) ranked recommendations, "if uncertain emit
fewer well-formed items rather than more."
2. Specify the **per-item framing** the T03 parser will consume (NDJSON: a
leading summary object, then one recommendation JSON object per line).
3. Raise **`max_tokens`** to give clear headroom for 7 framed items (eliminate
truncation at the expected size).
4. State the value vocabularies (`action`, `confidence`) the T04 guardrails will
check.
2026-06-30 live evidence check: the 2026-06-28 and 2026-06-29 scheduled
`daily_triage` events validated successfully, which shows the runtime is no
longer failing every day. However, the preserved State Hub reports still contain
10 recommendations, not the requested bounded top-N of 7 / framed item contract.
Treat that as evidence that the runtime-projected prompt/schema/max-token bundle
has not fully absorbed the T02 handoff yet.
2026-06-30 source projection closeout: patched `k8s/railiance/20-runtime.yaml`
so the projected `daily-statehub-wsjf-triage.md` prompt now says at most 7
recommendations and instructs the model to emit fewer well-formed items rather
than more. The projected `daily-triage-report.json` now has `maxItems: 7` and
`rank.maximum: 7`, aligned with the repo schema. `max_tokens: 1800` remains as
headroom for the bounded report. T02 is done in source; live deployment and an
observed <=7 recommendation run remain under T05.
## Boundary Parser — Verify & Mitigate (Posture B)
```task
id: ACTIVITY-WP-0016-T03
status: done
priority: high
state_hub_task_id: "d65a6281-f1f9-4a9b-a835-da065411b709"
```
Implement item-granular parsing with a quarantine lane in
`src/activity_core/rules/executor.py`, applying posture **B** at the boundary
(principles #1#3).
Done when:
- the parser splits the envelope from the framed items, then parses **each item
independently**; a malformed item is routed to a bounded `quarantined_items`
artifact (index + validation error + raw snippet), not raised;
- a run with some valid and some invalid items emits a report over the surviving
valid items with `output_validated=true`, plus `partial=true` and
`quarantined_count` / `quarantined_items` markers — degraded-but-usable is
reported distinctly from total loss;
- a best-effort **repair** pass (close unterminated brackets/quotes, recover the
valid prefix) is attempted per item before quarantining it;
- truncation detected in T01 is handled as its own signal (recover whole items
emitted before the cutoff rather than failing the document);
- the existing monolithic-document path remains as the fallback when framing is
absent (backward compatible with task-only instructions).
2026-06-26 progress (implemented in `src/activity_core/rules/executor.py`):
- **Resilient recovery wired into `_execute`.** When the whole-document parse +
one retry still fail, report instructions (those with `report_sinks`) now run
`_resilient_report` *before* the total-loss `_invalid_output_report`. If it
recovers ≥1 valid item it returns a partial report; otherwise it returns None
and the prior total-loss path is preserved unchanged.
- **Brace/quote-aware object scanner, not line-splitting.** The real 06-26 output
was pretty-printed (multi-line objects), so naive NDJSON line recovery would
have failed. `_extract_object_spans` walks the `recommendations` array
brace-depth- and string-aware, so it recovers each recommendation object
whether pretty-printed across many lines *or* emitted one-per-line (NDJSON).
The truncated trailing object is returned with `complete=False`.
- **Layered mitigation per item:** `json.loads` → on failure for a truncated
tail, a best-effort `_try_repair` (balance open string/brackets/braces) →
then `_partition_items` validates each recovered object against the T02 item
schema. Valid items survive; malformed or over-`maxItems` items are
quarantined with provenance (`index`, `error`, `raw` snippet, `reason`).
- **Report shape on degradation:** `output_validated=True` over the survivors,
`review_required=True`, `partial=True`, `quarantined_count`, and a bounded
`quarantined_items` list (cap 20). Degraded-but-usable is now reported
distinctly from total loss.
- **Verified against the real failure shape.** New tests reconstruct a
pretty-printed report with 7 valid recommendations + a truncated tail (the
06-26 shape) and a one-bad-item-among-valid case. The 7-item run now recovers
all 7 and quarantines the broken tail (previously: whole run discarded);
log line `instruction_output_recovered: kept=7, quarantined=1`. The bad-item
run keeps 2 and quarantines the rank-less one.
- **Deferred to T04 (clean scope boundary):** enforcing `maxItems` top-N on the
*happy* path (valid JSON, all items schema-valid, but > N items) — the resilient
path only runs on failure, so over-limit-on-success is a guardrail/count-cap
concern, which is exactly T04's remit.
## Producer Guardrails + ADR-004
```task
id: ACTIVITY-WP-0016-T04
status: done
priority: medium
state_hub_task_id: "f5c3af5b-9e28-42b0-9af5-4c99284e99b9"
```
Write the architecture decision record and add the producer-agnostic guardrails
(principle #4).
Done when:
- `docs/adr/adr-004-producer-trust-boundary.md` documents the trust boundary,
the untrusted-producer premise (erroneous **and** malicious; human and agent),
the A vs B taxonomy and where each applies, the error-locality principle, and
the quarantine-with-provenance rule;
- boundary guardrails are enforced at the consumer edge: max item **count**, max
string length, max nesting **depth**, and a **reference allow-list** (e.g. a
recommendation `candidate` / a task `target_repo` must resolve to a known
workstream/repo before it is acted on);
- guardrail rejections are quarantined with provenance, consistent with T03;
- SCOPE.md / INTENT.md are checked for drift and updated if the boundary stance
changes the documented contract.
2026-06-26 progress:
- **ADR-004 written** — `docs/adr/adr-004-producer-trust-boundary.md` documents
the untrusted-producer premise (erroneous + malicious; LLM/agent/human), the
A-vs-B posture taxonomy, the four governing principles, the concrete
activity-core mechanisms, a posture-by-layer table, consequences, and
alternatives considered. Accepted, scope cross-repo.
- **Producer guardrails implemented** in `executor.py`, applied uniformly on the
happy path *and* the recovery path via `_partition_items`: per-item order is
structural-type → schema → structural caps (`_MAX_DEPTH=8`,
`_MAX_STRING_LEN=4000`) → reference allow-list → count cap (`maxItems`). Each
quarantine carries a `reason` (`malformed`/`schema`/`guardrail`/`allow_list`/
`over_limit`).
- **Happy-path count cap closed** (the item deferred from T03): a syntactically
valid 9-item report now keeps 7 and quarantines 2 as `over_limit`, emitting a
`partial` report — without a retry.
- **Reference allow-list wired but inert.** `_allow_list_from_context` reads
`context["known_candidates"]`; when present, recommendations with an unknown
`candidate` are quarantined (`reason: allow_list`). Absent today → check is
inert; activation is a one-line context-resolver change. Keeps the guardrail
producer-agnostic (principle #4) and ready.
- **SCOPE.md updated** — instruction-executor bullet now names the quarantine
lane + guardrails; ADR-004 added to the Architecture Decisions list. No INTENT
drift: this hardens the existing output contract, it does not extend scope.
- New tests: happy-path count cap, oversized-string guardrail, allow-list
rejection (all green).
## Tests + Calibration Re-Entry
```task
id: ACTIVITY-WP-0016-T05
status: done
priority: high
state_hub_task_id: "c881500b-5459-4620-81c0-b176971e989f"
```
Prove the new posture and hand back to the calibration gates.
Done when:
- regression tests cover: the captured 06-26 payload, a truncated-mid-list
payload, a one-bad-item-among-good payload (asserts quarantine + partial), an
oversized/over-deep payload (asserts guardrail rejection), and an
injection-shaped reference (asserts allow-list rejection);
- the full suite passes and the result is recorded here with the count;
- a daily-triage smoke against the live runtime shows a previously-failing
payload now **degrades gracefully** (valid items delivered, bad items
quarantined) instead of discarding the run;
- a progress note hands back to `ACTIVITY-WP-0010-T04` and `ACTIVITY-WP-0006-T03`
that the output-robustness blocker is cleared so the three-clean-run gate can
resume on its own.
2026-06-26 progress (in-repo portion complete):
- **Regression coverage complete.** Across T03/T04/T05: truncated-mid-list,
one-bad-item-among-good (quarantine + partial), oversized-string and over-depth
guardrail rejection, allow-list (injection-shaped) rejection, happy-path count
cap, and a test driving the **actual captured 2026-06-26 payload**
(`tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json`)
— it now recovers 6+ valid recommendations and quarantines the truncated tail,
where before it discarded the whole run.
- **Full suite green:** 218 passed, 1 skipped (recorded at T04; the T05 fixture +
over-depth tests add to this — see the commit).
- **Hand-back notes posted** to `ACTIVITY-WP-0006-T03` (State Hub event
`b6b8c2b8`) and `ACTIVITY-WP-0010-T04` (`b813f0dc`).
- **Remaining (remote, operator-owned):** the live daily-triage smoke on
`railiance01` proving end-to-end graceful degradation. It depends on deploying
the T02 bundle prompt/`max_tokens`/NDJSON changes together with this code, which
is cluster/operator work outside this repo's SCOPE. T05 therefore stays
`progress` until that live run exists; the in-repo deliverables are done.
2026-06-30 follow-up: added forward-looking diagnostics so future validation
failures carry llm-connect response metadata and a larger bounded raw-output
preview in activity-core-owned evidence. Focused verification passed:
`uv run pytest tests/test_llm_client.py tests/rules/test_executor.py tests/test_report_sinks.py -q`
=> 39 passed. This improves future root-cause ability but does not replace the
required live smoke proving graceful degradation on railiance01.
2026-06-30 projection follow-up: local source projection now enforces the top-7
prompt/schema contract. Remaining T05 proof is operational: deploy or sync the
updated `k8s/railiance/20-runtime.yaml`, run `actcore-sync`/schedule smoke or wait
for the next 07:20 Berlin fire, then confirm State Hub `daily_triage` evidence is
`output_validated=true` with no more than 7 recommendations.
## Relationships
- **Blocks / feeds:** `ACTIVITY-WP-0006-T03` (three clean scheduled runs) and
`ACTIVITY-WP-0010-T04` (collect three clean scheduled runs) — both stalled on
the same output-quality failure this workplan removes.
- **References:** `ACTIVITY-WP-0009` (scheduled-run trust gap).
- **Boundary discipline:** keeps activity-core inside its SCOPE — this hardens
the instruction-executor output contract; it does not move provider
credentials, cluster reconciliation, or task lifecycle into this repo.
## Closure 2026-07-02 (RAIL-BS-WP-0008 live deploy)
- T05 done: the robustness bundle (strict per-item schema + T03 quarantine
parser + bounded top-7/NDJSON runtime prompt, activity-core `7612112`) was
deployed to railiance01 and live-proven. A manually triggered daily-triage
run produced a clean schema-valid report with exactly 7 ranked
recommendations: State Hub event `24d2d321-c761-47f7-bf9e-7950a6253c21`,
`output_validated=true`, working memory written. Calibration re-entry: the
three-clean-run streak (WP-0006-T03 / WP-0010-T04) restarts from this run.
- T01 cancelled: the raw 2026-06-26 llm-connect response is unrecoverable
(stateless pod, no response store, log stream holds only 2 startup lines
since 2026-06-19). Root cause stands on the retained 4000-char preview and
break-at-char-5268 evidence: output exceeded the old ~1200-token budget and
truncated mid-JSON. The deployed mitigation (1800-token headroom, bounded
top-7, per-item recovery) addresses exactly that failure mode.

View File

@@ -0,0 +1,58 @@
---
id: ACTIVITY-WP-0017
type: workplan
title: "Core Hub ops evidence sink"
domain: infotech
repo: activity-core
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-27"
updated: "2026-06-27"
state_hub_workstream_id: "2a073bf4-febf-433e-a721-5daf71760912"
---
# Core Hub ops evidence sink
## Goal
Provide the activity-core side of the Core Hub replacement evidence path for
`CORE-WP-0008-T03`, without depending on the legacy Haskell Inter-Hub sink and
without placing secret material in activity definitions, logs, State Hub, or
chat.
## Task: Add Core Hub interaction-event sink
```task
id: ACTIVITY-WP-0017-T01
status: done
priority: high
state_hub_task_id: "32aab1af-6be5-4b52-afa1-c11f52c65892"
```
Add a `core-hub-interaction-event` ops evidence sink that posts sanitized
ops-inventory probe evidence to Core Hub `/api/v2/interaction-events`, verifies
the created event is visible, and reports only non-secret ids/statuses.
Acceptance:
- runtime token is read through `CORE_HUB_RUNTIME_TOKEN_FILE` or a named
environment variable, never from workplan content;
- sink configuration accepts `CORE_HUB_BASE_URL` and a widget id or widget
mapping;
- emitted metadata reuses the existing compact/sanitized probe evidence path;
- missing Core Hub config skips cleanly with explicit non-secret missing keys;
- tests prove the POST/visibility check and secret non-disclosure.
Verification 2026-06-27: `tests/test_ops_evidence_sinks.py` passed, and
a disposable local Core Hub runtime accepted an activity-core
`core-hub-interaction-event` sink emission, then listed the created
`ops-endpoint-verified` event back through `/api/v2/interaction-events`.
The verification asserted sanitized metadata did not include response body,
authorization header, URL userinfo, or token query material.
Completed 2026-06-27: implemented the Core Hub interaction-event sink in
`activity_core.ops_evidence_sinks` with unit coverage for POST/visibility
verification, missing config behavior, and secret non-disclosure. This provides
the direct Core Hub consumer path needed by `CORE-WP-0008-T03`; deployed use
still requires an approved Core Hub runtime token and widget id/mapping.

View File

@@ -0,0 +1,248 @@
---
id: ACTIVITY-WP-0018
type: workplan
title: "Own-infrastructure automation status surface"
domain: infotech
repo: activity-core
status: finished
owner: codex
topic_slug: automation-observability
created: "2026-06-29"
updated: "2026-06-29"
state_hub_workstream_id: "0220b38b-7c73-4601-9601-5f2c1a5b29e8"
---
# Own-infrastructure automation status surface
## Goal
Make activity-core's own scheduling and evidence infrastructure the explicit
operating preference for durable automations, independent of any coding
assistant-provided scheduler or reminder system.
An operator should be able to answer a question like "How did our automations go
since Friday?" with a repo-native command that does not require an LLM. Coding
assistants may inspect or summarize that command's output, but they must not be
the source of truth for scheduled execution, run history, or operational
evidence.
## Review notes
The repo already owns the correct infrastructure direction:
- `SCOPE.md` defines activity-core as the org-wide event bridge for cron,
one-off scheduled datetime, and event-triggered automation.
- `Makefile` exposes sync and service targets, but no operator status target for
recent automation outcomes.
- `docs/runbook.md` documents daily-triage verification through
`scripts/verify_daily_triage.py`, but that helper is activity-specific and
still reads like a checklist rather than the baseline answer surface for all
automations.
- Existing workplan evidence shows the status question is operationally common:
2026-06-24 and 2026-06-25 daily triage runs were clean, while 2026-06-26 and
2026-06-27 fired on schedule but failed output validation. That distinction is
exactly what the baseline command must make obvious.
## Task: Codify the own-infra scheduling preference
```task
id: ACTIVITY-WP-0018-T01
status: done
priority: high
state_hub_task_id: "00127678-5ce4-4cb3-b81c-f42e04407c73"
```
Record the repository preference that durable automation scheduling, execution
history, and run evidence belong to activity-core's own infrastructure: Temporal
Schedules, NATS JetStream, activity-core run records, State Hub progress, and
working-memory/report sinks.
Acceptance:
- `AGENTS.md` repo-specific instructions say not to use coding
assistant-provided automation tooling as the execution or evidence source for
activity-core automations.
- `SCOPE.md` and `docs/runbook.md` describe coding assistants as callers or
summarizers of repo-native automation commands, not as schedulers.
- The preference distinguishes durable automation from harmless local session
reminders: production/operational recurrence belongs to activity-core.
- The text names the authoritative evidence sources and avoids tying the policy
to any one assistant product.
2026-06-29 progress: Added the immediate repo-agent instruction in AGENTS.md
that durable activity-core automations must use repo-owned infrastructure, not
coding assistant automation/reminder/heartbeat tooling, as the execution or
evidence source. Remaining T01 work is to carry the same preference into
SCOPE.md and docs/runbook.md.
## Task: Define the automation status evidence contract
```task
id: ACTIVITY-WP-0018-T02
status: done
priority: high
state_hub_task_id: "17e6bb87-d4bf-4ef3-b91c-4bdfe2fe3492"
```
Define a small, deterministic report contract for answering recent automation
status questions across all ActivityDefinitions.
Acceptance:
- The contract covers schedule state, expected fires in the requested window,
observed workflow runs, `activity_runs` rows, State Hub progress events,
working-memory/report sink evidence, and known validation or sink failures.
- It defines normalized statuses such as `completed`, `running`, `retrying`,
`validation_failed`, `sink_failed`, `missed`, `disabled`, and `unknown`.
- Partial data is explicit: if Temporal, Postgres, State Hub, or a sink path is
unavailable, the report includes warnings rather than silently passing or
failing the whole check.
- The contract is safe for operator logs: no secrets, prompts, raw model output,
or credential-bearing URLs.
- The contract can be emitted as JSON for scripts and rendered as concise text
for humans.
## Task: Implement the non-LLM automation status CLI
```task
id: ACTIVITY-WP-0018-T03
status: done
priority: high
state_hub_task_id: "7831f2fc-8b76-48fe-aa34-9dcc11ee84db"
```
Add a deterministic CLI, likely under `scripts/automation_status.py` or an
`activity_core` module, that answers recent automation status questions without
calling an LLM.
Acceptance:
- Supports `--since`, `--until`, activity name/id filters, JSON output, and a
concise human summary.
- Accepts simple operator dates, including absolute dates and a documented
`friday`/`last-friday` style shortcut, resolving them to concrete dates in the
configured timezone.
- Inspects all enabled scheduled ActivityDefinitions by default, not just daily
triage.
- Uses live sources when configured: Postgres `activity_definitions` /
`activity_runs`, Temporal schedule and workflow visibility, State Hub
progress, and configured local report sink paths.
- Degrades usefully when a source is unavailable and exits non-zero only for
real status failures or invalid input, not for optional evidence gaps that are
clearly reported.
- Includes focused unit tests with fixture data for clean runs, validation
failures, missed runs, disabled schedules, and partial-source availability.
## Task: Add the Make target baseline
```task
id: ACTIVITY-WP-0018-T04
status: done
priority: high
state_hub_task_id: "451bdf62-b619-4ace-9262-46d20b912781"
```
Expose the CLI through a Make target that is easy for an operator or any coding
assistant to run before attempting a prose summary.
Acceptance:
- `make automation-status SINCE=2026-06-26` prints the human-readable baseline.
- `make automation-status SINCE=friday` is supported or documented with the
exact accepted shortcut.
- A JSON form is available, either through `FORMAT=json` or a separate target
such as `make automation-status-json`.
- The target does not require LLM credentials, coding assistant automation
tooling, or interactive prompts.
- `make help` lists the target with a clear one-line description.
## Task: Update operator docs and examples
```task
id: ACTIVITY-WP-0018-T05
status: done
priority: medium
state_hub_task_id: "233659aa-e14a-4b3d-b156-d04f0fa16db6"
```
Update the runbook so "How did automations go since Friday?" has an obvious
operator recipe.
Acceptance:
- `docs/runbook.md` has a short "Automation status" section near the scheduling
operations.
- The docs include example output or a compact sample for the known daily
triage distinction: fired on time versus completed successfully versus output
validation failure.
- The docs clarify that LLM summaries are optional convenience only; the Make
target output is the baseline evidence.
- The daily-triage-specific helper is either kept as a lower-level diagnostic or
folded into the generalized status command.
## Task: Verify against recent scheduled-run evidence
```task
id: ACTIVITY-WP-0018-T06
status: done
priority: medium
state_hub_task_id: "24efbe9f-dfff-482f-9edc-456379c9a2aa"
```
Prove the new surface against the recent evidence that motivated this workplan.
Acceptance:
- Running the status command over the window starting Friday, 2026-06-26 shows
that the daily triage schedule fired on 2026-06-26 and 2026-06-27 but did not
produce clean validated reports.
- The command distinguishes scheduling health from output/schema validation
failure.
- Disabled or waiting schedules, such as the weekly coding retro gate when its
upstream read model is not available, are reported without being counted as
missed runs.
- Verification results are recorded in this workplan and as a State Hub progress
note once the implementation lands.
## Implementation Result
Completed 2026-06-29: implemented the own-infrastructure automation status
surface and codified the scheduling preference.
Delivered:
- `AGENTS.md` now states that durable activity-core automations use repo-owned
infrastructure, not coding assistant automation/reminder/heartbeat tooling, as
execution or evidence authority.
- `SCOPE.md` and `docs/runbook.md` describe the deterministic status surface and
assistant boundary.
- `src/activity_core/automation_status.py` and `scripts/automation_status.py`
provide the non-LLM CLI.
- `make automation-status SINCE=...` and `make automation-status-json` expose the
baseline operator commands.
- `tests/test_automation_status.py` covers date shortcuts, cron fire estimation,
completed runs, validation failures, missed runs, disabled schedules, partial
source availability, and working-memory evidence parsing.
Verification:
```bash
python3 -m py_compile src/activity_core/automation_status.py scripts/automation_status.py tests/test_automation_status.py
/home/worsch/.local/bin/uv run pytest tests/test_automation_status.py tests/test_daily_triage_verifier.py -q
/home/worsch/.local/bin/uv run python scripts/automation_status.py \
--since 2026-06-26 --until 2026-06-27 --db-url '' \
--progress-event-type daily_triage --timeout-seconds 10 \
--working-memory-dir /tmp --format json
```
Results:
- focused tests: `11 passed`;
- `make help` lists `automation-status` and `automation-status-json`;
- the 2026-06-26 through 2026-06-27 status run exited `1` as expected because
State Hub evidence classified daily triage activity
`6fca51fa-387a-4fd0-bc4e-d62c29eb859a` as `validation_failed` with two
non-secret evidence records: 2026-06-26 `Expecting ',' delimiter` and
2026-06-27 `Unterminated string`;
- the same report classified the gated weekly coding retro as `disabled`, not
`missed`.

View File

@@ -0,0 +1,204 @@
---
id: ACTIVITY-WP-0019
type: workplan
title: "Automation schedule inventory Make targets"
domain: infotech
repo: activity-core
status: finished
owner: codex
topic_slug: automation-inventory
created: "2026-06-29"
updated: "2026-07-01"
state_hub_workstream_id: "21c73763-9adc-42f6-8fd2-1b8b33c2c770"
---
# Automation schedule inventory Make targets
## Goal
Provide a repo-native, non-LLM way to list every scheduled automation that
activity-core knows about.
`ACTIVITY-WP-0018` added the status surface for questions like "How did our
automations go since Friday?". The next operator question is the inventory
baseline: "What automations are scheduled at all?" That should be answerable
through Make targets backed by activity-core's own ActivityDefinitions,
database, and Temporal schedule metadata when available, independent of any
coding assistant automation infrastructure.
## Review notes
- `Makefile` currently exposes `automation-status` and
`automation-status-json`, but no dedicated inventory/list target.
- `scripts/automation_status.py` and `src/activity_core/automation_status.py`
already load scheduled ActivityDefinitions and compute their Temporal schedule
ids. The inventory target should reuse that parsing/loading posture where it
fits rather than creating a second discovery path.
- `make sync-schedules` reconciles Temporal schedules from the
`activity_definitions` database, but it is an action target, not a read-only
operator inventory command.
- The inventory command should remain useful in degraded local mode: file-backed
definitions are enough to list configured scheduled automations, while live
DB and Temporal visibility can enrich the output.
## Task: Define the automation inventory contract
```task
id: ACTIVITY-WP-0019-T01
status: done
priority: high
state_hub_task_id: "8de24590-f9ee-4d0e-8692-b7ada9f232ed"
```
Define the fields and source precedence for a deterministic scheduled
automation inventory report.
Acceptance:
- The report includes every ActivityDefinition with `trigger_type` of `cron` or
`scheduled`, including disabled definitions.
- Each row includes id, name, enabled/disabled state, trigger type, schedule
expression or one-shot datetime, timezone, overlap/catchup policy when known,
and the derived Temporal schedule id.
- The report identifies its source for each row: database, repo definition file,
Temporal visibility, or a combination.
- If Temporal is reachable, the report adds paused/missing/drift hints without
mutating schedules.
- Missing optional sources produce warnings, not silent omissions.
- The JSON shape is stable enough for scripts and tests.
## Task: Implement a non-mutating inventory CLI
```task
id: ACTIVITY-WP-0019-T02
status: done
priority: high
state_hub_task_id: "538cb9a5-48f3-470c-8518-29ee66c96678"
```
Add a deterministic CLI path for listing scheduled automations without requiring
LLM credentials or coding assistant tooling.
Acceptance:
- A script or module command, likely sharing code with
`activity_core.automation_status`, supports human and JSON output.
- The command is read-only: it does not call `sync-schedules`, upsert schedules,
delete schedules, enqueue workflows, or write State Hub evidence.
- It supports filters by activity id, activity name, enabled state, and trigger
type.
- It loads from the database when configured and falls back to repo definition
files when the database is unavailable or explicitly disabled.
- It optionally enriches rows from Temporal when `TEMPORAL_HOST` is configured,
with bounded timeouts so an unreachable service does not hang the command.
- Unit tests cover DB rows, file fallback, disabled definitions, Temporal
enrichment unavailable, and JSON output.
## Task: Add Make targets
```task
id: ACTIVITY-WP-0019-T03
status: done
priority: high
state_hub_task_id: "f2001721-07f3-42f5-a15e-0c7d1b0ed801"
```
Expose the inventory command through Make targets that are easy for humans,
scripts, and coding assistants to run before asking for a prose summary.
Acceptance:
- `make automation-list` prints a concise human-readable inventory.
- `make automation-list-json` emits the same inventory as JSON.
- Optional Make variables pass through cleanly, for example `ENABLED=true`,
`TRIGGER=cron`, `ACTIVITY_ID=<uuid>`, or `FORMAT=json`.
- `make help` lists both targets with clear one-line descriptions.
- The targets do not require LLM access, Codex automation tooling, or
interactive prompts.
## Task: Document the inventory workflow
```task
id: ACTIVITY-WP-0019-T04
status: done
priority: medium
state_hub_task_id: "f687743b-3936-413e-ae50-d35484ae9a81"
```
Update operator documentation so the scheduled automation inventory path is
discoverable next to the status path.
Acceptance:
- `docs/runbook.md` documents `make automation-list` and
`make automation-list-json`.
- The docs distinguish inventory from status: inventory answers what is
configured; status answers what happened in a time window.
- The docs state that the command is read-only and uses activity-core-owned
scheduling evidence.
- The docs include a compact example of the expected human output.
## Task: Verify against current repo and live/degraded sources
```task
id: ACTIVITY-WP-0019-T05
status: done
priority: medium
state_hub_task_id: "5317b532-5cef-4eff-b6d8-3e85bbca8e8a"
```
Prove the target against the current scheduled automation definitions and
degraded local conditions.
Acceptance:
- `make automation-list` shows the current scheduled automations, including
daily triage and weekly scheduled definitions when present in the selected
source.
- JSON output is valid and includes the same rows.
- A DB-unavailable run falls back to repo definition files or reports a clear
warning if no definitions are discoverable.
- A Temporal-unavailable run exits successfully with Temporal warnings rather
than hanging.
- Focused tests pass and the result is recorded in this workplan before the
workplan is moved to `finished`.
## Implementation Result
Completed 2026-07-01: implemented the read-only scheduled automation inventory
surface.
Delivered:
- `scripts/automation_inventory.py` exposes the inventory CLI backed by
`activity_core.automation_status` shared definition and Temporal helpers.
- `make automation-list` and `make automation-list-json` list configured
scheduled ActivityDefinitions with filters for `ENABLED`, `TRIGGER`,
`ACTIVITY_ID`, and `ACTIVITY_NAME`.
- JSON output is script-safe; the Make JSON target suppresses command echo and
recursive make directory chatter.
- `docs/runbook.md` now distinguishes inventory (what is configured) from status
(what happened in a time window).
- Tests cover DB-backed rows, file fallback, disabled filtering, Temporal
unavailable warnings, and JSON CLI output.
Verification:
```bash
/home/worsch/.local/bin/uv run pytest tests/test_automation_status.py tests/test_daily_triage_verifier.py -q
bash -lc 'export PATH="/home/worsch/.local/bin:$PATH"; make automation-list ACTCORE_DB_URL= TEMPORAL_HOST='
bash -lc 'export PATH="/home/worsch/.local/bin:$PATH"; make automation-list-json ACTCORE_DB_URL= TEMPORAL_HOST= > /tmp/activity-core-inventory.json && python3 -m json.tool /tmp/activity-core-inventory.json >/tmp/activity-core-inventory.pretty'
bash -lc 'export PATH="/home/worsch/.local/bin:$PATH"; make automation-list ACTCORE_DB_URL= TEMPORAL_HOST= ENABLED=true TRIGGER=cron'
bash -lc 'export PATH="/home/worsch/.local/bin:$PATH"; make help'
```
Results:
- focused tests: `16 passed`;
- degraded Make inventory run listed 9 file-backed scheduled automations, with
5 enabled and 4 disabled;
- filtered Make run with `ENABLED=true TRIGGER=cron` listed 5 enabled cron
automations;
- `automation-list-json` emitted parseable JSON directly;
- `make help` lists `automation-list` and `automation-list-json`.

View File

@@ -3,6 +3,7 @@ type: session-note
created: "2026-03-28"
updated: "2026-06-03"
status: archived
state_hub_workstream_id: "b221e65a-6f97-44b0-8dae-442fffcb7f64"
---
# WP-0002 Handoff Note — Continue on CoulombCore