generated from coulomb/repo-seed
Compare commits
69 Commits
17e2e39165
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 6a5321525e | |||
| 2f55167215 | |||
| ffe10f098e | |||
| 3f85274916 | |||
| bb14d08212 | |||
| 92629e7a91 | |||
| 951ec56f7a | |||
| 9440d539c6 | |||
| 2ff852da29 | |||
| 30043348f0 | |||
| 18fcce87fe | |||
| 17b787fad0 | |||
| 6c8cb1b7b6 | |||
| ec66e06066 | |||
| 919edd98ac | |||
| bf877b7f0d | |||
| 9be4ddbdb7 | |||
| c5440e8429 | |||
| 53dc0f6e93 | |||
| a70c00a789 | |||
| b41b6034ee | |||
| 960fb05268 | |||
| b7b0b5bf6e | |||
| 14f76fb6d9 | |||
| caa2608092 | |||
| 61f278d643 | |||
| 0e9e18a59a | |||
| 5eb33bd3bb | |||
| 612c226472 | |||
| 0b2c68838e | |||
| 4b5e96d7c1 | |||
| 65ef005c2d | |||
| 0e75aaec01 | |||
| b2e57707a7 | |||
| 88fe359385 | |||
| f90591c5f1 | |||
| cf7a11dcd9 | |||
| 99e5d525a8 | |||
| 8424c13783 | |||
| 864f90f9b9 | |||
| 053d18b24a | |||
| 77af65afb2 | |||
| 0495f8a43f | |||
| c6cad9e7b3 | |||
| a83b117f60 | |||
| ffc0ee2cb7 | |||
| 59b3b73061 | |||
| 4bc5111dfd | |||
| e9a6029ded | |||
| bf4e61f0bf | |||
| 40fa851ec0 | |||
| e0742d18d7 | |||
| ccac285b0a | |||
| a0dcc52353 | |||
| faf5d60ae8 | |||
| adfd1a9067 | |||
| 44987457c1 | |||
| 3a981cc98f | |||
| dbd2fbb11c | |||
| c938b80503 | |||
| 3e93567a53 | |||
| 6f68f8f9ec | |||
| f05c56e202 | |||
| 200ec0c97a | |||
| 42e5ef725c | |||
| a08bd1684f | |||
| 2078915854 | |||
| 23f4956b68 | |||
| 764339e490 |
50
.claude/rules/credential-routing.md
Normal file
50
.claude/rules/credential-routing.md
Normal file
@@ -0,0 +1,50 @@
|
||||
# Credential and access routing
|
||||
|
||||
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
|
||||
for inference. Run this check **before** requesting secrets, API keys, SSH access,
|
||||
login tokens, or database passwords — in any repo, not only `ops-warden`.
|
||||
|
||||
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
|
||||
other credential need belongs to another subsystem. **Do not** message
|
||||
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
|
||||
|
||||
### Lookup (do this first)
|
||||
|
||||
```bash
|
||||
warden route find "<describe your need>" --json
|
||||
warden route show <catalog-id> --json
|
||||
```
|
||||
|
||||
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
|
||||
|
||||
| Agent runtime | How to orient |
|
||||
| --- | --- |
|
||||
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=activity-core` is for coordination, not secret vending |
|
||||
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
|
||||
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
|
||||
|
||||
### Quick routing table
|
||||
|
||||
| I need… | Owner | ops-warden executes? |
|
||||
| --- | --- | --- |
|
||||
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
|
||||
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
|
||||
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
|
||||
| Authorization decision | flex-auth | No — route only |
|
||||
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
|
||||
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
|
||||
|
||||
### Anti-patterns (do not do these)
|
||||
|
||||
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
|
||||
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
|
||||
- Pasting secrets into Git, State Hub, workplans, logs, or chat
|
||||
|
||||
### Other capabilities (reuse-surface)
|
||||
|
||||
Non-credential capabilities are usually discovered through **reuse-surface** federation
|
||||
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
|
||||
every repo's agent instructions because it is high-frequency, high-risk, and easy to
|
||||
get wrong.
|
||||
|
||||
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
|
||||
@@ -1,11 +1,11 @@
|
||||
## First Session Protocol
|
||||
|
||||
Triggered when `get_domain_summary("custodian")` shows **no workstreams**.
|
||||
Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
|
||||
The project is registered but work has not yet been structured.
|
||||
|
||||
**Step 1 — Read, don't write**
|
||||
- `~/the-custodian/canon/projects/custodian/project_charter_v0.1.md` — purpose, scope
|
||||
- `~/the-custodian/canon/projects/custodian/roadmap_v0.1.md` — planned phases
|
||||
- `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
|
||||
- `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
|
||||
- Scan repo root: README, directory structure, existing code or docs
|
||||
|
||||
**Step 2 — Survey in-progress work**
|
||||
@@ -17,7 +17,7 @@ roadmap phase. **Wait for approval before creating.**
|
||||
|
||||
**Step 4 — Create workplan file first, then DB record (ADR-001)**
|
||||
```
|
||||
workplans/activity-core-WP-NNNN-<slug>.md ← write this first
|
||||
workplans/ACTIVITY-WP-NNNN-<slug>.md ← write this first
|
||||
```
|
||||
Then register in the hub:
|
||||
```
|
||||
@@ -28,7 +28,7 @@ create_task(workstream_id="<id>", title="...", priority="high|medium|low")
|
||||
**Step 5 — Record the setup**
|
||||
```
|
||||
add_progress_event(
|
||||
summary="First session: structured custodian into N workstreams, M tasks",
|
||||
summary="First session: structured infotech into N workstreams, M tasks",
|
||||
event_type="milestone",
|
||||
topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
|
||||
detail={"workstreams": [...], "tasks_created": M}
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
**Purpose:** Durable task factory built on Temporal. Manages ActivityDefinitions, schedules recurring workflows via Temporal Schedules, routes events via NATS JetStream, and exposes a FastAPI CRUD surface for the custodian domain.
|
||||
|
||||
**Domain:** custodian
|
||||
**Domain:** infotech
|
||||
**Repo slug:** activity-core
|
||||
**Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
## Session Protocol
|
||||
|
||||
State Hub: http://127.0.0.1:8000
|
||||
Dev Hub (State Hub API): http://127.0.0.1:8000
|
||||
MCP server name in `~/.claude.json`: `dev-hub`
|
||||
|
||||
**Step 1 — Orient**
|
||||
|
||||
@@ -10,7 +11,7 @@ cat .custodian-brief.md
|
||||
```
|
||||
Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
|
||||
```
|
||||
get_domain_summary("custodian")
|
||||
get_domain_summary("infotech")
|
||||
```
|
||||
If MCP tools are unavailable in the current agent session, use the REST API:
|
||||
```bash
|
||||
@@ -39,11 +40,11 @@ curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
|
||||
ls workplans/
|
||||
```
|
||||
For each file with `status: ready`, `active`, or `blocked`, note pending
|
||||
`todo`/`in_progress` tasks.
|
||||
`wait`/`todo`/`progress` tasks.
|
||||
|
||||
**Step 4 — Present brief**
|
||||
|
||||
1. **Active workstreams** for `custodian` — title, task counts, blocking decisions
|
||||
1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
|
||||
2. **Pending tasks** from `workplans/` + any `[repo:activity-core]` hub tasks
|
||||
3. **Goal guidance** — if `goal_guidance` in summary:
|
||||
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
## Workplan Convention (ADR-001)
|
||||
|
||||
File location: `workplans/activity-core-WP-NNNN-<slug>.md`
|
||||
ID prefix: `ACTIVITY-WP`
|
||||
File location: `workplans/ACTIVITY-WP-NNNN-<slug>.md`
|
||||
ID prefix: `ACTIVITY-WP-`
|
||||
|
||||
Work items originate as files in this repo **before** being registered in the hub.
|
||||
|
||||
@@ -12,7 +12,7 @@ repo state, and `finished` when implementation is complete. `stalled` and
|
||||
`needs_review` are derived health labels, not stored statuses.
|
||||
|
||||
Closed workplans may be moved to `workplans/archived/` with a completion-date
|
||||
prefix: `YYMMDD-activity-core-WP-NNNN-<slug>.md`. The frontmatter id remains
|
||||
prefix: `YYMMDD-ACTIVITY-WP-NNNN-<slug>.md`. The frontmatter id remains
|
||||
unchanged; the prefix is only for quick visual reference.
|
||||
|
||||
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
|
||||
@@ -25,4 +25,16 @@ Ecosystem todos from other agents arrive as `[repo:activity-core]` hub tasks —
|
||||
visible at session start. Pick one up by creating the workplan file, then registering
|
||||
the workstream.
|
||||
|
||||
Task blocks use this shape:
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-NNNN-T01
|
||||
status: wait | todo | progress | done | cancel
|
||||
priority: high | medium | low
|
||||
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
|
||||
```
|
||||
|
||||
Status progression is `todo` → `progress` → `done`; use `wait` for waiting or
|
||||
blocked work and `cancel` for stopped work.
|
||||
|
||||
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->
|
||||
|
||||
@@ -1,33 +1,31 @@
|
||||
<!-- custodian-brief: generated by fix-consistency — do not edit manually -->
|
||||
# Custodian Brief — activity-core
|
||||
|
||||
**Domain:** custodian
|
||||
**Last synced:** 2026-06-18 13:20 UTC
|
||||
**Domain:** infotech
|
||||
**Last synced:** 2026-07-02 00:19 UTC
|
||||
**State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
|
||||
|
||||
## Active Workstreams
|
||||
|
||||
### Definition And Schedule Hot Reload
|
||||
Progress: 0/5 done | workstream_id: `8887075e-21ec-451b-b82b-cd81035c9ca5`
|
||||
### LLM Output Robustness & The Producer Trust Boundary
|
||||
Progress: 8/10 done | workstream_id: `4ef0d53b-1777-41ae-80c6-1b69fdb34726`
|
||||
|
||||
**Open tasks:**
|
||||
- ! Live No-Restart Smoke `68a0e22a`
|
||||
- · Extract Reusable Sync Service `53a7970b`
|
||||
- · Add Admin Sync Endpoint `8697c761`
|
||||
- · Preserve Schedule Drift Semantics `efeac412`
|
||||
- · Optional Background Sync Loop `d774087b`
|
||||
- ! Reproduce & Root-Cause The Failure `74fd16a5`
|
||||
*(wait: Local analysis complete: mechanism is the unbounded ~1-per-workstream recommendation list (16 active workstreams; break at char 5268 ~rank 8-9); both first attempt and retry failed. Exact token + finish_reason are unrecoverable from activity-core (complete() drops finish_reason; report cap 4000 < 5268; log cap 2000). Remaining: pull llm-connect producer-side logs on railiance01 (cluster/operator-owned). Does NOT block T02/T03 — mitigation is identical regardless.)*
|
||||
- ► Tests + Calibration Re-Entry `c881500b`
|
||||
|
||||
### Post-triage operational hardening
|
||||
Progress: 6/7 done | workstream_id: `5646e13a-13af-4724-bca6-3c0d86f96733`
|
||||
### Adopt State Hub Beachhead Endpoint
|
||||
Progress: 0/2 done | workstream_id: `bbc07f9e-9323-4b2b-b556-c33b37d0b228`
|
||||
|
||||
**Open tasks:**
|
||||
- ! Three-Run Calibration Feedback `7cbf0a35`
|
||||
- ! Point STATE_HUB_URL at the beachhead `76b6132d`
|
||||
- ! Retire the bespoke actcore-state-hub-bridge proxy `526c2129`
|
||||
|
||||
### Daily Triage LLM Reconciliation And Evidence
|
||||
Progress: 1/5 done | workstream_id: `f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9`
|
||||
Progress: 2/5 done | workstream_id: `f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9`
|
||||
|
||||
**Open tasks:**
|
||||
- ! Reconcile Live Railiance Runtime `23545ddc`
|
||||
- ! Run Daily Triage Fixture Smoke `10e0df77`
|
||||
- ! Collect Three Clean Scheduled Runs `dc6b9482`
|
||||
- ! Close Handoff State `ecc57e21`
|
||||
@@ -49,6 +47,6 @@ Progress: 2/3 done | workstream_id: `7387fc50-1f2c-471a-9d85-bb085cbd0b63`
|
||||
## MCP Orientation (when available)
|
||||
|
||||
If the state-hub MCP server is reachable, call:
|
||||
`get_domain_summary("custodian")`
|
||||
`get_domain_summary("infotech")`
|
||||
This provides richer cross-domain context.
|
||||
If the MCP call fails, use this file as your orientation source.
|
||||
|
||||
@@ -18,7 +18,9 @@ STATE_HUB_URL=http://127.0.0.1:8000
|
||||
# Repo scoping — used by the repo-scoping context adapter. Binds {} on failure.
|
||||
REPO_SCOPING_URL=http://127.0.0.1:8020
|
||||
# Issue Core — task emission backend.
|
||||
ISSUE_CORE_URL=http://127.0.0.1:8010
|
||||
ISSUE_CORE_URL=http://127.0.0.1:8765
|
||||
# Shared ingestion key — must match issue-core's ISSUE_CORE_API_KEY.
|
||||
ISSUE_CORE_API_KEY=
|
||||
# Sink type: 'rest' (POST to issue-core) or 'null' (discard, for dry-run).
|
||||
ISSUE_SINK_TYPE=rest
|
||||
|
||||
|
||||
@@ -1,17 +1,15 @@
|
||||
# Kaizen scheduled agent execution (ADR-005)
|
||||
# Engagement: coulomb-loop — stabilize phase (daily crons per ADR-003)
|
||||
# Promoted 2026-06-18 after 3/3 bootstrap E2E cycles
|
||||
# Kaizen scheduled agent execution manifest (ADR-005)
|
||||
# Engagement: coulomb-loop bootstrap — weekly cadence
|
||||
# Regulator promotes cadence per customer engagement policy (ADR-003).
|
||||
# Validate with: kaizen-agentic schedule validate
|
||||
version: '1'
|
||||
timezone: Europe/Berlin
|
||||
agents:
|
||||
coach:
|
||||
cadence: daily
|
||||
cron: "0 9 * * *"
|
||||
cadence: weekly
|
||||
cron: 0 9 * * 1
|
||||
enabled: true
|
||||
optimization:
|
||||
cadence: daily
|
||||
cron: "0 10 * * *"
|
||||
cadence: weekly
|
||||
cron: 0 10 * * 1
|
||||
enabled: true
|
||||
tdd-workflow:
|
||||
cadence: monthly
|
||||
enabled: false
|
||||
28
.repo-classification.yaml
Normal file
28
.repo-classification.yaml
Normal file
@@ -0,0 +1,28 @@
|
||||
# Repo classification (Repo Classification Standard v1.0).
|
||||
|
||||
repo_classification:
|
||||
standard: Repo Classification Standard
|
||||
version: '1.0'
|
||||
classified_at: '2026-06-22'
|
||||
classified_by: human
|
||||
category: tooling
|
||||
domain: infotech
|
||||
secondary_domains:
|
||||
- agents
|
||||
capability_tags:
|
||||
- workflow
|
||||
- orchestration
|
||||
- automation
|
||||
- coordination
|
||||
- observability
|
||||
business_stake:
|
||||
- technology
|
||||
- operations
|
||||
- automation
|
||||
- execution
|
||||
business_mechanics:
|
||||
- coordination
|
||||
- operation
|
||||
- adaptation
|
||||
notes: Org-wide event bridge / task factory (Temporal-based). Active bounded implementation
|
||||
-> project.
|
||||
83
AGENTS.md
83
AGENTS.md
@@ -4,7 +4,7 @@
|
||||
|
||||
**Purpose:** Durable task factory built on Temporal. Manages ActivityDefinitions, schedules recurring workflows via Temporal Schedules, routes events via NATS JetStream, and exposes a FastAPI CRUD surface for the custodian domain.
|
||||
|
||||
**Domain:** custodian
|
||||
**Domain:** infotech
|
||||
**Repo slug:** activity-core
|
||||
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
|
||||
**Workplan prefix:** `ACTIVITY-WP-`
|
||||
@@ -83,7 +83,7 @@ curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
||||
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
|
||||
2. Check inbox: `GET /messages/?to_agent=activity-core&unread_only=true`; mark read
|
||||
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
|
||||
4. Check blocked tasks: `GET /tasks/?needs_human=true`
|
||||
4. Check human-needed tasks: `GET /tasks/?needs_human=true`
|
||||
|
||||
**During work:**
|
||||
- Update task statuses in workplan files as tasks progress
|
||||
@@ -101,6 +101,78 @@ curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
||||
|
||||
---
|
||||
|
||||
## Credential and access routing
|
||||
|
||||
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
|
||||
for inference. Run this check **before** requesting secrets, API keys, SSH access,
|
||||
login tokens, or database passwords — in any repo, not only `ops-warden`.
|
||||
|
||||
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
|
||||
other credential need belongs to another subsystem. **Do not** message
|
||||
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
|
||||
|
||||
### Lookup (do this first)
|
||||
|
||||
```bash
|
||||
warden route find "<describe your need>" --json
|
||||
warden route show <catalog-id> --json
|
||||
```
|
||||
|
||||
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
|
||||
|
||||
| Agent runtime | How to orient |
|
||||
| --- | --- |
|
||||
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=activity-core` is for coordination, not secret vending |
|
||||
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
|
||||
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
|
||||
|
||||
### Quick routing table
|
||||
|
||||
| I need… | Owner | ops-warden executes? |
|
||||
| --- | --- | --- |
|
||||
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
|
||||
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
|
||||
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
|
||||
| Authorization decision | flex-auth | No — route only |
|
||||
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
|
||||
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
|
||||
|
||||
### Anti-patterns (do not do these)
|
||||
|
||||
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
|
||||
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
|
||||
- Pasting secrets into Git, State Hub, workplans, logs, or chat
|
||||
|
||||
### Other capabilities (reuse-surface)
|
||||
|
||||
Non-credential capabilities are usually discovered through **reuse-surface** federation
|
||||
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
|
||||
every repo's agent instructions because it is high-frequency, high-risk, and easy to
|
||||
get wrong.
|
||||
|
||||
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
|
||||
|
||||
<!-- REPO-AGENTS-EXTENSIONS -->
|
||||
<!-- Append repo-specific agent instructions below this marker.
|
||||
The state-hub template sync preserves content after this line. -->
|
||||
|
||||
---
|
||||
|
||||
## Automation Scheduling Preference
|
||||
|
||||
Durable activity-core automations must use this repo's own infrastructure:
|
||||
Temporal Schedules, NATS JetStream, activity-core run records, State Hub
|
||||
progress, and configured report/evidence sinks. Do not use coding
|
||||
assistant-provided automation, reminder, or heartbeat tooling as the execution
|
||||
or evidence source for production or operational recurrence.
|
||||
|
||||
Coding assistants may run repo-native inspection commands and summarize their
|
||||
outputs, but the baseline answer to questions like "How did our automations go
|
||||
since Friday?" must come from deterministic local tooling such as the
|
||||
ACTIVITY-WP-0018 automation status surface.
|
||||
|
||||
---
|
||||
|
||||
## Workplan Convention (ADR-001)
|
||||
|
||||
Work items originate as files in this repo — not in the hub. The hub is a
|
||||
@@ -124,7 +196,7 @@ anything needing analysis, design, approval, dependencies, or multiple phases.
|
||||
id: ACTIVITY-WP-NNNN
|
||||
type: workplan
|
||||
title: "..."
|
||||
domain: custodian
|
||||
domain: infotech
|
||||
repo: activity-core
|
||||
status: proposed | ready | active | blocked | backlog | finished | archived
|
||||
owner: codex
|
||||
@@ -154,10 +226,7 @@ state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
|
||||
Task description text.
|
||||
```
|
||||
|
||||
Status progression: `todo` → `progress` → `done`; use `wait` for a task
|
||||
blocked on external input and `cancel` for intentionally abandoned work.
|
||||
Workstream/workplan lifecycle status is separate; frontmatter `blocked` remains
|
||||
valid there.
|
||||
Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
|
||||
|
||||
To create a new workplan:
|
||||
1. Write the file following the format above
|
||||
|
||||
@@ -8,4 +8,5 @@
|
||||
@.claude/rules/stack-and-commands.md
|
||||
@.claude/rules/architecture.md
|
||||
@.claude/rules/repo-boundary.md
|
||||
@.claude/rules/credential-routing.md
|
||||
@.claude/rules/agents.md
|
||||
|
||||
27
Makefile
27
Makefile
@@ -1,13 +1,17 @@
|
||||
-include .env
|
||||
export
|
||||
|
||||
.PHONY: sync-event-types sync-activity-definitions test migrate sync-all \
|
||||
.PHONY: sync-event-types sync-activity-definitions sync-schedules test migrate sync-all \
|
||||
automation-status automation-status-json automation-list automation-list-json \
|
||||
dev-up dev-down railiance-up railiance-down \
|
||||
start-worker start-api start-event-router help
|
||||
|
||||
sync-activity-definitions: ## Sync ActivityDefinition files into DB
|
||||
uv run python -m activity_core.sync_activity_definitions
|
||||
|
||||
sync-schedules: ## Reconcile Temporal schedules from activity_definitions DB
|
||||
uv run python -m activity_core.sync_schedules
|
||||
|
||||
sync-event-types: ## Sync event type YAML files into DB
|
||||
uv run python scripts/sync_event_types.py
|
||||
|
||||
@@ -21,6 +25,27 @@ migrate: ## Apply all pending Alembic migrations
|
||||
|
||||
sync-all: sync-event-types sync-activity-definitions ## Sync event types and activity definitions
|
||||
|
||||
# -- Automation status ---------------------------------------------------------
|
||||
|
||||
SINCE ?= today
|
||||
FORMAT ?= human
|
||||
ENABLED ?= all
|
||||
TRIGGER ?=
|
||||
ACTIVITY_ID ?=
|
||||
ACTIVITY_NAME ?=
|
||||
|
||||
automation-status: ## Report recent automation status from repo-owned evidence
|
||||
uv run python scripts/automation_status.py --since "$(SINCE)" $(if $(UNTIL),--until "$(UNTIL)",) --format "$(FORMAT)"
|
||||
|
||||
automation-status-json: ## Report recent automation status as JSON
|
||||
$(MAKE) automation-status FORMAT=json
|
||||
|
||||
automation-list: ## List configured scheduled automations from repo-owned definitions
|
||||
@uv run python scripts/automation_inventory.py --format "$(FORMAT)" --enabled "$(ENABLED)" $(if $(TRIGGER),--trigger-type "$(TRIGGER)",) $(if $(ACTIVITY_ID),--activity-id "$(ACTIVITY_ID)",) $(if $(ACTIVITY_NAME),--activity-name "$(ACTIVITY_NAME)",)
|
||||
|
||||
automation-list-json: ## List configured scheduled automations as JSON
|
||||
@$(MAKE) --no-print-directory automation-list FORMAT=json
|
||||
|
||||
# ── Infrastructure ─────────────────────────────────────────────────────────────
|
||||
|
||||
dev-up: ## Start full dev stack (Temporal + PG + ES + NATS)
|
||||
|
||||
16
SCOPE.md
16
SCOPE.md
@@ -64,7 +64,9 @@ The two evaluation modes:
|
||||
`context.*` / `event.*` interpolation and explicit `for_each` per-item
|
||||
binding. No `exec()`.
|
||||
- **Instruction executor**: trusted-field prompt rendering, LLM call via
|
||||
llm-connect, structured output validation, bounded validation-failure
|
||||
llm-connect, structured output validation, item-granular recovery with a
|
||||
quarantine lane and producer guardrails (count/length/depth caps, reference
|
||||
allow-list) at the producer trust boundary, bounded validation-failure
|
||||
artifacts for report instructions, review-required audit metadata, and
|
||||
deterministic report sinks. A real downstream review queue is not implemented
|
||||
in this repo.
|
||||
@@ -88,6 +90,9 @@ The two evaluation modes:
|
||||
- **REST admin API** (FastAPI): CRUD for ActivityDefinitions, manual trigger,
|
||||
event type registry queries.
|
||||
- **Prometheus metrics**: Temporal SDK metrics exposed for scraping.
|
||||
- **Automation status surface**: deterministic, non-LLM status reporting via
|
||||
`make automation-status` / `scripts/automation_status.py`, using repo-owned
|
||||
evidence sources rather than coding assistant scheduler state.
|
||||
- **Operational runbook**: `docs/runbook.md`.
|
||||
|
||||
---
|
||||
@@ -114,6 +119,10 @@ The two evaluation modes:
|
||||
runs on Railiance infrastructure (or Docker Compose for dev).
|
||||
- **End-user task UI** — tasks land in issue-core; presentation is separate.
|
||||
- **Synchronous request-response patterns** — Temporal is async-first.
|
||||
- **Coding assistant automation infrastructure** — assistant-provided reminders,
|
||||
heartbeats, or scheduled jobs are not the execution or evidence authority for
|
||||
activity-core automations. Assistants may run and summarize repo-native
|
||||
commands only.
|
||||
|
||||
---
|
||||
|
||||
@@ -130,6 +139,8 @@ The two evaluation modes:
|
||||
commands.
|
||||
- You are replacing scattered bespoke cron jobs and manual coordination with
|
||||
a governed, observable automation layer.
|
||||
- You need to answer "how did our automations go since Friday?" from
|
||||
deterministic repo-native evidence before any optional LLM summary.
|
||||
|
||||
---
|
||||
|
||||
@@ -320,6 +331,9 @@ new one-off control paths.
|
||||
governance model, event type schema, ActivityDefinition structure.
|
||||
- `docs/adr/adr-003-rule-instruction-model.md` — Rule DSL, Instruction safety
|
||||
model, evaluation semantics, audit trail, testing strategy.
|
||||
- `docs/adr/adr-004-producer-trust-boundary.md` — untrusted-producer premise,
|
||||
trust-but-handle vs verify-and-mitigate postures, error-locality and
|
||||
quarantine-with-provenance, producer guardrails for LLM/agent/human output.
|
||||
|
||||
---
|
||||
|
||||
|
||||
156
docs/adr/adr-004-producer-trust-boundary.md
Normal file
156
docs/adr/adr-004-producer-trust-boundary.md
Normal file
@@ -0,0 +1,156 @@
|
||||
---
|
||||
id: ACT-ADR-004
|
||||
type: architecture-decision-record
|
||||
title: "The Producer Trust Boundary — Guardrails and Error-Correction for Untrusted Output"
|
||||
status: accepted
|
||||
decided_by: Bernd Worsch
|
||||
date: "2026-06-26"
|
||||
scope: cross-repo
|
||||
affects:
|
||||
- activity-core
|
||||
- rules-core (future extraction)
|
||||
tags: ["architecture", "llm", "safety", "validation", "guardrails", "trust-boundary", "resilience"]
|
||||
---
|
||||
|
||||
# ACT-ADR-004: The Producer Trust Boundary
|
||||
|
||||
## Status
|
||||
|
||||
Accepted.
|
||||
|
||||
## Context
|
||||
|
||||
On 2026-06-26 the scheduled daily WSJF triage instruction fired on time, called
|
||||
llm-connect successfully, and produced a long ranked recommendation list — but
|
||||
the JSON broke at char 5268 (~rank 8–9 of ~16), failing schema validation. Because
|
||||
the report was validated and consumed as a single monolithic JSON document, one
|
||||
malformed delimiter discarded the **entire** run, including the 7 perfectly good
|
||||
recommendations the model had already emitted. The scheduling and runtime layers
|
||||
were healthy; the failure was entirely at the seam where free-form model output
|
||||
meets a strict consumer.
|
||||
|
||||
This is not a one-off bug, it is a recurring class. activity-core has a **trust
|
||||
boundary** wherever generative or human-authored output meets strict deterministic
|
||||
consumers: the JSON Schema validator, the task emitter, and any classic compute
|
||||
pipeline downstream. The producers on the other side of that boundary — **LLMs,
|
||||
agents, and humans** — are all *untrusted producers*. Their output may be:
|
||||
|
||||
- **erroneous** — hallucination, truncation at a token limit, drift, type slips,
|
||||
typos, a missing delimiter; or
|
||||
- **malicious** — prompt injection, crafted payloads, or oversized / deeply-nested
|
||||
structures intended to exhaust or confuse the consumer.
|
||||
|
||||
The pre-existing design treated producer output optimistically: parse the whole
|
||||
document, validate the whole document, and on any failure discard the whole
|
||||
document (preserving only a bounded diagnostic preview). That gives **zero error
|
||||
locality** — the blast radius of any single defect is the entire activation.
|
||||
|
||||
## Decision
|
||||
|
||||
Treat the producer→consumer seam as an explicit, adversarial **trust boundary**,
|
||||
and place guardrails plus error-correction tooling *at that boundary* rather than
|
||||
letting raw producer output flow into deterministic consumers.
|
||||
|
||||
### Two non-fail-fast postures
|
||||
|
||||
When hard-failing on a problem is undesirable, there are two sound strategies, and
|
||||
they **compose**:
|
||||
|
||||
- **A) Trust but handle exceptions** (optimistic / reactive). Consume the output
|
||||
as-is; on exception, catch → repair → retry → or quarantine. Cheap on the happy
|
||||
path; blast radius depends entirely on how granular the catch is. Best when
|
||||
failures are rare and locally recoverable. Risk: failures surface late, possibly
|
||||
after partial side effects.
|
||||
- **B) Verify and mitigate** (defensive / proactive). Validate, sanitize, clamp,
|
||||
and normalize the output to a known-good shape *before* it enters the pipeline —
|
||||
drop bad items, coerce types, bound sizes/depth, allow-list references — so the
|
||||
consumer only ever sees clean input. Higher upfront cost, smaller blast radius,
|
||||
no partial side effects. Best when failures are common or consequences are high.
|
||||
|
||||
### Governing principles
|
||||
|
||||
1. **Push verification to the boundary; keep the interior strict.** Apply posture
|
||||
**B** at the producer→consumer boundary; keep posture **A** for residual
|
||||
exceptions inside the verified core. Never relax the interior schema to absorb
|
||||
producer sloppiness.
|
||||
2. **Make error locality match the unit of work.** One bad recommendation must
|
||||
cost one recommendation, not the whole report. Structuring the payload so each
|
||||
item is independently parseable and validatable is the highest-leverage change.
|
||||
3. **Quarantine, never silently drop.** Invalid units are preserved as bounded,
|
||||
provenance-tagged artifacts (`index`, `error`, `raw` snippet, `reason`) so they
|
||||
can be debugged or replayed. Degraded-but-usable is reported distinctly from
|
||||
total loss.
|
||||
4. **Both human and agent input get the same rigor.** Guardrails are
|
||||
producer-agnostic: the same count / length / depth caps and reference
|
||||
allow-lists apply whether the producer is an LLM, an agent, or a human.
|
||||
|
||||
### What this means concretely in activity-core
|
||||
|
||||
Implemented in `src/activity_core/rules/executor.py`:
|
||||
|
||||
- **Strict-structure-only schema.** The daily-triage output schema is strict on
|
||||
per-item *structure* (`required [rank, candidate, action, why]`, typed `wsjf`)
|
||||
and carries `maxItems` as a producer *hint* — never as a hard whole-document
|
||||
reject, which would reproduce the very blast-radius failure (ACT-ADR-002 governs
|
||||
the schema format; `schemas/daily-triage-report.json`).
|
||||
- **Item-granular recovery (posture B).** When whole-document parse + one retry
|
||||
fail, `_resilient_report` recovers individually-parseable recommendation objects
|
||||
via a brace/quote-aware scanner (`_extract_object_spans`) that works for both
|
||||
pretty-printed and NDJSON output, attempts a best-effort `_try_repair` on a
|
||||
truncated tail, validates each recovered object against the item schema, and
|
||||
keeps the valid ones. Survivors are emitted with `output_validated=true`,
|
||||
`partial=true`, and `review_required=true`.
|
||||
- **Producer guardrails (`_partition_items`, applied on both the recovery and the
|
||||
happy path).** Per recommendation: structural type → schema → structural caps
|
||||
(`_MAX_DEPTH`, `_MAX_STRING_LEN`) → reference allow-list → count cap (top-N by
|
||||
`maxItems`). The first failing check quarantines the item with provenance and a
|
||||
`reason` (`malformed` / `schema` / `guardrail` / `allow_list` / `over_limit`).
|
||||
- **Reference allow-list.** A recommendation whose `candidate` is not in the set of
|
||||
known ids is quarantined. The set is sourced from resolved context
|
||||
(`context["known_candidates"]`, via `_allow_list_from_context`); the check is
|
||||
inert until a context resolver populates it, so the capability ships now and
|
||||
activates with a one-line resolver change.
|
||||
|
||||
### Where each posture sits
|
||||
|
||||
| Layer | Posture | Mechanism |
|
||||
|-------|---------|-----------|
|
||||
| Schema / contract | B | strict per-item structure; `maxItems` as hint |
|
||||
| Whole-document parse | A | tolerant parse + single retry |
|
||||
| Failed parse | B | item-granular recovery + repair + quarantine |
|
||||
| Per-item screening | B | schema + depth/length caps + allow-list + count cap |
|
||||
| Emitted report | — | `partial` / `quarantined_*` provenance; never silent |
|
||||
|
||||
## Consequences
|
||||
|
||||
- A single malformed or oversized item no longer discards an entire activation;
|
||||
the daily-triage run that failed on 2026-06-26 would now deliver its 7 valid
|
||||
recommendations and quarantine the broken tail.
|
||||
- Reports gain a `partial` / `quarantined_*` vocabulary; downstream report sinks
|
||||
and reviewers can distinguish degraded-but-usable from total loss.
|
||||
- Guardrail thresholds (`_MAX_DEPTH`, `_MAX_STRING_LEN`, `maxItems`, the
|
||||
allow-list) are policy knobs that will need tuning; they are intentionally
|
||||
conservative defaults, not a finished calibration.
|
||||
- **Known retention gap (follow-on):** `LLMConnectClient.complete()` still returns
|
||||
only `content`, discarding `finish_reason`/`usage`, and the total-loss artifact
|
||||
caps raw output below realistic break points. Capturing those signals so
|
||||
failures stay debuggable is tracked as a retention fix, not closed by this ADR.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Hard-enforce `maxItems` in the validator.** Rejected: a hard reject of an
|
||||
over-count document reproduces the whole-document blast radius. Mitigation (keep
|
||||
top-N, quarantine the rest) is preferred.
|
||||
- **Relax the schema to accept anything.** Rejected: violates principle 1; pushes
|
||||
malformed data into downstream consumers.
|
||||
- **Retry-until-valid only (pure posture A).** Rejected as the sole strategy: the
|
||||
2026-06-26 failure recurred across both the initial attempt and the retry, so
|
||||
retry alone does not bound the blast radius.
|
||||
|
||||
## References
|
||||
|
||||
- ACT-ADR-002 — markdown-as-definition format and output schema governance.
|
||||
- ACT-ADR-003 — Rule vs. Instruction model; the Instruction prompt-injection
|
||||
surface this boundary complements on the output side.
|
||||
- `workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md` — the
|
||||
implementing workplan.
|
||||
@@ -11,7 +11,9 @@ The current authoritative boundary is the issue-core REST API:
|
||||
POST {ISSUE_CORE_URL}/issues/
|
||||
```
|
||||
|
||||
`IssueCoreRestSink` sends this payload:
|
||||
`IssueCoreRestSink` authenticates with the shared `ISSUE_CORE_API_KEY` env var
|
||||
(same value as the issue-core server) via `Authorization: Bearer <key>` and
|
||||
sends this payload:
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -52,7 +54,7 @@ task reference before it can replace `IssueCoreRestSink`.
|
||||
|
||||
Weekly SBOM staleness is safe to evaluate in dry-run mode because the rule
|
||||
contract is deterministic and tested. Do not enable it against the real REST sink
|
||||
until issue-core credentials, endpoint reachability, and duplicate-handling are
|
||||
until `ISSUE_CORE_API_KEY`, endpoint reachability, and duplicate-handling are
|
||||
verified in the target environment.
|
||||
|
||||
## Verification
|
||||
|
||||
175
docs/runbook.md
175
docs/runbook.md
@@ -116,7 +116,129 @@ asyncio.run(publish())
|
||||
|
||||
---
|
||||
|
||||
## Syncing schedules manually
|
||||
## Syncing definitions and schedules manually
|
||||
|
||||
When the API is running, prefer the admin sync endpoint for definition or
|
||||
schedule changes. It refreshes file-backed ActivityDefinitions and reconciles
|
||||
Temporal Schedules without restarting the worker:
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
'http://localhost:8010/admin/sync?definitions=true&schedules=true'
|
||||
```
|
||||
|
||||
The response reports:
|
||||
|
||||
- `definitions.synced`
|
||||
- `event_types.synced`
|
||||
- `schedules.upserted`
|
||||
- `schedules.paused`
|
||||
- `schedules.deleted_orphans`
|
||||
- bounded `errors[]`
|
||||
|
||||
## Automation inventory
|
||||
|
||||
Use the repo-native inventory command to answer "what automations are scheduled
|
||||
at all?" before checking whether a recent window succeeded. The command is
|
||||
read-only: it loads ActivityDefinition rows or files and, when `TEMPORAL_HOST`
|
||||
is configured, describes Temporal schedules for visibility. It does not sync,
|
||||
upsert, pause, delete, or enqueue schedules.
|
||||
|
||||
```bash
|
||||
# Human-readable configured automation inventory.
|
||||
make automation-list
|
||||
|
||||
# JSON for scripts or assistant summarization.
|
||||
make automation-list-json
|
||||
|
||||
# Common filters.
|
||||
make automation-list ENABLED=true TRIGGER=cron
|
||||
make automation-list ACTIVITY_ID=6fca51fa-387a-4fd0-bc4e-d62c29eb859a
|
||||
```
|
||||
|
||||
Inventory answers what is configured; `make automation-status` answers what
|
||||
happened in a time window. Missing optional live sources are warnings, not
|
||||
silent omissions, so a degraded local run still lists repo definition files.
|
||||
|
||||
Compact human output looks like:
|
||||
|
||||
```text
|
||||
- Daily State Hub WSJF Triage [enabled cron] schedule=activity-schedule-... trigger=20 7 * * * tz=Europe/Berlin source=files temporal=not_checked
|
||||
```
|
||||
|
||||
## Automation status
|
||||
|
||||
Use the repo-native status command to answer operator questions such as "how did
|
||||
our automations go since Friday?". This is the baseline evidence surface; LLMs
|
||||
or coding assistants may summarize the output, but they are not the scheduler or
|
||||
source of truth.
|
||||
|
||||
```bash
|
||||
# Human-readable status. `friday` resolves in Europe/Berlin by default.
|
||||
make automation-status SINCE=friday
|
||||
|
||||
# JSON for scripts or assistant summarization.
|
||||
make automation-status-json SINCE=2026-06-26
|
||||
```
|
||||
|
||||
The command reads activity-core owned evidence only: ActivityDefinition files or
|
||||
DB rows, `activity_runs`, State Hub progress, working-memory report notes, and
|
||||
Temporal visibility when `TEMPORAL_HOST` is configured. Missing live sources are
|
||||
reported as warnings rather than hidden. It exits non-zero for real automation
|
||||
failures such as `missed`, `validation_failed`, or `sink_failed`.
|
||||
|
||||
Useful knobs:
|
||||
|
||||
```bash
|
||||
AUTOMATION_STATUS_TIMEOUT_SECONDS=10 make automation-status SINCE=friday
|
||||
make automation-status SINCE=2026-06-26 FORMAT=json
|
||||
make automation-status SINCE=2026-06-26 UNTIL=2026-06-27 ACTCORE_DB_URL=
|
||||
```
|
||||
|
||||
Example distinction from the June 2026 daily triage evidence:
|
||||
|
||||
```text
|
||||
- Activity 6fca51fa-387a-4fd0-bc4e-d62c29eb859a [validation_failed] expected=0 runs=0 evidence=2
|
||||
evidence state_hub_progress event_type=daily_triage run=ebec6e41... output_validated=false validation_error=Unterminated string...
|
||||
evidence state_hub_progress event_type=daily_triage run=c7370f9c... output_validated=false validation_error=Expecting ',' delimiter...
|
||||
```
|
||||
|
||||
That means the schedule/report path left evidence, but the report was not a
|
||||
clean validated output. Disabled schedules, such as the gated weekly coding
|
||||
retro, are reported as `disabled` and are not counted as missed runs.
|
||||
|
||||
`event_types` defaults to `false` for this endpoint because event-triggered
|
||||
definitions already reload from the DB in the event router path; opt in when
|
||||
the operator intentionally changed event type definition files:
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
'http://localhost:8010/admin/sync?definitions=true&schedules=true&event_types=true'
|
||||
```
|
||||
|
||||
The v1 posture is manual/operator-triggered sync. A periodic background loop is
|
||||
deferred until live use shows it is needed; this keeps customer definition
|
||||
changes explicit and avoids background repo scanning from the worker.
|
||||
|
||||
### Railiance01 no-restart smoke
|
||||
|
||||
After changing a projected definition in `k8s/railiance/20-runtime.yaml`,
|
||||
apply the ConfigMap and wait for the API pod volume to refresh (up to ~60s),
|
||||
then reconcile without restarting `actcore-worker`:
|
||||
|
||||
```bash
|
||||
export KUBECONFIG=~/.kube/config-hosteurope
|
||||
kubectl apply -f k8s/railiance/20-runtime.yaml
|
||||
sleep 60
|
||||
kubectl -n activity-core exec deploy/actcore-api -- \
|
||||
python3 -c 'import urllib.request; req=urllib.request.Request("http://localhost:8010/admin/sync?definitions=true&schedules=true", method="POST"); print(urllib.request.urlopen(req).read().decode())'
|
||||
```
|
||||
|
||||
Automated regression for the disabled `ops-service-inventory-probes`
|
||||
projection (enable/cadence flip, idempotent repeat sync, rollback) lives in
|
||||
`scripts/smoke_admin_sync_no_restart.py`.
|
||||
|
||||
If the API is unavailable, the schedule-only CLI remains available:
|
||||
|
||||
```bash
|
||||
TEMPORAL_HOST=localhost:7233 \
|
||||
@@ -126,7 +248,7 @@ ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
|
||||
|
||||
This reconciles all Temporal Schedules with the `activity_definitions` table:
|
||||
- Upserts schedules for every enabled cron definition
|
||||
- Creates paused schedules for disabled cron definitions
|
||||
- Creates paused schedules for disabled cron or one-shot scheduled definitions
|
||||
- Deletes orphaned schedules with no matching DB row
|
||||
|
||||
After adding or changing a recurring ActivityDefinition or workflow activity
|
||||
@@ -282,6 +404,52 @@ the same durable consumer name provides automatic failover.
|
||||
|
||||
---
|
||||
|
||||
## Run-miss recovery policies (cron triggers)
|
||||
|
||||
A cron fire is **missed** when the worker or Temporal is unavailable at trigger
|
||||
time. `trigger_config.misfire_policy` selects what happens when the system
|
||||
recovers. Each policy combines a Temporal **catchup window** (how far back missed
|
||||
fires are recovered) with an **overlap policy** (what to do if a recovered fire
|
||||
would start while a prior run is still executing):
|
||||
|
||||
| `misfire_policy` | Behaviour | Default catchup window | Overlap |
|
||||
| --- | --- | --- | --- |
|
||||
| `skip` | Run on trigger or skip — a missed fire is never recovered | 60s grace | `SKIP` |
|
||||
| `catchup_all` | Recover **every** fire missed during the outage | 365 days | `BUFFER_ALL` |
|
||||
| `catchup_latest` | Recover only the **most recent** missed fire; no backlog | 24h | `BUFFER_ONE` |
|
||||
|
||||
Set `trigger_config.catchup_window_seconds` to override the per-policy default
|
||||
(e.g. an hourly definition using `catchup_latest` should set it to ~3600 so a
|
||||
single missed hour is recovered but older ones are not).
|
||||
|
||||
Legacy values are still accepted: `catchup` → `catchup_all`,
|
||||
`compress` → `catchup_latest`.
|
||||
|
||||
> **Why this exists:** before ACTIVITY-WP-0014 no catchup window was set, so a
|
||||
> brief outage at trigger time silently dropped the fire with no recovery and no
|
||||
> log line. The `daily-statehub-wsjf-triage` definition now uses `catchup_latest`.
|
||||
|
||||
## State Hub write idempotency (ACTIVITY-WP-0014 T05)
|
||||
|
||||
Every State Hub write from activity-core (report-sink progress, ops-evidence
|
||||
progress, schedule-miss alerts) carries a stable **`Idempotency-Key`** header
|
||||
derived deterministically from the write's identity
|
||||
(`run_id:instruction_id:event_type`, or `schedule_miss:activity_id:last_fired`
|
||||
for miss alerts). This makes writes safe to **buffer and replay** under the
|
||||
planned State Hub *beachhead* (per-machine read cache + write outbox): a flush —
|
||||
possibly retried after an outage — cannot create duplicate progress/triage
|
||||
events once State Hub / the beachhead honours the header.
|
||||
|
||||
The guarantee lives on the write, not on a live dedup read. The read-based
|
||||
`_progress_exists` check is now best-effort only: if State Hub is unreachable it
|
||||
returns `False` (proceed to the keyed write) rather than hard-failing. The header
|
||||
passes untouched through the `actcore-state-hub-bridge` proxy and is ignored by
|
||||
State Hub versions that do not yet honour it.
|
||||
|
||||
> The queue/cache itself is **not** built in activity-core — it belongs to the
|
||||
> state-hub beachhead. activity-core only emits the key. See the proposal sent to
|
||||
> the `state-hub` agent.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Worker fails to start: "ACTCORE_DB_URL is required"
|
||||
@@ -291,6 +459,9 @@ Set the environment variable before running the worker.
|
||||
1. Check Temporal UI → Schedules tab for the schedule status.
|
||||
2. Ensure `enabled=True` on the ActivityDefinition (paused schedules don't fire).
|
||||
3. Verify the cron expression with: `docker exec temporal-admin-tools temporal schedule describe --schedule-id activity-schedule-<uuid>`
|
||||
4. If a fire was **missed entirely** (no run, no failure event) during an outage,
|
||||
check `misfire_policy` — under `skip` missed fires are dropped by design. Use
|
||||
`catchup_all` or `catchup_latest` to recover them. See *Run-miss recovery policies*.
|
||||
|
||||
### Event not routing
|
||||
1. Check NATS monitoring: http://localhost:8222/jsz to verify the `ACTIVITY_EVENTS` stream exists.
|
||||
|
||||
@@ -47,7 +47,10 @@ data:
|
||||
type: cron
|
||||
cron_expression: "20 7 * * *"
|
||||
timezone: Europe/Berlin
|
||||
misfire_policy: skip
|
||||
# ACTIVITY-WP-0014: recover the most recent missed daily fire when the
|
||||
# worker/Temporal was unavailable at trigger time, without accumulating a
|
||||
# backlog after a multi-day outage.
|
||||
misfire_policy: catchup_latest
|
||||
context_sources:
|
||||
- type: static
|
||||
bind_to: context.prompt_path
|
||||
@@ -92,7 +95,8 @@ data:
|
||||
(strategic_value + time_criticality + risk_reduction +
|
||||
opportunity_enablement) / job_size. Use integer factor values from 1 to 5,
|
||||
round score to one decimal place, sort recommendations by rank, and return at
|
||||
most 10 recommendations.
|
||||
most 7 recommendations. If uncertain, emit fewer well-formed
|
||||
recommendations rather than more.
|
||||
|
||||
Curated digest:
|
||||
{context.daily_triage_digest}
|
||||
@@ -164,6 +168,36 @@ data:
|
||||
|
||||
Kubernetes projection of the Custodian-owned definition in
|
||||
`/home/worsch/the-custodian/activity-definitions/hourly-recently-on-scope.md`.
|
||||
state-hub-consistency-sweep.md: |
|
||||
---
|
||||
id: "7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b"
|
||||
name: "State Hub Consistency Sweep"
|
||||
type: activity-definition
|
||||
version: "1.0"
|
||||
enabled: true
|
||||
owner: custodian
|
||||
governance: custodian
|
||||
status: active
|
||||
created: "2026-06-21"
|
||||
trigger:
|
||||
type: cron
|
||||
cron_expression: "*/15 * * * *"
|
||||
timezone: UTC
|
||||
misfire_policy: skip
|
||||
context_sources:
|
||||
- type: state-hub
|
||||
query: consistency_sweep_remote_all
|
||||
required: true
|
||||
params:
|
||||
max_seconds: 300
|
||||
source: activity-core
|
||||
bind_to: context.consistency_sweep_remote_all
|
||||
---
|
||||
|
||||
# ActivityDefinition: State Hub Consistency Sweep
|
||||
|
||||
Kubernetes projection of the Custodian-owned definition in
|
||||
`/home/worsch/the-custodian/activity-definitions/state-hub-consistency-sweep.md`.
|
||||
ops-service-inventory-probes.md: |
|
||||
---
|
||||
id: "40d15a87-7ff6-4d8e-992c-37df15f95110"
|
||||
@@ -399,7 +433,7 @@ data:
|
||||
"recommendations": {
|
||||
"type": "array",
|
||||
"minItems": 1,
|
||||
"maxItems": 10,
|
||||
"maxItems": 7,
|
||||
"items": {
|
||||
"type": "object",
|
||||
"required": ["rank", "candidate", "action", "why", "confidence", "wsjf"],
|
||||
@@ -408,7 +442,7 @@ data:
|
||||
"rank": {
|
||||
"type": "integer",
|
||||
"minimum": 1,
|
||||
"maximum": 10
|
||||
"maximum": 7
|
||||
},
|
||||
"candidate": {
|
||||
"type": "string"
|
||||
@@ -578,7 +612,8 @@ spec:
|
||||
method=self.command,
|
||||
)
|
||||
try:
|
||||
with urlopen(request, timeout=30) as response:
|
||||
timeout = 360 if self.command == "POST" else 30
|
||||
with urlopen(request, timeout=timeout) as response:
|
||||
payload = response.read()
|
||||
self.send_response(response.status)
|
||||
for key, value in response.headers.items():
|
||||
@@ -599,7 +634,7 @@ spec:
|
||||
ThreadingHTTPServer(("0.0.0.0", 18080), Proxy).serve_forever()
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /state/summary
|
||||
path: /state/health
|
||||
port: http
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
|
||||
@@ -1,4 +1,5 @@
|
||||
{
|
||||
"$comment": "ACTIVITY-WP-0016-T02. Strict, bounded contract for the daily WSJF triage report. The per-item 'recommendations' schema is intentionally strict on STRUCTURE (types + required keys) so the T03 boundary parser can validate each recommendation independently and quarantine only the malformed ones. 'maxItems' is a producer hint (honoured by llm-connect constrained decoding and by the prompt); it is deliberately NOT hard-enforced by the in-repo validator, because rejecting a whole report for having too many items would reproduce the monolithic-failure bug WP-0016 exists to remove. Over-count is mitigated in T03 (keep top-N by rank, quarantine the rest). Value-domain vocabularies (action/confidence) are documented in the prompt and enforced by T04 guardrails with mitigation, not as brittle hard-fail enums here.",
|
||||
"type": "object",
|
||||
"required": ["summary", "recommendations"],
|
||||
"properties": {
|
||||
@@ -7,8 +8,28 @@
|
||||
},
|
||||
"recommendations": {
|
||||
"type": "array",
|
||||
"maxItems": 7,
|
||||
"items": {
|
||||
"type": "object"
|
||||
"type": "object",
|
||||
"required": ["rank", "candidate", "action", "why"],
|
||||
"properties": {
|
||||
"rank": { "type": "integer" },
|
||||
"candidate": { "type": "string" },
|
||||
"action": { "type": "string" },
|
||||
"why": { "type": "string" },
|
||||
"confidence": { "type": "string" },
|
||||
"wsjf": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"score": { "type": "number" },
|
||||
"strategic_value": { "type": "number" },
|
||||
"time_criticality": { "type": "number" },
|
||||
"risk_reduction": { "type": "number" },
|
||||
"opportunity_enablement": { "type": "number" },
|
||||
"job_size": { "type": "number" }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
8
scripts/automation_inventory.py
Normal file
8
scripts/automation_inventory.py
Normal file
@@ -0,0 +1,8 @@
|
||||
#!/usr/bin/env python3
|
||||
"""CLI wrapper for the repo-native automation inventory report."""
|
||||
|
||||
from activity_core.automation_status import inventory_main
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(inventory_main())
|
||||
8
scripts/automation_status.py
Normal file
8
scripts/automation_status.py
Normal file
@@ -0,0 +1,8 @@
|
||||
#!/usr/bin/env python3
|
||||
"""CLI wrapper for the repo-native automation status report."""
|
||||
|
||||
from activity_core.automation_status import main
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
212
scripts/smoke_admin_sync_no_restart.py
Executable file
212
scripts/smoke_admin_sync_no_restart.py
Executable file
@@ -0,0 +1,212 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Railiance01 no-restart smoke for POST /admin/sync.
|
||||
|
||||
Patches the disabled ops-service-inventory-probes projection in the cluster
|
||||
ConfigMap, waits for the API pod volume to refresh, runs /admin/sync twice,
|
||||
verifies DB + Temporal schedule drift without restarting actcore-worker, then
|
||||
rolls the ConfigMap back to the disabled baseline.
|
||||
|
||||
Requires:
|
||||
- KUBECONFIG pointing at railiance01 (for example ~/.kube/config-hosteurope)
|
||||
- kubectl access to the activity-core namespace
|
||||
|
||||
Example:
|
||||
export KUBECONFIG=~/.kube/config-hosteurope
|
||||
python3 scripts/smoke_admin_sync_no_restart.py
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
|
||||
ACTIVITY_ID = "40d15a87-7ff6-4d8e-992c-37df15f95110"
|
||||
CONFIGMAP = "actcore-external-activity-definitions"
|
||||
DEFINITION_KEY = "ops-service-inventory-probes.md"
|
||||
MOUNTED_FILE = (
|
||||
"/etc/activity-core/external-definitions/activity-definitions/"
|
||||
f"{DEFINITION_KEY}"
|
||||
)
|
||||
VOLUME_PROPAGATION_SECONDS = 65
|
||||
|
||||
|
||||
def kubectl(*args: str, input_text: str | None = None) -> str:
|
||||
cmd = ["kubectl", "-n", "activity-core", *args]
|
||||
return subprocess.check_output(
|
||||
cmd,
|
||||
input=input_text,
|
||||
text=True,
|
||||
)
|
||||
|
||||
|
||||
def api_json(path: str, *, method: str = "GET") -> dict:
|
||||
script = (
|
||||
"import urllib.request, json\n"
|
||||
f'req = urllib.request.Request("http://localhost:8010{path}", method="{method}")\n'
|
||||
"print(urllib.request.urlopen(req).read().decode())"
|
||||
)
|
||||
return json.loads(kubectl("exec", "deploy/actcore-api", "--", "python3", "-c", script))
|
||||
|
||||
|
||||
def worker_lines(script: str) -> list[str]:
|
||||
return kubectl("exec", "deploy/actcore-worker", "--", "python3", "-c", script).splitlines()
|
||||
|
||||
|
||||
def worker_uid() -> str:
|
||||
return kubectl(
|
||||
"get",
|
||||
"pod",
|
||||
"-l",
|
||||
"app.kubernetes.io/name=actcore-worker",
|
||||
"-o",
|
||||
"jsonpath={.items[0].metadata.uid}",
|
||||
).strip()
|
||||
|
||||
|
||||
def load_configmap() -> dict:
|
||||
return json.loads(kubectl("get", "configmap", CONFIGMAP, "-o", "json"))
|
||||
|
||||
|
||||
def apply_configmap(cm: dict) -> None:
|
||||
kubectl("apply", "-f", "-", input_text=json.dumps(cm))
|
||||
|
||||
|
||||
def patch_definition(cm: dict, *, enabled: bool, cron: str) -> None:
|
||||
text = cm["data"][DEFINITION_KEY]
|
||||
for line in text.splitlines():
|
||||
if line.strip().startswith("enabled:"):
|
||||
break
|
||||
else:
|
||||
raise RuntimeError("enabled field not found in projection")
|
||||
|
||||
text = _replace_once(text, 'enabled: false', f"enabled: {'true' if enabled else 'false'}")
|
||||
text = _replace_once(text, 'enabled: true', f"enabled: {'true' if enabled else 'false'}")
|
||||
text = _replace_once(
|
||||
text,
|
||||
'cron_expression: "15 * * * *"',
|
||||
f'cron_expression: "{cron}"',
|
||||
)
|
||||
text = _replace_once(
|
||||
text,
|
||||
'cron_expression: "25 * * * *"',
|
||||
f'cron_expression: "{cron}"',
|
||||
)
|
||||
cm["data"][DEFINITION_KEY] = text
|
||||
apply_configmap(cm)
|
||||
|
||||
|
||||
def _replace_once(text: str, old: str, new: str) -> str:
|
||||
if old not in text:
|
||||
return text
|
||||
return text.replace(old, new, 1)
|
||||
|
||||
|
||||
def wait_for_mount(*, enabled: bool, cron: str) -> None:
|
||||
deadline = time.time() + VOLUME_PROPAGATION_SECONDS
|
||||
want_enabled = "enabled: true" if enabled else "enabled: false"
|
||||
want_cron = f'cron_expression: "{cron}"'
|
||||
while time.time() < deadline:
|
||||
content = kubectl("exec", "deploy/actcore-api", "--", "cat", MOUNTED_FILE)
|
||||
if want_enabled in content and want_cron in content:
|
||||
return
|
||||
time.sleep(5)
|
||||
raise RuntimeError(
|
||||
f"ConfigMap projection did not refresh within {VOLUME_PROPAGATION_SECONDS}s"
|
||||
)
|
||||
|
||||
|
||||
def get_definition() -> dict[str, object]:
|
||||
for item in api_json("/activity-definitions/"):
|
||||
if item["id"] == ACTIVITY_ID:
|
||||
return {
|
||||
"enabled": item["enabled"],
|
||||
"cron": item["trigger_config"]["cron_expression"],
|
||||
}
|
||||
raise RuntimeError(f"ActivityDefinition {ACTIVITY_ID} not found")
|
||||
|
||||
|
||||
def describe_schedule() -> dict[str, object]:
|
||||
script = f"""
|
||||
import asyncio
|
||||
from temporalio.client import Client
|
||||
|
||||
async def main() -> None:
|
||||
client = await Client.connect("actcore-temporal:7233")
|
||||
handle = client.get_schedule_handle("activity-schedule-{ACTIVITY_ID}")
|
||||
described = await handle.describe()
|
||||
schedule = described.schedule
|
||||
minute = schedule.spec.calendars[0].minute[0].start if schedule.spec.calendars else None
|
||||
print(schedule.state.paused)
|
||||
print(minute)
|
||||
|
||||
asyncio.run(main())
|
||||
"""
|
||||
paused, minute = worker_lines(script)
|
||||
return {"paused": paused == "True", "minute": int(minute)}
|
||||
|
||||
|
||||
def main() -> int:
|
||||
worker_before = worker_uid()
|
||||
cm = load_configmap()
|
||||
|
||||
print("1) enable + cadence change via ConfigMap")
|
||||
patch_definition(cm, enabled=True, cron="25 * * * *")
|
||||
wait_for_mount(enabled=True, cron="25 * * * *")
|
||||
|
||||
print("2) POST /admin/sync (first pass)")
|
||||
sync1 = api_json("/admin/sync?definitions=true&schedules=true", method="POST")
|
||||
if not sync1.get("ok"):
|
||||
print(json.dumps(sync1, indent=2), file=sys.stderr)
|
||||
return 1
|
||||
|
||||
defn = get_definition()
|
||||
schedule = describe_schedule()
|
||||
print(" definition:", defn)
|
||||
print(" schedule:", schedule)
|
||||
if defn != {"enabled": True, "cron": "25 * * * *"}:
|
||||
print("definition drift after sync", file=sys.stderr)
|
||||
return 1
|
||||
if schedule["paused"] or schedule["minute"] != 25:
|
||||
print("schedule drift after enable sync", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
print("3) POST /admin/sync (idempotent repeat)")
|
||||
sync2 = api_json("/admin/sync?definitions=true&schedules=true", method="POST")
|
||||
if sync2.get("schedules") != sync1.get("schedules"):
|
||||
print("idempotent schedule counts changed", file=sys.stderr)
|
||||
print(json.dumps({"sync1": sync1, "sync2": sync2}, indent=2), file=sys.stderr)
|
||||
return 1
|
||||
|
||||
print("4) rollback ConfigMap + sync")
|
||||
cm = load_configmap()
|
||||
patch_definition(cm, enabled=False, cron="15 * * * *")
|
||||
wait_for_mount(enabled=False, cron="15 * * * *")
|
||||
sync3 = api_json("/admin/sync?definitions=true&schedules=true", method="POST")
|
||||
if not sync3.get("ok"):
|
||||
print(json.dumps(sync3, indent=2), file=sys.stderr)
|
||||
return 1
|
||||
|
||||
defn = get_definition()
|
||||
schedule = describe_schedule()
|
||||
print(" definition:", defn)
|
||||
print(" schedule:", schedule)
|
||||
if defn != {"enabled": False, "cron": "15 * * * *"}:
|
||||
print("rollback definition drift", file=sys.stderr)
|
||||
return 1
|
||||
if not schedule["paused"] or schedule["minute"] != 15:
|
||||
print("rollback schedule drift", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
worker_after = worker_uid()
|
||||
if worker_before != worker_after:
|
||||
print("actcore-worker pod restarted during smoke", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
print("smoke passed: admin sync hot-reload without worker restart")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -149,6 +149,8 @@ async def resolve_context(
|
||||
query = source.get("query", "")
|
||||
params = source.get("params") or {}
|
||||
required = bool(source.get("required") or params.get("required", False))
|
||||
resolver_params = dict(params)
|
||||
resolver_params["required"] = required
|
||||
raw_bind = source.get("bind_to") or source.get("name") or source_type
|
||||
# Strip the 'context.' namespace prefix so evaluator can find the key.
|
||||
bind_key = raw_bind.removeprefix("context.") if raw_bind.startswith("context.") else raw_bind
|
||||
@@ -172,7 +174,7 @@ async def resolve_context(
|
||||
continue
|
||||
|
||||
try:
|
||||
resolved = resolver_cls().resolve(query, event_envelope, params)
|
||||
resolved = resolver_cls().resolve(query, event_envelope, resolver_params)
|
||||
snapshot[bind_key] = _bind_resolver_result(bind_key, resolved)
|
||||
except Exception as exc:
|
||||
if required:
|
||||
@@ -364,6 +366,7 @@ async def evaluate_instructions(payload: dict) -> dict:
|
||||
"output_validated": result.output_validated,
|
||||
"review_required": result.review_required,
|
||||
"validation_error": result.validation_error,
|
||||
"llm_response_metadata": result.llm_response_metadata,
|
||||
})
|
||||
for spec in result.tasks:
|
||||
task_specs.append({
|
||||
|
||||
@@ -40,6 +40,7 @@ from temporalio.client import Client
|
||||
from activity_core.models import ActivityDefinition, CronTriggerConfig
|
||||
from activity_core.orm import ActivityDefinition as ActivityDefinitionRow, EventType as EventTypeRow
|
||||
from activity_core.schedule_manager import delete_schedule, upsert_schedule
|
||||
from activity_core.sync_service import run_sync
|
||||
from activity_core.webhook_receiver import router as webhook_router
|
||||
|
||||
TEMPORAL_HOST = os.environ.get("TEMPORAL_HOST", "localhost:7233")
|
||||
@@ -275,6 +276,24 @@ async def trigger_definition(definition_id: uuid.UUID) -> dict[str, str]:
|
||||
return {"workflow_id": handle.id, "trigger_key": trigger_key}
|
||||
|
||||
|
||||
# --- Admin sync ---------------------------------------------------------------
|
||||
|
||||
@app.post("/admin/sync")
|
||||
async def admin_sync(
|
||||
definitions: bool = True,
|
||||
schedules: bool = True,
|
||||
event_types: bool = False,
|
||||
) -> dict[str, Any]:
|
||||
"""Run operator-triggered definition/event/schedule sync without restart."""
|
||||
return await run_sync(
|
||||
session_factory=_get_db(),
|
||||
temporal_client=_get_temporal() if schedules else None,
|
||||
definitions=definitions,
|
||||
schedules=schedules,
|
||||
event_types=event_types,
|
||||
)
|
||||
|
||||
|
||||
# T42: Curator gate — event type approval endpoint
|
||||
|
||||
@app.get("/health")
|
||||
|
||||
1107
src/activity_core/automation_status.py
Normal file
1107
src/activity_core/automation_status.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -4,4 +4,5 @@ from activity_core.context_resolvers import ( # noqa: F401
|
||||
ops_inventory,
|
||||
repo_scoping,
|
||||
state_hub,
|
||||
reuse_surface,
|
||||
)
|
||||
|
||||
516
src/activity_core/context_resolvers/reuse_surface.py
Normal file
516
src/activity_core/context_resolvers/reuse_surface.py
Normal file
@@ -0,0 +1,516 @@
|
||||
"""Reuse-surface registry hygiene context adapter.
|
||||
|
||||
Registered as source type ``reuse-surface`` and as the ``shell`` resolver
|
||||
dispatcher for the ``reuse_surface_report_gaps`` query. Other shell queries
|
||||
continue to delegate to the kaizen resolver for backward compatibility.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import socket
|
||||
import subprocess
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import httpx
|
||||
import yaml
|
||||
|
||||
from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY, ContextResolver
|
||||
from activity_core.context_resolvers.kaizen import KaizenContextResolver
|
||||
from activity_core.context_resolvers.state_hub import StateHubContextResolver
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
|
||||
_REPORT_TIMEOUT_SECONDS = 60
|
||||
_STATE_HUB_TIMEOUT_SECONDS = 10.0
|
||||
_KNOWN_SIGNALS = frozenset(
|
||||
{
|
||||
"registry_gap",
|
||||
"empty_capability_scaffold",
|
||||
"stale_scope",
|
||||
"stale_sbom",
|
||||
"publish_check_fail",
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class RosterEntry:
|
||||
slug: str
|
||||
domain: str | None = None
|
||||
publish_check: str | None = None
|
||||
|
||||
|
||||
def _base_url() -> str:
|
||||
return os.environ.get("STATE_HUB_URL", _DEFAULT_STATE_HUB_URL).rstrip("/")
|
||||
|
||||
|
||||
def _runner_host(params: dict[str, Any]) -> str:
|
||||
return str(
|
||||
params.get("runner_host")
|
||||
or os.environ.get("KAIZEN_RUNNER_HOST")
|
||||
or socket.gethostname()
|
||||
)
|
||||
|
||||
|
||||
def _as_required(params: dict[str, Any]) -> bool:
|
||||
return bool(params.get("required", False))
|
||||
|
||||
|
||||
def reuse_surface_report_gaps(params: dict[str, Any]) -> dict[str, Any]:
|
||||
"""Resolve registry-hygiene gaps for the next rollout batch.
|
||||
|
||||
Missing operational dependencies are visible failures for required sources
|
||||
and graceful empty lists for optional sources so definitions can opt into
|
||||
either behavior without changing rule logic.
|
||||
"""
|
||||
try:
|
||||
return _resolve_reuse_surface_report_gaps(params)
|
||||
except Exception as exc:
|
||||
if _as_required(params):
|
||||
raise
|
||||
logger.warning("reuse_surface_report_gaps unavailable: %s", exc)
|
||||
return {"gaps": []}
|
||||
|
||||
|
||||
def _resolve_reuse_surface_report_gaps(params: dict[str, Any]) -> dict[str, Any]:
|
||||
roster_path = _roster_path(params)
|
||||
entries = _load_active_roster_entries(roster_path)
|
||||
if not entries:
|
||||
return {"gaps": []}
|
||||
|
||||
state_path = _round_robin_state_path(params, roster_path)
|
||||
selected, next_cursor = _select_round_robin_batch(
|
||||
entries,
|
||||
_batch_size(params),
|
||||
state_path,
|
||||
)
|
||||
if not selected:
|
||||
return {"gaps": []}
|
||||
|
||||
signals = _enabled_signals(_signals_path(params, roster_path))
|
||||
roots = _resolve_repo_roots(selected, _runner_host(params))
|
||||
report = _reuse_surface_report(params, signals)
|
||||
gaps = _gap_records(selected, roots, signals, report)
|
||||
|
||||
_write_round_robin_state(state_path, next_cursor, selected)
|
||||
return {"gaps": gaps}
|
||||
|
||||
|
||||
def _roster_path(params: dict[str, Any]) -> Path:
|
||||
raw = params.get("roster")
|
||||
if not raw:
|
||||
raise ValueError("reuse_surface_report_gaps requires params.roster")
|
||||
path = Path(str(raw)).expanduser()
|
||||
if not path.is_file():
|
||||
raise FileNotFoundError(f"reuse_surface_report_gaps roster not found: {path}")
|
||||
return path
|
||||
|
||||
|
||||
def _batch_size(params: dict[str, Any]) -> int:
|
||||
try:
|
||||
return max(1, int(params.get("batch_size", 3)))
|
||||
except (TypeError, ValueError):
|
||||
return 3
|
||||
|
||||
|
||||
def _round_robin_state_path(params: dict[str, Any], roster_path: Path) -> Path:
|
||||
raw = params.get("round_robin_state")
|
||||
if raw:
|
||||
return Path(str(raw)).expanduser()
|
||||
return roster_path.with_name("round-robin-state.json")
|
||||
|
||||
|
||||
def _signals_path(params: dict[str, Any], roster_path: Path) -> Path:
|
||||
raw = params.get("signals")
|
||||
if raw:
|
||||
return Path(str(raw)).expanduser()
|
||||
return roster_path.with_name("signals.yml")
|
||||
|
||||
|
||||
def _load_active_roster_entries(path: Path) -> list[RosterEntry]:
|
||||
data = yaml.safe_load(path.read_text(encoding="utf-8"))
|
||||
if not isinstance(data, dict):
|
||||
raise ValueError(f"reuse_surface rollout roster is not a mapping: {path}")
|
||||
|
||||
entries: dict[str, RosterEntry] = {}
|
||||
for domain, block in _iter_domain_blocks(data):
|
||||
if _domain_phase(block) != "active":
|
||||
continue
|
||||
for item in _repo_items(block):
|
||||
entry = _entry_from_item(item, domain, block)
|
||||
if entry and entry.slug not in entries:
|
||||
entries[entry.slug] = entry
|
||||
return list(entries.values())
|
||||
|
||||
|
||||
def _iter_domain_blocks(data: dict[str, Any]) -> list[tuple[str | None, dict[str, Any]]]:
|
||||
domains = data.get("domains")
|
||||
if isinstance(domains, dict):
|
||||
return [
|
||||
(str(name), block)
|
||||
for name, block in domains.items()
|
||||
if isinstance(block, dict)
|
||||
]
|
||||
if isinstance(domains, list):
|
||||
return [
|
||||
(str(block.get("name") or block.get("domain") or ""), block)
|
||||
for block in domains
|
||||
if isinstance(block, dict)
|
||||
]
|
||||
if isinstance(data.get("active"), list):
|
||||
return [(None, {"phase": "active", "repos": data["active"]})]
|
||||
return [
|
||||
(str(name), block)
|
||||
for name, block in data.items()
|
||||
if isinstance(block, dict) and ("phase" in block or "repos" in block)
|
||||
]
|
||||
|
||||
|
||||
def _domain_phase(block: dict[str, Any]) -> str:
|
||||
return str(block.get("phase") or block.get("status") or "").lower()
|
||||
|
||||
|
||||
def _repo_items(block: dict[str, Any]) -> list[Any]:
|
||||
repos = (
|
||||
block.get("repos")
|
||||
or block.get("repo_slugs")
|
||||
or block.get("repositories")
|
||||
or block.get("slugs")
|
||||
or []
|
||||
)
|
||||
if isinstance(repos, dict):
|
||||
items: list[Any] = []
|
||||
for slug, config in repos.items():
|
||||
if isinstance(config, dict):
|
||||
item = dict(config)
|
||||
item.setdefault("slug", slug)
|
||||
items.append(item)
|
||||
else:
|
||||
items.append(str(slug))
|
||||
return items
|
||||
if isinstance(repos, list):
|
||||
return repos
|
||||
return []
|
||||
|
||||
|
||||
def _entry_from_item(
|
||||
item: Any,
|
||||
domain: str | None,
|
||||
block: dict[str, Any],
|
||||
) -> RosterEntry | None:
|
||||
publish_check = block.get("publish_check")
|
||||
if isinstance(item, str):
|
||||
slug = item
|
||||
elif isinstance(item, dict):
|
||||
slug = item.get("slug") or item.get("repo") or item.get("name")
|
||||
publish_check = item.get("publish_check", publish_check)
|
||||
else:
|
||||
return None
|
||||
if not slug:
|
||||
return None
|
||||
return RosterEntry(
|
||||
slug=str(slug),
|
||||
domain=domain or None,
|
||||
publish_check=str(publish_check).lower() if publish_check is not None else None,
|
||||
)
|
||||
|
||||
|
||||
def _select_round_robin_batch(
|
||||
entries: list[RosterEntry],
|
||||
batch_size: int,
|
||||
state_path: Path,
|
||||
) -> tuple[list[RosterEntry], int]:
|
||||
if not entries:
|
||||
return [], 0
|
||||
cursor = _read_round_robin_cursor(state_path) % len(entries)
|
||||
size = min(batch_size, len(entries))
|
||||
selected = [entries[(cursor + offset) % len(entries)] for offset in range(size)]
|
||||
next_cursor = (cursor + size) % len(entries)
|
||||
return selected, next_cursor
|
||||
|
||||
|
||||
def _read_round_robin_cursor(path: Path) -> int:
|
||||
if not path.is_file():
|
||||
return 0
|
||||
try:
|
||||
data = json.loads(path.read_text(encoding="utf-8"))
|
||||
except (OSError, json.JSONDecodeError):
|
||||
return 0
|
||||
if not isinstance(data, dict):
|
||||
return 0
|
||||
try:
|
||||
return int(data.get("cursor", 0))
|
||||
except (TypeError, ValueError):
|
||||
return 0
|
||||
|
||||
|
||||
def _write_round_robin_state(
|
||||
path: Path,
|
||||
cursor: int,
|
||||
selected: list[RosterEntry],
|
||||
) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
payload = {
|
||||
"cursor": cursor,
|
||||
"last_batch": [entry.slug for entry in selected],
|
||||
"updated_at": datetime.now(timezone.utc).isoformat(),
|
||||
}
|
||||
path.write_text(
|
||||
json.dumps(payload, indent=2, sort_keys=True) + "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
def _enabled_signals(path: Path) -> set[str]:
|
||||
if not path.is_file():
|
||||
return set(_KNOWN_SIGNALS)
|
||||
data = yaml.safe_load(path.read_text(encoding="utf-8"))
|
||||
node = data.get("signals") if isinstance(data, dict) else data
|
||||
enabled: set[str] = set()
|
||||
saw_known_signal = False
|
||||
|
||||
if isinstance(node, dict):
|
||||
for name, config in node.items():
|
||||
if str(name) not in _KNOWN_SIGNALS:
|
||||
continue
|
||||
saw_known_signal = True
|
||||
if isinstance(config, dict) and config.get("enabled") is False:
|
||||
continue
|
||||
if config is False:
|
||||
continue
|
||||
enabled.add(str(name))
|
||||
elif isinstance(node, list):
|
||||
for item in node:
|
||||
if isinstance(item, str) and item in _KNOWN_SIGNALS:
|
||||
saw_known_signal = True
|
||||
enabled.add(item)
|
||||
elif isinstance(item, dict):
|
||||
name = item.get("id") or item.get("signal") or item.get("name")
|
||||
if str(name) in _KNOWN_SIGNALS and item.get("enabled", True) is not False:
|
||||
saw_known_signal = True
|
||||
enabled.add(str(name))
|
||||
|
||||
return enabled if saw_known_signal else set(_KNOWN_SIGNALS)
|
||||
|
||||
|
||||
def _resolve_repo_roots(
|
||||
entries: list[RosterEntry],
|
||||
runner_host: str,
|
||||
) -> dict[str, Path]:
|
||||
requested = {entry.slug for entry in entries}
|
||||
roots: dict[str, Path] = {}
|
||||
for repo in _fetch_repos():
|
||||
slug = str(repo.get("slug") or "")
|
||||
if slug not in requested:
|
||||
continue
|
||||
raw = _repo_path_for_host(repo, runner_host)
|
||||
if raw:
|
||||
roots[slug] = Path(raw)
|
||||
return roots
|
||||
|
||||
|
||||
def _fetch_repos() -> list[dict[str, Any]]:
|
||||
url = f"{_base_url()}/repos/"
|
||||
try:
|
||||
resp = httpx.get(url, timeout=_STATE_HUB_TIMEOUT_SECONDS)
|
||||
resp.raise_for_status()
|
||||
except httpx.HTTPError as exc:
|
||||
raise RuntimeError(f"State Hub unreachable at {url}: {exc}") from exc
|
||||
payload = resp.json()
|
||||
if not isinstance(payload, list):
|
||||
raise RuntimeError(f"State Hub /repos/ returned non-list: {type(payload)!r}")
|
||||
return [repo for repo in payload if isinstance(repo, dict)]
|
||||
|
||||
|
||||
def _repo_path_for_host(repo: dict[str, Any], runner_host: str) -> str | None:
|
||||
host_paths = repo.get("host_paths") or {}
|
||||
raw = None
|
||||
if isinstance(host_paths, dict):
|
||||
raw = host_paths.get(runner_host)
|
||||
raw = raw or repo.get("local_path")
|
||||
if not raw or raw == "(unknown)":
|
||||
return None
|
||||
return str(raw)
|
||||
|
||||
|
||||
def _reuse_surface_report(params: dict[str, Any], signals: set[str]) -> dict[str, Any]:
|
||||
if not (signals & {"registry_gap", "empty_capability_scaffold"}):
|
||||
return {}
|
||||
binary = str(params.get("reuse_surface_bin") or "reuse-surface")
|
||||
try:
|
||||
completed = subprocess.run(
|
||||
[binary, "report", "gaps", "--format", "json"],
|
||||
capture_output=True,
|
||||
check=False,
|
||||
text=True,
|
||||
timeout=_REPORT_TIMEOUT_SECONDS,
|
||||
)
|
||||
except FileNotFoundError as exc:
|
||||
raise RuntimeError(f"reuse-surface CLI not found: {binary}") from exc
|
||||
except subprocess.TimeoutExpired as exc:
|
||||
raise RuntimeError("reuse-surface report gaps timed out") from exc
|
||||
|
||||
if completed.returncode != 0:
|
||||
detail = completed.stderr.strip() or completed.stdout.strip()
|
||||
raise RuntimeError(f"reuse-surface report gaps failed: {detail}")
|
||||
try:
|
||||
payload = json.loads(completed.stdout or "{}")
|
||||
except json.JSONDecodeError as exc:
|
||||
raise RuntimeError("reuse-surface report gaps returned invalid JSON") from exc
|
||||
if not isinstance(payload, dict):
|
||||
raise RuntimeError("reuse-surface report gaps returned non-object JSON")
|
||||
return payload
|
||||
|
||||
|
||||
def _gap_records(
|
||||
entries: list[RosterEntry],
|
||||
roots: dict[str, Path],
|
||||
signals: set[str],
|
||||
report: dict[str, Any],
|
||||
) -> list[dict[str, Any]]:
|
||||
empty_scaffolds = _repo_set(report, {"empty_scaffolds", "empty_scaffold"})
|
||||
publish_fail = _repo_set(
|
||||
report,
|
||||
{"publish_fail", "publish_fails", "publish_failures"},
|
||||
)
|
||||
gaps: list[dict[str, Any]] = []
|
||||
seen: set[tuple[str, str]] = set()
|
||||
|
||||
for entry in entries:
|
||||
root = roots.get(entry.slug)
|
||||
if root is None:
|
||||
logger.info("reuse_surface repo_unreachable slug=%s", entry.slug)
|
||||
continue
|
||||
|
||||
if (
|
||||
signals & {"registry_gap", "empty_capability_scaffold"}
|
||||
and entry.slug in empty_scaffolds
|
||||
):
|
||||
_append_gap(gaps, seen, entry.slug, root, "empty_capability_scaffold")
|
||||
|
||||
if "registry_gap" in signals and entry.slug in publish_fail:
|
||||
_append_gap(gaps, seen, entry.slug, root, "registry_gap")
|
||||
|
||||
if "publish_check_fail" in signals and entry.publish_check == "fail":
|
||||
_append_gap(gaps, seen, entry.slug, root, "publish_check_fail")
|
||||
|
||||
if "stale_scope" in signals and _scope_is_stale(root):
|
||||
_append_gap(gaps, seen, entry.slug, root, "stale_scope")
|
||||
|
||||
if "stale_sbom" in signals and _sbom_is_stale(entry.slug):
|
||||
_append_gap(gaps, seen, entry.slug, root, "stale_sbom")
|
||||
|
||||
return gaps
|
||||
|
||||
|
||||
def _append_gap(
|
||||
gaps: list[dict[str, Any]],
|
||||
seen: set[tuple[str, str]],
|
||||
slug: str,
|
||||
root: Path,
|
||||
signal: str,
|
||||
) -> None:
|
||||
key = (slug, signal)
|
||||
if key in seen:
|
||||
return
|
||||
seen.add(key)
|
||||
gaps.append(
|
||||
{
|
||||
"repo": slug,
|
||||
"root": str(root),
|
||||
"signal": signal,
|
||||
"hygiene_signal": signal,
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
def _scope_is_stale(root: Path) -> bool:
|
||||
scope = root / "SCOPE.md"
|
||||
if not scope.is_file():
|
||||
return True
|
||||
age_seconds = datetime.now(timezone.utc).timestamp() - scope.stat().st_mtime
|
||||
return age_seconds > 90 * 24 * 60 * 60
|
||||
|
||||
|
||||
def _sbom_is_stale(slug: str) -> bool:
|
||||
payload = StateHubContextResolver().resolve(
|
||||
"repo_sbom_status",
|
||||
None,
|
||||
{"repo_slug": slug},
|
||||
)
|
||||
if not isinstance(payload, dict):
|
||||
return False
|
||||
try:
|
||||
return int(payload.get("sbom_age_days", 0)) > 30
|
||||
except (TypeError, ValueError):
|
||||
return False
|
||||
|
||||
|
||||
def _repo_set(report: dict[str, Any], keys: set[str]) -> set[str]:
|
||||
slugs: set[str] = set()
|
||||
for value in _values_for_keys(report, keys):
|
||||
slugs.update(_slugs_from_value(value))
|
||||
return slugs
|
||||
|
||||
|
||||
def _values_for_keys(value: Any, keys: set[str]) -> list[Any]:
|
||||
values: list[Any] = []
|
||||
if isinstance(value, dict):
|
||||
for key, nested in value.items():
|
||||
if key in keys:
|
||||
values.append(nested)
|
||||
values.extend(_values_for_keys(nested, keys))
|
||||
elif isinstance(value, list):
|
||||
for item in value:
|
||||
values.extend(_values_for_keys(item, keys))
|
||||
return values
|
||||
|
||||
|
||||
def _slugs_from_value(value: Any) -> set[str]:
|
||||
if isinstance(value, str):
|
||||
return {value}
|
||||
if isinstance(value, list):
|
||||
slugs: set[str] = set()
|
||||
for item in value:
|
||||
slugs.update(_slugs_from_value(item))
|
||||
return slugs
|
||||
if isinstance(value, dict):
|
||||
for key in ("repo", "repo_slug", "slug", "name"):
|
||||
if value.get(key):
|
||||
return {str(value[key])}
|
||||
slugs: set[str] = set()
|
||||
for key, nested in value.items():
|
||||
if nested is True or isinstance(nested, (dict, list)):
|
||||
slugs.add(str(key))
|
||||
slugs.update(_slugs_from_value(nested))
|
||||
return slugs
|
||||
return set()
|
||||
|
||||
|
||||
class ReuseSurfaceContextResolver(ContextResolver):
|
||||
"""Resolves reuse-surface registry hygiene gap reports."""
|
||||
|
||||
def resolve(self, query: str, event: Any, params: dict[str, Any]) -> dict[str, Any]:
|
||||
if query == "reuse_surface_report_gaps":
|
||||
return reuse_surface_report_gaps(params)
|
||||
return {}
|
||||
|
||||
|
||||
class ShellContextResolver(ContextResolver):
|
||||
"""Dispatch shell-backed context queries without breaking kaizen aliases."""
|
||||
|
||||
def resolve(self, query: str, event: Any, params: dict[str, Any]) -> dict[str, Any]:
|
||||
if query == "reuse_surface_report_gaps":
|
||||
return reuse_surface_report_gaps(params)
|
||||
return KaizenContextResolver().resolve(query, event, params)
|
||||
|
||||
|
||||
CONTEXT_RESOLVER_REGISTRY["reuse-surface"] = ReuseSurfaceContextResolver
|
||||
CONTEXT_RESOLVER_REGISTRY["shell"] = ShellContextResolver
|
||||
@@ -12,6 +12,7 @@ Supported queries:
|
||||
- coding_retro: latest /progress/ item with event_type=coding_retro
|
||||
- daily_triage_digest: curated scalar JSON digest for daily WSJF triage
|
||||
- recently_on_scope_hourly: POST {STATE_HUB_URL}/recently-on-scope/hourly
|
||||
- consistency_sweep_remote_all: POST {STATE_HUB_URL}/consistency/sweep/remote-all
|
||||
|
||||
No caching — state hub data is live operational state and must not be stale
|
||||
within a single workflow run.
|
||||
@@ -31,6 +32,7 @@ from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY, Cont
|
||||
|
||||
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
|
||||
_TIMEOUT_SECONDS = 10.0
|
||||
_SWEEP_TIMEOUT_SECONDS = 330.0
|
||||
_OPEN_WORKSTREAM_STATUSES = {"active", "ready", "blocked"}
|
||||
_OPEN_TASK_STATUSES = {"wait", "todo", "progress"}
|
||||
# Sentinel age for repos that have never had an SBOM ingested. Large enough
|
||||
@@ -53,13 +55,26 @@ def _fetch_json(path: str, params: dict[str, Any] | None = None) -> Any:
|
||||
return {}
|
||||
|
||||
|
||||
def _post_json(path: str, payload: dict[str, Any]) -> Any:
|
||||
def _post_json(path: str, payload: dict[str, Any], *, timeout: float = _TIMEOUT_SECONDS) -> Any:
|
||||
url = f"{_base_url()}{path}"
|
||||
resp = httpx.post(url, json=payload, timeout=_TIMEOUT_SECONDS)
|
||||
resp = httpx.post(url, json=payload, timeout=timeout)
|
||||
resp.raise_for_status()
|
||||
return resp.json()
|
||||
|
||||
|
||||
def _validate_consistency_sweep_remote_all(result: Any) -> dict[str, Any]:
|
||||
if not isinstance(result, dict):
|
||||
raise RuntimeError("consistency_sweep_remote_all returned a non-object response")
|
||||
required_keys = {"exit_code", "lock_skipped", "repos_processed"}
|
||||
missing = required_keys - set(result)
|
||||
if missing:
|
||||
missing_list = ", ".join(sorted(missing))
|
||||
raise RuntimeError(
|
||||
f"consistency_sweep_remote_all response missing required key(s): {missing_list}"
|
||||
)
|
||||
return result
|
||||
|
||||
|
||||
def _validate_recently_on_scope_hourly(result: Any) -> dict[str, Any]:
|
||||
if not isinstance(result, dict):
|
||||
raise RuntimeError("recently_on_scope_hourly returned a non-object response")
|
||||
@@ -107,6 +122,18 @@ class StateHubContextResolver(ContextResolver):
|
||||
}
|
||||
result = _post_json("/recently-on-scope/hourly", payload)
|
||||
return _validate_recently_on_scope_hourly(result)
|
||||
if query == "consistency_sweep_remote_all":
|
||||
payload = {
|
||||
key: value
|
||||
for key, value in params.items()
|
||||
if key not in {"required"}
|
||||
}
|
||||
result = _post_json(
|
||||
"/consistency/sweep/remote-all",
|
||||
payload,
|
||||
timeout=_SWEEP_TIMEOUT_SECONDS,
|
||||
)
|
||||
return _validate_consistency_sweep_remote_all(result)
|
||||
return {}
|
||||
|
||||
|
||||
|
||||
@@ -20,7 +20,8 @@ from activity_core.rules.models import TaskRef, TaskSpec
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
ISSUE_CORE_URL = os.environ.get("ISSUE_CORE_URL", "http://127.0.0.1:8010")
|
||||
ISSUE_CORE_URL = os.environ.get("ISSUE_CORE_URL", "http://127.0.0.1:8765")
|
||||
ISSUE_CORE_API_KEY_ENV = "ISSUE_CORE_API_KEY"
|
||||
ISSUE_SINK_TYPE = os.environ.get("ISSUE_SINK_TYPE", "rest")
|
||||
|
||||
|
||||
@@ -30,10 +31,30 @@ class IssueSink(ABC):
|
||||
|
||||
|
||||
class IssueCoreRestSink(IssueSink):
|
||||
"""POSTs to issue-core REST API. Config: ISSUE_CORE_URL env var."""
|
||||
"""POSTs to issue-core REST API.
|
||||
|
||||
def __init__(self, base_url: str = ISSUE_CORE_URL) -> None:
|
||||
Config: ISSUE_CORE_URL and ISSUE_CORE_API_KEY env vars (shared key with
|
||||
the issue-core server).
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
base_url: str = ISSUE_CORE_URL,
|
||||
api_key: str | None = None,
|
||||
) -> None:
|
||||
self._base_url = base_url.rstrip("/")
|
||||
if api_key is not None:
|
||||
self._api_key = api_key.strip()
|
||||
else:
|
||||
self._api_key = os.environ.get(ISSUE_CORE_API_KEY_ENV, "").strip()
|
||||
|
||||
def _auth_headers(self) -> dict[str, str]:
|
||||
if not self._api_key:
|
||||
raise RuntimeError(
|
||||
f"{ISSUE_CORE_API_KEY_ENV} is not set. "
|
||||
"Required when ISSUE_SINK_TYPE=rest."
|
||||
)
|
||||
return {"Authorization": f"Bearer {self._api_key}"}
|
||||
|
||||
def emit(self, task_spec: TaskSpec) -> TaskRef:
|
||||
payload = {
|
||||
@@ -45,10 +66,19 @@ class IssueCoreRestSink(IssueSink):
|
||||
"due_in_days": task_spec.due_in_days,
|
||||
"source_type": task_spec.source_type,
|
||||
"source_id": task_spec.source_id,
|
||||
"triggering_event_id": task_spec.triggering_event_id,
|
||||
"triggering_event_id": (
|
||||
str(task_spec.triggering_event_id)
|
||||
if task_spec.triggering_event_id is not None
|
||||
else None
|
||||
),
|
||||
"activity_definition_id": task_spec.activity_definition_id,
|
||||
}
|
||||
resp = httpx.post(f"{self._base_url}/issues/", json=payload, timeout=10.0)
|
||||
resp = httpx.post(
|
||||
f"{self._base_url}/issues/",
|
||||
json=payload,
|
||||
headers=self._auth_headers(),
|
||||
timeout=10.0,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
return TaskRef(
|
||||
|
||||
@@ -17,6 +17,8 @@ import httpx
|
||||
class DisabledLLMClient:
|
||||
"""LLM client used when no llm-connect endpoint is configured."""
|
||||
|
||||
last_response_metadata: dict[str, Any] | None = None
|
||||
|
||||
def complete(
|
||||
self,
|
||||
prompt: str,
|
||||
@@ -32,6 +34,7 @@ class LLMConnectClient:
|
||||
def __init__(self, base_url: str, timeout_seconds: float = 300.0) -> None:
|
||||
self.base_url = base_url.rstrip("/")
|
||||
self.timeout_seconds = timeout_seconds
|
||||
self.last_response_metadata: dict[str, Any] | None = None
|
||||
|
||||
def complete(
|
||||
self,
|
||||
@@ -54,12 +57,48 @@ class LLMConnectClient:
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
self.last_response_metadata = _extract_response_metadata(data)
|
||||
content = data.get("content")
|
||||
if not isinstance(content, str):
|
||||
raise ValueError("llm-connect response missing string content")
|
||||
return content
|
||||
|
||||
|
||||
_SAFE_RESPONSE_METADATA_KEYS = {
|
||||
"finish_reason",
|
||||
"usage",
|
||||
"model",
|
||||
"model_name",
|
||||
"provider",
|
||||
"request_id",
|
||||
"response_id",
|
||||
"trace_id",
|
||||
"latency_ms",
|
||||
"duration_ms",
|
||||
"elapsed_ms",
|
||||
"created",
|
||||
"created_at",
|
||||
}
|
||||
|
||||
|
||||
def _extract_response_metadata(data: dict[str, Any]) -> dict[str, Any]:
|
||||
"""Keep non-secret llm-connect diagnostics alongside the returned content."""
|
||||
return {
|
||||
key: value for key, value in data.items()
|
||||
if key in _SAFE_RESPONSE_METADATA_KEYS and _json_safe(value)
|
||||
}
|
||||
|
||||
|
||||
def _json_safe(value: Any) -> bool:
|
||||
try:
|
||||
import json
|
||||
|
||||
json.dumps(value)
|
||||
except (TypeError, ValueError):
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def get_llm_client() -> DisabledLLMClient | LLMConnectClient:
|
||||
base_url = os.environ.get("LLM_CONNECT_URL", "").strip()
|
||||
if not base_url:
|
||||
|
||||
@@ -49,7 +49,18 @@ class CronTriggerConfig(BaseModel):
|
||||
)
|
||||
timezone: str = Field(default="UTC", description="IANA timezone name.")
|
||||
jitter_seconds: int = Field(default=0, ge=0)
|
||||
misfire_policy: Literal["skip", "catchup", "compress"] = Field(default="skip")
|
||||
# Run-miss recovery behaviour (ACTIVITY-WP-0014). What happens when a fire is
|
||||
# missed because the worker / Temporal was unavailable at trigger time:
|
||||
# skip - run on trigger or skip; a missed fire is never recovered
|
||||
# catchup_all - recover every fire missed during the outage window
|
||||
# catchup_latest - recover only the most recent missed fire; do not accumulate
|
||||
# Legacy aliases are accepted: catchup → catchup_all, compress → catchup_latest.
|
||||
misfire_policy: Literal[
|
||||
"skip", "catchup_all", "catchup_latest", "catchup", "compress"
|
||||
] = Field(default="skip")
|
||||
# Override the per-policy default catchup window (how far back Temporal will
|
||||
# recover missed fires after an outage). None uses the policy default.
|
||||
catchup_window_seconds: int | None = Field(default=None, ge=0)
|
||||
|
||||
|
||||
class EventTriggerConfig(BaseModel):
|
||||
|
||||
@@ -2,12 +2,15 @@
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import httpx
|
||||
|
||||
from activity_core.context_resolvers.ops_inventory import _sanitize_url
|
||||
from activity_core.state_hub_write import idempotency_headers
|
||||
|
||||
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
|
||||
_INTER_HUB_SINK_TYPES = {
|
||||
@@ -15,6 +18,10 @@ _INTER_HUB_SINK_TYPES = {
|
||||
"inter-hub-event",
|
||||
"inter-hub-interaction-event",
|
||||
}
|
||||
_CORE_HUB_SINK_TYPES = {
|
||||
"core-hub",
|
||||
"core-hub-interaction-event",
|
||||
}
|
||||
|
||||
|
||||
def persist_ops_inventory_evidence(payload: dict[str, Any]) -> list[dict[str, Any]]:
|
||||
@@ -55,6 +62,12 @@ def persist_ops_inventory_evidence(payload: dict[str, Any]) -> list[dict[str, An
|
||||
results.append(
|
||||
_post_state_hub_progress(payload, bind_key, probe_result, sink)
|
||||
)
|
||||
elif sink_type in _CORE_HUB_SINK_TYPES:
|
||||
results.append(
|
||||
_post_core_hub_interaction_event(
|
||||
payload, bind_key, probe_result, sink
|
||||
)
|
||||
)
|
||||
elif sink_type in _INTER_HUB_SINK_TYPES:
|
||||
results.append(_inter_hub_result(sink))
|
||||
else:
|
||||
@@ -121,6 +134,7 @@ def _post_state_hub_progress(
|
||||
resp = httpx.post(
|
||||
f"{base_url}/progress/",
|
||||
json=body,
|
||||
headers=idempotency_headers(run_id, context_key, event_type),
|
||||
timeout=float(sink.get("timeout_seconds", 10.0)),
|
||||
)
|
||||
resp.raise_for_status()
|
||||
@@ -136,12 +150,17 @@ def _post_state_hub_progress(
|
||||
|
||||
|
||||
def _progress_exists(base_url: str, event_type: str, idempotency_key: str) -> bool:
|
||||
resp = httpx.get(
|
||||
f"{base_url}/progress/",
|
||||
params={"limit": 100},
|
||||
timeout=10.0,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
# Best-effort optimisation only; the Idempotency-Key header on the write is the
|
||||
# real dedup guarantee. Do not hard-fail if State Hub is unreachable here.
|
||||
try:
|
||||
resp = httpx.get(
|
||||
f"{base_url}/progress/",
|
||||
params={"limit": 100},
|
||||
timeout=10.0,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
except httpx.HTTPError:
|
||||
return False
|
||||
for item in resp.json():
|
||||
detail = item.get("detail") or {}
|
||||
if (
|
||||
@@ -152,6 +171,213 @@ def _progress_exists(base_url: str, event_type: str, idempotency_key: str) -> bo
|
||||
return False
|
||||
|
||||
|
||||
def _post_core_hub_interaction_event(
|
||||
payload: dict[str, Any],
|
||||
context_key: str,
|
||||
probe_result: dict[str, Any],
|
||||
sink: dict[str, Any],
|
||||
) -> dict[str, Any]:
|
||||
raw_base_url = (
|
||||
sink.get("core_hub_url")
|
||||
or sink.get("base_url")
|
||||
or os.environ.get("CORE_HUB_BASE_URL")
|
||||
or ""
|
||||
)
|
||||
base_url = str(raw_base_url).rstrip("/")
|
||||
runtime_token = _core_hub_runtime_token(sink)
|
||||
widget_id = _core_hub_widget_id(sink, probe_result)
|
||||
|
||||
missing: list[str] = []
|
||||
if not base_url:
|
||||
missing.append("CORE_HUB_BASE_URL")
|
||||
if not runtime_token:
|
||||
missing.append("CORE_HUB_RUNTIME_TOKEN or CORE_HUB_RUNTIME_TOKEN_FILE")
|
||||
if not widget_id:
|
||||
missing.append("widget_id or CORE_HUB_WIDGET_ID")
|
||||
if missing:
|
||||
return {
|
||||
"type": sink.get("type"),
|
||||
"status": "skipped",
|
||||
"reason": "missing_core_hub_config",
|
||||
"missing": missing,
|
||||
"context_key": context_key,
|
||||
}
|
||||
|
||||
endpoint = _selected_endpoint(probe_result, sink)
|
||||
event_type = sink.get("event_type", "ops-endpoint-verified")
|
||||
timeout = float(sink.get("timeout_seconds", 10.0))
|
||||
body = {
|
||||
"widgetId": widget_id,
|
||||
"eventType": event_type,
|
||||
"viewContext": _core_hub_view_context(payload, context_key, endpoint, sink),
|
||||
"metadata": _core_hub_metadata(payload, context_key, probe_result, endpoint),
|
||||
}
|
||||
resp = httpx.post(
|
||||
f"{base_url}/api/v2/interaction-events",
|
||||
json=body,
|
||||
headers=_core_hub_headers(runtime_token),
|
||||
timeout=timeout,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
event_id = data.get("id")
|
||||
if not event_id:
|
||||
raise RuntimeError("Core Hub interaction event response did not include an id")
|
||||
if not _core_hub_event_exists(base_url, runtime_token, str(event_id), timeout):
|
||||
raise RuntimeError("Core Hub interaction event was not visible after create")
|
||||
|
||||
return {
|
||||
"type": sink.get("type"),
|
||||
"status": "posted",
|
||||
"event_type": data.get("eventType", event_type),
|
||||
"event_id": event_id,
|
||||
"widget_id": data.get("widgetId", widget_id),
|
||||
"verified": True,
|
||||
"context_key": context_key,
|
||||
}
|
||||
|
||||
|
||||
def _core_hub_headers(runtime_token: str) -> dict[str, str]:
|
||||
return {
|
||||
"Accept": "application/json",
|
||||
"Authorization": f"Bearer {runtime_token}",
|
||||
"Content-Type": "application/json",
|
||||
"User-Agent": "activity-core-ops-evidence/0.1",
|
||||
}
|
||||
|
||||
|
||||
def _core_hub_runtime_token(sink: dict[str, Any]) -> str:
|
||||
token_file = (
|
||||
sink.get("runtime_token_file")
|
||||
or sink.get("token_file")
|
||||
or os.environ.get("CORE_HUB_RUNTIME_TOKEN_FILE")
|
||||
)
|
||||
if token_file:
|
||||
return Path(str(token_file)).read_text(encoding="utf-8").strip()
|
||||
env_name = (
|
||||
sink.get("runtime_token_env")
|
||||
or os.environ.get("CORE_HUB_RUNTIME_TOKEN_ENV")
|
||||
or "CORE_HUB_RUNTIME_TOKEN"
|
||||
)
|
||||
return os.environ.get(str(env_name), "").strip()
|
||||
|
||||
|
||||
def _core_hub_widget_id(sink: dict[str, Any], probe_result: dict[str, Any]) -> str:
|
||||
direct = sink.get("widget_id") or os.environ.get("CORE_HUB_WIDGET_ID")
|
||||
if direct:
|
||||
return str(direct)
|
||||
|
||||
endpoint = _selected_endpoint(probe_result, sink)
|
||||
widget_ref = endpoint.get("widget_ref") if endpoint else None
|
||||
if not widget_ref:
|
||||
return ""
|
||||
|
||||
mapping = sink.get("widget_mapping") or sink.get("capability_mapping")
|
||||
if mapping is None:
|
||||
mapping = os.environ.get("CORE_HUB_WIDGET_MAPPING")
|
||||
parsed = _parse_widget_mapping(mapping)
|
||||
return parsed.get(str(widget_ref), "")
|
||||
|
||||
|
||||
def _parse_widget_mapping(raw: Any) -> dict[str, str]:
|
||||
if isinstance(raw, dict):
|
||||
return {str(key): str(value) for key, value in raw.items() if value}
|
||||
if not isinstance(raw, str) or not raw.strip():
|
||||
return {}
|
||||
value = raw.strip()
|
||||
if value.startswith("{"):
|
||||
try:
|
||||
loaded = json.loads(value)
|
||||
except json.JSONDecodeError:
|
||||
return {}
|
||||
if isinstance(loaded, dict):
|
||||
return {str(key): str(item) for key, item in loaded.items() if item}
|
||||
return {}
|
||||
if "=" not in value:
|
||||
return {}
|
||||
pairs: dict[str, str] = {}
|
||||
for part in value.split(","):
|
||||
key, _, item = part.partition("=")
|
||||
if key.strip() and item.strip():
|
||||
pairs[key.strip()] = item.strip()
|
||||
return pairs
|
||||
|
||||
|
||||
def _selected_endpoint(probe_result: dict[str, Any], sink: dict[str, Any]) -> dict[str, Any]:
|
||||
endpoints = [
|
||||
endpoint
|
||||
for endpoint in probe_result.get("endpoints", [])
|
||||
if isinstance(endpoint, dict)
|
||||
]
|
||||
endpoint_id = sink.get("endpoint_id")
|
||||
if endpoint_id:
|
||||
match = next(
|
||||
(endpoint for endpoint in endpoints if endpoint.get("endpoint_id") == endpoint_id),
|
||||
None,
|
||||
)
|
||||
if match:
|
||||
return match
|
||||
return next(
|
||||
(endpoint for endpoint in endpoints if endpoint.get("widget_ref")),
|
||||
endpoints[0] if endpoints else {},
|
||||
)
|
||||
|
||||
|
||||
def _core_hub_view_context(
|
||||
payload: dict[str, Any],
|
||||
context_key: str,
|
||||
endpoint: dict[str, Any],
|
||||
sink: dict[str, Any],
|
||||
) -> str:
|
||||
return str(
|
||||
sink.get("view_context")
|
||||
or endpoint.get("view_context")
|
||||
or f"activity-core/ops-inventory/{payload.get('run_id', 'unknown')}/{context_key}"
|
||||
)
|
||||
|
||||
|
||||
def _core_hub_metadata(
|
||||
payload: dict[str, Any],
|
||||
context_key: str,
|
||||
probe_result: dict[str, Any],
|
||||
endpoint: dict[str, Any],
|
||||
) -> dict[str, Any]:
|
||||
compact = _compact_probe_result(probe_result)
|
||||
return {
|
||||
"activity_id": payload.get("activity_id"),
|
||||
"activity_core_run_id": payload.get("run_id"),
|
||||
"scheduled_for": payload.get("scheduled_for"),
|
||||
"source_type": "ops-inventory",
|
||||
"context_key": context_key,
|
||||
"probe": {
|
||||
"generated_at": compact.get("generated_at"),
|
||||
"inventory_path": compact.get("inventory_path"),
|
||||
"status": compact.get("status"),
|
||||
"reason": compact.get("reason"),
|
||||
"summary": compact.get("summary", {}),
|
||||
},
|
||||
"endpoint": _compact_endpoint(endpoint) if endpoint else {},
|
||||
}
|
||||
|
||||
|
||||
def _core_hub_event_exists(
|
||||
base_url: str,
|
||||
runtime_token: str,
|
||||
event_id: str,
|
||||
timeout: float,
|
||||
) -> bool:
|
||||
resp = httpx.get(
|
||||
f"{base_url}/api/v2/interaction-events",
|
||||
headers=_core_hub_headers(runtime_token),
|
||||
timeout=timeout,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
payload = resp.json()
|
||||
data = payload.get("data") if isinstance(payload, dict) else []
|
||||
if not isinstance(data, list):
|
||||
return False
|
||||
return any(isinstance(item, dict) and item.get("id") == event_id for item in data)
|
||||
|
||||
def _inter_hub_result(sink: dict[str, Any]) -> dict[str, Any]:
|
||||
missing: list[str] = []
|
||||
if not (sink.get("inter_hub_url") or os.environ.get("INTER_HUB_URL")):
|
||||
|
||||
@@ -11,6 +11,8 @@ from zoneinfo import ZoneInfo
|
||||
|
||||
import httpx
|
||||
|
||||
from activity_core.state_hub_write import idempotency_headers
|
||||
|
||||
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
|
||||
_THE_CUSTODIAN_ROOT = Path("/home/worsch/the-custodian")
|
||||
_FORBIDDEN_CUSTODIAN_ROOTS = (
|
||||
@@ -134,6 +136,7 @@ def _post_state_hub_progress(
|
||||
"output_validated": report_entry.get("output_validated"),
|
||||
"review_required": report_entry.get("review_required"),
|
||||
"validation_error": report_entry.get("validation_error"),
|
||||
"llm_response_metadata": report_entry.get("llm_response_metadata"),
|
||||
"report": report,
|
||||
},
|
||||
}
|
||||
@@ -149,6 +152,7 @@ def _post_state_hub_progress(
|
||||
resp = httpx.post(
|
||||
f"{base_url}/progress/",
|
||||
json=body,
|
||||
headers=idempotency_headers(run_id, instruction_id, event_type),
|
||||
timeout=float(sink.get("timeout_seconds", 10.0)),
|
||||
)
|
||||
resp.raise_for_status()
|
||||
@@ -167,12 +171,18 @@ def _progress_exists(
|
||||
instruction_id: str,
|
||||
event_type: str,
|
||||
) -> bool:
|
||||
resp = httpx.get(
|
||||
f"{base_url}/progress/",
|
||||
params={"limit": 100},
|
||||
timeout=10.0,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
# Best-effort read-dedup optimisation only. The Idempotency-Key header on the
|
||||
# write is the real guarantee; if State Hub is unreachable here we must not
|
||||
# hard-fail — proceed to the (keyed) write rather than raising.
|
||||
try:
|
||||
resp = httpx.get(
|
||||
f"{base_url}/progress/",
|
||||
params={"limit": 100},
|
||||
timeout=10.0,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
except httpx.HTTPError:
|
||||
return False
|
||||
for item in resp.json():
|
||||
detail = item.get("detail") or {}
|
||||
if (
|
||||
@@ -215,6 +225,16 @@ def _render_markdown(
|
||||
lines.extend([summary, ""])
|
||||
if validation_error:
|
||||
lines.extend(["Validation error:", "", f"`{validation_error}`", ""])
|
||||
metadata = report_entry.get("llm_response_metadata")
|
||||
if metadata:
|
||||
lines.extend([
|
||||
"LLM response metadata:",
|
||||
"",
|
||||
"```json",
|
||||
json.dumps(metadata, indent=2, sort_keys=True),
|
||||
"```",
|
||||
"",
|
||||
])
|
||||
lines.extend([
|
||||
"```json",
|
||||
json.dumps(report, indent=2, sort_keys=True),
|
||||
|
||||
@@ -41,6 +41,7 @@ class InstructionResult:
|
||||
review_required: bool = False
|
||||
condition_matched: str | None = None
|
||||
validation_error: str | None = None
|
||||
llm_response_metadata: dict[str, Any] | None = None
|
||||
|
||||
|
||||
def _resolve_path(obj: Any, path: str) -> Any:
|
||||
@@ -160,15 +161,22 @@ def _execute(
|
||||
prompt_hash = hashlib.sha256(rendered.encode()).hexdigest()
|
||||
llm_config = _llm_run_config(instr)
|
||||
|
||||
# Reference allow-list (WP-0016-T04): if a context resolver supplied the set
|
||||
# of known candidate ids, recommendations pointing at anything else are
|
||||
# quarantined. Absent (None) today → the check is inert until wired.
|
||||
allow_list = _allow_list_from_context(context)
|
||||
|
||||
# Step 3 — call LLM
|
||||
raw_output = llm_client.complete(rendered, model=instr.model, config=llm_config)
|
||||
response_metadata = _llm_response_metadata(llm_client)
|
||||
|
||||
# Step 4 — validate and optionally retry
|
||||
task_specs, report, error = _validate_output(raw_output, instr)
|
||||
task_specs, report, error = _validate_output(raw_output, instr, allow_list)
|
||||
if error:
|
||||
retry_prompt = rendered + f"\n\nPrevious output was invalid: {error}\nPlease fix."
|
||||
raw_output = llm_client.complete(retry_prompt, model=instr.model, config=llm_config)
|
||||
task_specs, report, error = _validate_output(raw_output, instr)
|
||||
response_metadata = _llm_response_metadata(llm_client)
|
||||
task_specs, report, error = _validate_output(raw_output, instr, allow_list)
|
||||
if error:
|
||||
# Truncate to keep log volume bounded but long enough to see the
|
||||
# actual JSON shape mismatch (typical reports are <2KB).
|
||||
@@ -178,7 +186,18 @@ def _execute(
|
||||
"error=%s, raw_output_preview=%r",
|
||||
instr.id, prompt_hash, error, preview,
|
||||
)
|
||||
failure_report = _invalid_output_report(instr, error, raw_output)
|
||||
# Posture B (WP-0016-T03): try to recover a partial-but-usable
|
||||
# report from individually-parseable items before declaring total
|
||||
# loss. One bad item should cost one item, not the whole report.
|
||||
recovered = _resilient_report(
|
||||
instr, raw_output, error, prompt_hash, allow_list,
|
||||
response_metadata=response_metadata,
|
||||
)
|
||||
if recovered is not None:
|
||||
return recovered
|
||||
failure_report = _invalid_output_report(
|
||||
instr, error, raw_output, response_metadata=response_metadata,
|
||||
)
|
||||
if failure_report is not None:
|
||||
return InstructionResult(
|
||||
tasks=[],
|
||||
@@ -189,6 +208,7 @@ def _execute(
|
||||
review_required=True,
|
||||
condition_matched=instr.condition or None,
|
||||
validation_error=error,
|
||||
llm_response_metadata=response_metadata,
|
||||
)
|
||||
return _empty_result(instr, prompt_hash=prompt_hash, validation_error=error)
|
||||
|
||||
@@ -200,6 +220,7 @@ def _execute(
|
||||
output_validated=True,
|
||||
review_required=bool(getattr(instr, "review_required", False)),
|
||||
condition_matched=instr.condition or None,
|
||||
llm_response_metadata=response_metadata,
|
||||
)
|
||||
|
||||
|
||||
@@ -239,6 +260,7 @@ def _invalid_output_report(
|
||||
instr: Any,
|
||||
validation_error: str,
|
||||
raw_output: Any,
|
||||
response_metadata: dict[str, Any] | None = None,
|
||||
) -> dict[str, Any] | None:
|
||||
"""Build a durable diagnostic report for invalid report-sink output.
|
||||
|
||||
@@ -256,7 +278,7 @@ def _invalid_output_report(
|
||||
partial_output = _parse_json_output(raw_output)
|
||||
except json.JSONDecodeError:
|
||||
partial_output = None
|
||||
raw_preview = raw_output[:4000]
|
||||
raw_preview = raw_output[:_RAW_OUTPUT_PREVIEW_LIMIT]
|
||||
else:
|
||||
partial_output = raw_output
|
||||
|
||||
@@ -268,6 +290,8 @@ def _invalid_output_report(
|
||||
"status": "validation_failed",
|
||||
"validation_error": validation_error,
|
||||
}
|
||||
if response_metadata:
|
||||
report["llm_response_metadata"] = response_metadata
|
||||
if isinstance(partial_output, dict):
|
||||
if isinstance(partial_output.get("summary"), str):
|
||||
report["partial_summary"] = partial_output["summary"]
|
||||
@@ -279,6 +303,358 @@ def _invalid_output_report(
|
||||
return report
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Resilient report recovery (ACTIVITY-WP-0016-T03)
|
||||
#
|
||||
# Posture B — verify & mitigate at the producer→consumer boundary. When the
|
||||
# whole-document parse/validate fails, recover individually-parseable
|
||||
# recommendation objects, validate each against the item schema, keep the valid
|
||||
# ones, and quarantine the malformed/over-limit ones with provenance. One bad
|
||||
# item costs one item, not the whole report (error locality == unit of work).
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
_QUARANTINE_LIMIT = 20
|
||||
_SNIPPET_LIMIT = 200
|
||||
# Producer guardrails (ACTIVITY-WP-0016-T04): structural bounds applied to every
|
||||
# recommendation regardless of producer (LLM, agent, or human). These are
|
||||
# verify-and-mitigate limits — an offending item is quarantined, never allowed to
|
||||
# fail the whole report or flow unbounded into a downstream consumer.
|
||||
_MAX_STRING_LEN = 4000
|
||||
_MAX_DEPTH = 8
|
||||
_RAW_OUTPUT_PREVIEW_LIMIT = 12000
|
||||
_SUMMARY_RE = re.compile(r'"summary"\s*:\s*"((?:[^"\\]|\\.)*)"')
|
||||
|
||||
|
||||
_SAFE_RESPONSE_METADATA_KEYS = {
|
||||
"finish_reason",
|
||||
"usage",
|
||||
"model",
|
||||
"model_name",
|
||||
"provider",
|
||||
"request_id",
|
||||
"response_id",
|
||||
"trace_id",
|
||||
"latency_ms",
|
||||
"duration_ms",
|
||||
"elapsed_ms",
|
||||
"created",
|
||||
"created_at",
|
||||
}
|
||||
|
||||
|
||||
def _llm_response_metadata(llm_client: Any) -> dict[str, Any] | None:
|
||||
metadata = getattr(llm_client, "last_response_metadata", None)
|
||||
if not isinstance(metadata, dict) or not metadata:
|
||||
return None
|
||||
safe: dict[str, Any] = {}
|
||||
for key, value in metadata.items():
|
||||
if key not in _SAFE_RESPONSE_METADATA_KEYS:
|
||||
continue
|
||||
try:
|
||||
json.dumps(value)
|
||||
except (TypeError, ValueError):
|
||||
continue
|
||||
safe[str(key)] = value
|
||||
return safe or None
|
||||
|
||||
|
||||
def _snippet(value: Any) -> str:
|
||||
text = value if isinstance(value, str) else json.dumps(value, default=str)
|
||||
return text[:_SNIPPET_LIMIT]
|
||||
|
||||
|
||||
def _json_depth(value: Any, depth: int = 1) -> int:
|
||||
if depth > _MAX_DEPTH:
|
||||
return depth
|
||||
if isinstance(value, dict):
|
||||
return max((_json_depth(v, depth + 1) for v in value.values()), default=depth)
|
||||
if isinstance(value, list):
|
||||
return max((_json_depth(v, depth + 1) for v in value), default=depth)
|
||||
return depth
|
||||
|
||||
|
||||
def _has_oversized_string(value: Any) -> bool:
|
||||
if isinstance(value, str):
|
||||
return len(value) > _MAX_STRING_LEN
|
||||
if isinstance(value, dict):
|
||||
return any(_has_oversized_string(v) for v in value.values())
|
||||
if isinstance(value, list):
|
||||
return any(_has_oversized_string(v) for v in value)
|
||||
return False
|
||||
|
||||
|
||||
def _item_structure_error(item: Any) -> str | None:
|
||||
"""Producer-agnostic structural guardrail: depth and string-length caps."""
|
||||
if _json_depth(item) > _MAX_DEPTH:
|
||||
return f"exceeds max nesting depth {_MAX_DEPTH}"
|
||||
if _has_oversized_string(item):
|
||||
return f"contains a string longer than {_MAX_STRING_LEN} chars"
|
||||
return None
|
||||
|
||||
|
||||
def _allow_list_from_context(context: dict | None) -> set[str] | None:
|
||||
"""Build the recommendation-candidate allow-list from resolved context.
|
||||
|
||||
Looks for `context["known_candidates"]` (a list/set of valid candidate ids).
|
||||
Returns None when absent so the allow-list check stays inert until a context
|
||||
resolver populates it — the guardrail capability ships now; activation is a
|
||||
one-line resolver change.
|
||||
"""
|
||||
if not isinstance(context, dict):
|
||||
return None
|
||||
known = context.get("known_candidates")
|
||||
if isinstance(known, (list, set, tuple)):
|
||||
return {str(item) for item in known}
|
||||
return None
|
||||
|
||||
|
||||
def _report_contract(instr: Any) -> tuple[dict[str, Any] | None, int | None]:
|
||||
"""Extract (item_schema, max_items) for the recommendations list, if any."""
|
||||
try:
|
||||
schema = _load_output_schema(getattr(instr, "output_schema", ""))
|
||||
except (OSError, json.JSONDecodeError, TypeError):
|
||||
return None, None
|
||||
if not isinstance(schema, dict):
|
||||
return None, None
|
||||
recs = (schema.get("properties") or {}).get("recommendations")
|
||||
if not isinstance(recs, dict):
|
||||
return None, None
|
||||
item_schema = recs.get("items") if isinstance(recs.get("items"), dict) else None
|
||||
max_items = recs.get("maxItems") if isinstance(recs.get("maxItems"), int) else None
|
||||
return item_schema, max_items
|
||||
|
||||
|
||||
def _extract_object_spans(raw: str) -> list[tuple[str, bool]]:
|
||||
"""Return (span, complete) for each recommendation object in raw output.
|
||||
|
||||
Scans the `recommendations` array brace-aware and string-aware so it recovers
|
||||
objects whether they are pretty-printed across many lines or emitted one per
|
||||
line (NDJSON). A truncated trailing object is returned with complete=False.
|
||||
"""
|
||||
key = raw.find('"recommendations"')
|
||||
start_region = raw.find("[", key) if key >= 0 else -1
|
||||
if start_region < 0:
|
||||
return []
|
||||
spans: list[tuple[str, bool]] = []
|
||||
i, n = start_region + 1, len(raw)
|
||||
while i < n:
|
||||
ch = raw[i]
|
||||
if ch == "]":
|
||||
break
|
||||
if ch != "{":
|
||||
i += 1
|
||||
continue
|
||||
depth, in_str, esc, j = 0, False, False, i
|
||||
closed = False
|
||||
while j < n:
|
||||
c = raw[j]
|
||||
if in_str:
|
||||
if esc:
|
||||
esc = False
|
||||
elif c == "\\":
|
||||
esc = True
|
||||
elif c == '"':
|
||||
in_str = False
|
||||
elif c == '"':
|
||||
in_str = True
|
||||
elif c == "{":
|
||||
depth += 1
|
||||
elif c == "}":
|
||||
depth -= 1
|
||||
if depth == 0:
|
||||
spans.append((raw[i:j + 1], True))
|
||||
closed = True
|
||||
break
|
||||
j += 1
|
||||
if not closed:
|
||||
spans.append((raw[i:], False)) # truncated tail
|
||||
break
|
||||
i = j + 1
|
||||
return spans
|
||||
|
||||
|
||||
def _try_repair(span: str) -> str:
|
||||
"""Best-effort close of a truncated JSON object: balance quote, braces, brackets."""
|
||||
in_str, esc, depth_c, depth_b = False, False, 0, 0
|
||||
for c in span:
|
||||
if in_str:
|
||||
if esc:
|
||||
esc = False
|
||||
elif c == "\\":
|
||||
esc = True
|
||||
elif c == '"':
|
||||
in_str = False
|
||||
elif c == '"':
|
||||
in_str = True
|
||||
elif c == "{":
|
||||
depth_c += 1
|
||||
elif c == "}":
|
||||
depth_c -= 1
|
||||
elif c == "[":
|
||||
depth_b += 1
|
||||
elif c == "]":
|
||||
depth_b -= 1
|
||||
repaired = span.rstrip().rstrip(",")
|
||||
if in_str:
|
||||
repaired += '"'
|
||||
return repaired + "]" * max(depth_b, 0) + "}" * max(depth_c, 0)
|
||||
|
||||
|
||||
def _recover_recommendations(
|
||||
raw: str,
|
||||
) -> tuple[str | None, list[dict[str, Any]], list[dict[str, Any]]]:
|
||||
"""Recover (summary, items, quarantined) from a failed report payload."""
|
||||
summary_match = _SUMMARY_RE.search(raw)
|
||||
summary = None
|
||||
if summary_match:
|
||||
try:
|
||||
summary = json.loads(f'"{summary_match.group(1)}"')
|
||||
except json.JSONDecodeError:
|
||||
summary = summary_match.group(1)
|
||||
items: list[dict[str, Any]] = []
|
||||
quarantined: list[dict[str, Any]] = []
|
||||
for index, (span, complete) in enumerate(_extract_object_spans(raw)):
|
||||
parsed: Any = None
|
||||
try:
|
||||
parsed = json.loads(span)
|
||||
except json.JSONDecodeError as exc:
|
||||
if not complete:
|
||||
try:
|
||||
parsed = json.loads(_try_repair(span))
|
||||
except json.JSONDecodeError:
|
||||
parsed = None
|
||||
if parsed is None:
|
||||
quarantined.append(
|
||||
{"index": index, "error": str(exc), "raw": _snippet(span),
|
||||
"reason": "truncated" if not complete else "unparseable"}
|
||||
)
|
||||
continue
|
||||
if isinstance(parsed, dict):
|
||||
items.append(parsed)
|
||||
else:
|
||||
quarantined.append(
|
||||
{"index": index, "error": "item is not a JSON object",
|
||||
"raw": _snippet(span)}
|
||||
)
|
||||
return summary, items, quarantined
|
||||
|
||||
|
||||
def _partition_items(
|
||||
items: list[dict[str, Any]],
|
||||
item_schema: dict[str, Any] | None,
|
||||
max_items: int | None,
|
||||
*,
|
||||
run_schema: bool = True,
|
||||
allow_list: set[str] | None = None,
|
||||
) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
|
||||
"""Screen items into (valid, quarantined).
|
||||
|
||||
Applied uniformly to recovered items (run_schema=True) and to already
|
||||
schema-valid happy-path items (run_schema=False). Order of checks: structural
|
||||
type → schema → producer guardrails (depth/length) → reference allow-list →
|
||||
count cap. The first failing check quarantines the item with provenance.
|
||||
"""
|
||||
valid: list[dict[str, Any]] = []
|
||||
quarantined: list[dict[str, Any]] = []
|
||||
for index, item in enumerate(items):
|
||||
if not isinstance(item, dict):
|
||||
quarantined.append(
|
||||
{"index": index, "error": "item is not a JSON object",
|
||||
"raw": _snippet(item), "reason": "malformed"}
|
||||
)
|
||||
continue
|
||||
schema_error = (
|
||||
_validate_schema_node(item, item_schema, f"recommendations[{index}]")
|
||||
if (run_schema and item_schema)
|
||||
else None
|
||||
)
|
||||
if schema_error:
|
||||
quarantined.append(
|
||||
{"index": index, "error": schema_error, "raw": _snippet(item),
|
||||
"reason": "schema"}
|
||||
)
|
||||
continue
|
||||
structure_error = _item_structure_error(item)
|
||||
if structure_error:
|
||||
quarantined.append(
|
||||
{"index": index, "error": structure_error, "raw": _snippet(item),
|
||||
"reason": "guardrail"}
|
||||
)
|
||||
continue
|
||||
if allow_list is not None:
|
||||
candidate = item.get("candidate")
|
||||
if not isinstance(candidate, str) or candidate not in allow_list:
|
||||
quarantined.append(
|
||||
{"index": index, "error": f"candidate {candidate!r} not in allow-list",
|
||||
"raw": _snippet(item), "reason": "allow_list"}
|
||||
)
|
||||
continue
|
||||
valid.append(item)
|
||||
if max_items is not None and len(valid) > max_items:
|
||||
for item in valid[max_items:]:
|
||||
quarantined.append(
|
||||
{"index": None, "error": f"exceeds maxItems={max_items}",
|
||||
"raw": _snippet(item), "reason": "over_limit"}
|
||||
)
|
||||
valid = valid[:max_items]
|
||||
return valid, quarantined
|
||||
|
||||
|
||||
def _resilient_report(
|
||||
instr: Any,
|
||||
raw_output: Any,
|
||||
original_error: str,
|
||||
prompt_hash: str | None,
|
||||
allow_list: set[str] | None = None,
|
||||
response_metadata: dict[str, Any] | None = None,
|
||||
) -> InstructionResult | None:
|
||||
"""Recover a partial-but-usable report from output that failed validation.
|
||||
|
||||
Returns None when nothing usable can be recovered, so the caller falls back
|
||||
to the total-loss diagnostic artifact (_invalid_output_report).
|
||||
"""
|
||||
if not getattr(instr, "report_sinks", None) or not isinstance(raw_output, str):
|
||||
return None
|
||||
item_schema, max_items = _report_contract(instr)
|
||||
summary, items, quarantined = _recover_recommendations(raw_output)
|
||||
if not items:
|
||||
return None
|
||||
valid, item_quarantine = _partition_items(
|
||||
items, item_schema, max_items, allow_list=allow_list,
|
||||
)
|
||||
quarantined.extend(item_quarantine)
|
||||
if not valid:
|
||||
return None
|
||||
report: dict[str, Any] = {
|
||||
"summary": summary
|
||||
or f"Partial daily triage: recovered {len(valid)} recommendation(s) "
|
||||
"after the full report failed validation.",
|
||||
"recommendations": valid,
|
||||
"status": "partial",
|
||||
"partial": True,
|
||||
"quarantined_count": len(quarantined),
|
||||
"quarantined_items": quarantined[:_QUARANTINE_LIMIT],
|
||||
"recovery_note": f"original validation error: {original_error}",
|
||||
}
|
||||
if response_metadata:
|
||||
report["llm_response_metadata"] = response_metadata
|
||||
logger.warning(
|
||||
"instruction_output_recovered: instruction=%r, kept=%d, quarantined=%d",
|
||||
getattr(instr, "id", None), len(valid), len(quarantined),
|
||||
)
|
||||
return InstructionResult(
|
||||
tasks=[],
|
||||
report=report,
|
||||
prompt_hash=prompt_hash,
|
||||
model=getattr(instr, "model", None),
|
||||
output_validated=True,
|
||||
review_required=True,
|
||||
condition_matched=getattr(instr, "condition", "") or None,
|
||||
validation_error=None,
|
||||
llm_response_metadata=response_metadata,
|
||||
)
|
||||
|
||||
|
||||
def _execution_failure_report(instr: Any, error: str) -> dict[str, Any] | None:
|
||||
"""Build a durable diagnostic report when a report instruction cannot run."""
|
||||
if not getattr(instr, "report_sinks", None):
|
||||
@@ -295,6 +671,7 @@ def _execution_failure_report(instr: Any, error: str) -> dict[str, Any] | None:
|
||||
def _validate_output(
|
||||
raw_output: Any,
|
||||
instr: Any,
|
||||
allow_list: set[str] | None = None,
|
||||
) -> tuple[list[TaskSpec], dict[str, Any] | None, str | None]:
|
||||
"""Parse raw LLM output into TaskSpecs and optional report payload.
|
||||
|
||||
@@ -349,6 +726,28 @@ def _validate_output(
|
||||
source_type="instruction",
|
||||
source_id=instr.id,
|
||||
))
|
||||
|
||||
# Happy-path producer guardrails (WP-0016-T04): the whole document already
|
||||
# passed schema validation, so recommendations are schema-valid; still apply
|
||||
# the count cap, structural caps, and reference allow-list, quarantining any
|
||||
# offenders rather than emitting them. Report shape only changes when an item
|
||||
# is actually quarantined.
|
||||
if isinstance(report, dict) and isinstance(report.get("recommendations"), list):
|
||||
item_schema, max_items = _report_contract(instr)
|
||||
kept, quarantined = _partition_items(
|
||||
report["recommendations"], item_schema, max_items,
|
||||
run_schema=False, allow_list=allow_list,
|
||||
)
|
||||
if quarantined:
|
||||
report = {
|
||||
**report,
|
||||
"recommendations": kept,
|
||||
"status": "partial",
|
||||
"partial": True,
|
||||
"quarantined_count": len(quarantined),
|
||||
"quarantined_items": quarantined[:_QUARANTINE_LIMIT],
|
||||
}
|
||||
|
||||
return specs, report, None
|
||||
except (json.JSONDecodeError, AttributeError, KeyError, TypeError) as exc:
|
||||
return [], None, str(exc)
|
||||
|
||||
194
src/activity_core/schedule_health.py
Normal file
194
src/activity_core/schedule_health.py
Normal file
@@ -0,0 +1,194 @@
|
||||
"""Missed-fire detection for cron schedules (ACTIVITY-WP-0014, T03).
|
||||
|
||||
Even with a catchup window configured, an operator wants to *know* when a fire
|
||||
was missed — especially under ``misfire_policy: skip`` where missed fires are
|
||||
dropped by design and leave no run and no failure event. This module turns the
|
||||
schedule's own bookkeeping into an explicit verdict and an optional State Hub
|
||||
alert so a miss is never invisible again.
|
||||
|
||||
Temporal already counts fires that were dropped because they fell outside the
|
||||
catchup window in ``ScheduleInfo.num_actions_missed_catchup_window``. We surface
|
||||
that, plus a staleness check on the most recent fire, as a ``ScheduleHealth``
|
||||
verdict. The verdict logic is a pure function so it is testable without a live
|
||||
Temporal server; ``check_schedule_health`` is the thin async reader.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime, timedelta, timezone
|
||||
from typing import Any
|
||||
from uuid import UUID
|
||||
|
||||
import httpx
|
||||
|
||||
from activity_core.schedule_manager import schedule_id
|
||||
from activity_core.state_hub_write import idempotency_headers
|
||||
|
||||
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class ScheduleHealth:
|
||||
"""Verdict for a single schedule's recent firing behaviour."""
|
||||
|
||||
activity_id: str
|
||||
healthy: bool
|
||||
missed_catchup_window: int
|
||||
last_fired_at: datetime | None
|
||||
staleness: timedelta | None
|
||||
reasons: list[str] = field(default_factory=list)
|
||||
|
||||
@property
|
||||
def missed(self) -> bool:
|
||||
return not self.healthy
|
||||
|
||||
|
||||
def evaluate_schedule_health(
|
||||
*,
|
||||
activity_id: str,
|
||||
missed_catchup_window: int,
|
||||
last_fired_at: datetime | None,
|
||||
now: datetime,
|
||||
expected_interval: timedelta | None = None,
|
||||
tolerance: timedelta = timedelta(minutes=10),
|
||||
) -> ScheduleHealth:
|
||||
"""Pure verdict: was a fire missed?
|
||||
|
||||
A schedule is unhealthy if Temporal dropped any fire past the catchup window,
|
||||
or — when ``expected_interval`` is known — if the most recent fire is older
|
||||
than one interval plus ``tolerance`` (i.e. a fire should have happened and
|
||||
did not).
|
||||
"""
|
||||
reasons: list[str] = []
|
||||
|
||||
if missed_catchup_window > 0:
|
||||
reasons.append(
|
||||
f"{missed_catchup_window} fire(s) dropped outside the catchup window"
|
||||
)
|
||||
|
||||
staleness: timedelta | None = None
|
||||
if last_fired_at is not None:
|
||||
staleness = now - last_fired_at
|
||||
if expected_interval is not None and staleness > expected_interval + tolerance:
|
||||
reasons.append(
|
||||
f"last fire was {staleness} ago, exceeding the expected "
|
||||
f"{expected_interval} interval"
|
||||
)
|
||||
elif expected_interval is not None:
|
||||
reasons.append("no recorded fire for a schedule that should have fired")
|
||||
|
||||
return ScheduleHealth(
|
||||
activity_id=activity_id,
|
||||
healthy=not reasons,
|
||||
missed_catchup_window=missed_catchup_window,
|
||||
last_fired_at=last_fired_at,
|
||||
staleness=staleness,
|
||||
reasons=reasons,
|
||||
)
|
||||
|
||||
|
||||
def _extract_info(desc: Any) -> tuple[int, datetime | None]:
|
||||
"""Pull (missed_catchup_window, last_fired_at) from a ScheduleDescription.
|
||||
|
||||
Accesses are defensive so a Temporal SDK field rename degrades to "unknown"
|
||||
rather than raising inside an operational health check.
|
||||
"""
|
||||
info = getattr(desc, "info", None)
|
||||
missed = int(getattr(info, "num_actions_missed_catchup_window", 0) or 0)
|
||||
|
||||
last_fired: datetime | None = None
|
||||
recent = getattr(info, "recent_actions", None) or []
|
||||
times = [
|
||||
getattr(a, "scheduled_at", None) or getattr(a, "started_at", None)
|
||||
for a in recent
|
||||
]
|
||||
times = [t for t in times if t is not None]
|
||||
if times:
|
||||
last_fired = max(times)
|
||||
return missed, last_fired
|
||||
|
||||
|
||||
async def check_schedule_health(
|
||||
client: Any,
|
||||
activity_id: str | UUID,
|
||||
*,
|
||||
now: datetime | None = None,
|
||||
expected_interval: timedelta | None = None,
|
||||
tolerance: timedelta = timedelta(minutes=10),
|
||||
) -> ScheduleHealth:
|
||||
"""Describe the schedule for ``activity_id`` and evaluate its health."""
|
||||
now = now or datetime.now(tz=timezone.utc)
|
||||
handle = client.get_schedule_handle(schedule_id(activity_id))
|
||||
desc = await handle.describe()
|
||||
missed, last_fired = _extract_info(desc)
|
||||
return evaluate_schedule_health(
|
||||
activity_id=str(activity_id),
|
||||
missed_catchup_window=missed,
|
||||
last_fired_at=last_fired,
|
||||
now=now,
|
||||
expected_interval=expected_interval,
|
||||
tolerance=tolerance,
|
||||
)
|
||||
|
||||
|
||||
def post_missed_fire_alert(
|
||||
health: ScheduleHealth,
|
||||
*,
|
||||
state_hub_url: str | None = None,
|
||||
author: str = "activity-core",
|
||||
topic_id: str | None = None,
|
||||
workstream_id: str | None = None,
|
||||
timeout_seconds: float = 10.0,
|
||||
) -> dict[str, Any]:
|
||||
"""Post a ``schedule_miss`` progress event to State Hub for an unhealthy schedule.
|
||||
|
||||
No-op (returns ``status: ok``) when the schedule is healthy, so callers can
|
||||
invoke unconditionally.
|
||||
"""
|
||||
if health.healthy:
|
||||
return {"type": "schedule-miss-alert", "status": "ok"}
|
||||
|
||||
base_url = state_hub_url or os.environ.get("STATE_HUB_URL", _DEFAULT_STATE_HUB_URL)
|
||||
base_url = str(base_url).rstrip("/")
|
||||
|
||||
body: dict[str, Any] = {
|
||||
"event_type": "schedule_miss",
|
||||
"author": author,
|
||||
"summary": (
|
||||
f"Schedule {health.activity_id} missed a fire: "
|
||||
+ "; ".join(health.reasons)
|
||||
),
|
||||
"detail": {
|
||||
"activity_id": health.activity_id,
|
||||
"missed_catchup_window": health.missed_catchup_window,
|
||||
"last_fired_at": (
|
||||
health.last_fired_at.isoformat() if health.last_fired_at else None
|
||||
),
|
||||
"staleness_seconds": (
|
||||
health.staleness.total_seconds() if health.staleness else None
|
||||
),
|
||||
"reasons": health.reasons,
|
||||
},
|
||||
}
|
||||
if topic_id:
|
||||
body["topic_id"] = topic_id
|
||||
if workstream_id:
|
||||
body["workstream_id"] = workstream_id
|
||||
|
||||
# Dedup repeated alerts for the same missed window (same schedule + last fire).
|
||||
last_fired = health.last_fired_at.isoformat() if health.last_fired_at else "none"
|
||||
resp = httpx.post(
|
||||
f"{base_url}/progress/",
|
||||
json=body,
|
||||
headers=idempotency_headers("schedule_miss", health.activity_id, last_fired),
|
||||
timeout=timeout_seconds,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
return {
|
||||
"type": "schedule-miss-alert",
|
||||
"status": "posted",
|
||||
"progress_id": data.get("id"),
|
||||
}
|
||||
@@ -17,7 +17,6 @@ from temporalio.client import (
|
||||
Schedule,
|
||||
ScheduleActionStartWorkflow,
|
||||
ScheduleAlreadyRunningError,
|
||||
ScheduleBackfill,
|
||||
ScheduleCalendarSpec,
|
||||
ScheduleHandle,
|
||||
ScheduleOverlapPolicy,
|
||||
@@ -38,13 +37,49 @@ _ORCHESTRATOR_TASK_QUEUE = "orchestrator-tq"
|
||||
# RunActivityWorkflow detects this value and derives run dedup key from workflow_id.
|
||||
SCHEDULED_TRIGGER_KEY = "scheduled"
|
||||
|
||||
# T24: misfire_policy → ScheduleOverlapPolicy
|
||||
_MISFIRE_TO_OVERLAP: dict[str, ScheduleOverlapPolicy] = {
|
||||
"skip": ScheduleOverlapPolicy.SKIP,
|
||||
"catchup": ScheduleOverlapPolicy.BUFFER_ALL,
|
||||
"compress": ScheduleOverlapPolicy.BUFFER_ONE,
|
||||
# ACTIVITY-WP-0014: misfire_policy → run-miss recovery behaviour.
|
||||
#
|
||||
# A "missed fire" happens when the worker / Temporal is unavailable at trigger
|
||||
# time. Two Temporal levers together define the behaviour:
|
||||
# - catchup_window: how far back the server will recover missed fires once it
|
||||
# is healthy again. The previous code never set this, so a brief outage at
|
||||
# trigger time silently dropped the fire with no recovery and no signal.
|
||||
# - overlap: what to do when a (recovered) fire would start while a prior run
|
||||
# is still executing.
|
||||
#
|
||||
# Legacy values (catchup, compress) are aliased onto the explicit names.
|
||||
_MISFIRE_ALIASES: dict[str, str] = {
|
||||
"catchup": "catchup_all",
|
||||
"compress": "catchup_latest",
|
||||
}
|
||||
|
||||
# overlap policy + default catchup window (seconds) per normalised policy.
|
||||
_SKIP_WINDOW_SECONDS = 60
|
||||
_CATCHUP_ALL_WINDOW_SECONDS = 365 * 24 * 3600
|
||||
_CATCHUP_LATEST_WINDOW_SECONDS = 24 * 3600
|
||||
|
||||
_MISFIRE_TO_OVERLAP: dict[str, ScheduleOverlapPolicy] = {
|
||||
# Run on trigger or skip — recover nothing past a tiny grace window.
|
||||
"skip": ScheduleOverlapPolicy.SKIP,
|
||||
# Run on trigger or recover every missed fire during the outage window.
|
||||
"catchup_all": ScheduleOverlapPolicy.BUFFER_ALL,
|
||||
# Run on trigger or recover the most recent missed fire only; BUFFER_ONE
|
||||
# buffers at most one start and drops the rest, so a backlog never accumulates.
|
||||
"catchup_latest": ScheduleOverlapPolicy.BUFFER_ONE,
|
||||
}
|
||||
|
||||
_MISFIRE_DEFAULT_WINDOW: dict[str, int] = {
|
||||
"skip": _SKIP_WINDOW_SECONDS,
|
||||
"catchup_all": _CATCHUP_ALL_WINDOW_SECONDS,
|
||||
"catchup_latest": _CATCHUP_LATEST_WINDOW_SECONDS,
|
||||
}
|
||||
|
||||
|
||||
def _normalize_misfire_policy(misfire_policy: str) -> str:
|
||||
"""Map legacy aliases onto the explicit run-miss policy names."""
|
||||
canonical = _MISFIRE_ALIASES.get(misfire_policy, misfire_policy)
|
||||
return canonical if canonical in _MISFIRE_TO_OVERLAP else "skip"
|
||||
|
||||
|
||||
def schedule_id(activity_id: str | UUID) -> str:
|
||||
"""Return the canonical Temporal Schedule ID for an ActivityDefinition."""
|
||||
@@ -57,7 +92,15 @@ def smoke_schedule_id(activity_id: str | UUID) -> str:
|
||||
|
||||
|
||||
def _overlap_policy(misfire_policy: str) -> ScheduleOverlapPolicy:
|
||||
return _MISFIRE_TO_OVERLAP.get(misfire_policy, ScheduleOverlapPolicy.SKIP)
|
||||
return _MISFIRE_TO_OVERLAP[_normalize_misfire_policy(misfire_policy)]
|
||||
|
||||
|
||||
def _catchup_window(cfg: CronTriggerConfig) -> timedelta:
|
||||
"""Resolve the catchup window: explicit override, else the policy default."""
|
||||
if cfg.catchup_window_seconds is not None:
|
||||
return timedelta(seconds=cfg.catchup_window_seconds)
|
||||
policy = _normalize_misfire_policy(cfg.misfire_policy)
|
||||
return timedelta(seconds=_MISFIRE_DEFAULT_WINDOW[policy])
|
||||
|
||||
|
||||
def _build_schedule(defn: ActivityDefinition) -> Schedule:
|
||||
@@ -80,7 +123,10 @@ def _build_schedule(defn: ActivityDefinition) -> Schedule:
|
||||
jitter=timedelta(seconds=cfg.jitter_seconds) if cfg.jitter_seconds else None,
|
||||
)
|
||||
|
||||
policy = SchedulePolicy(overlap=_overlap_policy(cfg.misfire_policy))
|
||||
policy = SchedulePolicy(
|
||||
overlap=_overlap_policy(cfg.misfire_policy),
|
||||
catchup_window=_catchup_window(cfg),
|
||||
)
|
||||
state = ScheduleState(paused=not defn.enabled)
|
||||
|
||||
return Schedule(action=action, spec=spec, policy=policy, state=state)
|
||||
@@ -282,18 +328,10 @@ async def upsert_schedule(client: Client, defn: ActivityDefinition) -> ScheduleH
|
||||
else:
|
||||
await handle.pause(note="disabled via upsert_schedule")
|
||||
|
||||
# T24 catchup: backfill any fires missed in the last hour.
|
||||
if isinstance(defn.trigger_config, CronTriggerConfig):
|
||||
if defn.trigger_config.misfire_policy == "catchup":
|
||||
now = datetime.now(tz=timezone.utc)
|
||||
backfill_start = now - timedelta(hours=1)
|
||||
await handle.backfill(
|
||||
ScheduleBackfill(
|
||||
start_at=backfill_start,
|
||||
end_at=now,
|
||||
overlap=ScheduleOverlapPolicy.BUFFER_ALL,
|
||||
)
|
||||
)
|
||||
# ACTIVITY-WP-0014: missed-fire recovery is now handled natively by the
|
||||
# schedule's catchup_window (see _build_schedule), which the server applies
|
||||
# continuously after any outage — not only at upsert time. The previous
|
||||
# ad-hoc 1-hour backfill is therefore no longer needed.
|
||||
|
||||
return handle
|
||||
|
||||
|
||||
34
src/activity_core/state_hub_write.py
Normal file
34
src/activity_core/state_hub_write.py
Normal file
@@ -0,0 +1,34 @@
|
||||
"""Idempotency-keyed State Hub writes (ACTIVITY-WP-0014 T05).
|
||||
|
||||
Under the State Hub *beachhead* model, a write may be buffered locally while
|
||||
central State Hub is unreachable and **flushed later, possibly with retries**.
|
||||
To keep that flush safe — no duplicate progress / triage events — every write
|
||||
carries a stable ``Idempotency-Key`` header derived deterministically from the
|
||||
write's identity. The guarantee lives on the write itself and does **not** depend
|
||||
on a live dedup read, so it holds even when the beachhead is serving offline.
|
||||
|
||||
activity-core does not implement the queue/cache (that is state-hub's beachhead);
|
||||
it only emits the key so the beachhead / State Hub can dedup on flush. The header
|
||||
passes untouched through the existing ``actcore-state-hub-bridge`` proxy and is
|
||||
ignored by State Hub versions that do not yet honour it.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
IDEMPOTENCY_HEADER = "Idempotency-Key"
|
||||
|
||||
|
||||
def idempotency_key(*parts: str | None) -> str:
|
||||
"""Build a stable, header-safe idempotency key from identity parts.
|
||||
|
||||
Empty/None parts are kept as empty segments so the key shape is stable across
|
||||
calls. Whitespace and control characters are collapsed to keep the value a
|
||||
valid single-line HTTP header.
|
||||
"""
|
||||
raw = ":".join((p or "") for p in parts)
|
||||
return "".join(ch if 0x20 < ord(ch) < 0x7F else "_" for ch in raw) or "_"
|
||||
|
||||
|
||||
def idempotency_headers(*parts: str | None) -> dict[str, str]:
|
||||
"""Return the header dict to attach to a State Hub write."""
|
||||
return {IDEMPOTENCY_HEADER: idempotency_key(*parts)}
|
||||
@@ -15,6 +15,8 @@ import asyncio
|
||||
import logging
|
||||
import os
|
||||
import uuid
|
||||
from dataclasses import dataclass
|
||||
from typing import Sequence
|
||||
|
||||
from sqlalchemy import select
|
||||
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
|
||||
@@ -30,6 +32,20 @@ TEMPORAL_HOST = os.environ.get("TEMPORAL_HOST", "localhost:7233")
|
||||
TEMPORAL_NAMESPACE = os.environ.get("TEMPORAL_NAMESPACE", "default")
|
||||
|
||||
|
||||
@dataclass
|
||||
class ScheduleSyncResult:
|
||||
upserted: int = 0
|
||||
paused: int = 0
|
||||
deleted_orphans: int = 0
|
||||
|
||||
def to_dict(self) -> dict[str, int]:
|
||||
return {
|
||||
"upserted": self.upserted,
|
||||
"paused": self.paused,
|
||||
"deleted_orphans": self.deleted_orphans,
|
||||
}
|
||||
|
||||
|
||||
def _row_to_domain(row: ActivityDefinitionRow) -> ActivityDefinition:
|
||||
"""Convert an ORM row to a domain ActivityDefinition for schedule_manager."""
|
||||
return ActivityDefinition.model_validate(
|
||||
@@ -46,12 +62,82 @@ def _row_to_domain(row: ActivityDefinitionRow) -> ActivityDefinition:
|
||||
)
|
||||
|
||||
|
||||
async def sync(client: Client, db_url: str) -> None:
|
||||
def _valid_schedule_activity_id(defn: ActivityDefinition) -> str:
|
||||
if isinstance(defn.trigger_config, ScheduledTriggerConfig):
|
||||
return f"{defn.id}-once"
|
||||
return str(defn.id)
|
||||
|
||||
|
||||
async def _load_schedule_rows(
|
||||
session_factory: async_sessionmaker[AsyncSession],
|
||||
) -> Sequence[ActivityDefinitionRow]:
|
||||
async with session_factory() as session:
|
||||
return (
|
||||
await session.scalars(
|
||||
select(ActivityDefinitionRow).where(
|
||||
ActivityDefinitionRow.trigger_type.in_(["cron", "scheduled"])
|
||||
)
|
||||
)
|
||||
).all()
|
||||
|
||||
|
||||
async def sync_schedule_rows(
|
||||
client: Client,
|
||||
rows: Sequence[ActivityDefinitionRow],
|
||||
) -> ScheduleSyncResult:
|
||||
"""Reconcile Temporal Schedules against already-loaded definition rows."""
|
||||
valid_schedule_activity_ids: set[str] = set()
|
||||
result = ScheduleSyncResult()
|
||||
|
||||
for row in rows:
|
||||
defn = _row_to_domain(row)
|
||||
if not isinstance(
|
||||
defn.trigger_config,
|
||||
(CronTriggerConfig, ScheduledTriggerConfig),
|
||||
):
|
||||
continue
|
||||
|
||||
valid_schedule_activity_ids.add(_valid_schedule_activity_id(defn))
|
||||
|
||||
await upsert_schedule(client, defn)
|
||||
if defn.enabled:
|
||||
result.upserted += 1
|
||||
logger.info("upserted schedule for activity %s (%s)", defn.id, defn.name)
|
||||
else:
|
||||
result.paused += 1
|
||||
logger.info("upserted paused schedule for disabled activity %s", defn.id)
|
||||
|
||||
# Tombstone cleanup: remove Temporal Schedules with no matching DB row.
|
||||
existing_schedules = await list_schedules(client)
|
||||
for entry in existing_schedules:
|
||||
if entry["activity_id"] not in valid_schedule_activity_ids:
|
||||
await delete_schedule(client, entry["activity_id"])
|
||||
result.deleted_orphans += 1
|
||||
logger.info("deleted orphaned schedule %s", entry["schedule_id"])
|
||||
|
||||
logger.info(
|
||||
"sync_schedules complete — upserted=%d paused=%d deleted_orphans=%d",
|
||||
result.upserted,
|
||||
result.paused,
|
||||
result.deleted_orphans,
|
||||
)
|
||||
return result
|
||||
|
||||
|
||||
async def sync_with_session_factory(
|
||||
client: Client,
|
||||
session_factory: async_sessionmaker[AsyncSession],
|
||||
) -> ScheduleSyncResult:
|
||||
"""Reconcile Temporal Schedules using an existing DB session factory."""
|
||||
return await sync_schedule_rows(client, await _load_schedule_rows(session_factory))
|
||||
|
||||
|
||||
async def sync(client: Client, db_url: str) -> ScheduleSyncResult:
|
||||
"""Reconcile Temporal Schedules against the ActivityDefinition table.
|
||||
|
||||
Steps:
|
||||
1. Load all enabled cron ActivityDefinitions from Postgres.
|
||||
2. Upsert a Temporal Schedule for each one.
|
||||
1. Load all cron/scheduled ActivityDefinitions from Postgres.
|
||||
2. Upsert a Temporal Schedule for each one, paused when disabled.
|
||||
3. Delete Temporal Schedules whose activity_id has no matching DB row
|
||||
(tombstone cleanup for deleted or trigger-type-changed definitions).
|
||||
"""
|
||||
@@ -59,55 +145,10 @@ async def sync(client: Client, db_url: str) -> None:
|
||||
session_factory = async_sessionmaker(engine, expire_on_commit=False)
|
||||
|
||||
try:
|
||||
async with session_factory() as session:
|
||||
rows = (
|
||||
await session.scalars(
|
||||
select(ActivityDefinitionRow).where(
|
||||
ActivityDefinitionRow.trigger_type.in_(["cron", "scheduled"])
|
||||
)
|
||||
)
|
||||
).all()
|
||||
return await sync_with_session_factory(client, session_factory)
|
||||
finally:
|
||||
await engine.dispose()
|
||||
|
||||
db_activity_ids: set[str] = set()
|
||||
upserted = 0
|
||||
skipped = 0
|
||||
|
||||
for row in rows:
|
||||
defn = _row_to_domain(row)
|
||||
if not isinstance(defn.trigger_config, (CronTriggerConfig, ScheduledTriggerConfig)):
|
||||
continue
|
||||
|
||||
db_activity_ids.add(str(defn.id))
|
||||
|
||||
if defn.enabled:
|
||||
await upsert_schedule(client, defn)
|
||||
upserted += 1
|
||||
logger.info("upserted schedule for activity %s (%s)", defn.id, defn.name)
|
||||
else:
|
||||
# Disabled definitions: schedule may exist (paused) — leave it;
|
||||
# upsert_schedule already handles the paused state.
|
||||
await upsert_schedule(client, defn)
|
||||
skipped += 1
|
||||
logger.info("upserted paused schedule for disabled activity %s", defn.id)
|
||||
|
||||
# Tombstone cleanup: remove Temporal Schedules with no matching DB row.
|
||||
existing_schedules = await list_schedules(client)
|
||||
deleted = 0
|
||||
for entry in existing_schedules:
|
||||
if entry["activity_id"] not in db_activity_ids:
|
||||
await delete_schedule(client, entry["activity_id"])
|
||||
deleted += 1
|
||||
logger.info("deleted orphaned schedule %s", entry["schedule_id"])
|
||||
|
||||
logger.info(
|
||||
"sync_schedules complete — upserted=%d skipped_disabled=%d deleted_orphans=%d",
|
||||
upserted,
|
||||
skipped,
|
||||
deleted,
|
||||
)
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
@@ -116,7 +157,13 @@ async def main() -> None:
|
||||
raise RuntimeError("ACTCORE_DB_URL is required")
|
||||
|
||||
client = await Client.connect(TEMPORAL_HOST, namespace=TEMPORAL_NAMESPACE)
|
||||
await sync(client, db_url)
|
||||
result = await sync(client, db_url)
|
||||
print(
|
||||
"Synced schedules: "
|
||||
f"upserted={result.upserted} "
|
||||
f"paused={result.paused} "
|
||||
f"deleted_orphans={result.deleted_orphans}"
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
97
src/activity_core/sync_service.py
Normal file
97
src/activity_core/sync_service.py
Normal file
@@ -0,0 +1,97 @@
|
||||
"""Shared ActivityDefinition/event type/schedule sync orchestration."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
from temporalio.client import Client
|
||||
|
||||
from activity_core.event_type_registry import sync_event_types
|
||||
from activity_core.sync_activity_definitions import sync as sync_activity_definitions
|
||||
from activity_core.sync_schedules import ScheduleSyncResult, sync_with_session_factory
|
||||
|
||||
_MAX_ERRORS = 20
|
||||
_MAX_ERROR_MESSAGE_LENGTH = 1000
|
||||
|
||||
|
||||
def _empty_result(
|
||||
*,
|
||||
definitions: bool,
|
||||
schedules: bool,
|
||||
event_types: bool,
|
||||
) -> dict[str, Any]:
|
||||
return {
|
||||
"ok": True,
|
||||
"ran": {
|
||||
"definitions": definitions,
|
||||
"schedules": schedules,
|
||||
"event_types": event_types,
|
||||
},
|
||||
"definitions": {"synced": 0},
|
||||
"event_types": {"synced": 0},
|
||||
"schedules": ScheduleSyncResult().to_dict(),
|
||||
"errors": [],
|
||||
}
|
||||
|
||||
|
||||
def _record_error(result: dict[str, Any], stage: str, exc: Exception) -> None:
|
||||
errors = result["errors"]
|
||||
if len(errors) >= _MAX_ERRORS:
|
||||
return
|
||||
errors.append(
|
||||
{
|
||||
"stage": stage,
|
||||
"type": type(exc).__name__,
|
||||
"message": str(exc)[:_MAX_ERROR_MESSAGE_LENGTH],
|
||||
}
|
||||
)
|
||||
result["ok"] = False
|
||||
|
||||
|
||||
async def run_sync(
|
||||
*,
|
||||
session_factory: Any,
|
||||
temporal_client: Client | None,
|
||||
definitions: bool = True,
|
||||
schedules: bool = True,
|
||||
event_types: bool = False,
|
||||
) -> dict[str, Any]:
|
||||
"""Run the requested sync stages and return bounded operator-facing status.
|
||||
|
||||
The orchestration deliberately accepts its database and Temporal
|
||||
dependencies as arguments so startup and the API can share the same behavior
|
||||
without creating another global runtime.
|
||||
"""
|
||||
result = _empty_result(
|
||||
definitions=definitions,
|
||||
schedules=schedules,
|
||||
event_types=event_types,
|
||||
)
|
||||
|
||||
if definitions:
|
||||
try:
|
||||
result["definitions"]["synced"] = await sync_activity_definitions(
|
||||
session_factory
|
||||
)
|
||||
except Exception as exc: # pragma: no cover - exercised through tests
|
||||
_record_error(result, "definitions", exc)
|
||||
|
||||
if event_types:
|
||||
try:
|
||||
result["event_types"]["synced"] = await sync_event_types(session_factory)
|
||||
except Exception as exc: # pragma: no cover - exercised through tests
|
||||
_record_error(result, "event_types", exc)
|
||||
|
||||
if schedules:
|
||||
try:
|
||||
if temporal_client is None:
|
||||
raise RuntimeError("Temporal client is required for schedule sync")
|
||||
schedule_result = await sync_with_session_factory(
|
||||
temporal_client,
|
||||
session_factory,
|
||||
)
|
||||
result["schedules"] = schedule_result.to_dict()
|
||||
except Exception as exc: # pragma: no cover - exercised through tests
|
||||
_record_error(result, "schedules", exc)
|
||||
|
||||
return result
|
||||
@@ -46,8 +46,7 @@ from activity_core.activities import (
|
||||
)
|
||||
from activity_core.db import make_engine
|
||||
from sqlalchemy.ext.asyncio import async_sessionmaker
|
||||
from activity_core.sync_activity_definitions import sync as sync_activity_defs
|
||||
from activity_core.sync_schedules import sync as sync_schedules
|
||||
from activity_core.sync_service import run_sync
|
||||
from activity_core.workflows import RunActivityWorkflow, TaskExecutorWorkflow
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
@@ -77,20 +76,26 @@ async def run() -> None:
|
||||
TEMPORAL_HOST, namespace=TEMPORAL_NAMESPACE, runtime=runtime
|
||||
)
|
||||
|
||||
# T45: Sync ActivityDefinition files into DB before schedule sync.
|
||||
logger.info("Syncing ActivityDefinition files...")
|
||||
logger.info("Syncing ActivityDefinitions and Temporal Schedules...")
|
||||
sync_engine = make_engine(db_url)
|
||||
session_factory = async_sessionmaker(sync_engine, expire_on_commit=False)
|
||||
try:
|
||||
session_factory = async_sessionmaker(make_engine(db_url), expire_on_commit=False)
|
||||
await sync_activity_defs(session_factory)
|
||||
except Exception:
|
||||
logger.exception("activity definition sync failed — continuing worker startup")
|
||||
|
||||
# T23: Sync Temporal Schedules with the DB before workers start accepting tasks.
|
||||
logger.info("Syncing Temporal Schedules with ActivityDefinition DB...")
|
||||
try:
|
||||
await sync_schedules(client, db_url)
|
||||
except Exception:
|
||||
logger.exception("schedule sync failed — continuing worker startup")
|
||||
sync_result = await run_sync(
|
||||
session_factory=session_factory,
|
||||
temporal_client=client,
|
||||
definitions=True,
|
||||
schedules=True,
|
||||
event_types=False,
|
||||
)
|
||||
for error in sync_result["errors"]:
|
||||
logger.error(
|
||||
"startup sync %s failed — %s: %s",
|
||||
error["stage"],
|
||||
error["type"],
|
||||
error["message"],
|
||||
)
|
||||
finally:
|
||||
await sync_engine.dispose()
|
||||
|
||||
orchestrator_worker = Worker(
|
||||
client,
|
||||
|
||||
5
tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json
vendored
Normal file
5
tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
{
|
||||
"_note": "PARTIAL 4000-char preview of the 2026-06-26 daily-triage validation failure (retry attempt). Full payload not recoverable from activity-core: complete() drops finish_reason; report sink caps raw at 4000 chars; the JSON break is at char 5268 (beyond this preview). Full response would require llm-connect producer-side logs on railiance01.",
|
||||
"validation_error": "Expecting ',' delimiter: line 136 column 22 (char 5268)",
|
||||
"raw_output_preview": "{\n \"summary\": \"Triage report focusing on high-priority workstreams with pending human intervention or critical dependencies, and addressing recently cleared dependencies to unblock progress.\",\n \"recommendations\": [\n {\n \"rank\": 1,\n \"candidate\": \"2731fece-6c49-45b8-ab8a-4ea6c04ac603\",\n \"action\": \"work-next\",\n \"why\": \"A critical dependency (T03 - Configure bounded OpenBao token roles and policies) for this workstream has been cleared, unblocking significant progress on credential management. This workstream has 8 todo tasks and no waits, indicating it's ready for immediate action.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 5.0,\n \"strategic_value\": 5,\n \"time_criticality\": 5,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 5,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 2,\n \"candidate\": \"bd086c41-287d-4a4e-8ac5-9ab270f14d72\",\n \"action\": \"needs-human\",\n \"why\": \"This high-priority workstream has a 'needs_human' task (T04 - Provision the runtime API key outside Git) and is currently blocked by 3 'wait' tasks. Human intervention is required to unblock progress.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 4.7,\n \"strategic_value\": 5,\n \"time_criticality\": 4,\n \"risk_reduction\": 5,\n \"opportunity_enablement\": 4,\n \"job_size\": 3\n }\n },\n {\n \"rank\": 3,\n \"candidate\": \"9b56414a-c71f-4e72-9b2b-d2166aaf50d0\",\n \"action\": \"needs-human\",\n \"why\": \"This high-priority workstream has a 'needs_human' task (Task: Execute Live Ops-Hub Bootstrap) and is currently blocked by a 'wait' task. Human intervention is required to proceed with the bootstrap.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 4.7,\n \"strategic_value\": 5,\n \"time_criticality\": 4,\n \"risk_reduction\": 5,\n \"opportunity_enablement\": 4,\n \"job_size\": 3\n }\n },\n {\n \"rank\": 4,\n \"candidate\": \"84e17675-0d15-4268-a8bd-540124d37018\",\n \"action\": \"needs-human\",\n \"why\": \"This workstream has 4 'needs_human' tasks, including 'T02 \u2014 Resolve Forgejo production design decisions', indicating significant human input is required to move forward with the migration.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 4.0,\n \"strategic_value\": 4,\n \"time_criticality\": 4,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 4,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 5,\n \"candidate\": \"5646e13a-13af-4724-bca6-3c0d86f96733\",\n \"action\": \"needs-human\",\n \"why\": \"This workstream has a 'needs_human' task ('Three-Run Calibration Feedback') and is currently in a 'wait' state. Human feedback is crucial for operational hardening.\",\n \"confidence\": \"medium\",\n \"wsjf\": {\n \"score\": 3.7,\n \"strategic_value\": 4,\n \"time_criticality\": 3,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 4,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 6,\n \"candidate\": \"896ace77-21b3-450b-8fb7-254aefc8c570\",\n \"action\": \"close-out\",\n \"why\": \"The task 'Wire activity-core to the live service' has been resolved, and the workstream shows 2 progress tasks with 0 todo/wait tasks. This indicates the deployment is likely complete or nearing completion and ready for close-out after verification.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 3.7,\n \"strategic_value\": 4,\n \"time_criticality\": 3,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 4,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 7,\n \"candidate\": \"656e435d-3a00-4f5e-a38e-114467f9062e\",\n \"action\": \"work-next\",\n \"why\": \"This high-priority workstream has a single 'wait' task ('Task: Activate Ops-Hub Widgets In Inter-Hub') and no 'needs_human' tasks. It appears ready for the next step to activate the widgets.\",\n \"confidence\": \"medium\",\n \"wsjf"
|
||||
}
|
||||
@@ -88,6 +88,43 @@ def test_for_each_binds_each_list_item_before_condition_and_action_rendering() -
|
||||
]
|
||||
|
||||
|
||||
def test_for_each_can_gate_registry_hygiene_gaps_on_signal() -> None:
|
||||
rules = [
|
||||
{
|
||||
"id": "flag-registry-hygiene-gap",
|
||||
"for_each": "context.gaps",
|
||||
"bind_as": "g",
|
||||
"condition": 'context.g.hygiene_signal != ""',
|
||||
"action": {
|
||||
"task_template": "Close registry hygiene gap for {context.g.repo}",
|
||||
"target_repo": "context.g.repo",
|
||||
"priority": "medium",
|
||||
"labels": ["registry-hygiene", "{context.g.hygiene_signal}"],
|
||||
},
|
||||
}
|
||||
]
|
||||
context = {
|
||||
"gaps": [
|
||||
{
|
||||
"repo": "reuse-surface",
|
||||
"hygiene_signal": "empty_capability_scaffold",
|
||||
},
|
||||
{
|
||||
"repo": "activity-core",
|
||||
"hygiene_signal": "",
|
||||
},
|
||||
]
|
||||
}
|
||||
|
||||
specs = expand_rule_actions(rules, _Event(), context)
|
||||
|
||||
assert [spec["target_repo"] for spec in specs] == ["reuse-surface"]
|
||||
assert specs[0]["labels"] == [
|
||||
"registry-hygiene",
|
||||
"empty_capability_scaffold",
|
||||
]
|
||||
|
||||
|
||||
def test_for_each_rejects_non_path_expression() -> None:
|
||||
rules = [
|
||||
{
|
||||
|
||||
@@ -12,6 +12,7 @@ Covers:
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
from types import SimpleNamespace
|
||||
from typing import Any
|
||||
|
||||
@@ -333,7 +334,14 @@ def test_execute_instruction_forwards_output_schema_to_llm_connect(tmp_path, mon
|
||||
def test_execute_instruction_with_audit_accepts_report_payload():
|
||||
report_data = {
|
||||
"summary": "State Hub has loose ends.",
|
||||
"recommendations": [{"action": "revisit", "candidate": "CUST-WP-0045"}],
|
||||
"recommendations": [
|
||||
{
|
||||
"rank": 1,
|
||||
"action": "revisit",
|
||||
"candidate": "CUST-WP-0045",
|
||||
"why": "Loose ends need attention.",
|
||||
}
|
||||
],
|
||||
}
|
||||
llm = _CountingLLM([json.dumps(report_data)])
|
||||
instr = _instr(
|
||||
@@ -353,7 +361,14 @@ def test_execute_instruction_with_audit_accepts_report_payload():
|
||||
def test_execute_instruction_with_audit_accepts_fenced_report_payload():
|
||||
report_data = {
|
||||
"summary": "State Hub has loose ends.",
|
||||
"recommendations": [{"action": "revisit", "candidate": "CUST-WP-0045"}],
|
||||
"recommendations": [
|
||||
{
|
||||
"rank": 1,
|
||||
"action": "revisit",
|
||||
"candidate": "CUST-WP-0045",
|
||||
"why": "Loose ends need attention.",
|
||||
}
|
||||
],
|
||||
}
|
||||
llm = _CountingLLM([f"```json\n{json.dumps(report_data)}\n```"])
|
||||
instr = _instr(
|
||||
@@ -389,6 +404,216 @@ def test_execute_instruction_with_audit_rejects_invalid_report_schema():
|
||||
assert llm.call_count == 2
|
||||
|
||||
|
||||
# ── WP-0016-T03 resilient report recovery ─────────────────────────────────────
|
||||
|
||||
def _valid_rec(rank: int) -> dict[str, Any]:
|
||||
return {
|
||||
"rank": rank,
|
||||
"candidate": f"WS-{rank}",
|
||||
"action": "work-next",
|
||||
"why": f"reason {rank}",
|
||||
"wsjf": {"score": 5.0},
|
||||
}
|
||||
|
||||
|
||||
def _pretty_triage_with_truncated_tail(num_valid: int) -> str:
|
||||
body = ",\n".join(" " + json.dumps(_valid_rec(i)) for i in range(1, num_valid + 1))
|
||||
# Trailing object is cut off mid-string — the whole document is invalid JSON,
|
||||
# reproducing the 2026-06-26 failure shape (valid prefix, broken tail).
|
||||
return (
|
||||
'{\n "summary": "Daily triage.",\n "recommendations": [\n'
|
||||
+ body
|
||||
+ ',\n {\n "rank": '
|
||||
+ str(num_valid + 1)
|
||||
+ ',\n "candidate": "WS-X",\n "action": "work-'
|
||||
)
|
||||
|
||||
|
||||
def test_resilient_report_recovers_valid_prefix_and_quarantines_truncated_tail():
|
||||
raw = _pretty_triage_with_truncated_tail(7)
|
||||
llm = _CountingLLM([raw, raw])
|
||||
instr = _instr(
|
||||
id="daily-triage-report",
|
||||
prompt="Report.",
|
||||
trusted_fields=[],
|
||||
output_schema="schemas/daily-triage-report.json",
|
||||
report_sinks=[{"type": "working-memory"}],
|
||||
)
|
||||
|
||||
result = execute_instruction_with_audit(instr, _Event(), {}, llm)
|
||||
|
||||
assert result.output_validated is True
|
||||
assert result.review_required is True
|
||||
assert result.report is not None
|
||||
assert result.report["partial"] is True
|
||||
assert len(result.report["recommendations"]) == 7
|
||||
assert result.report["summary"] == "Daily triage."
|
||||
assert result.report["quarantined_count"] >= 1
|
||||
# The broken tail is dropped — either as an unparseable/truncated span or,
|
||||
# if _try_repair salvages its structure, as a schema-invalid item. Either way
|
||||
# it carries a diagnostic error and never pollutes the surviving report.
|
||||
assert result.report["quarantined_items"][0]["error"]
|
||||
|
||||
|
||||
def test_resilient_report_quarantines_one_bad_item_among_valid():
|
||||
recs = [_valid_rec(1), {"candidate": "WS-2", "action": "x", "why": "no rank"}, _valid_rec(3)]
|
||||
raw = json.dumps({"summary": "Triage.", "recommendations": recs})
|
||||
llm = _CountingLLM([raw, raw])
|
||||
instr = _instr(
|
||||
id="daily-triage-report",
|
||||
prompt="Report.",
|
||||
trusted_fields=[],
|
||||
output_schema="schemas/daily-triage-report.json",
|
||||
report_sinks=[{"type": "working-memory"}],
|
||||
)
|
||||
|
||||
result = execute_instruction_with_audit(instr, _Event(), {}, llm)
|
||||
|
||||
assert result.output_validated is True
|
||||
assert result.report["partial"] is True
|
||||
assert len(result.report["recommendations"]) == 2
|
||||
assert result.report["quarantined_count"] == 1
|
||||
assert "rank" in result.report["quarantined_items"][0]["error"]
|
||||
|
||||
|
||||
# ── WP-0016-T04 producer guardrails ───────────────────────────────────────────
|
||||
|
||||
def _triage_instr() -> SimpleNamespace:
|
||||
return _instr(
|
||||
id="daily-triage-report",
|
||||
prompt="Report.",
|
||||
trusted_fields=[],
|
||||
output_schema="schemas/daily-triage-report.json",
|
||||
report_sinks=[{"type": "working-memory"}],
|
||||
)
|
||||
|
||||
|
||||
def test_guardrail_count_cap_on_valid_happy_path():
|
||||
# 9 fully-valid recommendations in a syntactically valid document: schema
|
||||
# validation passes, but the maxItems=7 count cap must keep 7 and quarantine 2.
|
||||
recs = [_valid_rec(i) for i in range(1, 10)]
|
||||
raw = json.dumps({"summary": "Triage.", "recommendations": recs})
|
||||
llm = _CountingLLM([raw])
|
||||
|
||||
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
|
||||
|
||||
assert llm.call_count == 1 # no retry — the document was valid
|
||||
assert result.report["partial"] is True
|
||||
assert len(result.report["recommendations"]) == 7
|
||||
assert result.report["quarantined_count"] == 2
|
||||
assert all(q["reason"] == "over_limit" for q in result.report["quarantined_items"])
|
||||
|
||||
|
||||
def test_guardrail_oversized_string_quarantined():
|
||||
big = _valid_rec(2)
|
||||
big["why"] = "x" * 5000 # exceeds _MAX_STRING_LEN
|
||||
raw = json.dumps({"summary": "Triage.", "recommendations": [_valid_rec(1), big]})
|
||||
llm = _CountingLLM([raw])
|
||||
|
||||
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
|
||||
|
||||
assert len(result.report["recommendations"]) == 1
|
||||
assert result.report["quarantined_count"] == 1
|
||||
assert result.report["quarantined_items"][0]["reason"] == "guardrail"
|
||||
|
||||
|
||||
def test_guardrail_allow_list_rejects_unknown_candidate():
|
||||
raw = json.dumps({
|
||||
"summary": "Triage.",
|
||||
"recommendations": [_valid_rec(1), _valid_rec(2)], # candidates WS-1, WS-2
|
||||
})
|
||||
llm = _CountingLLM([raw])
|
||||
context = {"known_candidates": ["WS-1"]}
|
||||
|
||||
result = execute_instruction_with_audit(_triage_instr(), _Event(), context, llm)
|
||||
|
||||
assert len(result.report["recommendations"]) == 1
|
||||
assert result.report["recommendations"][0]["candidate"] == "WS-1"
|
||||
assert result.report["quarantined_items"][0]["reason"] == "allow_list"
|
||||
|
||||
|
||||
def _nested(depth: int) -> dict[str, Any]:
|
||||
node: dict[str, Any] = {"leaf": 1}
|
||||
for _ in range(depth):
|
||||
node = {"a": node}
|
||||
return node
|
||||
|
||||
|
||||
def test_guardrail_over_depth_quarantined():
|
||||
deep = _valid_rec(2)
|
||||
deep["extra"] = _nested(12) # well past _MAX_DEPTH
|
||||
raw = json.dumps({"summary": "Triage.", "recommendations": [_valid_rec(1), deep]})
|
||||
llm = _CountingLLM([raw])
|
||||
|
||||
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
|
||||
|
||||
assert len(result.report["recommendations"]) == 1
|
||||
assert result.report["quarantined_count"] == 1
|
||||
assert result.report["quarantined_items"][0]["reason"] == "guardrail"
|
||||
assert "depth" in result.report["quarantined_items"][0]["error"]
|
||||
|
||||
|
||||
def test_resilient_recovery_against_real_2026_06_26_fixture():
|
||||
# The actual captured failure payload (4000-char preview, truncated at the 7th
|
||||
# recommendation) — the run that reset the WP-0006-T03 streak. Before WP-0016
|
||||
# this discarded the whole report; now it must recover the valid prefix.
|
||||
fixture = json.loads(
|
||||
Path("tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json")
|
||||
.read_text(encoding="utf-8")
|
||||
)
|
||||
raw = fixture["raw_output_preview"]
|
||||
llm = _CountingLLM([raw, raw])
|
||||
|
||||
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
|
||||
|
||||
assert result.output_validated is True
|
||||
assert result.report["partial"] is True
|
||||
# Six recommendations are fully intact before the truncation point.
|
||||
assert len(result.report["recommendations"]) >= 6
|
||||
assert all("rank" in rec and "candidate" in rec for rec in result.report["recommendations"])
|
||||
|
||||
|
||||
|
||||
class _MetadataBadLLM:
|
||||
def __init__(self) -> None:
|
||||
self.call_count = 0
|
||||
self.last_response_metadata: dict[str, Any] | None = None
|
||||
|
||||
def complete(
|
||||
self,
|
||||
prompt: str,
|
||||
model: str = "",
|
||||
config: dict | None = None,
|
||||
) -> str:
|
||||
self.call_count += 1
|
||||
self.last_response_metadata = {
|
||||
"finish_reason": "length",
|
||||
"usage": {"input_tokens": 1100, "output_tokens": 1200},
|
||||
}
|
||||
return ("x" * 9000) + "{"
|
||||
|
||||
|
||||
def test_invalid_report_preserves_response_metadata_and_long_preview():
|
||||
llm = _MetadataBadLLM()
|
||||
instr = _instr(
|
||||
id="daily-triage-report",
|
||||
prompt="Report.",
|
||||
trusted_fields=[],
|
||||
report_sinks=[{"type": "working-memory", "path": "/tmp"}],
|
||||
)
|
||||
|
||||
result = execute_instruction_with_audit(instr, _Event(), {}, llm)
|
||||
|
||||
assert llm.call_count == 2
|
||||
assert result.output_validated is False
|
||||
assert result.llm_response_metadata == {
|
||||
"finish_reason": "length",
|
||||
"usage": {"input_tokens": 1100, "output_tokens": 1200},
|
||||
}
|
||||
assert result.report["llm_response_metadata"] == result.llm_response_metadata
|
||||
assert len(result.report["raw_output_preview"]) > 4000
|
||||
|
||||
|
||||
def test_execute_instruction_with_audit_preserves_invalid_report_with_sinks(
|
||||
tmp_path,
|
||||
monkeypatch,
|
||||
|
||||
114
tests/test_admin_sync_api.py
Normal file
114
tests/test_admin_sync_api.py
Normal file
@@ -0,0 +1,114 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
import pytest
|
||||
|
||||
from activity_core import api
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_admin_sync_definitions_only_does_not_require_temporal(
|
||||
monkeypatch,
|
||||
) -> None:
|
||||
seen: dict[str, Any] = {}
|
||||
|
||||
async def fake_run_sync(**kwargs: Any) -> dict[str, Any]:
|
||||
seen.update(kwargs)
|
||||
return {"ok": True, "ran": {"definitions": True}}
|
||||
|
||||
monkeypatch.setattr(api, "_session_factory", object())
|
||||
monkeypatch.setattr(api, "_temporal_client", None)
|
||||
monkeypatch.setattr(api, "run_sync", fake_run_sync)
|
||||
|
||||
result = await api.admin_sync(
|
||||
definitions=True,
|
||||
schedules=False,
|
||||
event_types=False,
|
||||
)
|
||||
|
||||
assert result == {"ok": True, "ran": {"definitions": True}}
|
||||
assert seen["session_factory"] is api._session_factory
|
||||
assert seen["temporal_client"] is None
|
||||
assert seen["definitions"] is True
|
||||
assert seen["schedules"] is False
|
||||
assert seen["event_types"] is False
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_admin_sync_schedules_only_passes_temporal(monkeypatch) -> None:
|
||||
temporal = object()
|
||||
seen: dict[str, Any] = {}
|
||||
|
||||
async def fake_run_sync(**kwargs: Any) -> dict[str, Any]:
|
||||
seen.update(kwargs)
|
||||
return {
|
||||
"ok": True,
|
||||
"schedules": {
|
||||
"upserted": 1,
|
||||
"paused": 0,
|
||||
"deleted_orphans": 0,
|
||||
},
|
||||
}
|
||||
|
||||
monkeypatch.setattr(api, "_session_factory", object())
|
||||
monkeypatch.setattr(api, "_temporal_client", temporal)
|
||||
monkeypatch.setattr(api, "run_sync", fake_run_sync)
|
||||
|
||||
result = await api.admin_sync(
|
||||
definitions=False,
|
||||
schedules=True,
|
||||
event_types=False,
|
||||
)
|
||||
|
||||
assert result["schedules"]["upserted"] == 1
|
||||
assert seen["temporal_client"] is temporal
|
||||
assert seen["definitions"] is False
|
||||
assert seen["schedules"] is True
|
||||
assert seen["event_types"] is False
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_admin_sync_all_sync_returns_failure_result(monkeypatch) -> None:
|
||||
async def fake_run_sync(**kwargs: Any) -> dict[str, Any]:
|
||||
return {
|
||||
"ok": False,
|
||||
"ran": {
|
||||
"definitions": kwargs["definitions"],
|
||||
"schedules": kwargs["schedules"],
|
||||
"event_types": kwargs["event_types"],
|
||||
},
|
||||
"errors": [
|
||||
{
|
||||
"stage": "event_types",
|
||||
"type": "RuntimeError",
|
||||
"message": "bad event type",
|
||||
}
|
||||
],
|
||||
}
|
||||
|
||||
monkeypatch.setattr(api, "_session_factory", object())
|
||||
monkeypatch.setattr(api, "_temporal_client", object())
|
||||
monkeypatch.setattr(api, "run_sync", fake_run_sync)
|
||||
|
||||
result = await api.admin_sync(
|
||||
definitions=True,
|
||||
schedules=True,
|
||||
event_types=True,
|
||||
)
|
||||
|
||||
assert result == {
|
||||
"ok": False,
|
||||
"ran": {
|
||||
"definitions": True,
|
||||
"schedules": True,
|
||||
"event_types": True,
|
||||
},
|
||||
"errors": [
|
||||
{
|
||||
"stage": "event_types",
|
||||
"type": "RuntimeError",
|
||||
"message": "bad event type",
|
||||
}
|
||||
],
|
||||
}
|
||||
289
tests/test_automation_status.py
Normal file
289
tests/test_automation_status.py
Normal file
@@ -0,0 +1,289 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from zoneinfo import ZoneInfo
|
||||
|
||||
from activity_core import automation_status as status
|
||||
|
||||
ACTIVITY_ID = "00000000-0000-0000-0000-000000000123"
|
||||
|
||||
|
||||
def _window():
|
||||
return status.resolve_window(
|
||||
"2026-06-26",
|
||||
"2026-06-29",
|
||||
"Europe/Berlin",
|
||||
)
|
||||
|
||||
|
||||
def _definition(enabled: bool = True):
|
||||
return {
|
||||
"id": ACTIVITY_ID,
|
||||
"name": "Daily Check",
|
||||
"enabled": enabled,
|
||||
"trigger_type": "cron",
|
||||
"trigger_config": {
|
||||
"trigger_type": "cron",
|
||||
"cron_expression": "0 9 * * *",
|
||||
"timezone": "Europe/Berlin",
|
||||
"misfire_policy": "skip",
|
||||
},
|
||||
"source": "test",
|
||||
}
|
||||
|
||||
|
||||
def test_friday_shortcut_resolves_to_previous_friday_start() -> None:
|
||||
now = datetime(2026, 6, 29, 12, 0, tzinfo=ZoneInfo("Europe/Berlin"))
|
||||
|
||||
window = status.resolve_window("friday", None, "Europe/Berlin", now=now)
|
||||
|
||||
assert window["since"].isoformat() == "2026-06-26T00:00:00+02:00"
|
||||
assert window["until"].isoformat() == "2026-06-29T12:00:00+02:00"
|
||||
|
||||
|
||||
def test_expected_fires_for_simple_cron_window() -> None:
|
||||
fires = status.expected_fires(_definition(), _window())
|
||||
|
||||
assert fires == [
|
||||
"2026-06-26T09:00:00+02:00",
|
||||
"2026-06-27T09:00:00+02:00",
|
||||
"2026-06-28T09:00:00+02:00",
|
||||
"2026-06-29T09:00:00+02:00",
|
||||
]
|
||||
|
||||
|
||||
def test_completed_when_expected_run_exists() -> None:
|
||||
run = {
|
||||
"run_id": "run-1",
|
||||
"activity_id": ACTIVITY_ID,
|
||||
"scheduled_for": "2026-06-26T07:00:00+00:00",
|
||||
"fired_at": "2026-06-26T07:00:10+00:00",
|
||||
"tasks_spawned": 1,
|
||||
}
|
||||
|
||||
report = status.classify_activity(
|
||||
_definition(),
|
||||
_window(),
|
||||
[run],
|
||||
[{"source": "state_hub_progress", "run_id": "run-1", "output_validated": True}],
|
||||
None,
|
||||
["2026-06-26T09:00:00+02:00"],
|
||||
runs_available=True,
|
||||
)
|
||||
|
||||
assert report["status"] == "completed"
|
||||
|
||||
|
||||
def test_validation_failure_wins_over_completed_run() -> None:
|
||||
run = {"run_id": "run-1", "activity_id": ACTIVITY_ID, "scheduled_for": None, "fired_at": "2026-06-26T07:00:10+00:00"}
|
||||
|
||||
report = status.classify_activity(
|
||||
_definition(),
|
||||
_window(),
|
||||
[run],
|
||||
[{"source": "working_memory", "run_id": "run-1", "output_validated": False}],
|
||||
None,
|
||||
["2026-06-26T09:00:00+02:00"],
|
||||
runs_available=True,
|
||||
)
|
||||
|
||||
assert report["status"] == "validation_failed"
|
||||
|
||||
|
||||
def test_missed_when_expected_fire_has_no_run_and_runs_available() -> None:
|
||||
report = status.classify_activity(
|
||||
_definition(),
|
||||
_window(),
|
||||
[],
|
||||
[],
|
||||
None,
|
||||
["2026-06-26T09:00:00+02:00"],
|
||||
runs_available=True,
|
||||
)
|
||||
|
||||
assert report["status"] == "missed"
|
||||
|
||||
|
||||
def test_disabled_schedule_is_not_counted_as_missed() -> None:
|
||||
report = status.classify_activity(
|
||||
_definition(enabled=False),
|
||||
_window(),
|
||||
[],
|
||||
[],
|
||||
None,
|
||||
["2026-06-26T09:00:00+02:00"],
|
||||
runs_available=True,
|
||||
)
|
||||
|
||||
assert report["status"] == "disabled"
|
||||
|
||||
|
||||
def test_scheduled_definition_reports_one_shot_schedule_id() -> None:
|
||||
definition = {
|
||||
"id": ACTIVITY_ID,
|
||||
"name": "One Shot",
|
||||
"enabled": True,
|
||||
"trigger_type": "scheduled",
|
||||
"trigger_config": {
|
||||
"trigger_type": "scheduled",
|
||||
"at": "2026-06-26T09:00:00+02:00",
|
||||
"timezone": "Europe/Berlin",
|
||||
},
|
||||
"source": "test",
|
||||
}
|
||||
|
||||
report = status.classify_activity(
|
||||
definition,
|
||||
_window(),
|
||||
[],
|
||||
[],
|
||||
None,
|
||||
["2026-06-26T09:00:00+02:00"],
|
||||
runs_available=False,
|
||||
)
|
||||
|
||||
assert status.automation_schedule_id(_definition()) == f"activity-schedule-{ACTIVITY_ID}"
|
||||
assert report["schedule_id"] == f"activity-schedule-{ACTIVITY_ID}-once"
|
||||
|
||||
|
||||
def test_partial_source_availability_is_unknown_not_missed() -> None:
|
||||
report = status.classify_activity(
|
||||
_definition(),
|
||||
_window(),
|
||||
[],
|
||||
[],
|
||||
None,
|
||||
["2026-06-26T09:00:00+02:00"],
|
||||
runs_available=False,
|
||||
)
|
||||
|
||||
assert report["status"] == "unknown"
|
||||
assert "missed-run verdict is unknown" in report["warnings"][0]
|
||||
|
||||
|
||||
def test_working_memory_frontmatter_evidence(tmp_path: Path) -> None:
|
||||
note = tmp_path / "daily-triage-2026-06-26-run.md"
|
||||
note.write_text(
|
||||
"---\n"
|
||||
"source: activity-core\n"
|
||||
f"activity_id: {ACTIVITY_ID}\n"
|
||||
"activity_core_run_id: run-1\n"
|
||||
"scheduled_for: 2026-06-26T07:00:00+00:00\n"
|
||||
"output_validated: false\n"
|
||||
"created: 2026-06-26T07:01:00+00:00\n"
|
||||
"---\n"
|
||||
"body\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
evidence, source = status.load_working_memory_evidence(str(tmp_path), _window())
|
||||
|
||||
assert source["status"] == "ok"
|
||||
assert evidence[0]["run_id"] == "run-1"
|
||||
assert evidence[0]["output_validated"] is False
|
||||
|
||||
|
||||
def _scheduled_definition(enabled: bool = False):
|
||||
return {
|
||||
"id": "00000000-0000-0000-0000-000000000456",
|
||||
"name": "One Shot",
|
||||
"enabled": enabled,
|
||||
"trigger_type": "scheduled",
|
||||
"trigger_config": {
|
||||
"trigger_type": "scheduled",
|
||||
"at": "2026-06-26T09:00:00+02:00",
|
||||
"timezone": "Europe/Berlin",
|
||||
},
|
||||
"source": "db",
|
||||
}
|
||||
|
||||
|
||||
def test_inventory_report_uses_db_definition_rows(monkeypatch) -> None:
|
||||
async def fake_load_definitions(args, warnings):
|
||||
return [dict(_definition(), source="db"), _scheduled_definition()], {"status": "ok", "source": "db"}
|
||||
|
||||
async def fake_temporal(host, namespace, definitions, *, timeout_seconds):
|
||||
return {
|
||||
ACTIVITY_ID: {
|
||||
"schedule_id": f"activity-schedule-{ACTIVITY_ID}",
|
||||
"available": True,
|
||||
"paused": False,
|
||||
"missed_catchup_window": 0,
|
||||
"last_fired_at": None,
|
||||
},
|
||||
}, {"status": "ok", "count": 1}
|
||||
|
||||
monkeypatch.setattr(status, "load_definitions", fake_load_definitions)
|
||||
monkeypatch.setattr(status, "load_temporal_visibility", fake_temporal)
|
||||
args = status.parse_inventory_args(["--format", "json"])
|
||||
|
||||
report, exit_code = asyncio.run(status.build_inventory_report(args))
|
||||
|
||||
assert exit_code == 0
|
||||
assert report["sources"]["definitions"] == {"status": "ok", "source": "db"}
|
||||
assert report["summary"]["automation_count"] == 2
|
||||
assert report["automations"][0]["definition_source"] == "db"
|
||||
assert report["automations"][0]["temporal"]["status"] == "active"
|
||||
assert report["automations"][1]["schedule_id"].endswith("-once")
|
||||
|
||||
|
||||
def test_inventory_file_fallback_when_db_url_missing(monkeypatch) -> None:
|
||||
monkeypatch.setattr(status, "file_definitions", lambda: [dict(_definition(), source="files")])
|
||||
args = status.parse_inventory_args(["--db-url", "", "--temporal-host", ""])
|
||||
|
||||
report, exit_code = asyncio.run(status.build_inventory_report(args))
|
||||
|
||||
assert exit_code == 0
|
||||
assert report["sources"]["definitions"]["status"] == "degraded"
|
||||
assert report["automations"][0]["definition_source"] == "files"
|
||||
assert "ACTCORE_DB_URL is not set" in report["warnings"][0]
|
||||
|
||||
|
||||
def test_inventory_filters_disabled_definitions() -> None:
|
||||
definitions = [_definition(enabled=True), _scheduled_definition(enabled=False)]
|
||||
|
||||
filtered = status.filter_inventory_definitions(
|
||||
definitions,
|
||||
ids=[],
|
||||
names=[],
|
||||
enabled=False,
|
||||
trigger_types=set(),
|
||||
)
|
||||
|
||||
assert [item["name"] for item in filtered] == ["One Shot"]
|
||||
|
||||
|
||||
def test_inventory_temporal_unavailable_is_warning_not_failure(monkeypatch) -> None:
|
||||
async def fake_load_definitions(args, warnings):
|
||||
return [_definition()], {"status": "ok", "source": "db"}
|
||||
|
||||
async def fake_temporal(host, namespace, definitions, *, timeout_seconds):
|
||||
return {}, {"status": "unavailable", "warning": "Temporal unavailable: nope"}
|
||||
|
||||
monkeypatch.setattr(status, "load_definitions", fake_load_definitions)
|
||||
monkeypatch.setattr(status, "load_temporal_visibility", fake_temporal)
|
||||
args = status.parse_inventory_args([])
|
||||
|
||||
report, exit_code = asyncio.run(status.build_inventory_report(args))
|
||||
|
||||
assert exit_code == 0
|
||||
assert report["automations"][0]["temporal"]["status"] == "not_checked"
|
||||
assert report["warnings"] == ["Temporal unavailable: nope"]
|
||||
|
||||
|
||||
def test_inventory_cli_emits_json(monkeypatch, capsys) -> None:
|
||||
monkeypatch.setattr(status, "file_definitions", lambda: [dict(_definition(), source="files")])
|
||||
|
||||
exit_code = asyncio.run(status.async_inventory_main([
|
||||
"--db-url", "",
|
||||
"--temporal-host", "",
|
||||
"--format", "json",
|
||||
]))
|
||||
|
||||
payload = json.loads(capsys.readouterr().out)
|
||||
assert exit_code == 0
|
||||
assert payload["mode"] == "automation-inventory"
|
||||
assert payload["automations"][0]["name"] == "Daily Check"
|
||||
@@ -1,6 +1,7 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
@@ -70,7 +71,14 @@ async def test_evaluate_instructions_returns_task_specs_with_audit(monkeypatch)
|
||||
async def test_evaluate_instructions_returns_report_payload(monkeypatch) -> None:
|
||||
llm = FakeLLMClient(json.dumps({
|
||||
"summary": "State Hub has open loose ends.",
|
||||
"recommendations": [{"candidate": "CUST-WP-0045", "action": "work-next"}],
|
||||
"recommendations": [
|
||||
{
|
||||
"rank": 1,
|
||||
"candidate": "CUST-WP-0045",
|
||||
"action": "work-next",
|
||||
"why": "Open loose ends.",
|
||||
}
|
||||
],
|
||||
}))
|
||||
monkeypatch.setattr(activities, "get_llm_client", lambda: llm)
|
||||
|
||||
@@ -209,6 +217,12 @@ async def test_evaluate_instructions_forwards_llm_connect_depth_config(monkeypat
|
||||
"context": {},
|
||||
})
|
||||
|
||||
# Read the live schema file rather than hard-coding it, so the forwarded
|
||||
# json_schema assertion tracks schemas/daily-triage-report.json as the
|
||||
# contract evolves (ACTIVITY-WP-0016-T02).
|
||||
expected_schema = json.loads(
|
||||
Path("schemas/daily-triage-report.json").read_text(encoding="utf-8")
|
||||
)
|
||||
assert llm.calls[0][2] == {
|
||||
"model_name": "custodian-triage-balanced",
|
||||
"temperature": 0.2,
|
||||
@@ -216,16 +230,6 @@ async def test_evaluate_instructions_forwards_llm_connect_depth_config(monkeypat
|
||||
"max_depth": 2,
|
||||
"model_params": {
|
||||
"reasoning_effort": "medium",
|
||||
"json_schema": {
|
||||
"type": "object",
|
||||
"required": ["summary", "recommendations"],
|
||||
"properties": {
|
||||
"summary": {"type": "string"},
|
||||
"recommendations": {
|
||||
"type": "array",
|
||||
"items": {"type": "object"},
|
||||
},
|
||||
},
|
||||
},
|
||||
"json_schema": expected_schema,
|
||||
},
|
||||
}
|
||||
|
||||
@@ -34,7 +34,7 @@ def test_issue_core_rest_sink_posts_task_contract(monkeypatch) -> None:
|
||||
|
||||
monkeypatch.setattr(httpx, "post", fake_post)
|
||||
|
||||
ref = IssueCoreRestSink("http://issue-core.test/").emit(TaskSpec(
|
||||
ref = IssueCoreRestSink("http://issue-core.test/", api_key="test-key").emit(TaskSpec(
|
||||
title="Run SBOM rescan for activity-core",
|
||||
description="SBOM is older than 30 days.",
|
||||
target_repo="activity-core",
|
||||
@@ -67,12 +67,30 @@ def test_issue_core_rest_sink_posts_task_contract(monkeypatch) -> None:
|
||||
"triggering_event_id": "scheduled",
|
||||
"activity_definition_id": "activity-1",
|
||||
},
|
||||
"headers": {"Authorization": "Bearer test-key"},
|
||||
"timeout": 10.0,
|
||||
}
|
||||
]
|
||||
assert "review_required" not in posts[0]["json"]
|
||||
|
||||
|
||||
def test_issue_core_rest_sink_requires_api_key() -> None:
|
||||
sink = IssueCoreRestSink("http://issue-core.test/", api_key="")
|
||||
with pytest.raises(RuntimeError, match="ISSUE_CORE_API_KEY"):
|
||||
sink.emit(TaskSpec(
|
||||
title="t",
|
||||
description="",
|
||||
target_repo="activity-core",
|
||||
priority="low",
|
||||
labels=[],
|
||||
due_in_days=None,
|
||||
source_type="rule",
|
||||
source_id="r",
|
||||
triggering_event_id="e",
|
||||
activity_definition_id="a",
|
||||
))
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_emit_tasks_raises_when_sink_fails(monkeypatch) -> None:
|
||||
class FailingSink:
|
||||
|
||||
@@ -13,7 +13,12 @@ def test_llm_connect_client_forwards_run_config(monkeypatch) -> None:
|
||||
pass
|
||||
|
||||
def json(self) -> dict:
|
||||
return {"content": '{"summary":"ok","recommendations":[]}'}
|
||||
return {
|
||||
"content": '{"summary":"ok","recommendations":[]}',
|
||||
"finish_reason": "stop",
|
||||
"usage": {"input_tokens": 10, "output_tokens": 20},
|
||||
"raw_response": {"provider_blob": "not persisted"},
|
||||
}
|
||||
|
||||
def fake_post(url: str, json: dict, timeout: float) -> Response:
|
||||
captured["url"] = url
|
||||
@@ -50,3 +55,7 @@ def test_llm_connect_client_forwards_run_config(monkeypatch) -> None:
|
||||
"timeout_seconds": 42,
|
||||
},
|
||||
}
|
||||
assert client.last_response_metadata == {
|
||||
"finish_reason": "stop",
|
||||
"usage": {"input_tokens": 10, "output_tokens": 20},
|
||||
}
|
||||
|
||||
@@ -166,6 +166,93 @@ def test_state_hub_progress_sink_is_idempotent(monkeypatch) -> None:
|
||||
assert result[0]["idempotency_key"] == idempotency_key
|
||||
|
||||
|
||||
def test_core_hub_interaction_event_sink_posts_and_verifies_compact_event(monkeypatch) -> None:
|
||||
posts: list[dict[str, Any]] = []
|
||||
|
||||
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
|
||||
assert url == "http://core-hub.test/api/v2/interaction-events"
|
||||
assert kwargs["headers"]["Authorization"] == "Bearer runtime-secret"
|
||||
posts.append({"url": url, **kwargs})
|
||||
return DummyResponse(
|
||||
{
|
||||
"id": "event-1",
|
||||
"eventType": "ops-endpoint-verified",
|
||||
"widgetId": "widget-1",
|
||||
}
|
||||
)
|
||||
|
||||
def fake_get(url: str, **kwargs: Any) -> DummyResponse:
|
||||
assert url == "http://core-hub.test/api/v2/interaction-events"
|
||||
assert kwargs["headers"]["Authorization"] == "Bearer runtime-secret"
|
||||
return DummyResponse({"data": [{"id": "event-1"}]})
|
||||
|
||||
monkeypatch.setenv("CORE_HUB_RUNTIME_TOKEN", "runtime-secret")
|
||||
monkeypatch.setattr(httpx, "post", fake_post)
|
||||
monkeypatch.setattr(httpx, "get", fake_get)
|
||||
|
||||
result = persist_ops_inventory_evidence(
|
||||
_payload([
|
||||
{
|
||||
"type": "core-hub-interaction-event",
|
||||
"core_hub_url": "http://core-hub.test",
|
||||
"widget_id": "widget-1",
|
||||
"event_type": "ops-endpoint-verified",
|
||||
}
|
||||
])
|
||||
)
|
||||
|
||||
assert result == [
|
||||
{
|
||||
"type": "core-hub-interaction-event",
|
||||
"status": "posted",
|
||||
"event_type": "ops-endpoint-verified",
|
||||
"event_id": "event-1",
|
||||
"widget_id": "widget-1",
|
||||
"verified": True,
|
||||
"context_key": "ops_probe",
|
||||
}
|
||||
]
|
||||
body = posts[0]["json"]
|
||||
assert body["widgetId"] == "widget-1"
|
||||
assert body["eventType"] == "ops-endpoint-verified"
|
||||
assert body["metadata"]["activity_core_run_id"] == _run_id()
|
||||
assert body["metadata"]["endpoint"]["url"] == "http://state-hub.test/health"
|
||||
assert body["metadata"]["endpoint"]["widget_ref"] == "ops:endpoint:state-hub-health"
|
||||
|
||||
serialized = json.dumps(body, sort_keys=True)
|
||||
assert "runtime-secret" not in serialized
|
||||
assert "secret response body" not in serialized
|
||||
assert "Authorization" not in serialized
|
||||
assert "user:pass" not in serialized
|
||||
assert "token=secret" not in serialized
|
||||
|
||||
|
||||
def test_core_hub_sink_skips_cleanly_when_config_missing(monkeypatch) -> None:
|
||||
monkeypatch.delenv("CORE_HUB_BASE_URL", raising=False)
|
||||
monkeypatch.delenv("CORE_HUB_RUNTIME_TOKEN", raising=False)
|
||||
monkeypatch.delenv("CORE_HUB_RUNTIME_TOKEN_FILE", raising=False)
|
||||
monkeypatch.delenv("CORE_HUB_WIDGET_ID", raising=False)
|
||||
monkeypatch.delenv("CORE_HUB_WIDGET_MAPPING", raising=False)
|
||||
|
||||
result = persist_ops_inventory_evidence(
|
||||
_payload([{"type": "core-hub-interaction-event"}])
|
||||
)
|
||||
|
||||
assert result == [
|
||||
{
|
||||
"type": "core-hub-interaction-event",
|
||||
"status": "skipped",
|
||||
"reason": "missing_core_hub_config",
|
||||
"missing": [
|
||||
"CORE_HUB_BASE_URL",
|
||||
"CORE_HUB_RUNTIME_TOKEN or CORE_HUB_RUNTIME_TOKEN_FILE",
|
||||
"widget_id or CORE_HUB_WIDGET_ID",
|
||||
],
|
||||
"context_key": "ops_probe",
|
||||
}
|
||||
]
|
||||
|
||||
|
||||
def test_inter_hub_sink_skips_cleanly_when_config_missing(monkeypatch) -> None:
|
||||
monkeypatch.delenv("INTER_HUB_URL", raising=False)
|
||||
monkeypatch.delenv("OPS_HUB_KEY", raising=False)
|
||||
|
||||
@@ -93,12 +93,21 @@ def test_external_configmap_projects_enabled_daily_wsjf_definition(tmp_path) ->
|
||||
assert definition.trigger_config["cron_expression"] == "20 7 * * *"
|
||||
assert definition.trigger_config["timezone"] == "Europe/Berlin"
|
||||
assert instruction["id"] == "daily-triage-report"
|
||||
assert instruction["max_tokens"] == 1800
|
||||
assert "most 7 recommendations" in instruction["prompt"]
|
||||
assert "fewer well-formed" in instruction["prompt"]
|
||||
assert instruction["output_schema"] == (
|
||||
"/etc/activity-core/schemas/daily-triage-report.json"
|
||||
)
|
||||
assert instruction["report_sinks"][0]["type"] == "working-memory"
|
||||
assert instruction["report_sinks"][1]["event_type"] == "daily_triage"
|
||||
|
||||
schema = _by_kind_name("ConfigMap", "actcore-report-schemas")
|
||||
daily_schema = yaml.safe_load(schema["data"]["daily-triage-report.json"])
|
||||
recommendations = daily_schema["properties"]["recommendations"]
|
||||
assert recommendations["maxItems"] == 7
|
||||
assert recommendations["items"]["properties"]["rank"]["maximum"] == 7
|
||||
|
||||
|
||||
def test_ops_inventory_configmap_contains_probeable_inventory() -> None:
|
||||
config = _by_kind_name("ConfigMap", "actcore-ops-service-inventory")
|
||||
|
||||
@@ -37,6 +37,10 @@ def _payload(sinks: list[dict[str, Any]]) -> dict[str, Any]:
|
||||
"output_validated": True,
|
||||
"review_required": False,
|
||||
"validation_error": None,
|
||||
"llm_response_metadata": {
|
||||
"finish_reason": "stop",
|
||||
"usage": {"output_tokens": 50},
|
||||
},
|
||||
}
|
||||
],
|
||||
}
|
||||
@@ -62,6 +66,8 @@ def test_working_memory_sink_writes_idempotently(tmp_path) -> None:
|
||||
assert "output_validated: true" in text
|
||||
assert "review_required: false" in text
|
||||
assert "model: test-model" in text
|
||||
assert "LLM response metadata:" in text
|
||||
assert '"finish_reason": "stop"' in text
|
||||
assert "State Hub has loose ends." in text
|
||||
|
||||
|
||||
@@ -113,6 +119,10 @@ def test_state_hub_progress_sink_posts(monkeypatch) -> None:
|
||||
assert posts[0]["json"]["detail"]["activity_core_run_id"] == payload_run_id()
|
||||
assert posts[0]["json"]["detail"]["output_validated"] is True
|
||||
assert posts[0]["json"]["detail"]["review_required"] is False
|
||||
assert posts[0]["json"]["detail"]["llm_response_metadata"] == {
|
||||
"finish_reason": "stop",
|
||||
"usage": {"output_tokens": 50},
|
||||
}
|
||||
|
||||
|
||||
def test_state_hub_progress_includes_prior_working_memory_path(
|
||||
|
||||
167
tests/test_reuse_surface_context_resolver.py
Normal file
167
tests/test_reuse_surface_context_resolver.py
Normal file
@@ -0,0 +1,167 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import pytest
|
||||
from temporalio.exceptions import ApplicationError
|
||||
|
||||
from activity_core.activities import resolve_context
|
||||
from activity_core.context_resolvers import reuse_surface
|
||||
from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY
|
||||
|
||||
|
||||
class _Response:
|
||||
def __init__(self, payload: Any) -> None:
|
||||
self._payload = payload
|
||||
|
||||
def raise_for_status(self) -> None:
|
||||
return None
|
||||
|
||||
def json(self) -> Any:
|
||||
return self._payload
|
||||
|
||||
|
||||
class _Completed:
|
||||
returncode = 0
|
||||
stderr = ""
|
||||
|
||||
def __init__(self, payload: dict[str, Any]) -> None:
|
||||
self.stdout = json.dumps(payload)
|
||||
|
||||
|
||||
def _write_rollout(path: Path) -> None:
|
||||
path.write_text(
|
||||
"""
|
||||
domains:
|
||||
reuse:
|
||||
phase: active
|
||||
repos:
|
||||
- reuse-surface
|
||||
- activity-core
|
||||
parked:
|
||||
phase: backlog
|
||||
repos:
|
||||
- ignored-repo
|
||||
""".lstrip(),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
def _write_cli_only_signals(path: Path) -> None:
|
||||
path.write_text(
|
||||
"""
|
||||
signals:
|
||||
empty_capability_scaffold:
|
||||
enabled: true
|
||||
registry_gap:
|
||||
enabled: false
|
||||
stale_scope:
|
||||
enabled: false
|
||||
stale_sbom:
|
||||
enabled: false
|
||||
publish_check_fail:
|
||||
enabled: false
|
||||
""".lstrip(),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
def test_shell_resolver_emits_reuse_surface_gaps_and_advances_cursor(
|
||||
tmp_path,
|
||||
monkeypatch,
|
||||
) -> None:
|
||||
rollout = tmp_path / "rollout.yaml"
|
||||
_write_rollout(rollout)
|
||||
_write_cli_only_signals(tmp_path / "signals.yml")
|
||||
reuse_root = tmp_path / "reuse-surface"
|
||||
reuse_root.mkdir()
|
||||
(reuse_root / "SCOPE.md").write_text("fresh\n", encoding="utf-8")
|
||||
activity_root = tmp_path / "activity-core"
|
||||
activity_root.mkdir()
|
||||
|
||||
monkeypatch.setenv("KAIZEN_RUNNER_HOST", "runner")
|
||||
|
||||
def fake_get(url: str, **kwargs: Any) -> _Response:
|
||||
assert url.endswith("/repos/")
|
||||
return _Response(
|
||||
[
|
||||
{
|
||||
"slug": "reuse-surface",
|
||||
"host_paths": {"runner": str(reuse_root)},
|
||||
},
|
||||
{
|
||||
"slug": "activity-core",
|
||||
"host_paths": {"runner": str(activity_root)},
|
||||
},
|
||||
]
|
||||
)
|
||||
|
||||
def fake_run(cmd: list[str], **kwargs: Any) -> _Completed:
|
||||
assert cmd == ["reuse-surface", "report", "gaps", "--format", "json"]
|
||||
return _Completed({"empty_scaffolds": ["reuse-surface"]})
|
||||
|
||||
monkeypatch.setattr(reuse_surface.httpx, "get", fake_get)
|
||||
monkeypatch.setattr(reuse_surface.subprocess, "run", fake_run)
|
||||
|
||||
import activity_core.context_resolvers # noqa: F401
|
||||
|
||||
result = CONTEXT_RESOLVER_REGISTRY["shell"]().resolve(
|
||||
"reuse_surface_report_gaps",
|
||||
None,
|
||||
{
|
||||
"roster": str(rollout),
|
||||
"batch_size": 1,
|
||||
},
|
||||
)
|
||||
|
||||
assert result == {
|
||||
"gaps": [
|
||||
{
|
||||
"repo": "reuse-surface",
|
||||
"root": str(reuse_root),
|
||||
"signal": "empty_capability_scaffold",
|
||||
"hygiene_signal": "empty_capability_scaffold",
|
||||
}
|
||||
]
|
||||
}
|
||||
state = json.loads((tmp_path / "round-robin-state.json").read_text(encoding="utf-8"))
|
||||
assert state["cursor"] == 1
|
||||
assert state["last_batch"] == ["reuse-surface"]
|
||||
|
||||
|
||||
def test_shell_resolver_keeps_kaizen_fallback_for_existing_queries() -> None:
|
||||
assert CONTEXT_RESOLVER_REGISTRY["shell"]().resolve("unknown_query", None, {}) == {}
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_optional_reuse_surface_missing_roster_binds_empty_list(tmp_path) -> None:
|
||||
snapshot = await resolve_context(
|
||||
[
|
||||
{
|
||||
"type": "shell",
|
||||
"query": "reuse_surface_report_gaps",
|
||||
"params": {"roster": str(tmp_path / "missing.yaml")},
|
||||
"bind_to": "context.gaps",
|
||||
}
|
||||
]
|
||||
)
|
||||
|
||||
assert snapshot == {"gaps": []}
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_required_reuse_surface_missing_roster_fails_visibly(tmp_path) -> None:
|
||||
with pytest.raises(ApplicationError, match="Required context resolver"):
|
||||
await resolve_context(
|
||||
[
|
||||
{
|
||||
"type": "shell",
|
||||
"query": "reuse_surface_report_gaps",
|
||||
"params": {"roster": str(tmp_path / "missing.yaml")},
|
||||
"bind_to": "context.gaps",
|
||||
"required": True,
|
||||
}
|
||||
]
|
||||
)
|
||||
81
tests/test_schedule_health.py
Normal file
81
tests/test_schedule_health.py
Normal file
@@ -0,0 +1,81 @@
|
||||
"""ACTIVITY-WP-0014 T03: missed-fire detection verdict tests."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from datetime import datetime, timedelta, timezone
|
||||
|
||||
from activity_core.schedule_health import evaluate_schedule_health
|
||||
|
||||
NOW = datetime(2026, 6, 23, 12, 0, tzinfo=timezone.utc)
|
||||
|
||||
|
||||
def test_healthy_when_recent_fire_and_no_drops() -> None:
|
||||
health = evaluate_schedule_health(
|
||||
activity_id="a1",
|
||||
missed_catchup_window=0,
|
||||
last_fired_at=NOW - timedelta(minutes=5),
|
||||
now=NOW,
|
||||
expected_interval=timedelta(hours=1),
|
||||
)
|
||||
assert health.healthy is True
|
||||
assert health.missed is False
|
||||
assert health.reasons == []
|
||||
|
||||
|
||||
def test_unhealthy_when_catchup_window_dropped_fires() -> None:
|
||||
health = evaluate_schedule_health(
|
||||
activity_id="a1",
|
||||
missed_catchup_window=2,
|
||||
last_fired_at=NOW - timedelta(minutes=5),
|
||||
now=NOW,
|
||||
)
|
||||
assert health.missed is True
|
||||
assert "2 fire(s) dropped" in health.reasons[0]
|
||||
|
||||
|
||||
def test_unhealthy_when_last_fire_too_stale() -> None:
|
||||
health = evaluate_schedule_health(
|
||||
activity_id="daily",
|
||||
missed_catchup_window=0,
|
||||
last_fired_at=NOW - timedelta(days=2),
|
||||
now=NOW,
|
||||
expected_interval=timedelta(days=1),
|
||||
)
|
||||
assert health.missed is True
|
||||
assert any("exceeding the expected" in r for r in health.reasons)
|
||||
assert health.staleness == timedelta(days=2)
|
||||
|
||||
|
||||
def test_within_tolerance_is_healthy() -> None:
|
||||
health = evaluate_schedule_health(
|
||||
activity_id="daily",
|
||||
missed_catchup_window=0,
|
||||
last_fired_at=NOW - (timedelta(days=1) + timedelta(minutes=5)),
|
||||
now=NOW,
|
||||
expected_interval=timedelta(days=1),
|
||||
tolerance=timedelta(minutes=10),
|
||||
)
|
||||
assert health.healthy is True
|
||||
|
||||
|
||||
def test_no_fire_recorded_for_due_schedule_is_unhealthy() -> None:
|
||||
health = evaluate_schedule_health(
|
||||
activity_id="daily",
|
||||
missed_catchup_window=0,
|
||||
last_fired_at=None,
|
||||
now=NOW,
|
||||
expected_interval=timedelta(days=1),
|
||||
)
|
||||
assert health.missed is True
|
||||
assert "no recorded fire" in health.reasons[0]
|
||||
|
||||
|
||||
def test_no_interval_and_no_fire_is_not_flagged() -> None:
|
||||
# Without an expected interval we cannot assert a miss from absence alone.
|
||||
health = evaluate_schedule_health(
|
||||
activity_id="event-ish",
|
||||
missed_catchup_window=0,
|
||||
last_fired_at=None,
|
||||
now=NOW,
|
||||
)
|
||||
assert health.healthy is True
|
||||
@@ -37,6 +37,7 @@ def _make_defn(
|
||||
misfire_policy: str = "skip",
|
||||
enabled: bool = True,
|
||||
jitter: int = 0,
|
||||
catchup_window_seconds: int | None = None,
|
||||
) -> ActivityDefinition:
|
||||
return ActivityDefinition(
|
||||
id=uuid.uuid4(),
|
||||
@@ -46,6 +47,7 @@ def _make_defn(
|
||||
cron_expression=cron,
|
||||
misfire_policy=misfire_policy,
|
||||
jitter_seconds=jitter,
|
||||
catchup_window_seconds=catchup_window_seconds,
|
||||
),
|
||||
)
|
||||
|
||||
@@ -186,6 +188,76 @@ async def test_misfire_policy_compress_sets_overlap_buffer_one(env: WorkflowEnvi
|
||||
await delete_schedule(env.client, defn.id)
|
||||
|
||||
|
||||
# ── ACTIVITY-WP-0014: explicit run-miss policies + catchup window ────────────
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_skip_sets_short_catchup_window(env: WorkflowEnvironment) -> None:
|
||||
"""skip = run on trigger or skip: tiny grace window, no real recovery."""
|
||||
defn = _make_defn(misfire_policy="skip")
|
||||
await upsert_schedule(env.client, defn)
|
||||
|
||||
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
|
||||
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.SKIP
|
||||
assert desc.schedule.policy.catchup_window == timedelta(seconds=60)
|
||||
|
||||
await delete_schedule(env.client, defn.id)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_catchup_all_recovers_full_window(env: WorkflowEnvironment) -> None:
|
||||
"""catchup_all = recover every missed fire: long window, BUFFER_ALL."""
|
||||
defn = _make_defn(misfire_policy="catchup_all")
|
||||
await upsert_schedule(env.client, defn)
|
||||
|
||||
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
|
||||
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.BUFFER_ALL
|
||||
assert desc.schedule.policy.catchup_window == timedelta(days=365)
|
||||
|
||||
await delete_schedule(env.client, defn.id)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_catchup_latest_does_not_accumulate(env: WorkflowEnvironment) -> None:
|
||||
"""catchup_latest = recover only the most recent missed fire: BUFFER_ONE."""
|
||||
defn = _make_defn(misfire_policy="catchup_latest")
|
||||
await upsert_schedule(env.client, defn)
|
||||
|
||||
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
|
||||
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.BUFFER_ONE
|
||||
assert desc.schedule.policy.catchup_window == timedelta(hours=24)
|
||||
|
||||
await delete_schedule(env.client, defn.id)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_legacy_aliases_map_to_explicit_policies(env: WorkflowEnvironment) -> None:
|
||||
"""Legacy catchup/compress keep working and pick up the new catchup windows."""
|
||||
catchup = _make_defn(misfire_policy="catchup")
|
||||
compress = _make_defn(misfire_policy="compress")
|
||||
await upsert_schedule(env.client, catchup)
|
||||
await upsert_schedule(env.client, compress)
|
||||
|
||||
d1 = await env.client.get_schedule_handle(schedule_id(catchup.id)).describe()
|
||||
d2 = await env.client.get_schedule_handle(schedule_id(compress.id)).describe()
|
||||
assert d1.schedule.policy.catchup_window == timedelta(days=365)
|
||||
assert d2.schedule.policy.catchup_window == timedelta(hours=24)
|
||||
|
||||
await delete_schedule(env.client, catchup.id)
|
||||
await delete_schedule(env.client, compress.id)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_explicit_catchup_window_override(env: WorkflowEnvironment) -> None:
|
||||
"""An explicit catchup_window_seconds overrides the per-policy default."""
|
||||
defn = _make_defn(misfire_policy="skip", catchup_window_seconds=7200)
|
||||
await upsert_schedule(env.client, defn)
|
||||
|
||||
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
|
||||
assert desc.schedule.policy.catchup_window == timedelta(hours=2)
|
||||
|
||||
await delete_schedule(env.client, defn.id)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_schedule_smoke_test_creates_one_shot_schedule(
|
||||
env: WorkflowEnvironment,
|
||||
|
||||
@@ -407,6 +407,70 @@ def test_recently_on_scope_hourly_failure_bubbles(monkeypatch) -> None:
|
||||
StateHubContextResolver().resolve("recently_on_scope_hourly", None, {"range": "1h"})
|
||||
|
||||
|
||||
def test_consistency_sweep_remote_all_posts_batch(monkeypatch) -> None:
|
||||
calls: list[dict[str, Any]] = []
|
||||
|
||||
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
|
||||
calls.append({"url": url, **kwargs})
|
||||
return DummyResponse(
|
||||
{
|
||||
"exit_code": 0,
|
||||
"lock_skipped": False,
|
||||
"repos_processed": [{"repo_slug": "state-hub", "result": "pass"}],
|
||||
"skipped_clean": ["quiet-repo"],
|
||||
"skipped_missing": [],
|
||||
"skipped_budget": [],
|
||||
}
|
||||
)
|
||||
|
||||
monkeypatch.setenv("STATE_HUB_URL", "http://state-hub.test/")
|
||||
monkeypatch.setattr(httpx, "post", fake_post)
|
||||
|
||||
result = StateHubContextResolver().resolve(
|
||||
"consistency_sweep_remote_all",
|
||||
None,
|
||||
{"max_seconds": 300, "source": "activity-core", "required": True},
|
||||
)
|
||||
|
||||
assert result["exit_code"] == 0
|
||||
assert result["repos_processed"][0]["repo_slug"] == "state-hub"
|
||||
assert calls == [
|
||||
{
|
||||
"url": "http://state-hub.test/consistency/sweep/remote-all",
|
||||
"json": {"max_seconds": 300, "source": "activity-core"},
|
||||
"timeout": 330.0,
|
||||
}
|
||||
]
|
||||
|
||||
|
||||
def test_consistency_sweep_remote_all_failure_bubbles(monkeypatch) -> None:
|
||||
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
|
||||
raise httpx.ConnectError("offline")
|
||||
|
||||
monkeypatch.setattr(httpx, "post", fake_post)
|
||||
|
||||
with pytest.raises(httpx.ConnectError):
|
||||
StateHubContextResolver().resolve(
|
||||
"consistency_sweep_remote_all",
|
||||
None,
|
||||
{"max_seconds": 300},
|
||||
)
|
||||
|
||||
|
||||
def test_consistency_sweep_remote_all_rejects_empty_response(monkeypatch) -> None:
|
||||
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
|
||||
return DummyResponse({})
|
||||
|
||||
monkeypatch.setattr(httpx, "post", fake_post)
|
||||
|
||||
with pytest.raises(RuntimeError, match="missing required key"):
|
||||
StateHubContextResolver().resolve(
|
||||
"consistency_sweep_remote_all",
|
||||
None,
|
||||
{"max_seconds": 300},
|
||||
)
|
||||
|
||||
|
||||
def test_recently_on_scope_hourly_rejects_empty_response(monkeypatch) -> None:
|
||||
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
|
||||
return DummyResponse({})
|
||||
|
||||
81
tests/test_state_hub_write.py
Normal file
81
tests/test_state_hub_write.py
Normal file
@@ -0,0 +1,81 @@
|
||||
"""ACTIVITY-WP-0014 T05: idempotency-keyed State Hub writes."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import httpx
|
||||
import pytest
|
||||
|
||||
from activity_core import report_sinks
|
||||
from activity_core.state_hub_write import (
|
||||
IDEMPOTENCY_HEADER,
|
||||
idempotency_headers,
|
||||
idempotency_key,
|
||||
)
|
||||
|
||||
|
||||
def test_key_is_stable_and_deterministic() -> None:
|
||||
a = idempotency_key("run1", "daily-triage-report", "daily_triage")
|
||||
b = idempotency_key("run1", "daily-triage-report", "daily_triage")
|
||||
assert a == b == "run1:daily-triage-report:daily_triage"
|
||||
|
||||
|
||||
def test_key_shape_stable_with_missing_parts() -> None:
|
||||
assert idempotency_key("run1", None, "daily_triage") == "run1::daily_triage"
|
||||
|
||||
|
||||
def test_key_sanitizes_control_and_whitespace() -> None:
|
||||
key = idempotency_key("run 1", "a\tb", "x\n")
|
||||
assert "\t" not in key and "\n" not in key and " " not in key
|
||||
|
||||
|
||||
def test_headers_carry_the_key() -> None:
|
||||
headers = idempotency_headers("run1", "i", "e")
|
||||
assert headers == {IDEMPOTENCY_HEADER: "run1:i:e"}
|
||||
|
||||
|
||||
def test_distinct_identities_get_distinct_keys() -> None:
|
||||
assert idempotency_key("r", "i", "daily_triage") != idempotency_key(
|
||||
"r", "i", "schedule_miss"
|
||||
)
|
||||
|
||||
|
||||
def test_progress_exists_is_best_effort_on_connection_error(monkeypatch) -> None:
|
||||
"""A down State Hub must not hard-fail the dedup read; it returns False so the
|
||||
keyed write can still proceed."""
|
||||
|
||||
def _boom(*args, **kwargs):
|
||||
raise httpx.ConnectError("Connection refused")
|
||||
|
||||
monkeypatch.setattr(report_sinks.httpx, "get", _boom)
|
||||
assert (
|
||||
report_sinks._progress_exists(
|
||||
"http://127.0.0.1:8000", "run1", "daily-triage-report", "daily_triage"
|
||||
)
|
||||
is False
|
||||
)
|
||||
|
||||
|
||||
def test_report_sink_post_sends_idempotency_header(monkeypatch) -> None:
|
||||
"""The state-hub-progress write carries a stable Idempotency-Key header."""
|
||||
captured: dict[str, object] = {}
|
||||
|
||||
monkeypatch.setattr(report_sinks, "_progress_exists", lambda *a, **k: False)
|
||||
|
||||
class _Resp:
|
||||
def raise_for_status(self) -> None: ...
|
||||
def json(self) -> dict[str, str]:
|
||||
return {"id": "pid-1"}
|
||||
|
||||
def _capture_post(url, json, headers, timeout): # noqa: A002
|
||||
captured["headers"] = headers
|
||||
return _Resp()
|
||||
|
||||
monkeypatch.setattr(report_sinks.httpx, "post", _capture_post)
|
||||
|
||||
payload = {"run_id": "run1", "activity_id": "act1", "scheduled_for": None}
|
||||
report_entry = {"instruction_id": "daily-triage-report", "report": {"summary": "s"}}
|
||||
sink = {"event_type": "daily_triage"}
|
||||
|
||||
result = report_sinks._post_state_hub_progress(payload, report_entry, sink)
|
||||
assert result["status"] == "posted"
|
||||
assert captured["headers"][IDEMPOTENCY_HEADER] == "run1:daily-triage-report:daily_triage"
|
||||
126
tests/test_sync_schedules.py
Normal file
126
tests/test_sync_schedules.py
Normal file
@@ -0,0 +1,126 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import uuid
|
||||
from datetime import datetime, timezone
|
||||
from types import SimpleNamespace
|
||||
from typing import Any
|
||||
|
||||
import pytest
|
||||
|
||||
from activity_core import sync_schedules
|
||||
|
||||
|
||||
def _row(
|
||||
*,
|
||||
activity_id: uuid.UUID,
|
||||
enabled: bool,
|
||||
trigger_config: dict[str, Any],
|
||||
) -> SimpleNamespace:
|
||||
return SimpleNamespace(
|
||||
id=activity_id,
|
||||
name=f"definition-{activity_id}",
|
||||
enabled=enabled,
|
||||
trigger_config=trigger_config,
|
||||
context_sources=[],
|
||||
task_templates=[],
|
||||
dedupe_key_strategy="skip",
|
||||
version=1,
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_sync_schedule_rows_reports_drift_counts_and_preserves_one_shots(
|
||||
monkeypatch,
|
||||
) -> None:
|
||||
new_id = uuid.uuid4()
|
||||
disabled_old_id = uuid.uuid4()
|
||||
one_shot_id = uuid.uuid4()
|
||||
orphan_id = uuid.uuid4()
|
||||
upserted: list[tuple[uuid.UUID, bool, str]] = []
|
||||
deleted: list[str] = []
|
||||
|
||||
async def fake_upsert_schedule(client: object, defn: object) -> None:
|
||||
upserted.append((
|
||||
defn.id,
|
||||
defn.enabled,
|
||||
defn.trigger_config.trigger_type,
|
||||
))
|
||||
|
||||
async def fake_list_schedules(client: object) -> list[dict[str, str]]:
|
||||
return [
|
||||
{
|
||||
"schedule_id": f"activity-schedule-{disabled_old_id}",
|
||||
"activity_id": str(disabled_old_id),
|
||||
},
|
||||
{
|
||||
"schedule_id": f"activity-schedule-{one_shot_id}-once",
|
||||
"activity_id": f"{one_shot_id}-once",
|
||||
},
|
||||
{
|
||||
"schedule_id": f"activity-schedule-{orphan_id}",
|
||||
"activity_id": str(orphan_id),
|
||||
},
|
||||
]
|
||||
|
||||
async def fake_delete_schedule(client: object, activity_id: str) -> None:
|
||||
deleted.append(activity_id)
|
||||
|
||||
monkeypatch.setattr(sync_schedules, "upsert_schedule", fake_upsert_schedule)
|
||||
monkeypatch.setattr(sync_schedules, "list_schedules", fake_list_schedules)
|
||||
monkeypatch.setattr(sync_schedules, "delete_schedule", fake_delete_schedule)
|
||||
|
||||
result = await sync_schedules.sync_schedule_rows(
|
||||
object(),
|
||||
[
|
||||
_row(
|
||||
activity_id=new_id,
|
||||
enabled=True,
|
||||
trigger_config={
|
||||
"trigger_type": "cron",
|
||||
"cron_expression": "20 7 * * *",
|
||||
"timezone": "Europe/Berlin",
|
||||
"misfire_policy": "skip",
|
||||
},
|
||||
),
|
||||
_row(
|
||||
activity_id=disabled_old_id,
|
||||
enabled=False,
|
||||
trigger_config={
|
||||
"trigger_type": "cron",
|
||||
"cron_expression": "20 * * * *",
|
||||
"timezone": "Europe/Berlin",
|
||||
"misfire_policy": "skip",
|
||||
},
|
||||
),
|
||||
_row(
|
||||
activity_id=one_shot_id,
|
||||
enabled=True,
|
||||
trigger_config={
|
||||
"trigger_type": "scheduled",
|
||||
"at": datetime(2026, 6, 19, 8, 0, tzinfo=timezone.utc),
|
||||
"timezone": "UTC",
|
||||
},
|
||||
),
|
||||
_row(
|
||||
activity_id=uuid.uuid4(),
|
||||
enabled=True,
|
||||
trigger_config={
|
||||
"trigger_type": "event",
|
||||
"event_type": "kaizen.metrics.recorded",
|
||||
"filters": {},
|
||||
},
|
||||
),
|
||||
],
|
||||
)
|
||||
|
||||
assert result.to_dict() == {
|
||||
"upserted": 2,
|
||||
"paused": 1,
|
||||
"deleted_orphans": 1,
|
||||
}
|
||||
assert upserted == [
|
||||
(new_id, True, "cron"),
|
||||
(disabled_old_id, False, "cron"),
|
||||
(one_shot_id, True, "scheduled"),
|
||||
]
|
||||
assert deleted == [str(orphan_id)]
|
||||
134
tests/test_sync_service.py
Normal file
134
tests/test_sync_service.py
Normal file
@@ -0,0 +1,134 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
import pytest
|
||||
|
||||
from activity_core import sync_service
|
||||
from activity_core.sync_schedules import ScheduleSyncResult
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_sync_runs_requested_sections(monkeypatch) -> None:
|
||||
calls: list[str] = []
|
||||
|
||||
async def fake_definitions(session_factory: object) -> int:
|
||||
calls.append("definitions")
|
||||
return 2
|
||||
|
||||
async def fake_event_types(session_factory: object) -> int:
|
||||
calls.append("event_types")
|
||||
return 5
|
||||
|
||||
async def fake_schedules(
|
||||
temporal_client: object,
|
||||
session_factory: object,
|
||||
) -> ScheduleSyncResult:
|
||||
calls.append("schedules")
|
||||
return ScheduleSyncResult(upserted=3, paused=1, deleted_orphans=2)
|
||||
|
||||
monkeypatch.setattr(sync_service, "sync_activity_definitions", fake_definitions)
|
||||
monkeypatch.setattr(sync_service, "sync_event_types", fake_event_types)
|
||||
monkeypatch.setattr(sync_service, "sync_with_session_factory", fake_schedules)
|
||||
|
||||
result = await sync_service.run_sync(
|
||||
session_factory=object(),
|
||||
temporal_client=object(),
|
||||
definitions=True,
|
||||
schedules=True,
|
||||
event_types=True,
|
||||
)
|
||||
|
||||
assert calls == ["definitions", "event_types", "schedules"]
|
||||
assert result["ok"] is True
|
||||
assert result["ran"] == {
|
||||
"definitions": True,
|
||||
"schedules": True,
|
||||
"event_types": True,
|
||||
}
|
||||
assert result["definitions"] == {"synced": 2}
|
||||
assert result["event_types"] == {"synced": 5}
|
||||
assert result["schedules"] == {
|
||||
"upserted": 3,
|
||||
"paused": 1,
|
||||
"deleted_orphans": 2,
|
||||
}
|
||||
assert result["errors"] == []
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_sync_collects_errors_and_continues(monkeypatch) -> None:
|
||||
calls: list[str] = []
|
||||
|
||||
async def failing_definitions(session_factory: object) -> int:
|
||||
calls.append("definitions")
|
||||
raise RuntimeError("definition parse failed")
|
||||
|
||||
async def fake_schedules(
|
||||
temporal_client: object,
|
||||
session_factory: object,
|
||||
) -> ScheduleSyncResult:
|
||||
calls.append("schedules")
|
||||
return ScheduleSyncResult(upserted=1)
|
||||
|
||||
monkeypatch.setattr(
|
||||
sync_service,
|
||||
"sync_activity_definitions",
|
||||
failing_definitions,
|
||||
)
|
||||
monkeypatch.setattr(sync_service, "sync_with_session_factory", fake_schedules)
|
||||
|
||||
result = await sync_service.run_sync(
|
||||
session_factory=object(),
|
||||
temporal_client=object(),
|
||||
definitions=True,
|
||||
schedules=True,
|
||||
event_types=False,
|
||||
)
|
||||
|
||||
assert calls == ["definitions", "schedules"]
|
||||
assert result["ok"] is False
|
||||
assert result["definitions"] == {"synced": 0}
|
||||
assert result["schedules"]["upserted"] == 1
|
||||
assert result["errors"] == [
|
||||
{
|
||||
"stage": "definitions",
|
||||
"type": "RuntimeError",
|
||||
"message": "definition parse failed",
|
||||
}
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_sync_reports_missing_temporal_client_for_schedules() -> None:
|
||||
result = await sync_service.run_sync(
|
||||
session_factory=object(),
|
||||
temporal_client=None,
|
||||
definitions=False,
|
||||
schedules=True,
|
||||
event_types=False,
|
||||
)
|
||||
|
||||
assert result["ok"] is False
|
||||
assert result["errors"] == [
|
||||
{
|
||||
"stage": "schedules",
|
||||
"type": "RuntimeError",
|
||||
"message": "Temporal client is required for schedule sync",
|
||||
}
|
||||
]
|
||||
|
||||
|
||||
def test_record_error_bounds_error_count() -> None:
|
||||
result: dict[str, Any] = {
|
||||
"ok": True,
|
||||
"errors": [],
|
||||
}
|
||||
|
||||
for i in range(25):
|
||||
sync_service._record_error(result, "stage", RuntimeError(f"boom {i}"))
|
||||
|
||||
assert result["ok"] is False
|
||||
assert len(result["errors"]) == 20
|
||||
assert result["errors"][0]["message"] == "boom 0"
|
||||
assert result["errors"][-1]["message"] == "boom 19"
|
||||
@@ -4,11 +4,11 @@ type: workplan
|
||||
title: "Post-triage operational hardening"
|
||||
domain: custodian
|
||||
repo: activity-core
|
||||
status: active
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-03"
|
||||
updated: "2026-06-16"
|
||||
updated: "2026-06-30"
|
||||
state_hub_workstream_id: "5646e13a-13af-4724-bca6-3c0d86f96733"
|
||||
---
|
||||
|
||||
@@ -104,7 +104,7 @@ and emitted a validated `daily_triage` report plus working-memory note.
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0006-T03
|
||||
status: wait
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "7cbf0a35-71a1-47ac-afc2-f51ad2180fd0"
|
||||
```
|
||||
@@ -174,6 +174,56 @@ the worker consumes the configured URL, then produce schema-valid daily triage
|
||||
evidence and three clean scheduled runs. This narrower path is tracked in
|
||||
`ACTIVITY-WP-0010`.
|
||||
|
||||
2026-06-25: Consecutive-run streak resumed. State Hub `daily_triage` progress
|
||||
events from author `activity-core` fired on time on **2026-06-24 05:20:56Z** and
|
||||
**2026-06-25 05:20:47Z** (07:20 Berlin), both delivered, no misfires. That is two
|
||||
clean consecutive scheduled runs. **RECHECK 2026-06-26 (after 05:20Z):** confirm
|
||||
the 06-26 scheduled `daily_triage` event delivered. If clean, that completes three
|
||||
clean consecutive scheduled runs (06-24 / 06-25 / 06-26) — record the calibration
|
||||
result in State Hub and close T03. If the 06-26 run misfires or is missing, the
|
||||
streak resets and T03 stays `wait`. Flag deliberately kept in-repo (agent-agnostic)
|
||||
rather than tied to any single coding agent's scheduler.
|
||||
|
||||
2026-06-26 recheck outcome: **streak reset at two.** The 06-26 scheduled run fired
|
||||
on time (`daily_triage` event 05:20:57Z) — scheduling layer healthy, no misfire —
|
||||
but the `daily-triage-report` instruction output **failed schema validation**:
|
||||
`Expecting ',' delimiter: line 136 column 22 (char 5268)`. The model produced a
|
||||
long ranked WSJF recommendation list (reached rank 7+ with nested `wsjf` objects)
|
||||
whose JSON broke ~char 5268; only a bounded 4000-char preview is preserved in the
|
||||
State Hub event, so the exact offending token needs the runtime llm-connect log.
|
||||
This is an LLM-output-quality failure (tracked by `ACTIVITY-WP-0010`), not a
|
||||
runtime/projection failure. T03 stays `wait`; three clean consecutive scheduled
|
||||
runs not yet achieved (06-24 ✅, 06-25 ✅, 06-26 ✗-validation).
|
||||
|
||||
2026-06-27 recheck outcome: streak remains reset. The scheduled run fired and
|
||||
wrote State Hub progress plus working memory, but daily-triage-report failed
|
||||
validation again with an unterminated string around char 5246. This confirms the
|
||||
runner/sink path is alive and the active blocker is live deployment of the
|
||||
ACTIVITY-WP-0016 output-robustness bundle and runtime prompt/token changes, not
|
||||
a missing schedule. T03 stays wait until a post-deployment smoke passes and three
|
||||
new clean scheduled runs are collected.
|
||||
|
||||
2026-06-30 early checkpoint: two new clean scheduled runs exist after the
|
||||
validation failures. State Hub daily_triage progress shows 2026-06-28
|
||||
05:20:51Z run `6a44d6dd-3f02-53f2-a5d8-d42b76b0ef98` and 2026-06-29
|
||||
05:20:49Z run `1dfb47c9-07bf-551b-b778-1d21a40bd95c`, both with
|
||||
`output_validated=true` and working-memory notes written. The current local time
|
||||
was 2026-06-30 01:37 Europe/Berlin, before the expected 07:20 Berlin scheduled
|
||||
fire, so the three-clean-run gate cannot close yet. Recheck after 2026-06-30
|
||||
05:20Z; if that scheduled run validates, the clean streak is 06-28 / 06-29 /
|
||||
06-30 and T03 can close with calibration feedback.
|
||||
|
||||
2026-06-30 closeout: the 07:20 Berlin scheduled run fired at 05:20:50Z as run
|
||||
`ac3d71a0-2f8f-50df-b3ce-7c60c2abb5c5` with `output_validated=true` and a
|
||||
working-memory note written. The post-failure clean streak is now complete:
|
||||
2026-06-28 (`6a44d6dd`), 2026-06-29 (`1dfb47c9`), and 2026-06-30 (`ac3d71a0`).
|
||||
Calibration feedback: the scheduler, worker, llm-connect route, State Hub sink,
|
||||
and working-memory sink are stable again; the recommendations were operationally
|
||||
useful but too dense at 10 items, repeatedly emphasizing human-dependency and
|
||||
infrastructure-unblock work. ACTIVITY-WP-0016 now owns the density/contract fix:
|
||||
Railiance runtime projection was aligned to a top-7 contract so the next live
|
||||
run can prove the bounded output posture. T03 is done.
|
||||
|
||||
## Rule Action Contract Documentation
|
||||
|
||||
```task
|
||||
|
||||
@@ -8,7 +8,7 @@ status: blocked
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-18"
|
||||
updated: "2026-06-18"
|
||||
updated: "2026-06-27"
|
||||
state_hub_workstream_id: "f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9"
|
||||
---
|
||||
|
||||
@@ -87,7 +87,7 @@ reported 9 passed.
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0010-T02
|
||||
status: wait
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "23545ddc-926b-485a-8535-5cc11e01134a"
|
||||
```
|
||||
@@ -107,6 +107,30 @@ Current wait reason: this is Railiance/operator-owned live cluster work. State
|
||||
Hub handoff message `9a074b7c-4b87-4e3c-a6bf-e1fe5580daa8` asks
|
||||
`railiance-cluster` to reconcile the updated config and smoke it.
|
||||
|
||||
2026-06-19 recheck:
|
||||
|
||||
- Deployed `llm-connect` into the `activity-core` namespace on `railiance01`
|
||||
(the cluster that runs `actcore-worker`). `coulombcore` had llm-connect only;
|
||||
the in-cluster Service URL is cluster-local.
|
||||
- `actcore-runtime-config` already exposed the verified URL and timeout;
|
||||
`deployment/actcore-worker` was restarted and now reports
|
||||
`LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080`.
|
||||
- `llm-connect-provider-secrets` reports `DATA 1`; no Secret values were
|
||||
inspected.
|
||||
- Worker health probe to llm-connect `/health` returns `{"status": "ok"}`.
|
||||
- `actcore-state-hub-bridge` remains `0/1` Ready with upstream timeouts, so T02
|
||||
is not fully closed until the node-local State Hub tunnel is restored.
|
||||
|
||||
2026-06-27 recheck:
|
||||
|
||||
- Superseded by real scheduled runner evidence: State Hub daily_triage events on
|
||||
2026-06-24, 2026-06-25, 2026-06-26, and 2026-06-27 all reached State Hub and
|
||||
wrote working-memory notes. The bridge/sink is therefore reachable for the
|
||||
live runner.
|
||||
- 2026-06-24 and 2026-06-25 were schema-valid; 2026-06-26 and 2026-06-27 failed
|
||||
output validation after calling llm-connect. That moves the active blocker out
|
||||
of T02 and into the WP-0016 live bundle/smoke lane. Marking T02 done.
|
||||
|
||||
## Run Daily Triage Fixture Smoke
|
||||
|
||||
```task
|
||||
@@ -128,6 +152,27 @@ Done when:
|
||||
detail;
|
||||
- `scripts/verify_daily_triage.py` reports the smoke/manual run as present.
|
||||
|
||||
2026-06-19 recheck:
|
||||
|
||||
- In-namespace llm-connect fixture smoke on `railiance01` passed:
|
||||
`smoke: pass health=ok latency_seconds=1.681 recommendations=1`.
|
||||
- Manual `POST /activity-definitions/6fca51fa-387a-4fd0-bc4e-d62c29eb859a/trigger`
|
||||
reached llm-connect, but the workflow failed at `persist_instruction_reports`
|
||||
with `state-hub-progress` sink `Connection refused` while
|
||||
`actcore-state-hub-bridge` is unhealthy.
|
||||
- T03 therefore remains open until State Hub bridge reachability is restored and
|
||||
a run emits non-secret `daily_triage` progress with `output_validated=true`.
|
||||
|
||||
2026-06-27 recheck:
|
||||
|
||||
- Scheduled runs on 2026-06-24 and 2026-06-25 satisfy the non-secret smoke
|
||||
evidence for llm-connect call, State Hub progress with output_validated=true,
|
||||
and working-memory note creation.
|
||||
- Kept T03 at progress rather than done because the workstation did not run the
|
||||
live verifier against Temporal/activity-core DB, and the smoke must be repeated
|
||||
after the WP-0016 code/schema/runtime-prompt deployment due the 2026-06-26 and
|
||||
2026-06-27 malformed-output failures.
|
||||
|
||||
## Collect Three Clean Scheduled Runs
|
||||
|
||||
```task
|
||||
@@ -151,6 +196,14 @@ Done when:
|
||||
- `ACTIVITY-WP-0006-T03` and `ACTIVITY-WP-0009-T01` can move from `wait` to
|
||||
`done`.
|
||||
|
||||
2026-06-27 recheck:
|
||||
|
||||
- Three-clean-run streak is reset. The latest sequence is 2026-06-24 clean,
|
||||
2026-06-25 clean, 2026-06-26 validation_failed, 2026-06-27 validation_failed.
|
||||
- Current pickup is to deploy ACTIVITY-WP-0016 code/schema together with the
|
||||
Railiance runtime prompt and max_tokens changes, run a live smoke, then restart
|
||||
the three-consecutive-scheduled-run gate from zero.
|
||||
|
||||
## Close Handoff State
|
||||
|
||||
```task
|
||||
|
||||
@@ -4,11 +4,11 @@ type: workplan
|
||||
title: "Definition And Schedule Hot Reload"
|
||||
domain: custodian
|
||||
repo: activity-core
|
||||
status: ready
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-18"
|
||||
updated: "2026-06-18"
|
||||
updated: "2026-06-22"
|
||||
state_hub_workstream_id: "8887075e-21ec-451b-b82b-cd81035c9ca5"
|
||||
---
|
||||
|
||||
@@ -39,7 +39,7 @@ a repo checkout manager or CI system.
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0012-T01
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "53a7970b-7eec-47f5-ad30-bbd7c6271952"
|
||||
```
|
||||
@@ -57,11 +57,17 @@ Done when:
|
||||
- failures are collected into a bounded `errors[]` result while preserving the
|
||||
current startup best-effort behavior.
|
||||
|
||||
2026-06-19: Completed. Added `activity_core.sync_service.run_sync`, which
|
||||
orchestrates ActivityDefinition, event type, and schedule sync independently
|
||||
from explicit DB session factory and Temporal client dependencies. Worker
|
||||
startup now calls the shared service for definitions+schedules and logs bounded
|
||||
stage errors while continuing startup.
|
||||
|
||||
## Add Admin Sync Endpoint
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0012-T02
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "8697c761-15d1-4da0-b66b-d838218a2495"
|
||||
```
|
||||
@@ -80,11 +86,17 @@ Done when:
|
||||
- endpoint tests cover definitions-only, schedules-only, all-sync, and failure
|
||||
result behavior.
|
||||
|
||||
2026-06-19: Completed. Added `POST /admin/sync` with defaults
|
||||
`definitions=true`, `schedules=true`, and `event_types=false`. The response
|
||||
reports definition/event counts, schedule upsert/pause/orphan-delete counts, and
|
||||
bounded `errors[]`. Tests cover definitions-only, schedules-only, all-sync, and
|
||||
failure-result behavior.
|
||||
|
||||
## Preserve Schedule Drift Semantics
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0012-T03
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "efeac412-632c-4c90-9428-bb575ac7a624"
|
||||
```
|
||||
@@ -101,11 +113,18 @@ Done when:
|
||||
- regression tests demonstrate the Coulomb hourly-to-daily rename shape without
|
||||
needing a worker restart.
|
||||
|
||||
2026-06-19: Completed. `sync_schedules` now returns explicit counts for enabled
|
||||
schedule upserts, disabled schedule pauses, and orphan deletes. Regression tests
|
||||
cover the hourly-to-daily rename shape: a new enabled cron schedule is upserted,
|
||||
the old disabled cron schedule is preserved as paused, unrelated orphan
|
||||
schedules are deleted, event-triggered definitions do not create schedules, and
|
||||
one-shot scheduled definitions are no longer mistaken for orphans.
|
||||
|
||||
## Optional Background Sync Loop
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0012-T04
|
||||
status: todo
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "d774087b-c51d-4444-8e90-bfef43765456"
|
||||
```
|
||||
@@ -121,11 +140,17 @@ Done when:
|
||||
last error summary;
|
||||
- the loop does not block worker startup or workflow task processing.
|
||||
|
||||
2026-06-19: Completed by decision. v1 stays manual/operator-triggered through
|
||||
`POST /admin/sync`; no background loop was added. The runbook records this
|
||||
posture so customer definition changes stay explicit and the worker does not
|
||||
start background repo scanning. A periodic loop remains a future option if live
|
||||
operator use proves it is needed.
|
||||
|
||||
## Live No-Restart Smoke
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0012-T05
|
||||
status: wait
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "68a0e22a-106a-4d21-9f39-c6279850cb5e"
|
||||
```
|
||||
@@ -141,5 +166,27 @@ Done when non-secret State Hub evidence shows:
|
||||
- event-triggered definitions still fire normally;
|
||||
- rollback or repeat sync is idempotent.
|
||||
|
||||
Current wait reason: this gate depends on the implementation tasks and a
|
||||
cluster-owned smoke path.
|
||||
2026-06-22: Completed on Railiance01 (`KUBECONFIG=~/.kube/config-hosteurope`).
|
||||
|
||||
Smoke target: disabled projection `ops-service-inventory-probes`
|
||||
(`40d15a87-7ff6-4d8e-992c-37df15f95110`) in
|
||||
`actcore-external-activity-definitions`.
|
||||
|
||||
Evidence:
|
||||
|
||||
- ConfigMap flip `enabled: false -> true` and cadence `15 * * * * -> 25 * * * *`,
|
||||
then `POST /admin/sync?definitions=true&schedules=true` from `actcore-api`.
|
||||
- DB after sync: `enabled=true`, `cron=25 * * * *`.
|
||||
- Temporal schedule after sync: `paused=false`, calendar minute `25`.
|
||||
- Repeat sync returned identical schedule counts
|
||||
(`upserted=5`, `paused=1`, `deleted_orphans=0`) — idempotent.
|
||||
- Rollback flip restored `enabled=false`, `cron=15 * * * *`, schedule
|
||||
`paused=true`, calendar minute `15`.
|
||||
- `actcore-worker` pod UID unchanged (`a68d6539-2bba-457e-a78a-39564002a980`,
|
||||
started `2026-06-21T18:46:46Z`); `actcore-event-router` pod UID unchanged.
|
||||
- Event-triggered definitions: none projected on Railiance01 today; hot DB
|
||||
reload path for event definitions remains covered by T03 unit tests and an
|
||||
unchanged event-router deployment.
|
||||
|
||||
Automation: `scripts/smoke_admin_sync_no_restart.py`. Runbook section added
|
||||
under "Railiance01 no-restart smoke".
|
||||
|
||||
@@ -0,0 +1,78 @@
|
||||
---
|
||||
id: ACTIVITY-WP-0013
|
||||
type: workplan
|
||||
title: "Reuse Surface Report Gaps Resolver"
|
||||
domain: custodian
|
||||
repo: activity-core
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: activity-core
|
||||
created: "2026-06-18"
|
||||
updated: "2026-06-18"
|
||||
state_hub_workstream_id: "01e68dfd-b146-4aef-a575-2d3b178ca5c2"
|
||||
---
|
||||
|
||||
# Reuse Surface Report Gaps Resolver
|
||||
|
||||
Implement the R2 handoff from kaizen-agentic (`bffa224c`) so the
|
||||
`reuse_surface_report_gaps` shell context source populates
|
||||
`context.gaps` for the Coulomb daily registry hygiene sweep.
|
||||
|
||||
## Register Shell Resolver Query
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0013-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "a6e1fc5c-7b42-436d-914e-4d605cb6f329"
|
||||
```
|
||||
|
||||
Add a dedicated reuse-surface context resolver module and register
|
||||
`reuse_surface_report_gaps` on the `shell` resolver path while preserving
|
||||
the existing kaizen shell query behavior.
|
||||
|
||||
## Implement Batch And Signal Semantics
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0013-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "229cf285-8388-471d-95fd-08400db1553e"
|
||||
```
|
||||
|
||||
Load the Coulomb rollout roster, select active repos with a persisted
|
||||
round-robin cursor, resolve repo roots from State Hub host paths, run
|
||||
`reuse-surface report gaps --format json`, and emit gap records for the
|
||||
enabled registry hygiene signals.
|
||||
|
||||
## Cover Required And Optional Failure Modes
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0013-T03
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "85b5c7d4-40e1-4945-8ada-1dff2363c194"
|
||||
```
|
||||
|
||||
Ensure missing required dependencies fail visibly while optional resolver
|
||||
sources bind an empty `context.gaps` list. Add unit coverage for fixture
|
||||
rollout data, mocked CLI JSON, resolver binding, and `hygiene_signal`
|
||||
rule gating.
|
||||
|
||||
## Smoke Real Coulomb Rollout
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0013-T04
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "6a5446ed-b4ec-4693-b508-65415571d834"
|
||||
```
|
||||
|
||||
Run a live resolver smoke against
|
||||
`/home/worsch/coulomb-loop/loops/registry-hygiene/rollout.yaml` using a
|
||||
temporary round-robin cursor. The real active rollout produced five gaps,
|
||||
including one for `reuse-surface` with `hygiene_signal: stale_sbom`.
|
||||
The smoke supplied `reuse_surface_bin:
|
||||
/home/worsch/reuse-surface/.venv/bin/reuse-surface` and
|
||||
`runner_host: bnt-lap001`; the worker environment or definition params must
|
||||
provide equivalent values before enabling the production sweep.
|
||||
194
workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md
Normal file
194
workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md
Normal file
@@ -0,0 +1,194 @@
|
||||
---
|
||||
id: ACTIVITY-WP-0014
|
||||
type: workplan
|
||||
title: "Schedule Misfire Robustness & Run-Miss Recovery Options"
|
||||
domain: infotech
|
||||
repo: activity-core
|
||||
status: finished
|
||||
owner: claude
|
||||
topic_slug: activity-core
|
||||
created: "2026-06-23"
|
||||
updated: "2026-06-24"
|
||||
status_note: "T01-T05 complete; beachhead-endpoint adoption split to ACTIVITY-WP-0015"
|
||||
state_hub_workstream_id: "91b64686-5d17-4c86-bc9e-3d0ee6720cf5"
|
||||
---
|
||||
|
||||
# Schedule Misfire Robustness & Run-Miss Recovery Options
|
||||
|
||||
Make cron-triggered ActivityDefinitions robust to missed fires (worker/Temporal
|
||||
unavailable at trigger time) with explicit, per-definition recovery behaviour,
|
||||
plus detection/alerting when a scheduled fire is missed.
|
||||
|
||||
## Motivation
|
||||
|
||||
On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition
|
||||
(cron `20 7 * * *` Europe/Berlin, projected into the Railiance runtime ConfigMap
|
||||
`actcore-external-activity-definitions`) produced **no `daily_triage` progress
|
||||
event at all** — neither a success nor a `could not run; operator review
|
||||
required` failure.
|
||||
|
||||
> **Corrected by T01 (2026-06-23).** The initial hypothesis below — that
|
||||
> `_build_schedule()` never set `catchup_window`, so a short-default catchup
|
||||
> window silently dropped the fire — was **disproven on the live cluster**. The
|
||||
> Temporal schedule is healthy with `CatchupWindow 365d` (the server default) and
|
||||
> `0 MissedCatchupWindow`. The real cause is that the run **fired and ran but
|
||||
> failed at the report sink** with `Connection refused` posting to State Hub,
|
||||
> because railiance01 reaches State Hub via a reverse tunnel back to the
|
||||
> workstation, which is asleep at 07:20 Berlin. See the T01 findings and T05.
|
||||
|
||||
The trigger now originates entirely on **railiance01** (in-cluster Temporal
|
||||
Schedule, ConfigMap-projected definition) and is **not** laptop-dependent — but
|
||||
the triage's State Hub *data dependencies* (context resolution and report
|
||||
delivery) still route back to the workstation State Hub.
|
||||
|
||||
This workplan still delivers worthwhile robustness — explicit run-miss recovery
|
||||
policies (T02) and missed-fire detection (T03) — but the fix for *this* incident
|
||||
is T05 (resilient sinks/resolvers + a workstation-independent State Hub endpoint).
|
||||
|
||||
## Desired run-miss options (from Bernd)
|
||||
|
||||
Three explicit, per-definition behaviours when a fire is missed:
|
||||
|
||||
1. **Run on trigger or skip** — never recover a missed fire.
|
||||
2. **Run on trigger or later if missed** — recover **all** missed fires when back up.
|
||||
3. **Run on trigger or later if missed, but skip if next trigger reached** —
|
||||
recover only the **most recent** missed fire; do not accumulate a backlog.
|
||||
|
||||
Proposed mapping to a new `misfire_policy` value set (names open to review):
|
||||
|
||||
| Policy | Semantics | Temporal mapping |
|
||||
| --- | --- | --- |
|
||||
| `skip` | Run on trigger or skip | `catchup_window ≈ 0`, `overlap=SKIP` |
|
||||
| `catchup_all` | Run on trigger or all missed later | `catchup_window=<long>`, `overlap=BUFFER_ALL` |
|
||||
| `catchup_latest` | Run on trigger or only the latest missed | `catchup_window ≈ 1 interval`, `overlap=BUFFER_ONE` |
|
||||
|
||||
## Confirm root cause on Railiance01
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0014-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "c90ff214-9214-48c7-96b9-7d699528d5ab"
|
||||
```
|
||||
|
||||
Inspected via `ssh railiance01` + in-node `kubectl`/`temporal` (no k3s tunnel is
|
||||
defined for railiance01; the documented access path is SSH to the host).
|
||||
|
||||
**Findings (2026-06-23) — the WP-0014 premise was wrong for this incident:**
|
||||
|
||||
- All pods healthy; `actcore-worker` up 44h, 0 restarts. Not a crash.
|
||||
- The daily-triage Temporal schedule (`activity-schedule-6fca51fa-…`) is
|
||||
**healthy**: `Paused false`, `OverlapPolicy Skip`, **`CatchupWindow 365d`**
|
||||
(Temporal's *default* when unset), `ActionCounts {Total:8, MissedCatchupWindow:0}`.
|
||||
So fires were **not** silently dropped — my original "no catchup window → silent
|
||||
drop" hypothesis does not hold; the server default is already 365d.
|
||||
- The `2026-06-23T05:20:00Z` fire **did fire and ran**, then **Failed at the report
|
||||
sink**: `report sink failure: state-hub-progress … '[Errno 111] Connection
|
||||
refused'`. The run produced a report but could not deliver it to State Hub, so
|
||||
no `daily_triage` progress event (not even a "could not run" one) was posted →
|
||||
the silence. The 06-22 fire has no execution in retention (bridge likely down
|
||||
then too / schedule update window at `LastUpdateAt 1d ago`).
|
||||
- Root cause is **State Hub connectivity from railiance01**, not Temporal. The
|
||||
in-cluster `actcore-state-hub-bridge` (`hostNetwork`) proxies to
|
||||
`127.0.0.1:18000` on the node — the local end of the ops-bridge **reverse tunnel
|
||||
back to the workstation's State Hub**. At 07:20 Europe/Berlin (= 05:20 UTC) the
|
||||
workstation/tunnel was unreachable → `Connection refused`. Chronic flakiness
|
||||
confirmed: 102 State Hub resolver timeouts in 24h (69 `recently_on_scope`,
|
||||
33 `consistency_sweep`).
|
||||
|
||||
**Implication:** the trigger *is* independent of the laptop, but the triage's
|
||||
**data dependencies (State Hub context resolution + report delivery) still route
|
||||
back to the workstation State Hub**, which is asleep at 07:20 Berlin. WP-0014's
|
||||
misfire policies are still good robustness, but the real fix is (a) State Hub
|
||||
reachable from railiance01 independent of the workstation, and/or (b) sinks/
|
||||
resolvers resilient to transient State Hub unavailability (retry/backoff,
|
||||
store-and-forward) instead of hard-failing the workflow. Tracked as follow-up
|
||||
below. Backfill deferred: a replay only succeeds while the workstation State Hub
|
||||
is reachable.
|
||||
|
||||
## Implement explicit misfire recovery modes
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0014-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "19615562-4cb2-4f25-872f-505d6e40dcc5"
|
||||
```
|
||||
|
||||
Add `catchup_window_seconds` to `CronTriggerConfig` and redefine `misfire_policy`
|
||||
into the three explicit modes above. In `_build_schedule()` set
|
||||
`SchedulePolicy(overlap=..., catchup_window=timedelta(...))` per mode. Remove the
|
||||
ad-hoc 1-hour `backfill` hack in favour of native catchup-window semantics. Keep
|
||||
backward compatibility for existing `skip`/`catchup`/`compress` values (alias
|
||||
map). Unit tests for each mode's `(catchup_window, overlap)` mapping.
|
||||
|
||||
## Missed-fire detection & alert sink
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0014-T03
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "dbedd96a-59ca-4b83-bce6-35755b076807"
|
||||
```
|
||||
|
||||
Detect when a scheduled definition has no successful run within its expected
|
||||
interval + tolerance, and emit a signal (State Hub progress event and/or
|
||||
agent-inbox message) so a miss is visible even under `skip`. This is the
|
||||
observability the current silent-drop behaviour lacks — a miss should never again
|
||||
be invisible.
|
||||
|
||||
## Apply policy to runtime definitions & document
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0014-T04
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "04e9d1d2-1192-4402-9402-b12c5d7d44e5"
|
||||
```
|
||||
|
||||
Set `misfire_policy: catchup_latest` for `daily-statehub-wsjf-triage`, documented
|
||||
run-miss options in `docs/runbook.md`.
|
||||
|
||||
**Deployed & verified to railiance01 (2026-06-24):** built `activity-core:
|
||||
railiance01-prod` with the WP-0014 code (T02/T03/T05), imported into k3s
|
||||
containerd, applied the ConfigMap, rolled `actcore-worker`/`api`/`event-router`
|
||||
onto the new image, and ran `/admin/sync` (6 defs, 4 schedules upserted, 0
|
||||
errors). The live Temporal schedule now reports `OverlapPolicy BufferOne` +
|
||||
`CatchupWindow 1d` (= `catchup_latest`); pods healthy, API `db:true temporal:true`.
|
||||
|
||||
## Keep activity-core thin under the State Hub beachhead model
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0014-T05
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"
|
||||
```
|
||||
|
||||
**Architecture decision (Bernd, 2026-06-23):** the resilience that this incident
|
||||
needs — queuing writes and caching reads while State Hub is unreachable — must
|
||||
**not** be a burden carried by client repos. It belongs to State Hub as a
|
||||
**per-machine local "beachhead"** (transparent read cache + write outbox, possibly
|
||||
with State-Hub federation), owned by custodian/state-hub. It handles all three
|
||||
failure modes: network interruption, central State Hub crash, central machine
|
||||
down. This is handed off to state-hub (see the coordination message / proposal);
|
||||
**do not build client-side queue/cache logic in activity-core.**
|
||||
|
||||
activity-core's only responsibilities under this model are thin:
|
||||
|
||||
- **Idempotent writes — DONE (2026-06-23, in-repo):** added
|
||||
`activity_core/state_hub_write` (`idempotency_headers`); every State Hub write
|
||||
(report-sink, ops-evidence, schedule-miss) now sends a stable `Idempotency-Key`
|
||||
header derived from `run_id:instruction_id:event_type`. The read-based
|
||||
`_progress_exists` dedup is now best-effort (returns `False` on connection
|
||||
error instead of hard-failing), so the guarantee lives on the keyed write, not
|
||||
a live read. Tests in `tests/test_state_hub_write.py`; documented in
|
||||
`docs/runbook.md`.
|
||||
- **Adopt the beachhead endpoint — MOVED to [[ACTIVITY-WP-0015]]:** pointing
|
||||
`STATE_HUB_URL` at the local beachhead and retiring the bespoke
|
||||
`actcore-state-hub-bridge` proxy depend on the state-hub beachhead existing
|
||||
first. Split into WP-0015 (status `blocked`) so this workplan can close on its
|
||||
completed in-repo work rather than waiting on an external capability.
|
||||
|
||||
T05 is done as far as activity-core can act now; the external-dependent adoption
|
||||
lives in WP-0015.
|
||||
@@ -0,0 +1,54 @@
|
||||
---
|
||||
id: ACTIVITY-WP-0015
|
||||
type: workplan
|
||||
title: "Adopt State Hub Beachhead Endpoint"
|
||||
domain: infotech
|
||||
repo: activity-core
|
||||
status: blocked
|
||||
owner: claude
|
||||
topic_slug: activity-core
|
||||
created: "2026-06-24"
|
||||
updated: "2026-06-24"
|
||||
state_hub_workstream_id: "bbc07f9e-9323-4b2b-b556-c33b37d0b228"
|
||||
---
|
||||
|
||||
# Adopt State Hub Beachhead Endpoint
|
||||
|
||||
Carries the **blocked remainder** of [[ACTIVITY-WP-0014]] T05. The in-repo half
|
||||
(idempotency-keyed State Hub writes) shipped in WP-0014; this workplan is the
|
||||
client-side adoption that depends on the state-hub-owned **beachhead** capability
|
||||
(per-machine read cache + write outbox) existing first.
|
||||
|
||||
**Blocked on:** the state-hub beachhead (proposal sent to the `state-hub` agent,
|
||||
2026-06-23). Do not build queue/cache logic in activity-core — see
|
||||
[[statehub-beachhead-principle]].
|
||||
|
||||
## Point STATE_HUB_URL at the beachhead
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0015-T01
|
||||
status: wait
|
||||
priority: medium
|
||||
state_hub_task_id: "76b6132d-394a-4a67-bef6-73bb9d1e277e"
|
||||
```
|
||||
|
||||
Once the state-hub beachhead exposes a local endpoint, point activity-core's
|
||||
`STATE_HUB_URL` (and the railiance runtime config) at it and verify reads are
|
||||
served from cache and writes are queued/flushed correctly when central State Hub
|
||||
is unreachable. Confirm idempotency-keyed writes dedup on flush (no duplicate
|
||||
`daily_triage`/progress events).
|
||||
|
||||
## Retire the bespoke actcore-state-hub-bridge proxy
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0015-T02
|
||||
status: wait
|
||||
priority: medium
|
||||
state_hub_task_id: "526c2129-cbf7-4531-a319-aebfc75cc6a3"
|
||||
```
|
||||
|
||||
Remove the inline `hostNetwork` HTTP proxy `actcore-state-hub-bridge` from
|
||||
`k8s/railiance/20-runtime.yaml` — it is a primitive precursor of the beachhead
|
||||
and should be replaced by the state-hub-owned component, not extended. Re-verify
|
||||
the daily triage end-to-end after cutover, including an overnight scheduled run
|
||||
while the workstation is asleep (the original failure condition).
|
||||
@@ -0,0 +1,417 @@
|
||||
---
|
||||
id: ACTIVITY-WP-0016
|
||||
type: workplan
|
||||
title: "LLM Output Robustness & The Producer Trust Boundary"
|
||||
domain: custodian
|
||||
repo: activity-core
|
||||
status: active
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-26"
|
||||
updated: "2026-06-30"
|
||||
state_hub_workstream_id: "4ef0d53b-1777-41ae-80c6-1b69fdb34726"
|
||||
---
|
||||
|
||||
# ACTIVITY-WP-0016 — LLM Output Robustness & The Producer Trust Boundary
|
||||
|
||||
## Context
|
||||
|
||||
On 2026-06-26 the scheduled `daily-statehub-wsjf-triage` instruction fired on
|
||||
time (`daily_triage` event 05:20:57Z) but its output **failed schema
|
||||
validation**: `Expecting ',' delimiter: line 136 column 22 (char 5268)`. The
|
||||
model emitted a long ranked WSJF recommendation list (reached rank 7+ with
|
||||
nested `wsjf` objects) and the JSON broke deep in that list. Because the report
|
||||
is a single monolithic JSON document, one malformed delimiter discarded the
|
||||
**entire** run. This reset the three-clean-consecutive-scheduled-runs streak in
|
||||
`ACTIVITY-WP-0006-T03` (06-24 ✅, 06-25 ✅, 06-26 ✗-validation) and is the
|
||||
LLM-output-quality surface deferred from `ACTIVITY-WP-0010`.
|
||||
|
||||
The scheduling/runtime layer is healthy — this is purely an output-robustness
|
||||
and boundary-design problem. Today's code (`src/activity_core/rules/executor.py`)
|
||||
already: passes the output schema to llm-connect as a `json_schema` model param
|
||||
(`_llm_run_config`), retries once, runs a fenced/`raw_decode` tolerant parser
|
||||
(`_parse_json_output`), and preserves a bounded 4000-char preview on hard
|
||||
failure (`_invalid_output_report`). None of that helps when error locality is
|
||||
zero: the failure unit is the whole document, not the offending item.
|
||||
|
||||
## Design Frame — The Producer Trust Boundary
|
||||
|
||||
This workplan is anchored to a deliberate architectural stance, not just a bug
|
||||
fix. Capture it in an ADR (T04) so future work inherits it.
|
||||
|
||||
**Premise.** activity-core has a *trust boundary* where free-form producer
|
||||
output meets strict deterministic consumers (JSON Schema validators, the task
|
||||
emitter, classic compute pipelines). The producers are **LLMs and humans (and
|
||||
agents acting for either)**. Both are *untrusted producers*: their output may be
|
||||
|
||||
- **erroneous** — hallucination, truncation (token-limit cutoff), drift,
|
||||
type slips, typos; or
|
||||
- **malicious** — prompt injection, crafted payloads, oversized/deeply-nested
|
||||
structures aimed at exhausting or confusing the consumer.
|
||||
|
||||
The architecture should treat the boundary as an adversarial frontier and place
|
||||
**guardrails + error-correction tooling there**, rather than letting raw
|
||||
producer output flow into deterministic consumers and fail (or worse, partially
|
||||
succeed) downstream.
|
||||
|
||||
**Two non-fail-fast postures.** When we do *not* want to hard-fail on a problem,
|
||||
there are two sensible strategies — and they compose:
|
||||
|
||||
- **A) Trust but handle exceptions** (optimistic / reactive). Consume the output
|
||||
as-is; on exception, catch → repair → retry → or quarantine. Cheap on the
|
||||
happy path. Blast radius depends entirely on how granular the catch is. Good
|
||||
when failures are rare and locally recoverable. Risk: failures surface late,
|
||||
possibly after partial side effects.
|
||||
- **B) Verify and mitigate** (defensive / proactive). Validate, sanitize, clamp,
|
||||
and normalize the output to a known-good shape *before* it enters the pipeline
|
||||
— drop bad items, coerce types, bound sizes/depth, allow-list references — so
|
||||
the consumer only ever sees clean input. Higher upfront cost, smaller blast
|
||||
radius, no partial side effects. Good when failures are common or
|
||||
consequences are high.
|
||||
|
||||
**Governing principles for this repo:**
|
||||
|
||||
1. **Push verification to the boundary; keep the interior strict.** Apply
|
||||
posture **B** at the producer→consumer boundary (verify+mitigate structure);
|
||||
keep posture **A** for residual exceptions inside the verified core. Never
|
||||
relax the interior schema to absorb producer sloppiness.
|
||||
2. **Make error locality match the unit of work.** One bad recommendation must
|
||||
cost one recommendation, not the whole report. Framing the payload so each
|
||||
item is independently parseable is the single highest-leverage change.
|
||||
3. **Quarantine, never silently drop.** Invalid units are preserved as bounded,
|
||||
provenance-tagged artifacts (index, error, raw snippet) so they can be
|
||||
debugged or replayed — degraded-but-usable is distinct from total loss.
|
||||
4. **Both human and agent input get the same rigor.** Guardrails are
|
||||
producer-agnostic: the same size/depth/count caps, reference allow-lists, and
|
||||
truncation detection apply whether the producer is an LLM, an agent, or a
|
||||
human form submission.
|
||||
|
||||
## Reproduce & Root-Cause The Failure
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0016-T01
|
||||
status: wait
|
||||
priority: high
|
||||
state_hub_task_id: "74fd16a5-4ea5-4dfe-8526-dfa27cf76138"
|
||||
```
|
||||
|
||||
Recover the **full** raw llm-connect response for the 06-26 failure (the State
|
||||
Hub event keeps only a 4000-char preview; the break is at char 5268) and
|
||||
establish the precise cause.
|
||||
|
||||
Done when:
|
||||
|
||||
- the full raw response is pulled from the runtime llm-connect log / response
|
||||
store and the exact offending token at char 5268 is identified;
|
||||
- `finish_reason` is captured to confirm or rule out token-limit **truncation**
|
||||
vs a structural mid-stream glitch;
|
||||
- it is confirmed whether llm-connect actually **enforced** the `json_schema`
|
||||
constrained-decoding hint or merely accepted it as advisory (this determines
|
||||
whether the schema param is load-bearing);
|
||||
- the failing payload is captured as a regression fixture under `tests/`.
|
||||
|
||||
2026-06-26 findings (local analysis on the workstation):
|
||||
|
||||
- **Mechanism confirmed structurally.** There are **16 active workstreams**
|
||||
org-wide and the triage instruction emits ~one ranked recommendation per
|
||||
candidate. The preserved preview holds 7 fully-formed recommendations; the JSON
|
||||
break is at char 5268 (~rank 8–9). The unbounded one-per-workstream list is the
|
||||
structural cause — more items = more tokens = higher odds of a mid-stream JSON
|
||||
slip and/or truncation. This directly justifies T02's bounded top-N + per-item
|
||||
framing.
|
||||
- **Both attempts failed.** `executor._execute` retries once
|
||||
(`src/activity_core/rules/executor.py:166-171`); the recorded error is from the
|
||||
**retry** output, so the model produced invalid JSON twice — not a one-off.
|
||||
- **activity-core discards the diagnostics needed to root-cause this.** Three
|
||||
retention gaps mean the exact char-5268 token cannot be recovered from
|
||||
activity-core data at all:
|
||||
1. `LLMConnectClient.complete()` returns only `data["content"]`
|
||||
(`llm_client.py:57-60`) — it drops `finish_reason`/`usage` from the
|
||||
llm-connect HTTP response, so truncation-vs-structural cannot be
|
||||
distinguished locally.
|
||||
2. the report sink caps raw output at **4000 chars** (`_invalid_output_report`,
|
||||
`executor.py:259`) — below the 5268 break.
|
||||
3. the worker log caps the preview at **2000 chars** (`executor.py:175`).
|
||||
- **Remaining (remote, operator-owned).** Confirming the exact offending token
|
||||
and `finish_reason` requires llm-connect's producer-side logs on `railiance01`
|
||||
— cluster access, outside this repo's SCOPE for direct action. Truncation is
|
||||
the leading hypothesis given the 16-item input, but the mitigation (T02/T03) is
|
||||
identical either way, so T01 does not block the build work.
|
||||
- **Feeds T03/T04.** The retention gaps are themselves defects to fix: capture
|
||||
`finish_reason`/`usage` and persist a larger bounded raw artifact on validation
|
||||
failure so this class of failure is never un-debuggable again.
|
||||
- Partial fixture saved:
|
||||
`tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json`
|
||||
(the 4000-char preview + validation error; full payload pending the remote pull).
|
||||
|
||||
2026-06-30 local retention hardening: activity-core now preserves future
|
||||
llm-connect diagnostic metadata instead of dropping it at the client boundary.
|
||||
`LLMConnectClient.complete()` still returns the content string for compatibility,
|
||||
but records safe non-secret response fields such as `finish_reason` and `usage`
|
||||
on `last_response_metadata`; the executor copies that into report artifacts,
|
||||
State Hub progress detail, and working-memory notes. Invalid report raw previews
|
||||
were raised from 4000 to 12000 chars. This does not recover the historical
|
||||
06-26 full payload or producer-side `finish_reason`, so T01 remains wait on the
|
||||
remote llm-connect log pull, but the retention gap is closed for future failures.
|
||||
|
||||
## Schema + Prompt Redesign For Error Locality
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0016-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "ae67ca8c-ee01-4a8d-9e8a-a0a36c999758"
|
||||
```
|
||||
|
||||
Redesign the daily-triage report contract so a single malformed item can no
|
||||
longer discard the whole report (principle #2).
|
||||
|
||||
Done when:
|
||||
|
||||
- the recommendation list is **bounded** (configurable top-N, default 5–7) in
|
||||
both the prompt and the output schema — long lists are where the model drifts;
|
||||
- the report uses a **per-item-framed** shape (JSON Lines / NDJSON — one
|
||||
recommendation object per line — or an equivalent delimited per-item form)
|
||||
behind a minimal stable envelope (`summary` + framed items), so each item is
|
||||
an independent parse unit;
|
||||
- the prompt explicitly states the contract, the per-item framing, the cap, and
|
||||
a "if uncertain, emit fewer well-formed items rather than more" instruction;
|
||||
- `max_tokens` is set with headroom for the bounded list so truncation cannot
|
||||
occur at the expected size;
|
||||
- the output schema file (`_load_output_schema` target) is updated to match.
|
||||
|
||||
2026-06-26 progress (in-repo portion):
|
||||
|
||||
- **Strict, bounded schema written** — `schemas/daily-triage-report.json` went
|
||||
from `recommendations.items: {type: object}` (accept-anything) to a strict
|
||||
per-item contract: `required [rank, candidate, action, why]` with typed
|
||||
`wsjf` sub-fields, plus `maxItems: 7`. The strict item shape is what lets the
|
||||
T03 boundary parser validate each recommendation independently.
|
||||
- **`maxItems` is a hint, not a hard reject** — the in-repo validator
|
||||
(`_validate_schema_node`) only enforces `type`/`required`/`properties`/`items`
|
||||
and ignores `maxItems`/`enum`. That is deliberate: a hard `maxItems` reject
|
||||
would discard a whole 16-item report — the exact blast-radius bug WP-0016
|
||||
removes. The bound is enforced via the prompt + the llm-connect `json_schema`
|
||||
constraint hint + T03 mitigation (keep top-N by rank, quarantine extras).
|
||||
- **DEPLOY COUPLING (important):** this schema file is consumed *both* as the
|
||||
llm-connect hint *and* by the current whole-document validator. Tightening
|
||||
per-item `required` fields makes the existing whole-doc validation hard-fail
|
||||
**more** until T03 replaces it with per-item quarantine. Therefore the schema
|
||||
change MUST ship together with T03 — do not deploy the strict schema to the
|
||||
runtime bundle ahead of the T03 parser. Four executor/instruction tests that
|
||||
asserted the old loose contract were updated to the strict contract; the
|
||||
forwarded-schema test now reads the live file instead of hard-coding it.
|
||||
- **Truncation hypothesis corroborated** — the instruction config carries
|
||||
`max_tokens` on the order of ~1200 (per the wiring test fixture). 5268 chars ≈
|
||||
~1300–1500 tokens, so a ~1200-token cap would truncate a 16-item list right at
|
||||
the observed break. This strengthens T01's leading hypothesis and makes the
|
||||
`max_tokens` headroom change below concrete.
|
||||
|
||||
**Bundle handoff (NOT in this repo — runtime-projected definition).** The triage
|
||||
prompt and `max_tokens` live in the Railiance runtime bundle, not in repo files.
|
||||
Apply there:
|
||||
1. Instruct a **bounded top-N** (≤ 7) ranked recommendations, "if uncertain emit
|
||||
fewer well-formed items rather than more."
|
||||
2. Specify the **per-item framing** the T03 parser will consume (NDJSON: a
|
||||
leading summary object, then one recommendation JSON object per line).
|
||||
3. Raise **`max_tokens`** to give clear headroom for 7 framed items (eliminate
|
||||
truncation at the expected size).
|
||||
4. State the value vocabularies (`action`, `confidence`) the T04 guardrails will
|
||||
check.
|
||||
|
||||
2026-06-30 live evidence check: the 2026-06-28 and 2026-06-29 scheduled
|
||||
`daily_triage` events validated successfully, which shows the runtime is no
|
||||
longer failing every day. However, the preserved State Hub reports still contain
|
||||
10 recommendations, not the requested bounded top-N of 7 / framed item contract.
|
||||
Treat that as evidence that the runtime-projected prompt/schema/max-token bundle
|
||||
has not fully absorbed the T02 handoff yet.
|
||||
|
||||
2026-06-30 source projection closeout: patched `k8s/railiance/20-runtime.yaml`
|
||||
so the projected `daily-statehub-wsjf-triage.md` prompt now says at most 7
|
||||
recommendations and instructs the model to emit fewer well-formed items rather
|
||||
than more. The projected `daily-triage-report.json` now has `maxItems: 7` and
|
||||
`rank.maximum: 7`, aligned with the repo schema. `max_tokens: 1800` remains as
|
||||
headroom for the bounded report. T02 is done in source; live deployment and an
|
||||
observed <=7 recommendation run remain under T05.
|
||||
|
||||
## Boundary Parser — Verify & Mitigate (Posture B)
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0016-T03
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "d65a6281-f1f9-4a9b-a835-da065411b709"
|
||||
```
|
||||
|
||||
Implement item-granular parsing with a quarantine lane in
|
||||
`src/activity_core/rules/executor.py`, applying posture **B** at the boundary
|
||||
(principles #1–#3).
|
||||
|
||||
Done when:
|
||||
|
||||
- the parser splits the envelope from the framed items, then parses **each item
|
||||
independently**; a malformed item is routed to a bounded `quarantined_items`
|
||||
artifact (index + validation error + raw snippet), not raised;
|
||||
- a run with some valid and some invalid items emits a report over the surviving
|
||||
valid items with `output_validated=true`, plus `partial=true` and
|
||||
`quarantined_count` / `quarantined_items` markers — degraded-but-usable is
|
||||
reported distinctly from total loss;
|
||||
- a best-effort **repair** pass (close unterminated brackets/quotes, recover the
|
||||
valid prefix) is attempted per item before quarantining it;
|
||||
- truncation detected in T01 is handled as its own signal (recover whole items
|
||||
emitted before the cutoff rather than failing the document);
|
||||
- the existing monolithic-document path remains as the fallback when framing is
|
||||
absent (backward compatible with task-only instructions).
|
||||
|
||||
2026-06-26 progress (implemented in `src/activity_core/rules/executor.py`):
|
||||
|
||||
- **Resilient recovery wired into `_execute`.** When the whole-document parse +
|
||||
one retry still fail, report instructions (those with `report_sinks`) now run
|
||||
`_resilient_report` *before* the total-loss `_invalid_output_report`. If it
|
||||
recovers ≥1 valid item it returns a partial report; otherwise it returns None
|
||||
and the prior total-loss path is preserved unchanged.
|
||||
- **Brace/quote-aware object scanner, not line-splitting.** The real 06-26 output
|
||||
was pretty-printed (multi-line objects), so naive NDJSON line recovery would
|
||||
have failed. `_extract_object_spans` walks the `recommendations` array
|
||||
brace-depth- and string-aware, so it recovers each recommendation object
|
||||
whether pretty-printed across many lines *or* emitted one-per-line (NDJSON).
|
||||
The truncated trailing object is returned with `complete=False`.
|
||||
- **Layered mitigation per item:** `json.loads` → on failure for a truncated
|
||||
tail, a best-effort `_try_repair` (balance open string/brackets/braces) →
|
||||
then `_partition_items` validates each recovered object against the T02 item
|
||||
schema. Valid items survive; malformed or over-`maxItems` items are
|
||||
quarantined with provenance (`index`, `error`, `raw` snippet, `reason`).
|
||||
- **Report shape on degradation:** `output_validated=True` over the survivors,
|
||||
`review_required=True`, `partial=True`, `quarantined_count`, and a bounded
|
||||
`quarantined_items` list (cap 20). Degraded-but-usable is now reported
|
||||
distinctly from total loss.
|
||||
- **Verified against the real failure shape.** New tests reconstruct a
|
||||
pretty-printed report with 7 valid recommendations + a truncated tail (the
|
||||
06-26 shape) and a one-bad-item-among-valid case. The 7-item run now recovers
|
||||
all 7 and quarantines the broken tail (previously: whole run discarded);
|
||||
log line `instruction_output_recovered: kept=7, quarantined=1`. The bad-item
|
||||
run keeps 2 and quarantines the rank-less one.
|
||||
- **Deferred to T04 (clean scope boundary):** enforcing `maxItems` top-N on the
|
||||
*happy* path (valid JSON, all items schema-valid, but > N items) — the resilient
|
||||
path only runs on failure, so over-limit-on-success is a guardrail/count-cap
|
||||
concern, which is exactly T04's remit.
|
||||
|
||||
## Producer Guardrails + ADR-004
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0016-T04
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "f5c3af5b-9e28-42b0-9af5-4c99284e99b9"
|
||||
```
|
||||
|
||||
Write the architecture decision record and add the producer-agnostic guardrails
|
||||
(principle #4).
|
||||
|
||||
Done when:
|
||||
|
||||
- `docs/adr/adr-004-producer-trust-boundary.md` documents the trust boundary,
|
||||
the untrusted-producer premise (erroneous **and** malicious; human and agent),
|
||||
the A vs B taxonomy and where each applies, the error-locality principle, and
|
||||
the quarantine-with-provenance rule;
|
||||
- boundary guardrails are enforced at the consumer edge: max item **count**, max
|
||||
string length, max nesting **depth**, and a **reference allow-list** (e.g. a
|
||||
recommendation `candidate` / a task `target_repo` must resolve to a known
|
||||
workstream/repo before it is acted on);
|
||||
- guardrail rejections are quarantined with provenance, consistent with T03;
|
||||
- SCOPE.md / INTENT.md are checked for drift and updated if the boundary stance
|
||||
changes the documented contract.
|
||||
|
||||
2026-06-26 progress:
|
||||
|
||||
- **ADR-004 written** — `docs/adr/adr-004-producer-trust-boundary.md` documents
|
||||
the untrusted-producer premise (erroneous + malicious; LLM/agent/human), the
|
||||
A-vs-B posture taxonomy, the four governing principles, the concrete
|
||||
activity-core mechanisms, a posture-by-layer table, consequences, and
|
||||
alternatives considered. Accepted, scope cross-repo.
|
||||
- **Producer guardrails implemented** in `executor.py`, applied uniformly on the
|
||||
happy path *and* the recovery path via `_partition_items`: per-item order is
|
||||
structural-type → schema → structural caps (`_MAX_DEPTH=8`,
|
||||
`_MAX_STRING_LEN=4000`) → reference allow-list → count cap (`maxItems`). Each
|
||||
quarantine carries a `reason` (`malformed`/`schema`/`guardrail`/`allow_list`/
|
||||
`over_limit`).
|
||||
- **Happy-path count cap closed** (the item deferred from T03): a syntactically
|
||||
valid 9-item report now keeps 7 and quarantines 2 as `over_limit`, emitting a
|
||||
`partial` report — without a retry.
|
||||
- **Reference allow-list wired but inert.** `_allow_list_from_context` reads
|
||||
`context["known_candidates"]`; when present, recommendations with an unknown
|
||||
`candidate` are quarantined (`reason: allow_list`). Absent today → check is
|
||||
inert; activation is a one-line context-resolver change. Keeps the guardrail
|
||||
producer-agnostic (principle #4) and ready.
|
||||
- **SCOPE.md updated** — instruction-executor bullet now names the quarantine
|
||||
lane + guardrails; ADR-004 added to the Architecture Decisions list. No INTENT
|
||||
drift: this hardens the existing output contract, it does not extend scope.
|
||||
- New tests: happy-path count cap, oversized-string guardrail, allow-list
|
||||
rejection (all green).
|
||||
|
||||
## Tests + Calibration Re-Entry
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0016-T05
|
||||
status: progress
|
||||
priority: high
|
||||
state_hub_task_id: "c881500b-5459-4620-81c0-b176971e989f"
|
||||
```
|
||||
|
||||
Prove the new posture and hand back to the calibration gates.
|
||||
|
||||
Done when:
|
||||
|
||||
- regression tests cover: the captured 06-26 payload, a truncated-mid-list
|
||||
payload, a one-bad-item-among-good payload (asserts quarantine + partial), an
|
||||
oversized/over-deep payload (asserts guardrail rejection), and an
|
||||
injection-shaped reference (asserts allow-list rejection);
|
||||
- the full suite passes and the result is recorded here with the count;
|
||||
- a daily-triage smoke against the live runtime shows a previously-failing
|
||||
payload now **degrades gracefully** (valid items delivered, bad items
|
||||
quarantined) instead of discarding the run;
|
||||
- a progress note hands back to `ACTIVITY-WP-0010-T04` and `ACTIVITY-WP-0006-T03`
|
||||
that the output-robustness blocker is cleared so the three-clean-run gate can
|
||||
resume on its own.
|
||||
|
||||
2026-06-26 progress (in-repo portion complete):
|
||||
|
||||
- **Regression coverage complete.** Across T03/T04/T05: truncated-mid-list,
|
||||
one-bad-item-among-good (quarantine + partial), oversized-string and over-depth
|
||||
guardrail rejection, allow-list (injection-shaped) rejection, happy-path count
|
||||
cap, and a test driving the **actual captured 2026-06-26 payload**
|
||||
(`tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json`)
|
||||
— it now recovers 6+ valid recommendations and quarantines the truncated tail,
|
||||
where before it discarded the whole run.
|
||||
- **Full suite green:** 218 passed, 1 skipped (recorded at T04; the T05 fixture +
|
||||
over-depth tests add to this — see the commit).
|
||||
- **Hand-back notes posted** to `ACTIVITY-WP-0006-T03` (State Hub event
|
||||
`b6b8c2b8`) and `ACTIVITY-WP-0010-T04` (`b813f0dc`).
|
||||
- **Remaining (remote, operator-owned):** the live daily-triage smoke on
|
||||
`railiance01` proving end-to-end graceful degradation. It depends on deploying
|
||||
the T02 bundle prompt/`max_tokens`/NDJSON changes together with this code, which
|
||||
is cluster/operator work outside this repo's SCOPE. T05 therefore stays
|
||||
`progress` until that live run exists; the in-repo deliverables are done.
|
||||
|
||||
2026-06-30 follow-up: added forward-looking diagnostics so future validation
|
||||
failures carry llm-connect response metadata and a larger bounded raw-output
|
||||
preview in activity-core-owned evidence. Focused verification passed:
|
||||
`uv run pytest tests/test_llm_client.py tests/rules/test_executor.py tests/test_report_sinks.py -q`
|
||||
=> 39 passed. This improves future root-cause ability but does not replace the
|
||||
required live smoke proving graceful degradation on railiance01.
|
||||
|
||||
2026-06-30 projection follow-up: local source projection now enforces the top-7
|
||||
prompt/schema contract. Remaining T05 proof is operational: deploy or sync the
|
||||
updated `k8s/railiance/20-runtime.yaml`, run `actcore-sync`/schedule smoke or wait
|
||||
for the next 07:20 Berlin fire, then confirm State Hub `daily_triage` evidence is
|
||||
`output_validated=true` with no more than 7 recommendations.
|
||||
|
||||
## Relationships
|
||||
|
||||
- **Blocks / feeds:** `ACTIVITY-WP-0006-T03` (three clean scheduled runs) and
|
||||
`ACTIVITY-WP-0010-T04` (collect three clean scheduled runs) — both stalled on
|
||||
the same output-quality failure this workplan removes.
|
||||
- **References:** `ACTIVITY-WP-0009` (scheduled-run trust gap).
|
||||
- **Boundary discipline:** keeps activity-core inside its SCOPE — this hardens
|
||||
the instruction-executor output contract; it does not move provider
|
||||
credentials, cluster reconciliation, or task lifecycle into this repo.
|
||||
58
workplans/ACTIVITY-WP-0017-core-hub-ops-evidence-sink.md
Normal file
58
workplans/ACTIVITY-WP-0017-core-hub-ops-evidence-sink.md
Normal file
@@ -0,0 +1,58 @@
|
||||
---
|
||||
id: ACTIVITY-WP-0017
|
||||
type: workplan
|
||||
title: "Core Hub ops evidence sink"
|
||||
domain: infotech
|
||||
repo: activity-core
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-27"
|
||||
updated: "2026-06-27"
|
||||
state_hub_workstream_id: "2a073bf4-febf-433e-a721-5daf71760912"
|
||||
---
|
||||
|
||||
# Core Hub ops evidence sink
|
||||
|
||||
## Goal
|
||||
|
||||
Provide the activity-core side of the Core Hub replacement evidence path for
|
||||
`CORE-WP-0008-T03`, without depending on the legacy Haskell Inter-Hub sink and
|
||||
without placing secret material in activity definitions, logs, State Hub, or
|
||||
chat.
|
||||
|
||||
## Task: Add Core Hub interaction-event sink
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0017-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "32aab1af-6be5-4b52-afa1-c11f52c65892"
|
||||
```
|
||||
|
||||
Add a `core-hub-interaction-event` ops evidence sink that posts sanitized
|
||||
ops-inventory probe evidence to Core Hub `/api/v2/interaction-events`, verifies
|
||||
the created event is visible, and reports only non-secret ids/statuses.
|
||||
|
||||
Acceptance:
|
||||
|
||||
- runtime token is read through `CORE_HUB_RUNTIME_TOKEN_FILE` or a named
|
||||
environment variable, never from workplan content;
|
||||
- sink configuration accepts `CORE_HUB_BASE_URL` and a widget id or widget
|
||||
mapping;
|
||||
- emitted metadata reuses the existing compact/sanitized probe evidence path;
|
||||
- missing Core Hub config skips cleanly with explicit non-secret missing keys;
|
||||
- tests prove the POST/visibility check and secret non-disclosure.
|
||||
|
||||
Verification 2026-06-27: `tests/test_ops_evidence_sinks.py` passed, and
|
||||
a disposable local Core Hub runtime accepted an activity-core
|
||||
`core-hub-interaction-event` sink emission, then listed the created
|
||||
`ops-endpoint-verified` event back through `/api/v2/interaction-events`.
|
||||
The verification asserted sanitized metadata did not include response body,
|
||||
authorization header, URL userinfo, or token query material.
|
||||
|
||||
Completed 2026-06-27: implemented the Core Hub interaction-event sink in
|
||||
`activity_core.ops_evidence_sinks` with unit coverage for POST/visibility
|
||||
verification, missing config behavior, and secret non-disclosure. This provides
|
||||
the direct Core Hub consumer path needed by `CORE-WP-0008-T03`; deployed use
|
||||
still requires an approved Core Hub runtime token and widget id/mapping.
|
||||
248
workplans/ACTIVITY-WP-0018-own-infra-automation-status.md
Normal file
248
workplans/ACTIVITY-WP-0018-own-infra-automation-status.md
Normal file
@@ -0,0 +1,248 @@
|
||||
---
|
||||
id: ACTIVITY-WP-0018
|
||||
type: workplan
|
||||
title: "Own-infrastructure automation status surface"
|
||||
domain: infotech
|
||||
repo: activity-core
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: automation-observability
|
||||
created: "2026-06-29"
|
||||
updated: "2026-06-29"
|
||||
state_hub_workstream_id: "0220b38b-7c73-4601-9601-5f2c1a5b29e8"
|
||||
---
|
||||
|
||||
# Own-infrastructure automation status surface
|
||||
|
||||
## Goal
|
||||
|
||||
Make activity-core's own scheduling and evidence infrastructure the explicit
|
||||
operating preference for durable automations, independent of any coding
|
||||
assistant-provided scheduler or reminder system.
|
||||
|
||||
An operator should be able to answer a question like "How did our automations go
|
||||
since Friday?" with a repo-native command that does not require an LLM. Coding
|
||||
assistants may inspect or summarize that command's output, but they must not be
|
||||
the source of truth for scheduled execution, run history, or operational
|
||||
evidence.
|
||||
|
||||
## Review notes
|
||||
|
||||
The repo already owns the correct infrastructure direction:
|
||||
|
||||
- `SCOPE.md` defines activity-core as the org-wide event bridge for cron,
|
||||
one-off scheduled datetime, and event-triggered automation.
|
||||
- `Makefile` exposes sync and service targets, but no operator status target for
|
||||
recent automation outcomes.
|
||||
- `docs/runbook.md` documents daily-triage verification through
|
||||
`scripts/verify_daily_triage.py`, but that helper is activity-specific and
|
||||
still reads like a checklist rather than the baseline answer surface for all
|
||||
automations.
|
||||
- Existing workplan evidence shows the status question is operationally common:
|
||||
2026-06-24 and 2026-06-25 daily triage runs were clean, while 2026-06-26 and
|
||||
2026-06-27 fired on schedule but failed output validation. That distinction is
|
||||
exactly what the baseline command must make obvious.
|
||||
|
||||
## Task: Codify the own-infra scheduling preference
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0018-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "00127678-5ce4-4cb3-b81c-f42e04407c73"
|
||||
```
|
||||
|
||||
Record the repository preference that durable automation scheduling, execution
|
||||
history, and run evidence belong to activity-core's own infrastructure: Temporal
|
||||
Schedules, NATS JetStream, activity-core run records, State Hub progress, and
|
||||
working-memory/report sinks.
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `AGENTS.md` repo-specific instructions say not to use coding
|
||||
assistant-provided automation tooling as the execution or evidence source for
|
||||
activity-core automations.
|
||||
- `SCOPE.md` and `docs/runbook.md` describe coding assistants as callers or
|
||||
summarizers of repo-native automation commands, not as schedulers.
|
||||
- The preference distinguishes durable automation from harmless local session
|
||||
reminders: production/operational recurrence belongs to activity-core.
|
||||
- The text names the authoritative evidence sources and avoids tying the policy
|
||||
to any one assistant product.
|
||||
|
||||
2026-06-29 progress: Added the immediate repo-agent instruction in AGENTS.md
|
||||
that durable activity-core automations must use repo-owned infrastructure, not
|
||||
coding assistant automation/reminder/heartbeat tooling, as the execution or
|
||||
evidence source. Remaining T01 work is to carry the same preference into
|
||||
SCOPE.md and docs/runbook.md.
|
||||
|
||||
## Task: Define the automation status evidence contract
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0018-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "17e6bb87-d4bf-4ef3-b91c-4bdfe2fe3492"
|
||||
```
|
||||
|
||||
Define a small, deterministic report contract for answering recent automation
|
||||
status questions across all ActivityDefinitions.
|
||||
|
||||
Acceptance:
|
||||
|
||||
- The contract covers schedule state, expected fires in the requested window,
|
||||
observed workflow runs, `activity_runs` rows, State Hub progress events,
|
||||
working-memory/report sink evidence, and known validation or sink failures.
|
||||
- It defines normalized statuses such as `completed`, `running`, `retrying`,
|
||||
`validation_failed`, `sink_failed`, `missed`, `disabled`, and `unknown`.
|
||||
- Partial data is explicit: if Temporal, Postgres, State Hub, or a sink path is
|
||||
unavailable, the report includes warnings rather than silently passing or
|
||||
failing the whole check.
|
||||
- The contract is safe for operator logs: no secrets, prompts, raw model output,
|
||||
or credential-bearing URLs.
|
||||
- The contract can be emitted as JSON for scripts and rendered as concise text
|
||||
for humans.
|
||||
|
||||
## Task: Implement the non-LLM automation status CLI
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0018-T03
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "7831f2fc-8b76-48fe-aa34-9dcc11ee84db"
|
||||
```
|
||||
|
||||
Add a deterministic CLI, likely under `scripts/automation_status.py` or an
|
||||
`activity_core` module, that answers recent automation status questions without
|
||||
calling an LLM.
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Supports `--since`, `--until`, activity name/id filters, JSON output, and a
|
||||
concise human summary.
|
||||
- Accepts simple operator dates, including absolute dates and a documented
|
||||
`friday`/`last-friday` style shortcut, resolving them to concrete dates in the
|
||||
configured timezone.
|
||||
- Inspects all enabled scheduled ActivityDefinitions by default, not just daily
|
||||
triage.
|
||||
- Uses live sources when configured: Postgres `activity_definitions` /
|
||||
`activity_runs`, Temporal schedule and workflow visibility, State Hub
|
||||
progress, and configured local report sink paths.
|
||||
- Degrades usefully when a source is unavailable and exits non-zero only for
|
||||
real status failures or invalid input, not for optional evidence gaps that are
|
||||
clearly reported.
|
||||
- Includes focused unit tests with fixture data for clean runs, validation
|
||||
failures, missed runs, disabled schedules, and partial-source availability.
|
||||
|
||||
## Task: Add the Make target baseline
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0018-T04
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "451bdf62-b619-4ace-9262-46d20b912781"
|
||||
```
|
||||
|
||||
Expose the CLI through a Make target that is easy for an operator or any coding
|
||||
assistant to run before attempting a prose summary.
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `make automation-status SINCE=2026-06-26` prints the human-readable baseline.
|
||||
- `make automation-status SINCE=friday` is supported or documented with the
|
||||
exact accepted shortcut.
|
||||
- A JSON form is available, either through `FORMAT=json` or a separate target
|
||||
such as `make automation-status-json`.
|
||||
- The target does not require LLM credentials, coding assistant automation
|
||||
tooling, or interactive prompts.
|
||||
- `make help` lists the target with a clear one-line description.
|
||||
|
||||
## Task: Update operator docs and examples
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0018-T05
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "233659aa-e14a-4b3d-b156-d04f0fa16db6"
|
||||
```
|
||||
|
||||
Update the runbook so "How did automations go since Friday?" has an obvious
|
||||
operator recipe.
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `docs/runbook.md` has a short "Automation status" section near the scheduling
|
||||
operations.
|
||||
- The docs include example output or a compact sample for the known daily
|
||||
triage distinction: fired on time versus completed successfully versus output
|
||||
validation failure.
|
||||
- The docs clarify that LLM summaries are optional convenience only; the Make
|
||||
target output is the baseline evidence.
|
||||
- The daily-triage-specific helper is either kept as a lower-level diagnostic or
|
||||
folded into the generalized status command.
|
||||
|
||||
## Task: Verify against recent scheduled-run evidence
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0018-T06
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "24efbe9f-dfff-482f-9edc-456379c9a2aa"
|
||||
```
|
||||
|
||||
Prove the new surface against the recent evidence that motivated this workplan.
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Running the status command over the window starting Friday, 2026-06-26 shows
|
||||
that the daily triage schedule fired on 2026-06-26 and 2026-06-27 but did not
|
||||
produce clean validated reports.
|
||||
- The command distinguishes scheduling health from output/schema validation
|
||||
failure.
|
||||
- Disabled or waiting schedules, such as the weekly coding retro gate when its
|
||||
upstream read model is not available, are reported without being counted as
|
||||
missed runs.
|
||||
- Verification results are recorded in this workplan and as a State Hub progress
|
||||
note once the implementation lands.
|
||||
|
||||
## Implementation Result
|
||||
|
||||
Completed 2026-06-29: implemented the own-infrastructure automation status
|
||||
surface and codified the scheduling preference.
|
||||
|
||||
Delivered:
|
||||
|
||||
- `AGENTS.md` now states that durable activity-core automations use repo-owned
|
||||
infrastructure, not coding assistant automation/reminder/heartbeat tooling, as
|
||||
execution or evidence authority.
|
||||
- `SCOPE.md` and `docs/runbook.md` describe the deterministic status surface and
|
||||
assistant boundary.
|
||||
- `src/activity_core/automation_status.py` and `scripts/automation_status.py`
|
||||
provide the non-LLM CLI.
|
||||
- `make automation-status SINCE=...` and `make automation-status-json` expose the
|
||||
baseline operator commands.
|
||||
- `tests/test_automation_status.py` covers date shortcuts, cron fire estimation,
|
||||
completed runs, validation failures, missed runs, disabled schedules, partial
|
||||
source availability, and working-memory evidence parsing.
|
||||
|
||||
Verification:
|
||||
|
||||
```bash
|
||||
python3 -m py_compile src/activity_core/automation_status.py scripts/automation_status.py tests/test_automation_status.py
|
||||
/home/worsch/.local/bin/uv run pytest tests/test_automation_status.py tests/test_daily_triage_verifier.py -q
|
||||
/home/worsch/.local/bin/uv run python scripts/automation_status.py \
|
||||
--since 2026-06-26 --until 2026-06-27 --db-url '' \
|
||||
--progress-event-type daily_triage --timeout-seconds 10 \
|
||||
--working-memory-dir /tmp --format json
|
||||
```
|
||||
|
||||
Results:
|
||||
|
||||
- focused tests: `11 passed`;
|
||||
- `make help` lists `automation-status` and `automation-status-json`;
|
||||
- the 2026-06-26 through 2026-06-27 status run exited `1` as expected because
|
||||
State Hub evidence classified daily triage activity
|
||||
`6fca51fa-387a-4fd0-bc4e-d62c29eb859a` as `validation_failed` with two
|
||||
non-secret evidence records: 2026-06-26 `Expecting ',' delimiter` and
|
||||
2026-06-27 `Unterminated string`;
|
||||
- the same report classified the gated weekly coding retro as `disabled`, not
|
||||
`missed`.
|
||||
@@ -0,0 +1,204 @@
|
||||
---
|
||||
id: ACTIVITY-WP-0019
|
||||
type: workplan
|
||||
title: "Automation schedule inventory Make targets"
|
||||
domain: infotech
|
||||
repo: activity-core
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: automation-inventory
|
||||
created: "2026-06-29"
|
||||
updated: "2026-07-01"
|
||||
state_hub_workstream_id: "21c73763-9adc-42f6-8fd2-1b8b33c2c770"
|
||||
---
|
||||
|
||||
# Automation schedule inventory Make targets
|
||||
|
||||
## Goal
|
||||
|
||||
Provide a repo-native, non-LLM way to list every scheduled automation that
|
||||
activity-core knows about.
|
||||
|
||||
`ACTIVITY-WP-0018` added the status surface for questions like "How did our
|
||||
automations go since Friday?". The next operator question is the inventory
|
||||
baseline: "What automations are scheduled at all?" That should be answerable
|
||||
through Make targets backed by activity-core's own ActivityDefinitions,
|
||||
database, and Temporal schedule metadata when available, independent of any
|
||||
coding assistant automation infrastructure.
|
||||
|
||||
## Review notes
|
||||
|
||||
- `Makefile` currently exposes `automation-status` and
|
||||
`automation-status-json`, but no dedicated inventory/list target.
|
||||
- `scripts/automation_status.py` and `src/activity_core/automation_status.py`
|
||||
already load scheduled ActivityDefinitions and compute their Temporal schedule
|
||||
ids. The inventory target should reuse that parsing/loading posture where it
|
||||
fits rather than creating a second discovery path.
|
||||
- `make sync-schedules` reconciles Temporal schedules from the
|
||||
`activity_definitions` database, but it is an action target, not a read-only
|
||||
operator inventory command.
|
||||
- The inventory command should remain useful in degraded local mode: file-backed
|
||||
definitions are enough to list configured scheduled automations, while live
|
||||
DB and Temporal visibility can enrich the output.
|
||||
|
||||
## Task: Define the automation inventory contract
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0019-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "8de24590-f9ee-4d0e-8692-b7ada9f232ed"
|
||||
```
|
||||
|
||||
Define the fields and source precedence for a deterministic scheduled
|
||||
automation inventory report.
|
||||
|
||||
Acceptance:
|
||||
|
||||
- The report includes every ActivityDefinition with `trigger_type` of `cron` or
|
||||
`scheduled`, including disabled definitions.
|
||||
- Each row includes id, name, enabled/disabled state, trigger type, schedule
|
||||
expression or one-shot datetime, timezone, overlap/catchup policy when known,
|
||||
and the derived Temporal schedule id.
|
||||
- The report identifies its source for each row: database, repo definition file,
|
||||
Temporal visibility, or a combination.
|
||||
- If Temporal is reachable, the report adds paused/missing/drift hints without
|
||||
mutating schedules.
|
||||
- Missing optional sources produce warnings, not silent omissions.
|
||||
- The JSON shape is stable enough for scripts and tests.
|
||||
|
||||
## Task: Implement a non-mutating inventory CLI
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0019-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "538cb9a5-48f3-470c-8518-29ee66c96678"
|
||||
```
|
||||
|
||||
Add a deterministic CLI path for listing scheduled automations without requiring
|
||||
LLM credentials or coding assistant tooling.
|
||||
|
||||
Acceptance:
|
||||
|
||||
- A script or module command, likely sharing code with
|
||||
`activity_core.automation_status`, supports human and JSON output.
|
||||
- The command is read-only: it does not call `sync-schedules`, upsert schedules,
|
||||
delete schedules, enqueue workflows, or write State Hub evidence.
|
||||
- It supports filters by activity id, activity name, enabled state, and trigger
|
||||
type.
|
||||
- It loads from the database when configured and falls back to repo definition
|
||||
files when the database is unavailable or explicitly disabled.
|
||||
- It optionally enriches rows from Temporal when `TEMPORAL_HOST` is configured,
|
||||
with bounded timeouts so an unreachable service does not hang the command.
|
||||
- Unit tests cover DB rows, file fallback, disabled definitions, Temporal
|
||||
enrichment unavailable, and JSON output.
|
||||
|
||||
## Task: Add Make targets
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0019-T03
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "f2001721-07f3-42f5-a15e-0c7d1b0ed801"
|
||||
```
|
||||
|
||||
Expose the inventory command through Make targets that are easy for humans,
|
||||
scripts, and coding assistants to run before asking for a prose summary.
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `make automation-list` prints a concise human-readable inventory.
|
||||
- `make automation-list-json` emits the same inventory as JSON.
|
||||
- Optional Make variables pass through cleanly, for example `ENABLED=true`,
|
||||
`TRIGGER=cron`, `ACTIVITY_ID=<uuid>`, or `FORMAT=json`.
|
||||
- `make help` lists both targets with clear one-line descriptions.
|
||||
- The targets do not require LLM access, Codex automation tooling, or
|
||||
interactive prompts.
|
||||
|
||||
## Task: Document the inventory workflow
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0019-T04
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "f687743b-3936-413e-ae50-d35484ae9a81"
|
||||
```
|
||||
|
||||
Update operator documentation so the scheduled automation inventory path is
|
||||
discoverable next to the status path.
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `docs/runbook.md` documents `make automation-list` and
|
||||
`make automation-list-json`.
|
||||
- The docs distinguish inventory from status: inventory answers what is
|
||||
configured; status answers what happened in a time window.
|
||||
- The docs state that the command is read-only and uses activity-core-owned
|
||||
scheduling evidence.
|
||||
- The docs include a compact example of the expected human output.
|
||||
|
||||
## Task: Verify against current repo and live/degraded sources
|
||||
|
||||
```task
|
||||
id: ACTIVITY-WP-0019-T05
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "5317b532-5cef-4eff-b6d8-3e85bbca8e8a"
|
||||
```
|
||||
|
||||
Prove the target against the current scheduled automation definitions and
|
||||
degraded local conditions.
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `make automation-list` shows the current scheduled automations, including
|
||||
daily triage and weekly scheduled definitions when present in the selected
|
||||
source.
|
||||
- JSON output is valid and includes the same rows.
|
||||
- A DB-unavailable run falls back to repo definition files or reports a clear
|
||||
warning if no definitions are discoverable.
|
||||
- A Temporal-unavailable run exits successfully with Temporal warnings rather
|
||||
than hanging.
|
||||
- Focused tests pass and the result is recorded in this workplan before the
|
||||
workplan is moved to `finished`.
|
||||
|
||||
|
||||
## Implementation Result
|
||||
|
||||
Completed 2026-07-01: implemented the read-only scheduled automation inventory
|
||||
surface.
|
||||
|
||||
Delivered:
|
||||
|
||||
- `scripts/automation_inventory.py` exposes the inventory CLI backed by
|
||||
`activity_core.automation_status` shared definition and Temporal helpers.
|
||||
- `make automation-list` and `make automation-list-json` list configured
|
||||
scheduled ActivityDefinitions with filters for `ENABLED`, `TRIGGER`,
|
||||
`ACTIVITY_ID`, and `ACTIVITY_NAME`.
|
||||
- JSON output is script-safe; the Make JSON target suppresses command echo and
|
||||
recursive make directory chatter.
|
||||
- `docs/runbook.md` now distinguishes inventory (what is configured) from status
|
||||
(what happened in a time window).
|
||||
- Tests cover DB-backed rows, file fallback, disabled filtering, Temporal
|
||||
unavailable warnings, and JSON CLI output.
|
||||
|
||||
Verification:
|
||||
|
||||
```bash
|
||||
/home/worsch/.local/bin/uv run pytest tests/test_automation_status.py tests/test_daily_triage_verifier.py -q
|
||||
bash -lc 'export PATH="/home/worsch/.local/bin:$PATH"; make automation-list ACTCORE_DB_URL= TEMPORAL_HOST='
|
||||
bash -lc 'export PATH="/home/worsch/.local/bin:$PATH"; make automation-list-json ACTCORE_DB_URL= TEMPORAL_HOST= > /tmp/activity-core-inventory.json && python3 -m json.tool /tmp/activity-core-inventory.json >/tmp/activity-core-inventory.pretty'
|
||||
bash -lc 'export PATH="/home/worsch/.local/bin:$PATH"; make automation-list ACTCORE_DB_URL= TEMPORAL_HOST= ENABLED=true TRIGGER=cron'
|
||||
bash -lc 'export PATH="/home/worsch/.local/bin:$PATH"; make help'
|
||||
```
|
||||
|
||||
Results:
|
||||
|
||||
- focused tests: `16 passed`;
|
||||
- degraded Make inventory run listed 9 file-backed scheduled automations, with
|
||||
5 enabled and 4 disabled;
|
||||
- filtered Make run with `ENABLED=true TRIGGER=cron` listed 5 enabled cron
|
||||
automations;
|
||||
- `automation-list-json` emitted parseable JSON directly;
|
||||
- `make help` lists `automation-list` and `automation-list-json`.
|
||||
@@ -3,6 +3,7 @@ type: session-note
|
||||
created: "2026-03-28"
|
||||
updated: "2026-06-03"
|
||||
status: archived
|
||||
state_hub_workstream_id: "b221e65a-6f97-44b0-8dae-442fffcb7f64"
|
||||
---
|
||||
|
||||
# WP-0002 Handoff Note — Continue on CoulombCore
|
||||
|
||||
Reference in New Issue
Block a user