generated from coulomb/repo-seed
Compare commits
73 Commits
23f4956b68
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| a1e2a426b9 | |||
| 9113206974 | |||
| 79fd3406a3 | |||
| ef9a1a76c2 | |||
| 0da655979d | |||
| 7612112e7e | |||
| 6a5321525e | |||
| 2f55167215 | |||
| ffe10f098e | |||
| 3f85274916 | |||
| bb14d08212 | |||
| 92629e7a91 | |||
| 951ec56f7a | |||
| 9440d539c6 | |||
| 2ff852da29 | |||
| 30043348f0 | |||
| 18fcce87fe | |||
| 17b787fad0 | |||
| 6c8cb1b7b6 | |||
| ec66e06066 | |||
| 919edd98ac | |||
| bf877b7f0d | |||
| 9be4ddbdb7 | |||
| c5440e8429 | |||
| 53dc0f6e93 | |||
| a70c00a789 | |||
| b41b6034ee | |||
| 960fb05268 | |||
| b7b0b5bf6e | |||
| 14f76fb6d9 | |||
| caa2608092 | |||
| 61f278d643 | |||
| 0e9e18a59a | |||
| 5eb33bd3bb | |||
| 612c226472 | |||
| 0b2c68838e | |||
| 4b5e96d7c1 | |||
| 65ef005c2d | |||
| 0e75aaec01 | |||
| b2e57707a7 | |||
| 88fe359385 | |||
| f90591c5f1 | |||
| cf7a11dcd9 | |||
| 99e5d525a8 | |||
| 8424c13783 | |||
| 864f90f9b9 | |||
| 053d18b24a | |||
| 77af65afb2 | |||
| 0495f8a43f | |||
| c6cad9e7b3 | |||
| a83b117f60 | |||
| ffc0ee2cb7 | |||
| 59b3b73061 | |||
| 4bc5111dfd | |||
| e9a6029ded | |||
| bf4e61f0bf | |||
| 40fa851ec0 | |||
| e0742d18d7 | |||
| ccac285b0a | |||
| a0dcc52353 | |||
| faf5d60ae8 | |||
| adfd1a9067 | |||
| 44987457c1 | |||
| 3a981cc98f | |||
| dbd2fbb11c | |||
| c938b80503 | |||
| 3e93567a53 | |||
| 6f68f8f9ec | |||
| f05c56e202 | |||
| 200ec0c97a | |||
| 42e5ef725c | |||
| a08bd1684f | |||
| 2078915854 |
50
.claude/rules/credential-routing.md
Normal file
50
.claude/rules/credential-routing.md
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
# Credential and access routing
|
||||||
|
|
||||||
|
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
|
||||||
|
for inference. Run this check **before** requesting secrets, API keys, SSH access,
|
||||||
|
login tokens, or database passwords — in any repo, not only `ops-warden`.
|
||||||
|
|
||||||
|
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
|
||||||
|
other credential need belongs to another subsystem. **Do not** message
|
||||||
|
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
|
||||||
|
|
||||||
|
### Lookup (do this first)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
warden route find "<describe your need>" --json
|
||||||
|
warden route show <catalog-id> --json
|
||||||
|
```
|
||||||
|
|
||||||
|
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
|
||||||
|
|
||||||
|
| Agent runtime | How to orient |
|
||||||
|
| --- | --- |
|
||||||
|
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=activity-core` is for coordination, not secret vending |
|
||||||
|
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
|
||||||
|
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
|
||||||
|
|
||||||
|
### Quick routing table
|
||||||
|
|
||||||
|
| I need… | Owner | ops-warden executes? |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
|
||||||
|
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
|
||||||
|
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
|
||||||
|
| Authorization decision | flex-auth | No — route only |
|
||||||
|
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
|
||||||
|
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
|
||||||
|
|
||||||
|
### Anti-patterns (do not do these)
|
||||||
|
|
||||||
|
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
|
||||||
|
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
|
||||||
|
- Pasting secrets into Git, State Hub, workplans, logs, or chat
|
||||||
|
|
||||||
|
### Other capabilities (reuse-surface)
|
||||||
|
|
||||||
|
Non-credential capabilities are usually discovered through **reuse-surface** federation
|
||||||
|
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
|
||||||
|
every repo's agent instructions because it is high-frequency, high-risk, and easy to
|
||||||
|
get wrong.
|
||||||
|
|
||||||
|
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
|
||||||
@@ -1,11 +1,11 @@
|
|||||||
## First Session Protocol
|
## First Session Protocol
|
||||||
|
|
||||||
Triggered when `get_domain_summary("custodian")` shows **no workstreams**.
|
Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
|
||||||
The project is registered but work has not yet been structured.
|
The project is registered but work has not yet been structured.
|
||||||
|
|
||||||
**Step 1 — Read, don't write**
|
**Step 1 — Read, don't write**
|
||||||
- `~/the-custodian/canon/projects/custodian/project_charter_v0.1.md` — purpose, scope
|
- `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
|
||||||
- `~/the-custodian/canon/projects/custodian/roadmap_v0.1.md` — planned phases
|
- `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
|
||||||
- Scan repo root: README, directory structure, existing code or docs
|
- Scan repo root: README, directory structure, existing code or docs
|
||||||
|
|
||||||
**Step 2 — Survey in-progress work**
|
**Step 2 — Survey in-progress work**
|
||||||
@@ -17,7 +17,7 @@ roadmap phase. **Wait for approval before creating.**
|
|||||||
|
|
||||||
**Step 4 — Create workplan file first, then DB record (ADR-001)**
|
**Step 4 — Create workplan file first, then DB record (ADR-001)**
|
||||||
```
|
```
|
||||||
workplans/activity-core-WP-NNNN-<slug>.md ← write this first
|
workplans/ACTIVITY-WP-NNNN-<slug>.md ← write this first
|
||||||
```
|
```
|
||||||
Then register in the hub:
|
Then register in the hub:
|
||||||
```
|
```
|
||||||
@@ -28,7 +28,7 @@ create_task(workstream_id="<id>", title="...", priority="high|medium|low")
|
|||||||
**Step 5 — Record the setup**
|
**Step 5 — Record the setup**
|
||||||
```
|
```
|
||||||
add_progress_event(
|
add_progress_event(
|
||||||
summary="First session: structured custodian into N workstreams, M tasks",
|
summary="First session: structured infotech into N workstreams, M tasks",
|
||||||
event_type="milestone",
|
event_type="milestone",
|
||||||
topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
|
topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
|
||||||
detail={"workstreams": [...], "tasks_created": M}
|
detail={"workstreams": [...], "tasks_created": M}
|
||||||
|
|||||||
@@ -1,5 +1,5 @@
|
|||||||
**Purpose:** Durable task factory built on Temporal. Manages ActivityDefinitions, schedules recurring workflows via Temporal Schedules, routes events via NATS JetStream, and exposes a FastAPI CRUD surface for the custodian domain.
|
**Purpose:** Durable task factory built on Temporal. Manages ActivityDefinitions, schedules recurring workflows via Temporal Schedules, routes events via NATS JetStream, and exposes a FastAPI CRUD surface for the custodian domain.
|
||||||
|
|
||||||
**Domain:** custodian
|
**Domain:** infotech
|
||||||
**Repo slug:** activity-core
|
**Repo slug:** activity-core
|
||||||
**Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a
|
**Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a
|
||||||
|
|||||||
@@ -1,6 +1,7 @@
|
|||||||
## Session Protocol
|
## Session Protocol
|
||||||
|
|
||||||
State Hub: http://127.0.0.1:8000
|
Dev Hub (State Hub API): http://127.0.0.1:8000
|
||||||
|
MCP server name in `~/.claude.json`: `dev-hub`
|
||||||
|
|
||||||
**Step 1 — Orient**
|
**Step 1 — Orient**
|
||||||
|
|
||||||
@@ -10,7 +11,7 @@ cat .custodian-brief.md
|
|||||||
```
|
```
|
||||||
Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
|
Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
|
||||||
```
|
```
|
||||||
get_domain_summary("custodian")
|
get_domain_summary("infotech")
|
||||||
```
|
```
|
||||||
If MCP tools are unavailable in the current agent session, use the REST API:
|
If MCP tools are unavailable in the current agent session, use the REST API:
|
||||||
```bash
|
```bash
|
||||||
@@ -39,11 +40,11 @@ curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
|
|||||||
ls workplans/
|
ls workplans/
|
||||||
```
|
```
|
||||||
For each file with `status: ready`, `active`, or `blocked`, note pending
|
For each file with `status: ready`, `active`, or `blocked`, note pending
|
||||||
`todo`/`in_progress` tasks.
|
`wait`/`todo`/`progress` tasks.
|
||||||
|
|
||||||
**Step 4 — Present brief**
|
**Step 4 — Present brief**
|
||||||
|
|
||||||
1. **Active workstreams** for `custodian` — title, task counts, blocking decisions
|
1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
|
||||||
2. **Pending tasks** from `workplans/` + any `[repo:activity-core]` hub tasks
|
2. **Pending tasks** from `workplans/` + any `[repo:activity-core]` hub tasks
|
||||||
3. **Goal guidance** — if `goal_guidance` in summary:
|
3. **Goal guidance** — if `goal_guidance` in summary:
|
||||||
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
|
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
|
||||||
|
|||||||
@@ -1,7 +1,7 @@
|
|||||||
## Workplan Convention (ADR-001)
|
## Workplan Convention (ADR-001)
|
||||||
|
|
||||||
File location: `workplans/activity-core-WP-NNNN-<slug>.md`
|
File location: `workplans/ACTIVITY-WP-NNNN-<slug>.md`
|
||||||
ID prefix: `ACTIVITY-WP`
|
ID prefix: `ACTIVITY-WP-`
|
||||||
|
|
||||||
Work items originate as files in this repo **before** being registered in the hub.
|
Work items originate as files in this repo **before** being registered in the hub.
|
||||||
|
|
||||||
@@ -12,7 +12,7 @@ repo state, and `finished` when implementation is complete. `stalled` and
|
|||||||
`needs_review` are derived health labels, not stored statuses.
|
`needs_review` are derived health labels, not stored statuses.
|
||||||
|
|
||||||
Closed workplans may be moved to `workplans/archived/` with a completion-date
|
Closed workplans may be moved to `workplans/archived/` with a completion-date
|
||||||
prefix: `YYMMDD-activity-core-WP-NNNN-<slug>.md`. The frontmatter id remains
|
prefix: `YYMMDD-ACTIVITY-WP-NNNN-<slug>.md`. The frontmatter id remains
|
||||||
unchanged; the prefix is only for quick visual reference.
|
unchanged; the prefix is only for quick visual reference.
|
||||||
|
|
||||||
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
|
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
|
||||||
@@ -25,4 +25,16 @@ Ecosystem todos from other agents arrive as `[repo:activity-core]` hub tasks —
|
|||||||
visible at session start. Pick one up by creating the workplan file, then registering
|
visible at session start. Pick one up by creating the workplan file, then registering
|
||||||
the workstream.
|
the workstream.
|
||||||
|
|
||||||
|
Task blocks use this shape:
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-NNNN-T01
|
||||||
|
status: wait | todo | progress | done | cancel
|
||||||
|
priority: high | medium | low
|
||||||
|
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
|
||||||
|
```
|
||||||
|
|
||||||
|
Status progression is `todo` → `progress` → `done`; use `wait` for waiting or
|
||||||
|
blocked work and `cancel` for stopped work.
|
||||||
|
|
||||||
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->
|
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->
|
||||||
|
|||||||
@@ -1,34 +1,23 @@
|
|||||||
<!-- custodian-brief: generated by fix-consistency — do not edit manually -->
|
<!-- custodian-brief: generated by fix-consistency — do not edit manually -->
|
||||||
# Custodian Brief — activity-core
|
# Custodian Brief — activity-core
|
||||||
|
|
||||||
**Domain:** custodian
|
**Domain:** infotech
|
||||||
**Last synced:** 2026-06-18 15:52 UTC
|
**Last synced:** 2026-07-02 09:55 UTC
|
||||||
**State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
|
**State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
|
||||||
|
|
||||||
## Active Workstreams
|
## Active Workstreams
|
||||||
|
|
||||||
### Definition And Schedule Hot Reload
|
### Adopt State Hub Beachhead Endpoint
|
||||||
Progress: 0/5 done | workstream_id: `8887075e-21ec-451b-b82b-cd81035c9ca5`
|
Progress: 0/2 done | workstream_id: `bbc07f9e-9323-4b2b-b556-c33b37d0b228`
|
||||||
|
|
||||||
**Open tasks:**
|
**Open tasks:**
|
||||||
- ! Live No-Restart Smoke `68a0e22a`
|
- ! Point STATE_HUB_URL at the beachhead `76b6132d`
|
||||||
- · Extract Reusable Sync Service `53a7970b`
|
- ! Retire the bespoke actcore-state-hub-bridge proxy `526c2129`
|
||||||
- · Add Admin Sync Endpoint `8697c761`
|
|
||||||
- · Preserve Schedule Drift Semantics `efeac412`
|
|
||||||
- · Optional Background Sync Loop `d774087b`
|
|
||||||
|
|
||||||
### Post-triage operational hardening
|
|
||||||
Progress: 6/8 done | workstream_id: `5646e13a-13af-4724-bca6-3c0d86f96733`
|
|
||||||
|
|
||||||
**Open tasks:**
|
|
||||||
- ! Three-Run Calibration Feedback `7cbf0a35`
|
|
||||||
- · Implement reuse_surface_report_gaps shell resolver for coulomb registry hygiene `25293d5e`
|
|
||||||
|
|
||||||
### Daily Triage LLM Reconciliation And Evidence
|
### Daily Triage LLM Reconciliation And Evidence
|
||||||
Progress: 1/5 done | workstream_id: `f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9`
|
Progress: 2/5 done | workstream_id: `f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9`
|
||||||
|
|
||||||
**Open tasks:**
|
**Open tasks:**
|
||||||
- ! Reconcile Live Railiance Runtime `23545ddc`
|
|
||||||
- ! Run Daily Triage Fixture Smoke `10e0df77`
|
- ! Run Daily Triage Fixture Smoke `10e0df77`
|
||||||
- ! Collect Three Clean Scheduled Runs `dc6b9482`
|
- ! Collect Three Clean Scheduled Runs `dc6b9482`
|
||||||
- ! Close Handoff State `ecc57e21`
|
- ! Close Handoff State `ecc57e21`
|
||||||
@@ -50,6 +39,6 @@ Progress: 2/3 done | workstream_id: `7387fc50-1f2c-471a-9d85-bb085cbd0b63`
|
|||||||
## MCP Orientation (when available)
|
## MCP Orientation (when available)
|
||||||
|
|
||||||
If the state-hub MCP server is reachable, call:
|
If the state-hub MCP server is reachable, call:
|
||||||
`get_domain_summary("custodian")`
|
`get_domain_summary("infotech")`
|
||||||
This provides richer cross-domain context.
|
This provides richer cross-domain context.
|
||||||
If the MCP call fails, use this file as your orientation source.
|
If the MCP call fails, use this file as your orientation source.
|
||||||
|
|||||||
@@ -18,7 +18,9 @@ STATE_HUB_URL=http://127.0.0.1:8000
|
|||||||
# Repo scoping — used by the repo-scoping context adapter. Binds {} on failure.
|
# Repo scoping — used by the repo-scoping context adapter. Binds {} on failure.
|
||||||
REPO_SCOPING_URL=http://127.0.0.1:8020
|
REPO_SCOPING_URL=http://127.0.0.1:8020
|
||||||
# Issue Core — task emission backend.
|
# Issue Core — task emission backend.
|
||||||
ISSUE_CORE_URL=http://127.0.0.1:8010
|
ISSUE_CORE_URL=http://127.0.0.1:8765
|
||||||
|
# Shared ingestion key — must match issue-core's ISSUE_CORE_API_KEY.
|
||||||
|
ISSUE_CORE_API_KEY=
|
||||||
# Sink type: 'rest' (POST to issue-core) or 'null' (discard, for dry-run).
|
# Sink type: 'rest' (POST to issue-core) or 'null' (discard, for dry-run).
|
||||||
ISSUE_SINK_TYPE=rest
|
ISSUE_SINK_TYPE=rest
|
||||||
|
|
||||||
|
|||||||
@@ -1,17 +1,15 @@
|
|||||||
# Kaizen scheduled agent execution (ADR-005)
|
# Kaizen scheduled agent execution manifest (ADR-005)
|
||||||
# Engagement: coulomb-loop — stabilize phase (daily crons per ADR-003)
|
# Engagement: coulomb-loop bootstrap — weekly cadence
|
||||||
# Promoted 2026-06-18 after 3/3 bootstrap E2E cycles
|
# Regulator promotes cadence per customer engagement policy (ADR-003).
|
||||||
|
# Validate with: kaizen-agentic schedule validate
|
||||||
version: '1'
|
version: '1'
|
||||||
timezone: Europe/Berlin
|
timezone: Europe/Berlin
|
||||||
agents:
|
agents:
|
||||||
coach:
|
coach:
|
||||||
cadence: daily
|
cadence: weekly
|
||||||
cron: "0 9 * * *"
|
cron: 0 9 * * 1
|
||||||
enabled: true
|
enabled: true
|
||||||
optimization:
|
optimization:
|
||||||
cadence: daily
|
cadence: weekly
|
||||||
cron: "0 10 * * *"
|
cron: 0 10 * * 1
|
||||||
enabled: true
|
enabled: true
|
||||||
tdd-workflow:
|
|
||||||
cadence: monthly
|
|
||||||
enabled: false
|
|
||||||
28
.repo-classification.yaml
Normal file
28
.repo-classification.yaml
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
# Repo classification (Repo Classification Standard v1.0).
|
||||||
|
|
||||||
|
repo_classification:
|
||||||
|
standard: Repo Classification Standard
|
||||||
|
version: '1.0'
|
||||||
|
classified_at: '2026-06-22'
|
||||||
|
classified_by: human
|
||||||
|
category: tooling
|
||||||
|
domain: infotech
|
||||||
|
secondary_domains:
|
||||||
|
- agents
|
||||||
|
capability_tags:
|
||||||
|
- workflow
|
||||||
|
- orchestration
|
||||||
|
- automation
|
||||||
|
- coordination
|
||||||
|
- observability
|
||||||
|
business_stake:
|
||||||
|
- technology
|
||||||
|
- operations
|
||||||
|
- automation
|
||||||
|
- execution
|
||||||
|
business_mechanics:
|
||||||
|
- coordination
|
||||||
|
- operation
|
||||||
|
- adaptation
|
||||||
|
notes: Org-wide event bridge / task factory (Temporal-based). Active bounded implementation
|
||||||
|
-> project.
|
||||||
83
AGENTS.md
83
AGENTS.md
@@ -4,7 +4,7 @@
|
|||||||
|
|
||||||
**Purpose:** Durable task factory built on Temporal. Manages ActivityDefinitions, schedules recurring workflows via Temporal Schedules, routes events via NATS JetStream, and exposes a FastAPI CRUD surface for the custodian domain.
|
**Purpose:** Durable task factory built on Temporal. Manages ActivityDefinitions, schedules recurring workflows via Temporal Schedules, routes events via NATS JetStream, and exposes a FastAPI CRUD surface for the custodian domain.
|
||||||
|
|
||||||
**Domain:** custodian
|
**Domain:** infotech
|
||||||
**Repo slug:** activity-core
|
**Repo slug:** activity-core
|
||||||
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
|
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
|
||||||
**Workplan prefix:** `ACTIVITY-WP-`
|
**Workplan prefix:** `ACTIVITY-WP-`
|
||||||
@@ -83,7 +83,7 @@ curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
|||||||
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
|
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
|
||||||
2. Check inbox: `GET /messages/?to_agent=activity-core&unread_only=true`; mark read
|
2. Check inbox: `GET /messages/?to_agent=activity-core&unread_only=true`; mark read
|
||||||
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
|
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
|
||||||
4. Check blocked tasks: `GET /tasks/?needs_human=true`
|
4. Check human-needed tasks: `GET /tasks/?needs_human=true`
|
||||||
|
|
||||||
**During work:**
|
**During work:**
|
||||||
- Update task statuses in workplan files as tasks progress
|
- Update task statuses in workplan files as tasks progress
|
||||||
@@ -101,6 +101,78 @@ curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Credential and access routing
|
||||||
|
|
||||||
|
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
|
||||||
|
for inference. Run this check **before** requesting secrets, API keys, SSH access,
|
||||||
|
login tokens, or database passwords — in any repo, not only `ops-warden`.
|
||||||
|
|
||||||
|
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
|
||||||
|
other credential need belongs to another subsystem. **Do not** message
|
||||||
|
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
|
||||||
|
|
||||||
|
### Lookup (do this first)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
warden route find "<describe your need>" --json
|
||||||
|
warden route show <catalog-id> --json
|
||||||
|
```
|
||||||
|
|
||||||
|
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
|
||||||
|
|
||||||
|
| Agent runtime | How to orient |
|
||||||
|
| --- | --- |
|
||||||
|
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=activity-core` is for coordination, not secret vending |
|
||||||
|
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
|
||||||
|
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
|
||||||
|
|
||||||
|
### Quick routing table
|
||||||
|
|
||||||
|
| I need… | Owner | ops-warden executes? |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
|
||||||
|
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
|
||||||
|
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
|
||||||
|
| Authorization decision | flex-auth | No — route only |
|
||||||
|
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
|
||||||
|
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
|
||||||
|
|
||||||
|
### Anti-patterns (do not do these)
|
||||||
|
|
||||||
|
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
|
||||||
|
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
|
||||||
|
- Pasting secrets into Git, State Hub, workplans, logs, or chat
|
||||||
|
|
||||||
|
### Other capabilities (reuse-surface)
|
||||||
|
|
||||||
|
Non-credential capabilities are usually discovered through **reuse-surface** federation
|
||||||
|
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
|
||||||
|
every repo's agent instructions because it is high-frequency, high-risk, and easy to
|
||||||
|
get wrong.
|
||||||
|
|
||||||
|
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
|
||||||
|
|
||||||
|
<!-- REPO-AGENTS-EXTENSIONS -->
|
||||||
|
<!-- Append repo-specific agent instructions below this marker.
|
||||||
|
The state-hub template sync preserves content after this line. -->
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Automation Scheduling Preference
|
||||||
|
|
||||||
|
Durable activity-core automations must use this repo's own infrastructure:
|
||||||
|
Temporal Schedules, NATS JetStream, activity-core run records, State Hub
|
||||||
|
progress, and configured report/evidence sinks. Do not use coding
|
||||||
|
assistant-provided automation, reminder, or heartbeat tooling as the execution
|
||||||
|
or evidence source for production or operational recurrence.
|
||||||
|
|
||||||
|
Coding assistants may run repo-native inspection commands and summarize their
|
||||||
|
outputs, but the baseline answer to questions like "How did our automations go
|
||||||
|
since Friday?" must come from deterministic local tooling such as the
|
||||||
|
ACTIVITY-WP-0018 automation status surface.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Workplan Convention (ADR-001)
|
## Workplan Convention (ADR-001)
|
||||||
|
|
||||||
Work items originate as files in this repo — not in the hub. The hub is a
|
Work items originate as files in this repo — not in the hub. The hub is a
|
||||||
@@ -124,7 +196,7 @@ anything needing analysis, design, approval, dependencies, or multiple phases.
|
|||||||
id: ACTIVITY-WP-NNNN
|
id: ACTIVITY-WP-NNNN
|
||||||
type: workplan
|
type: workplan
|
||||||
title: "..."
|
title: "..."
|
||||||
domain: custodian
|
domain: infotech
|
||||||
repo: activity-core
|
repo: activity-core
|
||||||
status: proposed | ready | active | blocked | backlog | finished | archived
|
status: proposed | ready | active | blocked | backlog | finished | archived
|
||||||
owner: codex
|
owner: codex
|
||||||
@@ -154,10 +226,7 @@ state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
|
|||||||
Task description text.
|
Task description text.
|
||||||
```
|
```
|
||||||
|
|
||||||
Status progression: `todo` → `progress` → `done`; use `wait` for a task
|
Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
|
||||||
blocked on external input and `cancel` for intentionally abandoned work.
|
|
||||||
Workstream/workplan lifecycle status is separate; frontmatter `blocked` remains
|
|
||||||
valid there.
|
|
||||||
|
|
||||||
To create a new workplan:
|
To create a new workplan:
|
||||||
1. Write the file following the format above
|
1. Write the file following the format above
|
||||||
|
|||||||
@@ -8,4 +8,5 @@
|
|||||||
@.claude/rules/stack-and-commands.md
|
@.claude/rules/stack-and-commands.md
|
||||||
@.claude/rules/architecture.md
|
@.claude/rules/architecture.md
|
||||||
@.claude/rules/repo-boundary.md
|
@.claude/rules/repo-boundary.md
|
||||||
|
@.claude/rules/credential-routing.md
|
||||||
@.claude/rules/agents.md
|
@.claude/rules/agents.md
|
||||||
|
|||||||
27
Makefile
27
Makefile
@@ -1,13 +1,17 @@
|
|||||||
-include .env
|
-include .env
|
||||||
export
|
export
|
||||||
|
|
||||||
.PHONY: sync-event-types sync-activity-definitions test migrate sync-all \
|
.PHONY: sync-event-types sync-activity-definitions sync-schedules test migrate sync-all \
|
||||||
|
automation-status automation-status-json automation-list automation-list-json \
|
||||||
dev-up dev-down railiance-up railiance-down \
|
dev-up dev-down railiance-up railiance-down \
|
||||||
start-worker start-api start-event-router help
|
start-worker start-api start-event-router help
|
||||||
|
|
||||||
sync-activity-definitions: ## Sync ActivityDefinition files into DB
|
sync-activity-definitions: ## Sync ActivityDefinition files into DB
|
||||||
uv run python -m activity_core.sync_activity_definitions
|
uv run python -m activity_core.sync_activity_definitions
|
||||||
|
|
||||||
|
sync-schedules: ## Reconcile Temporal schedules from activity_definitions DB
|
||||||
|
uv run python -m activity_core.sync_schedules
|
||||||
|
|
||||||
sync-event-types: ## Sync event type YAML files into DB
|
sync-event-types: ## Sync event type YAML files into DB
|
||||||
uv run python scripts/sync_event_types.py
|
uv run python scripts/sync_event_types.py
|
||||||
|
|
||||||
@@ -21,6 +25,27 @@ migrate: ## Apply all pending Alembic migrations
|
|||||||
|
|
||||||
sync-all: sync-event-types sync-activity-definitions ## Sync event types and activity definitions
|
sync-all: sync-event-types sync-activity-definitions ## Sync event types and activity definitions
|
||||||
|
|
||||||
|
# -- Automation status ---------------------------------------------------------
|
||||||
|
|
||||||
|
SINCE ?= today
|
||||||
|
FORMAT ?= human
|
||||||
|
ENABLED ?= all
|
||||||
|
TRIGGER ?=
|
||||||
|
ACTIVITY_ID ?=
|
||||||
|
ACTIVITY_NAME ?=
|
||||||
|
|
||||||
|
automation-status: ## Report recent automation status from repo-owned evidence
|
||||||
|
uv run python scripts/automation_status.py --since "$(SINCE)" $(if $(UNTIL),--until "$(UNTIL)",) --format "$(FORMAT)"
|
||||||
|
|
||||||
|
automation-status-json: ## Report recent automation status as JSON
|
||||||
|
$(MAKE) automation-status FORMAT=json
|
||||||
|
|
||||||
|
automation-list: ## List configured scheduled automations from repo-owned definitions
|
||||||
|
@uv run python scripts/automation_inventory.py --format "$(FORMAT)" --enabled "$(ENABLED)" $(if $(TRIGGER),--trigger-type "$(TRIGGER)",) $(if $(ACTIVITY_ID),--activity-id "$(ACTIVITY_ID)",) $(if $(ACTIVITY_NAME),--activity-name "$(ACTIVITY_NAME)",)
|
||||||
|
|
||||||
|
automation-list-json: ## List configured scheduled automations as JSON
|
||||||
|
@$(MAKE) --no-print-directory automation-list FORMAT=json
|
||||||
|
|
||||||
# ── Infrastructure ─────────────────────────────────────────────────────────────
|
# ── Infrastructure ─────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
dev-up: ## Start full dev stack (Temporal + PG + ES + NATS)
|
dev-up: ## Start full dev stack (Temporal + PG + ES + NATS)
|
||||||
|
|||||||
16
SCOPE.md
16
SCOPE.md
@@ -64,7 +64,9 @@ The two evaluation modes:
|
|||||||
`context.*` / `event.*` interpolation and explicit `for_each` per-item
|
`context.*` / `event.*` interpolation and explicit `for_each` per-item
|
||||||
binding. No `exec()`.
|
binding. No `exec()`.
|
||||||
- **Instruction executor**: trusted-field prompt rendering, LLM call via
|
- **Instruction executor**: trusted-field prompt rendering, LLM call via
|
||||||
llm-connect, structured output validation, bounded validation-failure
|
llm-connect, structured output validation, item-granular recovery with a
|
||||||
|
quarantine lane and producer guardrails (count/length/depth caps, reference
|
||||||
|
allow-list) at the producer trust boundary, bounded validation-failure
|
||||||
artifacts for report instructions, review-required audit metadata, and
|
artifacts for report instructions, review-required audit metadata, and
|
||||||
deterministic report sinks. A real downstream review queue is not implemented
|
deterministic report sinks. A real downstream review queue is not implemented
|
||||||
in this repo.
|
in this repo.
|
||||||
@@ -88,6 +90,9 @@ The two evaluation modes:
|
|||||||
- **REST admin API** (FastAPI): CRUD for ActivityDefinitions, manual trigger,
|
- **REST admin API** (FastAPI): CRUD for ActivityDefinitions, manual trigger,
|
||||||
event type registry queries.
|
event type registry queries.
|
||||||
- **Prometheus metrics**: Temporal SDK metrics exposed for scraping.
|
- **Prometheus metrics**: Temporal SDK metrics exposed for scraping.
|
||||||
|
- **Automation status surface**: deterministic, non-LLM status reporting via
|
||||||
|
`make automation-status` / `scripts/automation_status.py`, using repo-owned
|
||||||
|
evidence sources rather than coding assistant scheduler state.
|
||||||
- **Operational runbook**: `docs/runbook.md`.
|
- **Operational runbook**: `docs/runbook.md`.
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -114,6 +119,10 @@ The two evaluation modes:
|
|||||||
runs on Railiance infrastructure (or Docker Compose for dev).
|
runs on Railiance infrastructure (or Docker Compose for dev).
|
||||||
- **End-user task UI** — tasks land in issue-core; presentation is separate.
|
- **End-user task UI** — tasks land in issue-core; presentation is separate.
|
||||||
- **Synchronous request-response patterns** — Temporal is async-first.
|
- **Synchronous request-response patterns** — Temporal is async-first.
|
||||||
|
- **Coding assistant automation infrastructure** — assistant-provided reminders,
|
||||||
|
heartbeats, or scheduled jobs are not the execution or evidence authority for
|
||||||
|
activity-core automations. Assistants may run and summarize repo-native
|
||||||
|
commands only.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -130,6 +139,8 @@ The two evaluation modes:
|
|||||||
commands.
|
commands.
|
||||||
- You are replacing scattered bespoke cron jobs and manual coordination with
|
- You are replacing scattered bespoke cron jobs and manual coordination with
|
||||||
a governed, observable automation layer.
|
a governed, observable automation layer.
|
||||||
|
- You need to answer "how did our automations go since Friday?" from
|
||||||
|
deterministic repo-native evidence before any optional LLM summary.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -320,6 +331,9 @@ new one-off control paths.
|
|||||||
governance model, event type schema, ActivityDefinition structure.
|
governance model, event type schema, ActivityDefinition structure.
|
||||||
- `docs/adr/adr-003-rule-instruction-model.md` — Rule DSL, Instruction safety
|
- `docs/adr/adr-003-rule-instruction-model.md` — Rule DSL, Instruction safety
|
||||||
model, evaluation semantics, audit trail, testing strategy.
|
model, evaluation semantics, audit trail, testing strategy.
|
||||||
|
- `docs/adr/adr-004-producer-trust-boundary.md` — untrusted-producer premise,
|
||||||
|
trust-but-handle vs verify-and-mitigate postures, error-locality and
|
||||||
|
quarantine-with-provenance, producer guardrails for LLM/agent/human output.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
156
docs/adr/adr-004-producer-trust-boundary.md
Normal file
156
docs/adr/adr-004-producer-trust-boundary.md
Normal file
@@ -0,0 +1,156 @@
|
|||||||
|
---
|
||||||
|
id: ACT-ADR-004
|
||||||
|
type: architecture-decision-record
|
||||||
|
title: "The Producer Trust Boundary — Guardrails and Error-Correction for Untrusted Output"
|
||||||
|
status: accepted
|
||||||
|
decided_by: Bernd Worsch
|
||||||
|
date: "2026-06-26"
|
||||||
|
scope: cross-repo
|
||||||
|
affects:
|
||||||
|
- activity-core
|
||||||
|
- rules-core (future extraction)
|
||||||
|
tags: ["architecture", "llm", "safety", "validation", "guardrails", "trust-boundary", "resilience"]
|
||||||
|
---
|
||||||
|
|
||||||
|
# ACT-ADR-004: The Producer Trust Boundary
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
On 2026-06-26 the scheduled daily WSJF triage instruction fired on time, called
|
||||||
|
llm-connect successfully, and produced a long ranked recommendation list — but
|
||||||
|
the JSON broke at char 5268 (~rank 8–9 of ~16), failing schema validation. Because
|
||||||
|
the report was validated and consumed as a single monolithic JSON document, one
|
||||||
|
malformed delimiter discarded the **entire** run, including the 7 perfectly good
|
||||||
|
recommendations the model had already emitted. The scheduling and runtime layers
|
||||||
|
were healthy; the failure was entirely at the seam where free-form model output
|
||||||
|
meets a strict consumer.
|
||||||
|
|
||||||
|
This is not a one-off bug, it is a recurring class. activity-core has a **trust
|
||||||
|
boundary** wherever generative or human-authored output meets strict deterministic
|
||||||
|
consumers: the JSON Schema validator, the task emitter, and any classic compute
|
||||||
|
pipeline downstream. The producers on the other side of that boundary — **LLMs,
|
||||||
|
agents, and humans** — are all *untrusted producers*. Their output may be:
|
||||||
|
|
||||||
|
- **erroneous** — hallucination, truncation at a token limit, drift, type slips,
|
||||||
|
typos, a missing delimiter; or
|
||||||
|
- **malicious** — prompt injection, crafted payloads, or oversized / deeply-nested
|
||||||
|
structures intended to exhaust or confuse the consumer.
|
||||||
|
|
||||||
|
The pre-existing design treated producer output optimistically: parse the whole
|
||||||
|
document, validate the whole document, and on any failure discard the whole
|
||||||
|
document (preserving only a bounded diagnostic preview). That gives **zero error
|
||||||
|
locality** — the blast radius of any single defect is the entire activation.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Treat the producer→consumer seam as an explicit, adversarial **trust boundary**,
|
||||||
|
and place guardrails plus error-correction tooling *at that boundary* rather than
|
||||||
|
letting raw producer output flow into deterministic consumers.
|
||||||
|
|
||||||
|
### Two non-fail-fast postures
|
||||||
|
|
||||||
|
When hard-failing on a problem is undesirable, there are two sound strategies, and
|
||||||
|
they **compose**:
|
||||||
|
|
||||||
|
- **A) Trust but handle exceptions** (optimistic / reactive). Consume the output
|
||||||
|
as-is; on exception, catch → repair → retry → or quarantine. Cheap on the happy
|
||||||
|
path; blast radius depends entirely on how granular the catch is. Best when
|
||||||
|
failures are rare and locally recoverable. Risk: failures surface late, possibly
|
||||||
|
after partial side effects.
|
||||||
|
- **B) Verify and mitigate** (defensive / proactive). Validate, sanitize, clamp,
|
||||||
|
and normalize the output to a known-good shape *before* it enters the pipeline —
|
||||||
|
drop bad items, coerce types, bound sizes/depth, allow-list references — so the
|
||||||
|
consumer only ever sees clean input. Higher upfront cost, smaller blast radius,
|
||||||
|
no partial side effects. Best when failures are common or consequences are high.
|
||||||
|
|
||||||
|
### Governing principles
|
||||||
|
|
||||||
|
1. **Push verification to the boundary; keep the interior strict.** Apply posture
|
||||||
|
**B** at the producer→consumer boundary; keep posture **A** for residual
|
||||||
|
exceptions inside the verified core. Never relax the interior schema to absorb
|
||||||
|
producer sloppiness.
|
||||||
|
2. **Make error locality match the unit of work.** One bad recommendation must
|
||||||
|
cost one recommendation, not the whole report. Structuring the payload so each
|
||||||
|
item is independently parseable and validatable is the highest-leverage change.
|
||||||
|
3. **Quarantine, never silently drop.** Invalid units are preserved as bounded,
|
||||||
|
provenance-tagged artifacts (`index`, `error`, `raw` snippet, `reason`) so they
|
||||||
|
can be debugged or replayed. Degraded-but-usable is reported distinctly from
|
||||||
|
total loss.
|
||||||
|
4. **Both human and agent input get the same rigor.** Guardrails are
|
||||||
|
producer-agnostic: the same count / length / depth caps and reference
|
||||||
|
allow-lists apply whether the producer is an LLM, an agent, or a human.
|
||||||
|
|
||||||
|
### What this means concretely in activity-core
|
||||||
|
|
||||||
|
Implemented in `src/activity_core/rules/executor.py`:
|
||||||
|
|
||||||
|
- **Strict-structure-only schema.** The daily-triage output schema is strict on
|
||||||
|
per-item *structure* (`required [rank, candidate, action, why]`, typed `wsjf`)
|
||||||
|
and carries `maxItems` as a producer *hint* — never as a hard whole-document
|
||||||
|
reject, which would reproduce the very blast-radius failure (ACT-ADR-002 governs
|
||||||
|
the schema format; `schemas/daily-triage-report.json`).
|
||||||
|
- **Item-granular recovery (posture B).** When whole-document parse + one retry
|
||||||
|
fail, `_resilient_report` recovers individually-parseable recommendation objects
|
||||||
|
via a brace/quote-aware scanner (`_extract_object_spans`) that works for both
|
||||||
|
pretty-printed and NDJSON output, attempts a best-effort `_try_repair` on a
|
||||||
|
truncated tail, validates each recovered object against the item schema, and
|
||||||
|
keeps the valid ones. Survivors are emitted with `output_validated=true`,
|
||||||
|
`partial=true`, and `review_required=true`.
|
||||||
|
- **Producer guardrails (`_partition_items`, applied on both the recovery and the
|
||||||
|
happy path).** Per recommendation: structural type → schema → structural caps
|
||||||
|
(`_MAX_DEPTH`, `_MAX_STRING_LEN`) → reference allow-list → count cap (top-N by
|
||||||
|
`maxItems`). The first failing check quarantines the item with provenance and a
|
||||||
|
`reason` (`malformed` / `schema` / `guardrail` / `allow_list` / `over_limit`).
|
||||||
|
- **Reference allow-list.** A recommendation whose `candidate` is not in the set of
|
||||||
|
known ids is quarantined. The set is sourced from resolved context
|
||||||
|
(`context["known_candidates"]`, via `_allow_list_from_context`); the check is
|
||||||
|
inert until a context resolver populates it, so the capability ships now and
|
||||||
|
activates with a one-line resolver change.
|
||||||
|
|
||||||
|
### Where each posture sits
|
||||||
|
|
||||||
|
| Layer | Posture | Mechanism |
|
||||||
|
|-------|---------|-----------|
|
||||||
|
| Schema / contract | B | strict per-item structure; `maxItems` as hint |
|
||||||
|
| Whole-document parse | A | tolerant parse + single retry |
|
||||||
|
| Failed parse | B | item-granular recovery + repair + quarantine |
|
||||||
|
| Per-item screening | B | schema + depth/length caps + allow-list + count cap |
|
||||||
|
| Emitted report | — | `partial` / `quarantined_*` provenance; never silent |
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- A single malformed or oversized item no longer discards an entire activation;
|
||||||
|
the daily-triage run that failed on 2026-06-26 would now deliver its 7 valid
|
||||||
|
recommendations and quarantine the broken tail.
|
||||||
|
- Reports gain a `partial` / `quarantined_*` vocabulary; downstream report sinks
|
||||||
|
and reviewers can distinguish degraded-but-usable from total loss.
|
||||||
|
- Guardrail thresholds (`_MAX_DEPTH`, `_MAX_STRING_LEN`, `maxItems`, the
|
||||||
|
allow-list) are policy knobs that will need tuning; they are intentionally
|
||||||
|
conservative defaults, not a finished calibration.
|
||||||
|
- **Known retention gap (follow-on):** `LLMConnectClient.complete()` still returns
|
||||||
|
only `content`, discarding `finish_reason`/`usage`, and the total-loss artifact
|
||||||
|
caps raw output below realistic break points. Capturing those signals so
|
||||||
|
failures stay debuggable is tracked as a retention fix, not closed by this ADR.
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
- **Hard-enforce `maxItems` in the validator.** Rejected: a hard reject of an
|
||||||
|
over-count document reproduces the whole-document blast radius. Mitigation (keep
|
||||||
|
top-N, quarantine the rest) is preferred.
|
||||||
|
- **Relax the schema to accept anything.** Rejected: violates principle 1; pushes
|
||||||
|
malformed data into downstream consumers.
|
||||||
|
- **Retry-until-valid only (pure posture A).** Rejected as the sole strategy: the
|
||||||
|
2026-06-26 failure recurred across both the initial attempt and the retry, so
|
||||||
|
retry alone does not bound the blast radius.
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- ACT-ADR-002 — markdown-as-definition format and output schema governance.
|
||||||
|
- ACT-ADR-003 — Rule vs. Instruction model; the Instruction prompt-injection
|
||||||
|
surface this boundary complements on the output side.
|
||||||
|
- `workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md` — the
|
||||||
|
implementing workplan.
|
||||||
@@ -11,7 +11,9 @@ The current authoritative boundary is the issue-core REST API:
|
|||||||
POST {ISSUE_CORE_URL}/issues/
|
POST {ISSUE_CORE_URL}/issues/
|
||||||
```
|
```
|
||||||
|
|
||||||
`IssueCoreRestSink` sends this payload:
|
`IssueCoreRestSink` authenticates with the shared `ISSUE_CORE_API_KEY` env var
|
||||||
|
(same value as the issue-core server) via `Authorization: Bearer <key>` and
|
||||||
|
sends this payload:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
@@ -52,7 +54,7 @@ task reference before it can replace `IssueCoreRestSink`.
|
|||||||
|
|
||||||
Weekly SBOM staleness is safe to evaluate in dry-run mode because the rule
|
Weekly SBOM staleness is safe to evaluate in dry-run mode because the rule
|
||||||
contract is deterministic and tested. Do not enable it against the real REST sink
|
contract is deterministic and tested. Do not enable it against the real REST sink
|
||||||
until issue-core credentials, endpoint reachability, and duplicate-handling are
|
until `ISSUE_CORE_API_KEY`, endpoint reachability, and duplicate-handling are
|
||||||
verified in the target environment.
|
verified in the target environment.
|
||||||
|
|
||||||
## Verification
|
## Verification
|
||||||
|
|||||||
175
docs/runbook.md
175
docs/runbook.md
@@ -116,7 +116,129 @@ asyncio.run(publish())
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Syncing schedules manually
|
## Syncing definitions and schedules manually
|
||||||
|
|
||||||
|
When the API is running, prefer the admin sync endpoint for definition or
|
||||||
|
schedule changes. It refreshes file-backed ActivityDefinitions and reconciles
|
||||||
|
Temporal Schedules without restarting the worker:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -s -X POST \
|
||||||
|
'http://localhost:8010/admin/sync?definitions=true&schedules=true'
|
||||||
|
```
|
||||||
|
|
||||||
|
The response reports:
|
||||||
|
|
||||||
|
- `definitions.synced`
|
||||||
|
- `event_types.synced`
|
||||||
|
- `schedules.upserted`
|
||||||
|
- `schedules.paused`
|
||||||
|
- `schedules.deleted_orphans`
|
||||||
|
- bounded `errors[]`
|
||||||
|
|
||||||
|
## Automation inventory
|
||||||
|
|
||||||
|
Use the repo-native inventory command to answer "what automations are scheduled
|
||||||
|
at all?" before checking whether a recent window succeeded. The command is
|
||||||
|
read-only: it loads ActivityDefinition rows or files and, when `TEMPORAL_HOST`
|
||||||
|
is configured, describes Temporal schedules for visibility. It does not sync,
|
||||||
|
upsert, pause, delete, or enqueue schedules.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Human-readable configured automation inventory.
|
||||||
|
make automation-list
|
||||||
|
|
||||||
|
# JSON for scripts or assistant summarization.
|
||||||
|
make automation-list-json
|
||||||
|
|
||||||
|
# Common filters.
|
||||||
|
make automation-list ENABLED=true TRIGGER=cron
|
||||||
|
make automation-list ACTIVITY_ID=6fca51fa-387a-4fd0-bc4e-d62c29eb859a
|
||||||
|
```
|
||||||
|
|
||||||
|
Inventory answers what is configured; `make automation-status` answers what
|
||||||
|
happened in a time window. Missing optional live sources are warnings, not
|
||||||
|
silent omissions, so a degraded local run still lists repo definition files.
|
||||||
|
|
||||||
|
Compact human output looks like:
|
||||||
|
|
||||||
|
```text
|
||||||
|
- Daily State Hub WSJF Triage [enabled cron] schedule=activity-schedule-... trigger=20 7 * * * tz=Europe/Berlin source=files temporal=not_checked
|
||||||
|
```
|
||||||
|
|
||||||
|
## Automation status
|
||||||
|
|
||||||
|
Use the repo-native status command to answer operator questions such as "how did
|
||||||
|
our automations go since Friday?". This is the baseline evidence surface; LLMs
|
||||||
|
or coding assistants may summarize the output, but they are not the scheduler or
|
||||||
|
source of truth.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Human-readable status. `friday` resolves in Europe/Berlin by default.
|
||||||
|
make automation-status SINCE=friday
|
||||||
|
|
||||||
|
# JSON for scripts or assistant summarization.
|
||||||
|
make automation-status-json SINCE=2026-06-26
|
||||||
|
```
|
||||||
|
|
||||||
|
The command reads activity-core owned evidence only: ActivityDefinition files or
|
||||||
|
DB rows, `activity_runs`, State Hub progress, working-memory report notes, and
|
||||||
|
Temporal visibility when `TEMPORAL_HOST` is configured. Missing live sources are
|
||||||
|
reported as warnings rather than hidden. It exits non-zero for real automation
|
||||||
|
failures such as `missed`, `validation_failed`, or `sink_failed`.
|
||||||
|
|
||||||
|
Useful knobs:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
AUTOMATION_STATUS_TIMEOUT_SECONDS=10 make automation-status SINCE=friday
|
||||||
|
make automation-status SINCE=2026-06-26 FORMAT=json
|
||||||
|
make automation-status SINCE=2026-06-26 UNTIL=2026-06-27 ACTCORE_DB_URL=
|
||||||
|
```
|
||||||
|
|
||||||
|
Example distinction from the June 2026 daily triage evidence:
|
||||||
|
|
||||||
|
```text
|
||||||
|
- Activity 6fca51fa-387a-4fd0-bc4e-d62c29eb859a [validation_failed] expected=0 runs=0 evidence=2
|
||||||
|
evidence state_hub_progress event_type=daily_triage run=ebec6e41... output_validated=false validation_error=Unterminated string...
|
||||||
|
evidence state_hub_progress event_type=daily_triage run=c7370f9c... output_validated=false validation_error=Expecting ',' delimiter...
|
||||||
|
```
|
||||||
|
|
||||||
|
That means the schedule/report path left evidence, but the report was not a
|
||||||
|
clean validated output. Disabled schedules, such as the gated weekly coding
|
||||||
|
retro, are reported as `disabled` and are not counted as missed runs.
|
||||||
|
|
||||||
|
`event_types` defaults to `false` for this endpoint because event-triggered
|
||||||
|
definitions already reload from the DB in the event router path; opt in when
|
||||||
|
the operator intentionally changed event type definition files:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -s -X POST \
|
||||||
|
'http://localhost:8010/admin/sync?definitions=true&schedules=true&event_types=true'
|
||||||
|
```
|
||||||
|
|
||||||
|
The v1 posture is manual/operator-triggered sync. A periodic background loop is
|
||||||
|
deferred until live use shows it is needed; this keeps customer definition
|
||||||
|
changes explicit and avoids background repo scanning from the worker.
|
||||||
|
|
||||||
|
### Railiance01 no-restart smoke
|
||||||
|
|
||||||
|
After changing a projected definition in `k8s/railiance/20-runtime.yaml`,
|
||||||
|
apply the ConfigMap and wait for the API pod volume to refresh (up to ~60s),
|
||||||
|
then reconcile without restarting `actcore-worker`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export KUBECONFIG=~/.kube/config-hosteurope
|
||||||
|
kubectl apply -f k8s/railiance/20-runtime.yaml
|
||||||
|
sleep 60
|
||||||
|
kubectl -n activity-core exec deploy/actcore-api -- \
|
||||||
|
python3 -c 'import urllib.request; req=urllib.request.Request("http://localhost:8010/admin/sync?definitions=true&schedules=true", method="POST"); print(urllib.request.urlopen(req).read().decode())'
|
||||||
|
```
|
||||||
|
|
||||||
|
Automated regression for the disabled `ops-service-inventory-probes`
|
||||||
|
projection (enable/cadence flip, idempotent repeat sync, rollback) lives in
|
||||||
|
`scripts/smoke_admin_sync_no_restart.py`.
|
||||||
|
|
||||||
|
If the API is unavailable, the schedule-only CLI remains available:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
TEMPORAL_HOST=localhost:7233 \
|
TEMPORAL_HOST=localhost:7233 \
|
||||||
@@ -126,7 +248,7 @@ ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
|
|||||||
|
|
||||||
This reconciles all Temporal Schedules with the `activity_definitions` table:
|
This reconciles all Temporal Schedules with the `activity_definitions` table:
|
||||||
- Upserts schedules for every enabled cron definition
|
- Upserts schedules for every enabled cron definition
|
||||||
- Creates paused schedules for disabled cron definitions
|
- Creates paused schedules for disabled cron or one-shot scheduled definitions
|
||||||
- Deletes orphaned schedules with no matching DB row
|
- Deletes orphaned schedules with no matching DB row
|
||||||
|
|
||||||
After adding or changing a recurring ActivityDefinition or workflow activity
|
After adding or changing a recurring ActivityDefinition or workflow activity
|
||||||
@@ -282,6 +404,52 @@ the same durable consumer name provides automatic failover.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Run-miss recovery policies (cron triggers)
|
||||||
|
|
||||||
|
A cron fire is **missed** when the worker or Temporal is unavailable at trigger
|
||||||
|
time. `trigger_config.misfire_policy` selects what happens when the system
|
||||||
|
recovers. Each policy combines a Temporal **catchup window** (how far back missed
|
||||||
|
fires are recovered) with an **overlap policy** (what to do if a recovered fire
|
||||||
|
would start while a prior run is still executing):
|
||||||
|
|
||||||
|
| `misfire_policy` | Behaviour | Default catchup window | Overlap |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| `skip` | Run on trigger or skip — a missed fire is never recovered | 60s grace | `SKIP` |
|
||||||
|
| `catchup_all` | Recover **every** fire missed during the outage | 365 days | `BUFFER_ALL` |
|
||||||
|
| `catchup_latest` | Recover only the **most recent** missed fire; no backlog | 24h | `BUFFER_ONE` |
|
||||||
|
|
||||||
|
Set `trigger_config.catchup_window_seconds` to override the per-policy default
|
||||||
|
(e.g. an hourly definition using `catchup_latest` should set it to ~3600 so a
|
||||||
|
single missed hour is recovered but older ones are not).
|
||||||
|
|
||||||
|
Legacy values are still accepted: `catchup` → `catchup_all`,
|
||||||
|
`compress` → `catchup_latest`.
|
||||||
|
|
||||||
|
> **Why this exists:** before ACTIVITY-WP-0014 no catchup window was set, so a
|
||||||
|
> brief outage at trigger time silently dropped the fire with no recovery and no
|
||||||
|
> log line. The `daily-statehub-wsjf-triage` definition now uses `catchup_latest`.
|
||||||
|
|
||||||
|
## State Hub write idempotency (ACTIVITY-WP-0014 T05)
|
||||||
|
|
||||||
|
Every State Hub write from activity-core (report-sink progress, ops-evidence
|
||||||
|
progress, schedule-miss alerts) carries a stable **`Idempotency-Key`** header
|
||||||
|
derived deterministically from the write's identity
|
||||||
|
(`run_id:instruction_id:event_type`, or `schedule_miss:activity_id:last_fired`
|
||||||
|
for miss alerts). This makes writes safe to **buffer and replay** under the
|
||||||
|
planned State Hub *beachhead* (per-machine read cache + write outbox): a flush —
|
||||||
|
possibly retried after an outage — cannot create duplicate progress/triage
|
||||||
|
events once State Hub / the beachhead honours the header.
|
||||||
|
|
||||||
|
The guarantee lives on the write, not on a live dedup read. The read-based
|
||||||
|
`_progress_exists` check is now best-effort only: if State Hub is unreachable it
|
||||||
|
returns `False` (proceed to the keyed write) rather than hard-failing. The header
|
||||||
|
passes untouched through the `actcore-state-hub-bridge` proxy and is ignored by
|
||||||
|
State Hub versions that do not yet honour it.
|
||||||
|
|
||||||
|
> The queue/cache itself is **not** built in activity-core — it belongs to the
|
||||||
|
> state-hub beachhead. activity-core only emits the key. See the proposal sent to
|
||||||
|
> the `state-hub` agent.
|
||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
### Worker fails to start: "ACTCORE_DB_URL is required"
|
### Worker fails to start: "ACTCORE_DB_URL is required"
|
||||||
@@ -291,6 +459,9 @@ Set the environment variable before running the worker.
|
|||||||
1. Check Temporal UI → Schedules tab for the schedule status.
|
1. Check Temporal UI → Schedules tab for the schedule status.
|
||||||
2. Ensure `enabled=True` on the ActivityDefinition (paused schedules don't fire).
|
2. Ensure `enabled=True` on the ActivityDefinition (paused schedules don't fire).
|
||||||
3. Verify the cron expression with: `docker exec temporal-admin-tools temporal schedule describe --schedule-id activity-schedule-<uuid>`
|
3. Verify the cron expression with: `docker exec temporal-admin-tools temporal schedule describe --schedule-id activity-schedule-<uuid>`
|
||||||
|
4. If a fire was **missed entirely** (no run, no failure event) during an outage,
|
||||||
|
check `misfire_policy` — under `skip` missed fires are dropped by design. Use
|
||||||
|
`catchup_all` or `catchup_latest` to recover them. See *Run-miss recovery policies*.
|
||||||
|
|
||||||
### Event not routing
|
### Event not routing
|
||||||
1. Check NATS monitoring: http://localhost:8222/jsz to verify the `ACTIVITY_EVENTS` stream exists.
|
1. Check NATS monitoring: http://localhost:8222/jsz to verify the `ACTIVITY_EVENTS` stream exists.
|
||||||
|
|||||||
@@ -14,8 +14,8 @@ data:
|
|||||||
LLM_CONNECT_URL: http://llm-connect.activity-core.svc.cluster.local:8080
|
LLM_CONNECT_URL: http://llm-connect.activity-core.svc.cluster.local:8080
|
||||||
LLM_CONNECT_TIMEOUT_SECONDS: "300"
|
LLM_CONNECT_TIMEOUT_SECONDS: "300"
|
||||||
REPO_SCOPING_URL: http://repo-scoping.repo-scoping.svc.cluster.local:8020
|
REPO_SCOPING_URL: http://repo-scoping.repo-scoping.svc.cluster.local:8020
|
||||||
ISSUE_CORE_URL: http://issue-core.issue-core.svc.cluster.local:8010
|
ISSUE_CORE_URL: http://actcore-issue-core-bridge.activity-core.svc.cluster.local:8765
|
||||||
ISSUE_SINK_TYPE: "null"
|
ISSUE_SINK_TYPE: "rest"
|
||||||
ACTIVITY_DEFINITION_DIRS: /etc/activity-core/external-definitions
|
ACTIVITY_DEFINITION_DIRS: /etc/activity-core/external-definitions
|
||||||
OPS_INVENTORY_PATH: /etc/activity-core/ops/service-inventory.yml
|
OPS_INVENTORY_PATH: /etc/activity-core/ops/service-inventory.yml
|
||||||
INTER_HUB_URL: ""
|
INTER_HUB_URL: ""
|
||||||
@@ -47,7 +47,10 @@ data:
|
|||||||
type: cron
|
type: cron
|
||||||
cron_expression: "20 7 * * *"
|
cron_expression: "20 7 * * *"
|
||||||
timezone: Europe/Berlin
|
timezone: Europe/Berlin
|
||||||
misfire_policy: skip
|
# ACTIVITY-WP-0014: recover the most recent missed daily fire when the
|
||||||
|
# worker/Temporal was unavailable at trigger time, without accumulating a
|
||||||
|
# backlog after a multi-day outage.
|
||||||
|
misfire_policy: catchup_latest
|
||||||
context_sources:
|
context_sources:
|
||||||
- type: static
|
- type: static
|
||||||
bind_to: context.prompt_path
|
bind_to: context.prompt_path
|
||||||
@@ -91,15 +94,19 @@ data:
|
|||||||
Score each recommendation with the WSJF rubric from the prompt:
|
Score each recommendation with the WSJF rubric from the prompt:
|
||||||
(strategic_value + time_criticality + risk_reduction +
|
(strategic_value + time_criticality + risk_reduction +
|
||||||
opportunity_enablement) / job_size. Use integer factor values from 1 to 5,
|
opportunity_enablement) / job_size. Use integer factor values from 1 to 5,
|
||||||
round score to one decimal place, sort recommendations by rank, and return at
|
round score to one decimal place, sort recommendations by rank, and return
|
||||||
most 10 recommendations.
|
only the bounded top-7 (at most 7) ranked recommendations. If uncertain,
|
||||||
|
emit fewer well-formed recommendations rather than more.
|
||||||
|
|
||||||
Curated digest:
|
Curated digest:
|
||||||
{context.daily_triage_digest}
|
{context.daily_triage_digest}
|
||||||
|
|
||||||
Return only JSON matching
|
Return only JSON matching
|
||||||
`/etc/activity-core/schemas/daily-triage-report.json`. Do not wrap the JSON
|
`/etc/activity-core/schemas/daily-triage-report.json`. Emit the "summary"
|
||||||
in Markdown fences or add prose before or after it:
|
field first, then inside the "recommendations" array write one complete
|
||||||
|
recommendation JSON object per line (NDJSON-style per-item framing) so
|
||||||
|
each item can be recovered independently if the output is truncated. Do
|
||||||
|
not wrap the JSON in Markdown fences or add prose before or after it:
|
||||||
{
|
{
|
||||||
"summary": "short operator-facing summary",
|
"summary": "short operator-facing summary",
|
||||||
"recommendations": [
|
"recommendations": [
|
||||||
@@ -164,6 +171,36 @@ data:
|
|||||||
|
|
||||||
Kubernetes projection of the Custodian-owned definition in
|
Kubernetes projection of the Custodian-owned definition in
|
||||||
`/home/worsch/the-custodian/activity-definitions/hourly-recently-on-scope.md`.
|
`/home/worsch/the-custodian/activity-definitions/hourly-recently-on-scope.md`.
|
||||||
|
state-hub-consistency-sweep.md: |
|
||||||
|
---
|
||||||
|
id: "7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b"
|
||||||
|
name: "State Hub Consistency Sweep"
|
||||||
|
type: activity-definition
|
||||||
|
version: "1.0"
|
||||||
|
enabled: true
|
||||||
|
owner: custodian
|
||||||
|
governance: custodian
|
||||||
|
status: active
|
||||||
|
created: "2026-06-21"
|
||||||
|
trigger:
|
||||||
|
type: cron
|
||||||
|
cron_expression: "*/15 * * * *"
|
||||||
|
timezone: UTC
|
||||||
|
misfire_policy: skip
|
||||||
|
context_sources:
|
||||||
|
- type: state-hub
|
||||||
|
query: consistency_sweep_remote_all
|
||||||
|
required: true
|
||||||
|
params:
|
||||||
|
max_seconds: 300
|
||||||
|
source: activity-core
|
||||||
|
bind_to: context.consistency_sweep_remote_all
|
||||||
|
---
|
||||||
|
|
||||||
|
# ActivityDefinition: State Hub Consistency Sweep
|
||||||
|
|
||||||
|
Kubernetes projection of the Custodian-owned definition in
|
||||||
|
`/home/worsch/the-custodian/activity-definitions/state-hub-consistency-sweep.md`.
|
||||||
ops-service-inventory-probes.md: |
|
ops-service-inventory-probes.md: |
|
||||||
---
|
---
|
||||||
id: "40d15a87-7ff6-4d8e-992c-37df15f95110"
|
id: "40d15a87-7ff6-4d8e-992c-37df15f95110"
|
||||||
@@ -399,7 +436,7 @@ data:
|
|||||||
"recommendations": {
|
"recommendations": {
|
||||||
"type": "array",
|
"type": "array",
|
||||||
"minItems": 1,
|
"minItems": 1,
|
||||||
"maxItems": 10,
|
"maxItems": 7,
|
||||||
"items": {
|
"items": {
|
||||||
"type": "object",
|
"type": "object",
|
||||||
"required": ["rank", "candidate", "action", "why", "confidence", "wsjf"],
|
"required": ["rank", "candidate", "action", "why", "confidence", "wsjf"],
|
||||||
@@ -408,7 +445,7 @@ data:
|
|||||||
"rank": {
|
"rank": {
|
||||||
"type": "integer",
|
"type": "integer",
|
||||||
"minimum": 1,
|
"minimum": 1,
|
||||||
"maximum": 10
|
"maximum": 7
|
||||||
},
|
},
|
||||||
"candidate": {
|
"candidate": {
|
||||||
"type": "string"
|
"type": "string"
|
||||||
@@ -578,7 +615,8 @@ spec:
|
|||||||
method=self.command,
|
method=self.command,
|
||||||
)
|
)
|
||||||
try:
|
try:
|
||||||
with urlopen(request, timeout=30) as response:
|
timeout = 360 if self.command == "POST" else 30
|
||||||
|
with urlopen(request, timeout=timeout) as response:
|
||||||
payload = response.read()
|
payload = response.read()
|
||||||
self.send_response(response.status)
|
self.send_response(response.status)
|
||||||
for key, value in response.headers.items():
|
for key, value in response.headers.items():
|
||||||
@@ -599,12 +637,123 @@ spec:
|
|||||||
ThreadingHTTPServer(("0.0.0.0", 18080), Proxy).serve_forever()
|
ThreadingHTTPServer(("0.0.0.0", 18080), Proxy).serve_forever()
|
||||||
readinessProbe:
|
readinessProbe:
|
||||||
httpGet:
|
httpGet:
|
||||||
path: /state/summary
|
path: /state/health
|
||||||
port: http
|
port: http
|
||||||
initialDelaySeconds: 5
|
initialDelaySeconds: 5
|
||||||
periodSeconds: 10
|
periodSeconds: 10
|
||||||
timeoutSeconds: 5
|
timeoutSeconds: 5
|
||||||
failureThreshold: 6
|
failureThreshold: 6
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: actcore-issue-core-bridge
|
||||||
|
namespace: activity-core
|
||||||
|
labels:
|
||||||
|
app.kubernetes.io/name: actcore-issue-core-bridge
|
||||||
|
app.kubernetes.io/part-of: activity-core
|
||||||
|
spec:
|
||||||
|
selector:
|
||||||
|
app.kubernetes.io/name: actcore-issue-core-bridge
|
||||||
|
ports:
|
||||||
|
- name: http
|
||||||
|
port: 8765
|
||||||
|
targetPort: http
|
||||||
|
---
|
||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: actcore-issue-core-bridge
|
||||||
|
namespace: activity-core
|
||||||
|
labels:
|
||||||
|
app.kubernetes.io/name: actcore-issue-core-bridge
|
||||||
|
app.kubernetes.io/part-of: activity-core
|
||||||
|
spec:
|
||||||
|
replicas: 1
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
app.kubernetes.io/name: actcore-issue-core-bridge
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
app.kubernetes.io/name: actcore-issue-core-bridge
|
||||||
|
app.kubernetes.io/part-of: activity-core
|
||||||
|
spec:
|
||||||
|
hostNetwork: true
|
||||||
|
dnsPolicy: ClusterFirstWithHostNet
|
||||||
|
containers:
|
||||||
|
- name: proxy
|
||||||
|
image: activity-core:railiance01-prod
|
||||||
|
imagePullPolicy: Never
|
||||||
|
ports:
|
||||||
|
- name: http
|
||||||
|
containerPort: 18081
|
||||||
|
command:
|
||||||
|
- python
|
||||||
|
- -c
|
||||||
|
- |
|
||||||
|
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
|
||||||
|
from urllib.error import HTTPError, URLError
|
||||||
|
from urllib.request import Request, urlopen
|
||||||
|
|
||||||
|
TARGET = "http://127.0.0.1:18765"
|
||||||
|
HOP_HEADERS = {"connection", "host", "keep-alive", "proxy-authenticate",
|
||||||
|
"proxy-authorization", "te", "trailers",
|
||||||
|
"transfer-encoding", "upgrade"}
|
||||||
|
|
||||||
|
class Proxy(BaseHTTPRequestHandler):
|
||||||
|
def do_GET(self):
|
||||||
|
self._proxy()
|
||||||
|
|
||||||
|
def do_POST(self):
|
||||||
|
self._proxy()
|
||||||
|
|
||||||
|
def do_PATCH(self):
|
||||||
|
self._proxy()
|
||||||
|
|
||||||
|
def _proxy(self):
|
||||||
|
length = int(self.headers.get("content-length", "0") or "0")
|
||||||
|
body = self.rfile.read(length) if length else None
|
||||||
|
headers = {
|
||||||
|
key: value
|
||||||
|
for key, value in self.headers.items()
|
||||||
|
if key.lower() not in HOP_HEADERS
|
||||||
|
}
|
||||||
|
request = Request(
|
||||||
|
TARGET + self.path,
|
||||||
|
data=body,
|
||||||
|
headers=headers,
|
||||||
|
method=self.command,
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
timeout = 360 if self.command == "POST" else 30
|
||||||
|
with urlopen(request, timeout=timeout) as response:
|
||||||
|
payload = response.read()
|
||||||
|
self.send_response(response.status)
|
||||||
|
for key, value in response.headers.items():
|
||||||
|
if key.lower() not in HOP_HEADERS:
|
||||||
|
self.send_header(key, value)
|
||||||
|
self.end_headers()
|
||||||
|
self.wfile.write(payload)
|
||||||
|
except HTTPError as exc:
|
||||||
|
payload = exc.read()
|
||||||
|
self.send_response(exc.code)
|
||||||
|
self.end_headers()
|
||||||
|
self.wfile.write(payload)
|
||||||
|
except URLError as exc:
|
||||||
|
self.send_response(502)
|
||||||
|
self.end_headers()
|
||||||
|
self.wfile.write(str(exc).encode())
|
||||||
|
|
||||||
|
ThreadingHTTPServer(("0.0.0.0", 18081), Proxy).serve_forever()
|
||||||
|
readinessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /healthz
|
||||||
|
port: http
|
||||||
|
initialDelaySeconds: 5
|
||||||
|
periodSeconds: 10
|
||||||
|
timeoutSeconds: 5
|
||||||
|
failureThreshold: 6
|
||||||
|
---
|
||||||
---
|
---
|
||||||
apiVersion: batch/v1
|
apiVersion: batch/v1
|
||||||
kind: Job
|
kind: Job
|
||||||
|
|||||||
@@ -1,4 +1,5 @@
|
|||||||
{
|
{
|
||||||
|
"$comment": "ACTIVITY-WP-0016-T02. Strict, bounded contract for the daily WSJF triage report. The per-item 'recommendations' schema is intentionally strict on STRUCTURE (types + required keys) so the T03 boundary parser can validate each recommendation independently and quarantine only the malformed ones. 'maxItems' is a producer hint (honoured by llm-connect constrained decoding and by the prompt); it is deliberately NOT hard-enforced by the in-repo validator, because rejecting a whole report for having too many items would reproduce the monolithic-failure bug WP-0016 exists to remove. Over-count is mitigated in T03 (keep top-N by rank, quarantine the rest). Value-domain vocabularies (action/confidence) are documented in the prompt and enforced by T04 guardrails with mitigation, not as brittle hard-fail enums here.",
|
||||||
"type": "object",
|
"type": "object",
|
||||||
"required": ["summary", "recommendations"],
|
"required": ["summary", "recommendations"],
|
||||||
"properties": {
|
"properties": {
|
||||||
@@ -7,8 +8,28 @@
|
|||||||
},
|
},
|
||||||
"recommendations": {
|
"recommendations": {
|
||||||
"type": "array",
|
"type": "array",
|
||||||
|
"maxItems": 7,
|
||||||
"items": {
|
"items": {
|
||||||
"type": "object"
|
"type": "object",
|
||||||
|
"required": ["rank", "candidate", "action", "why"],
|
||||||
|
"properties": {
|
||||||
|
"rank": { "type": "integer" },
|
||||||
|
"candidate": { "type": "string" },
|
||||||
|
"action": { "type": "string" },
|
||||||
|
"why": { "type": "string" },
|
||||||
|
"confidence": { "type": "string" },
|
||||||
|
"wsjf": {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"score": { "type": "number" },
|
||||||
|
"strategic_value": { "type": "number" },
|
||||||
|
"time_criticality": { "type": "number" },
|
||||||
|
"risk_reduction": { "type": "number" },
|
||||||
|
"opportunity_enablement": { "type": "number" },
|
||||||
|
"job_size": { "type": "number" }
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
8
scripts/automation_inventory.py
Normal file
8
scripts/automation_inventory.py
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""CLI wrapper for the repo-native automation inventory report."""
|
||||||
|
|
||||||
|
from activity_core.automation_status import inventory_main
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(inventory_main())
|
||||||
8
scripts/automation_status.py
Normal file
8
scripts/automation_status.py
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""CLI wrapper for the repo-native automation status report."""
|
||||||
|
|
||||||
|
from activity_core.automation_status import main
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
212
scripts/smoke_admin_sync_no_restart.py
Executable file
212
scripts/smoke_admin_sync_no_restart.py
Executable file
@@ -0,0 +1,212 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Railiance01 no-restart smoke for POST /admin/sync.
|
||||||
|
|
||||||
|
Patches the disabled ops-service-inventory-probes projection in the cluster
|
||||||
|
ConfigMap, waits for the API pod volume to refresh, runs /admin/sync twice,
|
||||||
|
verifies DB + Temporal schedule drift without restarting actcore-worker, then
|
||||||
|
rolls the ConfigMap back to the disabled baseline.
|
||||||
|
|
||||||
|
Requires:
|
||||||
|
- KUBECONFIG pointing at railiance01 (for example ~/.kube/config-hosteurope)
|
||||||
|
- kubectl access to the activity-core namespace
|
||||||
|
|
||||||
|
Example:
|
||||||
|
export KUBECONFIG=~/.kube/config-hosteurope
|
||||||
|
python3 scripts/smoke_admin_sync_no_restart.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
|
||||||
|
ACTIVITY_ID = "40d15a87-7ff6-4d8e-992c-37df15f95110"
|
||||||
|
CONFIGMAP = "actcore-external-activity-definitions"
|
||||||
|
DEFINITION_KEY = "ops-service-inventory-probes.md"
|
||||||
|
MOUNTED_FILE = (
|
||||||
|
"/etc/activity-core/external-definitions/activity-definitions/"
|
||||||
|
f"{DEFINITION_KEY}"
|
||||||
|
)
|
||||||
|
VOLUME_PROPAGATION_SECONDS = 65
|
||||||
|
|
||||||
|
|
||||||
|
def kubectl(*args: str, input_text: str | None = None) -> str:
|
||||||
|
cmd = ["kubectl", "-n", "activity-core", *args]
|
||||||
|
return subprocess.check_output(
|
||||||
|
cmd,
|
||||||
|
input=input_text,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def api_json(path: str, *, method: str = "GET") -> dict:
|
||||||
|
script = (
|
||||||
|
"import urllib.request, json\n"
|
||||||
|
f'req = urllib.request.Request("http://localhost:8010{path}", method="{method}")\n'
|
||||||
|
"print(urllib.request.urlopen(req).read().decode())"
|
||||||
|
)
|
||||||
|
return json.loads(kubectl("exec", "deploy/actcore-api", "--", "python3", "-c", script))
|
||||||
|
|
||||||
|
|
||||||
|
def worker_lines(script: str) -> list[str]:
|
||||||
|
return kubectl("exec", "deploy/actcore-worker", "--", "python3", "-c", script).splitlines()
|
||||||
|
|
||||||
|
|
||||||
|
def worker_uid() -> str:
|
||||||
|
return kubectl(
|
||||||
|
"get",
|
||||||
|
"pod",
|
||||||
|
"-l",
|
||||||
|
"app.kubernetes.io/name=actcore-worker",
|
||||||
|
"-o",
|
||||||
|
"jsonpath={.items[0].metadata.uid}",
|
||||||
|
).strip()
|
||||||
|
|
||||||
|
|
||||||
|
def load_configmap() -> dict:
|
||||||
|
return json.loads(kubectl("get", "configmap", CONFIGMAP, "-o", "json"))
|
||||||
|
|
||||||
|
|
||||||
|
def apply_configmap(cm: dict) -> None:
|
||||||
|
kubectl("apply", "-f", "-", input_text=json.dumps(cm))
|
||||||
|
|
||||||
|
|
||||||
|
def patch_definition(cm: dict, *, enabled: bool, cron: str) -> None:
|
||||||
|
text = cm["data"][DEFINITION_KEY]
|
||||||
|
for line in text.splitlines():
|
||||||
|
if line.strip().startswith("enabled:"):
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
raise RuntimeError("enabled field not found in projection")
|
||||||
|
|
||||||
|
text = _replace_once(text, 'enabled: false', f"enabled: {'true' if enabled else 'false'}")
|
||||||
|
text = _replace_once(text, 'enabled: true', f"enabled: {'true' if enabled else 'false'}")
|
||||||
|
text = _replace_once(
|
||||||
|
text,
|
||||||
|
'cron_expression: "15 * * * *"',
|
||||||
|
f'cron_expression: "{cron}"',
|
||||||
|
)
|
||||||
|
text = _replace_once(
|
||||||
|
text,
|
||||||
|
'cron_expression: "25 * * * *"',
|
||||||
|
f'cron_expression: "{cron}"',
|
||||||
|
)
|
||||||
|
cm["data"][DEFINITION_KEY] = text
|
||||||
|
apply_configmap(cm)
|
||||||
|
|
||||||
|
|
||||||
|
def _replace_once(text: str, old: str, new: str) -> str:
|
||||||
|
if old not in text:
|
||||||
|
return text
|
||||||
|
return text.replace(old, new, 1)
|
||||||
|
|
||||||
|
|
||||||
|
def wait_for_mount(*, enabled: bool, cron: str) -> None:
|
||||||
|
deadline = time.time() + VOLUME_PROPAGATION_SECONDS
|
||||||
|
want_enabled = "enabled: true" if enabled else "enabled: false"
|
||||||
|
want_cron = f'cron_expression: "{cron}"'
|
||||||
|
while time.time() < deadline:
|
||||||
|
content = kubectl("exec", "deploy/actcore-api", "--", "cat", MOUNTED_FILE)
|
||||||
|
if want_enabled in content and want_cron in content:
|
||||||
|
return
|
||||||
|
time.sleep(5)
|
||||||
|
raise RuntimeError(
|
||||||
|
f"ConfigMap projection did not refresh within {VOLUME_PROPAGATION_SECONDS}s"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def get_definition() -> dict[str, object]:
|
||||||
|
for item in api_json("/activity-definitions/"):
|
||||||
|
if item["id"] == ACTIVITY_ID:
|
||||||
|
return {
|
||||||
|
"enabled": item["enabled"],
|
||||||
|
"cron": item["trigger_config"]["cron_expression"],
|
||||||
|
}
|
||||||
|
raise RuntimeError(f"ActivityDefinition {ACTIVITY_ID} not found")
|
||||||
|
|
||||||
|
|
||||||
|
def describe_schedule() -> dict[str, object]:
|
||||||
|
script = f"""
|
||||||
|
import asyncio
|
||||||
|
from temporalio.client import Client
|
||||||
|
|
||||||
|
async def main() -> None:
|
||||||
|
client = await Client.connect("actcore-temporal:7233")
|
||||||
|
handle = client.get_schedule_handle("activity-schedule-{ACTIVITY_ID}")
|
||||||
|
described = await handle.describe()
|
||||||
|
schedule = described.schedule
|
||||||
|
minute = schedule.spec.calendars[0].minute[0].start if schedule.spec.calendars else None
|
||||||
|
print(schedule.state.paused)
|
||||||
|
print(minute)
|
||||||
|
|
||||||
|
asyncio.run(main())
|
||||||
|
"""
|
||||||
|
paused, minute = worker_lines(script)
|
||||||
|
return {"paused": paused == "True", "minute": int(minute)}
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
worker_before = worker_uid()
|
||||||
|
cm = load_configmap()
|
||||||
|
|
||||||
|
print("1) enable + cadence change via ConfigMap")
|
||||||
|
patch_definition(cm, enabled=True, cron="25 * * * *")
|
||||||
|
wait_for_mount(enabled=True, cron="25 * * * *")
|
||||||
|
|
||||||
|
print("2) POST /admin/sync (first pass)")
|
||||||
|
sync1 = api_json("/admin/sync?definitions=true&schedules=true", method="POST")
|
||||||
|
if not sync1.get("ok"):
|
||||||
|
print(json.dumps(sync1, indent=2), file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
defn = get_definition()
|
||||||
|
schedule = describe_schedule()
|
||||||
|
print(" definition:", defn)
|
||||||
|
print(" schedule:", schedule)
|
||||||
|
if defn != {"enabled": True, "cron": "25 * * * *"}:
|
||||||
|
print("definition drift after sync", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
if schedule["paused"] or schedule["minute"] != 25:
|
||||||
|
print("schedule drift after enable sync", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
print("3) POST /admin/sync (idempotent repeat)")
|
||||||
|
sync2 = api_json("/admin/sync?definitions=true&schedules=true", method="POST")
|
||||||
|
if sync2.get("schedules") != sync1.get("schedules"):
|
||||||
|
print("idempotent schedule counts changed", file=sys.stderr)
|
||||||
|
print(json.dumps({"sync1": sync1, "sync2": sync2}, indent=2), file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
print("4) rollback ConfigMap + sync")
|
||||||
|
cm = load_configmap()
|
||||||
|
patch_definition(cm, enabled=False, cron="15 * * * *")
|
||||||
|
wait_for_mount(enabled=False, cron="15 * * * *")
|
||||||
|
sync3 = api_json("/admin/sync?definitions=true&schedules=true", method="POST")
|
||||||
|
if not sync3.get("ok"):
|
||||||
|
print(json.dumps(sync3, indent=2), file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
defn = get_definition()
|
||||||
|
schedule = describe_schedule()
|
||||||
|
print(" definition:", defn)
|
||||||
|
print(" schedule:", schedule)
|
||||||
|
if defn != {"enabled": False, "cron": "15 * * * *"}:
|
||||||
|
print("rollback definition drift", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
if not schedule["paused"] or schedule["minute"] != 15:
|
||||||
|
print("rollback schedule drift", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
worker_after = worker_uid()
|
||||||
|
if worker_before != worker_after:
|
||||||
|
print("actcore-worker pod restarted during smoke", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
print("smoke passed: admin sync hot-reload without worker restart")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
@@ -149,6 +149,8 @@ async def resolve_context(
|
|||||||
query = source.get("query", "")
|
query = source.get("query", "")
|
||||||
params = source.get("params") or {}
|
params = source.get("params") or {}
|
||||||
required = bool(source.get("required") or params.get("required", False))
|
required = bool(source.get("required") or params.get("required", False))
|
||||||
|
resolver_params = dict(params)
|
||||||
|
resolver_params["required"] = required
|
||||||
raw_bind = source.get("bind_to") or source.get("name") or source_type
|
raw_bind = source.get("bind_to") or source.get("name") or source_type
|
||||||
# Strip the 'context.' namespace prefix so evaluator can find the key.
|
# Strip the 'context.' namespace prefix so evaluator can find the key.
|
||||||
bind_key = raw_bind.removeprefix("context.") if raw_bind.startswith("context.") else raw_bind
|
bind_key = raw_bind.removeprefix("context.") if raw_bind.startswith("context.") else raw_bind
|
||||||
@@ -172,7 +174,7 @@ async def resolve_context(
|
|||||||
continue
|
continue
|
||||||
|
|
||||||
try:
|
try:
|
||||||
resolved = resolver_cls().resolve(query, event_envelope, params)
|
resolved = resolver_cls().resolve(query, event_envelope, resolver_params)
|
||||||
snapshot[bind_key] = _bind_resolver_result(bind_key, resolved)
|
snapshot[bind_key] = _bind_resolver_result(bind_key, resolved)
|
||||||
except Exception as exc:
|
except Exception as exc:
|
||||||
if required:
|
if required:
|
||||||
@@ -364,6 +366,7 @@ async def evaluate_instructions(payload: dict) -> dict:
|
|||||||
"output_validated": result.output_validated,
|
"output_validated": result.output_validated,
|
||||||
"review_required": result.review_required,
|
"review_required": result.review_required,
|
||||||
"validation_error": result.validation_error,
|
"validation_error": result.validation_error,
|
||||||
|
"llm_response_metadata": result.llm_response_metadata,
|
||||||
})
|
})
|
||||||
for spec in result.tasks:
|
for spec in result.tasks:
|
||||||
task_specs.append({
|
task_specs.append({
|
||||||
|
|||||||
@@ -40,6 +40,7 @@ from temporalio.client import Client
|
|||||||
from activity_core.models import ActivityDefinition, CronTriggerConfig
|
from activity_core.models import ActivityDefinition, CronTriggerConfig
|
||||||
from activity_core.orm import ActivityDefinition as ActivityDefinitionRow, EventType as EventTypeRow
|
from activity_core.orm import ActivityDefinition as ActivityDefinitionRow, EventType as EventTypeRow
|
||||||
from activity_core.schedule_manager import delete_schedule, upsert_schedule
|
from activity_core.schedule_manager import delete_schedule, upsert_schedule
|
||||||
|
from activity_core.sync_service import run_sync
|
||||||
from activity_core.webhook_receiver import router as webhook_router
|
from activity_core.webhook_receiver import router as webhook_router
|
||||||
|
|
||||||
TEMPORAL_HOST = os.environ.get("TEMPORAL_HOST", "localhost:7233")
|
TEMPORAL_HOST = os.environ.get("TEMPORAL_HOST", "localhost:7233")
|
||||||
@@ -275,6 +276,24 @@ async def trigger_definition(definition_id: uuid.UUID) -> dict[str, str]:
|
|||||||
return {"workflow_id": handle.id, "trigger_key": trigger_key}
|
return {"workflow_id": handle.id, "trigger_key": trigger_key}
|
||||||
|
|
||||||
|
|
||||||
|
# --- Admin sync ---------------------------------------------------------------
|
||||||
|
|
||||||
|
@app.post("/admin/sync")
|
||||||
|
async def admin_sync(
|
||||||
|
definitions: bool = True,
|
||||||
|
schedules: bool = True,
|
||||||
|
event_types: bool = False,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""Run operator-triggered definition/event/schedule sync without restart."""
|
||||||
|
return await run_sync(
|
||||||
|
session_factory=_get_db(),
|
||||||
|
temporal_client=_get_temporal() if schedules else None,
|
||||||
|
definitions=definitions,
|
||||||
|
schedules=schedules,
|
||||||
|
event_types=event_types,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
# T42: Curator gate — event type approval endpoint
|
# T42: Curator gate — event type approval endpoint
|
||||||
|
|
||||||
@app.get("/health")
|
@app.get("/health")
|
||||||
|
|||||||
1107
src/activity_core/automation_status.py
Normal file
1107
src/activity_core/automation_status.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -4,4 +4,5 @@ from activity_core.context_resolvers import ( # noqa: F401
|
|||||||
ops_inventory,
|
ops_inventory,
|
||||||
repo_scoping,
|
repo_scoping,
|
||||||
state_hub,
|
state_hub,
|
||||||
|
reuse_surface,
|
||||||
)
|
)
|
||||||
|
|||||||
516
src/activity_core/context_resolvers/reuse_surface.py
Normal file
516
src/activity_core/context_resolvers/reuse_surface.py
Normal file
@@ -0,0 +1,516 @@
|
|||||||
|
"""Reuse-surface registry hygiene context adapter.
|
||||||
|
|
||||||
|
Registered as source type ``reuse-surface`` and as the ``shell`` resolver
|
||||||
|
dispatcher for the ``reuse_surface_report_gaps`` query. Other shell queries
|
||||||
|
continue to delegate to the kaizen resolver for backward compatibility.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import socket
|
||||||
|
import subprocess
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
import yaml
|
||||||
|
|
||||||
|
from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY, ContextResolver
|
||||||
|
from activity_core.context_resolvers.kaizen import KaizenContextResolver
|
||||||
|
from activity_core.context_resolvers.state_hub import StateHubContextResolver
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
|
||||||
|
_REPORT_TIMEOUT_SECONDS = 60
|
||||||
|
_STATE_HUB_TIMEOUT_SECONDS = 10.0
|
||||||
|
_KNOWN_SIGNALS = frozenset(
|
||||||
|
{
|
||||||
|
"registry_gap",
|
||||||
|
"empty_capability_scaffold",
|
||||||
|
"stale_scope",
|
||||||
|
"stale_sbom",
|
||||||
|
"publish_check_fail",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class RosterEntry:
|
||||||
|
slug: str
|
||||||
|
domain: str | None = None
|
||||||
|
publish_check: str | None = None
|
||||||
|
|
||||||
|
|
||||||
|
def _base_url() -> str:
|
||||||
|
return os.environ.get("STATE_HUB_URL", _DEFAULT_STATE_HUB_URL).rstrip("/")
|
||||||
|
|
||||||
|
|
||||||
|
def _runner_host(params: dict[str, Any]) -> str:
|
||||||
|
return str(
|
||||||
|
params.get("runner_host")
|
||||||
|
or os.environ.get("KAIZEN_RUNNER_HOST")
|
||||||
|
or socket.gethostname()
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _as_required(params: dict[str, Any]) -> bool:
|
||||||
|
return bool(params.get("required", False))
|
||||||
|
|
||||||
|
|
||||||
|
def reuse_surface_report_gaps(params: dict[str, Any]) -> dict[str, Any]:
|
||||||
|
"""Resolve registry-hygiene gaps for the next rollout batch.
|
||||||
|
|
||||||
|
Missing operational dependencies are visible failures for required sources
|
||||||
|
and graceful empty lists for optional sources so definitions can opt into
|
||||||
|
either behavior without changing rule logic.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
return _resolve_reuse_surface_report_gaps(params)
|
||||||
|
except Exception as exc:
|
||||||
|
if _as_required(params):
|
||||||
|
raise
|
||||||
|
logger.warning("reuse_surface_report_gaps unavailable: %s", exc)
|
||||||
|
return {"gaps": []}
|
||||||
|
|
||||||
|
|
||||||
|
def _resolve_reuse_surface_report_gaps(params: dict[str, Any]) -> dict[str, Any]:
|
||||||
|
roster_path = _roster_path(params)
|
||||||
|
entries = _load_active_roster_entries(roster_path)
|
||||||
|
if not entries:
|
||||||
|
return {"gaps": []}
|
||||||
|
|
||||||
|
state_path = _round_robin_state_path(params, roster_path)
|
||||||
|
selected, next_cursor = _select_round_robin_batch(
|
||||||
|
entries,
|
||||||
|
_batch_size(params),
|
||||||
|
state_path,
|
||||||
|
)
|
||||||
|
if not selected:
|
||||||
|
return {"gaps": []}
|
||||||
|
|
||||||
|
signals = _enabled_signals(_signals_path(params, roster_path))
|
||||||
|
roots = _resolve_repo_roots(selected, _runner_host(params))
|
||||||
|
report = _reuse_surface_report(params, signals)
|
||||||
|
gaps = _gap_records(selected, roots, signals, report)
|
||||||
|
|
||||||
|
_write_round_robin_state(state_path, next_cursor, selected)
|
||||||
|
return {"gaps": gaps}
|
||||||
|
|
||||||
|
|
||||||
|
def _roster_path(params: dict[str, Any]) -> Path:
|
||||||
|
raw = params.get("roster")
|
||||||
|
if not raw:
|
||||||
|
raise ValueError("reuse_surface_report_gaps requires params.roster")
|
||||||
|
path = Path(str(raw)).expanduser()
|
||||||
|
if not path.is_file():
|
||||||
|
raise FileNotFoundError(f"reuse_surface_report_gaps roster not found: {path}")
|
||||||
|
return path
|
||||||
|
|
||||||
|
|
||||||
|
def _batch_size(params: dict[str, Any]) -> int:
|
||||||
|
try:
|
||||||
|
return max(1, int(params.get("batch_size", 3)))
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return 3
|
||||||
|
|
||||||
|
|
||||||
|
def _round_robin_state_path(params: dict[str, Any], roster_path: Path) -> Path:
|
||||||
|
raw = params.get("round_robin_state")
|
||||||
|
if raw:
|
||||||
|
return Path(str(raw)).expanduser()
|
||||||
|
return roster_path.with_name("round-robin-state.json")
|
||||||
|
|
||||||
|
|
||||||
|
def _signals_path(params: dict[str, Any], roster_path: Path) -> Path:
|
||||||
|
raw = params.get("signals")
|
||||||
|
if raw:
|
||||||
|
return Path(str(raw)).expanduser()
|
||||||
|
return roster_path.with_name("signals.yml")
|
||||||
|
|
||||||
|
|
||||||
|
def _load_active_roster_entries(path: Path) -> list[RosterEntry]:
|
||||||
|
data = yaml.safe_load(path.read_text(encoding="utf-8"))
|
||||||
|
if not isinstance(data, dict):
|
||||||
|
raise ValueError(f"reuse_surface rollout roster is not a mapping: {path}")
|
||||||
|
|
||||||
|
entries: dict[str, RosterEntry] = {}
|
||||||
|
for domain, block in _iter_domain_blocks(data):
|
||||||
|
if _domain_phase(block) != "active":
|
||||||
|
continue
|
||||||
|
for item in _repo_items(block):
|
||||||
|
entry = _entry_from_item(item, domain, block)
|
||||||
|
if entry and entry.slug not in entries:
|
||||||
|
entries[entry.slug] = entry
|
||||||
|
return list(entries.values())
|
||||||
|
|
||||||
|
|
||||||
|
def _iter_domain_blocks(data: dict[str, Any]) -> list[tuple[str | None, dict[str, Any]]]:
|
||||||
|
domains = data.get("domains")
|
||||||
|
if isinstance(domains, dict):
|
||||||
|
return [
|
||||||
|
(str(name), block)
|
||||||
|
for name, block in domains.items()
|
||||||
|
if isinstance(block, dict)
|
||||||
|
]
|
||||||
|
if isinstance(domains, list):
|
||||||
|
return [
|
||||||
|
(str(block.get("name") or block.get("domain") or ""), block)
|
||||||
|
for block in domains
|
||||||
|
if isinstance(block, dict)
|
||||||
|
]
|
||||||
|
if isinstance(data.get("active"), list):
|
||||||
|
return [(None, {"phase": "active", "repos": data["active"]})]
|
||||||
|
return [
|
||||||
|
(str(name), block)
|
||||||
|
for name, block in data.items()
|
||||||
|
if isinstance(block, dict) and ("phase" in block or "repos" in block)
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _domain_phase(block: dict[str, Any]) -> str:
|
||||||
|
return str(block.get("phase") or block.get("status") or "").lower()
|
||||||
|
|
||||||
|
|
||||||
|
def _repo_items(block: dict[str, Any]) -> list[Any]:
|
||||||
|
repos = (
|
||||||
|
block.get("repos")
|
||||||
|
or block.get("repo_slugs")
|
||||||
|
or block.get("repositories")
|
||||||
|
or block.get("slugs")
|
||||||
|
or []
|
||||||
|
)
|
||||||
|
if isinstance(repos, dict):
|
||||||
|
items: list[Any] = []
|
||||||
|
for slug, config in repos.items():
|
||||||
|
if isinstance(config, dict):
|
||||||
|
item = dict(config)
|
||||||
|
item.setdefault("slug", slug)
|
||||||
|
items.append(item)
|
||||||
|
else:
|
||||||
|
items.append(str(slug))
|
||||||
|
return items
|
||||||
|
if isinstance(repos, list):
|
||||||
|
return repos
|
||||||
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
def _entry_from_item(
|
||||||
|
item: Any,
|
||||||
|
domain: str | None,
|
||||||
|
block: dict[str, Any],
|
||||||
|
) -> RosterEntry | None:
|
||||||
|
publish_check = block.get("publish_check")
|
||||||
|
if isinstance(item, str):
|
||||||
|
slug = item
|
||||||
|
elif isinstance(item, dict):
|
||||||
|
slug = item.get("slug") or item.get("repo") or item.get("name")
|
||||||
|
publish_check = item.get("publish_check", publish_check)
|
||||||
|
else:
|
||||||
|
return None
|
||||||
|
if not slug:
|
||||||
|
return None
|
||||||
|
return RosterEntry(
|
||||||
|
slug=str(slug),
|
||||||
|
domain=domain or None,
|
||||||
|
publish_check=str(publish_check).lower() if publish_check is not None else None,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _select_round_robin_batch(
|
||||||
|
entries: list[RosterEntry],
|
||||||
|
batch_size: int,
|
||||||
|
state_path: Path,
|
||||||
|
) -> tuple[list[RosterEntry], int]:
|
||||||
|
if not entries:
|
||||||
|
return [], 0
|
||||||
|
cursor = _read_round_robin_cursor(state_path) % len(entries)
|
||||||
|
size = min(batch_size, len(entries))
|
||||||
|
selected = [entries[(cursor + offset) % len(entries)] for offset in range(size)]
|
||||||
|
next_cursor = (cursor + size) % len(entries)
|
||||||
|
return selected, next_cursor
|
||||||
|
|
||||||
|
|
||||||
|
def _read_round_robin_cursor(path: Path) -> int:
|
||||||
|
if not path.is_file():
|
||||||
|
return 0
|
||||||
|
try:
|
||||||
|
data = json.loads(path.read_text(encoding="utf-8"))
|
||||||
|
except (OSError, json.JSONDecodeError):
|
||||||
|
return 0
|
||||||
|
if not isinstance(data, dict):
|
||||||
|
return 0
|
||||||
|
try:
|
||||||
|
return int(data.get("cursor", 0))
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def _write_round_robin_state(
|
||||||
|
path: Path,
|
||||||
|
cursor: int,
|
||||||
|
selected: list[RosterEntry],
|
||||||
|
) -> None:
|
||||||
|
path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
payload = {
|
||||||
|
"cursor": cursor,
|
||||||
|
"last_batch": [entry.slug for entry in selected],
|
||||||
|
"updated_at": datetime.now(timezone.utc).isoformat(),
|
||||||
|
}
|
||||||
|
path.write_text(
|
||||||
|
json.dumps(payload, indent=2, sort_keys=True) + "\n",
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _enabled_signals(path: Path) -> set[str]:
|
||||||
|
if not path.is_file():
|
||||||
|
return set(_KNOWN_SIGNALS)
|
||||||
|
data = yaml.safe_load(path.read_text(encoding="utf-8"))
|
||||||
|
node = data.get("signals") if isinstance(data, dict) else data
|
||||||
|
enabled: set[str] = set()
|
||||||
|
saw_known_signal = False
|
||||||
|
|
||||||
|
if isinstance(node, dict):
|
||||||
|
for name, config in node.items():
|
||||||
|
if str(name) not in _KNOWN_SIGNALS:
|
||||||
|
continue
|
||||||
|
saw_known_signal = True
|
||||||
|
if isinstance(config, dict) and config.get("enabled") is False:
|
||||||
|
continue
|
||||||
|
if config is False:
|
||||||
|
continue
|
||||||
|
enabled.add(str(name))
|
||||||
|
elif isinstance(node, list):
|
||||||
|
for item in node:
|
||||||
|
if isinstance(item, str) and item in _KNOWN_SIGNALS:
|
||||||
|
saw_known_signal = True
|
||||||
|
enabled.add(item)
|
||||||
|
elif isinstance(item, dict):
|
||||||
|
name = item.get("id") or item.get("signal") or item.get("name")
|
||||||
|
if str(name) in _KNOWN_SIGNALS and item.get("enabled", True) is not False:
|
||||||
|
saw_known_signal = True
|
||||||
|
enabled.add(str(name))
|
||||||
|
|
||||||
|
return enabled if saw_known_signal else set(_KNOWN_SIGNALS)
|
||||||
|
|
||||||
|
|
||||||
|
def _resolve_repo_roots(
|
||||||
|
entries: list[RosterEntry],
|
||||||
|
runner_host: str,
|
||||||
|
) -> dict[str, Path]:
|
||||||
|
requested = {entry.slug for entry in entries}
|
||||||
|
roots: dict[str, Path] = {}
|
||||||
|
for repo in _fetch_repos():
|
||||||
|
slug = str(repo.get("slug") or "")
|
||||||
|
if slug not in requested:
|
||||||
|
continue
|
||||||
|
raw = _repo_path_for_host(repo, runner_host)
|
||||||
|
if raw:
|
||||||
|
roots[slug] = Path(raw)
|
||||||
|
return roots
|
||||||
|
|
||||||
|
|
||||||
|
def _fetch_repos() -> list[dict[str, Any]]:
|
||||||
|
url = f"{_base_url()}/repos/"
|
||||||
|
try:
|
||||||
|
resp = httpx.get(url, timeout=_STATE_HUB_TIMEOUT_SECONDS)
|
||||||
|
resp.raise_for_status()
|
||||||
|
except httpx.HTTPError as exc:
|
||||||
|
raise RuntimeError(f"State Hub unreachable at {url}: {exc}") from exc
|
||||||
|
payload = resp.json()
|
||||||
|
if not isinstance(payload, list):
|
||||||
|
raise RuntimeError(f"State Hub /repos/ returned non-list: {type(payload)!r}")
|
||||||
|
return [repo for repo in payload if isinstance(repo, dict)]
|
||||||
|
|
||||||
|
|
||||||
|
def _repo_path_for_host(repo: dict[str, Any], runner_host: str) -> str | None:
|
||||||
|
host_paths = repo.get("host_paths") or {}
|
||||||
|
raw = None
|
||||||
|
if isinstance(host_paths, dict):
|
||||||
|
raw = host_paths.get(runner_host)
|
||||||
|
raw = raw or repo.get("local_path")
|
||||||
|
if not raw or raw == "(unknown)":
|
||||||
|
return None
|
||||||
|
return str(raw)
|
||||||
|
|
||||||
|
|
||||||
|
def _reuse_surface_report(params: dict[str, Any], signals: set[str]) -> dict[str, Any]:
|
||||||
|
if not (signals & {"registry_gap", "empty_capability_scaffold"}):
|
||||||
|
return {}
|
||||||
|
binary = str(params.get("reuse_surface_bin") or "reuse-surface")
|
||||||
|
try:
|
||||||
|
completed = subprocess.run(
|
||||||
|
[binary, "report", "gaps", "--format", "json"],
|
||||||
|
capture_output=True,
|
||||||
|
check=False,
|
||||||
|
text=True,
|
||||||
|
timeout=_REPORT_TIMEOUT_SECONDS,
|
||||||
|
)
|
||||||
|
except FileNotFoundError as exc:
|
||||||
|
raise RuntimeError(f"reuse-surface CLI not found: {binary}") from exc
|
||||||
|
except subprocess.TimeoutExpired as exc:
|
||||||
|
raise RuntimeError("reuse-surface report gaps timed out") from exc
|
||||||
|
|
||||||
|
if completed.returncode != 0:
|
||||||
|
detail = completed.stderr.strip() or completed.stdout.strip()
|
||||||
|
raise RuntimeError(f"reuse-surface report gaps failed: {detail}")
|
||||||
|
try:
|
||||||
|
payload = json.loads(completed.stdout or "{}")
|
||||||
|
except json.JSONDecodeError as exc:
|
||||||
|
raise RuntimeError("reuse-surface report gaps returned invalid JSON") from exc
|
||||||
|
if not isinstance(payload, dict):
|
||||||
|
raise RuntimeError("reuse-surface report gaps returned non-object JSON")
|
||||||
|
return payload
|
||||||
|
|
||||||
|
|
||||||
|
def _gap_records(
|
||||||
|
entries: list[RosterEntry],
|
||||||
|
roots: dict[str, Path],
|
||||||
|
signals: set[str],
|
||||||
|
report: dict[str, Any],
|
||||||
|
) -> list[dict[str, Any]]:
|
||||||
|
empty_scaffolds = _repo_set(report, {"empty_scaffolds", "empty_scaffold"})
|
||||||
|
publish_fail = _repo_set(
|
||||||
|
report,
|
||||||
|
{"publish_fail", "publish_fails", "publish_failures"},
|
||||||
|
)
|
||||||
|
gaps: list[dict[str, Any]] = []
|
||||||
|
seen: set[tuple[str, str]] = set()
|
||||||
|
|
||||||
|
for entry in entries:
|
||||||
|
root = roots.get(entry.slug)
|
||||||
|
if root is None:
|
||||||
|
logger.info("reuse_surface repo_unreachable slug=%s", entry.slug)
|
||||||
|
continue
|
||||||
|
|
||||||
|
if (
|
||||||
|
signals & {"registry_gap", "empty_capability_scaffold"}
|
||||||
|
and entry.slug in empty_scaffolds
|
||||||
|
):
|
||||||
|
_append_gap(gaps, seen, entry.slug, root, "empty_capability_scaffold")
|
||||||
|
|
||||||
|
if "registry_gap" in signals and entry.slug in publish_fail:
|
||||||
|
_append_gap(gaps, seen, entry.slug, root, "registry_gap")
|
||||||
|
|
||||||
|
if "publish_check_fail" in signals and entry.publish_check == "fail":
|
||||||
|
_append_gap(gaps, seen, entry.slug, root, "publish_check_fail")
|
||||||
|
|
||||||
|
if "stale_scope" in signals and _scope_is_stale(root):
|
||||||
|
_append_gap(gaps, seen, entry.slug, root, "stale_scope")
|
||||||
|
|
||||||
|
if "stale_sbom" in signals and _sbom_is_stale(entry.slug):
|
||||||
|
_append_gap(gaps, seen, entry.slug, root, "stale_sbom")
|
||||||
|
|
||||||
|
return gaps
|
||||||
|
|
||||||
|
|
||||||
|
def _append_gap(
|
||||||
|
gaps: list[dict[str, Any]],
|
||||||
|
seen: set[tuple[str, str]],
|
||||||
|
slug: str,
|
||||||
|
root: Path,
|
||||||
|
signal: str,
|
||||||
|
) -> None:
|
||||||
|
key = (slug, signal)
|
||||||
|
if key in seen:
|
||||||
|
return
|
||||||
|
seen.add(key)
|
||||||
|
gaps.append(
|
||||||
|
{
|
||||||
|
"repo": slug,
|
||||||
|
"root": str(root),
|
||||||
|
"signal": signal,
|
||||||
|
"hygiene_signal": signal,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _scope_is_stale(root: Path) -> bool:
|
||||||
|
scope = root / "SCOPE.md"
|
||||||
|
if not scope.is_file():
|
||||||
|
return True
|
||||||
|
age_seconds = datetime.now(timezone.utc).timestamp() - scope.stat().st_mtime
|
||||||
|
return age_seconds > 90 * 24 * 60 * 60
|
||||||
|
|
||||||
|
|
||||||
|
def _sbom_is_stale(slug: str) -> bool:
|
||||||
|
payload = StateHubContextResolver().resolve(
|
||||||
|
"repo_sbom_status",
|
||||||
|
None,
|
||||||
|
{"repo_slug": slug},
|
||||||
|
)
|
||||||
|
if not isinstance(payload, dict):
|
||||||
|
return False
|
||||||
|
try:
|
||||||
|
return int(payload.get("sbom_age_days", 0)) > 30
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def _repo_set(report: dict[str, Any], keys: set[str]) -> set[str]:
|
||||||
|
slugs: set[str] = set()
|
||||||
|
for value in _values_for_keys(report, keys):
|
||||||
|
slugs.update(_slugs_from_value(value))
|
||||||
|
return slugs
|
||||||
|
|
||||||
|
|
||||||
|
def _values_for_keys(value: Any, keys: set[str]) -> list[Any]:
|
||||||
|
values: list[Any] = []
|
||||||
|
if isinstance(value, dict):
|
||||||
|
for key, nested in value.items():
|
||||||
|
if key in keys:
|
||||||
|
values.append(nested)
|
||||||
|
values.extend(_values_for_keys(nested, keys))
|
||||||
|
elif isinstance(value, list):
|
||||||
|
for item in value:
|
||||||
|
values.extend(_values_for_keys(item, keys))
|
||||||
|
return values
|
||||||
|
|
||||||
|
|
||||||
|
def _slugs_from_value(value: Any) -> set[str]:
|
||||||
|
if isinstance(value, str):
|
||||||
|
return {value}
|
||||||
|
if isinstance(value, list):
|
||||||
|
slugs: set[str] = set()
|
||||||
|
for item in value:
|
||||||
|
slugs.update(_slugs_from_value(item))
|
||||||
|
return slugs
|
||||||
|
if isinstance(value, dict):
|
||||||
|
for key in ("repo", "repo_slug", "slug", "name"):
|
||||||
|
if value.get(key):
|
||||||
|
return {str(value[key])}
|
||||||
|
slugs: set[str] = set()
|
||||||
|
for key, nested in value.items():
|
||||||
|
if nested is True or isinstance(nested, (dict, list)):
|
||||||
|
slugs.add(str(key))
|
||||||
|
slugs.update(_slugs_from_value(nested))
|
||||||
|
return slugs
|
||||||
|
return set()
|
||||||
|
|
||||||
|
|
||||||
|
class ReuseSurfaceContextResolver(ContextResolver):
|
||||||
|
"""Resolves reuse-surface registry hygiene gap reports."""
|
||||||
|
|
||||||
|
def resolve(self, query: str, event: Any, params: dict[str, Any]) -> dict[str, Any]:
|
||||||
|
if query == "reuse_surface_report_gaps":
|
||||||
|
return reuse_surface_report_gaps(params)
|
||||||
|
return {}
|
||||||
|
|
||||||
|
|
||||||
|
class ShellContextResolver(ContextResolver):
|
||||||
|
"""Dispatch shell-backed context queries without breaking kaizen aliases."""
|
||||||
|
|
||||||
|
def resolve(self, query: str, event: Any, params: dict[str, Any]) -> dict[str, Any]:
|
||||||
|
if query == "reuse_surface_report_gaps":
|
||||||
|
return reuse_surface_report_gaps(params)
|
||||||
|
return KaizenContextResolver().resolve(query, event, params)
|
||||||
|
|
||||||
|
|
||||||
|
CONTEXT_RESOLVER_REGISTRY["reuse-surface"] = ReuseSurfaceContextResolver
|
||||||
|
CONTEXT_RESOLVER_REGISTRY["shell"] = ShellContextResolver
|
||||||
@@ -12,6 +12,7 @@ Supported queries:
|
|||||||
- coding_retro: latest /progress/ item with event_type=coding_retro
|
- coding_retro: latest /progress/ item with event_type=coding_retro
|
||||||
- daily_triage_digest: curated scalar JSON digest for daily WSJF triage
|
- daily_triage_digest: curated scalar JSON digest for daily WSJF triage
|
||||||
- recently_on_scope_hourly: POST {STATE_HUB_URL}/recently-on-scope/hourly
|
- recently_on_scope_hourly: POST {STATE_HUB_URL}/recently-on-scope/hourly
|
||||||
|
- consistency_sweep_remote_all: POST {STATE_HUB_URL}/consistency/sweep/remote-all
|
||||||
|
|
||||||
No caching — state hub data is live operational state and must not be stale
|
No caching — state hub data is live operational state and must not be stale
|
||||||
within a single workflow run.
|
within a single workflow run.
|
||||||
@@ -31,6 +32,7 @@ from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY, Cont
|
|||||||
|
|
||||||
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
|
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
|
||||||
_TIMEOUT_SECONDS = 10.0
|
_TIMEOUT_SECONDS = 10.0
|
||||||
|
_SWEEP_TIMEOUT_SECONDS = 330.0
|
||||||
_OPEN_WORKSTREAM_STATUSES = {"active", "ready", "blocked"}
|
_OPEN_WORKSTREAM_STATUSES = {"active", "ready", "blocked"}
|
||||||
_OPEN_TASK_STATUSES = {"wait", "todo", "progress"}
|
_OPEN_TASK_STATUSES = {"wait", "todo", "progress"}
|
||||||
# Sentinel age for repos that have never had an SBOM ingested. Large enough
|
# Sentinel age for repos that have never had an SBOM ingested. Large enough
|
||||||
@@ -53,13 +55,26 @@ def _fetch_json(path: str, params: dict[str, Any] | None = None) -> Any:
|
|||||||
return {}
|
return {}
|
||||||
|
|
||||||
|
|
||||||
def _post_json(path: str, payload: dict[str, Any]) -> Any:
|
def _post_json(path: str, payload: dict[str, Any], *, timeout: float = _TIMEOUT_SECONDS) -> Any:
|
||||||
url = f"{_base_url()}{path}"
|
url = f"{_base_url()}{path}"
|
||||||
resp = httpx.post(url, json=payload, timeout=_TIMEOUT_SECONDS)
|
resp = httpx.post(url, json=payload, timeout=timeout)
|
||||||
resp.raise_for_status()
|
resp.raise_for_status()
|
||||||
return resp.json()
|
return resp.json()
|
||||||
|
|
||||||
|
|
||||||
|
def _validate_consistency_sweep_remote_all(result: Any) -> dict[str, Any]:
|
||||||
|
if not isinstance(result, dict):
|
||||||
|
raise RuntimeError("consistency_sweep_remote_all returned a non-object response")
|
||||||
|
required_keys = {"exit_code", "lock_skipped", "repos_processed"}
|
||||||
|
missing = required_keys - set(result)
|
||||||
|
if missing:
|
||||||
|
missing_list = ", ".join(sorted(missing))
|
||||||
|
raise RuntimeError(
|
||||||
|
f"consistency_sweep_remote_all response missing required key(s): {missing_list}"
|
||||||
|
)
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
def _validate_recently_on_scope_hourly(result: Any) -> dict[str, Any]:
|
def _validate_recently_on_scope_hourly(result: Any) -> dict[str, Any]:
|
||||||
if not isinstance(result, dict):
|
if not isinstance(result, dict):
|
||||||
raise RuntimeError("recently_on_scope_hourly returned a non-object response")
|
raise RuntimeError("recently_on_scope_hourly returned a non-object response")
|
||||||
@@ -107,6 +122,18 @@ class StateHubContextResolver(ContextResolver):
|
|||||||
}
|
}
|
||||||
result = _post_json("/recently-on-scope/hourly", payload)
|
result = _post_json("/recently-on-scope/hourly", payload)
|
||||||
return _validate_recently_on_scope_hourly(result)
|
return _validate_recently_on_scope_hourly(result)
|
||||||
|
if query == "consistency_sweep_remote_all":
|
||||||
|
payload = {
|
||||||
|
key: value
|
||||||
|
for key, value in params.items()
|
||||||
|
if key not in {"required"}
|
||||||
|
}
|
||||||
|
result = _post_json(
|
||||||
|
"/consistency/sweep/remote-all",
|
||||||
|
payload,
|
||||||
|
timeout=_SWEEP_TIMEOUT_SECONDS,
|
||||||
|
)
|
||||||
|
return _validate_consistency_sweep_remote_all(result)
|
||||||
return {}
|
return {}
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -20,7 +20,8 @@ from activity_core.rules.models import TaskRef, TaskSpec
|
|||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
ISSUE_CORE_URL = os.environ.get("ISSUE_CORE_URL", "http://127.0.0.1:8010")
|
ISSUE_CORE_URL = os.environ.get("ISSUE_CORE_URL", "http://127.0.0.1:8765")
|
||||||
|
ISSUE_CORE_API_KEY_ENV = "ISSUE_CORE_API_KEY"
|
||||||
ISSUE_SINK_TYPE = os.environ.get("ISSUE_SINK_TYPE", "rest")
|
ISSUE_SINK_TYPE = os.environ.get("ISSUE_SINK_TYPE", "rest")
|
||||||
|
|
||||||
|
|
||||||
@@ -30,10 +31,30 @@ class IssueSink(ABC):
|
|||||||
|
|
||||||
|
|
||||||
class IssueCoreRestSink(IssueSink):
|
class IssueCoreRestSink(IssueSink):
|
||||||
"""POSTs to issue-core REST API. Config: ISSUE_CORE_URL env var."""
|
"""POSTs to issue-core REST API.
|
||||||
|
|
||||||
def __init__(self, base_url: str = ISSUE_CORE_URL) -> None:
|
Config: ISSUE_CORE_URL and ISSUE_CORE_API_KEY env vars (shared key with
|
||||||
|
the issue-core server).
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
base_url: str = ISSUE_CORE_URL,
|
||||||
|
api_key: str | None = None,
|
||||||
|
) -> None:
|
||||||
self._base_url = base_url.rstrip("/")
|
self._base_url = base_url.rstrip("/")
|
||||||
|
if api_key is not None:
|
||||||
|
self._api_key = api_key.strip()
|
||||||
|
else:
|
||||||
|
self._api_key = os.environ.get(ISSUE_CORE_API_KEY_ENV, "").strip()
|
||||||
|
|
||||||
|
def _auth_headers(self) -> dict[str, str]:
|
||||||
|
if not self._api_key:
|
||||||
|
raise RuntimeError(
|
||||||
|
f"{ISSUE_CORE_API_KEY_ENV} is not set. "
|
||||||
|
"Required when ISSUE_SINK_TYPE=rest."
|
||||||
|
)
|
||||||
|
return {"Authorization": f"Bearer {self._api_key}"}
|
||||||
|
|
||||||
def emit(self, task_spec: TaskSpec) -> TaskRef:
|
def emit(self, task_spec: TaskSpec) -> TaskRef:
|
||||||
payload = {
|
payload = {
|
||||||
@@ -45,10 +66,19 @@ class IssueCoreRestSink(IssueSink):
|
|||||||
"due_in_days": task_spec.due_in_days,
|
"due_in_days": task_spec.due_in_days,
|
||||||
"source_type": task_spec.source_type,
|
"source_type": task_spec.source_type,
|
||||||
"source_id": task_spec.source_id,
|
"source_id": task_spec.source_id,
|
||||||
"triggering_event_id": task_spec.triggering_event_id,
|
"triggering_event_id": (
|
||||||
|
str(task_spec.triggering_event_id)
|
||||||
|
if task_spec.triggering_event_id is not None
|
||||||
|
else None
|
||||||
|
),
|
||||||
"activity_definition_id": task_spec.activity_definition_id,
|
"activity_definition_id": task_spec.activity_definition_id,
|
||||||
}
|
}
|
||||||
resp = httpx.post(f"{self._base_url}/issues/", json=payload, timeout=10.0)
|
resp = httpx.post(
|
||||||
|
f"{self._base_url}/issues/",
|
||||||
|
json=payload,
|
||||||
|
headers=self._auth_headers(),
|
||||||
|
timeout=10.0,
|
||||||
|
)
|
||||||
resp.raise_for_status()
|
resp.raise_for_status()
|
||||||
data = resp.json()
|
data = resp.json()
|
||||||
return TaskRef(
|
return TaskRef(
|
||||||
|
|||||||
@@ -17,6 +17,8 @@ import httpx
|
|||||||
class DisabledLLMClient:
|
class DisabledLLMClient:
|
||||||
"""LLM client used when no llm-connect endpoint is configured."""
|
"""LLM client used when no llm-connect endpoint is configured."""
|
||||||
|
|
||||||
|
last_response_metadata: dict[str, Any] | None = None
|
||||||
|
|
||||||
def complete(
|
def complete(
|
||||||
self,
|
self,
|
||||||
prompt: str,
|
prompt: str,
|
||||||
@@ -32,6 +34,7 @@ class LLMConnectClient:
|
|||||||
def __init__(self, base_url: str, timeout_seconds: float = 300.0) -> None:
|
def __init__(self, base_url: str, timeout_seconds: float = 300.0) -> None:
|
||||||
self.base_url = base_url.rstrip("/")
|
self.base_url = base_url.rstrip("/")
|
||||||
self.timeout_seconds = timeout_seconds
|
self.timeout_seconds = timeout_seconds
|
||||||
|
self.last_response_metadata: dict[str, Any] | None = None
|
||||||
|
|
||||||
def complete(
|
def complete(
|
||||||
self,
|
self,
|
||||||
@@ -54,12 +57,48 @@ class LLMConnectClient:
|
|||||||
)
|
)
|
||||||
resp.raise_for_status()
|
resp.raise_for_status()
|
||||||
data = resp.json()
|
data = resp.json()
|
||||||
|
self.last_response_metadata = _extract_response_metadata(data)
|
||||||
content = data.get("content")
|
content = data.get("content")
|
||||||
if not isinstance(content, str):
|
if not isinstance(content, str):
|
||||||
raise ValueError("llm-connect response missing string content")
|
raise ValueError("llm-connect response missing string content")
|
||||||
return content
|
return content
|
||||||
|
|
||||||
|
|
||||||
|
_SAFE_RESPONSE_METADATA_KEYS = {
|
||||||
|
"finish_reason",
|
||||||
|
"usage",
|
||||||
|
"model",
|
||||||
|
"model_name",
|
||||||
|
"provider",
|
||||||
|
"request_id",
|
||||||
|
"response_id",
|
||||||
|
"trace_id",
|
||||||
|
"latency_ms",
|
||||||
|
"duration_ms",
|
||||||
|
"elapsed_ms",
|
||||||
|
"created",
|
||||||
|
"created_at",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_response_metadata(data: dict[str, Any]) -> dict[str, Any]:
|
||||||
|
"""Keep non-secret llm-connect diagnostics alongside the returned content."""
|
||||||
|
return {
|
||||||
|
key: value for key, value in data.items()
|
||||||
|
if key in _SAFE_RESPONSE_METADATA_KEYS and _json_safe(value)
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _json_safe(value: Any) -> bool:
|
||||||
|
try:
|
||||||
|
import json
|
||||||
|
|
||||||
|
json.dumps(value)
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return False
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
def get_llm_client() -> DisabledLLMClient | LLMConnectClient:
|
def get_llm_client() -> DisabledLLMClient | LLMConnectClient:
|
||||||
base_url = os.environ.get("LLM_CONNECT_URL", "").strip()
|
base_url = os.environ.get("LLM_CONNECT_URL", "").strip()
|
||||||
if not base_url:
|
if not base_url:
|
||||||
|
|||||||
@@ -49,7 +49,18 @@ class CronTriggerConfig(BaseModel):
|
|||||||
)
|
)
|
||||||
timezone: str = Field(default="UTC", description="IANA timezone name.")
|
timezone: str = Field(default="UTC", description="IANA timezone name.")
|
||||||
jitter_seconds: int = Field(default=0, ge=0)
|
jitter_seconds: int = Field(default=0, ge=0)
|
||||||
misfire_policy: Literal["skip", "catchup", "compress"] = Field(default="skip")
|
# Run-miss recovery behaviour (ACTIVITY-WP-0014). What happens when a fire is
|
||||||
|
# missed because the worker / Temporal was unavailable at trigger time:
|
||||||
|
# skip - run on trigger or skip; a missed fire is never recovered
|
||||||
|
# catchup_all - recover every fire missed during the outage window
|
||||||
|
# catchup_latest - recover only the most recent missed fire; do not accumulate
|
||||||
|
# Legacy aliases are accepted: catchup → catchup_all, compress → catchup_latest.
|
||||||
|
misfire_policy: Literal[
|
||||||
|
"skip", "catchup_all", "catchup_latest", "catchup", "compress"
|
||||||
|
] = Field(default="skip")
|
||||||
|
# Override the per-policy default catchup window (how far back Temporal will
|
||||||
|
# recover missed fires after an outage). None uses the policy default.
|
||||||
|
catchup_window_seconds: int | None = Field(default=None, ge=0)
|
||||||
|
|
||||||
|
|
||||||
class EventTriggerConfig(BaseModel):
|
class EventTriggerConfig(BaseModel):
|
||||||
|
|||||||
@@ -2,12 +2,15 @@
|
|||||||
|
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
import os
|
import os
|
||||||
|
from pathlib import Path
|
||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
import httpx
|
import httpx
|
||||||
|
|
||||||
from activity_core.context_resolvers.ops_inventory import _sanitize_url
|
from activity_core.context_resolvers.ops_inventory import _sanitize_url
|
||||||
|
from activity_core.state_hub_write import idempotency_headers
|
||||||
|
|
||||||
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
|
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
|
||||||
_INTER_HUB_SINK_TYPES = {
|
_INTER_HUB_SINK_TYPES = {
|
||||||
@@ -15,6 +18,10 @@ _INTER_HUB_SINK_TYPES = {
|
|||||||
"inter-hub-event",
|
"inter-hub-event",
|
||||||
"inter-hub-interaction-event",
|
"inter-hub-interaction-event",
|
||||||
}
|
}
|
||||||
|
_CORE_HUB_SINK_TYPES = {
|
||||||
|
"core-hub",
|
||||||
|
"core-hub-interaction-event",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def persist_ops_inventory_evidence(payload: dict[str, Any]) -> list[dict[str, Any]]:
|
def persist_ops_inventory_evidence(payload: dict[str, Any]) -> list[dict[str, Any]]:
|
||||||
@@ -55,6 +62,12 @@ def persist_ops_inventory_evidence(payload: dict[str, Any]) -> list[dict[str, An
|
|||||||
results.append(
|
results.append(
|
||||||
_post_state_hub_progress(payload, bind_key, probe_result, sink)
|
_post_state_hub_progress(payload, bind_key, probe_result, sink)
|
||||||
)
|
)
|
||||||
|
elif sink_type in _CORE_HUB_SINK_TYPES:
|
||||||
|
results.append(
|
||||||
|
_post_core_hub_interaction_event(
|
||||||
|
payload, bind_key, probe_result, sink
|
||||||
|
)
|
||||||
|
)
|
||||||
elif sink_type in _INTER_HUB_SINK_TYPES:
|
elif sink_type in _INTER_HUB_SINK_TYPES:
|
||||||
results.append(_inter_hub_result(sink))
|
results.append(_inter_hub_result(sink))
|
||||||
else:
|
else:
|
||||||
@@ -121,6 +134,7 @@ def _post_state_hub_progress(
|
|||||||
resp = httpx.post(
|
resp = httpx.post(
|
||||||
f"{base_url}/progress/",
|
f"{base_url}/progress/",
|
||||||
json=body,
|
json=body,
|
||||||
|
headers=idempotency_headers(run_id, context_key, event_type),
|
||||||
timeout=float(sink.get("timeout_seconds", 10.0)),
|
timeout=float(sink.get("timeout_seconds", 10.0)),
|
||||||
)
|
)
|
||||||
resp.raise_for_status()
|
resp.raise_for_status()
|
||||||
@@ -136,12 +150,17 @@ def _post_state_hub_progress(
|
|||||||
|
|
||||||
|
|
||||||
def _progress_exists(base_url: str, event_type: str, idempotency_key: str) -> bool:
|
def _progress_exists(base_url: str, event_type: str, idempotency_key: str) -> bool:
|
||||||
|
# Best-effort optimisation only; the Idempotency-Key header on the write is the
|
||||||
|
# real dedup guarantee. Do not hard-fail if State Hub is unreachable here.
|
||||||
|
try:
|
||||||
resp = httpx.get(
|
resp = httpx.get(
|
||||||
f"{base_url}/progress/",
|
f"{base_url}/progress/",
|
||||||
params={"limit": 100},
|
params={"limit": 100},
|
||||||
timeout=10.0,
|
timeout=10.0,
|
||||||
)
|
)
|
||||||
resp.raise_for_status()
|
resp.raise_for_status()
|
||||||
|
except httpx.HTTPError:
|
||||||
|
return False
|
||||||
for item in resp.json():
|
for item in resp.json():
|
||||||
detail = item.get("detail") or {}
|
detail = item.get("detail") or {}
|
||||||
if (
|
if (
|
||||||
@@ -152,6 +171,213 @@ def _progress_exists(base_url: str, event_type: str, idempotency_key: str) -> bo
|
|||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def _post_core_hub_interaction_event(
|
||||||
|
payload: dict[str, Any],
|
||||||
|
context_key: str,
|
||||||
|
probe_result: dict[str, Any],
|
||||||
|
sink: dict[str, Any],
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
raw_base_url = (
|
||||||
|
sink.get("core_hub_url")
|
||||||
|
or sink.get("base_url")
|
||||||
|
or os.environ.get("CORE_HUB_BASE_URL")
|
||||||
|
or ""
|
||||||
|
)
|
||||||
|
base_url = str(raw_base_url).rstrip("/")
|
||||||
|
runtime_token = _core_hub_runtime_token(sink)
|
||||||
|
widget_id = _core_hub_widget_id(sink, probe_result)
|
||||||
|
|
||||||
|
missing: list[str] = []
|
||||||
|
if not base_url:
|
||||||
|
missing.append("CORE_HUB_BASE_URL")
|
||||||
|
if not runtime_token:
|
||||||
|
missing.append("CORE_HUB_RUNTIME_TOKEN or CORE_HUB_RUNTIME_TOKEN_FILE")
|
||||||
|
if not widget_id:
|
||||||
|
missing.append("widget_id or CORE_HUB_WIDGET_ID")
|
||||||
|
if missing:
|
||||||
|
return {
|
||||||
|
"type": sink.get("type"),
|
||||||
|
"status": "skipped",
|
||||||
|
"reason": "missing_core_hub_config",
|
||||||
|
"missing": missing,
|
||||||
|
"context_key": context_key,
|
||||||
|
}
|
||||||
|
|
||||||
|
endpoint = _selected_endpoint(probe_result, sink)
|
||||||
|
event_type = sink.get("event_type", "ops-endpoint-verified")
|
||||||
|
timeout = float(sink.get("timeout_seconds", 10.0))
|
||||||
|
body = {
|
||||||
|
"widgetId": widget_id,
|
||||||
|
"eventType": event_type,
|
||||||
|
"viewContext": _core_hub_view_context(payload, context_key, endpoint, sink),
|
||||||
|
"metadata": _core_hub_metadata(payload, context_key, probe_result, endpoint),
|
||||||
|
}
|
||||||
|
resp = httpx.post(
|
||||||
|
f"{base_url}/api/v2/interaction-events",
|
||||||
|
json=body,
|
||||||
|
headers=_core_hub_headers(runtime_token),
|
||||||
|
timeout=timeout,
|
||||||
|
)
|
||||||
|
resp.raise_for_status()
|
||||||
|
data = resp.json()
|
||||||
|
event_id = data.get("id")
|
||||||
|
if not event_id:
|
||||||
|
raise RuntimeError("Core Hub interaction event response did not include an id")
|
||||||
|
if not _core_hub_event_exists(base_url, runtime_token, str(event_id), timeout):
|
||||||
|
raise RuntimeError("Core Hub interaction event was not visible after create")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"type": sink.get("type"),
|
||||||
|
"status": "posted",
|
||||||
|
"event_type": data.get("eventType", event_type),
|
||||||
|
"event_id": event_id,
|
||||||
|
"widget_id": data.get("widgetId", widget_id),
|
||||||
|
"verified": True,
|
||||||
|
"context_key": context_key,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _core_hub_headers(runtime_token: str) -> dict[str, str]:
|
||||||
|
return {
|
||||||
|
"Accept": "application/json",
|
||||||
|
"Authorization": f"Bearer {runtime_token}",
|
||||||
|
"Content-Type": "application/json",
|
||||||
|
"User-Agent": "activity-core-ops-evidence/0.1",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _core_hub_runtime_token(sink: dict[str, Any]) -> str:
|
||||||
|
token_file = (
|
||||||
|
sink.get("runtime_token_file")
|
||||||
|
or sink.get("token_file")
|
||||||
|
or os.environ.get("CORE_HUB_RUNTIME_TOKEN_FILE")
|
||||||
|
)
|
||||||
|
if token_file:
|
||||||
|
return Path(str(token_file)).read_text(encoding="utf-8").strip()
|
||||||
|
env_name = (
|
||||||
|
sink.get("runtime_token_env")
|
||||||
|
or os.environ.get("CORE_HUB_RUNTIME_TOKEN_ENV")
|
||||||
|
or "CORE_HUB_RUNTIME_TOKEN"
|
||||||
|
)
|
||||||
|
return os.environ.get(str(env_name), "").strip()
|
||||||
|
|
||||||
|
|
||||||
|
def _core_hub_widget_id(sink: dict[str, Any], probe_result: dict[str, Any]) -> str:
|
||||||
|
direct = sink.get("widget_id") or os.environ.get("CORE_HUB_WIDGET_ID")
|
||||||
|
if direct:
|
||||||
|
return str(direct)
|
||||||
|
|
||||||
|
endpoint = _selected_endpoint(probe_result, sink)
|
||||||
|
widget_ref = endpoint.get("widget_ref") if endpoint else None
|
||||||
|
if not widget_ref:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
mapping = sink.get("widget_mapping") or sink.get("capability_mapping")
|
||||||
|
if mapping is None:
|
||||||
|
mapping = os.environ.get("CORE_HUB_WIDGET_MAPPING")
|
||||||
|
parsed = _parse_widget_mapping(mapping)
|
||||||
|
return parsed.get(str(widget_ref), "")
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_widget_mapping(raw: Any) -> dict[str, str]:
|
||||||
|
if isinstance(raw, dict):
|
||||||
|
return {str(key): str(value) for key, value in raw.items() if value}
|
||||||
|
if not isinstance(raw, str) or not raw.strip():
|
||||||
|
return {}
|
||||||
|
value = raw.strip()
|
||||||
|
if value.startswith("{"):
|
||||||
|
try:
|
||||||
|
loaded = json.loads(value)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
return {}
|
||||||
|
if isinstance(loaded, dict):
|
||||||
|
return {str(key): str(item) for key, item in loaded.items() if item}
|
||||||
|
return {}
|
||||||
|
if "=" not in value:
|
||||||
|
return {}
|
||||||
|
pairs: dict[str, str] = {}
|
||||||
|
for part in value.split(","):
|
||||||
|
key, _, item = part.partition("=")
|
||||||
|
if key.strip() and item.strip():
|
||||||
|
pairs[key.strip()] = item.strip()
|
||||||
|
return pairs
|
||||||
|
|
||||||
|
|
||||||
|
def _selected_endpoint(probe_result: dict[str, Any], sink: dict[str, Any]) -> dict[str, Any]:
|
||||||
|
endpoints = [
|
||||||
|
endpoint
|
||||||
|
for endpoint in probe_result.get("endpoints", [])
|
||||||
|
if isinstance(endpoint, dict)
|
||||||
|
]
|
||||||
|
endpoint_id = sink.get("endpoint_id")
|
||||||
|
if endpoint_id:
|
||||||
|
match = next(
|
||||||
|
(endpoint for endpoint in endpoints if endpoint.get("endpoint_id") == endpoint_id),
|
||||||
|
None,
|
||||||
|
)
|
||||||
|
if match:
|
||||||
|
return match
|
||||||
|
return next(
|
||||||
|
(endpoint for endpoint in endpoints if endpoint.get("widget_ref")),
|
||||||
|
endpoints[0] if endpoints else {},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _core_hub_view_context(
|
||||||
|
payload: dict[str, Any],
|
||||||
|
context_key: str,
|
||||||
|
endpoint: dict[str, Any],
|
||||||
|
sink: dict[str, Any],
|
||||||
|
) -> str:
|
||||||
|
return str(
|
||||||
|
sink.get("view_context")
|
||||||
|
or endpoint.get("view_context")
|
||||||
|
or f"activity-core/ops-inventory/{payload.get('run_id', 'unknown')}/{context_key}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _core_hub_metadata(
|
||||||
|
payload: dict[str, Any],
|
||||||
|
context_key: str,
|
||||||
|
probe_result: dict[str, Any],
|
||||||
|
endpoint: dict[str, Any],
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
compact = _compact_probe_result(probe_result)
|
||||||
|
return {
|
||||||
|
"activity_id": payload.get("activity_id"),
|
||||||
|
"activity_core_run_id": payload.get("run_id"),
|
||||||
|
"scheduled_for": payload.get("scheduled_for"),
|
||||||
|
"source_type": "ops-inventory",
|
||||||
|
"context_key": context_key,
|
||||||
|
"probe": {
|
||||||
|
"generated_at": compact.get("generated_at"),
|
||||||
|
"inventory_path": compact.get("inventory_path"),
|
||||||
|
"status": compact.get("status"),
|
||||||
|
"reason": compact.get("reason"),
|
||||||
|
"summary": compact.get("summary", {}),
|
||||||
|
},
|
||||||
|
"endpoint": _compact_endpoint(endpoint) if endpoint else {},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _core_hub_event_exists(
|
||||||
|
base_url: str,
|
||||||
|
runtime_token: str,
|
||||||
|
event_id: str,
|
||||||
|
timeout: float,
|
||||||
|
) -> bool:
|
||||||
|
resp = httpx.get(
|
||||||
|
f"{base_url}/api/v2/interaction-events",
|
||||||
|
headers=_core_hub_headers(runtime_token),
|
||||||
|
timeout=timeout,
|
||||||
|
)
|
||||||
|
resp.raise_for_status()
|
||||||
|
payload = resp.json()
|
||||||
|
data = payload.get("data") if isinstance(payload, dict) else []
|
||||||
|
if not isinstance(data, list):
|
||||||
|
return False
|
||||||
|
return any(isinstance(item, dict) and item.get("id") == event_id for item in data)
|
||||||
|
|
||||||
def _inter_hub_result(sink: dict[str, Any]) -> dict[str, Any]:
|
def _inter_hub_result(sink: dict[str, Any]) -> dict[str, Any]:
|
||||||
missing: list[str] = []
|
missing: list[str] = []
|
||||||
if not (sink.get("inter_hub_url") or os.environ.get("INTER_HUB_URL")):
|
if not (sink.get("inter_hub_url") or os.environ.get("INTER_HUB_URL")):
|
||||||
|
|||||||
@@ -11,6 +11,8 @@ from zoneinfo import ZoneInfo
|
|||||||
|
|
||||||
import httpx
|
import httpx
|
||||||
|
|
||||||
|
from activity_core.state_hub_write import idempotency_headers
|
||||||
|
|
||||||
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
|
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
|
||||||
_THE_CUSTODIAN_ROOT = Path("/home/worsch/the-custodian")
|
_THE_CUSTODIAN_ROOT = Path("/home/worsch/the-custodian")
|
||||||
_FORBIDDEN_CUSTODIAN_ROOTS = (
|
_FORBIDDEN_CUSTODIAN_ROOTS = (
|
||||||
@@ -134,6 +136,7 @@ def _post_state_hub_progress(
|
|||||||
"output_validated": report_entry.get("output_validated"),
|
"output_validated": report_entry.get("output_validated"),
|
||||||
"review_required": report_entry.get("review_required"),
|
"review_required": report_entry.get("review_required"),
|
||||||
"validation_error": report_entry.get("validation_error"),
|
"validation_error": report_entry.get("validation_error"),
|
||||||
|
"llm_response_metadata": report_entry.get("llm_response_metadata"),
|
||||||
"report": report,
|
"report": report,
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
@@ -149,6 +152,7 @@ def _post_state_hub_progress(
|
|||||||
resp = httpx.post(
|
resp = httpx.post(
|
||||||
f"{base_url}/progress/",
|
f"{base_url}/progress/",
|
||||||
json=body,
|
json=body,
|
||||||
|
headers=idempotency_headers(run_id, instruction_id, event_type),
|
||||||
timeout=float(sink.get("timeout_seconds", 10.0)),
|
timeout=float(sink.get("timeout_seconds", 10.0)),
|
||||||
)
|
)
|
||||||
resp.raise_for_status()
|
resp.raise_for_status()
|
||||||
@@ -167,12 +171,18 @@ def _progress_exists(
|
|||||||
instruction_id: str,
|
instruction_id: str,
|
||||||
event_type: str,
|
event_type: str,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
|
# Best-effort read-dedup optimisation only. The Idempotency-Key header on the
|
||||||
|
# write is the real guarantee; if State Hub is unreachable here we must not
|
||||||
|
# hard-fail — proceed to the (keyed) write rather than raising.
|
||||||
|
try:
|
||||||
resp = httpx.get(
|
resp = httpx.get(
|
||||||
f"{base_url}/progress/",
|
f"{base_url}/progress/",
|
||||||
params={"limit": 100},
|
params={"limit": 100},
|
||||||
timeout=10.0,
|
timeout=10.0,
|
||||||
)
|
)
|
||||||
resp.raise_for_status()
|
resp.raise_for_status()
|
||||||
|
except httpx.HTTPError:
|
||||||
|
return False
|
||||||
for item in resp.json():
|
for item in resp.json():
|
||||||
detail = item.get("detail") or {}
|
detail = item.get("detail") or {}
|
||||||
if (
|
if (
|
||||||
@@ -215,6 +225,16 @@ def _render_markdown(
|
|||||||
lines.extend([summary, ""])
|
lines.extend([summary, ""])
|
||||||
if validation_error:
|
if validation_error:
|
||||||
lines.extend(["Validation error:", "", f"`{validation_error}`", ""])
|
lines.extend(["Validation error:", "", f"`{validation_error}`", ""])
|
||||||
|
metadata = report_entry.get("llm_response_metadata")
|
||||||
|
if metadata:
|
||||||
|
lines.extend([
|
||||||
|
"LLM response metadata:",
|
||||||
|
"",
|
||||||
|
"```json",
|
||||||
|
json.dumps(metadata, indent=2, sort_keys=True),
|
||||||
|
"```",
|
||||||
|
"",
|
||||||
|
])
|
||||||
lines.extend([
|
lines.extend([
|
||||||
"```json",
|
"```json",
|
||||||
json.dumps(report, indent=2, sort_keys=True),
|
json.dumps(report, indent=2, sort_keys=True),
|
||||||
|
|||||||
@@ -41,6 +41,7 @@ class InstructionResult:
|
|||||||
review_required: bool = False
|
review_required: bool = False
|
||||||
condition_matched: str | None = None
|
condition_matched: str | None = None
|
||||||
validation_error: str | None = None
|
validation_error: str | None = None
|
||||||
|
llm_response_metadata: dict[str, Any] | None = None
|
||||||
|
|
||||||
|
|
||||||
def _resolve_path(obj: Any, path: str) -> Any:
|
def _resolve_path(obj: Any, path: str) -> Any:
|
||||||
@@ -160,15 +161,22 @@ def _execute(
|
|||||||
prompt_hash = hashlib.sha256(rendered.encode()).hexdigest()
|
prompt_hash = hashlib.sha256(rendered.encode()).hexdigest()
|
||||||
llm_config = _llm_run_config(instr)
|
llm_config = _llm_run_config(instr)
|
||||||
|
|
||||||
|
# Reference allow-list (WP-0016-T04): if a context resolver supplied the set
|
||||||
|
# of known candidate ids, recommendations pointing at anything else are
|
||||||
|
# quarantined. Absent (None) today → the check is inert until wired.
|
||||||
|
allow_list = _allow_list_from_context(context)
|
||||||
|
|
||||||
# Step 3 — call LLM
|
# Step 3 — call LLM
|
||||||
raw_output = llm_client.complete(rendered, model=instr.model, config=llm_config)
|
raw_output = llm_client.complete(rendered, model=instr.model, config=llm_config)
|
||||||
|
response_metadata = _llm_response_metadata(llm_client)
|
||||||
|
|
||||||
# Step 4 — validate and optionally retry
|
# Step 4 — validate and optionally retry
|
||||||
task_specs, report, error = _validate_output(raw_output, instr)
|
task_specs, report, error = _validate_output(raw_output, instr, allow_list)
|
||||||
if error:
|
if error:
|
||||||
retry_prompt = rendered + f"\n\nPrevious output was invalid: {error}\nPlease fix."
|
retry_prompt = rendered + f"\n\nPrevious output was invalid: {error}\nPlease fix."
|
||||||
raw_output = llm_client.complete(retry_prompt, model=instr.model, config=llm_config)
|
raw_output = llm_client.complete(retry_prompt, model=instr.model, config=llm_config)
|
||||||
task_specs, report, error = _validate_output(raw_output, instr)
|
response_metadata = _llm_response_metadata(llm_client)
|
||||||
|
task_specs, report, error = _validate_output(raw_output, instr, allow_list)
|
||||||
if error:
|
if error:
|
||||||
# Truncate to keep log volume bounded but long enough to see the
|
# Truncate to keep log volume bounded but long enough to see the
|
||||||
# actual JSON shape mismatch (typical reports are <2KB).
|
# actual JSON shape mismatch (typical reports are <2KB).
|
||||||
@@ -178,7 +186,18 @@ def _execute(
|
|||||||
"error=%s, raw_output_preview=%r",
|
"error=%s, raw_output_preview=%r",
|
||||||
instr.id, prompt_hash, error, preview,
|
instr.id, prompt_hash, error, preview,
|
||||||
)
|
)
|
||||||
failure_report = _invalid_output_report(instr, error, raw_output)
|
# Posture B (WP-0016-T03): try to recover a partial-but-usable
|
||||||
|
# report from individually-parseable items before declaring total
|
||||||
|
# loss. One bad item should cost one item, not the whole report.
|
||||||
|
recovered = _resilient_report(
|
||||||
|
instr, raw_output, error, prompt_hash, allow_list,
|
||||||
|
response_metadata=response_metadata,
|
||||||
|
)
|
||||||
|
if recovered is not None:
|
||||||
|
return recovered
|
||||||
|
failure_report = _invalid_output_report(
|
||||||
|
instr, error, raw_output, response_metadata=response_metadata,
|
||||||
|
)
|
||||||
if failure_report is not None:
|
if failure_report is not None:
|
||||||
return InstructionResult(
|
return InstructionResult(
|
||||||
tasks=[],
|
tasks=[],
|
||||||
@@ -189,6 +208,7 @@ def _execute(
|
|||||||
review_required=True,
|
review_required=True,
|
||||||
condition_matched=instr.condition or None,
|
condition_matched=instr.condition or None,
|
||||||
validation_error=error,
|
validation_error=error,
|
||||||
|
llm_response_metadata=response_metadata,
|
||||||
)
|
)
|
||||||
return _empty_result(instr, prompt_hash=prompt_hash, validation_error=error)
|
return _empty_result(instr, prompt_hash=prompt_hash, validation_error=error)
|
||||||
|
|
||||||
@@ -200,6 +220,7 @@ def _execute(
|
|||||||
output_validated=True,
|
output_validated=True,
|
||||||
review_required=bool(getattr(instr, "review_required", False)),
|
review_required=bool(getattr(instr, "review_required", False)),
|
||||||
condition_matched=instr.condition or None,
|
condition_matched=instr.condition or None,
|
||||||
|
llm_response_metadata=response_metadata,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@@ -239,6 +260,7 @@ def _invalid_output_report(
|
|||||||
instr: Any,
|
instr: Any,
|
||||||
validation_error: str,
|
validation_error: str,
|
||||||
raw_output: Any,
|
raw_output: Any,
|
||||||
|
response_metadata: dict[str, Any] | None = None,
|
||||||
) -> dict[str, Any] | None:
|
) -> dict[str, Any] | None:
|
||||||
"""Build a durable diagnostic report for invalid report-sink output.
|
"""Build a durable diagnostic report for invalid report-sink output.
|
||||||
|
|
||||||
@@ -256,7 +278,7 @@ def _invalid_output_report(
|
|||||||
partial_output = _parse_json_output(raw_output)
|
partial_output = _parse_json_output(raw_output)
|
||||||
except json.JSONDecodeError:
|
except json.JSONDecodeError:
|
||||||
partial_output = None
|
partial_output = None
|
||||||
raw_preview = raw_output[:4000]
|
raw_preview = raw_output[:_RAW_OUTPUT_PREVIEW_LIMIT]
|
||||||
else:
|
else:
|
||||||
partial_output = raw_output
|
partial_output = raw_output
|
||||||
|
|
||||||
@@ -268,6 +290,8 @@ def _invalid_output_report(
|
|||||||
"status": "validation_failed",
|
"status": "validation_failed",
|
||||||
"validation_error": validation_error,
|
"validation_error": validation_error,
|
||||||
}
|
}
|
||||||
|
if response_metadata:
|
||||||
|
report["llm_response_metadata"] = response_metadata
|
||||||
if isinstance(partial_output, dict):
|
if isinstance(partial_output, dict):
|
||||||
if isinstance(partial_output.get("summary"), str):
|
if isinstance(partial_output.get("summary"), str):
|
||||||
report["partial_summary"] = partial_output["summary"]
|
report["partial_summary"] = partial_output["summary"]
|
||||||
@@ -279,6 +303,358 @@ def _invalid_output_report(
|
|||||||
return report
|
return report
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Resilient report recovery (ACTIVITY-WP-0016-T03)
|
||||||
|
#
|
||||||
|
# Posture B — verify & mitigate at the producer→consumer boundary. When the
|
||||||
|
# whole-document parse/validate fails, recover individually-parseable
|
||||||
|
# recommendation objects, validate each against the item schema, keep the valid
|
||||||
|
# ones, and quarantine the malformed/over-limit ones with provenance. One bad
|
||||||
|
# item costs one item, not the whole report (error locality == unit of work).
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
_QUARANTINE_LIMIT = 20
|
||||||
|
_SNIPPET_LIMIT = 200
|
||||||
|
# Producer guardrails (ACTIVITY-WP-0016-T04): structural bounds applied to every
|
||||||
|
# recommendation regardless of producer (LLM, agent, or human). These are
|
||||||
|
# verify-and-mitigate limits — an offending item is quarantined, never allowed to
|
||||||
|
# fail the whole report or flow unbounded into a downstream consumer.
|
||||||
|
_MAX_STRING_LEN = 4000
|
||||||
|
_MAX_DEPTH = 8
|
||||||
|
_RAW_OUTPUT_PREVIEW_LIMIT = 12000
|
||||||
|
_SUMMARY_RE = re.compile(r'"summary"\s*:\s*"((?:[^"\\]|\\.)*)"')
|
||||||
|
|
||||||
|
|
||||||
|
_SAFE_RESPONSE_METADATA_KEYS = {
|
||||||
|
"finish_reason",
|
||||||
|
"usage",
|
||||||
|
"model",
|
||||||
|
"model_name",
|
||||||
|
"provider",
|
||||||
|
"request_id",
|
||||||
|
"response_id",
|
||||||
|
"trace_id",
|
||||||
|
"latency_ms",
|
||||||
|
"duration_ms",
|
||||||
|
"elapsed_ms",
|
||||||
|
"created",
|
||||||
|
"created_at",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _llm_response_metadata(llm_client: Any) -> dict[str, Any] | None:
|
||||||
|
metadata = getattr(llm_client, "last_response_metadata", None)
|
||||||
|
if not isinstance(metadata, dict) or not metadata:
|
||||||
|
return None
|
||||||
|
safe: dict[str, Any] = {}
|
||||||
|
for key, value in metadata.items():
|
||||||
|
if key not in _SAFE_RESPONSE_METADATA_KEYS:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
json.dumps(value)
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
continue
|
||||||
|
safe[str(key)] = value
|
||||||
|
return safe or None
|
||||||
|
|
||||||
|
|
||||||
|
def _snippet(value: Any) -> str:
|
||||||
|
text = value if isinstance(value, str) else json.dumps(value, default=str)
|
||||||
|
return text[:_SNIPPET_LIMIT]
|
||||||
|
|
||||||
|
|
||||||
|
def _json_depth(value: Any, depth: int = 1) -> int:
|
||||||
|
if depth > _MAX_DEPTH:
|
||||||
|
return depth
|
||||||
|
if isinstance(value, dict):
|
||||||
|
return max((_json_depth(v, depth + 1) for v in value.values()), default=depth)
|
||||||
|
if isinstance(value, list):
|
||||||
|
return max((_json_depth(v, depth + 1) for v in value), default=depth)
|
||||||
|
return depth
|
||||||
|
|
||||||
|
|
||||||
|
def _has_oversized_string(value: Any) -> bool:
|
||||||
|
if isinstance(value, str):
|
||||||
|
return len(value) > _MAX_STRING_LEN
|
||||||
|
if isinstance(value, dict):
|
||||||
|
return any(_has_oversized_string(v) for v in value.values())
|
||||||
|
if isinstance(value, list):
|
||||||
|
return any(_has_oversized_string(v) for v in value)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def _item_structure_error(item: Any) -> str | None:
|
||||||
|
"""Producer-agnostic structural guardrail: depth and string-length caps."""
|
||||||
|
if _json_depth(item) > _MAX_DEPTH:
|
||||||
|
return f"exceeds max nesting depth {_MAX_DEPTH}"
|
||||||
|
if _has_oversized_string(item):
|
||||||
|
return f"contains a string longer than {_MAX_STRING_LEN} chars"
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _allow_list_from_context(context: dict | None) -> set[str] | None:
|
||||||
|
"""Build the recommendation-candidate allow-list from resolved context.
|
||||||
|
|
||||||
|
Looks for `context["known_candidates"]` (a list/set of valid candidate ids).
|
||||||
|
Returns None when absent so the allow-list check stays inert until a context
|
||||||
|
resolver populates it — the guardrail capability ships now; activation is a
|
||||||
|
one-line resolver change.
|
||||||
|
"""
|
||||||
|
if not isinstance(context, dict):
|
||||||
|
return None
|
||||||
|
known = context.get("known_candidates")
|
||||||
|
if isinstance(known, (list, set, tuple)):
|
||||||
|
return {str(item) for item in known}
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _report_contract(instr: Any) -> tuple[dict[str, Any] | None, int | None]:
|
||||||
|
"""Extract (item_schema, max_items) for the recommendations list, if any."""
|
||||||
|
try:
|
||||||
|
schema = _load_output_schema(getattr(instr, "output_schema", ""))
|
||||||
|
except (OSError, json.JSONDecodeError, TypeError):
|
||||||
|
return None, None
|
||||||
|
if not isinstance(schema, dict):
|
||||||
|
return None, None
|
||||||
|
recs = (schema.get("properties") or {}).get("recommendations")
|
||||||
|
if not isinstance(recs, dict):
|
||||||
|
return None, None
|
||||||
|
item_schema = recs.get("items") if isinstance(recs.get("items"), dict) else None
|
||||||
|
max_items = recs.get("maxItems") if isinstance(recs.get("maxItems"), int) else None
|
||||||
|
return item_schema, max_items
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_object_spans(raw: str) -> list[tuple[str, bool]]:
|
||||||
|
"""Return (span, complete) for each recommendation object in raw output.
|
||||||
|
|
||||||
|
Scans the `recommendations` array brace-aware and string-aware so it recovers
|
||||||
|
objects whether they are pretty-printed across many lines or emitted one per
|
||||||
|
line (NDJSON). A truncated trailing object is returned with complete=False.
|
||||||
|
"""
|
||||||
|
key = raw.find('"recommendations"')
|
||||||
|
start_region = raw.find("[", key) if key >= 0 else -1
|
||||||
|
if start_region < 0:
|
||||||
|
return []
|
||||||
|
spans: list[tuple[str, bool]] = []
|
||||||
|
i, n = start_region + 1, len(raw)
|
||||||
|
while i < n:
|
||||||
|
ch = raw[i]
|
||||||
|
if ch == "]":
|
||||||
|
break
|
||||||
|
if ch != "{":
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
depth, in_str, esc, j = 0, False, False, i
|
||||||
|
closed = False
|
||||||
|
while j < n:
|
||||||
|
c = raw[j]
|
||||||
|
if in_str:
|
||||||
|
if esc:
|
||||||
|
esc = False
|
||||||
|
elif c == "\\":
|
||||||
|
esc = True
|
||||||
|
elif c == '"':
|
||||||
|
in_str = False
|
||||||
|
elif c == '"':
|
||||||
|
in_str = True
|
||||||
|
elif c == "{":
|
||||||
|
depth += 1
|
||||||
|
elif c == "}":
|
||||||
|
depth -= 1
|
||||||
|
if depth == 0:
|
||||||
|
spans.append((raw[i:j + 1], True))
|
||||||
|
closed = True
|
||||||
|
break
|
||||||
|
j += 1
|
||||||
|
if not closed:
|
||||||
|
spans.append((raw[i:], False)) # truncated tail
|
||||||
|
break
|
||||||
|
i = j + 1
|
||||||
|
return spans
|
||||||
|
|
||||||
|
|
||||||
|
def _try_repair(span: str) -> str:
|
||||||
|
"""Best-effort close of a truncated JSON object: balance quote, braces, brackets."""
|
||||||
|
in_str, esc, depth_c, depth_b = False, False, 0, 0
|
||||||
|
for c in span:
|
||||||
|
if in_str:
|
||||||
|
if esc:
|
||||||
|
esc = False
|
||||||
|
elif c == "\\":
|
||||||
|
esc = True
|
||||||
|
elif c == '"':
|
||||||
|
in_str = False
|
||||||
|
elif c == '"':
|
||||||
|
in_str = True
|
||||||
|
elif c == "{":
|
||||||
|
depth_c += 1
|
||||||
|
elif c == "}":
|
||||||
|
depth_c -= 1
|
||||||
|
elif c == "[":
|
||||||
|
depth_b += 1
|
||||||
|
elif c == "]":
|
||||||
|
depth_b -= 1
|
||||||
|
repaired = span.rstrip().rstrip(",")
|
||||||
|
if in_str:
|
||||||
|
repaired += '"'
|
||||||
|
return repaired + "]" * max(depth_b, 0) + "}" * max(depth_c, 0)
|
||||||
|
|
||||||
|
|
||||||
|
def _recover_recommendations(
|
||||||
|
raw: str,
|
||||||
|
) -> tuple[str | None, list[dict[str, Any]], list[dict[str, Any]]]:
|
||||||
|
"""Recover (summary, items, quarantined) from a failed report payload."""
|
||||||
|
summary_match = _SUMMARY_RE.search(raw)
|
||||||
|
summary = None
|
||||||
|
if summary_match:
|
||||||
|
try:
|
||||||
|
summary = json.loads(f'"{summary_match.group(1)}"')
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
summary = summary_match.group(1)
|
||||||
|
items: list[dict[str, Any]] = []
|
||||||
|
quarantined: list[dict[str, Any]] = []
|
||||||
|
for index, (span, complete) in enumerate(_extract_object_spans(raw)):
|
||||||
|
parsed: Any = None
|
||||||
|
try:
|
||||||
|
parsed = json.loads(span)
|
||||||
|
except json.JSONDecodeError as exc:
|
||||||
|
if not complete:
|
||||||
|
try:
|
||||||
|
parsed = json.loads(_try_repair(span))
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
parsed = None
|
||||||
|
if parsed is None:
|
||||||
|
quarantined.append(
|
||||||
|
{"index": index, "error": str(exc), "raw": _snippet(span),
|
||||||
|
"reason": "truncated" if not complete else "unparseable"}
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
if isinstance(parsed, dict):
|
||||||
|
items.append(parsed)
|
||||||
|
else:
|
||||||
|
quarantined.append(
|
||||||
|
{"index": index, "error": "item is not a JSON object",
|
||||||
|
"raw": _snippet(span)}
|
||||||
|
)
|
||||||
|
return summary, items, quarantined
|
||||||
|
|
||||||
|
|
||||||
|
def _partition_items(
|
||||||
|
items: list[dict[str, Any]],
|
||||||
|
item_schema: dict[str, Any] | None,
|
||||||
|
max_items: int | None,
|
||||||
|
*,
|
||||||
|
run_schema: bool = True,
|
||||||
|
allow_list: set[str] | None = None,
|
||||||
|
) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
|
||||||
|
"""Screen items into (valid, quarantined).
|
||||||
|
|
||||||
|
Applied uniformly to recovered items (run_schema=True) and to already
|
||||||
|
schema-valid happy-path items (run_schema=False). Order of checks: structural
|
||||||
|
type → schema → producer guardrails (depth/length) → reference allow-list →
|
||||||
|
count cap. The first failing check quarantines the item with provenance.
|
||||||
|
"""
|
||||||
|
valid: list[dict[str, Any]] = []
|
||||||
|
quarantined: list[dict[str, Any]] = []
|
||||||
|
for index, item in enumerate(items):
|
||||||
|
if not isinstance(item, dict):
|
||||||
|
quarantined.append(
|
||||||
|
{"index": index, "error": "item is not a JSON object",
|
||||||
|
"raw": _snippet(item), "reason": "malformed"}
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
schema_error = (
|
||||||
|
_validate_schema_node(item, item_schema, f"recommendations[{index}]")
|
||||||
|
if (run_schema and item_schema)
|
||||||
|
else None
|
||||||
|
)
|
||||||
|
if schema_error:
|
||||||
|
quarantined.append(
|
||||||
|
{"index": index, "error": schema_error, "raw": _snippet(item),
|
||||||
|
"reason": "schema"}
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
structure_error = _item_structure_error(item)
|
||||||
|
if structure_error:
|
||||||
|
quarantined.append(
|
||||||
|
{"index": index, "error": structure_error, "raw": _snippet(item),
|
||||||
|
"reason": "guardrail"}
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
if allow_list is not None:
|
||||||
|
candidate = item.get("candidate")
|
||||||
|
if not isinstance(candidate, str) or candidate not in allow_list:
|
||||||
|
quarantined.append(
|
||||||
|
{"index": index, "error": f"candidate {candidate!r} not in allow-list",
|
||||||
|
"raw": _snippet(item), "reason": "allow_list"}
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
valid.append(item)
|
||||||
|
if max_items is not None and len(valid) > max_items:
|
||||||
|
for item in valid[max_items:]:
|
||||||
|
quarantined.append(
|
||||||
|
{"index": None, "error": f"exceeds maxItems={max_items}",
|
||||||
|
"raw": _snippet(item), "reason": "over_limit"}
|
||||||
|
)
|
||||||
|
valid = valid[:max_items]
|
||||||
|
return valid, quarantined
|
||||||
|
|
||||||
|
|
||||||
|
def _resilient_report(
|
||||||
|
instr: Any,
|
||||||
|
raw_output: Any,
|
||||||
|
original_error: str,
|
||||||
|
prompt_hash: str | None,
|
||||||
|
allow_list: set[str] | None = None,
|
||||||
|
response_metadata: dict[str, Any] | None = None,
|
||||||
|
) -> InstructionResult | None:
|
||||||
|
"""Recover a partial-but-usable report from output that failed validation.
|
||||||
|
|
||||||
|
Returns None when nothing usable can be recovered, so the caller falls back
|
||||||
|
to the total-loss diagnostic artifact (_invalid_output_report).
|
||||||
|
"""
|
||||||
|
if not getattr(instr, "report_sinks", None) or not isinstance(raw_output, str):
|
||||||
|
return None
|
||||||
|
item_schema, max_items = _report_contract(instr)
|
||||||
|
summary, items, quarantined = _recover_recommendations(raw_output)
|
||||||
|
if not items:
|
||||||
|
return None
|
||||||
|
valid, item_quarantine = _partition_items(
|
||||||
|
items, item_schema, max_items, allow_list=allow_list,
|
||||||
|
)
|
||||||
|
quarantined.extend(item_quarantine)
|
||||||
|
if not valid:
|
||||||
|
return None
|
||||||
|
report: dict[str, Any] = {
|
||||||
|
"summary": summary
|
||||||
|
or f"Partial daily triage: recovered {len(valid)} recommendation(s) "
|
||||||
|
"after the full report failed validation.",
|
||||||
|
"recommendations": valid,
|
||||||
|
"status": "partial",
|
||||||
|
"partial": True,
|
||||||
|
"quarantined_count": len(quarantined),
|
||||||
|
"quarantined_items": quarantined[:_QUARANTINE_LIMIT],
|
||||||
|
"recovery_note": f"original validation error: {original_error}",
|
||||||
|
}
|
||||||
|
if response_metadata:
|
||||||
|
report["llm_response_metadata"] = response_metadata
|
||||||
|
logger.warning(
|
||||||
|
"instruction_output_recovered: instruction=%r, kept=%d, quarantined=%d",
|
||||||
|
getattr(instr, "id", None), len(valid), len(quarantined),
|
||||||
|
)
|
||||||
|
return InstructionResult(
|
||||||
|
tasks=[],
|
||||||
|
report=report,
|
||||||
|
prompt_hash=prompt_hash,
|
||||||
|
model=getattr(instr, "model", None),
|
||||||
|
output_validated=True,
|
||||||
|
review_required=True,
|
||||||
|
condition_matched=getattr(instr, "condition", "") or None,
|
||||||
|
validation_error=None,
|
||||||
|
llm_response_metadata=response_metadata,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def _execution_failure_report(instr: Any, error: str) -> dict[str, Any] | None:
|
def _execution_failure_report(instr: Any, error: str) -> dict[str, Any] | None:
|
||||||
"""Build a durable diagnostic report when a report instruction cannot run."""
|
"""Build a durable diagnostic report when a report instruction cannot run."""
|
||||||
if not getattr(instr, "report_sinks", None):
|
if not getattr(instr, "report_sinks", None):
|
||||||
@@ -295,6 +671,7 @@ def _execution_failure_report(instr: Any, error: str) -> dict[str, Any] | None:
|
|||||||
def _validate_output(
|
def _validate_output(
|
||||||
raw_output: Any,
|
raw_output: Any,
|
||||||
instr: Any,
|
instr: Any,
|
||||||
|
allow_list: set[str] | None = None,
|
||||||
) -> tuple[list[TaskSpec], dict[str, Any] | None, str | None]:
|
) -> tuple[list[TaskSpec], dict[str, Any] | None, str | None]:
|
||||||
"""Parse raw LLM output into TaskSpecs and optional report payload.
|
"""Parse raw LLM output into TaskSpecs and optional report payload.
|
||||||
|
|
||||||
@@ -349,6 +726,28 @@ def _validate_output(
|
|||||||
source_type="instruction",
|
source_type="instruction",
|
||||||
source_id=instr.id,
|
source_id=instr.id,
|
||||||
))
|
))
|
||||||
|
|
||||||
|
# Happy-path producer guardrails (WP-0016-T04): the whole document already
|
||||||
|
# passed schema validation, so recommendations are schema-valid; still apply
|
||||||
|
# the count cap, structural caps, and reference allow-list, quarantining any
|
||||||
|
# offenders rather than emitting them. Report shape only changes when an item
|
||||||
|
# is actually quarantined.
|
||||||
|
if isinstance(report, dict) and isinstance(report.get("recommendations"), list):
|
||||||
|
item_schema, max_items = _report_contract(instr)
|
||||||
|
kept, quarantined = _partition_items(
|
||||||
|
report["recommendations"], item_schema, max_items,
|
||||||
|
run_schema=False, allow_list=allow_list,
|
||||||
|
)
|
||||||
|
if quarantined:
|
||||||
|
report = {
|
||||||
|
**report,
|
||||||
|
"recommendations": kept,
|
||||||
|
"status": "partial",
|
||||||
|
"partial": True,
|
||||||
|
"quarantined_count": len(quarantined),
|
||||||
|
"quarantined_items": quarantined[:_QUARANTINE_LIMIT],
|
||||||
|
}
|
||||||
|
|
||||||
return specs, report, None
|
return specs, report, None
|
||||||
except (json.JSONDecodeError, AttributeError, KeyError, TypeError) as exc:
|
except (json.JSONDecodeError, AttributeError, KeyError, TypeError) as exc:
|
||||||
return [], None, str(exc)
|
return [], None, str(exc)
|
||||||
|
|||||||
194
src/activity_core/schedule_health.py
Normal file
194
src/activity_core/schedule_health.py
Normal file
@@ -0,0 +1,194 @@
|
|||||||
|
"""Missed-fire detection for cron schedules (ACTIVITY-WP-0014, T03).
|
||||||
|
|
||||||
|
Even with a catchup window configured, an operator wants to *know* when a fire
|
||||||
|
was missed — especially under ``misfire_policy: skip`` where missed fires are
|
||||||
|
dropped by design and leave no run and no failure event. This module turns the
|
||||||
|
schedule's own bookkeeping into an explicit verdict and an optional State Hub
|
||||||
|
alert so a miss is never invisible again.
|
||||||
|
|
||||||
|
Temporal already counts fires that were dropped because they fell outside the
|
||||||
|
catchup window in ``ScheduleInfo.num_actions_missed_catchup_window``. We surface
|
||||||
|
that, plus a staleness check on the most recent fire, as a ``ScheduleHealth``
|
||||||
|
verdict. The verdict logic is a pure function so it is testable without a live
|
||||||
|
Temporal server; ``check_schedule_health`` is the thin async reader.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from datetime import datetime, timedelta, timezone
|
||||||
|
from typing import Any
|
||||||
|
from uuid import UUID
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
|
||||||
|
from activity_core.schedule_manager import schedule_id
|
||||||
|
from activity_core.state_hub_write import idempotency_headers
|
||||||
|
|
||||||
|
_DEFAULT_STATE_HUB_URL = "http://127.0.0.1:8000"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class ScheduleHealth:
|
||||||
|
"""Verdict for a single schedule's recent firing behaviour."""
|
||||||
|
|
||||||
|
activity_id: str
|
||||||
|
healthy: bool
|
||||||
|
missed_catchup_window: int
|
||||||
|
last_fired_at: datetime | None
|
||||||
|
staleness: timedelta | None
|
||||||
|
reasons: list[str] = field(default_factory=list)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def missed(self) -> bool:
|
||||||
|
return not self.healthy
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate_schedule_health(
|
||||||
|
*,
|
||||||
|
activity_id: str,
|
||||||
|
missed_catchup_window: int,
|
||||||
|
last_fired_at: datetime | None,
|
||||||
|
now: datetime,
|
||||||
|
expected_interval: timedelta | None = None,
|
||||||
|
tolerance: timedelta = timedelta(minutes=10),
|
||||||
|
) -> ScheduleHealth:
|
||||||
|
"""Pure verdict: was a fire missed?
|
||||||
|
|
||||||
|
A schedule is unhealthy if Temporal dropped any fire past the catchup window,
|
||||||
|
or — when ``expected_interval`` is known — if the most recent fire is older
|
||||||
|
than one interval plus ``tolerance`` (i.e. a fire should have happened and
|
||||||
|
did not).
|
||||||
|
"""
|
||||||
|
reasons: list[str] = []
|
||||||
|
|
||||||
|
if missed_catchup_window > 0:
|
||||||
|
reasons.append(
|
||||||
|
f"{missed_catchup_window} fire(s) dropped outside the catchup window"
|
||||||
|
)
|
||||||
|
|
||||||
|
staleness: timedelta | None = None
|
||||||
|
if last_fired_at is not None:
|
||||||
|
staleness = now - last_fired_at
|
||||||
|
if expected_interval is not None and staleness > expected_interval + tolerance:
|
||||||
|
reasons.append(
|
||||||
|
f"last fire was {staleness} ago, exceeding the expected "
|
||||||
|
f"{expected_interval} interval"
|
||||||
|
)
|
||||||
|
elif expected_interval is not None:
|
||||||
|
reasons.append("no recorded fire for a schedule that should have fired")
|
||||||
|
|
||||||
|
return ScheduleHealth(
|
||||||
|
activity_id=activity_id,
|
||||||
|
healthy=not reasons,
|
||||||
|
missed_catchup_window=missed_catchup_window,
|
||||||
|
last_fired_at=last_fired_at,
|
||||||
|
staleness=staleness,
|
||||||
|
reasons=reasons,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_info(desc: Any) -> tuple[int, datetime | None]:
|
||||||
|
"""Pull (missed_catchup_window, last_fired_at) from a ScheduleDescription.
|
||||||
|
|
||||||
|
Accesses are defensive so a Temporal SDK field rename degrades to "unknown"
|
||||||
|
rather than raising inside an operational health check.
|
||||||
|
"""
|
||||||
|
info = getattr(desc, "info", None)
|
||||||
|
missed = int(getattr(info, "num_actions_missed_catchup_window", 0) or 0)
|
||||||
|
|
||||||
|
last_fired: datetime | None = None
|
||||||
|
recent = getattr(info, "recent_actions", None) or []
|
||||||
|
times = [
|
||||||
|
getattr(a, "scheduled_at", None) or getattr(a, "started_at", None)
|
||||||
|
for a in recent
|
||||||
|
]
|
||||||
|
times = [t for t in times if t is not None]
|
||||||
|
if times:
|
||||||
|
last_fired = max(times)
|
||||||
|
return missed, last_fired
|
||||||
|
|
||||||
|
|
||||||
|
async def check_schedule_health(
|
||||||
|
client: Any,
|
||||||
|
activity_id: str | UUID,
|
||||||
|
*,
|
||||||
|
now: datetime | None = None,
|
||||||
|
expected_interval: timedelta | None = None,
|
||||||
|
tolerance: timedelta = timedelta(minutes=10),
|
||||||
|
) -> ScheduleHealth:
|
||||||
|
"""Describe the schedule for ``activity_id`` and evaluate its health."""
|
||||||
|
now = now or datetime.now(tz=timezone.utc)
|
||||||
|
handle = client.get_schedule_handle(schedule_id(activity_id))
|
||||||
|
desc = await handle.describe()
|
||||||
|
missed, last_fired = _extract_info(desc)
|
||||||
|
return evaluate_schedule_health(
|
||||||
|
activity_id=str(activity_id),
|
||||||
|
missed_catchup_window=missed,
|
||||||
|
last_fired_at=last_fired,
|
||||||
|
now=now,
|
||||||
|
expected_interval=expected_interval,
|
||||||
|
tolerance=tolerance,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def post_missed_fire_alert(
|
||||||
|
health: ScheduleHealth,
|
||||||
|
*,
|
||||||
|
state_hub_url: str | None = None,
|
||||||
|
author: str = "activity-core",
|
||||||
|
topic_id: str | None = None,
|
||||||
|
workstream_id: str | None = None,
|
||||||
|
timeout_seconds: float = 10.0,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""Post a ``schedule_miss`` progress event to State Hub for an unhealthy schedule.
|
||||||
|
|
||||||
|
No-op (returns ``status: ok``) when the schedule is healthy, so callers can
|
||||||
|
invoke unconditionally.
|
||||||
|
"""
|
||||||
|
if health.healthy:
|
||||||
|
return {"type": "schedule-miss-alert", "status": "ok"}
|
||||||
|
|
||||||
|
base_url = state_hub_url or os.environ.get("STATE_HUB_URL", _DEFAULT_STATE_HUB_URL)
|
||||||
|
base_url = str(base_url).rstrip("/")
|
||||||
|
|
||||||
|
body: dict[str, Any] = {
|
||||||
|
"event_type": "schedule_miss",
|
||||||
|
"author": author,
|
||||||
|
"summary": (
|
||||||
|
f"Schedule {health.activity_id} missed a fire: "
|
||||||
|
+ "; ".join(health.reasons)
|
||||||
|
),
|
||||||
|
"detail": {
|
||||||
|
"activity_id": health.activity_id,
|
||||||
|
"missed_catchup_window": health.missed_catchup_window,
|
||||||
|
"last_fired_at": (
|
||||||
|
health.last_fired_at.isoformat() if health.last_fired_at else None
|
||||||
|
),
|
||||||
|
"staleness_seconds": (
|
||||||
|
health.staleness.total_seconds() if health.staleness else None
|
||||||
|
),
|
||||||
|
"reasons": health.reasons,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
if topic_id:
|
||||||
|
body["topic_id"] = topic_id
|
||||||
|
if workstream_id:
|
||||||
|
body["workstream_id"] = workstream_id
|
||||||
|
|
||||||
|
# Dedup repeated alerts for the same missed window (same schedule + last fire).
|
||||||
|
last_fired = health.last_fired_at.isoformat() if health.last_fired_at else "none"
|
||||||
|
resp = httpx.post(
|
||||||
|
f"{base_url}/progress/",
|
||||||
|
json=body,
|
||||||
|
headers=idempotency_headers("schedule_miss", health.activity_id, last_fired),
|
||||||
|
timeout=timeout_seconds,
|
||||||
|
)
|
||||||
|
resp.raise_for_status()
|
||||||
|
data = resp.json()
|
||||||
|
return {
|
||||||
|
"type": "schedule-miss-alert",
|
||||||
|
"status": "posted",
|
||||||
|
"progress_id": data.get("id"),
|
||||||
|
}
|
||||||
@@ -17,7 +17,6 @@ from temporalio.client import (
|
|||||||
Schedule,
|
Schedule,
|
||||||
ScheduleActionStartWorkflow,
|
ScheduleActionStartWorkflow,
|
||||||
ScheduleAlreadyRunningError,
|
ScheduleAlreadyRunningError,
|
||||||
ScheduleBackfill,
|
|
||||||
ScheduleCalendarSpec,
|
ScheduleCalendarSpec,
|
||||||
ScheduleHandle,
|
ScheduleHandle,
|
||||||
ScheduleOverlapPolicy,
|
ScheduleOverlapPolicy,
|
||||||
@@ -38,13 +37,49 @@ _ORCHESTRATOR_TASK_QUEUE = "orchestrator-tq"
|
|||||||
# RunActivityWorkflow detects this value and derives run dedup key from workflow_id.
|
# RunActivityWorkflow detects this value and derives run dedup key from workflow_id.
|
||||||
SCHEDULED_TRIGGER_KEY = "scheduled"
|
SCHEDULED_TRIGGER_KEY = "scheduled"
|
||||||
|
|
||||||
# T24: misfire_policy → ScheduleOverlapPolicy
|
# ACTIVITY-WP-0014: misfire_policy → run-miss recovery behaviour.
|
||||||
_MISFIRE_TO_OVERLAP: dict[str, ScheduleOverlapPolicy] = {
|
#
|
||||||
"skip": ScheduleOverlapPolicy.SKIP,
|
# A "missed fire" happens when the worker / Temporal is unavailable at trigger
|
||||||
"catchup": ScheduleOverlapPolicy.BUFFER_ALL,
|
# time. Two Temporal levers together define the behaviour:
|
||||||
"compress": ScheduleOverlapPolicy.BUFFER_ONE,
|
# - catchup_window: how far back the server will recover missed fires once it
|
||||||
|
# is healthy again. The previous code never set this, so a brief outage at
|
||||||
|
# trigger time silently dropped the fire with no recovery and no signal.
|
||||||
|
# - overlap: what to do when a (recovered) fire would start while a prior run
|
||||||
|
# is still executing.
|
||||||
|
#
|
||||||
|
# Legacy values (catchup, compress) are aliased onto the explicit names.
|
||||||
|
_MISFIRE_ALIASES: dict[str, str] = {
|
||||||
|
"catchup": "catchup_all",
|
||||||
|
"compress": "catchup_latest",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# overlap policy + default catchup window (seconds) per normalised policy.
|
||||||
|
_SKIP_WINDOW_SECONDS = 60
|
||||||
|
_CATCHUP_ALL_WINDOW_SECONDS = 365 * 24 * 3600
|
||||||
|
_CATCHUP_LATEST_WINDOW_SECONDS = 24 * 3600
|
||||||
|
|
||||||
|
_MISFIRE_TO_OVERLAP: dict[str, ScheduleOverlapPolicy] = {
|
||||||
|
# Run on trigger or skip — recover nothing past a tiny grace window.
|
||||||
|
"skip": ScheduleOverlapPolicy.SKIP,
|
||||||
|
# Run on trigger or recover every missed fire during the outage window.
|
||||||
|
"catchup_all": ScheduleOverlapPolicy.BUFFER_ALL,
|
||||||
|
# Run on trigger or recover the most recent missed fire only; BUFFER_ONE
|
||||||
|
# buffers at most one start and drops the rest, so a backlog never accumulates.
|
||||||
|
"catchup_latest": ScheduleOverlapPolicy.BUFFER_ONE,
|
||||||
|
}
|
||||||
|
|
||||||
|
_MISFIRE_DEFAULT_WINDOW: dict[str, int] = {
|
||||||
|
"skip": _SKIP_WINDOW_SECONDS,
|
||||||
|
"catchup_all": _CATCHUP_ALL_WINDOW_SECONDS,
|
||||||
|
"catchup_latest": _CATCHUP_LATEST_WINDOW_SECONDS,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _normalize_misfire_policy(misfire_policy: str) -> str:
|
||||||
|
"""Map legacy aliases onto the explicit run-miss policy names."""
|
||||||
|
canonical = _MISFIRE_ALIASES.get(misfire_policy, misfire_policy)
|
||||||
|
return canonical if canonical in _MISFIRE_TO_OVERLAP else "skip"
|
||||||
|
|
||||||
|
|
||||||
def schedule_id(activity_id: str | UUID) -> str:
|
def schedule_id(activity_id: str | UUID) -> str:
|
||||||
"""Return the canonical Temporal Schedule ID for an ActivityDefinition."""
|
"""Return the canonical Temporal Schedule ID for an ActivityDefinition."""
|
||||||
@@ -57,7 +92,15 @@ def smoke_schedule_id(activity_id: str | UUID) -> str:
|
|||||||
|
|
||||||
|
|
||||||
def _overlap_policy(misfire_policy: str) -> ScheduleOverlapPolicy:
|
def _overlap_policy(misfire_policy: str) -> ScheduleOverlapPolicy:
|
||||||
return _MISFIRE_TO_OVERLAP.get(misfire_policy, ScheduleOverlapPolicy.SKIP)
|
return _MISFIRE_TO_OVERLAP[_normalize_misfire_policy(misfire_policy)]
|
||||||
|
|
||||||
|
|
||||||
|
def _catchup_window(cfg: CronTriggerConfig) -> timedelta:
|
||||||
|
"""Resolve the catchup window: explicit override, else the policy default."""
|
||||||
|
if cfg.catchup_window_seconds is not None:
|
||||||
|
return timedelta(seconds=cfg.catchup_window_seconds)
|
||||||
|
policy = _normalize_misfire_policy(cfg.misfire_policy)
|
||||||
|
return timedelta(seconds=_MISFIRE_DEFAULT_WINDOW[policy])
|
||||||
|
|
||||||
|
|
||||||
def _build_schedule(defn: ActivityDefinition) -> Schedule:
|
def _build_schedule(defn: ActivityDefinition) -> Schedule:
|
||||||
@@ -80,7 +123,10 @@ def _build_schedule(defn: ActivityDefinition) -> Schedule:
|
|||||||
jitter=timedelta(seconds=cfg.jitter_seconds) if cfg.jitter_seconds else None,
|
jitter=timedelta(seconds=cfg.jitter_seconds) if cfg.jitter_seconds else None,
|
||||||
)
|
)
|
||||||
|
|
||||||
policy = SchedulePolicy(overlap=_overlap_policy(cfg.misfire_policy))
|
policy = SchedulePolicy(
|
||||||
|
overlap=_overlap_policy(cfg.misfire_policy),
|
||||||
|
catchup_window=_catchup_window(cfg),
|
||||||
|
)
|
||||||
state = ScheduleState(paused=not defn.enabled)
|
state = ScheduleState(paused=not defn.enabled)
|
||||||
|
|
||||||
return Schedule(action=action, spec=spec, policy=policy, state=state)
|
return Schedule(action=action, spec=spec, policy=policy, state=state)
|
||||||
@@ -282,18 +328,10 @@ async def upsert_schedule(client: Client, defn: ActivityDefinition) -> ScheduleH
|
|||||||
else:
|
else:
|
||||||
await handle.pause(note="disabled via upsert_schedule")
|
await handle.pause(note="disabled via upsert_schedule")
|
||||||
|
|
||||||
# T24 catchup: backfill any fires missed in the last hour.
|
# ACTIVITY-WP-0014: missed-fire recovery is now handled natively by the
|
||||||
if isinstance(defn.trigger_config, CronTriggerConfig):
|
# schedule's catchup_window (see _build_schedule), which the server applies
|
||||||
if defn.trigger_config.misfire_policy == "catchup":
|
# continuously after any outage — not only at upsert time. The previous
|
||||||
now = datetime.now(tz=timezone.utc)
|
# ad-hoc 1-hour backfill is therefore no longer needed.
|
||||||
backfill_start = now - timedelta(hours=1)
|
|
||||||
await handle.backfill(
|
|
||||||
ScheduleBackfill(
|
|
||||||
start_at=backfill_start,
|
|
||||||
end_at=now,
|
|
||||||
overlap=ScheduleOverlapPolicy.BUFFER_ALL,
|
|
||||||
)
|
|
||||||
)
|
|
||||||
|
|
||||||
return handle
|
return handle
|
||||||
|
|
||||||
|
|||||||
34
src/activity_core/state_hub_write.py
Normal file
34
src/activity_core/state_hub_write.py
Normal file
@@ -0,0 +1,34 @@
|
|||||||
|
"""Idempotency-keyed State Hub writes (ACTIVITY-WP-0014 T05).
|
||||||
|
|
||||||
|
Under the State Hub *beachhead* model, a write may be buffered locally while
|
||||||
|
central State Hub is unreachable and **flushed later, possibly with retries**.
|
||||||
|
To keep that flush safe — no duplicate progress / triage events — every write
|
||||||
|
carries a stable ``Idempotency-Key`` header derived deterministically from the
|
||||||
|
write's identity. The guarantee lives on the write itself and does **not** depend
|
||||||
|
on a live dedup read, so it holds even when the beachhead is serving offline.
|
||||||
|
|
||||||
|
activity-core does not implement the queue/cache (that is state-hub's beachhead);
|
||||||
|
it only emits the key so the beachhead / State Hub can dedup on flush. The header
|
||||||
|
passes untouched through the existing ``actcore-state-hub-bridge`` proxy and is
|
||||||
|
ignored by State Hub versions that do not yet honour it.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
IDEMPOTENCY_HEADER = "Idempotency-Key"
|
||||||
|
|
||||||
|
|
||||||
|
def idempotency_key(*parts: str | None) -> str:
|
||||||
|
"""Build a stable, header-safe idempotency key from identity parts.
|
||||||
|
|
||||||
|
Empty/None parts are kept as empty segments so the key shape is stable across
|
||||||
|
calls. Whitespace and control characters are collapsed to keep the value a
|
||||||
|
valid single-line HTTP header.
|
||||||
|
"""
|
||||||
|
raw = ":".join((p or "") for p in parts)
|
||||||
|
return "".join(ch if 0x20 < ord(ch) < 0x7F else "_" for ch in raw) or "_"
|
||||||
|
|
||||||
|
|
||||||
|
def idempotency_headers(*parts: str | None) -> dict[str, str]:
|
||||||
|
"""Return the header dict to attach to a State Hub write."""
|
||||||
|
return {IDEMPOTENCY_HEADER: idempotency_key(*parts)}
|
||||||
@@ -15,6 +15,8 @@ import asyncio
|
|||||||
import logging
|
import logging
|
||||||
import os
|
import os
|
||||||
import uuid
|
import uuid
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from typing import Sequence
|
||||||
|
|
||||||
from sqlalchemy import select
|
from sqlalchemy import select
|
||||||
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
|
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
|
||||||
@@ -30,6 +32,20 @@ TEMPORAL_HOST = os.environ.get("TEMPORAL_HOST", "localhost:7233")
|
|||||||
TEMPORAL_NAMESPACE = os.environ.get("TEMPORAL_NAMESPACE", "default")
|
TEMPORAL_NAMESPACE = os.environ.get("TEMPORAL_NAMESPACE", "default")
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ScheduleSyncResult:
|
||||||
|
upserted: int = 0
|
||||||
|
paused: int = 0
|
||||||
|
deleted_orphans: int = 0
|
||||||
|
|
||||||
|
def to_dict(self) -> dict[str, int]:
|
||||||
|
return {
|
||||||
|
"upserted": self.upserted,
|
||||||
|
"paused": self.paused,
|
||||||
|
"deleted_orphans": self.deleted_orphans,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def _row_to_domain(row: ActivityDefinitionRow) -> ActivityDefinition:
|
def _row_to_domain(row: ActivityDefinitionRow) -> ActivityDefinition:
|
||||||
"""Convert an ORM row to a domain ActivityDefinition for schedule_manager."""
|
"""Convert an ORM row to a domain ActivityDefinition for schedule_manager."""
|
||||||
return ActivityDefinition.model_validate(
|
return ActivityDefinition.model_validate(
|
||||||
@@ -46,12 +62,82 @@ def _row_to_domain(row: ActivityDefinitionRow) -> ActivityDefinition:
|
|||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
async def sync(client: Client, db_url: str) -> None:
|
def _valid_schedule_activity_id(defn: ActivityDefinition) -> str:
|
||||||
|
if isinstance(defn.trigger_config, ScheduledTriggerConfig):
|
||||||
|
return f"{defn.id}-once"
|
||||||
|
return str(defn.id)
|
||||||
|
|
||||||
|
|
||||||
|
async def _load_schedule_rows(
|
||||||
|
session_factory: async_sessionmaker[AsyncSession],
|
||||||
|
) -> Sequence[ActivityDefinitionRow]:
|
||||||
|
async with session_factory() as session:
|
||||||
|
return (
|
||||||
|
await session.scalars(
|
||||||
|
select(ActivityDefinitionRow).where(
|
||||||
|
ActivityDefinitionRow.trigger_type.in_(["cron", "scheduled"])
|
||||||
|
)
|
||||||
|
)
|
||||||
|
).all()
|
||||||
|
|
||||||
|
|
||||||
|
async def sync_schedule_rows(
|
||||||
|
client: Client,
|
||||||
|
rows: Sequence[ActivityDefinitionRow],
|
||||||
|
) -> ScheduleSyncResult:
|
||||||
|
"""Reconcile Temporal Schedules against already-loaded definition rows."""
|
||||||
|
valid_schedule_activity_ids: set[str] = set()
|
||||||
|
result = ScheduleSyncResult()
|
||||||
|
|
||||||
|
for row in rows:
|
||||||
|
defn = _row_to_domain(row)
|
||||||
|
if not isinstance(
|
||||||
|
defn.trigger_config,
|
||||||
|
(CronTriggerConfig, ScheduledTriggerConfig),
|
||||||
|
):
|
||||||
|
continue
|
||||||
|
|
||||||
|
valid_schedule_activity_ids.add(_valid_schedule_activity_id(defn))
|
||||||
|
|
||||||
|
await upsert_schedule(client, defn)
|
||||||
|
if defn.enabled:
|
||||||
|
result.upserted += 1
|
||||||
|
logger.info("upserted schedule for activity %s (%s)", defn.id, defn.name)
|
||||||
|
else:
|
||||||
|
result.paused += 1
|
||||||
|
logger.info("upserted paused schedule for disabled activity %s", defn.id)
|
||||||
|
|
||||||
|
# Tombstone cleanup: remove Temporal Schedules with no matching DB row.
|
||||||
|
existing_schedules = await list_schedules(client)
|
||||||
|
for entry in existing_schedules:
|
||||||
|
if entry["activity_id"] not in valid_schedule_activity_ids:
|
||||||
|
await delete_schedule(client, entry["activity_id"])
|
||||||
|
result.deleted_orphans += 1
|
||||||
|
logger.info("deleted orphaned schedule %s", entry["schedule_id"])
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"sync_schedules complete — upserted=%d paused=%d deleted_orphans=%d",
|
||||||
|
result.upserted,
|
||||||
|
result.paused,
|
||||||
|
result.deleted_orphans,
|
||||||
|
)
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
async def sync_with_session_factory(
|
||||||
|
client: Client,
|
||||||
|
session_factory: async_sessionmaker[AsyncSession],
|
||||||
|
) -> ScheduleSyncResult:
|
||||||
|
"""Reconcile Temporal Schedules using an existing DB session factory."""
|
||||||
|
return await sync_schedule_rows(client, await _load_schedule_rows(session_factory))
|
||||||
|
|
||||||
|
|
||||||
|
async def sync(client: Client, db_url: str) -> ScheduleSyncResult:
|
||||||
"""Reconcile Temporal Schedules against the ActivityDefinition table.
|
"""Reconcile Temporal Schedules against the ActivityDefinition table.
|
||||||
|
|
||||||
Steps:
|
Steps:
|
||||||
1. Load all enabled cron ActivityDefinitions from Postgres.
|
1. Load all cron/scheduled ActivityDefinitions from Postgres.
|
||||||
2. Upsert a Temporal Schedule for each one.
|
2. Upsert a Temporal Schedule for each one, paused when disabled.
|
||||||
3. Delete Temporal Schedules whose activity_id has no matching DB row
|
3. Delete Temporal Schedules whose activity_id has no matching DB row
|
||||||
(tombstone cleanup for deleted or trigger-type-changed definitions).
|
(tombstone cleanup for deleted or trigger-type-changed definitions).
|
||||||
"""
|
"""
|
||||||
@@ -59,55 +145,10 @@ async def sync(client: Client, db_url: str) -> None:
|
|||||||
session_factory = async_sessionmaker(engine, expire_on_commit=False)
|
session_factory = async_sessionmaker(engine, expire_on_commit=False)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
async with session_factory() as session:
|
return await sync_with_session_factory(client, session_factory)
|
||||||
rows = (
|
|
||||||
await session.scalars(
|
|
||||||
select(ActivityDefinitionRow).where(
|
|
||||||
ActivityDefinitionRow.trigger_type.in_(["cron", "scheduled"])
|
|
||||||
)
|
|
||||||
)
|
|
||||||
).all()
|
|
||||||
finally:
|
finally:
|
||||||
await engine.dispose()
|
await engine.dispose()
|
||||||
|
|
||||||
db_activity_ids: set[str] = set()
|
|
||||||
upserted = 0
|
|
||||||
skipped = 0
|
|
||||||
|
|
||||||
for row in rows:
|
|
||||||
defn = _row_to_domain(row)
|
|
||||||
if not isinstance(defn.trigger_config, (CronTriggerConfig, ScheduledTriggerConfig)):
|
|
||||||
continue
|
|
||||||
|
|
||||||
db_activity_ids.add(str(defn.id))
|
|
||||||
|
|
||||||
if defn.enabled:
|
|
||||||
await upsert_schedule(client, defn)
|
|
||||||
upserted += 1
|
|
||||||
logger.info("upserted schedule for activity %s (%s)", defn.id, defn.name)
|
|
||||||
else:
|
|
||||||
# Disabled definitions: schedule may exist (paused) — leave it;
|
|
||||||
# upsert_schedule already handles the paused state.
|
|
||||||
await upsert_schedule(client, defn)
|
|
||||||
skipped += 1
|
|
||||||
logger.info("upserted paused schedule for disabled activity %s", defn.id)
|
|
||||||
|
|
||||||
# Tombstone cleanup: remove Temporal Schedules with no matching DB row.
|
|
||||||
existing_schedules = await list_schedules(client)
|
|
||||||
deleted = 0
|
|
||||||
for entry in existing_schedules:
|
|
||||||
if entry["activity_id"] not in db_activity_ids:
|
|
||||||
await delete_schedule(client, entry["activity_id"])
|
|
||||||
deleted += 1
|
|
||||||
logger.info("deleted orphaned schedule %s", entry["schedule_id"])
|
|
||||||
|
|
||||||
logger.info(
|
|
||||||
"sync_schedules complete — upserted=%d skipped_disabled=%d deleted_orphans=%d",
|
|
||||||
upserted,
|
|
||||||
skipped,
|
|
||||||
deleted,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
async def main() -> None:
|
async def main() -> None:
|
||||||
logging.basicConfig(level=logging.INFO)
|
logging.basicConfig(level=logging.INFO)
|
||||||
@@ -116,7 +157,13 @@ async def main() -> None:
|
|||||||
raise RuntimeError("ACTCORE_DB_URL is required")
|
raise RuntimeError("ACTCORE_DB_URL is required")
|
||||||
|
|
||||||
client = await Client.connect(TEMPORAL_HOST, namespace=TEMPORAL_NAMESPACE)
|
client = await Client.connect(TEMPORAL_HOST, namespace=TEMPORAL_NAMESPACE)
|
||||||
await sync(client, db_url)
|
result = await sync(client, db_url)
|
||||||
|
print(
|
||||||
|
"Synced schedules: "
|
||||||
|
f"upserted={result.upserted} "
|
||||||
|
f"paused={result.paused} "
|
||||||
|
f"deleted_orphans={result.deleted_orphans}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
97
src/activity_core/sync_service.py
Normal file
97
src/activity_core/sync_service.py
Normal file
@@ -0,0 +1,97 @@
|
|||||||
|
"""Shared ActivityDefinition/event type/schedule sync orchestration."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from temporalio.client import Client
|
||||||
|
|
||||||
|
from activity_core.event_type_registry import sync_event_types
|
||||||
|
from activity_core.sync_activity_definitions import sync as sync_activity_definitions
|
||||||
|
from activity_core.sync_schedules import ScheduleSyncResult, sync_with_session_factory
|
||||||
|
|
||||||
|
_MAX_ERRORS = 20
|
||||||
|
_MAX_ERROR_MESSAGE_LENGTH = 1000
|
||||||
|
|
||||||
|
|
||||||
|
def _empty_result(
|
||||||
|
*,
|
||||||
|
definitions: bool,
|
||||||
|
schedules: bool,
|
||||||
|
event_types: bool,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
return {
|
||||||
|
"ok": True,
|
||||||
|
"ran": {
|
||||||
|
"definitions": definitions,
|
||||||
|
"schedules": schedules,
|
||||||
|
"event_types": event_types,
|
||||||
|
},
|
||||||
|
"definitions": {"synced": 0},
|
||||||
|
"event_types": {"synced": 0},
|
||||||
|
"schedules": ScheduleSyncResult().to_dict(),
|
||||||
|
"errors": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _record_error(result: dict[str, Any], stage: str, exc: Exception) -> None:
|
||||||
|
errors = result["errors"]
|
||||||
|
if len(errors) >= _MAX_ERRORS:
|
||||||
|
return
|
||||||
|
errors.append(
|
||||||
|
{
|
||||||
|
"stage": stage,
|
||||||
|
"type": type(exc).__name__,
|
||||||
|
"message": str(exc)[:_MAX_ERROR_MESSAGE_LENGTH],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
result["ok"] = False
|
||||||
|
|
||||||
|
|
||||||
|
async def run_sync(
|
||||||
|
*,
|
||||||
|
session_factory: Any,
|
||||||
|
temporal_client: Client | None,
|
||||||
|
definitions: bool = True,
|
||||||
|
schedules: bool = True,
|
||||||
|
event_types: bool = False,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""Run the requested sync stages and return bounded operator-facing status.
|
||||||
|
|
||||||
|
The orchestration deliberately accepts its database and Temporal
|
||||||
|
dependencies as arguments so startup and the API can share the same behavior
|
||||||
|
without creating another global runtime.
|
||||||
|
"""
|
||||||
|
result = _empty_result(
|
||||||
|
definitions=definitions,
|
||||||
|
schedules=schedules,
|
||||||
|
event_types=event_types,
|
||||||
|
)
|
||||||
|
|
||||||
|
if definitions:
|
||||||
|
try:
|
||||||
|
result["definitions"]["synced"] = await sync_activity_definitions(
|
||||||
|
session_factory
|
||||||
|
)
|
||||||
|
except Exception as exc: # pragma: no cover - exercised through tests
|
||||||
|
_record_error(result, "definitions", exc)
|
||||||
|
|
||||||
|
if event_types:
|
||||||
|
try:
|
||||||
|
result["event_types"]["synced"] = await sync_event_types(session_factory)
|
||||||
|
except Exception as exc: # pragma: no cover - exercised through tests
|
||||||
|
_record_error(result, "event_types", exc)
|
||||||
|
|
||||||
|
if schedules:
|
||||||
|
try:
|
||||||
|
if temporal_client is None:
|
||||||
|
raise RuntimeError("Temporal client is required for schedule sync")
|
||||||
|
schedule_result = await sync_with_session_factory(
|
||||||
|
temporal_client,
|
||||||
|
session_factory,
|
||||||
|
)
|
||||||
|
result["schedules"] = schedule_result.to_dict()
|
||||||
|
except Exception as exc: # pragma: no cover - exercised through tests
|
||||||
|
_record_error(result, "schedules", exc)
|
||||||
|
|
||||||
|
return result
|
||||||
@@ -46,8 +46,7 @@ from activity_core.activities import (
|
|||||||
)
|
)
|
||||||
from activity_core.db import make_engine
|
from activity_core.db import make_engine
|
||||||
from sqlalchemy.ext.asyncio import async_sessionmaker
|
from sqlalchemy.ext.asyncio import async_sessionmaker
|
||||||
from activity_core.sync_activity_definitions import sync as sync_activity_defs
|
from activity_core.sync_service import run_sync
|
||||||
from activity_core.sync_schedules import sync as sync_schedules
|
|
||||||
from activity_core.workflows import RunActivityWorkflow, TaskExecutorWorkflow
|
from activity_core.workflows import RunActivityWorkflow, TaskExecutorWorkflow
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
@@ -77,20 +76,26 @@ async def run() -> None:
|
|||||||
TEMPORAL_HOST, namespace=TEMPORAL_NAMESPACE, runtime=runtime
|
TEMPORAL_HOST, namespace=TEMPORAL_NAMESPACE, runtime=runtime
|
||||||
)
|
)
|
||||||
|
|
||||||
# T45: Sync ActivityDefinition files into DB before schedule sync.
|
logger.info("Syncing ActivityDefinitions and Temporal Schedules...")
|
||||||
logger.info("Syncing ActivityDefinition files...")
|
sync_engine = make_engine(db_url)
|
||||||
|
session_factory = async_sessionmaker(sync_engine, expire_on_commit=False)
|
||||||
try:
|
try:
|
||||||
session_factory = async_sessionmaker(make_engine(db_url), expire_on_commit=False)
|
sync_result = await run_sync(
|
||||||
await sync_activity_defs(session_factory)
|
session_factory=session_factory,
|
||||||
except Exception:
|
temporal_client=client,
|
||||||
logger.exception("activity definition sync failed — continuing worker startup")
|
definitions=True,
|
||||||
|
schedules=True,
|
||||||
# T23: Sync Temporal Schedules with the DB before workers start accepting tasks.
|
event_types=False,
|
||||||
logger.info("Syncing Temporal Schedules with ActivityDefinition DB...")
|
)
|
||||||
try:
|
for error in sync_result["errors"]:
|
||||||
await sync_schedules(client, db_url)
|
logger.error(
|
||||||
except Exception:
|
"startup sync %s failed — %s: %s",
|
||||||
logger.exception("schedule sync failed — continuing worker startup")
|
error["stage"],
|
||||||
|
error["type"],
|
||||||
|
error["message"],
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
await sync_engine.dispose()
|
||||||
|
|
||||||
orchestrator_worker = Worker(
|
orchestrator_worker = Worker(
|
||||||
client,
|
client,
|
||||||
|
|||||||
5
tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json
vendored
Normal file
5
tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json
vendored
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
{
|
||||||
|
"_note": "PARTIAL 4000-char preview of the 2026-06-26 daily-triage validation failure (retry attempt). Full payload not recoverable from activity-core: complete() drops finish_reason; report sink caps raw at 4000 chars; the JSON break is at char 5268 (beyond this preview). Full response would require llm-connect producer-side logs on railiance01.",
|
||||||
|
"validation_error": "Expecting ',' delimiter: line 136 column 22 (char 5268)",
|
||||||
|
"raw_output_preview": "{\n \"summary\": \"Triage report focusing on high-priority workstreams with pending human intervention or critical dependencies, and addressing recently cleared dependencies to unblock progress.\",\n \"recommendations\": [\n {\n \"rank\": 1,\n \"candidate\": \"2731fece-6c49-45b8-ab8a-4ea6c04ac603\",\n \"action\": \"work-next\",\n \"why\": \"A critical dependency (T03 - Configure bounded OpenBao token roles and policies) for this workstream has been cleared, unblocking significant progress on credential management. This workstream has 8 todo tasks and no waits, indicating it's ready for immediate action.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 5.0,\n \"strategic_value\": 5,\n \"time_criticality\": 5,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 5,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 2,\n \"candidate\": \"bd086c41-287d-4a4e-8ac5-9ab270f14d72\",\n \"action\": \"needs-human\",\n \"why\": \"This high-priority workstream has a 'needs_human' task (T04 - Provision the runtime API key outside Git) and is currently blocked by 3 'wait' tasks. Human intervention is required to unblock progress.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 4.7,\n \"strategic_value\": 5,\n \"time_criticality\": 4,\n \"risk_reduction\": 5,\n \"opportunity_enablement\": 4,\n \"job_size\": 3\n }\n },\n {\n \"rank\": 3,\n \"candidate\": \"9b56414a-c71f-4e72-9b2b-d2166aaf50d0\",\n \"action\": \"needs-human\",\n \"why\": \"This high-priority workstream has a 'needs_human' task (Task: Execute Live Ops-Hub Bootstrap) and is currently blocked by a 'wait' task. Human intervention is required to proceed with the bootstrap.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 4.7,\n \"strategic_value\": 5,\n \"time_criticality\": 4,\n \"risk_reduction\": 5,\n \"opportunity_enablement\": 4,\n \"job_size\": 3\n }\n },\n {\n \"rank\": 4,\n \"candidate\": \"84e17675-0d15-4268-a8bd-540124d37018\",\n \"action\": \"needs-human\",\n \"why\": \"This workstream has 4 'needs_human' tasks, including 'T02 \u2014 Resolve Forgejo production design decisions', indicating significant human input is required to move forward with the migration.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 4.0,\n \"strategic_value\": 4,\n \"time_criticality\": 4,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 4,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 5,\n \"candidate\": \"5646e13a-13af-4724-bca6-3c0d86f96733\",\n \"action\": \"needs-human\",\n \"why\": \"This workstream has a 'needs_human' task ('Three-Run Calibration Feedback') and is currently in a 'wait' state. Human feedback is crucial for operational hardening.\",\n \"confidence\": \"medium\",\n \"wsjf\": {\n \"score\": 3.7,\n \"strategic_value\": 4,\n \"time_criticality\": 3,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 4,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 6,\n \"candidate\": \"896ace77-21b3-450b-8fb7-254aefc8c570\",\n \"action\": \"close-out\",\n \"why\": \"The task 'Wire activity-core to the live service' has been resolved, and the workstream shows 2 progress tasks with 0 todo/wait tasks. This indicates the deployment is likely complete or nearing completion and ready for close-out after verification.\",\n \"confidence\": \"high\",\n \"wsjf\": {\n \"score\": 3.7,\n \"strategic_value\": 4,\n \"time_criticality\": 3,\n \"risk_reduction\": 4,\n \"opportunity_enablement\": 4,\n \"job_size\": 4\n }\n },\n {\n \"rank\": 7,\n \"candidate\": \"656e435d-3a00-4f5e-a38e-114467f9062e\",\n \"action\": \"work-next\",\n \"why\": \"This high-priority workstream has a single 'wait' task ('Task: Activate Ops-Hub Widgets In Inter-Hub') and no 'needs_human' tasks. It appears ready for the next step to activate the widgets.\",\n \"confidence\": \"medium\",\n \"wsjf"
|
||||||
|
}
|
||||||
@@ -88,6 +88,43 @@ def test_for_each_binds_each_list_item_before_condition_and_action_rendering() -
|
|||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_for_each_can_gate_registry_hygiene_gaps_on_signal() -> None:
|
||||||
|
rules = [
|
||||||
|
{
|
||||||
|
"id": "flag-registry-hygiene-gap",
|
||||||
|
"for_each": "context.gaps",
|
||||||
|
"bind_as": "g",
|
||||||
|
"condition": 'context.g.hygiene_signal != ""',
|
||||||
|
"action": {
|
||||||
|
"task_template": "Close registry hygiene gap for {context.g.repo}",
|
||||||
|
"target_repo": "context.g.repo",
|
||||||
|
"priority": "medium",
|
||||||
|
"labels": ["registry-hygiene", "{context.g.hygiene_signal}"],
|
||||||
|
},
|
||||||
|
}
|
||||||
|
]
|
||||||
|
context = {
|
||||||
|
"gaps": [
|
||||||
|
{
|
||||||
|
"repo": "reuse-surface",
|
||||||
|
"hygiene_signal": "empty_capability_scaffold",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"repo": "activity-core",
|
||||||
|
"hygiene_signal": "",
|
||||||
|
},
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
specs = expand_rule_actions(rules, _Event(), context)
|
||||||
|
|
||||||
|
assert [spec["target_repo"] for spec in specs] == ["reuse-surface"]
|
||||||
|
assert specs[0]["labels"] == [
|
||||||
|
"registry-hygiene",
|
||||||
|
"empty_capability_scaffold",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
def test_for_each_rejects_non_path_expression() -> None:
|
def test_for_each_rejects_non_path_expression() -> None:
|
||||||
rules = [
|
rules = [
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -12,6 +12,7 @@ Covers:
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import json
|
import json
|
||||||
|
from pathlib import Path
|
||||||
from types import SimpleNamespace
|
from types import SimpleNamespace
|
||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
@@ -333,7 +334,14 @@ def test_execute_instruction_forwards_output_schema_to_llm_connect(tmp_path, mon
|
|||||||
def test_execute_instruction_with_audit_accepts_report_payload():
|
def test_execute_instruction_with_audit_accepts_report_payload():
|
||||||
report_data = {
|
report_data = {
|
||||||
"summary": "State Hub has loose ends.",
|
"summary": "State Hub has loose ends.",
|
||||||
"recommendations": [{"action": "revisit", "candidate": "CUST-WP-0045"}],
|
"recommendations": [
|
||||||
|
{
|
||||||
|
"rank": 1,
|
||||||
|
"action": "revisit",
|
||||||
|
"candidate": "CUST-WP-0045",
|
||||||
|
"why": "Loose ends need attention.",
|
||||||
|
}
|
||||||
|
],
|
||||||
}
|
}
|
||||||
llm = _CountingLLM([json.dumps(report_data)])
|
llm = _CountingLLM([json.dumps(report_data)])
|
||||||
instr = _instr(
|
instr = _instr(
|
||||||
@@ -353,7 +361,14 @@ def test_execute_instruction_with_audit_accepts_report_payload():
|
|||||||
def test_execute_instruction_with_audit_accepts_fenced_report_payload():
|
def test_execute_instruction_with_audit_accepts_fenced_report_payload():
|
||||||
report_data = {
|
report_data = {
|
||||||
"summary": "State Hub has loose ends.",
|
"summary": "State Hub has loose ends.",
|
||||||
"recommendations": [{"action": "revisit", "candidate": "CUST-WP-0045"}],
|
"recommendations": [
|
||||||
|
{
|
||||||
|
"rank": 1,
|
||||||
|
"action": "revisit",
|
||||||
|
"candidate": "CUST-WP-0045",
|
||||||
|
"why": "Loose ends need attention.",
|
||||||
|
}
|
||||||
|
],
|
||||||
}
|
}
|
||||||
llm = _CountingLLM([f"```json\n{json.dumps(report_data)}\n```"])
|
llm = _CountingLLM([f"```json\n{json.dumps(report_data)}\n```"])
|
||||||
instr = _instr(
|
instr = _instr(
|
||||||
@@ -389,6 +404,216 @@ def test_execute_instruction_with_audit_rejects_invalid_report_schema():
|
|||||||
assert llm.call_count == 2
|
assert llm.call_count == 2
|
||||||
|
|
||||||
|
|
||||||
|
# ── WP-0016-T03 resilient report recovery ─────────────────────────────────────
|
||||||
|
|
||||||
|
def _valid_rec(rank: int) -> dict[str, Any]:
|
||||||
|
return {
|
||||||
|
"rank": rank,
|
||||||
|
"candidate": f"WS-{rank}",
|
||||||
|
"action": "work-next",
|
||||||
|
"why": f"reason {rank}",
|
||||||
|
"wsjf": {"score": 5.0},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _pretty_triage_with_truncated_tail(num_valid: int) -> str:
|
||||||
|
body = ",\n".join(" " + json.dumps(_valid_rec(i)) for i in range(1, num_valid + 1))
|
||||||
|
# Trailing object is cut off mid-string — the whole document is invalid JSON,
|
||||||
|
# reproducing the 2026-06-26 failure shape (valid prefix, broken tail).
|
||||||
|
return (
|
||||||
|
'{\n "summary": "Daily triage.",\n "recommendations": [\n'
|
||||||
|
+ body
|
||||||
|
+ ',\n {\n "rank": '
|
||||||
|
+ str(num_valid + 1)
|
||||||
|
+ ',\n "candidate": "WS-X",\n "action": "work-'
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_resilient_report_recovers_valid_prefix_and_quarantines_truncated_tail():
|
||||||
|
raw = _pretty_triage_with_truncated_tail(7)
|
||||||
|
llm = _CountingLLM([raw, raw])
|
||||||
|
instr = _instr(
|
||||||
|
id="daily-triage-report",
|
||||||
|
prompt="Report.",
|
||||||
|
trusted_fields=[],
|
||||||
|
output_schema="schemas/daily-triage-report.json",
|
||||||
|
report_sinks=[{"type": "working-memory"}],
|
||||||
|
)
|
||||||
|
|
||||||
|
result = execute_instruction_with_audit(instr, _Event(), {}, llm)
|
||||||
|
|
||||||
|
assert result.output_validated is True
|
||||||
|
assert result.review_required is True
|
||||||
|
assert result.report is not None
|
||||||
|
assert result.report["partial"] is True
|
||||||
|
assert len(result.report["recommendations"]) == 7
|
||||||
|
assert result.report["summary"] == "Daily triage."
|
||||||
|
assert result.report["quarantined_count"] >= 1
|
||||||
|
# The broken tail is dropped — either as an unparseable/truncated span or,
|
||||||
|
# if _try_repair salvages its structure, as a schema-invalid item. Either way
|
||||||
|
# it carries a diagnostic error and never pollutes the surviving report.
|
||||||
|
assert result.report["quarantined_items"][0]["error"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_resilient_report_quarantines_one_bad_item_among_valid():
|
||||||
|
recs = [_valid_rec(1), {"candidate": "WS-2", "action": "x", "why": "no rank"}, _valid_rec(3)]
|
||||||
|
raw = json.dumps({"summary": "Triage.", "recommendations": recs})
|
||||||
|
llm = _CountingLLM([raw, raw])
|
||||||
|
instr = _instr(
|
||||||
|
id="daily-triage-report",
|
||||||
|
prompt="Report.",
|
||||||
|
trusted_fields=[],
|
||||||
|
output_schema="schemas/daily-triage-report.json",
|
||||||
|
report_sinks=[{"type": "working-memory"}],
|
||||||
|
)
|
||||||
|
|
||||||
|
result = execute_instruction_with_audit(instr, _Event(), {}, llm)
|
||||||
|
|
||||||
|
assert result.output_validated is True
|
||||||
|
assert result.report["partial"] is True
|
||||||
|
assert len(result.report["recommendations"]) == 2
|
||||||
|
assert result.report["quarantined_count"] == 1
|
||||||
|
assert "rank" in result.report["quarantined_items"][0]["error"]
|
||||||
|
|
||||||
|
|
||||||
|
# ── WP-0016-T04 producer guardrails ───────────────────────────────────────────
|
||||||
|
|
||||||
|
def _triage_instr() -> SimpleNamespace:
|
||||||
|
return _instr(
|
||||||
|
id="daily-triage-report",
|
||||||
|
prompt="Report.",
|
||||||
|
trusted_fields=[],
|
||||||
|
output_schema="schemas/daily-triage-report.json",
|
||||||
|
report_sinks=[{"type": "working-memory"}],
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_guardrail_count_cap_on_valid_happy_path():
|
||||||
|
# 9 fully-valid recommendations in a syntactically valid document: schema
|
||||||
|
# validation passes, but the maxItems=7 count cap must keep 7 and quarantine 2.
|
||||||
|
recs = [_valid_rec(i) for i in range(1, 10)]
|
||||||
|
raw = json.dumps({"summary": "Triage.", "recommendations": recs})
|
||||||
|
llm = _CountingLLM([raw])
|
||||||
|
|
||||||
|
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
|
||||||
|
|
||||||
|
assert llm.call_count == 1 # no retry — the document was valid
|
||||||
|
assert result.report["partial"] is True
|
||||||
|
assert len(result.report["recommendations"]) == 7
|
||||||
|
assert result.report["quarantined_count"] == 2
|
||||||
|
assert all(q["reason"] == "over_limit" for q in result.report["quarantined_items"])
|
||||||
|
|
||||||
|
|
||||||
|
def test_guardrail_oversized_string_quarantined():
|
||||||
|
big = _valid_rec(2)
|
||||||
|
big["why"] = "x" * 5000 # exceeds _MAX_STRING_LEN
|
||||||
|
raw = json.dumps({"summary": "Triage.", "recommendations": [_valid_rec(1), big]})
|
||||||
|
llm = _CountingLLM([raw])
|
||||||
|
|
||||||
|
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
|
||||||
|
|
||||||
|
assert len(result.report["recommendations"]) == 1
|
||||||
|
assert result.report["quarantined_count"] == 1
|
||||||
|
assert result.report["quarantined_items"][0]["reason"] == "guardrail"
|
||||||
|
|
||||||
|
|
||||||
|
def test_guardrail_allow_list_rejects_unknown_candidate():
|
||||||
|
raw = json.dumps({
|
||||||
|
"summary": "Triage.",
|
||||||
|
"recommendations": [_valid_rec(1), _valid_rec(2)], # candidates WS-1, WS-2
|
||||||
|
})
|
||||||
|
llm = _CountingLLM([raw])
|
||||||
|
context = {"known_candidates": ["WS-1"]}
|
||||||
|
|
||||||
|
result = execute_instruction_with_audit(_triage_instr(), _Event(), context, llm)
|
||||||
|
|
||||||
|
assert len(result.report["recommendations"]) == 1
|
||||||
|
assert result.report["recommendations"][0]["candidate"] == "WS-1"
|
||||||
|
assert result.report["quarantined_items"][0]["reason"] == "allow_list"
|
||||||
|
|
||||||
|
|
||||||
|
def _nested(depth: int) -> dict[str, Any]:
|
||||||
|
node: dict[str, Any] = {"leaf": 1}
|
||||||
|
for _ in range(depth):
|
||||||
|
node = {"a": node}
|
||||||
|
return node
|
||||||
|
|
||||||
|
|
||||||
|
def test_guardrail_over_depth_quarantined():
|
||||||
|
deep = _valid_rec(2)
|
||||||
|
deep["extra"] = _nested(12) # well past _MAX_DEPTH
|
||||||
|
raw = json.dumps({"summary": "Triage.", "recommendations": [_valid_rec(1), deep]})
|
||||||
|
llm = _CountingLLM([raw])
|
||||||
|
|
||||||
|
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
|
||||||
|
|
||||||
|
assert len(result.report["recommendations"]) == 1
|
||||||
|
assert result.report["quarantined_count"] == 1
|
||||||
|
assert result.report["quarantined_items"][0]["reason"] == "guardrail"
|
||||||
|
assert "depth" in result.report["quarantined_items"][0]["error"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_resilient_recovery_against_real_2026_06_26_fixture():
|
||||||
|
# The actual captured failure payload (4000-char preview, truncated at the 7th
|
||||||
|
# recommendation) — the run that reset the WP-0006-T03 streak. Before WP-0016
|
||||||
|
# this discarded the whole report; now it must recover the valid prefix.
|
||||||
|
fixture = json.loads(
|
||||||
|
Path("tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json")
|
||||||
|
.read_text(encoding="utf-8")
|
||||||
|
)
|
||||||
|
raw = fixture["raw_output_preview"]
|
||||||
|
llm = _CountingLLM([raw, raw])
|
||||||
|
|
||||||
|
result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm)
|
||||||
|
|
||||||
|
assert result.output_validated is True
|
||||||
|
assert result.report["partial"] is True
|
||||||
|
# Six recommendations are fully intact before the truncation point.
|
||||||
|
assert len(result.report["recommendations"]) >= 6
|
||||||
|
assert all("rank" in rec and "candidate" in rec for rec in result.report["recommendations"])
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
class _MetadataBadLLM:
|
||||||
|
def __init__(self) -> None:
|
||||||
|
self.call_count = 0
|
||||||
|
self.last_response_metadata: dict[str, Any] | None = None
|
||||||
|
|
||||||
|
def complete(
|
||||||
|
self,
|
||||||
|
prompt: str,
|
||||||
|
model: str = "",
|
||||||
|
config: dict | None = None,
|
||||||
|
) -> str:
|
||||||
|
self.call_count += 1
|
||||||
|
self.last_response_metadata = {
|
||||||
|
"finish_reason": "length",
|
||||||
|
"usage": {"input_tokens": 1100, "output_tokens": 1200},
|
||||||
|
}
|
||||||
|
return ("x" * 9000) + "{"
|
||||||
|
|
||||||
|
|
||||||
|
def test_invalid_report_preserves_response_metadata_and_long_preview():
|
||||||
|
llm = _MetadataBadLLM()
|
||||||
|
instr = _instr(
|
||||||
|
id="daily-triage-report",
|
||||||
|
prompt="Report.",
|
||||||
|
trusted_fields=[],
|
||||||
|
report_sinks=[{"type": "working-memory", "path": "/tmp"}],
|
||||||
|
)
|
||||||
|
|
||||||
|
result = execute_instruction_with_audit(instr, _Event(), {}, llm)
|
||||||
|
|
||||||
|
assert llm.call_count == 2
|
||||||
|
assert result.output_validated is False
|
||||||
|
assert result.llm_response_metadata == {
|
||||||
|
"finish_reason": "length",
|
||||||
|
"usage": {"input_tokens": 1100, "output_tokens": 1200},
|
||||||
|
}
|
||||||
|
assert result.report["llm_response_metadata"] == result.llm_response_metadata
|
||||||
|
assert len(result.report["raw_output_preview"]) > 4000
|
||||||
|
|
||||||
|
|
||||||
def test_execute_instruction_with_audit_preserves_invalid_report_with_sinks(
|
def test_execute_instruction_with_audit_preserves_invalid_report_with_sinks(
|
||||||
tmp_path,
|
tmp_path,
|
||||||
monkeypatch,
|
monkeypatch,
|
||||||
|
|||||||
114
tests/test_admin_sync_api.py
Normal file
114
tests/test_admin_sync_api.py
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from activity_core import api
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_admin_sync_definitions_only_does_not_require_temporal(
|
||||||
|
monkeypatch,
|
||||||
|
) -> None:
|
||||||
|
seen: dict[str, Any] = {}
|
||||||
|
|
||||||
|
async def fake_run_sync(**kwargs: Any) -> dict[str, Any]:
|
||||||
|
seen.update(kwargs)
|
||||||
|
return {"ok": True, "ran": {"definitions": True}}
|
||||||
|
|
||||||
|
monkeypatch.setattr(api, "_session_factory", object())
|
||||||
|
monkeypatch.setattr(api, "_temporal_client", None)
|
||||||
|
monkeypatch.setattr(api, "run_sync", fake_run_sync)
|
||||||
|
|
||||||
|
result = await api.admin_sync(
|
||||||
|
definitions=True,
|
||||||
|
schedules=False,
|
||||||
|
event_types=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result == {"ok": True, "ran": {"definitions": True}}
|
||||||
|
assert seen["session_factory"] is api._session_factory
|
||||||
|
assert seen["temporal_client"] is None
|
||||||
|
assert seen["definitions"] is True
|
||||||
|
assert seen["schedules"] is False
|
||||||
|
assert seen["event_types"] is False
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_admin_sync_schedules_only_passes_temporal(monkeypatch) -> None:
|
||||||
|
temporal = object()
|
||||||
|
seen: dict[str, Any] = {}
|
||||||
|
|
||||||
|
async def fake_run_sync(**kwargs: Any) -> dict[str, Any]:
|
||||||
|
seen.update(kwargs)
|
||||||
|
return {
|
||||||
|
"ok": True,
|
||||||
|
"schedules": {
|
||||||
|
"upserted": 1,
|
||||||
|
"paused": 0,
|
||||||
|
"deleted_orphans": 0,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
monkeypatch.setattr(api, "_session_factory", object())
|
||||||
|
monkeypatch.setattr(api, "_temporal_client", temporal)
|
||||||
|
monkeypatch.setattr(api, "run_sync", fake_run_sync)
|
||||||
|
|
||||||
|
result = await api.admin_sync(
|
||||||
|
definitions=False,
|
||||||
|
schedules=True,
|
||||||
|
event_types=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result["schedules"]["upserted"] == 1
|
||||||
|
assert seen["temporal_client"] is temporal
|
||||||
|
assert seen["definitions"] is False
|
||||||
|
assert seen["schedules"] is True
|
||||||
|
assert seen["event_types"] is False
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_admin_sync_all_sync_returns_failure_result(monkeypatch) -> None:
|
||||||
|
async def fake_run_sync(**kwargs: Any) -> dict[str, Any]:
|
||||||
|
return {
|
||||||
|
"ok": False,
|
||||||
|
"ran": {
|
||||||
|
"definitions": kwargs["definitions"],
|
||||||
|
"schedules": kwargs["schedules"],
|
||||||
|
"event_types": kwargs["event_types"],
|
||||||
|
},
|
||||||
|
"errors": [
|
||||||
|
{
|
||||||
|
"stage": "event_types",
|
||||||
|
"type": "RuntimeError",
|
||||||
|
"message": "bad event type",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
|
||||||
|
monkeypatch.setattr(api, "_session_factory", object())
|
||||||
|
monkeypatch.setattr(api, "_temporal_client", object())
|
||||||
|
monkeypatch.setattr(api, "run_sync", fake_run_sync)
|
||||||
|
|
||||||
|
result = await api.admin_sync(
|
||||||
|
definitions=True,
|
||||||
|
schedules=True,
|
||||||
|
event_types=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result == {
|
||||||
|
"ok": False,
|
||||||
|
"ran": {
|
||||||
|
"definitions": True,
|
||||||
|
"schedules": True,
|
||||||
|
"event_types": True,
|
||||||
|
},
|
||||||
|
"errors": [
|
||||||
|
{
|
||||||
|
"stage": "event_types",
|
||||||
|
"type": "RuntimeError",
|
||||||
|
"message": "bad event type",
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
289
tests/test_automation_status.py
Normal file
289
tests/test_automation_status.py
Normal file
@@ -0,0 +1,289 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
from zoneinfo import ZoneInfo
|
||||||
|
|
||||||
|
from activity_core import automation_status as status
|
||||||
|
|
||||||
|
ACTIVITY_ID = "00000000-0000-0000-0000-000000000123"
|
||||||
|
|
||||||
|
|
||||||
|
def _window():
|
||||||
|
return status.resolve_window(
|
||||||
|
"2026-06-26",
|
||||||
|
"2026-06-29",
|
||||||
|
"Europe/Berlin",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _definition(enabled: bool = True):
|
||||||
|
return {
|
||||||
|
"id": ACTIVITY_ID,
|
||||||
|
"name": "Daily Check",
|
||||||
|
"enabled": enabled,
|
||||||
|
"trigger_type": "cron",
|
||||||
|
"trigger_config": {
|
||||||
|
"trigger_type": "cron",
|
||||||
|
"cron_expression": "0 9 * * *",
|
||||||
|
"timezone": "Europe/Berlin",
|
||||||
|
"misfire_policy": "skip",
|
||||||
|
},
|
||||||
|
"source": "test",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def test_friday_shortcut_resolves_to_previous_friday_start() -> None:
|
||||||
|
now = datetime(2026, 6, 29, 12, 0, tzinfo=ZoneInfo("Europe/Berlin"))
|
||||||
|
|
||||||
|
window = status.resolve_window("friday", None, "Europe/Berlin", now=now)
|
||||||
|
|
||||||
|
assert window["since"].isoformat() == "2026-06-26T00:00:00+02:00"
|
||||||
|
assert window["until"].isoformat() == "2026-06-29T12:00:00+02:00"
|
||||||
|
|
||||||
|
|
||||||
|
def test_expected_fires_for_simple_cron_window() -> None:
|
||||||
|
fires = status.expected_fires(_definition(), _window())
|
||||||
|
|
||||||
|
assert fires == [
|
||||||
|
"2026-06-26T09:00:00+02:00",
|
||||||
|
"2026-06-27T09:00:00+02:00",
|
||||||
|
"2026-06-28T09:00:00+02:00",
|
||||||
|
"2026-06-29T09:00:00+02:00",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_completed_when_expected_run_exists() -> None:
|
||||||
|
run = {
|
||||||
|
"run_id": "run-1",
|
||||||
|
"activity_id": ACTIVITY_ID,
|
||||||
|
"scheduled_for": "2026-06-26T07:00:00+00:00",
|
||||||
|
"fired_at": "2026-06-26T07:00:10+00:00",
|
||||||
|
"tasks_spawned": 1,
|
||||||
|
}
|
||||||
|
|
||||||
|
report = status.classify_activity(
|
||||||
|
_definition(),
|
||||||
|
_window(),
|
||||||
|
[run],
|
||||||
|
[{"source": "state_hub_progress", "run_id": "run-1", "output_validated": True}],
|
||||||
|
None,
|
||||||
|
["2026-06-26T09:00:00+02:00"],
|
||||||
|
runs_available=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert report["status"] == "completed"
|
||||||
|
|
||||||
|
|
||||||
|
def test_validation_failure_wins_over_completed_run() -> None:
|
||||||
|
run = {"run_id": "run-1", "activity_id": ACTIVITY_ID, "scheduled_for": None, "fired_at": "2026-06-26T07:00:10+00:00"}
|
||||||
|
|
||||||
|
report = status.classify_activity(
|
||||||
|
_definition(),
|
||||||
|
_window(),
|
||||||
|
[run],
|
||||||
|
[{"source": "working_memory", "run_id": "run-1", "output_validated": False}],
|
||||||
|
None,
|
||||||
|
["2026-06-26T09:00:00+02:00"],
|
||||||
|
runs_available=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert report["status"] == "validation_failed"
|
||||||
|
|
||||||
|
|
||||||
|
def test_missed_when_expected_fire_has_no_run_and_runs_available() -> None:
|
||||||
|
report = status.classify_activity(
|
||||||
|
_definition(),
|
||||||
|
_window(),
|
||||||
|
[],
|
||||||
|
[],
|
||||||
|
None,
|
||||||
|
["2026-06-26T09:00:00+02:00"],
|
||||||
|
runs_available=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert report["status"] == "missed"
|
||||||
|
|
||||||
|
|
||||||
|
def test_disabled_schedule_is_not_counted_as_missed() -> None:
|
||||||
|
report = status.classify_activity(
|
||||||
|
_definition(enabled=False),
|
||||||
|
_window(),
|
||||||
|
[],
|
||||||
|
[],
|
||||||
|
None,
|
||||||
|
["2026-06-26T09:00:00+02:00"],
|
||||||
|
runs_available=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert report["status"] == "disabled"
|
||||||
|
|
||||||
|
|
||||||
|
def test_scheduled_definition_reports_one_shot_schedule_id() -> None:
|
||||||
|
definition = {
|
||||||
|
"id": ACTIVITY_ID,
|
||||||
|
"name": "One Shot",
|
||||||
|
"enabled": True,
|
||||||
|
"trigger_type": "scheduled",
|
||||||
|
"trigger_config": {
|
||||||
|
"trigger_type": "scheduled",
|
||||||
|
"at": "2026-06-26T09:00:00+02:00",
|
||||||
|
"timezone": "Europe/Berlin",
|
||||||
|
},
|
||||||
|
"source": "test",
|
||||||
|
}
|
||||||
|
|
||||||
|
report = status.classify_activity(
|
||||||
|
definition,
|
||||||
|
_window(),
|
||||||
|
[],
|
||||||
|
[],
|
||||||
|
None,
|
||||||
|
["2026-06-26T09:00:00+02:00"],
|
||||||
|
runs_available=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert status.automation_schedule_id(_definition()) == f"activity-schedule-{ACTIVITY_ID}"
|
||||||
|
assert report["schedule_id"] == f"activity-schedule-{ACTIVITY_ID}-once"
|
||||||
|
|
||||||
|
|
||||||
|
def test_partial_source_availability_is_unknown_not_missed() -> None:
|
||||||
|
report = status.classify_activity(
|
||||||
|
_definition(),
|
||||||
|
_window(),
|
||||||
|
[],
|
||||||
|
[],
|
||||||
|
None,
|
||||||
|
["2026-06-26T09:00:00+02:00"],
|
||||||
|
runs_available=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert report["status"] == "unknown"
|
||||||
|
assert "missed-run verdict is unknown" in report["warnings"][0]
|
||||||
|
|
||||||
|
|
||||||
|
def test_working_memory_frontmatter_evidence(tmp_path: Path) -> None:
|
||||||
|
note = tmp_path / "daily-triage-2026-06-26-run.md"
|
||||||
|
note.write_text(
|
||||||
|
"---\n"
|
||||||
|
"source: activity-core\n"
|
||||||
|
f"activity_id: {ACTIVITY_ID}\n"
|
||||||
|
"activity_core_run_id: run-1\n"
|
||||||
|
"scheduled_for: 2026-06-26T07:00:00+00:00\n"
|
||||||
|
"output_validated: false\n"
|
||||||
|
"created: 2026-06-26T07:01:00+00:00\n"
|
||||||
|
"---\n"
|
||||||
|
"body\n",
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
|
||||||
|
evidence, source = status.load_working_memory_evidence(str(tmp_path), _window())
|
||||||
|
|
||||||
|
assert source["status"] == "ok"
|
||||||
|
assert evidence[0]["run_id"] == "run-1"
|
||||||
|
assert evidence[0]["output_validated"] is False
|
||||||
|
|
||||||
|
|
||||||
|
def _scheduled_definition(enabled: bool = False):
|
||||||
|
return {
|
||||||
|
"id": "00000000-0000-0000-0000-000000000456",
|
||||||
|
"name": "One Shot",
|
||||||
|
"enabled": enabled,
|
||||||
|
"trigger_type": "scheduled",
|
||||||
|
"trigger_config": {
|
||||||
|
"trigger_type": "scheduled",
|
||||||
|
"at": "2026-06-26T09:00:00+02:00",
|
||||||
|
"timezone": "Europe/Berlin",
|
||||||
|
},
|
||||||
|
"source": "db",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def test_inventory_report_uses_db_definition_rows(monkeypatch) -> None:
|
||||||
|
async def fake_load_definitions(args, warnings):
|
||||||
|
return [dict(_definition(), source="db"), _scheduled_definition()], {"status": "ok", "source": "db"}
|
||||||
|
|
||||||
|
async def fake_temporal(host, namespace, definitions, *, timeout_seconds):
|
||||||
|
return {
|
||||||
|
ACTIVITY_ID: {
|
||||||
|
"schedule_id": f"activity-schedule-{ACTIVITY_ID}",
|
||||||
|
"available": True,
|
||||||
|
"paused": False,
|
||||||
|
"missed_catchup_window": 0,
|
||||||
|
"last_fired_at": None,
|
||||||
|
},
|
||||||
|
}, {"status": "ok", "count": 1}
|
||||||
|
|
||||||
|
monkeypatch.setattr(status, "load_definitions", fake_load_definitions)
|
||||||
|
monkeypatch.setattr(status, "load_temporal_visibility", fake_temporal)
|
||||||
|
args = status.parse_inventory_args(["--format", "json"])
|
||||||
|
|
||||||
|
report, exit_code = asyncio.run(status.build_inventory_report(args))
|
||||||
|
|
||||||
|
assert exit_code == 0
|
||||||
|
assert report["sources"]["definitions"] == {"status": "ok", "source": "db"}
|
||||||
|
assert report["summary"]["automation_count"] == 2
|
||||||
|
assert report["automations"][0]["definition_source"] == "db"
|
||||||
|
assert report["automations"][0]["temporal"]["status"] == "active"
|
||||||
|
assert report["automations"][1]["schedule_id"].endswith("-once")
|
||||||
|
|
||||||
|
|
||||||
|
def test_inventory_file_fallback_when_db_url_missing(monkeypatch) -> None:
|
||||||
|
monkeypatch.setattr(status, "file_definitions", lambda: [dict(_definition(), source="files")])
|
||||||
|
args = status.parse_inventory_args(["--db-url", "", "--temporal-host", ""])
|
||||||
|
|
||||||
|
report, exit_code = asyncio.run(status.build_inventory_report(args))
|
||||||
|
|
||||||
|
assert exit_code == 0
|
||||||
|
assert report["sources"]["definitions"]["status"] == "degraded"
|
||||||
|
assert report["automations"][0]["definition_source"] == "files"
|
||||||
|
assert "ACTCORE_DB_URL is not set" in report["warnings"][0]
|
||||||
|
|
||||||
|
|
||||||
|
def test_inventory_filters_disabled_definitions() -> None:
|
||||||
|
definitions = [_definition(enabled=True), _scheduled_definition(enabled=False)]
|
||||||
|
|
||||||
|
filtered = status.filter_inventory_definitions(
|
||||||
|
definitions,
|
||||||
|
ids=[],
|
||||||
|
names=[],
|
||||||
|
enabled=False,
|
||||||
|
trigger_types=set(),
|
||||||
|
)
|
||||||
|
|
||||||
|
assert [item["name"] for item in filtered] == ["One Shot"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_inventory_temporal_unavailable_is_warning_not_failure(monkeypatch) -> None:
|
||||||
|
async def fake_load_definitions(args, warnings):
|
||||||
|
return [_definition()], {"status": "ok", "source": "db"}
|
||||||
|
|
||||||
|
async def fake_temporal(host, namespace, definitions, *, timeout_seconds):
|
||||||
|
return {}, {"status": "unavailable", "warning": "Temporal unavailable: nope"}
|
||||||
|
|
||||||
|
monkeypatch.setattr(status, "load_definitions", fake_load_definitions)
|
||||||
|
monkeypatch.setattr(status, "load_temporal_visibility", fake_temporal)
|
||||||
|
args = status.parse_inventory_args([])
|
||||||
|
|
||||||
|
report, exit_code = asyncio.run(status.build_inventory_report(args))
|
||||||
|
|
||||||
|
assert exit_code == 0
|
||||||
|
assert report["automations"][0]["temporal"]["status"] == "not_checked"
|
||||||
|
assert report["warnings"] == ["Temporal unavailable: nope"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_inventory_cli_emits_json(monkeypatch, capsys) -> None:
|
||||||
|
monkeypatch.setattr(status, "file_definitions", lambda: [dict(_definition(), source="files")])
|
||||||
|
|
||||||
|
exit_code = asyncio.run(status.async_inventory_main([
|
||||||
|
"--db-url", "",
|
||||||
|
"--temporal-host", "",
|
||||||
|
"--format", "json",
|
||||||
|
]))
|
||||||
|
|
||||||
|
payload = json.loads(capsys.readouterr().out)
|
||||||
|
assert exit_code == 0
|
||||||
|
assert payload["mode"] == "automation-inventory"
|
||||||
|
assert payload["automations"][0]["name"] == "Daily Check"
|
||||||
@@ -1,6 +1,7 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import json
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
@@ -70,7 +71,14 @@ async def test_evaluate_instructions_returns_task_specs_with_audit(monkeypatch)
|
|||||||
async def test_evaluate_instructions_returns_report_payload(monkeypatch) -> None:
|
async def test_evaluate_instructions_returns_report_payload(monkeypatch) -> None:
|
||||||
llm = FakeLLMClient(json.dumps({
|
llm = FakeLLMClient(json.dumps({
|
||||||
"summary": "State Hub has open loose ends.",
|
"summary": "State Hub has open loose ends.",
|
||||||
"recommendations": [{"candidate": "CUST-WP-0045", "action": "work-next"}],
|
"recommendations": [
|
||||||
|
{
|
||||||
|
"rank": 1,
|
||||||
|
"candidate": "CUST-WP-0045",
|
||||||
|
"action": "work-next",
|
||||||
|
"why": "Open loose ends.",
|
||||||
|
}
|
||||||
|
],
|
||||||
}))
|
}))
|
||||||
monkeypatch.setattr(activities, "get_llm_client", lambda: llm)
|
monkeypatch.setattr(activities, "get_llm_client", lambda: llm)
|
||||||
|
|
||||||
@@ -209,6 +217,12 @@ async def test_evaluate_instructions_forwards_llm_connect_depth_config(monkeypat
|
|||||||
"context": {},
|
"context": {},
|
||||||
})
|
})
|
||||||
|
|
||||||
|
# Read the live schema file rather than hard-coding it, so the forwarded
|
||||||
|
# json_schema assertion tracks schemas/daily-triage-report.json as the
|
||||||
|
# contract evolves (ACTIVITY-WP-0016-T02).
|
||||||
|
expected_schema = json.loads(
|
||||||
|
Path("schemas/daily-triage-report.json").read_text(encoding="utf-8")
|
||||||
|
)
|
||||||
assert llm.calls[0][2] == {
|
assert llm.calls[0][2] == {
|
||||||
"model_name": "custodian-triage-balanced",
|
"model_name": "custodian-triage-balanced",
|
||||||
"temperature": 0.2,
|
"temperature": 0.2,
|
||||||
@@ -216,16 +230,6 @@ async def test_evaluate_instructions_forwards_llm_connect_depth_config(monkeypat
|
|||||||
"max_depth": 2,
|
"max_depth": 2,
|
||||||
"model_params": {
|
"model_params": {
|
||||||
"reasoning_effort": "medium",
|
"reasoning_effort": "medium",
|
||||||
"json_schema": {
|
"json_schema": expected_schema,
|
||||||
"type": "object",
|
|
||||||
"required": ["summary", "recommendations"],
|
|
||||||
"properties": {
|
|
||||||
"summary": {"type": "string"},
|
|
||||||
"recommendations": {
|
|
||||||
"type": "array",
|
|
||||||
"items": {"type": "object"},
|
|
||||||
},
|
|
||||||
},
|
|
||||||
},
|
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -34,7 +34,7 @@ def test_issue_core_rest_sink_posts_task_contract(monkeypatch) -> None:
|
|||||||
|
|
||||||
monkeypatch.setattr(httpx, "post", fake_post)
|
monkeypatch.setattr(httpx, "post", fake_post)
|
||||||
|
|
||||||
ref = IssueCoreRestSink("http://issue-core.test/").emit(TaskSpec(
|
ref = IssueCoreRestSink("http://issue-core.test/", api_key="test-key").emit(TaskSpec(
|
||||||
title="Run SBOM rescan for activity-core",
|
title="Run SBOM rescan for activity-core",
|
||||||
description="SBOM is older than 30 days.",
|
description="SBOM is older than 30 days.",
|
||||||
target_repo="activity-core",
|
target_repo="activity-core",
|
||||||
@@ -67,12 +67,30 @@ def test_issue_core_rest_sink_posts_task_contract(monkeypatch) -> None:
|
|||||||
"triggering_event_id": "scheduled",
|
"triggering_event_id": "scheduled",
|
||||||
"activity_definition_id": "activity-1",
|
"activity_definition_id": "activity-1",
|
||||||
},
|
},
|
||||||
|
"headers": {"Authorization": "Bearer test-key"},
|
||||||
"timeout": 10.0,
|
"timeout": 10.0,
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
assert "review_required" not in posts[0]["json"]
|
assert "review_required" not in posts[0]["json"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue_core_rest_sink_requires_api_key() -> None:
|
||||||
|
sink = IssueCoreRestSink("http://issue-core.test/", api_key="")
|
||||||
|
with pytest.raises(RuntimeError, match="ISSUE_CORE_API_KEY"):
|
||||||
|
sink.emit(TaskSpec(
|
||||||
|
title="t",
|
||||||
|
description="",
|
||||||
|
target_repo="activity-core",
|
||||||
|
priority="low",
|
||||||
|
labels=[],
|
||||||
|
due_in_days=None,
|
||||||
|
source_type="rule",
|
||||||
|
source_id="r",
|
||||||
|
triggering_event_id="e",
|
||||||
|
activity_definition_id="a",
|
||||||
|
))
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
@pytest.mark.asyncio
|
||||||
async def test_emit_tasks_raises_when_sink_fails(monkeypatch) -> None:
|
async def test_emit_tasks_raises_when_sink_fails(monkeypatch) -> None:
|
||||||
class FailingSink:
|
class FailingSink:
|
||||||
|
|||||||
@@ -13,7 +13,12 @@ def test_llm_connect_client_forwards_run_config(monkeypatch) -> None:
|
|||||||
pass
|
pass
|
||||||
|
|
||||||
def json(self) -> dict:
|
def json(self) -> dict:
|
||||||
return {"content": '{"summary":"ok","recommendations":[]}'}
|
return {
|
||||||
|
"content": '{"summary":"ok","recommendations":[]}',
|
||||||
|
"finish_reason": "stop",
|
||||||
|
"usage": {"input_tokens": 10, "output_tokens": 20},
|
||||||
|
"raw_response": {"provider_blob": "not persisted"},
|
||||||
|
}
|
||||||
|
|
||||||
def fake_post(url: str, json: dict, timeout: float) -> Response:
|
def fake_post(url: str, json: dict, timeout: float) -> Response:
|
||||||
captured["url"] = url
|
captured["url"] = url
|
||||||
@@ -50,3 +55,7 @@ def test_llm_connect_client_forwards_run_config(monkeypatch) -> None:
|
|||||||
"timeout_seconds": 42,
|
"timeout_seconds": 42,
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
assert client.last_response_metadata == {
|
||||||
|
"finish_reason": "stop",
|
||||||
|
"usage": {"input_tokens": 10, "output_tokens": 20},
|
||||||
|
}
|
||||||
|
|||||||
@@ -166,6 +166,93 @@ def test_state_hub_progress_sink_is_idempotent(monkeypatch) -> None:
|
|||||||
assert result[0]["idempotency_key"] == idempotency_key
|
assert result[0]["idempotency_key"] == idempotency_key
|
||||||
|
|
||||||
|
|
||||||
|
def test_core_hub_interaction_event_sink_posts_and_verifies_compact_event(monkeypatch) -> None:
|
||||||
|
posts: list[dict[str, Any]] = []
|
||||||
|
|
||||||
|
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
|
||||||
|
assert url == "http://core-hub.test/api/v2/interaction-events"
|
||||||
|
assert kwargs["headers"]["Authorization"] == "Bearer runtime-secret"
|
||||||
|
posts.append({"url": url, **kwargs})
|
||||||
|
return DummyResponse(
|
||||||
|
{
|
||||||
|
"id": "event-1",
|
||||||
|
"eventType": "ops-endpoint-verified",
|
||||||
|
"widgetId": "widget-1",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
def fake_get(url: str, **kwargs: Any) -> DummyResponse:
|
||||||
|
assert url == "http://core-hub.test/api/v2/interaction-events"
|
||||||
|
assert kwargs["headers"]["Authorization"] == "Bearer runtime-secret"
|
||||||
|
return DummyResponse({"data": [{"id": "event-1"}]})
|
||||||
|
|
||||||
|
monkeypatch.setenv("CORE_HUB_RUNTIME_TOKEN", "runtime-secret")
|
||||||
|
monkeypatch.setattr(httpx, "post", fake_post)
|
||||||
|
monkeypatch.setattr(httpx, "get", fake_get)
|
||||||
|
|
||||||
|
result = persist_ops_inventory_evidence(
|
||||||
|
_payload([
|
||||||
|
{
|
||||||
|
"type": "core-hub-interaction-event",
|
||||||
|
"core_hub_url": "http://core-hub.test",
|
||||||
|
"widget_id": "widget-1",
|
||||||
|
"event_type": "ops-endpoint-verified",
|
||||||
|
}
|
||||||
|
])
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result == [
|
||||||
|
{
|
||||||
|
"type": "core-hub-interaction-event",
|
||||||
|
"status": "posted",
|
||||||
|
"event_type": "ops-endpoint-verified",
|
||||||
|
"event_id": "event-1",
|
||||||
|
"widget_id": "widget-1",
|
||||||
|
"verified": True,
|
||||||
|
"context_key": "ops_probe",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
body = posts[0]["json"]
|
||||||
|
assert body["widgetId"] == "widget-1"
|
||||||
|
assert body["eventType"] == "ops-endpoint-verified"
|
||||||
|
assert body["metadata"]["activity_core_run_id"] == _run_id()
|
||||||
|
assert body["metadata"]["endpoint"]["url"] == "http://state-hub.test/health"
|
||||||
|
assert body["metadata"]["endpoint"]["widget_ref"] == "ops:endpoint:state-hub-health"
|
||||||
|
|
||||||
|
serialized = json.dumps(body, sort_keys=True)
|
||||||
|
assert "runtime-secret" not in serialized
|
||||||
|
assert "secret response body" not in serialized
|
||||||
|
assert "Authorization" not in serialized
|
||||||
|
assert "user:pass" not in serialized
|
||||||
|
assert "token=secret" not in serialized
|
||||||
|
|
||||||
|
|
||||||
|
def test_core_hub_sink_skips_cleanly_when_config_missing(monkeypatch) -> None:
|
||||||
|
monkeypatch.delenv("CORE_HUB_BASE_URL", raising=False)
|
||||||
|
monkeypatch.delenv("CORE_HUB_RUNTIME_TOKEN", raising=False)
|
||||||
|
monkeypatch.delenv("CORE_HUB_RUNTIME_TOKEN_FILE", raising=False)
|
||||||
|
monkeypatch.delenv("CORE_HUB_WIDGET_ID", raising=False)
|
||||||
|
monkeypatch.delenv("CORE_HUB_WIDGET_MAPPING", raising=False)
|
||||||
|
|
||||||
|
result = persist_ops_inventory_evidence(
|
||||||
|
_payload([{"type": "core-hub-interaction-event"}])
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result == [
|
||||||
|
{
|
||||||
|
"type": "core-hub-interaction-event",
|
||||||
|
"status": "skipped",
|
||||||
|
"reason": "missing_core_hub_config",
|
||||||
|
"missing": [
|
||||||
|
"CORE_HUB_BASE_URL",
|
||||||
|
"CORE_HUB_RUNTIME_TOKEN or CORE_HUB_RUNTIME_TOKEN_FILE",
|
||||||
|
"widget_id or CORE_HUB_WIDGET_ID",
|
||||||
|
],
|
||||||
|
"context_key": "ops_probe",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
def test_inter_hub_sink_skips_cleanly_when_config_missing(monkeypatch) -> None:
|
def test_inter_hub_sink_skips_cleanly_when_config_missing(monkeypatch) -> None:
|
||||||
monkeypatch.delenv("INTER_HUB_URL", raising=False)
|
monkeypatch.delenv("INTER_HUB_URL", raising=False)
|
||||||
monkeypatch.delenv("OPS_HUB_KEY", raising=False)
|
monkeypatch.delenv("OPS_HUB_KEY", raising=False)
|
||||||
|
|||||||
@@ -93,12 +93,21 @@ def test_external_configmap_projects_enabled_daily_wsjf_definition(tmp_path) ->
|
|||||||
assert definition.trigger_config["cron_expression"] == "20 7 * * *"
|
assert definition.trigger_config["cron_expression"] == "20 7 * * *"
|
||||||
assert definition.trigger_config["timezone"] == "Europe/Berlin"
|
assert definition.trigger_config["timezone"] == "Europe/Berlin"
|
||||||
assert instruction["id"] == "daily-triage-report"
|
assert instruction["id"] == "daily-triage-report"
|
||||||
|
assert instruction["max_tokens"] == 1800
|
||||||
|
assert "most 7 recommendations" in instruction["prompt"]
|
||||||
|
assert "fewer well-formed" in instruction["prompt"]
|
||||||
assert instruction["output_schema"] == (
|
assert instruction["output_schema"] == (
|
||||||
"/etc/activity-core/schemas/daily-triage-report.json"
|
"/etc/activity-core/schemas/daily-triage-report.json"
|
||||||
)
|
)
|
||||||
assert instruction["report_sinks"][0]["type"] == "working-memory"
|
assert instruction["report_sinks"][0]["type"] == "working-memory"
|
||||||
assert instruction["report_sinks"][1]["event_type"] == "daily_triage"
|
assert instruction["report_sinks"][1]["event_type"] == "daily_triage"
|
||||||
|
|
||||||
|
schema = _by_kind_name("ConfigMap", "actcore-report-schemas")
|
||||||
|
daily_schema = yaml.safe_load(schema["data"]["daily-triage-report.json"])
|
||||||
|
recommendations = daily_schema["properties"]["recommendations"]
|
||||||
|
assert recommendations["maxItems"] == 7
|
||||||
|
assert recommendations["items"]["properties"]["rank"]["maximum"] == 7
|
||||||
|
|
||||||
|
|
||||||
def test_ops_inventory_configmap_contains_probeable_inventory() -> None:
|
def test_ops_inventory_configmap_contains_probeable_inventory() -> None:
|
||||||
config = _by_kind_name("ConfigMap", "actcore-ops-service-inventory")
|
config = _by_kind_name("ConfigMap", "actcore-ops-service-inventory")
|
||||||
|
|||||||
@@ -37,6 +37,10 @@ def _payload(sinks: list[dict[str, Any]]) -> dict[str, Any]:
|
|||||||
"output_validated": True,
|
"output_validated": True,
|
||||||
"review_required": False,
|
"review_required": False,
|
||||||
"validation_error": None,
|
"validation_error": None,
|
||||||
|
"llm_response_metadata": {
|
||||||
|
"finish_reason": "stop",
|
||||||
|
"usage": {"output_tokens": 50},
|
||||||
|
},
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
}
|
}
|
||||||
@@ -62,6 +66,8 @@ def test_working_memory_sink_writes_idempotently(tmp_path) -> None:
|
|||||||
assert "output_validated: true" in text
|
assert "output_validated: true" in text
|
||||||
assert "review_required: false" in text
|
assert "review_required: false" in text
|
||||||
assert "model: test-model" in text
|
assert "model: test-model" in text
|
||||||
|
assert "LLM response metadata:" in text
|
||||||
|
assert '"finish_reason": "stop"' in text
|
||||||
assert "State Hub has loose ends." in text
|
assert "State Hub has loose ends." in text
|
||||||
|
|
||||||
|
|
||||||
@@ -113,6 +119,10 @@ def test_state_hub_progress_sink_posts(monkeypatch) -> None:
|
|||||||
assert posts[0]["json"]["detail"]["activity_core_run_id"] == payload_run_id()
|
assert posts[0]["json"]["detail"]["activity_core_run_id"] == payload_run_id()
|
||||||
assert posts[0]["json"]["detail"]["output_validated"] is True
|
assert posts[0]["json"]["detail"]["output_validated"] is True
|
||||||
assert posts[0]["json"]["detail"]["review_required"] is False
|
assert posts[0]["json"]["detail"]["review_required"] is False
|
||||||
|
assert posts[0]["json"]["detail"]["llm_response_metadata"] == {
|
||||||
|
"finish_reason": "stop",
|
||||||
|
"usage": {"output_tokens": 50},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def test_state_hub_progress_includes_prior_working_memory_path(
|
def test_state_hub_progress_includes_prior_working_memory_path(
|
||||||
|
|||||||
167
tests/test_reuse_surface_context_resolver.py
Normal file
167
tests/test_reuse_surface_context_resolver.py
Normal file
@@ -0,0 +1,167 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from temporalio.exceptions import ApplicationError
|
||||||
|
|
||||||
|
from activity_core.activities import resolve_context
|
||||||
|
from activity_core.context_resolvers import reuse_surface
|
||||||
|
from activity_core.context_resolvers.base import CONTEXT_RESOLVER_REGISTRY
|
||||||
|
|
||||||
|
|
||||||
|
class _Response:
|
||||||
|
def __init__(self, payload: Any) -> None:
|
||||||
|
self._payload = payload
|
||||||
|
|
||||||
|
def raise_for_status(self) -> None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
def json(self) -> Any:
|
||||||
|
return self._payload
|
||||||
|
|
||||||
|
|
||||||
|
class _Completed:
|
||||||
|
returncode = 0
|
||||||
|
stderr = ""
|
||||||
|
|
||||||
|
def __init__(self, payload: dict[str, Any]) -> None:
|
||||||
|
self.stdout = json.dumps(payload)
|
||||||
|
|
||||||
|
|
||||||
|
def _write_rollout(path: Path) -> None:
|
||||||
|
path.write_text(
|
||||||
|
"""
|
||||||
|
domains:
|
||||||
|
reuse:
|
||||||
|
phase: active
|
||||||
|
repos:
|
||||||
|
- reuse-surface
|
||||||
|
- activity-core
|
||||||
|
parked:
|
||||||
|
phase: backlog
|
||||||
|
repos:
|
||||||
|
- ignored-repo
|
||||||
|
""".lstrip(),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _write_cli_only_signals(path: Path) -> None:
|
||||||
|
path.write_text(
|
||||||
|
"""
|
||||||
|
signals:
|
||||||
|
empty_capability_scaffold:
|
||||||
|
enabled: true
|
||||||
|
registry_gap:
|
||||||
|
enabled: false
|
||||||
|
stale_scope:
|
||||||
|
enabled: false
|
||||||
|
stale_sbom:
|
||||||
|
enabled: false
|
||||||
|
publish_check_fail:
|
||||||
|
enabled: false
|
||||||
|
""".lstrip(),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_shell_resolver_emits_reuse_surface_gaps_and_advances_cursor(
|
||||||
|
tmp_path,
|
||||||
|
monkeypatch,
|
||||||
|
) -> None:
|
||||||
|
rollout = tmp_path / "rollout.yaml"
|
||||||
|
_write_rollout(rollout)
|
||||||
|
_write_cli_only_signals(tmp_path / "signals.yml")
|
||||||
|
reuse_root = tmp_path / "reuse-surface"
|
||||||
|
reuse_root.mkdir()
|
||||||
|
(reuse_root / "SCOPE.md").write_text("fresh\n", encoding="utf-8")
|
||||||
|
activity_root = tmp_path / "activity-core"
|
||||||
|
activity_root.mkdir()
|
||||||
|
|
||||||
|
monkeypatch.setenv("KAIZEN_RUNNER_HOST", "runner")
|
||||||
|
|
||||||
|
def fake_get(url: str, **kwargs: Any) -> _Response:
|
||||||
|
assert url.endswith("/repos/")
|
||||||
|
return _Response(
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"slug": "reuse-surface",
|
||||||
|
"host_paths": {"runner": str(reuse_root)},
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"slug": "activity-core",
|
||||||
|
"host_paths": {"runner": str(activity_root)},
|
||||||
|
},
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
def fake_run(cmd: list[str], **kwargs: Any) -> _Completed:
|
||||||
|
assert cmd == ["reuse-surface", "report", "gaps", "--format", "json"]
|
||||||
|
return _Completed({"empty_scaffolds": ["reuse-surface"]})
|
||||||
|
|
||||||
|
monkeypatch.setattr(reuse_surface.httpx, "get", fake_get)
|
||||||
|
monkeypatch.setattr(reuse_surface.subprocess, "run", fake_run)
|
||||||
|
|
||||||
|
import activity_core.context_resolvers # noqa: F401
|
||||||
|
|
||||||
|
result = CONTEXT_RESOLVER_REGISTRY["shell"]().resolve(
|
||||||
|
"reuse_surface_report_gaps",
|
||||||
|
None,
|
||||||
|
{
|
||||||
|
"roster": str(rollout),
|
||||||
|
"batch_size": 1,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result == {
|
||||||
|
"gaps": [
|
||||||
|
{
|
||||||
|
"repo": "reuse-surface",
|
||||||
|
"root": str(reuse_root),
|
||||||
|
"signal": "empty_capability_scaffold",
|
||||||
|
"hygiene_signal": "empty_capability_scaffold",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
state = json.loads((tmp_path / "round-robin-state.json").read_text(encoding="utf-8"))
|
||||||
|
assert state["cursor"] == 1
|
||||||
|
assert state["last_batch"] == ["reuse-surface"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_shell_resolver_keeps_kaizen_fallback_for_existing_queries() -> None:
|
||||||
|
assert CONTEXT_RESOLVER_REGISTRY["shell"]().resolve("unknown_query", None, {}) == {}
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_optional_reuse_surface_missing_roster_binds_empty_list(tmp_path) -> None:
|
||||||
|
snapshot = await resolve_context(
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"type": "shell",
|
||||||
|
"query": "reuse_surface_report_gaps",
|
||||||
|
"params": {"roster": str(tmp_path / "missing.yaml")},
|
||||||
|
"bind_to": "context.gaps",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
assert snapshot == {"gaps": []}
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_required_reuse_surface_missing_roster_fails_visibly(tmp_path) -> None:
|
||||||
|
with pytest.raises(ApplicationError, match="Required context resolver"):
|
||||||
|
await resolve_context(
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"type": "shell",
|
||||||
|
"query": "reuse_surface_report_gaps",
|
||||||
|
"params": {"roster": str(tmp_path / "missing.yaml")},
|
||||||
|
"bind_to": "context.gaps",
|
||||||
|
"required": True,
|
||||||
|
}
|
||||||
|
]
|
||||||
|
)
|
||||||
81
tests/test_schedule_health.py
Normal file
81
tests/test_schedule_health.py
Normal file
@@ -0,0 +1,81 @@
|
|||||||
|
"""ACTIVITY-WP-0014 T03: missed-fire detection verdict tests."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from datetime import datetime, timedelta, timezone
|
||||||
|
|
||||||
|
from activity_core.schedule_health import evaluate_schedule_health
|
||||||
|
|
||||||
|
NOW = datetime(2026, 6, 23, 12, 0, tzinfo=timezone.utc)
|
||||||
|
|
||||||
|
|
||||||
|
def test_healthy_when_recent_fire_and_no_drops() -> None:
|
||||||
|
health = evaluate_schedule_health(
|
||||||
|
activity_id="a1",
|
||||||
|
missed_catchup_window=0,
|
||||||
|
last_fired_at=NOW - timedelta(minutes=5),
|
||||||
|
now=NOW,
|
||||||
|
expected_interval=timedelta(hours=1),
|
||||||
|
)
|
||||||
|
assert health.healthy is True
|
||||||
|
assert health.missed is False
|
||||||
|
assert health.reasons == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_unhealthy_when_catchup_window_dropped_fires() -> None:
|
||||||
|
health = evaluate_schedule_health(
|
||||||
|
activity_id="a1",
|
||||||
|
missed_catchup_window=2,
|
||||||
|
last_fired_at=NOW - timedelta(minutes=5),
|
||||||
|
now=NOW,
|
||||||
|
)
|
||||||
|
assert health.missed is True
|
||||||
|
assert "2 fire(s) dropped" in health.reasons[0]
|
||||||
|
|
||||||
|
|
||||||
|
def test_unhealthy_when_last_fire_too_stale() -> None:
|
||||||
|
health = evaluate_schedule_health(
|
||||||
|
activity_id="daily",
|
||||||
|
missed_catchup_window=0,
|
||||||
|
last_fired_at=NOW - timedelta(days=2),
|
||||||
|
now=NOW,
|
||||||
|
expected_interval=timedelta(days=1),
|
||||||
|
)
|
||||||
|
assert health.missed is True
|
||||||
|
assert any("exceeding the expected" in r for r in health.reasons)
|
||||||
|
assert health.staleness == timedelta(days=2)
|
||||||
|
|
||||||
|
|
||||||
|
def test_within_tolerance_is_healthy() -> None:
|
||||||
|
health = evaluate_schedule_health(
|
||||||
|
activity_id="daily",
|
||||||
|
missed_catchup_window=0,
|
||||||
|
last_fired_at=NOW - (timedelta(days=1) + timedelta(minutes=5)),
|
||||||
|
now=NOW,
|
||||||
|
expected_interval=timedelta(days=1),
|
||||||
|
tolerance=timedelta(minutes=10),
|
||||||
|
)
|
||||||
|
assert health.healthy is True
|
||||||
|
|
||||||
|
|
||||||
|
def test_no_fire_recorded_for_due_schedule_is_unhealthy() -> None:
|
||||||
|
health = evaluate_schedule_health(
|
||||||
|
activity_id="daily",
|
||||||
|
missed_catchup_window=0,
|
||||||
|
last_fired_at=None,
|
||||||
|
now=NOW,
|
||||||
|
expected_interval=timedelta(days=1),
|
||||||
|
)
|
||||||
|
assert health.missed is True
|
||||||
|
assert "no recorded fire" in health.reasons[0]
|
||||||
|
|
||||||
|
|
||||||
|
def test_no_interval_and_no_fire_is_not_flagged() -> None:
|
||||||
|
# Without an expected interval we cannot assert a miss from absence alone.
|
||||||
|
health = evaluate_schedule_health(
|
||||||
|
activity_id="event-ish",
|
||||||
|
missed_catchup_window=0,
|
||||||
|
last_fired_at=None,
|
||||||
|
now=NOW,
|
||||||
|
)
|
||||||
|
assert health.healthy is True
|
||||||
@@ -37,6 +37,7 @@ def _make_defn(
|
|||||||
misfire_policy: str = "skip",
|
misfire_policy: str = "skip",
|
||||||
enabled: bool = True,
|
enabled: bool = True,
|
||||||
jitter: int = 0,
|
jitter: int = 0,
|
||||||
|
catchup_window_seconds: int | None = None,
|
||||||
) -> ActivityDefinition:
|
) -> ActivityDefinition:
|
||||||
return ActivityDefinition(
|
return ActivityDefinition(
|
||||||
id=uuid.uuid4(),
|
id=uuid.uuid4(),
|
||||||
@@ -46,6 +47,7 @@ def _make_defn(
|
|||||||
cron_expression=cron,
|
cron_expression=cron,
|
||||||
misfire_policy=misfire_policy,
|
misfire_policy=misfire_policy,
|
||||||
jitter_seconds=jitter,
|
jitter_seconds=jitter,
|
||||||
|
catchup_window_seconds=catchup_window_seconds,
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -186,6 +188,76 @@ async def test_misfire_policy_compress_sets_overlap_buffer_one(env: WorkflowEnvi
|
|||||||
await delete_schedule(env.client, defn.id)
|
await delete_schedule(env.client, defn.id)
|
||||||
|
|
||||||
|
|
||||||
|
# ── ACTIVITY-WP-0014: explicit run-miss policies + catchup window ────────────
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_skip_sets_short_catchup_window(env: WorkflowEnvironment) -> None:
|
||||||
|
"""skip = run on trigger or skip: tiny grace window, no real recovery."""
|
||||||
|
defn = _make_defn(misfire_policy="skip")
|
||||||
|
await upsert_schedule(env.client, defn)
|
||||||
|
|
||||||
|
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
|
||||||
|
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.SKIP
|
||||||
|
assert desc.schedule.policy.catchup_window == timedelta(seconds=60)
|
||||||
|
|
||||||
|
await delete_schedule(env.client, defn.id)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_catchup_all_recovers_full_window(env: WorkflowEnvironment) -> None:
|
||||||
|
"""catchup_all = recover every missed fire: long window, BUFFER_ALL."""
|
||||||
|
defn = _make_defn(misfire_policy="catchup_all")
|
||||||
|
await upsert_schedule(env.client, defn)
|
||||||
|
|
||||||
|
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
|
||||||
|
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.BUFFER_ALL
|
||||||
|
assert desc.schedule.policy.catchup_window == timedelta(days=365)
|
||||||
|
|
||||||
|
await delete_schedule(env.client, defn.id)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_catchup_latest_does_not_accumulate(env: WorkflowEnvironment) -> None:
|
||||||
|
"""catchup_latest = recover only the most recent missed fire: BUFFER_ONE."""
|
||||||
|
defn = _make_defn(misfire_policy="catchup_latest")
|
||||||
|
await upsert_schedule(env.client, defn)
|
||||||
|
|
||||||
|
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
|
||||||
|
assert desc.schedule.policy.overlap == ScheduleOverlapPolicy.BUFFER_ONE
|
||||||
|
assert desc.schedule.policy.catchup_window == timedelta(hours=24)
|
||||||
|
|
||||||
|
await delete_schedule(env.client, defn.id)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_legacy_aliases_map_to_explicit_policies(env: WorkflowEnvironment) -> None:
|
||||||
|
"""Legacy catchup/compress keep working and pick up the new catchup windows."""
|
||||||
|
catchup = _make_defn(misfire_policy="catchup")
|
||||||
|
compress = _make_defn(misfire_policy="compress")
|
||||||
|
await upsert_schedule(env.client, catchup)
|
||||||
|
await upsert_schedule(env.client, compress)
|
||||||
|
|
||||||
|
d1 = await env.client.get_schedule_handle(schedule_id(catchup.id)).describe()
|
||||||
|
d2 = await env.client.get_schedule_handle(schedule_id(compress.id)).describe()
|
||||||
|
assert d1.schedule.policy.catchup_window == timedelta(days=365)
|
||||||
|
assert d2.schedule.policy.catchup_window == timedelta(hours=24)
|
||||||
|
|
||||||
|
await delete_schedule(env.client, catchup.id)
|
||||||
|
await delete_schedule(env.client, compress.id)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_explicit_catchup_window_override(env: WorkflowEnvironment) -> None:
|
||||||
|
"""An explicit catchup_window_seconds overrides the per-policy default."""
|
||||||
|
defn = _make_defn(misfire_policy="skip", catchup_window_seconds=7200)
|
||||||
|
await upsert_schedule(env.client, defn)
|
||||||
|
|
||||||
|
desc = await env.client.get_schedule_handle(schedule_id(defn.id)).describe()
|
||||||
|
assert desc.schedule.policy.catchup_window == timedelta(hours=2)
|
||||||
|
|
||||||
|
await delete_schedule(env.client, defn.id)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
@pytest.mark.asyncio
|
||||||
async def test_schedule_smoke_test_creates_one_shot_schedule(
|
async def test_schedule_smoke_test_creates_one_shot_schedule(
|
||||||
env: WorkflowEnvironment,
|
env: WorkflowEnvironment,
|
||||||
|
|||||||
@@ -407,6 +407,70 @@ def test_recently_on_scope_hourly_failure_bubbles(monkeypatch) -> None:
|
|||||||
StateHubContextResolver().resolve("recently_on_scope_hourly", None, {"range": "1h"})
|
StateHubContextResolver().resolve("recently_on_scope_hourly", None, {"range": "1h"})
|
||||||
|
|
||||||
|
|
||||||
|
def test_consistency_sweep_remote_all_posts_batch(monkeypatch) -> None:
|
||||||
|
calls: list[dict[str, Any]] = []
|
||||||
|
|
||||||
|
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
|
||||||
|
calls.append({"url": url, **kwargs})
|
||||||
|
return DummyResponse(
|
||||||
|
{
|
||||||
|
"exit_code": 0,
|
||||||
|
"lock_skipped": False,
|
||||||
|
"repos_processed": [{"repo_slug": "state-hub", "result": "pass"}],
|
||||||
|
"skipped_clean": ["quiet-repo"],
|
||||||
|
"skipped_missing": [],
|
||||||
|
"skipped_budget": [],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
monkeypatch.setenv("STATE_HUB_URL", "http://state-hub.test/")
|
||||||
|
monkeypatch.setattr(httpx, "post", fake_post)
|
||||||
|
|
||||||
|
result = StateHubContextResolver().resolve(
|
||||||
|
"consistency_sweep_remote_all",
|
||||||
|
None,
|
||||||
|
{"max_seconds": 300, "source": "activity-core", "required": True},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result["exit_code"] == 0
|
||||||
|
assert result["repos_processed"][0]["repo_slug"] == "state-hub"
|
||||||
|
assert calls == [
|
||||||
|
{
|
||||||
|
"url": "http://state-hub.test/consistency/sweep/remote-all",
|
||||||
|
"json": {"max_seconds": 300, "source": "activity-core"},
|
||||||
|
"timeout": 330.0,
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_consistency_sweep_remote_all_failure_bubbles(monkeypatch) -> None:
|
||||||
|
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
|
||||||
|
raise httpx.ConnectError("offline")
|
||||||
|
|
||||||
|
monkeypatch.setattr(httpx, "post", fake_post)
|
||||||
|
|
||||||
|
with pytest.raises(httpx.ConnectError):
|
||||||
|
StateHubContextResolver().resolve(
|
||||||
|
"consistency_sweep_remote_all",
|
||||||
|
None,
|
||||||
|
{"max_seconds": 300},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_consistency_sweep_remote_all_rejects_empty_response(monkeypatch) -> None:
|
||||||
|
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
|
||||||
|
return DummyResponse({})
|
||||||
|
|
||||||
|
monkeypatch.setattr(httpx, "post", fake_post)
|
||||||
|
|
||||||
|
with pytest.raises(RuntimeError, match="missing required key"):
|
||||||
|
StateHubContextResolver().resolve(
|
||||||
|
"consistency_sweep_remote_all",
|
||||||
|
None,
|
||||||
|
{"max_seconds": 300},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def test_recently_on_scope_hourly_rejects_empty_response(monkeypatch) -> None:
|
def test_recently_on_scope_hourly_rejects_empty_response(monkeypatch) -> None:
|
||||||
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
|
def fake_post(url: str, **kwargs: Any) -> DummyResponse:
|
||||||
return DummyResponse({})
|
return DummyResponse({})
|
||||||
|
|||||||
81
tests/test_state_hub_write.py
Normal file
81
tests/test_state_hub_write.py
Normal file
@@ -0,0 +1,81 @@
|
|||||||
|
"""ACTIVITY-WP-0014 T05: idempotency-keyed State Hub writes."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from activity_core import report_sinks
|
||||||
|
from activity_core.state_hub_write import (
|
||||||
|
IDEMPOTENCY_HEADER,
|
||||||
|
idempotency_headers,
|
||||||
|
idempotency_key,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_key_is_stable_and_deterministic() -> None:
|
||||||
|
a = idempotency_key("run1", "daily-triage-report", "daily_triage")
|
||||||
|
b = idempotency_key("run1", "daily-triage-report", "daily_triage")
|
||||||
|
assert a == b == "run1:daily-triage-report:daily_triage"
|
||||||
|
|
||||||
|
|
||||||
|
def test_key_shape_stable_with_missing_parts() -> None:
|
||||||
|
assert idempotency_key("run1", None, "daily_triage") == "run1::daily_triage"
|
||||||
|
|
||||||
|
|
||||||
|
def test_key_sanitizes_control_and_whitespace() -> None:
|
||||||
|
key = idempotency_key("run 1", "a\tb", "x\n")
|
||||||
|
assert "\t" not in key and "\n" not in key and " " not in key
|
||||||
|
|
||||||
|
|
||||||
|
def test_headers_carry_the_key() -> None:
|
||||||
|
headers = idempotency_headers("run1", "i", "e")
|
||||||
|
assert headers == {IDEMPOTENCY_HEADER: "run1:i:e"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_distinct_identities_get_distinct_keys() -> None:
|
||||||
|
assert idempotency_key("r", "i", "daily_triage") != idempotency_key(
|
||||||
|
"r", "i", "schedule_miss"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_progress_exists_is_best_effort_on_connection_error(monkeypatch) -> None:
|
||||||
|
"""A down State Hub must not hard-fail the dedup read; it returns False so the
|
||||||
|
keyed write can still proceed."""
|
||||||
|
|
||||||
|
def _boom(*args, **kwargs):
|
||||||
|
raise httpx.ConnectError("Connection refused")
|
||||||
|
|
||||||
|
monkeypatch.setattr(report_sinks.httpx, "get", _boom)
|
||||||
|
assert (
|
||||||
|
report_sinks._progress_exists(
|
||||||
|
"http://127.0.0.1:8000", "run1", "daily-triage-report", "daily_triage"
|
||||||
|
)
|
||||||
|
is False
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_report_sink_post_sends_idempotency_header(monkeypatch) -> None:
|
||||||
|
"""The state-hub-progress write carries a stable Idempotency-Key header."""
|
||||||
|
captured: dict[str, object] = {}
|
||||||
|
|
||||||
|
monkeypatch.setattr(report_sinks, "_progress_exists", lambda *a, **k: False)
|
||||||
|
|
||||||
|
class _Resp:
|
||||||
|
def raise_for_status(self) -> None: ...
|
||||||
|
def json(self) -> dict[str, str]:
|
||||||
|
return {"id": "pid-1"}
|
||||||
|
|
||||||
|
def _capture_post(url, json, headers, timeout): # noqa: A002
|
||||||
|
captured["headers"] = headers
|
||||||
|
return _Resp()
|
||||||
|
|
||||||
|
monkeypatch.setattr(report_sinks.httpx, "post", _capture_post)
|
||||||
|
|
||||||
|
payload = {"run_id": "run1", "activity_id": "act1", "scheduled_for": None}
|
||||||
|
report_entry = {"instruction_id": "daily-triage-report", "report": {"summary": "s"}}
|
||||||
|
sink = {"event_type": "daily_triage"}
|
||||||
|
|
||||||
|
result = report_sinks._post_state_hub_progress(payload, report_entry, sink)
|
||||||
|
assert result["status"] == "posted"
|
||||||
|
assert captured["headers"][IDEMPOTENCY_HEADER] == "run1:daily-triage-report:daily_triage"
|
||||||
126
tests/test_sync_schedules.py
Normal file
126
tests/test_sync_schedules.py
Normal file
@@ -0,0 +1,126 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import uuid
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from types import SimpleNamespace
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from activity_core import sync_schedules
|
||||||
|
|
||||||
|
|
||||||
|
def _row(
|
||||||
|
*,
|
||||||
|
activity_id: uuid.UUID,
|
||||||
|
enabled: bool,
|
||||||
|
trigger_config: dict[str, Any],
|
||||||
|
) -> SimpleNamespace:
|
||||||
|
return SimpleNamespace(
|
||||||
|
id=activity_id,
|
||||||
|
name=f"definition-{activity_id}",
|
||||||
|
enabled=enabled,
|
||||||
|
trigger_config=trigger_config,
|
||||||
|
context_sources=[],
|
||||||
|
task_templates=[],
|
||||||
|
dedupe_key_strategy="skip",
|
||||||
|
version=1,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_sync_schedule_rows_reports_drift_counts_and_preserves_one_shots(
|
||||||
|
monkeypatch,
|
||||||
|
) -> None:
|
||||||
|
new_id = uuid.uuid4()
|
||||||
|
disabled_old_id = uuid.uuid4()
|
||||||
|
one_shot_id = uuid.uuid4()
|
||||||
|
orphan_id = uuid.uuid4()
|
||||||
|
upserted: list[tuple[uuid.UUID, bool, str]] = []
|
||||||
|
deleted: list[str] = []
|
||||||
|
|
||||||
|
async def fake_upsert_schedule(client: object, defn: object) -> None:
|
||||||
|
upserted.append((
|
||||||
|
defn.id,
|
||||||
|
defn.enabled,
|
||||||
|
defn.trigger_config.trigger_type,
|
||||||
|
))
|
||||||
|
|
||||||
|
async def fake_list_schedules(client: object) -> list[dict[str, str]]:
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"schedule_id": f"activity-schedule-{disabled_old_id}",
|
||||||
|
"activity_id": str(disabled_old_id),
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"schedule_id": f"activity-schedule-{one_shot_id}-once",
|
||||||
|
"activity_id": f"{one_shot_id}-once",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"schedule_id": f"activity-schedule-{orphan_id}",
|
||||||
|
"activity_id": str(orphan_id),
|
||||||
|
},
|
||||||
|
]
|
||||||
|
|
||||||
|
async def fake_delete_schedule(client: object, activity_id: str) -> None:
|
||||||
|
deleted.append(activity_id)
|
||||||
|
|
||||||
|
monkeypatch.setattr(sync_schedules, "upsert_schedule", fake_upsert_schedule)
|
||||||
|
monkeypatch.setattr(sync_schedules, "list_schedules", fake_list_schedules)
|
||||||
|
monkeypatch.setattr(sync_schedules, "delete_schedule", fake_delete_schedule)
|
||||||
|
|
||||||
|
result = await sync_schedules.sync_schedule_rows(
|
||||||
|
object(),
|
||||||
|
[
|
||||||
|
_row(
|
||||||
|
activity_id=new_id,
|
||||||
|
enabled=True,
|
||||||
|
trigger_config={
|
||||||
|
"trigger_type": "cron",
|
||||||
|
"cron_expression": "20 7 * * *",
|
||||||
|
"timezone": "Europe/Berlin",
|
||||||
|
"misfire_policy": "skip",
|
||||||
|
},
|
||||||
|
),
|
||||||
|
_row(
|
||||||
|
activity_id=disabled_old_id,
|
||||||
|
enabled=False,
|
||||||
|
trigger_config={
|
||||||
|
"trigger_type": "cron",
|
||||||
|
"cron_expression": "20 * * * *",
|
||||||
|
"timezone": "Europe/Berlin",
|
||||||
|
"misfire_policy": "skip",
|
||||||
|
},
|
||||||
|
),
|
||||||
|
_row(
|
||||||
|
activity_id=one_shot_id,
|
||||||
|
enabled=True,
|
||||||
|
trigger_config={
|
||||||
|
"trigger_type": "scheduled",
|
||||||
|
"at": datetime(2026, 6, 19, 8, 0, tzinfo=timezone.utc),
|
||||||
|
"timezone": "UTC",
|
||||||
|
},
|
||||||
|
),
|
||||||
|
_row(
|
||||||
|
activity_id=uuid.uuid4(),
|
||||||
|
enabled=True,
|
||||||
|
trigger_config={
|
||||||
|
"trigger_type": "event",
|
||||||
|
"event_type": "kaizen.metrics.recorded",
|
||||||
|
"filters": {},
|
||||||
|
},
|
||||||
|
),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result.to_dict() == {
|
||||||
|
"upserted": 2,
|
||||||
|
"paused": 1,
|
||||||
|
"deleted_orphans": 1,
|
||||||
|
}
|
||||||
|
assert upserted == [
|
||||||
|
(new_id, True, "cron"),
|
||||||
|
(disabled_old_id, False, "cron"),
|
||||||
|
(one_shot_id, True, "scheduled"),
|
||||||
|
]
|
||||||
|
assert deleted == [str(orphan_id)]
|
||||||
134
tests/test_sync_service.py
Normal file
134
tests/test_sync_service.py
Normal file
@@ -0,0 +1,134 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from activity_core import sync_service
|
||||||
|
from activity_core.sync_schedules import ScheduleSyncResult
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_run_sync_runs_requested_sections(monkeypatch) -> None:
|
||||||
|
calls: list[str] = []
|
||||||
|
|
||||||
|
async def fake_definitions(session_factory: object) -> int:
|
||||||
|
calls.append("definitions")
|
||||||
|
return 2
|
||||||
|
|
||||||
|
async def fake_event_types(session_factory: object) -> int:
|
||||||
|
calls.append("event_types")
|
||||||
|
return 5
|
||||||
|
|
||||||
|
async def fake_schedules(
|
||||||
|
temporal_client: object,
|
||||||
|
session_factory: object,
|
||||||
|
) -> ScheduleSyncResult:
|
||||||
|
calls.append("schedules")
|
||||||
|
return ScheduleSyncResult(upserted=3, paused=1, deleted_orphans=2)
|
||||||
|
|
||||||
|
monkeypatch.setattr(sync_service, "sync_activity_definitions", fake_definitions)
|
||||||
|
monkeypatch.setattr(sync_service, "sync_event_types", fake_event_types)
|
||||||
|
monkeypatch.setattr(sync_service, "sync_with_session_factory", fake_schedules)
|
||||||
|
|
||||||
|
result = await sync_service.run_sync(
|
||||||
|
session_factory=object(),
|
||||||
|
temporal_client=object(),
|
||||||
|
definitions=True,
|
||||||
|
schedules=True,
|
||||||
|
event_types=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert calls == ["definitions", "event_types", "schedules"]
|
||||||
|
assert result["ok"] is True
|
||||||
|
assert result["ran"] == {
|
||||||
|
"definitions": True,
|
||||||
|
"schedules": True,
|
||||||
|
"event_types": True,
|
||||||
|
}
|
||||||
|
assert result["definitions"] == {"synced": 2}
|
||||||
|
assert result["event_types"] == {"synced": 5}
|
||||||
|
assert result["schedules"] == {
|
||||||
|
"upserted": 3,
|
||||||
|
"paused": 1,
|
||||||
|
"deleted_orphans": 2,
|
||||||
|
}
|
||||||
|
assert result["errors"] == []
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_run_sync_collects_errors_and_continues(monkeypatch) -> None:
|
||||||
|
calls: list[str] = []
|
||||||
|
|
||||||
|
async def failing_definitions(session_factory: object) -> int:
|
||||||
|
calls.append("definitions")
|
||||||
|
raise RuntimeError("definition parse failed")
|
||||||
|
|
||||||
|
async def fake_schedules(
|
||||||
|
temporal_client: object,
|
||||||
|
session_factory: object,
|
||||||
|
) -> ScheduleSyncResult:
|
||||||
|
calls.append("schedules")
|
||||||
|
return ScheduleSyncResult(upserted=1)
|
||||||
|
|
||||||
|
monkeypatch.setattr(
|
||||||
|
sync_service,
|
||||||
|
"sync_activity_definitions",
|
||||||
|
failing_definitions,
|
||||||
|
)
|
||||||
|
monkeypatch.setattr(sync_service, "sync_with_session_factory", fake_schedules)
|
||||||
|
|
||||||
|
result = await sync_service.run_sync(
|
||||||
|
session_factory=object(),
|
||||||
|
temporal_client=object(),
|
||||||
|
definitions=True,
|
||||||
|
schedules=True,
|
||||||
|
event_types=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert calls == ["definitions", "schedules"]
|
||||||
|
assert result["ok"] is False
|
||||||
|
assert result["definitions"] == {"synced": 0}
|
||||||
|
assert result["schedules"]["upserted"] == 1
|
||||||
|
assert result["errors"] == [
|
||||||
|
{
|
||||||
|
"stage": "definitions",
|
||||||
|
"type": "RuntimeError",
|
||||||
|
"message": "definition parse failed",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_run_sync_reports_missing_temporal_client_for_schedules() -> None:
|
||||||
|
result = await sync_service.run_sync(
|
||||||
|
session_factory=object(),
|
||||||
|
temporal_client=None,
|
||||||
|
definitions=False,
|
||||||
|
schedules=True,
|
||||||
|
event_types=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result["ok"] is False
|
||||||
|
assert result["errors"] == [
|
||||||
|
{
|
||||||
|
"stage": "schedules",
|
||||||
|
"type": "RuntimeError",
|
||||||
|
"message": "Temporal client is required for schedule sync",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def test_record_error_bounds_error_count() -> None:
|
||||||
|
result: dict[str, Any] = {
|
||||||
|
"ok": True,
|
||||||
|
"errors": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
for i in range(25):
|
||||||
|
sync_service._record_error(result, "stage", RuntimeError(f"boom {i}"))
|
||||||
|
|
||||||
|
assert result["ok"] is False
|
||||||
|
assert len(result["errors"]) == 20
|
||||||
|
assert result["errors"][0]["message"] == "boom 0"
|
||||||
|
assert result["errors"][-1]["message"] == "boom 19"
|
||||||
@@ -4,11 +4,11 @@ type: workplan
|
|||||||
title: "Post-triage operational hardening"
|
title: "Post-triage operational hardening"
|
||||||
domain: custodian
|
domain: custodian
|
||||||
repo: activity-core
|
repo: activity-core
|
||||||
status: active
|
status: finished
|
||||||
owner: codex
|
owner: codex
|
||||||
topic_slug: custodian
|
topic_slug: custodian
|
||||||
created: "2026-06-03"
|
created: "2026-06-03"
|
||||||
updated: "2026-06-16"
|
updated: "2026-06-30"
|
||||||
state_hub_workstream_id: "5646e13a-13af-4724-bca6-3c0d86f96733"
|
state_hub_workstream_id: "5646e13a-13af-4724-bca6-3c0d86f96733"
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -104,7 +104,7 @@ and emitted a validated `daily_triage` report plus working-memory note.
|
|||||||
|
|
||||||
```task
|
```task
|
||||||
id: ACTIVITY-WP-0006-T03
|
id: ACTIVITY-WP-0006-T03
|
||||||
status: wait
|
status: done
|
||||||
priority: medium
|
priority: medium
|
||||||
state_hub_task_id: "7cbf0a35-71a1-47ac-afc2-f51ad2180fd0"
|
state_hub_task_id: "7cbf0a35-71a1-47ac-afc2-f51ad2180fd0"
|
||||||
```
|
```
|
||||||
@@ -174,6 +174,56 @@ the worker consumes the configured URL, then produce schema-valid daily triage
|
|||||||
evidence and three clean scheduled runs. This narrower path is tracked in
|
evidence and three clean scheduled runs. This narrower path is tracked in
|
||||||
`ACTIVITY-WP-0010`.
|
`ACTIVITY-WP-0010`.
|
||||||
|
|
||||||
|
2026-06-25: Consecutive-run streak resumed. State Hub `daily_triage` progress
|
||||||
|
events from author `activity-core` fired on time on **2026-06-24 05:20:56Z** and
|
||||||
|
**2026-06-25 05:20:47Z** (07:20 Berlin), both delivered, no misfires. That is two
|
||||||
|
clean consecutive scheduled runs. **RECHECK 2026-06-26 (after 05:20Z):** confirm
|
||||||
|
the 06-26 scheduled `daily_triage` event delivered. If clean, that completes three
|
||||||
|
clean consecutive scheduled runs (06-24 / 06-25 / 06-26) — record the calibration
|
||||||
|
result in State Hub and close T03. If the 06-26 run misfires or is missing, the
|
||||||
|
streak resets and T03 stays `wait`. Flag deliberately kept in-repo (agent-agnostic)
|
||||||
|
rather than tied to any single coding agent's scheduler.
|
||||||
|
|
||||||
|
2026-06-26 recheck outcome: **streak reset at two.** The 06-26 scheduled run fired
|
||||||
|
on time (`daily_triage` event 05:20:57Z) — scheduling layer healthy, no misfire —
|
||||||
|
but the `daily-triage-report` instruction output **failed schema validation**:
|
||||||
|
`Expecting ',' delimiter: line 136 column 22 (char 5268)`. The model produced a
|
||||||
|
long ranked WSJF recommendation list (reached rank 7+ with nested `wsjf` objects)
|
||||||
|
whose JSON broke ~char 5268; only a bounded 4000-char preview is preserved in the
|
||||||
|
State Hub event, so the exact offending token needs the runtime llm-connect log.
|
||||||
|
This is an LLM-output-quality failure (tracked by `ACTIVITY-WP-0010`), not a
|
||||||
|
runtime/projection failure. T03 stays `wait`; three clean consecutive scheduled
|
||||||
|
runs not yet achieved (06-24 ✅, 06-25 ✅, 06-26 ✗-validation).
|
||||||
|
|
||||||
|
2026-06-27 recheck outcome: streak remains reset. The scheduled run fired and
|
||||||
|
wrote State Hub progress plus working memory, but daily-triage-report failed
|
||||||
|
validation again with an unterminated string around char 5246. This confirms the
|
||||||
|
runner/sink path is alive and the active blocker is live deployment of the
|
||||||
|
ACTIVITY-WP-0016 output-robustness bundle and runtime prompt/token changes, not
|
||||||
|
a missing schedule. T03 stays wait until a post-deployment smoke passes and three
|
||||||
|
new clean scheduled runs are collected.
|
||||||
|
|
||||||
|
2026-06-30 early checkpoint: two new clean scheduled runs exist after the
|
||||||
|
validation failures. State Hub daily_triage progress shows 2026-06-28
|
||||||
|
05:20:51Z run `6a44d6dd-3f02-53f2-a5d8-d42b76b0ef98` and 2026-06-29
|
||||||
|
05:20:49Z run `1dfb47c9-07bf-551b-b778-1d21a40bd95c`, both with
|
||||||
|
`output_validated=true` and working-memory notes written. The current local time
|
||||||
|
was 2026-06-30 01:37 Europe/Berlin, before the expected 07:20 Berlin scheduled
|
||||||
|
fire, so the three-clean-run gate cannot close yet. Recheck after 2026-06-30
|
||||||
|
05:20Z; if that scheduled run validates, the clean streak is 06-28 / 06-29 /
|
||||||
|
06-30 and T03 can close with calibration feedback.
|
||||||
|
|
||||||
|
2026-06-30 closeout: the 07:20 Berlin scheduled run fired at 05:20:50Z as run
|
||||||
|
`ac3d71a0-2f8f-50df-b3ce-7c60c2abb5c5` with `output_validated=true` and a
|
||||||
|
working-memory note written. The post-failure clean streak is now complete:
|
||||||
|
2026-06-28 (`6a44d6dd`), 2026-06-29 (`1dfb47c9`), and 2026-06-30 (`ac3d71a0`).
|
||||||
|
Calibration feedback: the scheduler, worker, llm-connect route, State Hub sink,
|
||||||
|
and working-memory sink are stable again; the recommendations were operationally
|
||||||
|
useful but too dense at 10 items, repeatedly emphasizing human-dependency and
|
||||||
|
infrastructure-unblock work. ACTIVITY-WP-0016 now owns the density/contract fix:
|
||||||
|
Railiance runtime projection was aligned to a top-7 contract so the next live
|
||||||
|
run can prove the bounded output posture. T03 is done.
|
||||||
|
|
||||||
## Rule Action Contract Documentation
|
## Rule Action Contract Documentation
|
||||||
|
|
||||||
```task
|
```task
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ status: blocked
|
|||||||
owner: codex
|
owner: codex
|
||||||
topic_slug: custodian
|
topic_slug: custodian
|
||||||
created: "2026-06-18"
|
created: "2026-06-18"
|
||||||
updated: "2026-06-18"
|
updated: "2026-06-27"
|
||||||
state_hub_workstream_id: "f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9"
|
state_hub_workstream_id: "f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9"
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -87,7 +87,7 @@ reported 9 passed.
|
|||||||
|
|
||||||
```task
|
```task
|
||||||
id: ACTIVITY-WP-0010-T02
|
id: ACTIVITY-WP-0010-T02
|
||||||
status: wait
|
status: done
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: "23545ddc-926b-485a-8535-5cc11e01134a"
|
state_hub_task_id: "23545ddc-926b-485a-8535-5cc11e01134a"
|
||||||
```
|
```
|
||||||
@@ -107,6 +107,30 @@ Current wait reason: this is Railiance/operator-owned live cluster work. State
|
|||||||
Hub handoff message `9a074b7c-4b87-4e3c-a6bf-e1fe5580daa8` asks
|
Hub handoff message `9a074b7c-4b87-4e3c-a6bf-e1fe5580daa8` asks
|
||||||
`railiance-cluster` to reconcile the updated config and smoke it.
|
`railiance-cluster` to reconcile the updated config and smoke it.
|
||||||
|
|
||||||
|
2026-06-19 recheck:
|
||||||
|
|
||||||
|
- Deployed `llm-connect` into the `activity-core` namespace on `railiance01`
|
||||||
|
(the cluster that runs `actcore-worker`). `coulombcore` had llm-connect only;
|
||||||
|
the in-cluster Service URL is cluster-local.
|
||||||
|
- `actcore-runtime-config` already exposed the verified URL and timeout;
|
||||||
|
`deployment/actcore-worker` was restarted and now reports
|
||||||
|
`LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080`.
|
||||||
|
- `llm-connect-provider-secrets` reports `DATA 1`; no Secret values were
|
||||||
|
inspected.
|
||||||
|
- Worker health probe to llm-connect `/health` returns `{"status": "ok"}`.
|
||||||
|
- `actcore-state-hub-bridge` remains `0/1` Ready with upstream timeouts, so T02
|
||||||
|
is not fully closed until the node-local State Hub tunnel is restored.
|
||||||
|
|
||||||
|
2026-06-27 recheck:
|
||||||
|
|
||||||
|
- Superseded by real scheduled runner evidence: State Hub daily_triage events on
|
||||||
|
2026-06-24, 2026-06-25, 2026-06-26, and 2026-06-27 all reached State Hub and
|
||||||
|
wrote working-memory notes. The bridge/sink is therefore reachable for the
|
||||||
|
live runner.
|
||||||
|
- 2026-06-24 and 2026-06-25 were schema-valid; 2026-06-26 and 2026-06-27 failed
|
||||||
|
output validation after calling llm-connect. That moves the active blocker out
|
||||||
|
of T02 and into the WP-0016 live bundle/smoke lane. Marking T02 done.
|
||||||
|
|
||||||
## Run Daily Triage Fixture Smoke
|
## Run Daily Triage Fixture Smoke
|
||||||
|
|
||||||
```task
|
```task
|
||||||
@@ -128,6 +152,27 @@ Done when:
|
|||||||
detail;
|
detail;
|
||||||
- `scripts/verify_daily_triage.py` reports the smoke/manual run as present.
|
- `scripts/verify_daily_triage.py` reports the smoke/manual run as present.
|
||||||
|
|
||||||
|
2026-06-19 recheck:
|
||||||
|
|
||||||
|
- In-namespace llm-connect fixture smoke on `railiance01` passed:
|
||||||
|
`smoke: pass health=ok latency_seconds=1.681 recommendations=1`.
|
||||||
|
- Manual `POST /activity-definitions/6fca51fa-387a-4fd0-bc4e-d62c29eb859a/trigger`
|
||||||
|
reached llm-connect, but the workflow failed at `persist_instruction_reports`
|
||||||
|
with `state-hub-progress` sink `Connection refused` while
|
||||||
|
`actcore-state-hub-bridge` is unhealthy.
|
||||||
|
- T03 therefore remains open until State Hub bridge reachability is restored and
|
||||||
|
a run emits non-secret `daily_triage` progress with `output_validated=true`.
|
||||||
|
|
||||||
|
2026-06-27 recheck:
|
||||||
|
|
||||||
|
- Scheduled runs on 2026-06-24 and 2026-06-25 satisfy the non-secret smoke
|
||||||
|
evidence for llm-connect call, State Hub progress with output_validated=true,
|
||||||
|
and working-memory note creation.
|
||||||
|
- Kept T03 at progress rather than done because the workstation did not run the
|
||||||
|
live verifier against Temporal/activity-core DB, and the smoke must be repeated
|
||||||
|
after the WP-0016 code/schema/runtime-prompt deployment due the 2026-06-26 and
|
||||||
|
2026-06-27 malformed-output failures.
|
||||||
|
|
||||||
## Collect Three Clean Scheduled Runs
|
## Collect Three Clean Scheduled Runs
|
||||||
|
|
||||||
```task
|
```task
|
||||||
@@ -151,6 +196,14 @@ Done when:
|
|||||||
- `ACTIVITY-WP-0006-T03` and `ACTIVITY-WP-0009-T01` can move from `wait` to
|
- `ACTIVITY-WP-0006-T03` and `ACTIVITY-WP-0009-T01` can move from `wait` to
|
||||||
`done`.
|
`done`.
|
||||||
|
|
||||||
|
2026-06-27 recheck:
|
||||||
|
|
||||||
|
- Three-clean-run streak is reset. The latest sequence is 2026-06-24 clean,
|
||||||
|
2026-06-25 clean, 2026-06-26 validation_failed, 2026-06-27 validation_failed.
|
||||||
|
- Current pickup is to deploy ACTIVITY-WP-0016 code/schema together with the
|
||||||
|
Railiance runtime prompt and max_tokens changes, run a live smoke, then restart
|
||||||
|
the three-consecutive-scheduled-run gate from zero.
|
||||||
|
|
||||||
## Close Handoff State
|
## Close Handoff State
|
||||||
|
|
||||||
```task
|
```task
|
||||||
|
|||||||
@@ -4,11 +4,11 @@ type: workplan
|
|||||||
title: "Definition And Schedule Hot Reload"
|
title: "Definition And Schedule Hot Reload"
|
||||||
domain: custodian
|
domain: custodian
|
||||||
repo: activity-core
|
repo: activity-core
|
||||||
status: active
|
status: finished
|
||||||
owner: codex
|
owner: codex
|
||||||
topic_slug: custodian
|
topic_slug: custodian
|
||||||
created: "2026-06-18"
|
created: "2026-06-18"
|
||||||
updated: "2026-06-18"
|
updated: "2026-06-22"
|
||||||
state_hub_workstream_id: "8887075e-21ec-451b-b82b-cd81035c9ca5"
|
state_hub_workstream_id: "8887075e-21ec-451b-b82b-cd81035c9ca5"
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -39,7 +39,7 @@ a repo checkout manager or CI system.
|
|||||||
|
|
||||||
```task
|
```task
|
||||||
id: ACTIVITY-WP-0012-T01
|
id: ACTIVITY-WP-0012-T01
|
||||||
status: todo
|
status: done
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: "53a7970b-7eec-47f5-ad30-bbd7c6271952"
|
state_hub_task_id: "53a7970b-7eec-47f5-ad30-bbd7c6271952"
|
||||||
```
|
```
|
||||||
@@ -57,11 +57,17 @@ Done when:
|
|||||||
- failures are collected into a bounded `errors[]` result while preserving the
|
- failures are collected into a bounded `errors[]` result while preserving the
|
||||||
current startup best-effort behavior.
|
current startup best-effort behavior.
|
||||||
|
|
||||||
|
2026-06-19: Completed. Added `activity_core.sync_service.run_sync`, which
|
||||||
|
orchestrates ActivityDefinition, event type, and schedule sync independently
|
||||||
|
from explicit DB session factory and Temporal client dependencies. Worker
|
||||||
|
startup now calls the shared service for definitions+schedules and logs bounded
|
||||||
|
stage errors while continuing startup.
|
||||||
|
|
||||||
## Add Admin Sync Endpoint
|
## Add Admin Sync Endpoint
|
||||||
|
|
||||||
```task
|
```task
|
||||||
id: ACTIVITY-WP-0012-T02
|
id: ACTIVITY-WP-0012-T02
|
||||||
status: todo
|
status: done
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: "8697c761-15d1-4da0-b66b-d838218a2495"
|
state_hub_task_id: "8697c761-15d1-4da0-b66b-d838218a2495"
|
||||||
```
|
```
|
||||||
@@ -80,11 +86,17 @@ Done when:
|
|||||||
- endpoint tests cover definitions-only, schedules-only, all-sync, and failure
|
- endpoint tests cover definitions-only, schedules-only, all-sync, and failure
|
||||||
result behavior.
|
result behavior.
|
||||||
|
|
||||||
|
2026-06-19: Completed. Added `POST /admin/sync` with defaults
|
||||||
|
`definitions=true`, `schedules=true`, and `event_types=false`. The response
|
||||||
|
reports definition/event counts, schedule upsert/pause/orphan-delete counts, and
|
||||||
|
bounded `errors[]`. Tests cover definitions-only, schedules-only, all-sync, and
|
||||||
|
failure-result behavior.
|
||||||
|
|
||||||
## Preserve Schedule Drift Semantics
|
## Preserve Schedule Drift Semantics
|
||||||
|
|
||||||
```task
|
```task
|
||||||
id: ACTIVITY-WP-0012-T03
|
id: ACTIVITY-WP-0012-T03
|
||||||
status: todo
|
status: done
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: "efeac412-632c-4c90-9428-bb575ac7a624"
|
state_hub_task_id: "efeac412-632c-4c90-9428-bb575ac7a624"
|
||||||
```
|
```
|
||||||
@@ -101,11 +113,18 @@ Done when:
|
|||||||
- regression tests demonstrate the Coulomb hourly-to-daily rename shape without
|
- regression tests demonstrate the Coulomb hourly-to-daily rename shape without
|
||||||
needing a worker restart.
|
needing a worker restart.
|
||||||
|
|
||||||
|
2026-06-19: Completed. `sync_schedules` now returns explicit counts for enabled
|
||||||
|
schedule upserts, disabled schedule pauses, and orphan deletes. Regression tests
|
||||||
|
cover the hourly-to-daily rename shape: a new enabled cron schedule is upserted,
|
||||||
|
the old disabled cron schedule is preserved as paused, unrelated orphan
|
||||||
|
schedules are deleted, event-triggered definitions do not create schedules, and
|
||||||
|
one-shot scheduled definitions are no longer mistaken for orphans.
|
||||||
|
|
||||||
## Optional Background Sync Loop
|
## Optional Background Sync Loop
|
||||||
|
|
||||||
```task
|
```task
|
||||||
id: ACTIVITY-WP-0012-T04
|
id: ACTIVITY-WP-0012-T04
|
||||||
status: todo
|
status: done
|
||||||
priority: medium
|
priority: medium
|
||||||
state_hub_task_id: "d774087b-c51d-4444-8e90-bfef43765456"
|
state_hub_task_id: "d774087b-c51d-4444-8e90-bfef43765456"
|
||||||
```
|
```
|
||||||
@@ -121,11 +140,17 @@ Done when:
|
|||||||
last error summary;
|
last error summary;
|
||||||
- the loop does not block worker startup or workflow task processing.
|
- the loop does not block worker startup or workflow task processing.
|
||||||
|
|
||||||
|
2026-06-19: Completed by decision. v1 stays manual/operator-triggered through
|
||||||
|
`POST /admin/sync`; no background loop was added. The runbook records this
|
||||||
|
posture so customer definition changes stay explicit and the worker does not
|
||||||
|
start background repo scanning. A periodic loop remains a future option if live
|
||||||
|
operator use proves it is needed.
|
||||||
|
|
||||||
## Live No-Restart Smoke
|
## Live No-Restart Smoke
|
||||||
|
|
||||||
```task
|
```task
|
||||||
id: ACTIVITY-WP-0012-T05
|
id: ACTIVITY-WP-0012-T05
|
||||||
status: wait
|
status: done
|
||||||
priority: high
|
priority: high
|
||||||
state_hub_task_id: "68a0e22a-106a-4d21-9f39-c6279850cb5e"
|
state_hub_task_id: "68a0e22a-106a-4d21-9f39-c6279850cb5e"
|
||||||
```
|
```
|
||||||
@@ -141,5 +166,27 @@ Done when non-secret State Hub evidence shows:
|
|||||||
- event-triggered definitions still fire normally;
|
- event-triggered definitions still fire normally;
|
||||||
- rollback or repeat sync is idempotent.
|
- rollback or repeat sync is idempotent.
|
||||||
|
|
||||||
Current wait reason: this gate depends on the implementation tasks and a
|
2026-06-22: Completed on Railiance01 (`KUBECONFIG=~/.kube/config-hosteurope`).
|
||||||
cluster-owned smoke path.
|
|
||||||
|
Smoke target: disabled projection `ops-service-inventory-probes`
|
||||||
|
(`40d15a87-7ff6-4d8e-992c-37df15f95110`) in
|
||||||
|
`actcore-external-activity-definitions`.
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
|
||||||
|
- ConfigMap flip `enabled: false -> true` and cadence `15 * * * * -> 25 * * * *`,
|
||||||
|
then `POST /admin/sync?definitions=true&schedules=true` from `actcore-api`.
|
||||||
|
- DB after sync: `enabled=true`, `cron=25 * * * *`.
|
||||||
|
- Temporal schedule after sync: `paused=false`, calendar minute `25`.
|
||||||
|
- Repeat sync returned identical schedule counts
|
||||||
|
(`upserted=5`, `paused=1`, `deleted_orphans=0`) — idempotent.
|
||||||
|
- Rollback flip restored `enabled=false`, `cron=15 * * * *`, schedule
|
||||||
|
`paused=true`, calendar minute `15`.
|
||||||
|
- `actcore-worker` pod UID unchanged (`a68d6539-2bba-457e-a78a-39564002a980`,
|
||||||
|
started `2026-06-21T18:46:46Z`); `actcore-event-router` pod UID unchanged.
|
||||||
|
- Event-triggered definitions: none projected on Railiance01 today; hot DB
|
||||||
|
reload path for event definitions remains covered by T03 unit tests and an
|
||||||
|
unchanged event-router deployment.
|
||||||
|
|
||||||
|
Automation: `scripts/smoke_admin_sync_no_restart.py`. Runbook section added
|
||||||
|
under "Railiance01 no-restart smoke".
|
||||||
|
|||||||
@@ -0,0 +1,78 @@
|
|||||||
|
---
|
||||||
|
id: ACTIVITY-WP-0013
|
||||||
|
type: workplan
|
||||||
|
title: "Reuse Surface Report Gaps Resolver"
|
||||||
|
domain: custodian
|
||||||
|
repo: activity-core
|
||||||
|
status: finished
|
||||||
|
owner: codex
|
||||||
|
topic_slug: activity-core
|
||||||
|
created: "2026-06-18"
|
||||||
|
updated: "2026-06-18"
|
||||||
|
state_hub_workstream_id: "01e68dfd-b146-4aef-a575-2d3b178ca5c2"
|
||||||
|
---
|
||||||
|
|
||||||
|
# Reuse Surface Report Gaps Resolver
|
||||||
|
|
||||||
|
Implement the R2 handoff from kaizen-agentic (`bffa224c`) so the
|
||||||
|
`reuse_surface_report_gaps` shell context source populates
|
||||||
|
`context.gaps` for the Coulomb daily registry hygiene sweep.
|
||||||
|
|
||||||
|
## Register Shell Resolver Query
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0013-T01
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "a6e1fc5c-7b42-436d-914e-4d605cb6f329"
|
||||||
|
```
|
||||||
|
|
||||||
|
Add a dedicated reuse-surface context resolver module and register
|
||||||
|
`reuse_surface_report_gaps` on the `shell` resolver path while preserving
|
||||||
|
the existing kaizen shell query behavior.
|
||||||
|
|
||||||
|
## Implement Batch And Signal Semantics
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0013-T02
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "229cf285-8388-471d-95fd-08400db1553e"
|
||||||
|
```
|
||||||
|
|
||||||
|
Load the Coulomb rollout roster, select active repos with a persisted
|
||||||
|
round-robin cursor, resolve repo roots from State Hub host paths, run
|
||||||
|
`reuse-surface report gaps --format json`, and emit gap records for the
|
||||||
|
enabled registry hygiene signals.
|
||||||
|
|
||||||
|
## Cover Required And Optional Failure Modes
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0013-T03
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "85b5c7d4-40e1-4945-8ada-1dff2363c194"
|
||||||
|
```
|
||||||
|
|
||||||
|
Ensure missing required dependencies fail visibly while optional resolver
|
||||||
|
sources bind an empty `context.gaps` list. Add unit coverage for fixture
|
||||||
|
rollout data, mocked CLI JSON, resolver binding, and `hygiene_signal`
|
||||||
|
rule gating.
|
||||||
|
|
||||||
|
## Smoke Real Coulomb Rollout
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0013-T04
|
||||||
|
status: done
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: "6a5446ed-b4ec-4693-b508-65415571d834"
|
||||||
|
```
|
||||||
|
|
||||||
|
Run a live resolver smoke against
|
||||||
|
`/home/worsch/coulomb-loop/loops/registry-hygiene/rollout.yaml` using a
|
||||||
|
temporary round-robin cursor. The real active rollout produced five gaps,
|
||||||
|
including one for `reuse-surface` with `hygiene_signal: stale_sbom`.
|
||||||
|
The smoke supplied `reuse_surface_bin:
|
||||||
|
/home/worsch/reuse-surface/.venv/bin/reuse-surface` and
|
||||||
|
`runner_host: bnt-lap001`; the worker environment or definition params must
|
||||||
|
provide equivalent values before enabling the production sweep.
|
||||||
194
workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md
Normal file
194
workplans/ACTIVITY-WP-0014-schedule-misfire-robustness.md
Normal file
@@ -0,0 +1,194 @@
|
|||||||
|
---
|
||||||
|
id: ACTIVITY-WP-0014
|
||||||
|
type: workplan
|
||||||
|
title: "Schedule Misfire Robustness & Run-Miss Recovery Options"
|
||||||
|
domain: infotech
|
||||||
|
repo: activity-core
|
||||||
|
status: finished
|
||||||
|
owner: claude
|
||||||
|
topic_slug: activity-core
|
||||||
|
created: "2026-06-23"
|
||||||
|
updated: "2026-06-24"
|
||||||
|
status_note: "T01-T05 complete; beachhead-endpoint adoption split to ACTIVITY-WP-0015"
|
||||||
|
state_hub_workstream_id: "91b64686-5d17-4c86-bc9e-3d0ee6720cf5"
|
||||||
|
---
|
||||||
|
|
||||||
|
# Schedule Misfire Robustness & Run-Miss Recovery Options
|
||||||
|
|
||||||
|
Make cron-triggered ActivityDefinitions robust to missed fires (worker/Temporal
|
||||||
|
unavailable at trigger time) with explicit, per-definition recovery behaviour,
|
||||||
|
plus detection/alerting when a scheduled fire is missed.
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
On 2026-06-22 and 2026-06-23 the `daily-statehub-wsjf-triage` definition
|
||||||
|
(cron `20 7 * * *` Europe/Berlin, projected into the Railiance runtime ConfigMap
|
||||||
|
`actcore-external-activity-definitions`) produced **no `daily_triage` progress
|
||||||
|
event at all** — neither a success nor a `could not run; operator review
|
||||||
|
required` failure.
|
||||||
|
|
||||||
|
> **Corrected by T01 (2026-06-23).** The initial hypothesis below — that
|
||||||
|
> `_build_schedule()` never set `catchup_window`, so a short-default catchup
|
||||||
|
> window silently dropped the fire — was **disproven on the live cluster**. The
|
||||||
|
> Temporal schedule is healthy with `CatchupWindow 365d` (the server default) and
|
||||||
|
> `0 MissedCatchupWindow`. The real cause is that the run **fired and ran but
|
||||||
|
> failed at the report sink** with `Connection refused` posting to State Hub,
|
||||||
|
> because railiance01 reaches State Hub via a reverse tunnel back to the
|
||||||
|
> workstation, which is asleep at 07:20 Berlin. See the T01 findings and T05.
|
||||||
|
|
||||||
|
The trigger now originates entirely on **railiance01** (in-cluster Temporal
|
||||||
|
Schedule, ConfigMap-projected definition) and is **not** laptop-dependent — but
|
||||||
|
the triage's State Hub *data dependencies* (context resolution and report
|
||||||
|
delivery) still route back to the workstation State Hub.
|
||||||
|
|
||||||
|
This workplan still delivers worthwhile robustness — explicit run-miss recovery
|
||||||
|
policies (T02) and missed-fire detection (T03) — but the fix for *this* incident
|
||||||
|
is T05 (resilient sinks/resolvers + a workstation-independent State Hub endpoint).
|
||||||
|
|
||||||
|
## Desired run-miss options (from Bernd)
|
||||||
|
|
||||||
|
Three explicit, per-definition behaviours when a fire is missed:
|
||||||
|
|
||||||
|
1. **Run on trigger or skip** — never recover a missed fire.
|
||||||
|
2. **Run on trigger or later if missed** — recover **all** missed fires when back up.
|
||||||
|
3. **Run on trigger or later if missed, but skip if next trigger reached** —
|
||||||
|
recover only the **most recent** missed fire; do not accumulate a backlog.
|
||||||
|
|
||||||
|
Proposed mapping to a new `misfire_policy` value set (names open to review):
|
||||||
|
|
||||||
|
| Policy | Semantics | Temporal mapping |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| `skip` | Run on trigger or skip | `catchup_window ≈ 0`, `overlap=SKIP` |
|
||||||
|
| `catchup_all` | Run on trigger or all missed later | `catchup_window=<long>`, `overlap=BUFFER_ALL` |
|
||||||
|
| `catchup_latest` | Run on trigger or only the latest missed | `catchup_window ≈ 1 interval`, `overlap=BUFFER_ONE` |
|
||||||
|
|
||||||
|
## Confirm root cause on Railiance01
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0014-T01
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "c90ff214-9214-48c7-96b9-7d699528d5ab"
|
||||||
|
```
|
||||||
|
|
||||||
|
Inspected via `ssh railiance01` + in-node `kubectl`/`temporal` (no k3s tunnel is
|
||||||
|
defined for railiance01; the documented access path is SSH to the host).
|
||||||
|
|
||||||
|
**Findings (2026-06-23) — the WP-0014 premise was wrong for this incident:**
|
||||||
|
|
||||||
|
- All pods healthy; `actcore-worker` up 44h, 0 restarts. Not a crash.
|
||||||
|
- The daily-triage Temporal schedule (`activity-schedule-6fca51fa-…`) is
|
||||||
|
**healthy**: `Paused false`, `OverlapPolicy Skip`, **`CatchupWindow 365d`**
|
||||||
|
(Temporal's *default* when unset), `ActionCounts {Total:8, MissedCatchupWindow:0}`.
|
||||||
|
So fires were **not** silently dropped — my original "no catchup window → silent
|
||||||
|
drop" hypothesis does not hold; the server default is already 365d.
|
||||||
|
- The `2026-06-23T05:20:00Z` fire **did fire and ran**, then **Failed at the report
|
||||||
|
sink**: `report sink failure: state-hub-progress … '[Errno 111] Connection
|
||||||
|
refused'`. The run produced a report but could not deliver it to State Hub, so
|
||||||
|
no `daily_triage` progress event (not even a "could not run" one) was posted →
|
||||||
|
the silence. The 06-22 fire has no execution in retention (bridge likely down
|
||||||
|
then too / schedule update window at `LastUpdateAt 1d ago`).
|
||||||
|
- Root cause is **State Hub connectivity from railiance01**, not Temporal. The
|
||||||
|
in-cluster `actcore-state-hub-bridge` (`hostNetwork`) proxies to
|
||||||
|
`127.0.0.1:18000` on the node — the local end of the ops-bridge **reverse tunnel
|
||||||
|
back to the workstation's State Hub**. At 07:20 Europe/Berlin (= 05:20 UTC) the
|
||||||
|
workstation/tunnel was unreachable → `Connection refused`. Chronic flakiness
|
||||||
|
confirmed: 102 State Hub resolver timeouts in 24h (69 `recently_on_scope`,
|
||||||
|
33 `consistency_sweep`).
|
||||||
|
|
||||||
|
**Implication:** the trigger *is* independent of the laptop, but the triage's
|
||||||
|
**data dependencies (State Hub context resolution + report delivery) still route
|
||||||
|
back to the workstation State Hub**, which is asleep at 07:20 Berlin. WP-0014's
|
||||||
|
misfire policies are still good robustness, but the real fix is (a) State Hub
|
||||||
|
reachable from railiance01 independent of the workstation, and/or (b) sinks/
|
||||||
|
resolvers resilient to transient State Hub unavailability (retry/backoff,
|
||||||
|
store-and-forward) instead of hard-failing the workflow. Tracked as follow-up
|
||||||
|
below. Backfill deferred: a replay only succeeds while the workstation State Hub
|
||||||
|
is reachable.
|
||||||
|
|
||||||
|
## Implement explicit misfire recovery modes
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0014-T02
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "19615562-4cb2-4f25-872f-505d6e40dcc5"
|
||||||
|
```
|
||||||
|
|
||||||
|
Add `catchup_window_seconds` to `CronTriggerConfig` and redefine `misfire_policy`
|
||||||
|
into the three explicit modes above. In `_build_schedule()` set
|
||||||
|
`SchedulePolicy(overlap=..., catchup_window=timedelta(...))` per mode. Remove the
|
||||||
|
ad-hoc 1-hour `backfill` hack in favour of native catchup-window semantics. Keep
|
||||||
|
backward compatibility for existing `skip`/`catchup`/`compress` values (alias
|
||||||
|
map). Unit tests for each mode's `(catchup_window, overlap)` mapping.
|
||||||
|
|
||||||
|
## Missed-fire detection & alert sink
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0014-T03
|
||||||
|
status: done
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: "dbedd96a-59ca-4b83-bce6-35755b076807"
|
||||||
|
```
|
||||||
|
|
||||||
|
Detect when a scheduled definition has no successful run within its expected
|
||||||
|
interval + tolerance, and emit a signal (State Hub progress event and/or
|
||||||
|
agent-inbox message) so a miss is visible even under `skip`. This is the
|
||||||
|
observability the current silent-drop behaviour lacks — a miss should never again
|
||||||
|
be invisible.
|
||||||
|
|
||||||
|
## Apply policy to runtime definitions & document
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0014-T04
|
||||||
|
status: done
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: "04e9d1d2-1192-4402-9402-b12c5d7d44e5"
|
||||||
|
```
|
||||||
|
|
||||||
|
Set `misfire_policy: catchup_latest` for `daily-statehub-wsjf-triage`, documented
|
||||||
|
run-miss options in `docs/runbook.md`.
|
||||||
|
|
||||||
|
**Deployed & verified to railiance01 (2026-06-24):** built `activity-core:
|
||||||
|
railiance01-prod` with the WP-0014 code (T02/T03/T05), imported into k3s
|
||||||
|
containerd, applied the ConfigMap, rolled `actcore-worker`/`api`/`event-router`
|
||||||
|
onto the new image, and ran `/admin/sync` (6 defs, 4 schedules upserted, 0
|
||||||
|
errors). The live Temporal schedule now reports `OverlapPolicy BufferOne` +
|
||||||
|
`CatchupWindow 1d` (= `catchup_latest`); pods healthy, API `db:true temporal:true`.
|
||||||
|
|
||||||
|
## Keep activity-core thin under the State Hub beachhead model
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0014-T05
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "b7e5b877-1b09-421c-a04e-78f785dc00a1"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Architecture decision (Bernd, 2026-06-23):** the resilience that this incident
|
||||||
|
needs — queuing writes and caching reads while State Hub is unreachable — must
|
||||||
|
**not** be a burden carried by client repos. It belongs to State Hub as a
|
||||||
|
**per-machine local "beachhead"** (transparent read cache + write outbox, possibly
|
||||||
|
with State-Hub federation), owned by custodian/state-hub. It handles all three
|
||||||
|
failure modes: network interruption, central State Hub crash, central machine
|
||||||
|
down. This is handed off to state-hub (see the coordination message / proposal);
|
||||||
|
**do not build client-side queue/cache logic in activity-core.**
|
||||||
|
|
||||||
|
activity-core's only responsibilities under this model are thin:
|
||||||
|
|
||||||
|
- **Idempotent writes — DONE (2026-06-23, in-repo):** added
|
||||||
|
`activity_core/state_hub_write` (`idempotency_headers`); every State Hub write
|
||||||
|
(report-sink, ops-evidence, schedule-miss) now sends a stable `Idempotency-Key`
|
||||||
|
header derived from `run_id:instruction_id:event_type`. The read-based
|
||||||
|
`_progress_exists` dedup is now best-effort (returns `False` on connection
|
||||||
|
error instead of hard-failing), so the guarantee lives on the keyed write, not
|
||||||
|
a live read. Tests in `tests/test_state_hub_write.py`; documented in
|
||||||
|
`docs/runbook.md`.
|
||||||
|
- **Adopt the beachhead endpoint — MOVED to [[ACTIVITY-WP-0015]]:** pointing
|
||||||
|
`STATE_HUB_URL` at the local beachhead and retiring the bespoke
|
||||||
|
`actcore-state-hub-bridge` proxy depend on the state-hub beachhead existing
|
||||||
|
first. Split into WP-0015 (status `blocked`) so this workplan can close on its
|
||||||
|
completed in-repo work rather than waiting on an external capability.
|
||||||
|
|
||||||
|
T05 is done as far as activity-core can act now; the external-dependent adoption
|
||||||
|
lives in WP-0015.
|
||||||
@@ -0,0 +1,54 @@
|
|||||||
|
---
|
||||||
|
id: ACTIVITY-WP-0015
|
||||||
|
type: workplan
|
||||||
|
title: "Adopt State Hub Beachhead Endpoint"
|
||||||
|
domain: infotech
|
||||||
|
repo: activity-core
|
||||||
|
status: blocked
|
||||||
|
owner: claude
|
||||||
|
topic_slug: activity-core
|
||||||
|
created: "2026-06-24"
|
||||||
|
updated: "2026-06-24"
|
||||||
|
state_hub_workstream_id: "bbc07f9e-9323-4b2b-b556-c33b37d0b228"
|
||||||
|
---
|
||||||
|
|
||||||
|
# Adopt State Hub Beachhead Endpoint
|
||||||
|
|
||||||
|
Carries the **blocked remainder** of [[ACTIVITY-WP-0014]] T05. The in-repo half
|
||||||
|
(idempotency-keyed State Hub writes) shipped in WP-0014; this workplan is the
|
||||||
|
client-side adoption that depends on the state-hub-owned **beachhead** capability
|
||||||
|
(per-machine read cache + write outbox) existing first.
|
||||||
|
|
||||||
|
**Blocked on:** the state-hub beachhead (proposal sent to the `state-hub` agent,
|
||||||
|
2026-06-23). Do not build queue/cache logic in activity-core — see
|
||||||
|
[[statehub-beachhead-principle]].
|
||||||
|
|
||||||
|
## Point STATE_HUB_URL at the beachhead
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0015-T01
|
||||||
|
status: wait
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: "76b6132d-394a-4a67-bef6-73bb9d1e277e"
|
||||||
|
```
|
||||||
|
|
||||||
|
Once the state-hub beachhead exposes a local endpoint, point activity-core's
|
||||||
|
`STATE_HUB_URL` (and the railiance runtime config) at it and verify reads are
|
||||||
|
served from cache and writes are queued/flushed correctly when central State Hub
|
||||||
|
is unreachable. Confirm idempotency-keyed writes dedup on flush (no duplicate
|
||||||
|
`daily_triage`/progress events).
|
||||||
|
|
||||||
|
## Retire the bespoke actcore-state-hub-bridge proxy
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0015-T02
|
||||||
|
status: wait
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: "526c2129-cbf7-4531-a319-aebfc75cc6a3"
|
||||||
|
```
|
||||||
|
|
||||||
|
Remove the inline `hostNetwork` HTTP proxy `actcore-state-hub-bridge` from
|
||||||
|
`k8s/railiance/20-runtime.yaml` — it is a primitive precursor of the beachhead
|
||||||
|
and should be replaced by the state-hub-owned component, not extended. Re-verify
|
||||||
|
the daily triage end-to-end after cutover, including an overnight scheduled run
|
||||||
|
while the workstation is asleep (the original failure condition).
|
||||||
@@ -0,0 +1,434 @@
|
|||||||
|
---
|
||||||
|
id: ACTIVITY-WP-0016
|
||||||
|
type: workplan
|
||||||
|
title: "LLM Output Robustness & The Producer Trust Boundary"
|
||||||
|
domain: custodian
|
||||||
|
repo: activity-core
|
||||||
|
status: finished
|
||||||
|
owner: codex
|
||||||
|
topic_slug: custodian
|
||||||
|
created: "2026-06-26"
|
||||||
|
updated: "2026-06-30"
|
||||||
|
state_hub_workstream_id: "4ef0d53b-1777-41ae-80c6-1b69fdb34726"
|
||||||
|
---
|
||||||
|
|
||||||
|
# ACTIVITY-WP-0016 — LLM Output Robustness & The Producer Trust Boundary
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
On 2026-06-26 the scheduled `daily-statehub-wsjf-triage` instruction fired on
|
||||||
|
time (`daily_triage` event 05:20:57Z) but its output **failed schema
|
||||||
|
validation**: `Expecting ',' delimiter: line 136 column 22 (char 5268)`. The
|
||||||
|
model emitted a long ranked WSJF recommendation list (reached rank 7+ with
|
||||||
|
nested `wsjf` objects) and the JSON broke deep in that list. Because the report
|
||||||
|
is a single monolithic JSON document, one malformed delimiter discarded the
|
||||||
|
**entire** run. This reset the three-clean-consecutive-scheduled-runs streak in
|
||||||
|
`ACTIVITY-WP-0006-T03` (06-24 ✅, 06-25 ✅, 06-26 ✗-validation) and is the
|
||||||
|
LLM-output-quality surface deferred from `ACTIVITY-WP-0010`.
|
||||||
|
|
||||||
|
The scheduling/runtime layer is healthy — this is purely an output-robustness
|
||||||
|
and boundary-design problem. Today's code (`src/activity_core/rules/executor.py`)
|
||||||
|
already: passes the output schema to llm-connect as a `json_schema` model param
|
||||||
|
(`_llm_run_config`), retries once, runs a fenced/`raw_decode` tolerant parser
|
||||||
|
(`_parse_json_output`), and preserves a bounded 4000-char preview on hard
|
||||||
|
failure (`_invalid_output_report`). None of that helps when error locality is
|
||||||
|
zero: the failure unit is the whole document, not the offending item.
|
||||||
|
|
||||||
|
## Design Frame — The Producer Trust Boundary
|
||||||
|
|
||||||
|
This workplan is anchored to a deliberate architectural stance, not just a bug
|
||||||
|
fix. Capture it in an ADR (T04) so future work inherits it.
|
||||||
|
|
||||||
|
**Premise.** activity-core has a *trust boundary* where free-form producer
|
||||||
|
output meets strict deterministic consumers (JSON Schema validators, the task
|
||||||
|
emitter, classic compute pipelines). The producers are **LLMs and humans (and
|
||||||
|
agents acting for either)**. Both are *untrusted producers*: their output may be
|
||||||
|
|
||||||
|
- **erroneous** — hallucination, truncation (token-limit cutoff), drift,
|
||||||
|
type slips, typos; or
|
||||||
|
- **malicious** — prompt injection, crafted payloads, oversized/deeply-nested
|
||||||
|
structures aimed at exhausting or confusing the consumer.
|
||||||
|
|
||||||
|
The architecture should treat the boundary as an adversarial frontier and place
|
||||||
|
**guardrails + error-correction tooling there**, rather than letting raw
|
||||||
|
producer output flow into deterministic consumers and fail (or worse, partially
|
||||||
|
succeed) downstream.
|
||||||
|
|
||||||
|
**Two non-fail-fast postures.** When we do *not* want to hard-fail on a problem,
|
||||||
|
there are two sensible strategies — and they compose:
|
||||||
|
|
||||||
|
- **A) Trust but handle exceptions** (optimistic / reactive). Consume the output
|
||||||
|
as-is; on exception, catch → repair → retry → or quarantine. Cheap on the
|
||||||
|
happy path. Blast radius depends entirely on how granular the catch is. Good
|
||||||
|
when failures are rare and locally recoverable. Risk: failures surface late,
|
||||||
|
possibly after partial side effects.
|
||||||
|
- **B) Verify and mitigate** (defensive / proactive). Validate, sanitize, clamp,
|
||||||
|
and normalize the output to a known-good shape *before* it enters the pipeline
|
||||||
|
— drop bad items, coerce types, bound sizes/depth, allow-list references — so
|
||||||
|
the consumer only ever sees clean input. Higher upfront cost, smaller blast
|
||||||
|
radius, no partial side effects. Good when failures are common or
|
||||||
|
consequences are high.
|
||||||
|
|
||||||
|
**Governing principles for this repo:**
|
||||||
|
|
||||||
|
1. **Push verification to the boundary; keep the interior strict.** Apply
|
||||||
|
posture **B** at the producer→consumer boundary (verify+mitigate structure);
|
||||||
|
keep posture **A** for residual exceptions inside the verified core. Never
|
||||||
|
relax the interior schema to absorb producer sloppiness.
|
||||||
|
2. **Make error locality match the unit of work.** One bad recommendation must
|
||||||
|
cost one recommendation, not the whole report. Framing the payload so each
|
||||||
|
item is independently parseable is the single highest-leverage change.
|
||||||
|
3. **Quarantine, never silently drop.** Invalid units are preserved as bounded,
|
||||||
|
provenance-tagged artifacts (index, error, raw snippet) so they can be
|
||||||
|
debugged or replayed — degraded-but-usable is distinct from total loss.
|
||||||
|
4. **Both human and agent input get the same rigor.** Guardrails are
|
||||||
|
producer-agnostic: the same size/depth/count caps, reference allow-lists, and
|
||||||
|
truncation detection apply whether the producer is an LLM, an agent, or a
|
||||||
|
human form submission.
|
||||||
|
|
||||||
|
## Reproduce & Root-Cause The Failure
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0016-T01
|
||||||
|
status: cancel
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "74fd16a5-4ea5-4dfe-8526-dfa27cf76138"
|
||||||
|
```
|
||||||
|
|
||||||
|
Recover the **full** raw llm-connect response for the 06-26 failure (the State
|
||||||
|
Hub event keeps only a 4000-char preview; the break is at char 5268) and
|
||||||
|
establish the precise cause.
|
||||||
|
|
||||||
|
Done when:
|
||||||
|
|
||||||
|
- the full raw response is pulled from the runtime llm-connect log / response
|
||||||
|
store and the exact offending token at char 5268 is identified;
|
||||||
|
- `finish_reason` is captured to confirm or rule out token-limit **truncation**
|
||||||
|
vs a structural mid-stream glitch;
|
||||||
|
- it is confirmed whether llm-connect actually **enforced** the `json_schema`
|
||||||
|
constrained-decoding hint or merely accepted it as advisory (this determines
|
||||||
|
whether the schema param is load-bearing);
|
||||||
|
- the failing payload is captured as a regression fixture under `tests/`.
|
||||||
|
|
||||||
|
2026-06-26 findings (local analysis on the workstation):
|
||||||
|
|
||||||
|
- **Mechanism confirmed structurally.** There are **16 active workstreams**
|
||||||
|
org-wide and the triage instruction emits ~one ranked recommendation per
|
||||||
|
candidate. The preserved preview holds 7 fully-formed recommendations; the JSON
|
||||||
|
break is at char 5268 (~rank 8–9). The unbounded one-per-workstream list is the
|
||||||
|
structural cause — more items = more tokens = higher odds of a mid-stream JSON
|
||||||
|
slip and/or truncation. This directly justifies T02's bounded top-N + per-item
|
||||||
|
framing.
|
||||||
|
- **Both attempts failed.** `executor._execute` retries once
|
||||||
|
(`src/activity_core/rules/executor.py:166-171`); the recorded error is from the
|
||||||
|
**retry** output, so the model produced invalid JSON twice — not a one-off.
|
||||||
|
- **activity-core discards the diagnostics needed to root-cause this.** Three
|
||||||
|
retention gaps mean the exact char-5268 token cannot be recovered from
|
||||||
|
activity-core data at all:
|
||||||
|
1. `LLMConnectClient.complete()` returns only `data["content"]`
|
||||||
|
(`llm_client.py:57-60`) — it drops `finish_reason`/`usage` from the
|
||||||
|
llm-connect HTTP response, so truncation-vs-structural cannot be
|
||||||
|
distinguished locally.
|
||||||
|
2. the report sink caps raw output at **4000 chars** (`_invalid_output_report`,
|
||||||
|
`executor.py:259`) — below the 5268 break.
|
||||||
|
3. the worker log caps the preview at **2000 chars** (`executor.py:175`).
|
||||||
|
- **Remaining (remote, operator-owned).** Confirming the exact offending token
|
||||||
|
and `finish_reason` requires llm-connect's producer-side logs on `railiance01`
|
||||||
|
— cluster access, outside this repo's SCOPE for direct action. Truncation is
|
||||||
|
the leading hypothesis given the 16-item input, but the mitigation (T02/T03) is
|
||||||
|
identical either way, so T01 does not block the build work.
|
||||||
|
- **Feeds T03/T04.** The retention gaps are themselves defects to fix: capture
|
||||||
|
`finish_reason`/`usage` and persist a larger bounded raw artifact on validation
|
||||||
|
failure so this class of failure is never un-debuggable again.
|
||||||
|
- Partial fixture saved:
|
||||||
|
`tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json`
|
||||||
|
(the 4000-char preview + validation error; full payload pending the remote pull).
|
||||||
|
|
||||||
|
2026-06-30 local retention hardening: activity-core now preserves future
|
||||||
|
llm-connect diagnostic metadata instead of dropping it at the client boundary.
|
||||||
|
`LLMConnectClient.complete()` still returns the content string for compatibility,
|
||||||
|
but records safe non-secret response fields such as `finish_reason` and `usage`
|
||||||
|
on `last_response_metadata`; the executor copies that into report artifacts,
|
||||||
|
State Hub progress detail, and working-memory notes. Invalid report raw previews
|
||||||
|
were raised from 4000 to 12000 chars. This does not recover the historical
|
||||||
|
06-26 full payload or producer-side `finish_reason`, so T01 remains wait on the
|
||||||
|
remote llm-connect log pull, but the retention gap is closed for future failures.
|
||||||
|
|
||||||
|
## Schema + Prompt Redesign For Error Locality
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0016-T02
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "ae67ca8c-ee01-4a8d-9e8a-a0a36c999758"
|
||||||
|
```
|
||||||
|
|
||||||
|
Redesign the daily-triage report contract so a single malformed item can no
|
||||||
|
longer discard the whole report (principle #2).
|
||||||
|
|
||||||
|
Done when:
|
||||||
|
|
||||||
|
- the recommendation list is **bounded** (configurable top-N, default 5–7) in
|
||||||
|
both the prompt and the output schema — long lists are where the model drifts;
|
||||||
|
- the report uses a **per-item-framed** shape (JSON Lines / NDJSON — one
|
||||||
|
recommendation object per line — or an equivalent delimited per-item form)
|
||||||
|
behind a minimal stable envelope (`summary` + framed items), so each item is
|
||||||
|
an independent parse unit;
|
||||||
|
- the prompt explicitly states the contract, the per-item framing, the cap, and
|
||||||
|
a "if uncertain, emit fewer well-formed items rather than more" instruction;
|
||||||
|
- `max_tokens` is set with headroom for the bounded list so truncation cannot
|
||||||
|
occur at the expected size;
|
||||||
|
- the output schema file (`_load_output_schema` target) is updated to match.
|
||||||
|
|
||||||
|
2026-06-26 progress (in-repo portion):
|
||||||
|
|
||||||
|
- **Strict, bounded schema written** — `schemas/daily-triage-report.json` went
|
||||||
|
from `recommendations.items: {type: object}` (accept-anything) to a strict
|
||||||
|
per-item contract: `required [rank, candidate, action, why]` with typed
|
||||||
|
`wsjf` sub-fields, plus `maxItems: 7`. The strict item shape is what lets the
|
||||||
|
T03 boundary parser validate each recommendation independently.
|
||||||
|
- **`maxItems` is a hint, not a hard reject** — the in-repo validator
|
||||||
|
(`_validate_schema_node`) only enforces `type`/`required`/`properties`/`items`
|
||||||
|
and ignores `maxItems`/`enum`. That is deliberate: a hard `maxItems` reject
|
||||||
|
would discard a whole 16-item report — the exact blast-radius bug WP-0016
|
||||||
|
removes. The bound is enforced via the prompt + the llm-connect `json_schema`
|
||||||
|
constraint hint + T03 mitigation (keep top-N by rank, quarantine extras).
|
||||||
|
- **DEPLOY COUPLING (important):** this schema file is consumed *both* as the
|
||||||
|
llm-connect hint *and* by the current whole-document validator. Tightening
|
||||||
|
per-item `required` fields makes the existing whole-doc validation hard-fail
|
||||||
|
**more** until T03 replaces it with per-item quarantine. Therefore the schema
|
||||||
|
change MUST ship together with T03 — do not deploy the strict schema to the
|
||||||
|
runtime bundle ahead of the T03 parser. Four executor/instruction tests that
|
||||||
|
asserted the old loose contract were updated to the strict contract; the
|
||||||
|
forwarded-schema test now reads the live file instead of hard-coding it.
|
||||||
|
- **Truncation hypothesis corroborated** — the instruction config carries
|
||||||
|
`max_tokens` on the order of ~1200 (per the wiring test fixture). 5268 chars ≈
|
||||||
|
~1300–1500 tokens, so a ~1200-token cap would truncate a 16-item list right at
|
||||||
|
the observed break. This strengthens T01's leading hypothesis and makes the
|
||||||
|
`max_tokens` headroom change below concrete.
|
||||||
|
|
||||||
|
**Bundle handoff (NOT in this repo — runtime-projected definition).** The triage
|
||||||
|
prompt and `max_tokens` live in the Railiance runtime bundle, not in repo files.
|
||||||
|
Apply there:
|
||||||
|
1. Instruct a **bounded top-N** (≤ 7) ranked recommendations, "if uncertain emit
|
||||||
|
fewer well-formed items rather than more."
|
||||||
|
2. Specify the **per-item framing** the T03 parser will consume (NDJSON: a
|
||||||
|
leading summary object, then one recommendation JSON object per line).
|
||||||
|
3. Raise **`max_tokens`** to give clear headroom for 7 framed items (eliminate
|
||||||
|
truncation at the expected size).
|
||||||
|
4. State the value vocabularies (`action`, `confidence`) the T04 guardrails will
|
||||||
|
check.
|
||||||
|
|
||||||
|
2026-06-30 live evidence check: the 2026-06-28 and 2026-06-29 scheduled
|
||||||
|
`daily_triage` events validated successfully, which shows the runtime is no
|
||||||
|
longer failing every day. However, the preserved State Hub reports still contain
|
||||||
|
10 recommendations, not the requested bounded top-N of 7 / framed item contract.
|
||||||
|
Treat that as evidence that the runtime-projected prompt/schema/max-token bundle
|
||||||
|
has not fully absorbed the T02 handoff yet.
|
||||||
|
|
||||||
|
2026-06-30 source projection closeout: patched `k8s/railiance/20-runtime.yaml`
|
||||||
|
so the projected `daily-statehub-wsjf-triage.md` prompt now says at most 7
|
||||||
|
recommendations and instructs the model to emit fewer well-formed items rather
|
||||||
|
than more. The projected `daily-triage-report.json` now has `maxItems: 7` and
|
||||||
|
`rank.maximum: 7`, aligned with the repo schema. `max_tokens: 1800` remains as
|
||||||
|
headroom for the bounded report. T02 is done in source; live deployment and an
|
||||||
|
observed <=7 recommendation run remain under T05.
|
||||||
|
|
||||||
|
## Boundary Parser — Verify & Mitigate (Posture B)
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0016-T03
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "d65a6281-f1f9-4a9b-a835-da065411b709"
|
||||||
|
```
|
||||||
|
|
||||||
|
Implement item-granular parsing with a quarantine lane in
|
||||||
|
`src/activity_core/rules/executor.py`, applying posture **B** at the boundary
|
||||||
|
(principles #1–#3).
|
||||||
|
|
||||||
|
Done when:
|
||||||
|
|
||||||
|
- the parser splits the envelope from the framed items, then parses **each item
|
||||||
|
independently**; a malformed item is routed to a bounded `quarantined_items`
|
||||||
|
artifact (index + validation error + raw snippet), not raised;
|
||||||
|
- a run with some valid and some invalid items emits a report over the surviving
|
||||||
|
valid items with `output_validated=true`, plus `partial=true` and
|
||||||
|
`quarantined_count` / `quarantined_items` markers — degraded-but-usable is
|
||||||
|
reported distinctly from total loss;
|
||||||
|
- a best-effort **repair** pass (close unterminated brackets/quotes, recover the
|
||||||
|
valid prefix) is attempted per item before quarantining it;
|
||||||
|
- truncation detected in T01 is handled as its own signal (recover whole items
|
||||||
|
emitted before the cutoff rather than failing the document);
|
||||||
|
- the existing monolithic-document path remains as the fallback when framing is
|
||||||
|
absent (backward compatible with task-only instructions).
|
||||||
|
|
||||||
|
2026-06-26 progress (implemented in `src/activity_core/rules/executor.py`):
|
||||||
|
|
||||||
|
- **Resilient recovery wired into `_execute`.** When the whole-document parse +
|
||||||
|
one retry still fail, report instructions (those with `report_sinks`) now run
|
||||||
|
`_resilient_report` *before* the total-loss `_invalid_output_report`. If it
|
||||||
|
recovers ≥1 valid item it returns a partial report; otherwise it returns None
|
||||||
|
and the prior total-loss path is preserved unchanged.
|
||||||
|
- **Brace/quote-aware object scanner, not line-splitting.** The real 06-26 output
|
||||||
|
was pretty-printed (multi-line objects), so naive NDJSON line recovery would
|
||||||
|
have failed. `_extract_object_spans` walks the `recommendations` array
|
||||||
|
brace-depth- and string-aware, so it recovers each recommendation object
|
||||||
|
whether pretty-printed across many lines *or* emitted one-per-line (NDJSON).
|
||||||
|
The truncated trailing object is returned with `complete=False`.
|
||||||
|
- **Layered mitigation per item:** `json.loads` → on failure for a truncated
|
||||||
|
tail, a best-effort `_try_repair` (balance open string/brackets/braces) →
|
||||||
|
then `_partition_items` validates each recovered object against the T02 item
|
||||||
|
schema. Valid items survive; malformed or over-`maxItems` items are
|
||||||
|
quarantined with provenance (`index`, `error`, `raw` snippet, `reason`).
|
||||||
|
- **Report shape on degradation:** `output_validated=True` over the survivors,
|
||||||
|
`review_required=True`, `partial=True`, `quarantined_count`, and a bounded
|
||||||
|
`quarantined_items` list (cap 20). Degraded-but-usable is now reported
|
||||||
|
distinctly from total loss.
|
||||||
|
- **Verified against the real failure shape.** New tests reconstruct a
|
||||||
|
pretty-printed report with 7 valid recommendations + a truncated tail (the
|
||||||
|
06-26 shape) and a one-bad-item-among-valid case. The 7-item run now recovers
|
||||||
|
all 7 and quarantines the broken tail (previously: whole run discarded);
|
||||||
|
log line `instruction_output_recovered: kept=7, quarantined=1`. The bad-item
|
||||||
|
run keeps 2 and quarantines the rank-less one.
|
||||||
|
- **Deferred to T04 (clean scope boundary):** enforcing `maxItems` top-N on the
|
||||||
|
*happy* path (valid JSON, all items schema-valid, but > N items) — the resilient
|
||||||
|
path only runs on failure, so over-limit-on-success is a guardrail/count-cap
|
||||||
|
concern, which is exactly T04's remit.
|
||||||
|
|
||||||
|
## Producer Guardrails + ADR-004
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0016-T04
|
||||||
|
status: done
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: "f5c3af5b-9e28-42b0-9af5-4c99284e99b9"
|
||||||
|
```
|
||||||
|
|
||||||
|
Write the architecture decision record and add the producer-agnostic guardrails
|
||||||
|
(principle #4).
|
||||||
|
|
||||||
|
Done when:
|
||||||
|
|
||||||
|
- `docs/adr/adr-004-producer-trust-boundary.md` documents the trust boundary,
|
||||||
|
the untrusted-producer premise (erroneous **and** malicious; human and agent),
|
||||||
|
the A vs B taxonomy and where each applies, the error-locality principle, and
|
||||||
|
the quarantine-with-provenance rule;
|
||||||
|
- boundary guardrails are enforced at the consumer edge: max item **count**, max
|
||||||
|
string length, max nesting **depth**, and a **reference allow-list** (e.g. a
|
||||||
|
recommendation `candidate` / a task `target_repo` must resolve to a known
|
||||||
|
workstream/repo before it is acted on);
|
||||||
|
- guardrail rejections are quarantined with provenance, consistent with T03;
|
||||||
|
- SCOPE.md / INTENT.md are checked for drift and updated if the boundary stance
|
||||||
|
changes the documented contract.
|
||||||
|
|
||||||
|
2026-06-26 progress:
|
||||||
|
|
||||||
|
- **ADR-004 written** — `docs/adr/adr-004-producer-trust-boundary.md` documents
|
||||||
|
the untrusted-producer premise (erroneous + malicious; LLM/agent/human), the
|
||||||
|
A-vs-B posture taxonomy, the four governing principles, the concrete
|
||||||
|
activity-core mechanisms, a posture-by-layer table, consequences, and
|
||||||
|
alternatives considered. Accepted, scope cross-repo.
|
||||||
|
- **Producer guardrails implemented** in `executor.py`, applied uniformly on the
|
||||||
|
happy path *and* the recovery path via `_partition_items`: per-item order is
|
||||||
|
structural-type → schema → structural caps (`_MAX_DEPTH=8`,
|
||||||
|
`_MAX_STRING_LEN=4000`) → reference allow-list → count cap (`maxItems`). Each
|
||||||
|
quarantine carries a `reason` (`malformed`/`schema`/`guardrail`/`allow_list`/
|
||||||
|
`over_limit`).
|
||||||
|
- **Happy-path count cap closed** (the item deferred from T03): a syntactically
|
||||||
|
valid 9-item report now keeps 7 and quarantines 2 as `over_limit`, emitting a
|
||||||
|
`partial` report — without a retry.
|
||||||
|
- **Reference allow-list wired but inert.** `_allow_list_from_context` reads
|
||||||
|
`context["known_candidates"]`; when present, recommendations with an unknown
|
||||||
|
`candidate` are quarantined (`reason: allow_list`). Absent today → check is
|
||||||
|
inert; activation is a one-line context-resolver change. Keeps the guardrail
|
||||||
|
producer-agnostic (principle #4) and ready.
|
||||||
|
- **SCOPE.md updated** — instruction-executor bullet now names the quarantine
|
||||||
|
lane + guardrails; ADR-004 added to the Architecture Decisions list. No INTENT
|
||||||
|
drift: this hardens the existing output contract, it does not extend scope.
|
||||||
|
- New tests: happy-path count cap, oversized-string guardrail, allow-list
|
||||||
|
rejection (all green).
|
||||||
|
|
||||||
|
## Tests + Calibration Re-Entry
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0016-T05
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "c881500b-5459-4620-81c0-b176971e989f"
|
||||||
|
```
|
||||||
|
|
||||||
|
Prove the new posture and hand back to the calibration gates.
|
||||||
|
|
||||||
|
Done when:
|
||||||
|
|
||||||
|
- regression tests cover: the captured 06-26 payload, a truncated-mid-list
|
||||||
|
payload, a one-bad-item-among-good payload (asserts quarantine + partial), an
|
||||||
|
oversized/over-deep payload (asserts guardrail rejection), and an
|
||||||
|
injection-shaped reference (asserts allow-list rejection);
|
||||||
|
- the full suite passes and the result is recorded here with the count;
|
||||||
|
- a daily-triage smoke against the live runtime shows a previously-failing
|
||||||
|
payload now **degrades gracefully** (valid items delivered, bad items
|
||||||
|
quarantined) instead of discarding the run;
|
||||||
|
- a progress note hands back to `ACTIVITY-WP-0010-T04` and `ACTIVITY-WP-0006-T03`
|
||||||
|
that the output-robustness blocker is cleared so the three-clean-run gate can
|
||||||
|
resume on its own.
|
||||||
|
|
||||||
|
2026-06-26 progress (in-repo portion complete):
|
||||||
|
|
||||||
|
- **Regression coverage complete.** Across T03/T04/T05: truncated-mid-list,
|
||||||
|
one-bad-item-among-good (quarantine + partial), oversized-string and over-depth
|
||||||
|
guardrail rejection, allow-list (injection-shaped) rejection, happy-path count
|
||||||
|
cap, and a test driving the **actual captured 2026-06-26 payload**
|
||||||
|
(`tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json`)
|
||||||
|
— it now recovers 6+ valid recommendations and quarantines the truncated tail,
|
||||||
|
where before it discarded the whole run.
|
||||||
|
- **Full suite green:** 218 passed, 1 skipped (recorded at T04; the T05 fixture +
|
||||||
|
over-depth tests add to this — see the commit).
|
||||||
|
- **Hand-back notes posted** to `ACTIVITY-WP-0006-T03` (State Hub event
|
||||||
|
`b6b8c2b8`) and `ACTIVITY-WP-0010-T04` (`b813f0dc`).
|
||||||
|
- **Remaining (remote, operator-owned):** the live daily-triage smoke on
|
||||||
|
`railiance01` proving end-to-end graceful degradation. It depends on deploying
|
||||||
|
the T02 bundle prompt/`max_tokens`/NDJSON changes together with this code, which
|
||||||
|
is cluster/operator work outside this repo's SCOPE. T05 therefore stays
|
||||||
|
`progress` until that live run exists; the in-repo deliverables are done.
|
||||||
|
|
||||||
|
2026-06-30 follow-up: added forward-looking diagnostics so future validation
|
||||||
|
failures carry llm-connect response metadata and a larger bounded raw-output
|
||||||
|
preview in activity-core-owned evidence. Focused verification passed:
|
||||||
|
`uv run pytest tests/test_llm_client.py tests/rules/test_executor.py tests/test_report_sinks.py -q`
|
||||||
|
=> 39 passed. This improves future root-cause ability but does not replace the
|
||||||
|
required live smoke proving graceful degradation on railiance01.
|
||||||
|
|
||||||
|
2026-06-30 projection follow-up: local source projection now enforces the top-7
|
||||||
|
prompt/schema contract. Remaining T05 proof is operational: deploy or sync the
|
||||||
|
updated `k8s/railiance/20-runtime.yaml`, run `actcore-sync`/schedule smoke or wait
|
||||||
|
for the next 07:20 Berlin fire, then confirm State Hub `daily_triage` evidence is
|
||||||
|
`output_validated=true` with no more than 7 recommendations.
|
||||||
|
|
||||||
|
## Relationships
|
||||||
|
|
||||||
|
- **Blocks / feeds:** `ACTIVITY-WP-0006-T03` (three clean scheduled runs) and
|
||||||
|
`ACTIVITY-WP-0010-T04` (collect three clean scheduled runs) — both stalled on
|
||||||
|
the same output-quality failure this workplan removes.
|
||||||
|
- **References:** `ACTIVITY-WP-0009` (scheduled-run trust gap).
|
||||||
|
- **Boundary discipline:** keeps activity-core inside its SCOPE — this hardens
|
||||||
|
the instruction-executor output contract; it does not move provider
|
||||||
|
credentials, cluster reconciliation, or task lifecycle into this repo.
|
||||||
|
|
||||||
|
|
||||||
|
## Closure 2026-07-02 (RAIL-BS-WP-0008 live deploy)
|
||||||
|
|
||||||
|
- T05 done: the robustness bundle (strict per-item schema + T03 quarantine
|
||||||
|
parser + bounded top-7/NDJSON runtime prompt, activity-core `7612112`) was
|
||||||
|
deployed to railiance01 and live-proven. A manually triggered daily-triage
|
||||||
|
run produced a clean schema-valid report with exactly 7 ranked
|
||||||
|
recommendations: State Hub event `24d2d321-c761-47f7-bf9e-7950a6253c21`,
|
||||||
|
`output_validated=true`, working memory written. Calibration re-entry: the
|
||||||
|
three-clean-run streak (WP-0006-T03 / WP-0010-T04) restarts from this run.
|
||||||
|
- T01 cancelled: the raw 2026-06-26 llm-connect response is unrecoverable
|
||||||
|
(stateless pod, no response store, log stream holds only 2 startup lines
|
||||||
|
since 2026-06-19). Root cause stands on the retained 4000-char preview and
|
||||||
|
break-at-char-5268 evidence: output exceeded the old ~1200-token budget and
|
||||||
|
truncated mid-JSON. The deployed mitigation (1800-token headroom, bounded
|
||||||
|
top-7, per-item recovery) addresses exactly that failure mode.
|
||||||
58
workplans/ACTIVITY-WP-0017-core-hub-ops-evidence-sink.md
Normal file
58
workplans/ACTIVITY-WP-0017-core-hub-ops-evidence-sink.md
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
---
|
||||||
|
id: ACTIVITY-WP-0017
|
||||||
|
type: workplan
|
||||||
|
title: "Core Hub ops evidence sink"
|
||||||
|
domain: infotech
|
||||||
|
repo: activity-core
|
||||||
|
status: finished
|
||||||
|
owner: codex
|
||||||
|
topic_slug: custodian
|
||||||
|
created: "2026-06-27"
|
||||||
|
updated: "2026-06-27"
|
||||||
|
state_hub_workstream_id: "2a073bf4-febf-433e-a721-5daf71760912"
|
||||||
|
---
|
||||||
|
|
||||||
|
# Core Hub ops evidence sink
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Provide the activity-core side of the Core Hub replacement evidence path for
|
||||||
|
`CORE-WP-0008-T03`, without depending on the legacy Haskell Inter-Hub sink and
|
||||||
|
without placing secret material in activity definitions, logs, State Hub, or
|
||||||
|
chat.
|
||||||
|
|
||||||
|
## Task: Add Core Hub interaction-event sink
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0017-T01
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "32aab1af-6be5-4b52-afa1-c11f52c65892"
|
||||||
|
```
|
||||||
|
|
||||||
|
Add a `core-hub-interaction-event` ops evidence sink that posts sanitized
|
||||||
|
ops-inventory probe evidence to Core Hub `/api/v2/interaction-events`, verifies
|
||||||
|
the created event is visible, and reports only non-secret ids/statuses.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
|
||||||
|
- runtime token is read through `CORE_HUB_RUNTIME_TOKEN_FILE` or a named
|
||||||
|
environment variable, never from workplan content;
|
||||||
|
- sink configuration accepts `CORE_HUB_BASE_URL` and a widget id or widget
|
||||||
|
mapping;
|
||||||
|
- emitted metadata reuses the existing compact/sanitized probe evidence path;
|
||||||
|
- missing Core Hub config skips cleanly with explicit non-secret missing keys;
|
||||||
|
- tests prove the POST/visibility check and secret non-disclosure.
|
||||||
|
|
||||||
|
Verification 2026-06-27: `tests/test_ops_evidence_sinks.py` passed, and
|
||||||
|
a disposable local Core Hub runtime accepted an activity-core
|
||||||
|
`core-hub-interaction-event` sink emission, then listed the created
|
||||||
|
`ops-endpoint-verified` event back through `/api/v2/interaction-events`.
|
||||||
|
The verification asserted sanitized metadata did not include response body,
|
||||||
|
authorization header, URL userinfo, or token query material.
|
||||||
|
|
||||||
|
Completed 2026-06-27: implemented the Core Hub interaction-event sink in
|
||||||
|
`activity_core.ops_evidence_sinks` with unit coverage for POST/visibility
|
||||||
|
verification, missing config behavior, and secret non-disclosure. This provides
|
||||||
|
the direct Core Hub consumer path needed by `CORE-WP-0008-T03`; deployed use
|
||||||
|
still requires an approved Core Hub runtime token and widget id/mapping.
|
||||||
248
workplans/ACTIVITY-WP-0018-own-infra-automation-status.md
Normal file
248
workplans/ACTIVITY-WP-0018-own-infra-automation-status.md
Normal file
@@ -0,0 +1,248 @@
|
|||||||
|
---
|
||||||
|
id: ACTIVITY-WP-0018
|
||||||
|
type: workplan
|
||||||
|
title: "Own-infrastructure automation status surface"
|
||||||
|
domain: infotech
|
||||||
|
repo: activity-core
|
||||||
|
status: finished
|
||||||
|
owner: codex
|
||||||
|
topic_slug: automation-observability
|
||||||
|
created: "2026-06-29"
|
||||||
|
updated: "2026-06-29"
|
||||||
|
state_hub_workstream_id: "0220b38b-7c73-4601-9601-5f2c1a5b29e8"
|
||||||
|
---
|
||||||
|
|
||||||
|
# Own-infrastructure automation status surface
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Make activity-core's own scheduling and evidence infrastructure the explicit
|
||||||
|
operating preference for durable automations, independent of any coding
|
||||||
|
assistant-provided scheduler or reminder system.
|
||||||
|
|
||||||
|
An operator should be able to answer a question like "How did our automations go
|
||||||
|
since Friday?" with a repo-native command that does not require an LLM. Coding
|
||||||
|
assistants may inspect or summarize that command's output, but they must not be
|
||||||
|
the source of truth for scheduled execution, run history, or operational
|
||||||
|
evidence.
|
||||||
|
|
||||||
|
## Review notes
|
||||||
|
|
||||||
|
The repo already owns the correct infrastructure direction:
|
||||||
|
|
||||||
|
- `SCOPE.md` defines activity-core as the org-wide event bridge for cron,
|
||||||
|
one-off scheduled datetime, and event-triggered automation.
|
||||||
|
- `Makefile` exposes sync and service targets, but no operator status target for
|
||||||
|
recent automation outcomes.
|
||||||
|
- `docs/runbook.md` documents daily-triage verification through
|
||||||
|
`scripts/verify_daily_triage.py`, but that helper is activity-specific and
|
||||||
|
still reads like a checklist rather than the baseline answer surface for all
|
||||||
|
automations.
|
||||||
|
- Existing workplan evidence shows the status question is operationally common:
|
||||||
|
2026-06-24 and 2026-06-25 daily triage runs were clean, while 2026-06-26 and
|
||||||
|
2026-06-27 fired on schedule but failed output validation. That distinction is
|
||||||
|
exactly what the baseline command must make obvious.
|
||||||
|
|
||||||
|
## Task: Codify the own-infra scheduling preference
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0018-T01
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "00127678-5ce4-4cb3-b81c-f42e04407c73"
|
||||||
|
```
|
||||||
|
|
||||||
|
Record the repository preference that durable automation scheduling, execution
|
||||||
|
history, and run evidence belong to activity-core's own infrastructure: Temporal
|
||||||
|
Schedules, NATS JetStream, activity-core run records, State Hub progress, and
|
||||||
|
working-memory/report sinks.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
|
||||||
|
- `AGENTS.md` repo-specific instructions say not to use coding
|
||||||
|
assistant-provided automation tooling as the execution or evidence source for
|
||||||
|
activity-core automations.
|
||||||
|
- `SCOPE.md` and `docs/runbook.md` describe coding assistants as callers or
|
||||||
|
summarizers of repo-native automation commands, not as schedulers.
|
||||||
|
- The preference distinguishes durable automation from harmless local session
|
||||||
|
reminders: production/operational recurrence belongs to activity-core.
|
||||||
|
- The text names the authoritative evidence sources and avoids tying the policy
|
||||||
|
to any one assistant product.
|
||||||
|
|
||||||
|
2026-06-29 progress: Added the immediate repo-agent instruction in AGENTS.md
|
||||||
|
that durable activity-core automations must use repo-owned infrastructure, not
|
||||||
|
coding assistant automation/reminder/heartbeat tooling, as the execution or
|
||||||
|
evidence source. Remaining T01 work is to carry the same preference into
|
||||||
|
SCOPE.md and docs/runbook.md.
|
||||||
|
|
||||||
|
## Task: Define the automation status evidence contract
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0018-T02
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "17e6bb87-d4bf-4ef3-b91c-4bdfe2fe3492"
|
||||||
|
```
|
||||||
|
|
||||||
|
Define a small, deterministic report contract for answering recent automation
|
||||||
|
status questions across all ActivityDefinitions.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
|
||||||
|
- The contract covers schedule state, expected fires in the requested window,
|
||||||
|
observed workflow runs, `activity_runs` rows, State Hub progress events,
|
||||||
|
working-memory/report sink evidence, and known validation or sink failures.
|
||||||
|
- It defines normalized statuses such as `completed`, `running`, `retrying`,
|
||||||
|
`validation_failed`, `sink_failed`, `missed`, `disabled`, and `unknown`.
|
||||||
|
- Partial data is explicit: if Temporal, Postgres, State Hub, or a sink path is
|
||||||
|
unavailable, the report includes warnings rather than silently passing or
|
||||||
|
failing the whole check.
|
||||||
|
- The contract is safe for operator logs: no secrets, prompts, raw model output,
|
||||||
|
or credential-bearing URLs.
|
||||||
|
- The contract can be emitted as JSON for scripts and rendered as concise text
|
||||||
|
for humans.
|
||||||
|
|
||||||
|
## Task: Implement the non-LLM automation status CLI
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0018-T03
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "7831f2fc-8b76-48fe-aa34-9dcc11ee84db"
|
||||||
|
```
|
||||||
|
|
||||||
|
Add a deterministic CLI, likely under `scripts/automation_status.py` or an
|
||||||
|
`activity_core` module, that answers recent automation status questions without
|
||||||
|
calling an LLM.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
|
||||||
|
- Supports `--since`, `--until`, activity name/id filters, JSON output, and a
|
||||||
|
concise human summary.
|
||||||
|
- Accepts simple operator dates, including absolute dates and a documented
|
||||||
|
`friday`/`last-friday` style shortcut, resolving them to concrete dates in the
|
||||||
|
configured timezone.
|
||||||
|
- Inspects all enabled scheduled ActivityDefinitions by default, not just daily
|
||||||
|
triage.
|
||||||
|
- Uses live sources when configured: Postgres `activity_definitions` /
|
||||||
|
`activity_runs`, Temporal schedule and workflow visibility, State Hub
|
||||||
|
progress, and configured local report sink paths.
|
||||||
|
- Degrades usefully when a source is unavailable and exits non-zero only for
|
||||||
|
real status failures or invalid input, not for optional evidence gaps that are
|
||||||
|
clearly reported.
|
||||||
|
- Includes focused unit tests with fixture data for clean runs, validation
|
||||||
|
failures, missed runs, disabled schedules, and partial-source availability.
|
||||||
|
|
||||||
|
## Task: Add the Make target baseline
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0018-T04
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "451bdf62-b619-4ace-9262-46d20b912781"
|
||||||
|
```
|
||||||
|
|
||||||
|
Expose the CLI through a Make target that is easy for an operator or any coding
|
||||||
|
assistant to run before attempting a prose summary.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
|
||||||
|
- `make automation-status SINCE=2026-06-26` prints the human-readable baseline.
|
||||||
|
- `make automation-status SINCE=friday` is supported or documented with the
|
||||||
|
exact accepted shortcut.
|
||||||
|
- A JSON form is available, either through `FORMAT=json` or a separate target
|
||||||
|
such as `make automation-status-json`.
|
||||||
|
- The target does not require LLM credentials, coding assistant automation
|
||||||
|
tooling, or interactive prompts.
|
||||||
|
- `make help` lists the target with a clear one-line description.
|
||||||
|
|
||||||
|
## Task: Update operator docs and examples
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0018-T05
|
||||||
|
status: done
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: "233659aa-e14a-4b3d-b156-d04f0fa16db6"
|
||||||
|
```
|
||||||
|
|
||||||
|
Update the runbook so "How did automations go since Friday?" has an obvious
|
||||||
|
operator recipe.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
|
||||||
|
- `docs/runbook.md` has a short "Automation status" section near the scheduling
|
||||||
|
operations.
|
||||||
|
- The docs include example output or a compact sample for the known daily
|
||||||
|
triage distinction: fired on time versus completed successfully versus output
|
||||||
|
validation failure.
|
||||||
|
- The docs clarify that LLM summaries are optional convenience only; the Make
|
||||||
|
target output is the baseline evidence.
|
||||||
|
- The daily-triage-specific helper is either kept as a lower-level diagnostic or
|
||||||
|
folded into the generalized status command.
|
||||||
|
|
||||||
|
## Task: Verify against recent scheduled-run evidence
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0018-T06
|
||||||
|
status: done
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: "24efbe9f-dfff-482f-9edc-456379c9a2aa"
|
||||||
|
```
|
||||||
|
|
||||||
|
Prove the new surface against the recent evidence that motivated this workplan.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
|
||||||
|
- Running the status command over the window starting Friday, 2026-06-26 shows
|
||||||
|
that the daily triage schedule fired on 2026-06-26 and 2026-06-27 but did not
|
||||||
|
produce clean validated reports.
|
||||||
|
- The command distinguishes scheduling health from output/schema validation
|
||||||
|
failure.
|
||||||
|
- Disabled or waiting schedules, such as the weekly coding retro gate when its
|
||||||
|
upstream read model is not available, are reported without being counted as
|
||||||
|
missed runs.
|
||||||
|
- Verification results are recorded in this workplan and as a State Hub progress
|
||||||
|
note once the implementation lands.
|
||||||
|
|
||||||
|
## Implementation Result
|
||||||
|
|
||||||
|
Completed 2026-06-29: implemented the own-infrastructure automation status
|
||||||
|
surface and codified the scheduling preference.
|
||||||
|
|
||||||
|
Delivered:
|
||||||
|
|
||||||
|
- `AGENTS.md` now states that durable activity-core automations use repo-owned
|
||||||
|
infrastructure, not coding assistant automation/reminder/heartbeat tooling, as
|
||||||
|
execution or evidence authority.
|
||||||
|
- `SCOPE.md` and `docs/runbook.md` describe the deterministic status surface and
|
||||||
|
assistant boundary.
|
||||||
|
- `src/activity_core/automation_status.py` and `scripts/automation_status.py`
|
||||||
|
provide the non-LLM CLI.
|
||||||
|
- `make automation-status SINCE=...` and `make automation-status-json` expose the
|
||||||
|
baseline operator commands.
|
||||||
|
- `tests/test_automation_status.py` covers date shortcuts, cron fire estimation,
|
||||||
|
completed runs, validation failures, missed runs, disabled schedules, partial
|
||||||
|
source availability, and working-memory evidence parsing.
|
||||||
|
|
||||||
|
Verification:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 -m py_compile src/activity_core/automation_status.py scripts/automation_status.py tests/test_automation_status.py
|
||||||
|
/home/worsch/.local/bin/uv run pytest tests/test_automation_status.py tests/test_daily_triage_verifier.py -q
|
||||||
|
/home/worsch/.local/bin/uv run python scripts/automation_status.py \
|
||||||
|
--since 2026-06-26 --until 2026-06-27 --db-url '' \
|
||||||
|
--progress-event-type daily_triage --timeout-seconds 10 \
|
||||||
|
--working-memory-dir /tmp --format json
|
||||||
|
```
|
||||||
|
|
||||||
|
Results:
|
||||||
|
|
||||||
|
- focused tests: `11 passed`;
|
||||||
|
- `make help` lists `automation-status` and `automation-status-json`;
|
||||||
|
- the 2026-06-26 through 2026-06-27 status run exited `1` as expected because
|
||||||
|
State Hub evidence classified daily triage activity
|
||||||
|
`6fca51fa-387a-4fd0-bc4e-d62c29eb859a` as `validation_failed` with two
|
||||||
|
non-secret evidence records: 2026-06-26 `Expecting ',' delimiter` and
|
||||||
|
2026-06-27 `Unterminated string`;
|
||||||
|
- the same report classified the gated weekly coding retro as `disabled`, not
|
||||||
|
`missed`.
|
||||||
@@ -0,0 +1,204 @@
|
|||||||
|
---
|
||||||
|
id: ACTIVITY-WP-0019
|
||||||
|
type: workplan
|
||||||
|
title: "Automation schedule inventory Make targets"
|
||||||
|
domain: infotech
|
||||||
|
repo: activity-core
|
||||||
|
status: finished
|
||||||
|
owner: codex
|
||||||
|
topic_slug: automation-inventory
|
||||||
|
created: "2026-06-29"
|
||||||
|
updated: "2026-07-01"
|
||||||
|
state_hub_workstream_id: "21c73763-9adc-42f6-8fd2-1b8b33c2c770"
|
||||||
|
---
|
||||||
|
|
||||||
|
# Automation schedule inventory Make targets
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Provide a repo-native, non-LLM way to list every scheduled automation that
|
||||||
|
activity-core knows about.
|
||||||
|
|
||||||
|
`ACTIVITY-WP-0018` added the status surface for questions like "How did our
|
||||||
|
automations go since Friday?". The next operator question is the inventory
|
||||||
|
baseline: "What automations are scheduled at all?" That should be answerable
|
||||||
|
through Make targets backed by activity-core's own ActivityDefinitions,
|
||||||
|
database, and Temporal schedule metadata when available, independent of any
|
||||||
|
coding assistant automation infrastructure.
|
||||||
|
|
||||||
|
## Review notes
|
||||||
|
|
||||||
|
- `Makefile` currently exposes `automation-status` and
|
||||||
|
`automation-status-json`, but no dedicated inventory/list target.
|
||||||
|
- `scripts/automation_status.py` and `src/activity_core/automation_status.py`
|
||||||
|
already load scheduled ActivityDefinitions and compute their Temporal schedule
|
||||||
|
ids. The inventory target should reuse that parsing/loading posture where it
|
||||||
|
fits rather than creating a second discovery path.
|
||||||
|
- `make sync-schedules` reconciles Temporal schedules from the
|
||||||
|
`activity_definitions` database, but it is an action target, not a read-only
|
||||||
|
operator inventory command.
|
||||||
|
- The inventory command should remain useful in degraded local mode: file-backed
|
||||||
|
definitions are enough to list configured scheduled automations, while live
|
||||||
|
DB and Temporal visibility can enrich the output.
|
||||||
|
|
||||||
|
## Task: Define the automation inventory contract
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0019-T01
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "8de24590-f9ee-4d0e-8692-b7ada9f232ed"
|
||||||
|
```
|
||||||
|
|
||||||
|
Define the fields and source precedence for a deterministic scheduled
|
||||||
|
automation inventory report.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
|
||||||
|
- The report includes every ActivityDefinition with `trigger_type` of `cron` or
|
||||||
|
`scheduled`, including disabled definitions.
|
||||||
|
- Each row includes id, name, enabled/disabled state, trigger type, schedule
|
||||||
|
expression or one-shot datetime, timezone, overlap/catchup policy when known,
|
||||||
|
and the derived Temporal schedule id.
|
||||||
|
- The report identifies its source for each row: database, repo definition file,
|
||||||
|
Temporal visibility, or a combination.
|
||||||
|
- If Temporal is reachable, the report adds paused/missing/drift hints without
|
||||||
|
mutating schedules.
|
||||||
|
- Missing optional sources produce warnings, not silent omissions.
|
||||||
|
- The JSON shape is stable enough for scripts and tests.
|
||||||
|
|
||||||
|
## Task: Implement a non-mutating inventory CLI
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0019-T02
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "538cb9a5-48f3-470c-8518-29ee66c96678"
|
||||||
|
```
|
||||||
|
|
||||||
|
Add a deterministic CLI path for listing scheduled automations without requiring
|
||||||
|
LLM credentials or coding assistant tooling.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
|
||||||
|
- A script or module command, likely sharing code with
|
||||||
|
`activity_core.automation_status`, supports human and JSON output.
|
||||||
|
- The command is read-only: it does not call `sync-schedules`, upsert schedules,
|
||||||
|
delete schedules, enqueue workflows, or write State Hub evidence.
|
||||||
|
- It supports filters by activity id, activity name, enabled state, and trigger
|
||||||
|
type.
|
||||||
|
- It loads from the database when configured and falls back to repo definition
|
||||||
|
files when the database is unavailable or explicitly disabled.
|
||||||
|
- It optionally enriches rows from Temporal when `TEMPORAL_HOST` is configured,
|
||||||
|
with bounded timeouts so an unreachable service does not hang the command.
|
||||||
|
- Unit tests cover DB rows, file fallback, disabled definitions, Temporal
|
||||||
|
enrichment unavailable, and JSON output.
|
||||||
|
|
||||||
|
## Task: Add Make targets
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0019-T03
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "f2001721-07f3-42f5-a15e-0c7d1b0ed801"
|
||||||
|
```
|
||||||
|
|
||||||
|
Expose the inventory command through Make targets that are easy for humans,
|
||||||
|
scripts, and coding assistants to run before asking for a prose summary.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
|
||||||
|
- `make automation-list` prints a concise human-readable inventory.
|
||||||
|
- `make automation-list-json` emits the same inventory as JSON.
|
||||||
|
- Optional Make variables pass through cleanly, for example `ENABLED=true`,
|
||||||
|
`TRIGGER=cron`, `ACTIVITY_ID=<uuid>`, or `FORMAT=json`.
|
||||||
|
- `make help` lists both targets with clear one-line descriptions.
|
||||||
|
- The targets do not require LLM access, Codex automation tooling, or
|
||||||
|
interactive prompts.
|
||||||
|
|
||||||
|
## Task: Document the inventory workflow
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0019-T04
|
||||||
|
status: done
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: "f687743b-3936-413e-ae50-d35484ae9a81"
|
||||||
|
```
|
||||||
|
|
||||||
|
Update operator documentation so the scheduled automation inventory path is
|
||||||
|
discoverable next to the status path.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
|
||||||
|
- `docs/runbook.md` documents `make automation-list` and
|
||||||
|
`make automation-list-json`.
|
||||||
|
- The docs distinguish inventory from status: inventory answers what is
|
||||||
|
configured; status answers what happened in a time window.
|
||||||
|
- The docs state that the command is read-only and uses activity-core-owned
|
||||||
|
scheduling evidence.
|
||||||
|
- The docs include a compact example of the expected human output.
|
||||||
|
|
||||||
|
## Task: Verify against current repo and live/degraded sources
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: ACTIVITY-WP-0019-T05
|
||||||
|
status: done
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: "5317b532-5cef-4eff-b6d8-3e85bbca8e8a"
|
||||||
|
```
|
||||||
|
|
||||||
|
Prove the target against the current scheduled automation definitions and
|
||||||
|
degraded local conditions.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
|
||||||
|
- `make automation-list` shows the current scheduled automations, including
|
||||||
|
daily triage and weekly scheduled definitions when present in the selected
|
||||||
|
source.
|
||||||
|
- JSON output is valid and includes the same rows.
|
||||||
|
- A DB-unavailable run falls back to repo definition files or reports a clear
|
||||||
|
warning if no definitions are discoverable.
|
||||||
|
- A Temporal-unavailable run exits successfully with Temporal warnings rather
|
||||||
|
than hanging.
|
||||||
|
- Focused tests pass and the result is recorded in this workplan before the
|
||||||
|
workplan is moved to `finished`.
|
||||||
|
|
||||||
|
|
||||||
|
## Implementation Result
|
||||||
|
|
||||||
|
Completed 2026-07-01: implemented the read-only scheduled automation inventory
|
||||||
|
surface.
|
||||||
|
|
||||||
|
Delivered:
|
||||||
|
|
||||||
|
- `scripts/automation_inventory.py` exposes the inventory CLI backed by
|
||||||
|
`activity_core.automation_status` shared definition and Temporal helpers.
|
||||||
|
- `make automation-list` and `make automation-list-json` list configured
|
||||||
|
scheduled ActivityDefinitions with filters for `ENABLED`, `TRIGGER`,
|
||||||
|
`ACTIVITY_ID`, and `ACTIVITY_NAME`.
|
||||||
|
- JSON output is script-safe; the Make JSON target suppresses command echo and
|
||||||
|
recursive make directory chatter.
|
||||||
|
- `docs/runbook.md` now distinguishes inventory (what is configured) from status
|
||||||
|
(what happened in a time window).
|
||||||
|
- Tests cover DB-backed rows, file fallback, disabled filtering, Temporal
|
||||||
|
unavailable warnings, and JSON CLI output.
|
||||||
|
|
||||||
|
Verification:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/home/worsch/.local/bin/uv run pytest tests/test_automation_status.py tests/test_daily_triage_verifier.py -q
|
||||||
|
bash -lc 'export PATH="/home/worsch/.local/bin:$PATH"; make automation-list ACTCORE_DB_URL= TEMPORAL_HOST='
|
||||||
|
bash -lc 'export PATH="/home/worsch/.local/bin:$PATH"; make automation-list-json ACTCORE_DB_URL= TEMPORAL_HOST= > /tmp/activity-core-inventory.json && python3 -m json.tool /tmp/activity-core-inventory.json >/tmp/activity-core-inventory.pretty'
|
||||||
|
bash -lc 'export PATH="/home/worsch/.local/bin:$PATH"; make automation-list ACTCORE_DB_URL= TEMPORAL_HOST= ENABLED=true TRIGGER=cron'
|
||||||
|
bash -lc 'export PATH="/home/worsch/.local/bin:$PATH"; make help'
|
||||||
|
```
|
||||||
|
|
||||||
|
Results:
|
||||||
|
|
||||||
|
- focused tests: `16 passed`;
|
||||||
|
- degraded Make inventory run listed 9 file-backed scheduled automations, with
|
||||||
|
5 enabled and 4 disabled;
|
||||||
|
- filtered Make run with `ENABLED=true TRIGGER=cron` listed 5 enabled cron
|
||||||
|
automations;
|
||||||
|
- `automation-list-json` emitted parseable JSON directly;
|
||||||
|
- `make help` lists `automation-list` and `automation-list-json`.
|
||||||
@@ -3,6 +3,7 @@ type: session-note
|
|||||||
created: "2026-03-28"
|
created: "2026-03-28"
|
||||||
updated: "2026-06-03"
|
updated: "2026-06-03"
|
||||||
status: archived
|
status: archived
|
||||||
|
state_hub_workstream_id: "b221e65a-6f97-44b0-8dae-442fffcb7f64"
|
||||||
---
|
---
|
||||||
|
|
||||||
# WP-0002 Handoff Note — Continue on CoulombCore
|
# WP-0002 Handoff Note — Continue on CoulombCore
|
||||||
|
|||||||
Reference in New Issue
Block a user