Files
activity-core/workplans/ADHOC-2026-06-01.md
tegwick 30598fd1ad Expand rule actions for per-repo tasks
Add safe action interpolation and for_each binding for rule fan-out, update the weekly SBOM definition, cover the new evaluation path, and reconcile activity-core scope/workplans for the State Hub sync.
2026-06-03 11:58:24 +02:00

226 lines
9.7 KiB
Markdown

---
id: ADHOC-2026-06-01
type: workplan
title: "Ad hoc — activity-core opportunistic fixes 2026-06-01"
domain: custodian
repo: activity-core
status: finished
owner: custodian
topic_slug: custodian
created: "2026-06-01"
updated: "2026-06-03"
state_hub_workstream_id: "36162ff0-9b47-47c4-8602-56767f9b7a1c"
---
# ADHOC-2026-06-01 — activity-core opportunistic fixes
Captured during the CUST-WP-0045 T06 cutover prep session. The dev worker was
brought up and surfaced an unrelated, pre-existing bug in the state-hub
context resolver that is independent of the daily triage canary.
## Tasks
### T01 - Fix repo_sbom_status resolver route and params
```task
id: ADHOC-2026-06-01-T01
status: done
priority: low
state_hub_task_id: "87b56da9-e692-4350-9aff-47080414ec06"
```
`src/activity_core/context_resolvers/state_hub.py` resolves
`query: repo_sbom_status` by calling `GET /sbom/status?repo={repo_slug}`, but
State Hub does not expose `/sbom/status` at all. Actual SBOM routes are
`/sbom/`, `/sbom/{repo_slug}`, `/sbom/snapshots/`, `/sbom/snapshots/{id}`,
`/sbom/ingest/`, `/sbom/report/licences/`.
Compounding bug: the only ActivityDefinition using this query is
`activity-definitions/weekly-sbom-staleness.md`, which passes
`params: { repos: all }`. The resolver reads `params.get("repo_slug", "")`,
so the lookup URL collapses to `/sbom/status?repo=` regardless of the
ActivityDefinition value.
Symptom: every Monday at 09:00 Europe/Berlin (and on worker startup after a
missed Monday tick), the `weekly-sbom-staleness` workflow runs and the
resolver logs `HTTP/1.1 404 Not Found` for `GET /sbom/status?repo=`. The
`_fetch_json` helper swallows the error and returns `{}`, so the workflow
continues but the downstream rule evaluates
`context.repos.sbom_age_days > 30` against an empty dict and never spawns
the intended SBOM rescan tasks. The weekly SBOM staleness check has been
silently no-op for as long as this route mismatch has existed.
Fix scope:
1. Decide the contract — single-repo lookup (current parameter shape suggests
this) versus multi-repo bulk lookup (`repos: all` suggests this).
2. Update the resolver to call the actual State Hub route(s):
- single repo: `GET /sbom/{repo_slug}` (or `/sbom/{repo_slug}/status` if a
status-shaped projection is preferred and exists).
- bulk: iterate the State Hub `/repos/` list and call `/sbom/{repo_slug}`
per repo, returning a list bound to `context.repos`.
3. Update `activity-definitions/weekly-sbom-staleness.md` to match: either pass
a real `repo_slug` per definition (multiple definitions, one per repo) or
keep `repos: all` and let the resolver fan out.
4. Update the rule expression to traverse the resulting shape — currently
`context.repos.sbom_age_days` assumes a single object; if the resolver
returns a list, the rule needs `any(repo.sbom_age_days > 30 for repo in
context.repos)` or an equivalent per-repo evaluation.
5. Add a resolver unit test that asserts the resolver hits a route State Hub
actually serves, and an integration test against a fixture State Hub
response so this regression cannot repeat.
Out of scope for this adhoc:
- Decoupling SBOM staleness rules from the state hub resolver.
- Rewriting the SBOM ingestion pipeline or `sbom_source` policy.
- Promoting this to a full workplan unless the multi-repo decision turns out
to need design discussion.
Done when `weekly-sbom-staleness` runs cleanly against a live State Hub on
Monday and either spawns SBOM rescan tasks for stale repos or leaves a clear
"all SBOMs fresh" audit row — not a 404 log line and a silent no-op.
**Completion — 2026-06-01:**
Resolver now supports two modes selected by params:
- single-repo: `params: {repo_slug: foo}``GET /sbom/{foo}`
- bulk: `params: {repos: all}``GET /repos/`, computes per-repo age,
returns the worst-repo fields hoisted to the top of the result alongside
`stale_count`, `total_count`, `worst_*` fields, and the full per-repo list
Never-scanned repos use a `99999` sentinel age so threshold rules treat them
as very stale without forcing the rule expression to special-case `None`.
`activity-definitions/weekly-sbom-staleness.md` kept its existing rule
expression `context.repos.sbom_age_days > 30` (the resolver hoists the worst
repo's age to that path). The definition now documents that the rule fires
at most once per workflow run, not once per stale repo, and that the
aspirational per-stale-repo fan-out exercised by the integration tests is
not delivered by the current workflow.
Live validation against the running State Hub on 2026-06-01:
- single: `activity-core` → 36 days since SBOM ingest at 2026-04-26
- bulk: 48 repos total, 46 stale (>30d); worst is `info-tech-canon`
(`last_sbom_at: null` → 99999d sentinel); rule expression evaluates True
Tests: `uv run pytest -q` → 120 passed, 1 skipped (previously 116 passed +
4 broken integration tests; broken-on-my-change reverted by hoisting the
worst-repo fields to the top of `context.repos`).
### T02 - Rule action context interpolation and per-iteration binding
```task
id: ADHOC-2026-06-01-T02
status: done
priority: low
state_hub_task_id: "6b3a185e-cbea-454c-82fb-8b4c16cefef0"
```
Discovered while completing T01: `RunActivityWorkflow` builds each
`TaskSpec` by lifting raw YAML fields out of the rule action without ever
interpolating `context.*` references:
```python
# src/activity_core/workflows.py
task_spec_dicts.append({
"title": action.get("task_template", rule.get("id", "")),
"target_repo": action.get("target_repo"),
...
})
```
So `target_repo: context.repos.repo_slug` in an ActivityDefinition rule is
emitted to the spawn log as the literal string `"context.repos.repo_slug"`,
not the actual stale repo slug. The aspirational per-stale-repo fan-out
exercised by `test_pipeline_emits_one_task_for_stale_repo_only` and friends
in `tests/test_integration_event_bridge.py` is *not* delivered by the
workflow — those tests simulate a per-repo iteration the real workflow
does not perform.
Two pieces of work, likely related:
1. **Action field interpolation.** Define and implement a safe template
grammar for `action.target_repo`, `action.task_template`,
`action.priority`, `action.labels`, etc. Reuse the rule-condition AST
walker (no `exec`, no comprehensions) or a constrained string
`{context.foo.bar}` substitution. Decide on grammar — instruction
prompt rendering uses `{...}` placeholders today
(`rules/executor.py::_render_prompt`); consistent with that is probably
right.
2. **Per-iteration context binding.** Decide whether the workflow should
evaluate a rule once per element of a list-valued context field (the
integration-test contract), or whether the spawn-once semantics is
actually desired and the tests should be relaxed. If iteration is the
answer, the resolver shape from T01 already gives a clean `repos` list
to iterate over; the workflow would need an explicit `for_each:`
directive on the rule, or implicit iteration when `condition` references
a list element.
This is borderline workplan-grade work (design decision + security review of
the interpolation grammar + workflow change + test updates). Promote to a
full workplan if anyone decides to actually do it; the adhoc T02 is just to
make sure the gap doesn't get forgotten.
Done when either: (a) rule action fields interpolate `context.*`
expressions and a stale-repo workflow run emits a TaskSpec with the actual
repo slug, or (b) a recorded decision explicitly defers/declines the change
with reasoning.
**Completion — 2026-06-03:**
Implemented explicit rule action expansion in `activity_core.rules.actions`.
`evaluate_rules` now returns concrete TaskSpec dictionaries directly, and
`RunActivityWorkflow` no longer lifts raw YAML action fields itself.
Action fields support two safe interpolation forms:
- whole-field paths such as `target_repo: context.repo.repo_slug`
- scalar placeholders such as `task_template: Run SBOM rescan for {context.repo.repo_slug}`
Rules may opt into per-item binding with:
```yaml
for_each: context.repos.repos
bind_as: repo
condition: 'context.repo.sbom_age_days > 30'
```
`activity-definitions/weekly-sbom-staleness.md` now uses that explicit
contract, so bulk SBOM staleness evaluation emits one task per stale repo
instead of one task for the hoisted worst repo. Tests cover direct action
interpolation, `for_each` binding, activity-level rule evaluation, and the
weekly SBOM integration path.
Tests: `PYTHONPATH=src .venv/bin/python -m pytest -q` -> 125 passed, 1 skipped.
### T03 - Make activity-core's Temporal activity timeout env-configurable
```task
id: ADHOC-2026-06-01-T03
status: done
priority: low
state_hub_task_id: "bc9c9edb-e20b-4ff9-a15d-6e3e81f9b5e1"
```
Discovered during the CUST-WP-0045 T06 canary on 2026-06-01. The daily
triage instruction call hit `BrokenPipeError` on the llm-connect side
because two 5-minute timeouts were racing:
- `_ACTIVITY_TIMEOUT = timedelta(minutes=5)` in `workflows.py`
- `LLM_CONNECT_TIMEOUT_SECONDS` default `300` in `llm_client.py`
The 10KB curated digest + `max_depth: 2` + JSON schema enforcement pushed
Claude past 5 minutes. Whichever timer fired first killed the httpx call,
and the model's late response arrived to a closed socket.
Fix: read `_ACTIVITY_TIMEOUT` from env `ACTIVITY_TIMEOUT_SECONDS` (default
`900` — 15 minutes), so the Temporal activity outlives a normal slow LLM
run. Operators are expected to also widen httpx via
`LLM_CONNECT_TIMEOUT_SECONDS=840` (or similar) so httpx still times out
slightly *before* Temporal, preserving the clean-error contract.
The activity timeout default is now larger by design — Temporal will still
heartbeat and Temporal-side cancellation still works; this only widens the
upper bound for long judgment-call activities like the daily triage.