generated from coulomb/repo-seed
Add safe action interpolation and for_each binding for rule fan-out, update the weekly SBOM definition, cover the new evaluation path, and reconcile activity-core scope/workplans for the State Hub sync.
226 lines
9.7 KiB
Markdown
226 lines
9.7 KiB
Markdown
---
|
|
id: ADHOC-2026-06-01
|
|
type: workplan
|
|
title: "Ad hoc — activity-core opportunistic fixes 2026-06-01"
|
|
domain: custodian
|
|
repo: activity-core
|
|
status: finished
|
|
owner: custodian
|
|
topic_slug: custodian
|
|
created: "2026-06-01"
|
|
updated: "2026-06-03"
|
|
state_hub_workstream_id: "36162ff0-9b47-47c4-8602-56767f9b7a1c"
|
|
---
|
|
|
|
# ADHOC-2026-06-01 — activity-core opportunistic fixes
|
|
|
|
Captured during the CUST-WP-0045 T06 cutover prep session. The dev worker was
|
|
brought up and surfaced an unrelated, pre-existing bug in the state-hub
|
|
context resolver that is independent of the daily triage canary.
|
|
|
|
## Tasks
|
|
|
|
### T01 - Fix repo_sbom_status resolver route and params
|
|
|
|
```task
|
|
id: ADHOC-2026-06-01-T01
|
|
status: done
|
|
priority: low
|
|
state_hub_task_id: "87b56da9-e692-4350-9aff-47080414ec06"
|
|
```
|
|
|
|
`src/activity_core/context_resolvers/state_hub.py` resolves
|
|
`query: repo_sbom_status` by calling `GET /sbom/status?repo={repo_slug}`, but
|
|
State Hub does not expose `/sbom/status` at all. Actual SBOM routes are
|
|
`/sbom/`, `/sbom/{repo_slug}`, `/sbom/snapshots/`, `/sbom/snapshots/{id}`,
|
|
`/sbom/ingest/`, `/sbom/report/licences/`.
|
|
|
|
Compounding bug: the only ActivityDefinition using this query is
|
|
`activity-definitions/weekly-sbom-staleness.md`, which passes
|
|
`params: { repos: all }`. The resolver reads `params.get("repo_slug", "")`,
|
|
so the lookup URL collapses to `/sbom/status?repo=` regardless of the
|
|
ActivityDefinition value.
|
|
|
|
Symptom: every Monday at 09:00 Europe/Berlin (and on worker startup after a
|
|
missed Monday tick), the `weekly-sbom-staleness` workflow runs and the
|
|
resolver logs `HTTP/1.1 404 Not Found` for `GET /sbom/status?repo=`. The
|
|
`_fetch_json` helper swallows the error and returns `{}`, so the workflow
|
|
continues but the downstream rule evaluates
|
|
`context.repos.sbom_age_days > 30` against an empty dict and never spawns
|
|
the intended SBOM rescan tasks. The weekly SBOM staleness check has been
|
|
silently no-op for as long as this route mismatch has existed.
|
|
|
|
Fix scope:
|
|
|
|
1. Decide the contract — single-repo lookup (current parameter shape suggests
|
|
this) versus multi-repo bulk lookup (`repos: all` suggests this).
|
|
2. Update the resolver to call the actual State Hub route(s):
|
|
- single repo: `GET /sbom/{repo_slug}` (or `/sbom/{repo_slug}/status` if a
|
|
status-shaped projection is preferred and exists).
|
|
- bulk: iterate the State Hub `/repos/` list and call `/sbom/{repo_slug}`
|
|
per repo, returning a list bound to `context.repos`.
|
|
3. Update `activity-definitions/weekly-sbom-staleness.md` to match: either pass
|
|
a real `repo_slug` per definition (multiple definitions, one per repo) or
|
|
keep `repos: all` and let the resolver fan out.
|
|
4. Update the rule expression to traverse the resulting shape — currently
|
|
`context.repos.sbom_age_days` assumes a single object; if the resolver
|
|
returns a list, the rule needs `any(repo.sbom_age_days > 30 for repo in
|
|
context.repos)` or an equivalent per-repo evaluation.
|
|
5. Add a resolver unit test that asserts the resolver hits a route State Hub
|
|
actually serves, and an integration test against a fixture State Hub
|
|
response so this regression cannot repeat.
|
|
|
|
Out of scope for this adhoc:
|
|
|
|
- Decoupling SBOM staleness rules from the state hub resolver.
|
|
- Rewriting the SBOM ingestion pipeline or `sbom_source` policy.
|
|
- Promoting this to a full workplan unless the multi-repo decision turns out
|
|
to need design discussion.
|
|
|
|
Done when `weekly-sbom-staleness` runs cleanly against a live State Hub on
|
|
Monday and either spawns SBOM rescan tasks for stale repos or leaves a clear
|
|
"all SBOMs fresh" audit row — not a 404 log line and a silent no-op.
|
|
|
|
**Completion — 2026-06-01:**
|
|
|
|
Resolver now supports two modes selected by params:
|
|
- single-repo: `params: {repo_slug: foo}` → `GET /sbom/{foo}`
|
|
- bulk: `params: {repos: all}` → `GET /repos/`, computes per-repo age,
|
|
returns the worst-repo fields hoisted to the top of the result alongside
|
|
`stale_count`, `total_count`, `worst_*` fields, and the full per-repo list
|
|
|
|
Never-scanned repos use a `99999` sentinel age so threshold rules treat them
|
|
as very stale without forcing the rule expression to special-case `None`.
|
|
|
|
`activity-definitions/weekly-sbom-staleness.md` kept its existing rule
|
|
expression `context.repos.sbom_age_days > 30` (the resolver hoists the worst
|
|
repo's age to that path). The definition now documents that the rule fires
|
|
at most once per workflow run, not once per stale repo, and that the
|
|
aspirational per-stale-repo fan-out exercised by the integration tests is
|
|
not delivered by the current workflow.
|
|
|
|
Live validation against the running State Hub on 2026-06-01:
|
|
- single: `activity-core` → 36 days since SBOM ingest at 2026-04-26
|
|
- bulk: 48 repos total, 46 stale (>30d); worst is `info-tech-canon`
|
|
(`last_sbom_at: null` → 99999d sentinel); rule expression evaluates True
|
|
|
|
Tests: `uv run pytest -q` → 120 passed, 1 skipped (previously 116 passed +
|
|
4 broken integration tests; broken-on-my-change reverted by hoisting the
|
|
worst-repo fields to the top of `context.repos`).
|
|
|
|
### T02 - Rule action context interpolation and per-iteration binding
|
|
|
|
```task
|
|
id: ADHOC-2026-06-01-T02
|
|
status: done
|
|
priority: low
|
|
state_hub_task_id: "6b3a185e-cbea-454c-82fb-8b4c16cefef0"
|
|
```
|
|
|
|
Discovered while completing T01: `RunActivityWorkflow` builds each
|
|
`TaskSpec` by lifting raw YAML fields out of the rule action without ever
|
|
interpolating `context.*` references:
|
|
|
|
```python
|
|
# src/activity_core/workflows.py
|
|
task_spec_dicts.append({
|
|
"title": action.get("task_template", rule.get("id", "")),
|
|
"target_repo": action.get("target_repo"),
|
|
...
|
|
})
|
|
```
|
|
|
|
So `target_repo: context.repos.repo_slug` in an ActivityDefinition rule is
|
|
emitted to the spawn log as the literal string `"context.repos.repo_slug"`,
|
|
not the actual stale repo slug. The aspirational per-stale-repo fan-out
|
|
exercised by `test_pipeline_emits_one_task_for_stale_repo_only` and friends
|
|
in `tests/test_integration_event_bridge.py` is *not* delivered by the
|
|
workflow — those tests simulate a per-repo iteration the real workflow
|
|
does not perform.
|
|
|
|
Two pieces of work, likely related:
|
|
|
|
1. **Action field interpolation.** Define and implement a safe template
|
|
grammar for `action.target_repo`, `action.task_template`,
|
|
`action.priority`, `action.labels`, etc. Reuse the rule-condition AST
|
|
walker (no `exec`, no comprehensions) or a constrained string
|
|
`{context.foo.bar}` substitution. Decide on grammar — instruction
|
|
prompt rendering uses `{...}` placeholders today
|
|
(`rules/executor.py::_render_prompt`); consistent with that is probably
|
|
right.
|
|
|
|
2. **Per-iteration context binding.** Decide whether the workflow should
|
|
evaluate a rule once per element of a list-valued context field (the
|
|
integration-test contract), or whether the spawn-once semantics is
|
|
actually desired and the tests should be relaxed. If iteration is the
|
|
answer, the resolver shape from T01 already gives a clean `repos` list
|
|
to iterate over; the workflow would need an explicit `for_each:`
|
|
directive on the rule, or implicit iteration when `condition` references
|
|
a list element.
|
|
|
|
This is borderline workplan-grade work (design decision + security review of
|
|
the interpolation grammar + workflow change + test updates). Promote to a
|
|
full workplan if anyone decides to actually do it; the adhoc T02 is just to
|
|
make sure the gap doesn't get forgotten.
|
|
|
|
Done when either: (a) rule action fields interpolate `context.*`
|
|
expressions and a stale-repo workflow run emits a TaskSpec with the actual
|
|
repo slug, or (b) a recorded decision explicitly defers/declines the change
|
|
with reasoning.
|
|
|
|
**Completion — 2026-06-03:**
|
|
|
|
Implemented explicit rule action expansion in `activity_core.rules.actions`.
|
|
`evaluate_rules` now returns concrete TaskSpec dictionaries directly, and
|
|
`RunActivityWorkflow` no longer lifts raw YAML action fields itself.
|
|
|
|
Action fields support two safe interpolation forms:
|
|
- whole-field paths such as `target_repo: context.repo.repo_slug`
|
|
- scalar placeholders such as `task_template: Run SBOM rescan for {context.repo.repo_slug}`
|
|
|
|
Rules may opt into per-item binding with:
|
|
|
|
```yaml
|
|
for_each: context.repos.repos
|
|
bind_as: repo
|
|
condition: 'context.repo.sbom_age_days > 30'
|
|
```
|
|
|
|
`activity-definitions/weekly-sbom-staleness.md` now uses that explicit
|
|
contract, so bulk SBOM staleness evaluation emits one task per stale repo
|
|
instead of one task for the hoisted worst repo. Tests cover direct action
|
|
interpolation, `for_each` binding, activity-level rule evaluation, and the
|
|
weekly SBOM integration path.
|
|
|
|
Tests: `PYTHONPATH=src .venv/bin/python -m pytest -q` -> 125 passed, 1 skipped.
|
|
|
|
### T03 - Make activity-core's Temporal activity timeout env-configurable
|
|
|
|
```task
|
|
id: ADHOC-2026-06-01-T03
|
|
status: done
|
|
priority: low
|
|
state_hub_task_id: "bc9c9edb-e20b-4ff9-a15d-6e3e81f9b5e1"
|
|
```
|
|
|
|
Discovered during the CUST-WP-0045 T06 canary on 2026-06-01. The daily
|
|
triage instruction call hit `BrokenPipeError` on the llm-connect side
|
|
because two 5-minute timeouts were racing:
|
|
|
|
- `_ACTIVITY_TIMEOUT = timedelta(minutes=5)` in `workflows.py`
|
|
- `LLM_CONNECT_TIMEOUT_SECONDS` default `300` in `llm_client.py`
|
|
|
|
The 10KB curated digest + `max_depth: 2` + JSON schema enforcement pushed
|
|
Claude past 5 minutes. Whichever timer fired first killed the httpx call,
|
|
and the model's late response arrived to a closed socket.
|
|
|
|
Fix: read `_ACTIVITY_TIMEOUT` from env `ACTIVITY_TIMEOUT_SECONDS` (default
|
|
`900` — 15 minutes), so the Temporal activity outlives a normal slow LLM
|
|
run. Operators are expected to also widen httpx via
|
|
`LLM_CONNECT_TIMEOUT_SECONDS=840` (or similar) so httpx still times out
|
|
slightly *before* Temporal, preserving the clean-error contract.
|
|
|
|
The activity timeout default is now larger by design — Temporal will still
|
|
heartbeat and Temporal-side cancellation still works; this only widens the
|
|
upper bound for long judgment-call activities like the daily triage.
|