Files

tegwick 30598fd1ad Expand rule actions for per-repo tasks

Add safe action interpolation and for_each binding for rule fan-out, update the weekly SBOM definition, cover the new evaluation path, and reconcile activity-core scope/workplans for the State Hub sync.

2026-06-03 11:58:24 +02:00

9.7 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	created	updated	state_hub_workstream_id
ADHOC-2026-06-01	workplan	Ad hoc — activity-core opportunistic fixes 2026-06-01	custodian	activity-core	finished	custodian	custodian	2026-06-01	2026-06-03	36162ff0-9b47-47c4-8602-56767f9b7a1c

ADHOC-2026-06-01 — activity-core opportunistic fixes

Captured during the CUST-WP-0045 T06 cutover prep session. The dev worker was brought up and surfaced an unrelated, pre-existing bug in the state-hub context resolver that is independent of the daily triage canary.

Tasks

T01 - Fix repo_sbom_status resolver route and params

id: ADHOC-2026-06-01-T01
status: done
priority: low
state_hub_task_id: "87b56da9-e692-4350-9aff-47080414ec06"

src/activity_core/context_resolvers/state_hub.py resolves query: repo_sbom_status by calling GET /sbom/status?repo={repo_slug}, but State Hub does not expose /sbom/status at all. Actual SBOM routes are /sbom/, /sbom/{repo_slug}, /sbom/snapshots/, /sbom/snapshots/{id}, /sbom/ingest/, /sbom/report/licences/.

Compounding bug: the only ActivityDefinition using this query is activity-definitions/weekly-sbom-staleness.md, which passes params: { repos: all }. The resolver reads params.get("repo_slug", ""), so the lookup URL collapses to /sbom/status?repo= regardless of the ActivityDefinition value.

Symptom: every Monday at 09:00 Europe/Berlin (and on worker startup after a missed Monday tick), the weekly-sbom-staleness workflow runs and the resolver logs HTTP/1.1 404 Not Found for GET /sbom/status?repo=. The _fetch_json helper swallows the error and returns {}, so the workflow continues but the downstream rule evaluates context.repos.sbom_age_days > 30 against an empty dict and never spawns the intended SBOM rescan tasks. The weekly SBOM staleness check has been silently no-op for as long as this route mismatch has existed.

Fix scope:

Decide the contract — single-repo lookup (current parameter shape suggests this) versus multi-repo bulk lookup (repos: all suggests this).
Update the resolver to call the actual State Hub route(s):
- single repo: GET /sbom/{repo_slug} (or /sbom/{repo_slug}/status if a status-shaped projection is preferred and exists).
- bulk: iterate the State Hub /repos/ list and call /sbom/{repo_slug} per repo, returning a list bound to context.repos.
Update activity-definitions/weekly-sbom-staleness.md to match: either pass a real repo_slug per definition (multiple definitions, one per repo) or keep repos: all and let the resolver fan out.
Update the rule expression to traverse the resulting shape — currently context.repos.sbom_age_days assumes a single object; if the resolver returns a list, the rule needs any(repo.sbom_age_days > 30 for repo in context.repos) or an equivalent per-repo evaluation.
Add a resolver unit test that asserts the resolver hits a route State Hub actually serves, and an integration test against a fixture State Hub response so this regression cannot repeat.

Out of scope for this adhoc:

Decoupling SBOM staleness rules from the state hub resolver.
Rewriting the SBOM ingestion pipeline or sbom_source policy.
Promoting this to a full workplan unless the multi-repo decision turns out to need design discussion.

Done when weekly-sbom-staleness runs cleanly against a live State Hub on Monday and either spawns SBOM rescan tasks for stale repos or leaves a clear "all SBOMs fresh" audit row — not a 404 log line and a silent no-op.

Completion — 2026-06-01:

Resolver now supports two modes selected by params:

single-repo: params: {repo_slug: foo} → GET /sbom/{foo}
bulk: params: {repos: all} → GET /repos/, computes per-repo age, returns the worst-repo fields hoisted to the top of the result alongside stale_count, total_count, worst_* fields, and the full per-repo list

Never-scanned repos use a 99999 sentinel age so threshold rules treat them as very stale without forcing the rule expression to special-case None.

activity-definitions/weekly-sbom-staleness.md kept its existing rule expression context.repos.sbom_age_days > 30 (the resolver hoists the worst repo's age to that path). The definition now documents that the rule fires at most once per workflow run, not once per stale repo, and that the aspirational per-stale-repo fan-out exercised by the integration tests is not delivered by the current workflow.

Live validation against the running State Hub on 2026-06-01:

single: activity-core → 36 days since SBOM ingest at 2026-04-26
bulk: 48 repos total, 46 stale (>30d); worst is info-tech-canon (last_sbom_at: null → 99999d sentinel); rule expression evaluates True

Tests: uv run pytest -q → 120 passed, 1 skipped (previously 116 passed + 4 broken integration tests; broken-on-my-change reverted by hoisting the worst-repo fields to the top of context.repos).

T02 - Rule action context interpolation and per-iteration binding

id: ADHOC-2026-06-01-T02
status: done
priority: low
state_hub_task_id: "6b3a185e-cbea-454c-82fb-8b4c16cefef0"

Discovered while completing T01: RunActivityWorkflow builds each TaskSpec by lifting raw YAML fields out of the rule action without ever interpolating context.* references:

# src/activity_core/workflows.py
task_spec_dicts.append({
    "title": action.get("task_template", rule.get("id", "")),
    "target_repo": action.get("target_repo"),
    ...
})

So target_repo: context.repos.repo_slug in an ActivityDefinition rule is emitted to the spawn log as the literal string "context.repos.repo_slug", not the actual stale repo slug. The aspirational per-stale-repo fan-out exercised by test_pipeline_emits_one_task_for_stale_repo_only and friends in tests/test_integration_event_bridge.py is not delivered by the workflow — those tests simulate a per-repo iteration the real workflow does not perform.

Two pieces of work, likely related:

Action field interpolation. Define and implement a safe template grammar for action.target_repo, action.task_template, action.priority, action.labels, etc. Reuse the rule-condition AST walker (no exec, no comprehensions) or a constrained string {context.foo.bar} substitution. Decide on grammar — instruction prompt rendering uses {...} placeholders today (rules/executor.py::_render_prompt); consistent with that is probably right.
Per-iteration context binding. Decide whether the workflow should evaluate a rule once per element of a list-valued context field (the integration-test contract), or whether the spawn-once semantics is actually desired and the tests should be relaxed. If iteration is the answer, the resolver shape from T01 already gives a clean repos list to iterate over; the workflow would need an explicit for_each: directive on the rule, or implicit iteration when condition references a list element.

This is borderline workplan-grade work (design decision + security review of the interpolation grammar + workflow change + test updates). Promote to a full workplan if anyone decides to actually do it; the adhoc T02 is just to make sure the gap doesn't get forgotten.

Done when either: (a) rule action fields interpolate context.* expressions and a stale-repo workflow run emits a TaskSpec with the actual repo slug, or (b) a recorded decision explicitly defers/declines the change with reasoning.

Completion — 2026-06-03:

Implemented explicit rule action expansion in activity_core.rules.actions. evaluate_rules now returns concrete TaskSpec dictionaries directly, and RunActivityWorkflow no longer lifts raw YAML action fields itself.

Action fields support two safe interpolation forms:

whole-field paths such as target_repo: context.repo.repo_slug
scalar placeholders such as task_template: Run SBOM rescan for {context.repo.repo_slug}

Rules may opt into per-item binding with:

for_each: context.repos.repos
bind_as: repo
condition: 'context.repo.sbom_age_days > 30'

activity-definitions/weekly-sbom-staleness.md now uses that explicit contract, so bulk SBOM staleness evaluation emits one task per stale repo instead of one task for the hoisted worst repo. Tests cover direct action interpolation, for_each binding, activity-level rule evaluation, and the weekly SBOM integration path.

Tests: PYTHONPATH=src .venv/bin/python -m pytest -q -> 125 passed, 1 skipped.

T03 - Make activity-core's Temporal activity timeout env-configurable

id: ADHOC-2026-06-01-T03
status: done
priority: low
state_hub_task_id: "bc9c9edb-e20b-4ff9-a15d-6e3e81f9b5e1"

Discovered during the CUST-WP-0045 T06 canary on 2026-06-01. The daily triage instruction call hit BrokenPipeError on the llm-connect side because two 5-minute timeouts were racing:

_ACTIVITY_TIMEOUT = timedelta(minutes=5) in workflows.py
LLM_CONNECT_TIMEOUT_SECONDS default 300 in llm_client.py

The 10KB curated digest + max_depth: 2 + JSON schema enforcement pushed Claude past 5 minutes. Whichever timer fired first killed the httpx call, and the model's late response arrived to a closed socket.

Fix: read _ACTIVITY_TIMEOUT from env ACTIVITY_TIMEOUT_SECONDS (default 900 — 15 minutes), so the Temporal activity outlives a normal slow LLM run. Operators are expected to also widen httpx via LLM_CONNECT_TIMEOUT_SECONDS=840 (or similar) so httpx still times out slightly before Temporal, preserving the clean-error contract.

The activity timeout default is now larger by design — Temporal will still heartbeat and Temporal-side cancellation still works; this only widens the upper bound for long judgment-call activities like the daily triage.

9.7 KiB Raw Blame History