coulomb/activity-core

Fork 0

generated from coulomb/repo-seed

Files

tegwick 20d4f26166 Implement post-triage operational hardening

2026-06-04 12:15:07 +02:00

14 KiB

Raw Blame History

id, type, title, status, decided_by, date, scope, affects, tags

type

title

status

decided_by

date

scope

affects

ACT-ADR-003: Rule vs. Instruction Model and Expression DSL

Status

Accepted.

Context

ActivityDefinitions need two distinct evaluation modes to cover the full range of automation scenarios in the Coulomb org:

Deterministic cases: "if this repo has tag python-service AND has no SBOM in the last 30 days, create a scan task." The condition is fully expressible as a boolean predicate over known attributes. The output is fixed by the template. No ambiguity, no LLM required, fully testable.

Judgement cases: "a new repository has been registered — based on its domain and profile, determine what domain-specific onboarding tasks are appropriate." The right answer depends on context that is expensive to encode as explicit rules. An LLM is a better evaluator than a rule tree, but introduces non-determinism, cost, and a new attack surface (prompt injection via event payload).

Conflating these two modes into one mechanism produces a system that is either too rigid (rules only) or too unpredictable (LLM everywhere). The two modes need different evaluation pipelines, testing strategies, and audit trails.

Decision

Two named, distinct evaluation modes: Rule and Instruction.

Terminology is deliberate. A Rule is deterministic and mechanical — it applies or it does not. An Instruction is contextual and interpretive — it guides an LLM agent to make a judgement call. Both are expressed as fenced blocks in ActivityDefinition markdown files (see ACT-ADR-002).

Rules

A Rule has two parts: a condition (boolean predicate) and one or more actions (task template references).

Condition expression language

The condition is a single-line string expression evaluated by a sandboxed AST walker — never exec() or eval(). The evaluator walks the parsed AST and whitelist-checks every node type before executing. Unknown node types raise an UnsafeExpression error at parse time, not at evaluation time.

Available operations:

Category	Syntax	Example
Equality	`==`, `!=`	`event.type == "org.repo.registered"`
Comparison	`>`, `<`, `>=`, `<=`	`event.attributes.sbom_age_days > 30`
Membership	`in`, `not in`	`"python-service" in event.attributes.tags`
Boolean	`and`, `or`, `not`	`a and (b or not c)`
Grouping	`( )`	`(a or b) and c`
Length	`len(x)`	`len(event.attributes.affected_repos) > 0`
Existence	`x is None`, `x is not None`	`event.attributes.domain is not None`

Attribute access follows dot notation on the event object and the context object (populated by context sources declared in the ActivityDefinition):

event.id — UUID string
event.type — event type identifier
event.version — event type version
event.timestamp — ISO 8601 datetime string
event.publisher — publisher identifier
event.attributes.{name} — typed attribute per event type schema
context.{source}.{field} — resolved context data

Explicitly forbidden (evaluator rejects at parse time):

Function calls other than len() and None tests
Attribute access on arbitrary Python objects
String interpolation or formatting
Any control flow (if, for, while, lambda)
Import statements
Assignments

Design rationale: the expression language is intentionally small. Anything complex enough to need more than this belongs in an Instruction, not a Rule. When a rule condition becomes difficult to express, that is a signal that the case requires LLM judgement, not a signal that the DSL needs more features.

Actions

A Rule's action block specifies:

action:
  task_template: "Run SBOM rescan for {context.repo.repo_slug}"
  target_repo: context.repo.repo_slug
  priority: medium
  labels: ["sbom", "security", "{context.repo.repo_slug}"]
  due_in_days: 7

action.task_template is the emitted task title template. It is not a path to a repo-local file. Older design notes and the legacy tasks/*.md directory use "task template" for materialized task-body templates; that is a separate legacy surface. To avoid surprise, new rule actions should treat task_template as title_template semantics until the field can be renamed in a schema-breaking revision.

Action fields accept two deterministic rendering forms:

Whole-field paths: if the whole string is a path like context.repo.repo_slug or event.attributes.repo_slug, the rendered value keeps the original scalar/list/object shape from that path. This is the correct form for target_repo and other fields that should not become prose.
Scalar placeholders: strings may include {context.foo} or {event.foo} placeholders. Each placeholder must resolve to a scalar. Lists and objects are rejected rather than stringified, which prevents accidental JSON blobs or untrusted text from being embedded into task titles.

Unsafe action cases are rejected:

Any action path outside context.* or event.*.
Any path containing calls, indexing, arithmetic, filters, or boolean logic.
Placeholder values that resolve to lists or objects.
for_each values that are not a whole-field context.* or event.* path to a list.
bind_as names that are not simple identifiers.

Per-item rule expansion is explicit:

for_each: context.repos.repos
bind_as: repo
condition: 'context.repo.sbom_age_days > 30'
action:
  task_template: Run SBOM rescan for {context.repo.repo_slug}
  target_repo: context.repo.repo_slug
  priority: medium
  labels: ["sbom", "security", "automated"]

The weekly SBOM staleness definition is the canonical pattern. The State Hub bulk resolver exposes all repository entries at context.repos.repos, the rule binds each item as context.repo, and the strict staleness definition is context.repo.sbom_age_days > 30. Thirty days exactly is not stale; thirty-one days is stale.

Evaluation semantics

All rules in an ActivityDefinition are evaluated; all matching rules fire (not first-match-only). There is no implicit ordering beyond the file order, which is documented in the ActivityDefinition for human clarity.
A rule whose condition raises an error during evaluation is skipped and logged as rule_error; other rules still fire. This prevents a single malformed rule from silencing an entire ActivityDefinition.
An empty condition (omitted condition field) evaluates to true — the rule always fires when the trigger fires.

Instructions

An Instruction defers the task-creation decision to an LLM. It specifies what context to provide, how to frame the prompt, and what output schema to enforce.

Structure

# in an instruction fenced block:
id: {slug}
condition: '{expression}'          # optional pre-filter (Rule DSL); runs before LLM
trusted_fields:                    # REQUIRED — explicit allowlist of payload fields
  - event.attributes.repo_slug     # safe to interpolate into prompt
  - event.attributes.domain
  - event.attributes.tags
model: claude-sonnet-4-6
review_required: false             # true | false — curator gate for output
prompt: |
  {prompt template — only trusted_fields may be interpolated}
output_schema: {path to JSON schema file}

Trusted fields and prompt injection protection

The trusted_fields list is required and enforced at parse time. Any field not listed is unavailable to the prompt template. The template engine raises UntrustedFieldError if the prompt references a field not in trusted_fields.

The rationale: event payloads may contain free-text from untrusted sources — commit messages, issue titles, CVE descriptions, repo descriptions. Interpolating these directly into a prompt creates a prompt injection surface. Trusted fields are those whose values are validated by the event type schema (typed attributes like slugs, domain names, tag lists) and cannot carry arbitrary instruction text by construction.

Fields of type object (freeform JSON) are never eligible for trusted_fields even if listed — the evaluator rejects this at parse time.

Output schema enforcement

The LLM response is validated against output_schema using JSON Schema validation. If validation fails, the instruction retries once with the schema error appended to the prompt. If the second attempt also fails, the instruction records an instruction_output_error audit event and emits no tasks. Tasks are never created from unvalidated output.

Structured output mode (tool_use / JSON mode) is used where the model supports it. The output schema must define List[TaskSpec] or a compatible envelope.

`review_required: true`

When set, the instruction's proposed task list is written to a pending review queue in issue-core rather than directly created. A human or curator agent reviews and approves/rejects before tasks are materialised. This is the default for instructions that create high-impact tasks (cross-repo changes, security responses, production operations).

Evaluation semantics

Instructions are evaluated after all rules in the ActivityDefinition.
The optional condition field on an instruction uses the same Rule DSL as a first-pass filter — if the condition is false, the LLM is not called. This avoids LLM cost for events that clearly do not need instruction judgement.
Instructions are not first-match-only; all instructions whose conditions pass fire. An ActivityDefinition may have zero instructions.

Audit trail

Every task emission records:

Field	Rule	Instruction
`source_type`	`"rule"`	`"instruction"`
`source_id`	rule `id` from definition	instruction `id` from definition
`source_version`	ActivityDefinition version	ActivityDefinition version
`triggering_event_id`	event UUID	event UUID
`condition_matched`	expression string	expression string (pre-filter)
`prompt_hash`	—	SHA-256 of rendered prompt
`model`	—	model ID used
`output_validated`	—	`true` / `false`
`review_required`	—	`true` / `false`

The audit trail is written to the task_spawn_log table in activity-core's database and referenced from the task record in issue-core.

Testing strategy

Rules: every rule can and should be unit-tested with fixture event payloads. A test helper evaluate_rule(condition_str, event_fixture) returns bool and raises on syntax errors. Tests live alongside ActivityDefinition files: activity-definitions/{slug}.test.json — a list of {event, expected_rules_fired} fixtures.

Instructions: instructions cannot be deterministically unit-tested. Instead:

Sample evaluations are collected: given a fixture event, record the LLM response.
Samples are committed to activity-definitions/{slug}.samples/ for human review.
Output schema validation is unit-tested independently of the LLM call.
Prompt injection resistance is tested by including injection strings in fixture event payloads and asserting they do not appear in the rendered prompt.

rules-core module boundary

The rule evaluator and instruction executor live in src/activity_core/rules/. Within this module:

No imports from temporalio, sqlalchemy, fastapi, or any activity-core application code.
Public surface: evaluate_condition(expr: str, event: EventEnvelope, context: dict) -> bool and execute_instruction(instr: InstructionDef, event: EventEnvelope, context: dict) -> List[TaskSpec].
The module is independently importable and testable without starting the Temporal worker or Postgres.

This boundary makes future extraction to rules-core a packaging exercise, not a refactor.

Consequences

The ActivityDefinition Pydantic model gains rules: List[RuleDef] and instructions: List[InstructionDef] fields. The current implicit "always create tasks" behaviour is replaced by explicit rule blocks.
A new RuleEvaluator class (AST walker) is added to src/activity_core/rules/.
A new InstructionExecutor class handles prompt rendering, LLM call, output validation, and review queue routing.
Integration tests for rule evaluation use fixture JSON; no running Temporal required.
The task_spawn_log table is added to the Postgres schema (new Alembic migration).
ActivityDefinition files that omit both rules and instructions are valid (they fire with no output) — this supports future placeholder definitions.

Alternatives Considered

OPA / Rego for rule conditions: powerful, well-established policy language, supports complex logic. Rejected — Rego's learning curve is high for non-specialists; agents rarely produce correct Rego without fine-tuning; it adds a runtime dependency. The simple AST-walker DSL covers the realistic condition complexity for this org.

Rules as Python lambdas: maximum expressiveness. Rejected — arbitrary code execution in a rule condition is a serious security surface, especially in an org-wide event loop. Code deployment required for any rule change; agents cannot write rules without code write access.

LLM for all conditions (no Rule/Instruction split): simpler model, more flexible. Rejected — non-deterministic for cases that are deterministic; expensive for high-frequency events like cron ticks; impossible to unit-test; audit trail for deterministic rules becomes murky.

Instructions only, no Rules: allows arbitrary LLM judgement for everything. Rejected — LLM cost for every event, latency, and non-determinism are unacceptable for high-frequency maintenance automations. Many cases (SBOM staleness check, tag-based routing) are fully deterministic and should stay that way.

ACT-ADR-001 — Event Bridge Architecture
ACT-ADR-002 — Definition format (where rule/instruction blocks live)
CUST-TFE-SCOPE-2026-000001 — task-flow-engine extraction (analogue pattern)
src/activity_core/rules/ — implementation home

14 KiB Raw Blame History

ACT-ADR-003: Rule vs. Instruction Model and Expression DSL

Status

Context

Decision

Rules

Condition expression language

Actions

Evaluation semantics

Instructions

Structure

Trusted fields and prompt injection protection

Output schema enforcement

review_required: true

Evaluation semantics

Audit trail

Testing strategy

rules-core module boundary

Consequences

Alternatives Considered

Related

14 KiB

Raw Blame History

`review_required: true`