Implement post-triage operational hardening

2026-06-04 12:15:07 +02:00
parent 8a33ec44b6
commit 20d4f26166
11 changed files with 775 additions and 31 deletions
--- a/docs/adr/adr-003-rule-instruction-model.md
+++ b/docs/adr/adr-003-rule-instruction-model.md
@@ -101,17 +101,58 @@ A Rule's action block specifies:

 ```yaml
 action:
-  task_template: tasks/{template-slug}.md   # required
-  target_repo: event.attributes.repo_slug    # expression — attribute access only
-  priority: high                             # high | medium | low | literal
-  labels: ["onboarding", "security"]        # literal list
-  due_in_days: 7                             # optional, integer literal
+  task_template: "Run SBOM rescan for {context.repo.repo_slug}"
+  target_repo: context.repo.repo_slug
+  priority: medium
+  labels: ["sbom", "security", "{context.repo.repo_slug}"]
+  due_in_days: 7
 ```

-`target_repo` and similar fields accept simple attribute access expressions
-(no boolean logic — just path traversal). This allows dynamic routing to the
-correct issue-core instance without arbitrary expression evaluation in action
-fields.
+`action.task_template` is the emitted task title template. It is not a path to a
+repo-local file. Older design notes and the legacy `tasks/*.md` directory use
+"task template" for materialized task-body templates; that is a separate legacy
+surface. To avoid surprise, new rule actions should treat `task_template` as
+`title_template` semantics until the field can be renamed in a schema-breaking
+revision.
+
+Action fields accept two deterministic rendering forms:
+
+- Whole-field paths: if the whole string is a path like
+  `context.repo.repo_slug` or `event.attributes.repo_slug`, the rendered value
+  keeps the original scalar/list/object shape from that path. This is the
+  correct form for `target_repo` and other fields that should not become prose.
+- Scalar placeholders: strings may include `{context.foo}` or `{event.foo}`
+  placeholders. Each placeholder must resolve to a scalar. Lists and objects are
+  rejected rather than stringified, which prevents accidental JSON blobs or
+  untrusted text from being embedded into task titles.
+
+Unsafe action cases are rejected:
+
+- Any action path outside `context.*` or `event.*`.
+- Any path containing calls, indexing, arithmetic, filters, or boolean logic.
+- Placeholder values that resolve to lists or objects.
+- `for_each` values that are not a whole-field `context.*` or `event.*` path to
+  a list.
+- `bind_as` names that are not simple identifiers.
+
+Per-item rule expansion is explicit:
+
+```yaml
+for_each: context.repos.repos
+bind_as: repo
+condition: 'context.repo.sbom_age_days > 30'
+action:
+  task_template: Run SBOM rescan for {context.repo.repo_slug}
+  target_repo: context.repo.repo_slug
+  priority: medium
+  labels: ["sbom", "security", "automated"]
+```
+
+The weekly SBOM staleness definition is the canonical pattern. The State Hub
+bulk resolver exposes all repository entries at `context.repos.repos`, the rule
+binds each item as `context.repo`, and the strict staleness definition is
+`context.repo.sbom_age_days > 30`. Thirty days exactly is not stale; thirty-one
+days is stale.

 #### Evaluation semantics

--- a/docs/issue-core-emission-boundary.md
+++ b/docs/issue-core-emission-boundary.md
@@ -0,0 +1,70 @@
+# Issue-Core Emission Boundary
+
+activity-core owns the decision to spawn a task and the audit trail that says
+why it spawned. It does not own downstream task lifecycle state after emission.
+
+## Current authoritative endpoint
+
+The current authoritative boundary is the issue-core REST API:
+
+```text
+POST {ISSUE_CORE_URL}/issues/
+```
+
+`IssueCoreRestSink` sends this payload:
+
+```json
+{
+  "title": "Run SBOM rescan for activity-core",
+  "description": "",
+  "target_repo": "activity-core",
+  "priority": "medium",
+  "labels": ["sbom", "security", "automated"],
+  "due_in_days": null,
+  "source_type": "rule",
+  "source_id": "flag-stale-sbom",
+  "triggering_event_id": "event-or-schedule-key",
+  "activity_definition_id": "activity-definition-uuid"
+}
+```
+
+The expected response contains `issue_id` and may include `issue_url` and
+`backend`. activity-core stores only the returned task reference in
+`task_spawn_log`; issue-core remains authoritative for task status, assignment,
+comments, closure, and cancellation.
+
+## REST versus NATS
+
+Keep REST as the active emission contract until issue-core publishes and owns a
+durable NATS consumer for task-creation commands. NATS is still appropriate for
+event intake into activity-core, but task creation needs an acknowledged,
+idempotent command boundary. A future NATS sink must return or later correlate a
+task reference before it can replace `IssueCoreRestSink`.
+
+## Safe operating modes
+
+- `ISSUE_SINK_TYPE=null`: dry-run/audit mode. Task specs are rendered and the
+  workflow records synthetic `null-*` references. This is the current Railiance
+  production setting.
+- `ISSUE_SINK_TYPE=rest`: live task creation. Sink failures raise out of
+  `emit_tasks`, so Temporal retries and the workflow history make failures
+  visible.
+
+Weekly SBOM staleness is safe to evaluate in dry-run mode because the rule
+contract is deterministic and tested. Do not enable it against the real REST sink
+until issue-core credentials, endpoint reachability, and duplicate-handling are
+verified in the target environment.
+
+## Verification
+
+Local contract tests cover the rendered weekly SBOM task path and the REST
+payload shape:
+
+```bash
+uv run pytest tests/test_integration_event_bridge.py tests/test_issue_sink.py
+```
+
+For a live environment, run with `ISSUE_SINK_TYPE=null` first and confirm
+`task_spawn_log` contains the expected source id, condition, triggering event id,
+and synthetic task reference. Then switch to `ISSUE_SINK_TYPE=rest` only after a
+single known-safe rule match creates one issue-core task with the same fields.
--- a/docs/runbook.md
+++ b/docs/runbook.md
@@ -147,6 +147,55 @@ docker exec temporal-admin-tools temporal workflow list \

 ---

+## Daily State Hub WSJF triage verification
+
+Use this when answering: "did today's daily triage run happen?"
+
+Set the ActivityDefinition id when known. If it is not known, pass the
+definition name used in the environment and let the live helper resolve it from
+Postgres.
+
+```bash
+export DAILY_TRIAGE_ACTIVITY_ID=<daily-triage-activity-definition-uuid>
+
+# Dry-run checklist; safe from any shell because it only prints checks.
+uv run python scripts/verify_daily_triage.py \
+  --activity-id "$DAILY_TRIAGE_ACTIVITY_ID" \
+  --date "$(date -u +%F)"
+
+# Live check from a shell with Temporal, DB, State Hub, and working-memory access.
+ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
+TEMPORAL_HOST=localhost:7233 \
+STATE_HUB_URL=http://127.0.0.1:8000 \
+  uv run python scripts/verify_daily_triage.py \
+    --activity-id "$DAILY_TRIAGE_ACTIVITY_ID" \
+    --working-memory-dir /home/worsch/the-custodian/working-memory \
+    --live
+```
+
+The verification is complete when all of these agree:
+
+- Temporal schedule `activity-schedule-$DAILY_TRIAGE_ACTIVITY_ID` exists, is not
+  paused, and uses the `skip` overlap policy.
+- The latest workflow found with `ActivityId="$DAILY_TRIAGE_ACTIVITY_ID"` either
+  completed or is visibly retrying a failed activity in history.
+- `activity_runs` has a row for the daily triage ActivityDefinition with today's
+  `scheduled_for` or `fired_at` date.
+- State Hub `/progress/` contains a `daily_triage` event whose detail includes
+  the same `activity_core_run_id`.
+- The working-memory sink wrote `daily-triage-YYYY-MM-DD-<run>.md` and its
+  frontmatter contains the same `activity_core_run_id`.
+- The ActivityDefinition's instruction model, token budget, and sink timeouts fit
+  under `ACTIVITY_TIMEOUT_SECONDS` (default 900 seconds). Temporal retries each
+  activity up to 10 attempts, so a slow LLM or sink failure should show as
+  workflow retry history rather than a silent missing report.
+
+Expected missed-run behavior: the daily triage definition should use
+`misfire_policy: skip`. Planned downtime does not catch up missed daily reports;
+the next scheduled fire is the next authoritative run.
+
+---
+
 ## Scale-out

 ### Multiple worker replicas
@@ -204,6 +253,44 @@ Set the environment variable before running the worker.
 2. `curl http://localhost:9090/metrics` should return Temporal SDK metrics.
 3. If port 9090 conflicts with Prometheus server, set `PROMETHEUS_BIND_ADDR=0.0.0.0:9091`.

+### Production alerting and failure modes
+
+Kubernetes health expectations:
+
+```bash
+kubectl -n activity-core get deploy actcore-worker actcore-api actcore-event-router
+kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core
+kubectl -n activity-core port-forward svc/actcore-worker-metrics 9090:9090
+curl -sf http://127.0.0.1:9090/metrics
+```
+
+Page an operator when:
+
+- `actcore-worker` has no ready pod, cannot connect to Temporal, or cannot reach
+  Postgres.
+- The daily triage schedule is missing or paused outside an approved maintenance
+  window.
+- The expected daily triage run is absent from Temporal and `activity_runs`
+  after the retry window.
+- Both State Hub progress and working-memory report sinks are missing for a
+  completed run.
+- Report sink or task emission failures repeat across Temporal retries.
+
+Leave a State Hub progress note, but do not page, when:
+
+- A planned outage caused one skipped run and the schedule is healthy again.
+- A sink idempotency check reports `exists` for the expected run id.
+- The report completed but calibration feedback says the recommendations were
+  noisy, too long, or under-sensitive.
+
+Handle in the next operator session:
+
+- Prompt/schema tuning, loose-end sensitivity, and stale-but-parked work
+  calibration.
+- Non-urgent schedule jitter or timeout adjustments.
+- Moving a task sink from `ISSUE_SINK_TYPE=null` to the real issue-core endpoint
+  after a dry-run contract check has passed.
+
 ### DB migration drift
 ```bash
 uv run alembic current    # show current revision