Implement post-triage operational hardening

This commit is contained in:
2026-06-04 12:15:07 +02:00
parent 8a33ec44b6
commit 20d4f26166
11 changed files with 775 additions and 31 deletions

View File

@@ -101,17 +101,58 @@ A Rule's action block specifies:
```yaml
action:
task_template: tasks/{template-slug}.md # required
target_repo: event.attributes.repo_slug # expression — attribute access only
priority: high # high | medium | low | literal
labels: ["onboarding", "security"] # literal list
due_in_days: 7 # optional, integer literal
task_template: "Run SBOM rescan for {context.repo.repo_slug}"
target_repo: context.repo.repo_slug
priority: medium
labels: ["sbom", "security", "{context.repo.repo_slug}"]
due_in_days: 7
```
`target_repo` and similar fields accept simple attribute access expressions
(no boolean logic — just path traversal). This allows dynamic routing to the
correct issue-core instance without arbitrary expression evaluation in action
fields.
`action.task_template` is the emitted task title template. It is not a path to a
repo-local file. Older design notes and the legacy `tasks/*.md` directory use
"task template" for materialized task-body templates; that is a separate legacy
surface. To avoid surprise, new rule actions should treat `task_template` as
`title_template` semantics until the field can be renamed in a schema-breaking
revision.
Action fields accept two deterministic rendering forms:
- Whole-field paths: if the whole string is a path like
`context.repo.repo_slug` or `event.attributes.repo_slug`, the rendered value
keeps the original scalar/list/object shape from that path. This is the
correct form for `target_repo` and other fields that should not become prose.
- Scalar placeholders: strings may include `{context.foo}` or `{event.foo}`
placeholders. Each placeholder must resolve to a scalar. Lists and objects are
rejected rather than stringified, which prevents accidental JSON blobs or
untrusted text from being embedded into task titles.
Unsafe action cases are rejected:
- Any action path outside `context.*` or `event.*`.
- Any path containing calls, indexing, arithmetic, filters, or boolean logic.
- Placeholder values that resolve to lists or objects.
- `for_each` values that are not a whole-field `context.*` or `event.*` path to
a list.
- `bind_as` names that are not simple identifiers.
Per-item rule expansion is explicit:
```yaml
for_each: context.repos.repos
bind_as: repo
condition: 'context.repo.sbom_age_days > 30'
action:
task_template: Run SBOM rescan for {context.repo.repo_slug}
target_repo: context.repo.repo_slug
priority: medium
labels: ["sbom", "security", "automated"]
```
The weekly SBOM staleness definition is the canonical pattern. The State Hub
bulk resolver exposes all repository entries at `context.repos.repos`, the rule
binds each item as `context.repo`, and the strict staleness definition is
`context.repo.sbom_age_days > 30`. Thirty days exactly is not stale; thirty-one
days is stale.
#### Evaluation semantics

View File

@@ -0,0 +1,70 @@
# Issue-Core Emission Boundary
activity-core owns the decision to spawn a task and the audit trail that says
why it spawned. It does not own downstream task lifecycle state after emission.
## Current authoritative endpoint
The current authoritative boundary is the issue-core REST API:
```text
POST {ISSUE_CORE_URL}/issues/
```
`IssueCoreRestSink` sends this payload:
```json
{
"title": "Run SBOM rescan for activity-core",
"description": "",
"target_repo": "activity-core",
"priority": "medium",
"labels": ["sbom", "security", "automated"],
"due_in_days": null,
"source_type": "rule",
"source_id": "flag-stale-sbom",
"triggering_event_id": "event-or-schedule-key",
"activity_definition_id": "activity-definition-uuid"
}
```
The expected response contains `issue_id` and may include `issue_url` and
`backend`. activity-core stores only the returned task reference in
`task_spawn_log`; issue-core remains authoritative for task status, assignment,
comments, closure, and cancellation.
## REST versus NATS
Keep REST as the active emission contract until issue-core publishes and owns a
durable NATS consumer for task-creation commands. NATS is still appropriate for
event intake into activity-core, but task creation needs an acknowledged,
idempotent command boundary. A future NATS sink must return or later correlate a
task reference before it can replace `IssueCoreRestSink`.
## Safe operating modes
- `ISSUE_SINK_TYPE=null`: dry-run/audit mode. Task specs are rendered and the
workflow records synthetic `null-*` references. This is the current Railiance
production setting.
- `ISSUE_SINK_TYPE=rest`: live task creation. Sink failures raise out of
`emit_tasks`, so Temporal retries and the workflow history make failures
visible.
Weekly SBOM staleness is safe to evaluate in dry-run mode because the rule
contract is deterministic and tested. Do not enable it against the real REST sink
until issue-core credentials, endpoint reachability, and duplicate-handling are
verified in the target environment.
## Verification
Local contract tests cover the rendered weekly SBOM task path and the REST
payload shape:
```bash
uv run pytest tests/test_integration_event_bridge.py tests/test_issue_sink.py
```
For a live environment, run with `ISSUE_SINK_TYPE=null` first and confirm
`task_spawn_log` contains the expected source id, condition, triggering event id,
and synthetic task reference. Then switch to `ISSUE_SINK_TYPE=rest` only after a
single known-safe rule match creates one issue-core task with the same fields.

View File

@@ -147,6 +147,55 @@ docker exec temporal-admin-tools temporal workflow list \
---
## Daily State Hub WSJF triage verification
Use this when answering: "did today's daily triage run happen?"
Set the ActivityDefinition id when known. If it is not known, pass the
definition name used in the environment and let the live helper resolve it from
Postgres.
```bash
export DAILY_TRIAGE_ACTIVITY_ID=<daily-triage-activity-definition-uuid>
# Dry-run checklist; safe from any shell because it only prints checks.
uv run python scripts/verify_daily_triage.py \
--activity-id "$DAILY_TRIAGE_ACTIVITY_ID" \
--date "$(date -u +%F)"
# Live check from a shell with Temporal, DB, State Hub, and working-memory access.
ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
TEMPORAL_HOST=localhost:7233 \
STATE_HUB_URL=http://127.0.0.1:8000 \
uv run python scripts/verify_daily_triage.py \
--activity-id "$DAILY_TRIAGE_ACTIVITY_ID" \
--working-memory-dir /home/worsch/the-custodian/working-memory \
--live
```
The verification is complete when all of these agree:
- Temporal schedule `activity-schedule-$DAILY_TRIAGE_ACTIVITY_ID` exists, is not
paused, and uses the `skip` overlap policy.
- The latest workflow found with `ActivityId="$DAILY_TRIAGE_ACTIVITY_ID"` either
completed or is visibly retrying a failed activity in history.
- `activity_runs` has a row for the daily triage ActivityDefinition with today's
`scheduled_for` or `fired_at` date.
- State Hub `/progress/` contains a `daily_triage` event whose detail includes
the same `activity_core_run_id`.
- The working-memory sink wrote `daily-triage-YYYY-MM-DD-<run>.md` and its
frontmatter contains the same `activity_core_run_id`.
- The ActivityDefinition's instruction model, token budget, and sink timeouts fit
under `ACTIVITY_TIMEOUT_SECONDS` (default 900 seconds). Temporal retries each
activity up to 10 attempts, so a slow LLM or sink failure should show as
workflow retry history rather than a silent missing report.
Expected missed-run behavior: the daily triage definition should use
`misfire_policy: skip`. Planned downtime does not catch up missed daily reports;
the next scheduled fire is the next authoritative run.
---
## Scale-out
### Multiple worker replicas
@@ -204,6 +253,44 @@ Set the environment variable before running the worker.
2. `curl http://localhost:9090/metrics` should return Temporal SDK metrics.
3. If port 9090 conflicts with Prometheus server, set `PROMETHEUS_BIND_ADDR=0.0.0.0:9091`.
### Production alerting and failure modes
Kubernetes health expectations:
```bash
kubectl -n activity-core get deploy actcore-worker actcore-api actcore-event-router
kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core
kubectl -n activity-core port-forward svc/actcore-worker-metrics 9090:9090
curl -sf http://127.0.0.1:9090/metrics
```
Page an operator when:
- `actcore-worker` has no ready pod, cannot connect to Temporal, or cannot reach
Postgres.
- The daily triage schedule is missing or paused outside an approved maintenance
window.
- The expected daily triage run is absent from Temporal and `activity_runs`
after the retry window.
- Both State Hub progress and working-memory report sinks are missing for a
completed run.
- Report sink or task emission failures repeat across Temporal retries.
Leave a State Hub progress note, but do not page, when:
- A planned outage caused one skipped run and the schedule is healthy again.
- A sink idempotency check reports `exists` for the expected run id.
- The report completed but calibration feedback says the recommendations were
noisy, too long, or under-sensitive.
Handle in the next operator session:
- Prompt/schema tuning, loose-end sensitivity, and stale-but-parked work
calibration.
- Non-urgent schedule jitter or timeout adjustments.
- Moving a task sink from `ISSUE_SINK_TYPE=null` to the real issue-core endpoint
after a dry-run contract check has passed.
### DB migration drift
```bash
uv run alembic current # show current revision