generated from coulomb/repo-seed
Implement post-triage operational hardening
This commit is contained in:
@@ -101,17 +101,58 @@ A Rule's action block specifies:
|
||||
|
||||
```yaml
|
||||
action:
|
||||
task_template: tasks/{template-slug}.md # required
|
||||
target_repo: event.attributes.repo_slug # expression — attribute access only
|
||||
priority: high # high | medium | low | literal
|
||||
labels: ["onboarding", "security"] # literal list
|
||||
due_in_days: 7 # optional, integer literal
|
||||
task_template: "Run SBOM rescan for {context.repo.repo_slug}"
|
||||
target_repo: context.repo.repo_slug
|
||||
priority: medium
|
||||
labels: ["sbom", "security", "{context.repo.repo_slug}"]
|
||||
due_in_days: 7
|
||||
```
|
||||
|
||||
`target_repo` and similar fields accept simple attribute access expressions
|
||||
(no boolean logic — just path traversal). This allows dynamic routing to the
|
||||
correct issue-core instance without arbitrary expression evaluation in action
|
||||
fields.
|
||||
`action.task_template` is the emitted task title template. It is not a path to a
|
||||
repo-local file. Older design notes and the legacy `tasks/*.md` directory use
|
||||
"task template" for materialized task-body templates; that is a separate legacy
|
||||
surface. To avoid surprise, new rule actions should treat `task_template` as
|
||||
`title_template` semantics until the field can be renamed in a schema-breaking
|
||||
revision.
|
||||
|
||||
Action fields accept two deterministic rendering forms:
|
||||
|
||||
- Whole-field paths: if the whole string is a path like
|
||||
`context.repo.repo_slug` or `event.attributes.repo_slug`, the rendered value
|
||||
keeps the original scalar/list/object shape from that path. This is the
|
||||
correct form for `target_repo` and other fields that should not become prose.
|
||||
- Scalar placeholders: strings may include `{context.foo}` or `{event.foo}`
|
||||
placeholders. Each placeholder must resolve to a scalar. Lists and objects are
|
||||
rejected rather than stringified, which prevents accidental JSON blobs or
|
||||
untrusted text from being embedded into task titles.
|
||||
|
||||
Unsafe action cases are rejected:
|
||||
|
||||
- Any action path outside `context.*` or `event.*`.
|
||||
- Any path containing calls, indexing, arithmetic, filters, or boolean logic.
|
||||
- Placeholder values that resolve to lists or objects.
|
||||
- `for_each` values that are not a whole-field `context.*` or `event.*` path to
|
||||
a list.
|
||||
- `bind_as` names that are not simple identifiers.
|
||||
|
||||
Per-item rule expansion is explicit:
|
||||
|
||||
```yaml
|
||||
for_each: context.repos.repos
|
||||
bind_as: repo
|
||||
condition: 'context.repo.sbom_age_days > 30'
|
||||
action:
|
||||
task_template: Run SBOM rescan for {context.repo.repo_slug}
|
||||
target_repo: context.repo.repo_slug
|
||||
priority: medium
|
||||
labels: ["sbom", "security", "automated"]
|
||||
```
|
||||
|
||||
The weekly SBOM staleness definition is the canonical pattern. The State Hub
|
||||
bulk resolver exposes all repository entries at `context.repos.repos`, the rule
|
||||
binds each item as `context.repo`, and the strict staleness definition is
|
||||
`context.repo.sbom_age_days > 30`. Thirty days exactly is not stale; thirty-one
|
||||
days is stale.
|
||||
|
||||
#### Evaluation semantics
|
||||
|
||||
|
||||
70
docs/issue-core-emission-boundary.md
Normal file
70
docs/issue-core-emission-boundary.md
Normal file
@@ -0,0 +1,70 @@
|
||||
# Issue-Core Emission Boundary
|
||||
|
||||
activity-core owns the decision to spawn a task and the audit trail that says
|
||||
why it spawned. It does not own downstream task lifecycle state after emission.
|
||||
|
||||
## Current authoritative endpoint
|
||||
|
||||
The current authoritative boundary is the issue-core REST API:
|
||||
|
||||
```text
|
||||
POST {ISSUE_CORE_URL}/issues/
|
||||
```
|
||||
|
||||
`IssueCoreRestSink` sends this payload:
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "Run SBOM rescan for activity-core",
|
||||
"description": "",
|
||||
"target_repo": "activity-core",
|
||||
"priority": "medium",
|
||||
"labels": ["sbom", "security", "automated"],
|
||||
"due_in_days": null,
|
||||
"source_type": "rule",
|
||||
"source_id": "flag-stale-sbom",
|
||||
"triggering_event_id": "event-or-schedule-key",
|
||||
"activity_definition_id": "activity-definition-uuid"
|
||||
}
|
||||
```
|
||||
|
||||
The expected response contains `issue_id` and may include `issue_url` and
|
||||
`backend`. activity-core stores only the returned task reference in
|
||||
`task_spawn_log`; issue-core remains authoritative for task status, assignment,
|
||||
comments, closure, and cancellation.
|
||||
|
||||
## REST versus NATS
|
||||
|
||||
Keep REST as the active emission contract until issue-core publishes and owns a
|
||||
durable NATS consumer for task-creation commands. NATS is still appropriate for
|
||||
event intake into activity-core, but task creation needs an acknowledged,
|
||||
idempotent command boundary. A future NATS sink must return or later correlate a
|
||||
task reference before it can replace `IssueCoreRestSink`.
|
||||
|
||||
## Safe operating modes
|
||||
|
||||
- `ISSUE_SINK_TYPE=null`: dry-run/audit mode. Task specs are rendered and the
|
||||
workflow records synthetic `null-*` references. This is the current Railiance
|
||||
production setting.
|
||||
- `ISSUE_SINK_TYPE=rest`: live task creation. Sink failures raise out of
|
||||
`emit_tasks`, so Temporal retries and the workflow history make failures
|
||||
visible.
|
||||
|
||||
Weekly SBOM staleness is safe to evaluate in dry-run mode because the rule
|
||||
contract is deterministic and tested. Do not enable it against the real REST sink
|
||||
until issue-core credentials, endpoint reachability, and duplicate-handling are
|
||||
verified in the target environment.
|
||||
|
||||
## Verification
|
||||
|
||||
Local contract tests cover the rendered weekly SBOM task path and the REST
|
||||
payload shape:
|
||||
|
||||
```bash
|
||||
uv run pytest tests/test_integration_event_bridge.py tests/test_issue_sink.py
|
||||
```
|
||||
|
||||
For a live environment, run with `ISSUE_SINK_TYPE=null` first and confirm
|
||||
`task_spawn_log` contains the expected source id, condition, triggering event id,
|
||||
and synthetic task reference. Then switch to `ISSUE_SINK_TYPE=rest` only after a
|
||||
single known-safe rule match creates one issue-core task with the same fields.
|
||||
@@ -147,6 +147,55 @@ docker exec temporal-admin-tools temporal workflow list \
|
||||
|
||||
---
|
||||
|
||||
## Daily State Hub WSJF triage verification
|
||||
|
||||
Use this when answering: "did today's daily triage run happen?"
|
||||
|
||||
Set the ActivityDefinition id when known. If it is not known, pass the
|
||||
definition name used in the environment and let the live helper resolve it from
|
||||
Postgres.
|
||||
|
||||
```bash
|
||||
export DAILY_TRIAGE_ACTIVITY_ID=<daily-triage-activity-definition-uuid>
|
||||
|
||||
# Dry-run checklist; safe from any shell because it only prints checks.
|
||||
uv run python scripts/verify_daily_triage.py \
|
||||
--activity-id "$DAILY_TRIAGE_ACTIVITY_ID" \
|
||||
--date "$(date -u +%F)"
|
||||
|
||||
# Live check from a shell with Temporal, DB, State Hub, and working-memory access.
|
||||
ACTCORE_DB_URL=postgresql+asyncpg://actcore:actcore@localhost:5433/actcore \
|
||||
TEMPORAL_HOST=localhost:7233 \
|
||||
STATE_HUB_URL=http://127.0.0.1:8000 \
|
||||
uv run python scripts/verify_daily_triage.py \
|
||||
--activity-id "$DAILY_TRIAGE_ACTIVITY_ID" \
|
||||
--working-memory-dir /home/worsch/the-custodian/working-memory \
|
||||
--live
|
||||
```
|
||||
|
||||
The verification is complete when all of these agree:
|
||||
|
||||
- Temporal schedule `activity-schedule-$DAILY_TRIAGE_ACTIVITY_ID` exists, is not
|
||||
paused, and uses the `skip` overlap policy.
|
||||
- The latest workflow found with `ActivityId="$DAILY_TRIAGE_ACTIVITY_ID"` either
|
||||
completed or is visibly retrying a failed activity in history.
|
||||
- `activity_runs` has a row for the daily triage ActivityDefinition with today's
|
||||
`scheduled_for` or `fired_at` date.
|
||||
- State Hub `/progress/` contains a `daily_triage` event whose detail includes
|
||||
the same `activity_core_run_id`.
|
||||
- The working-memory sink wrote `daily-triage-YYYY-MM-DD-<run>.md` and its
|
||||
frontmatter contains the same `activity_core_run_id`.
|
||||
- The ActivityDefinition's instruction model, token budget, and sink timeouts fit
|
||||
under `ACTIVITY_TIMEOUT_SECONDS` (default 900 seconds). Temporal retries each
|
||||
activity up to 10 attempts, so a slow LLM or sink failure should show as
|
||||
workflow retry history rather than a silent missing report.
|
||||
|
||||
Expected missed-run behavior: the daily triage definition should use
|
||||
`misfire_policy: skip`. Planned downtime does not catch up missed daily reports;
|
||||
the next scheduled fire is the next authoritative run.
|
||||
|
||||
---
|
||||
|
||||
## Scale-out
|
||||
|
||||
### Multiple worker replicas
|
||||
@@ -204,6 +253,44 @@ Set the environment variable before running the worker.
|
||||
2. `curl http://localhost:9090/metrics` should return Temporal SDK metrics.
|
||||
3. If port 9090 conflicts with Prometheus server, set `PROMETHEUS_BIND_ADDR=0.0.0.0:9091`.
|
||||
|
||||
### Production alerting and failure modes
|
||||
|
||||
Kubernetes health expectations:
|
||||
|
||||
```bash
|
||||
kubectl -n activity-core get deploy actcore-worker actcore-api actcore-event-router
|
||||
kubectl -n activity-core get pods -l app.kubernetes.io/part-of=activity-core
|
||||
kubectl -n activity-core port-forward svc/actcore-worker-metrics 9090:9090
|
||||
curl -sf http://127.0.0.1:9090/metrics
|
||||
```
|
||||
|
||||
Page an operator when:
|
||||
|
||||
- `actcore-worker` has no ready pod, cannot connect to Temporal, or cannot reach
|
||||
Postgres.
|
||||
- The daily triage schedule is missing or paused outside an approved maintenance
|
||||
window.
|
||||
- The expected daily triage run is absent from Temporal and `activity_runs`
|
||||
after the retry window.
|
||||
- Both State Hub progress and working-memory report sinks are missing for a
|
||||
completed run.
|
||||
- Report sink or task emission failures repeat across Temporal retries.
|
||||
|
||||
Leave a State Hub progress note, but do not page, when:
|
||||
|
||||
- A planned outage caused one skipped run and the schedule is healthy again.
|
||||
- A sink idempotency check reports `exists` for the expected run id.
|
||||
- The report completed but calibration feedback says the recommendations were
|
||||
noisy, too long, or under-sensitive.
|
||||
|
||||
Handle in the next operator session:
|
||||
|
||||
- Prompt/schema tuning, loose-end sensitivity, and stale-but-parked work
|
||||
calibration.
|
||||
- Non-urgent schedule jitter or timeout adjustments.
|
||||
- Moving a task sink from `ISSUE_SINK_TYPE=null` to the real issue-core endpoint
|
||||
after a dry-run contract check has passed.
|
||||
|
||||
### DB migration drift
|
||||
```bash
|
||||
uv run alembic current # show current revision
|
||||
|
||||
Reference in New Issue
Block a user