Files
activity-core/workplans/ACTIVITY-WP-0006-post-triage-operational-hardening.md
tegwick 30598fd1ad Expand rule actions for per-repo tasks
Add safe action interpolation and for_each binding for rule fan-out, update the weekly SBOM definition, cover the new evaluation path, and reconcile activity-core scope/workplans for the State Hub sync.
2026-06-03 11:58:24 +02:00

5.5 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated
id type title domain repo status owner topic_slug created updated
ACTIVITY-WP-0006 workplan Post-triage operational hardening custodian activity-core ready codex custodian 2026-06-03 2026-06-03

ACTIVITY-WP-0006 — Post-triage operational hardening

Context

activity-core has crossed the main construction threshold: Temporal-backed schedules, context resolution, deterministic rules, LLM instructions, report sinks, and the Railiance production service are implemented. The daily State Hub WSJF triage cutover is now trusted enough that activity-core can be treated as the standing scheduled substrate rather than an experiment.

The next work should keep that substrate dependable and aligned with INTENT.md: activity-core owns when coordination work runs, what task/report outputs are produced, and where they are emitted. It must not grow into the task lifecycle database, a project planner, or an execution worker.

Task Status Canon Adaptation

id: ACTIVITY-WP-0006-T01
status: todo
priority: high

Adapt activity-core to State Hub's task status canon: wait, todo, progress, done, cancel.

Scope:

  • update AGENTS.md task-status examples and progression text
  • update State Hub context resolver task-status filters and digest counters
  • keep workstream/workplan lifecycle status separate; blocked remains valid for workstreams/workplans where State Hub still uses it
  • update tests that fixture or assert in_progress / task-level blocked
  • resolve the State Hub interface-change notice only after the repo is adapted

Done when the full test suite passes and activity-core no longer depends on legacy task-status aliases for State Hub API clients or tests.

Daily Triage Observability Runbook

id: ACTIVITY-WP-0006-T02
status: todo
priority: high

Document and, where cheap, automate how to answer "did today's daily triage run happen?"

The operator should be able to check:

  • Temporal schedule state and latest workflow history
  • activity_runs row for the daily triage ActivityDefinition
  • State Hub daily_triage progress event
  • working-memory report note
  • expected missed-run behavior (skip, not catch-up)
  • the configured LLM and Temporal timeout relationship

Done when docs/runbook.md has a concise daily-triage verification section and any helper command/script is covered by tests or a dry-run path.

Three-Run Calibration Feedback

id: ACTIVITY-WP-0006-T03
status: todo
priority: medium

Collect three consecutive scheduled activity-core daily triage runs and feed the result back into the Custodian WSJF calibration loop.

Assess:

  • whether the top recommendations matched actual useful follow-up work
  • report length and density
  • loose-end detection sensitivity
  • stale-but-intentionally-parked work handling
  • whether model settings or prompt/schema constraints need adjustment

Done when the calibration result is recorded in State Hub and the related CUST-WP-0044 / CUST-WP-0045 tasks can close based on activity-core runs, not Codex app fallback runs.

Rule Action Contract Documentation

id: ACTIVITY-WP-0006-T04
status: todo
priority: medium

Document the rule action contract introduced by the ADHOC-2026-06-01 work: whole-field context.* / event.* paths, scalar {context.foo} placeholders, and explicit for_each / bind_as per-item expansion.

Also decide and document the naming/semantics mismatch around action.task_template: today it is the emitted task title field, while tasks/*.md contains template files with their own title templates.

Done when ADR-003 or a focused follow-up doc contains examples, unsafe cases, and the weekly SBOM staleness definition is cited as the canonical pattern.

Production Alerting And Failure Modes

id: ACTIVITY-WP-0006-T05
status: todo
priority: medium

Turn the current confidence in the daily triage schedule into routine operational visibility.

Cover:

  • Kubernetes/Temporal worker health expectations
  • schedule paused/missing detection
  • report sink failure behavior
  • LLM timeout and retry behavior
  • what should page, what should only leave a progress note, and what should be handled in the next operator session

Done when the runbook and metrics/health surface make ordinary failures visible without inspecting a Codex Desktop session.

Issue-Core Emission Boundary Verification

id: ACTIVITY-WP-0006-T06
status: todo
priority: medium

Verify the downstream task emission boundary now that rule fan-out is real.

Questions to close:

  • which issue-core endpoint is authoritative for task creation in the current environment
  • whether IssueCoreRestSink should keep using REST or move to the intended NATS subscription path
  • whether emitted rule tasks carry enough title, description, labels, source id, condition, and target repo data for issue-core and operators
  • whether weekly SBOM staleness can be safely enabled against the real sink

Done when there is a tested or dry-run-verified path from a rule match to a downstream task reference, and activity-core still owns only the spawn audit trail, not task lifecycle state.

Completion Criteria

  • State Hub task-status canon adaptation is complete.
  • Daily triage has an operator-grade verification path and three-run calibration evidence.
  • Rule action semantics are documented and no longer surprising.
  • Production failure modes are observable enough for routine operation.
  • Downstream task emission has been verified without expanding activity-core's ownership boundary.