--- id: ACTIVITY-WP-0006 type: workplan title: "Post-triage operational hardening" domain: custodian repo: activity-core status: active owner: codex topic_slug: custodian created: "2026-06-03" updated: "2026-06-06" state_hub_workstream_id: "5646e13a-13af-4724-bca6-3c0d86f96733" --- # ACTIVITY-WP-0006 — Post-triage operational hardening ## Context activity-core has crossed the main construction threshold: Temporal-backed schedules, context resolution, deterministic rules, LLM instructions, report sinks, and the Railiance production service are implemented. The daily State Hub WSJF triage cutover is now trusted enough that activity-core can be treated as the standing scheduled substrate rather than an experiment. The next work should keep that substrate dependable and aligned with `INTENT.md`: activity-core owns when coordination work runs, what task/report outputs are produced, and where they are emitted. It must not grow into the task lifecycle database, a project planner, or an execution worker. ## Task Status Canon Adaptation ```task id: ACTIVITY-WP-0006-T01 status: done priority: high state_hub_task_id: "5d79e3da-d26d-4cad-9cdf-5e5264bb7019" ``` Adapt activity-core to State Hub's task status canon: `wait`, `todo`, `progress`, `done`, `cancel`. Scope: - update `AGENTS.md` task-status examples and progression text - update State Hub context resolver task-status filters and digest counters - keep workstream/workplan lifecycle status separate; `blocked` remains valid for workstreams/workplans where State Hub still uses it - update tests that fixture or assert `in_progress` / task-level `blocked` - resolve the State Hub interface-change notice only after the repo is adapted Done when the full test suite passes and activity-core no longer depends on legacy task-status aliases for State Hub API clients or tests. 2026-06-04: Completed. `AGENTS.md` now uses State Hub task statuses `wait`, `todo`, `progress`, `done`, and `cancel`; workplan/workstream lifecycle `blocked` remains separate. The State Hub daily triage digest now counts `wait/todo/progress` open tasks and no longer fixtures task-level `in_progress` or `blocked`. Full suite passed: 128 passed, 1 skipped. ## Daily Triage Observability Runbook ```task id: ACTIVITY-WP-0006-T02 status: done priority: high state_hub_task_id: "02c34443-0e8d-4f1a-93d9-6c39f07faad7" ``` Document and, where cheap, automate how to answer "did today's daily triage run happen?" The operator should be able to check: - Temporal schedule state and latest workflow history - `activity_runs` row for the daily triage ActivityDefinition - State Hub `daily_triage` progress event - working-memory report note - expected missed-run behavior (`skip`, not catch-up) - the configured LLM and Temporal timeout relationship Done when `docs/runbook.md` has a concise daily-triage verification section and any helper command/script is covered by tests or a dry-run path. 2026-06-04: Completed. Added `scripts/verify_daily_triage.py` with dry-run and live modes, plus `tests/test_daily_triage_verifier.py`. `docs/runbook.md` now covers Temporal schedule/workflow checks, `activity_runs`, State Hub progress, working-memory notes, missed-run `skip` behavior, and LLM timeout budget. 2026-06-05: Follow-up hardening after the scheduled WSJF triage ran but emitted no report because the live schema required `wsjf` fields and the stale DB prompt did not request them. The verifier default and runbook now point at the live working-memory sink path, `/home/worsch/the-custodian/memory/working`. 2026-06-06: Added a schedule smoke-test routine for new or changed recurring ActivityDefinitions. Operators can recreate the recurring Temporal Schedule, schedule a one-shot smoke run one minute in the future, wait for completion, and get a non-zero warning if workflow imports, activity registration, or runtime wiring are broken. 2026-06-06: Exercised the routine against the daily triage definition. The daily recurring Temporal Schedule was deleted and recreated, then a one-shot smoke workflow completed with run id `c2db32e5-3874-522f-ae1f-9b2cdf307fd2` and emitted a validated `daily_triage` report plus working-memory note. ## Three-Run Calibration Feedback ```task id: ACTIVITY-WP-0006-T03 status: wait priority: medium state_hub_task_id: "7cbf0a35-71a1-47ac-afc2-f51ad2180fd0" ``` Collect three consecutive scheduled activity-core daily triage runs and feed the result back into the Custodian WSJF calibration loop. Assess: - whether the top recommendations matched actual useful follow-up work - report length and density - loose-end detection sensitivity - stale-but-intentionally-parked work handling - whether model settings or prompt/schema constraints need adjustment Done when the calibration result is recorded in State Hub and the related `CUST-WP-0044` / `CUST-WP-0045` tasks can close based on activity-core runs, not Codex app fallback runs. 2026-06-04: Waiting on real evidence. The repo now has a verification path for scheduled daily triage runs, but this task still requires three consecutive actual activity-core scheduled runs and State Hub calibration feedback. Local tests cannot substitute for that operational evidence. 2026-06-06: The scheduled run fired at 07:20 Europe/Berlin but initially stuck on a stale worker import error after ops-evidence wiring landed. Restarting the worker let Temporal complete the run, and the hardened report path emitted a validation-failure note instead of losing the evidence. This run is useful calibration input, but it is not a clean consecutive scheduled success. ## Rule Action Contract Documentation ```task id: ACTIVITY-WP-0006-T04 status: done priority: medium state_hub_task_id: "c9066d2e-0429-4e14-a68a-8418061ffd8d" ``` Document the rule action contract introduced by the ADHOC-2026-06-01 work: whole-field `context.*` / `event.*` paths, scalar `{context.foo}` placeholders, and explicit `for_each` / `bind_as` per-item expansion. Also decide and document the naming/semantics mismatch around `action.task_template`: today it is the emitted task title field, while `tasks/*.md` contains template files with their own title templates. Done when ADR-003 or a focused follow-up doc contains examples, unsafe cases, and the weekly SBOM staleness definition is cited as the canonical pattern. 2026-06-04: Completed. Updated ADR-003 with whole-field path rendering, scalar placeholder rendering, unsafe action cases, explicit `for_each` / `bind_as` expansion, the `task_template` naming mismatch, and weekly SBOM staleness as the canonical per-item pattern. ## Production Alerting And Failure Modes ```task id: ACTIVITY-WP-0006-T05 status: done priority: medium state_hub_task_id: "420ea629-0c20-4d09-9cc1-6b2f32665161" ``` Turn the current confidence in the daily triage schedule into routine operational visibility. Cover: - Kubernetes/Temporal worker health expectations - schedule paused/missing detection - report sink failure behavior - LLM timeout and retry behavior - what should page, what should only leave a progress note, and what should be handled in the next operator session Done when the runbook and metrics/health surface make ordinary failures visible without inspecting a Codex Desktop session. 2026-06-04: Completed. `docs/runbook.md` now documents Kubernetes worker/API/ router health checks, Temporal schedule paused/missing checks, report sink failure behavior, LLM timeout/retry behavior, and page/note/next-session classification. Task emission sink failures now raise from `emit_tasks`, making them visible to Temporal retries instead of warning-only logs. 2026-06-05: Added instruction-output robustness for report-sink instructions: after retry exhaustion, schema-invalid model output now produces a durable validation-failure report containing bounded partial output instead of a silent empty result. Report sinks include validation metadata in working-memory frontmatter and State Hub progress detail. 2026-06-06: Hardened instruction output parsing to accept a single Markdown JSON fence when the fenced content is valid JSON, while preserving the validation-failure artifact path for genuinely invalid output. ## Issue-Core Emission Boundary Verification ```task id: ACTIVITY-WP-0006-T06 status: done priority: medium state_hub_task_id: "78089aef-aba1-42d7-a203-ef80ba6791d9" ``` Verify the downstream task emission boundary now that rule fan-out is real. Questions to close: - which issue-core endpoint is authoritative for task creation in the current environment - whether `IssueCoreRestSink` should keep using REST or move to the intended NATS subscription path - whether emitted rule tasks carry enough title, description, labels, source id, condition, and target repo data for issue-core and operators - whether weekly SBOM staleness can be safely enabled against the real sink Done when there is a tested or dry-run-verified path from a rule match to a downstream task reference, and activity-core still owns only the spawn audit trail, not task lifecycle state. 2026-06-04: Completed. Added `docs/issue-core-emission-boundary.md` documenting REST `/issues/` as the current authoritative endpoint, NATS as future work, Railiance `ISSUE_SINK_TYPE=null` dry-run mode, and the fields sent to issue-core versus retained in `task_spawn_log`. Added REST payload and sink failure tests in `tests/test_issue_sink.py`; the existing weekly SBOM integration test remains the dry-run rule-match-to-task-reference proof. ## Completion Criteria - State Hub task-status canon adaptation is complete. - Daily triage has an operator-grade verification path and three-run calibration evidence. - Rule action semantics are documented and no longer surprising. - Production failure modes are observable enough for routine operation. - Downstream task emission has been verified without expanding activity-core's ownership boundary.