generated from coulomb/repo-seed
241 lines
9.7 KiB
Markdown
241 lines
9.7 KiB
Markdown
---
|
|
id: ACTIVITY-WP-0006
|
|
type: workplan
|
|
title: "Post-triage operational hardening"
|
|
domain: custodian
|
|
repo: activity-core
|
|
status: active
|
|
owner: codex
|
|
topic_slug: custodian
|
|
created: "2026-06-03"
|
|
updated: "2026-06-06"
|
|
state_hub_workstream_id: "5646e13a-13af-4724-bca6-3c0d86f96733"
|
|
---
|
|
|
|
# ACTIVITY-WP-0006 — Post-triage operational hardening
|
|
|
|
## Context
|
|
|
|
activity-core has crossed the main construction threshold: Temporal-backed
|
|
schedules, context resolution, deterministic rules, LLM instructions, report
|
|
sinks, and the Railiance production service are implemented. The daily State
|
|
Hub WSJF triage cutover is now trusted enough that activity-core can be treated
|
|
as the standing scheduled substrate rather than an experiment.
|
|
|
|
The next work should keep that substrate dependable and aligned with
|
|
`INTENT.md`: activity-core owns when coordination work runs, what task/report
|
|
outputs are produced, and where they are emitted. It must not grow into the
|
|
task lifecycle database, a project planner, or an execution worker.
|
|
|
|
## Task Status Canon Adaptation
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0006-T01
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "5d79e3da-d26d-4cad-9cdf-5e5264bb7019"
|
|
```
|
|
|
|
Adapt activity-core to State Hub's task status canon:
|
|
`wait`, `todo`, `progress`, `done`, `cancel`.
|
|
|
|
Scope:
|
|
- update `AGENTS.md` task-status examples and progression text
|
|
- update State Hub context resolver task-status filters and digest counters
|
|
- keep workstream/workplan lifecycle status separate; `blocked` remains valid
|
|
for workstreams/workplans where State Hub still uses it
|
|
- update tests that fixture or assert `in_progress` / task-level `blocked`
|
|
- resolve the State Hub interface-change notice only after the repo is adapted
|
|
|
|
Done when the full test suite passes and activity-core no longer depends on
|
|
legacy task-status aliases for State Hub API clients or tests.
|
|
|
|
2026-06-04: Completed. `AGENTS.md` now uses State Hub task statuses
|
|
`wait`, `todo`, `progress`, `done`, and `cancel`; workplan/workstream lifecycle
|
|
`blocked` remains separate. The State Hub daily triage digest now counts
|
|
`wait/todo/progress` open tasks and no longer fixtures task-level
|
|
`in_progress` or `blocked`. Full suite passed: 128 passed, 1 skipped.
|
|
|
|
## Daily Triage Observability Runbook
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0006-T02
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "02c34443-0e8d-4f1a-93d9-6c39f07faad7"
|
|
```
|
|
|
|
Document and, where cheap, automate how to answer "did today's daily triage
|
|
run happen?"
|
|
|
|
The operator should be able to check:
|
|
- Temporal schedule state and latest workflow history
|
|
- `activity_runs` row for the daily triage ActivityDefinition
|
|
- State Hub `daily_triage` progress event
|
|
- working-memory report note
|
|
- expected missed-run behavior (`skip`, not catch-up)
|
|
- the configured LLM and Temporal timeout relationship
|
|
|
|
Done when `docs/runbook.md` has a concise daily-triage verification section
|
|
and any helper command/script is covered by tests or a dry-run path.
|
|
|
|
2026-06-04: Completed. Added `scripts/verify_daily_triage.py` with dry-run and
|
|
live modes, plus `tests/test_daily_triage_verifier.py`. `docs/runbook.md` now
|
|
covers Temporal schedule/workflow checks, `activity_runs`, State Hub progress,
|
|
working-memory notes, missed-run `skip` behavior, and LLM timeout budget.
|
|
|
|
2026-06-05: Follow-up hardening after the scheduled WSJF triage ran but emitted
|
|
no report because the live schema required `wsjf` fields and the stale DB prompt
|
|
did not request them. The verifier default and runbook now point at the live
|
|
working-memory sink path, `/home/worsch/the-custodian/memory/working`.
|
|
|
|
2026-06-06: Added a schedule smoke-test routine for new or changed recurring
|
|
ActivityDefinitions. Operators can recreate the recurring Temporal Schedule,
|
|
schedule a one-shot smoke run one minute in the future, wait for completion,
|
|
and get a non-zero warning if workflow imports, activity registration, or
|
|
runtime wiring are broken.
|
|
|
|
2026-06-06: Exercised the routine against the daily triage definition. The
|
|
daily recurring Temporal Schedule was deleted and recreated, then a one-shot
|
|
smoke workflow completed with run id `c2db32e5-3874-522f-ae1f-9b2cdf307fd2`
|
|
and emitted a validated `daily_triage` report plus working-memory note.
|
|
|
|
## Three-Run Calibration Feedback
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0006-T03
|
|
status: wait
|
|
priority: medium
|
|
state_hub_task_id: "7cbf0a35-71a1-47ac-afc2-f51ad2180fd0"
|
|
```
|
|
|
|
Collect three consecutive scheduled activity-core daily triage runs and feed
|
|
the result back into the Custodian WSJF calibration loop.
|
|
|
|
Assess:
|
|
- whether the top recommendations matched actual useful follow-up work
|
|
- report length and density
|
|
- loose-end detection sensitivity
|
|
- stale-but-intentionally-parked work handling
|
|
- whether model settings or prompt/schema constraints need adjustment
|
|
|
|
Done when the calibration result is recorded in State Hub and the related
|
|
`CUST-WP-0044` / `CUST-WP-0045` tasks can close based on activity-core runs,
|
|
not Codex app fallback runs.
|
|
|
|
2026-06-04: Waiting on real evidence. The repo now has a verification path for
|
|
scheduled daily triage runs, but this task still requires three consecutive
|
|
actual activity-core scheduled runs and State Hub calibration feedback. Local
|
|
tests cannot substitute for that operational evidence.
|
|
|
|
2026-06-06: The scheduled run fired at 07:20 Europe/Berlin but initially stuck
|
|
on a stale worker import error after ops-evidence wiring landed. Restarting the
|
|
worker let Temporal complete the run, and the hardened report path emitted a
|
|
validation-failure note instead of losing the evidence. This run is useful
|
|
calibration input, but it is not a clean consecutive scheduled success.
|
|
|
|
## Rule Action Contract Documentation
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0006-T04
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "c9066d2e-0429-4e14-a68a-8418061ffd8d"
|
|
```
|
|
|
|
Document the rule action contract introduced by the ADHOC-2026-06-01 work:
|
|
whole-field `context.*` / `event.*` paths, scalar `{context.foo}` placeholders,
|
|
and explicit `for_each` / `bind_as` per-item expansion.
|
|
|
|
Also decide and document the naming/semantics mismatch around
|
|
`action.task_template`: today it is the emitted task title field, while
|
|
`tasks/*.md` contains template files with their own title templates.
|
|
|
|
Done when ADR-003 or a focused follow-up doc contains examples, unsafe cases,
|
|
and the weekly SBOM staleness definition is cited as the canonical pattern.
|
|
|
|
2026-06-04: Completed. Updated ADR-003 with whole-field path rendering,
|
|
scalar placeholder rendering, unsafe action cases, explicit `for_each` /
|
|
`bind_as` expansion, the `task_template` naming mismatch, and weekly SBOM
|
|
staleness as the canonical per-item pattern.
|
|
|
|
## Production Alerting And Failure Modes
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0006-T05
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "420ea629-0c20-4d09-9cc1-6b2f32665161"
|
|
```
|
|
|
|
Turn the current confidence in the daily triage schedule into routine
|
|
operational visibility.
|
|
|
|
Cover:
|
|
- Kubernetes/Temporal worker health expectations
|
|
- schedule paused/missing detection
|
|
- report sink failure behavior
|
|
- LLM timeout and retry behavior
|
|
- what should page, what should only leave a progress note, and what should be
|
|
handled in the next operator session
|
|
|
|
Done when the runbook and metrics/health surface make ordinary failures visible
|
|
without inspecting a Codex Desktop session.
|
|
|
|
2026-06-04: Completed. `docs/runbook.md` now documents Kubernetes worker/API/
|
|
router health checks, Temporal schedule paused/missing checks, report sink
|
|
failure behavior, LLM timeout/retry behavior, and page/note/next-session
|
|
classification. Task emission sink failures now raise from `emit_tasks`, making
|
|
them visible to Temporal retries instead of warning-only logs.
|
|
|
|
2026-06-05: Added instruction-output robustness for report-sink instructions:
|
|
after retry exhaustion, schema-invalid model output now produces a durable
|
|
validation-failure report containing bounded partial output instead of a silent
|
|
empty result. Report sinks include validation metadata in working-memory
|
|
frontmatter and State Hub progress detail.
|
|
|
|
2026-06-06: Hardened instruction output parsing to accept a single Markdown
|
|
JSON fence when the fenced content is valid JSON, while preserving the
|
|
validation-failure artifact path for genuinely invalid output.
|
|
|
|
## Issue-Core Emission Boundary Verification
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0006-T06
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "78089aef-aba1-42d7-a203-ef80ba6791d9"
|
|
```
|
|
|
|
Verify the downstream task emission boundary now that rule fan-out is real.
|
|
|
|
Questions to close:
|
|
- which issue-core endpoint is authoritative for task creation in the current
|
|
environment
|
|
- whether `IssueCoreRestSink` should keep using REST or move to the intended
|
|
NATS subscription path
|
|
- whether emitted rule tasks carry enough title, description, labels,
|
|
source id, condition, and target repo data for issue-core and operators
|
|
- whether weekly SBOM staleness can be safely enabled against the real sink
|
|
|
|
Done when there is a tested or dry-run-verified path from a rule match to a
|
|
downstream task reference, and activity-core still owns only the spawn audit
|
|
trail, not task lifecycle state.
|
|
|
|
2026-06-04: Completed. Added `docs/issue-core-emission-boundary.md` documenting
|
|
REST `/issues/` as the current authoritative endpoint, NATS as future work,
|
|
Railiance `ISSUE_SINK_TYPE=null` dry-run mode, and the fields sent to
|
|
issue-core versus retained in `task_spawn_log`. Added REST payload and sink
|
|
failure tests in `tests/test_issue_sink.py`; the existing weekly SBOM integration
|
|
test remains the dry-run rule-match-to-task-reference proof.
|
|
|
|
## Completion Criteria
|
|
|
|
- State Hub task-status canon adaptation is complete.
|
|
- Daily triage has an operator-grade verification path and three-run
|
|
calibration evidence.
|
|
- Rule action semantics are documented and no longer surprising.
|
|
- Production failure modes are observable enough for routine operation.
|
|
- Downstream task emission has been verified without expanding activity-core's
|
|
ownership boundary.
|