11 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|
| ACTIVITY-WP-0006 | workplan | Post-triage operational hardening | custodian | activity-core | active | codex | custodian | 2026-06-03 | 2026-06-07 | 5646e13a-13af-4724-bca6-3c0d86f96733 |
ACTIVITY-WP-0006 — Post-triage operational hardening
Context
activity-core has crossed the main construction threshold: Temporal-backed schedules, context resolution, deterministic rules, LLM instructions, report sinks, and the Railiance production service are implemented. The daily State Hub WSJF triage cutover is now trusted enough that activity-core can be treated as the standing scheduled substrate rather than an experiment.
The next work should keep that substrate dependable and aligned with
INTENT.md: activity-core owns when coordination work runs, what task/report
outputs are produced, and where they are emitted. It must not grow into the
task lifecycle database, a project planner, or an execution worker.
Task Status Canon Adaptation
id: ACTIVITY-WP-0006-T01
status: done
priority: high
state_hub_task_id: "5d79e3da-d26d-4cad-9cdf-5e5264bb7019"
Adapt activity-core to State Hub's task status canon:
wait, todo, progress, done, cancel.
Scope:
- update
AGENTS.mdtask-status examples and progression text - update State Hub context resolver task-status filters and digest counters
- keep workstream/workplan lifecycle status separate;
blockedremains valid for workstreams/workplans where State Hub still uses it - update tests that fixture or assert
in_progress/ task-levelblocked - resolve the State Hub interface-change notice only after the repo is adapted
Done when the full test suite passes and activity-core no longer depends on legacy task-status aliases for State Hub API clients or tests.
2026-06-04: Completed. AGENTS.md now uses State Hub task statuses
wait, todo, progress, done, and cancel; workplan/workstream lifecycle
blocked remains separate. The State Hub daily triage digest now counts
wait/todo/progress open tasks and no longer fixtures task-level
in_progress or blocked. Full suite passed: 128 passed, 1 skipped.
Daily Triage Observability Runbook
id: ACTIVITY-WP-0006-T02
status: done
priority: high
state_hub_task_id: "02c34443-0e8d-4f1a-93d9-6c39f07faad7"
Document and, where cheap, automate how to answer "did today's daily triage run happen?"
The operator should be able to check:
- Temporal schedule state and latest workflow history
activity_runsrow for the daily triage ActivityDefinition- State Hub
daily_triageprogress event - working-memory report note
- expected missed-run behavior (
skip, not catch-up) - the configured LLM and Temporal timeout relationship
Done when docs/runbook.md has a concise daily-triage verification section
and any helper command/script is covered by tests or a dry-run path.
2026-06-04: Completed. Added scripts/verify_daily_triage.py with dry-run and
live modes, plus tests/test_daily_triage_verifier.py. docs/runbook.md now
covers Temporal schedule/workflow checks, activity_runs, State Hub progress,
working-memory notes, missed-run skip behavior, and LLM timeout budget.
2026-06-05: Follow-up hardening after the scheduled WSJF triage ran but emitted
no report because the live schema required wsjf fields and the stale DB prompt
did not request them. The verifier default and runbook now point at the live
working-memory sink path, /home/worsch/the-custodian/memory/working.
2026-06-06: Added a schedule smoke-test routine for new or changed recurring ActivityDefinitions. Operators can recreate the recurring Temporal Schedule, schedule a one-shot smoke run one minute in the future, wait for completion, and get a non-zero warning if workflow imports, activity registration, or runtime wiring are broken.
2026-06-06: Exercised the routine against the daily triage definition. The
daily recurring Temporal Schedule was deleted and recreated, then a one-shot
smoke workflow completed with run id c2db32e5-3874-522f-ae1f-9b2cdf307fd2
and emitted a validated daily_triage report plus working-memory note.
Three-Run Calibration Feedback
id: ACTIVITY-WP-0006-T03
status: wait
priority: medium
state_hub_task_id: "7cbf0a35-71a1-47ac-afc2-f51ad2180fd0"
Collect three consecutive scheduled activity-core daily triage runs and feed the result back into the Custodian WSJF calibration loop.
Assess:
- whether the top recommendations matched actual useful follow-up work
- report length and density
- loose-end detection sensitivity
- stale-but-intentionally-parked work handling
- whether model settings or prompt/schema constraints need adjustment
Done when the calibration result is recorded in State Hub and the related
CUST-WP-0044 / CUST-WP-0045 tasks can close based on activity-core runs,
not Codex app fallback runs.
2026-06-04: Waiting on real evidence. The repo now has a verification path for scheduled daily triage runs, but this task still requires three consecutive actual activity-core scheduled runs and State Hub calibration feedback. Local tests cannot substitute for that operational evidence.
2026-06-06: The scheduled run fired at 07:20 Europe/Berlin but initially stuck on a stale worker import error after ops-evidence wiring landed. Restarting the worker let Temporal complete the run, and the hardened report path emitted a validation-failure note instead of losing the evidence. This run is useful calibration input, but it is not a clean consecutive scheduled success.
2026-06-07: Investigated the missing June 7 WSJF result. State Hub had no
daily_triage event for the date, no local activity-core DB/Temporal/API ports
were reachable, and the current Railiance Kubernetes context had no
activity-core namespace. The Railiance runtime projection also lacked
daily-statehub-wsjf-triage.md, and the node-local State Hub bridge target
127.0.0.1:18000 returned connection reset. Patched activity-core to project
the daily definition, mount the schema and working-memory storage, expose
LLM_CONNECT_URL, include working_memory_path in State Hub progress detail,
and emit a visible execution_failed report for report-sink instructions when
llm-connect is missing or broken. Cross-repo closure tasks were posted via
State Hub to state-hub (dc10704f), railiance-cluster (53e78702),
llm-connect (cf758ed8), the-custodian (7a5d4e62), and
activity-core (28d11021). This task remains waiting on a deployed, healthy
activity-core runner plus three clean scheduled daily runs and calibration
feedback.
Rule Action Contract Documentation
id: ACTIVITY-WP-0006-T04
status: done
priority: medium
state_hub_task_id: "c9066d2e-0429-4e14-a68a-8418061ffd8d"
Document the rule action contract introduced by the ADHOC-2026-06-01 work:
whole-field context.* / event.* paths, scalar {context.foo} placeholders,
and explicit for_each / bind_as per-item expansion.
Also decide and document the naming/semantics mismatch around
action.task_template: today it is the emitted task title field, while
tasks/*.md contains template files with their own title templates.
Done when ADR-003 or a focused follow-up doc contains examples, unsafe cases, and the weekly SBOM staleness definition is cited as the canonical pattern.
2026-06-04: Completed. Updated ADR-003 with whole-field path rendering,
scalar placeholder rendering, unsafe action cases, explicit for_each /
bind_as expansion, the task_template naming mismatch, and weekly SBOM
staleness as the canonical per-item pattern.
Production Alerting And Failure Modes
id: ACTIVITY-WP-0006-T05
status: done
priority: medium
state_hub_task_id: "420ea629-0c20-4d09-9cc1-6b2f32665161"
Turn the current confidence in the daily triage schedule into routine operational visibility.
Cover:
- Kubernetes/Temporal worker health expectations
- schedule paused/missing detection
- report sink failure behavior
- LLM timeout and retry behavior
- what should page, what should only leave a progress note, and what should be handled in the next operator session
Done when the runbook and metrics/health surface make ordinary failures visible without inspecting a Codex Desktop session.
2026-06-04: Completed. docs/runbook.md now documents Kubernetes worker/API/
router health checks, Temporal schedule paused/missing checks, report sink
failure behavior, LLM timeout/retry behavior, and page/note/next-session
classification. Task emission sink failures now raise from emit_tasks, making
them visible to Temporal retries instead of warning-only logs.
2026-06-05: Added instruction-output robustness for report-sink instructions: after retry exhaustion, schema-invalid model output now produces a durable validation-failure report containing bounded partial output instead of a silent empty result. Report sinks include validation metadata in working-memory frontmatter and State Hub progress detail.
2026-06-06: Hardened instruction output parsing to accept a single Markdown JSON fence when the fenced content is valid JSON, while preserving the validation-failure artifact path for genuinely invalid output.
Issue-Core Emission Boundary Verification
id: ACTIVITY-WP-0006-T06
status: done
priority: medium
state_hub_task_id: "78089aef-aba1-42d7-a203-ef80ba6791d9"
Verify the downstream task emission boundary now that rule fan-out is real.
Questions to close:
- which issue-core endpoint is authoritative for task creation in the current environment
- whether
IssueCoreRestSinkshould keep using REST or move to the intended NATS subscription path - whether emitted rule tasks carry enough title, description, labels, source id, condition, and target repo data for issue-core and operators
- whether weekly SBOM staleness can be safely enabled against the real sink
Done when there is a tested or dry-run-verified path from a rule match to a downstream task reference, and activity-core still owns only the spawn audit trail, not task lifecycle state.
2026-06-04: Completed. Added docs/issue-core-emission-boundary.md documenting
REST /issues/ as the current authoritative endpoint, NATS as future work,
Railiance ISSUE_SINK_TYPE=null dry-run mode, and the fields sent to
issue-core versus retained in task_spawn_log. Added REST payload and sink
failure tests in tests/test_issue_sink.py; the existing weekly SBOM integration
test remains the dry-run rule-match-to-task-reference proof.
Completion Criteria
- State Hub task-status canon adaptation is complete.
- Daily triage has an operator-grade verification path and three-run calibration evidence.
- Rule action semantics are documented and no longer surprising.
- Production failure modes are observable enough for routine operation.
- Downstream task emission has been verified without expanding activity-core's ownership boundary.