Files

tegwick 4e8ccbb344 Set up daily WSJF closure gates

2026-06-07 11:00:03 +02:00

11 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	created	updated	state_hub_workstream_id
ACTIVITY-WP-0006	workplan	Post-triage operational hardening	custodian	activity-core	active	codex	custodian	2026-06-03	2026-06-07	5646e13a-13af-4724-bca6-3c0d86f96733

ACTIVITY-WP-0006 — Post-triage operational hardening

Context

activity-core has crossed the main construction threshold: Temporal-backed schedules, context resolution, deterministic rules, LLM instructions, report sinks, and the Railiance production service are implemented. The daily State Hub WSJF triage cutover is now trusted enough that activity-core can be treated as the standing scheduled substrate rather than an experiment.

The next work should keep that substrate dependable and aligned with INTENT.md: activity-core owns when coordination work runs, what task/report outputs are produced, and where they are emitted. It must not grow into the task lifecycle database, a project planner, or an execution worker.

Task Status Canon Adaptation

id: ACTIVITY-WP-0006-T01
status: done
priority: high
state_hub_task_id: "5d79e3da-d26d-4cad-9cdf-5e5264bb7019"

Adapt activity-core to State Hub's task status canon: wait, todo, progress, done, cancel.

Scope:

update AGENTS.md task-status examples and progression text
update State Hub context resolver task-status filters and digest counters
keep workstream/workplan lifecycle status separate; blocked remains valid for workstreams/workplans where State Hub still uses it
update tests that fixture or assert in_progress / task-level blocked
resolve the State Hub interface-change notice only after the repo is adapted

Done when the full test suite passes and activity-core no longer depends on legacy task-status aliases for State Hub API clients or tests.

2026-06-04: Completed. AGENTS.md now uses State Hub task statuses wait, todo, progress, done, and cancel; workplan/workstream lifecycle blocked remains separate. The State Hub daily triage digest now counts wait/todo/progress open tasks and no longer fixtures task-level in_progress or blocked. Full suite passed: 128 passed, 1 skipped.

Daily Triage Observability Runbook

id: ACTIVITY-WP-0006-T02
status: done
priority: high
state_hub_task_id: "02c34443-0e8d-4f1a-93d9-6c39f07faad7"

Document and, where cheap, automate how to answer "did today's daily triage run happen?"

The operator should be able to check:

Temporal schedule state and latest workflow history
activity_runs row for the daily triage ActivityDefinition
State Hub daily_triage progress event
working-memory report note
expected missed-run behavior (skip, not catch-up)
the configured LLM and Temporal timeout relationship

Done when docs/runbook.md has a concise daily-triage verification section and any helper command/script is covered by tests or a dry-run path.

2026-06-04: Completed. Added scripts/verify_daily_triage.py with dry-run and live modes, plus tests/test_daily_triage_verifier.py. docs/runbook.md now covers Temporal schedule/workflow checks, activity_runs, State Hub progress, working-memory notes, missed-run skip behavior, and LLM timeout budget.

2026-06-05: Follow-up hardening after the scheduled WSJF triage ran but emitted no report because the live schema required wsjf fields and the stale DB prompt did not request them. The verifier default and runbook now point at the live working-memory sink path, /home/worsch/the-custodian/memory/working.

2026-06-06: Added a schedule smoke-test routine for new or changed recurring ActivityDefinitions. Operators can recreate the recurring Temporal Schedule, schedule a one-shot smoke run one minute in the future, wait for completion, and get a non-zero warning if workflow imports, activity registration, or runtime wiring are broken.

2026-06-06: Exercised the routine against the daily triage definition. The daily recurring Temporal Schedule was deleted and recreated, then a one-shot smoke workflow completed with run id c2db32e5-3874-522f-ae1f-9b2cdf307fd2 and emitted a validated daily_triage report plus working-memory note.

Three-Run Calibration Feedback

id: ACTIVITY-WP-0006-T03
status: wait
priority: medium
state_hub_task_id: "7cbf0a35-71a1-47ac-afc2-f51ad2180fd0"

Collect three consecutive scheduled activity-core daily triage runs and feed the result back into the Custodian WSJF calibration loop.

Assess:

whether the top recommendations matched actual useful follow-up work
report length and density
loose-end detection sensitivity
stale-but-intentionally-parked work handling
whether model settings or prompt/schema constraints need adjustment

Done when the calibration result is recorded in State Hub and the related CUST-WP-0044 / CUST-WP-0045 tasks can close based on activity-core runs, not Codex app fallback runs.

2026-06-04: Waiting on real evidence. The repo now has a verification path for scheduled daily triage runs, but this task still requires three consecutive actual activity-core scheduled runs and State Hub calibration feedback. Local tests cannot substitute for that operational evidence.

2026-06-06: The scheduled run fired at 07:20 Europe/Berlin but initially stuck on a stale worker import error after ops-evidence wiring landed. Restarting the worker let Temporal complete the run, and the hardened report path emitted a validation-failure note instead of losing the evidence. This run is useful calibration input, but it is not a clean consecutive scheduled success.

2026-06-07: Investigated the missing June 7 WSJF result. State Hub had no daily_triage event for the date, no local activity-core DB/Temporal/API ports were reachable, and the current Railiance Kubernetes context had no activity-core namespace. The Railiance runtime projection also lacked daily-statehub-wsjf-triage.md, and the node-local State Hub bridge target 127.0.0.1:18000 returned connection reset. Patched activity-core to project the daily definition, mount the schema and working-memory storage, expose LLM_CONNECT_URL, include working_memory_path in State Hub progress detail, and emit a visible execution_failed report for report-sink instructions when llm-connect is missing or broken. Cross-repo closure tasks were posted via State Hub to state-hub (dc10704f), railiance-cluster (53e78702), llm-connect (cf758ed8), the-custodian (7a5d4e62), and activity-core (28d11021). This task remains waiting on a deployed, healthy activity-core runner plus three clean scheduled daily runs and calibration feedback.

Rule Action Contract Documentation

id: ACTIVITY-WP-0006-T04
status: done
priority: medium
state_hub_task_id: "c9066d2e-0429-4e14-a68a-8418061ffd8d"

Document the rule action contract introduced by the ADHOC-2026-06-01 work: whole-field context.* / event.* paths, scalar {context.foo} placeholders, and explicit for_each / bind_as per-item expansion.

Also decide and document the naming/semantics mismatch around action.task_template: today it is the emitted task title field, while tasks/*.md contains template files with their own title templates.

Done when ADR-003 or a focused follow-up doc contains examples, unsafe cases, and the weekly SBOM staleness definition is cited as the canonical pattern.

2026-06-04: Completed. Updated ADR-003 with whole-field path rendering, scalar placeholder rendering, unsafe action cases, explicit for_each / bind_as expansion, the task_template naming mismatch, and weekly SBOM staleness as the canonical per-item pattern.

Production Alerting And Failure Modes

id: ACTIVITY-WP-0006-T05
status: done
priority: medium
state_hub_task_id: "420ea629-0c20-4d09-9cc1-6b2f32665161"

Turn the current confidence in the daily triage schedule into routine operational visibility.

Cover:

Kubernetes/Temporal worker health expectations
schedule paused/missing detection
report sink failure behavior
LLM timeout and retry behavior
what should page, what should only leave a progress note, and what should be handled in the next operator session

Done when the runbook and metrics/health surface make ordinary failures visible without inspecting a Codex Desktop session.

2026-06-04: Completed. docs/runbook.md now documents Kubernetes worker/API/ router health checks, Temporal schedule paused/missing checks, report sink failure behavior, LLM timeout/retry behavior, and page/note/next-session classification. Task emission sink failures now raise from emit_tasks, making them visible to Temporal retries instead of warning-only logs.

2026-06-05: Added instruction-output robustness for report-sink instructions: after retry exhaustion, schema-invalid model output now produces a durable validation-failure report containing bounded partial output instead of a silent empty result. Report sinks include validation metadata in working-memory frontmatter and State Hub progress detail.

2026-06-06: Hardened instruction output parsing to accept a single Markdown JSON fence when the fenced content is valid JSON, while preserving the validation-failure artifact path for genuinely invalid output.

Issue-Core Emission Boundary Verification

id: ACTIVITY-WP-0006-T06
status: done
priority: medium
state_hub_task_id: "78089aef-aba1-42d7-a203-ef80ba6791d9"

Verify the downstream task emission boundary now that rule fan-out is real.

Questions to close:

which issue-core endpoint is authoritative for task creation in the current environment
whether IssueCoreRestSink should keep using REST or move to the intended NATS subscription path
whether emitted rule tasks carry enough title, description, labels, source id, condition, and target repo data for issue-core and operators
whether weekly SBOM staleness can be safely enabled against the real sink

Done when there is a tested or dry-run-verified path from a rule match to a downstream task reference, and activity-core still owns only the spawn audit trail, not task lifecycle state.

2026-06-04: Completed. Added docs/issue-core-emission-boundary.md documenting REST /issues/ as the current authoritative endpoint, NATS as future work, Railiance ISSUE_SINK_TYPE=null dry-run mode, and the fields sent to issue-core versus retained in task_spawn_log. Added REST payload and sink failure tests in tests/test_issue_sink.py; the existing weekly SBOM integration test remains the dry-run rule-match-to-task-reference proof.

Completion Criteria

State Hub task-status canon adaptation is complete.
Daily triage has an operator-grade verification path and three-run calibration evidence.
Rule action semantics are documented and no longer surprising.
Production failure modes are observable enough for routine operation.
Downstream task emission has been verified without expanding activity-core's ownership boundary.

11 KiB Raw Blame History