Files

tegwick 2e50588837 Record daily triage activity-core canary blocker

2026-05-19 20:14:19 +02:00

16 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, planning_priority, planning_order, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	planning_priority	planning_order	created	updated	state_hub_workstream_id
CUST-WP-0045	workplan	Activity-Core Daily Triage Runner Cutover	custodian	the-custodian	blocked	custodian	custodian	high	45	2026-05-19	2026-05-19	d9d9a3ec-f736-4041-beac-bb92c7ad314e

CUST-WP-0045 - Activity-Core Daily Triage Runner Cutover

Goal

Move the Daily State Hub WSJF Triage runner from the Codex app automation substrate to owned activity-core infrastructure.

The outcome should be a reliable daily run at 07:20 Europe/Berlin that produces the same review artifact promised by CUST-WP-0044: a dated working-memory note, a State Hub daily_triage progress event, and an auditable activity-core run record.

Context

On 2026-05-19 the Codex app automation fired at the scheduled time, but did not complete a useful run:

two Daily State Hub WSJF Triage sessions were created at 07:20 Europe/Berlin
both session files contained only session metadata
no prompt execution, report, tool call, working-memory note, or final answer was recorded
State Hub had no daily_triage progress event for that date
the recorded session cwd values used Windows-style C:\home\worsch\... paths rather than the intended WSL paths

This shows the schedule is present but the launch substrate is not trustworthy enough for an unattended Custodian operating habit.

activity-core already provides the pieces that should own this class of work:

Temporal cron schedules with timezone and misfire-policy handling
ActivityDefinition markdown ingestion via ACTIVITY_DEFINITION_DIRS
state-hub context resolver hooks
ActivityRun logging and Temporal workflow history
rule/instruction model design in ACT-ADR-003
deployment/runbook paths for the Railiance environment

The missing work is to connect those existing capabilities to this judgement report use case without building a second scheduler or a parallel priority database.

Scope

In scope:

Extend activity-core so the existing daily triage ActivityDefinition can run as the primary scheduler.
Reuse the existing prompt at runtime/prompts/daily_statehub_wsgi_triage.md.
Reuse the existing ActivityDefinition at activity-definitions/daily-statehub-wsjf-triage.md.
Extend activity-core's State Hub context resolver for the queries this report already needs.
Add or finish the instruction/report execution path described by activity-core ADR-003.
Write the report to Custodian working memory and log event_type: daily_triage in State Hub.
Disable the Codex app automation after activity-core is validated, so there is only one daily runner.

Out of scope:

Rewriting the WSJF rubric or report template; that belongs to CUST-WP-0044.
Creating a new scheduler, cron daemon, or separate automation database.
Automatically changing workplan status, priority, canon, secrets, deployment, or external commitments from the daily report.
Retiring the workstation fallback or deploying HA activity-core before the relevant Railiance deployment work is approved.

Runner Decision

Primary target runner: activity-core Temporal schedule.

Temporary fallback runner: Codex app automation, only until activity-core has completed a manual run and at least one scheduled canary run.

Cutover rule: do not enable both runners at the same time. The handoff is:

Activity-core definition remains disabled while the Codex automation is the only runner.
Activity-core is validated with a manual trigger using the same definition.
Codex automation is paused.
Activity-core definition is enabled and schedules are synced.
The next scheduled run is checked for a working-memory note, State Hub progress event, and ActivityRun row.

Tasks

T01 - Capture Failure Evidence And Runner Boundary

id: CUST-WP-0045-T01
status: done
priority: high
state_hub_task_id: "01f57ed4-0473-42bf-b61c-0491f7ac7e2c"

Record the 2026-05-19 failed automation evidence in the implementation notes for this workplan and, if useful, in the CUST-WP-0044 calibration notes.

Confirm the desired runner boundary:

activity-core owns schedule, retries, run log, and context resolution
State Hub remains the read model and progress sink
the-custodian owns the prompt, report template, and governance guardrails
Codex app automation is a temporary fallback only

Done when the failure mode and cutover target are explicit enough that future agents do not try to fix this by adding another local cron path.

T02 - Extend Activity-Core State Hub Context Resolver

id: CUST-WP-0045-T02
status: done
priority: high
depends_on: [CUST-WP-0045-T01]
state_hub_task_id: "c4303b24-6f6b-445e-8e2e-94441589a7f2"

Extend activity-core's existing state-hub context resolver instead of adding bespoke HTTP fetch logic to the Custodian repo.

Required queries:

state_summary -> GET /state/summary
next_steps -> GET /state/next_steps
workplan_index -> GET /workstreams/workplan-index
hub_inbox -> GET /messages/?to_agent=hub&unread_only=true

The resolver should keep the existing STATE_HUB_URL configuration pattern, use bounded timeouts, and return {} on resolver failure so the workflow can still fall back to the offline brief/prompt contract.

Done when activity-core tests cover all four new query names and the existing domain_summary and repo_sbom_status behavior remains intact.

T03 - Implement Instruction Report Execution

id: CUST-WP-0045-T03
status: done
priority: high
depends_on: [CUST-WP-0045-T02]
state_hub_task_id: "e766ff2e-1887-49e6-9c66-598bb395e76c"

Finish the activity-core instruction/report execution path needed for judgement runs like daily triage.

Reuse the existing rule/instruction model from ACT-ADR-003:

parse a fenced instruction block from the ActivityDefinition
apply any instruction condition before running the report
render the canonical prompt with explicit trusted context fields
call the approved model/agent adapter through the existing org LLM path where available
validate the output against a small daily-triage report schema
record model, prompt hash, validation result, and source instruction id in the activity-core audit trail

This task should not introduce another scheduler or a one-off daily-triage script. The deliverable is a reusable instruction execution capability that this report can use and future judgement activities can share.

Done when activity-core can run a synthetic instruction ActivityDefinition and produce a validated report payload under test.

T04 - Add Working-Memory And State Hub Progress Sinks

id: CUST-WP-0045-T04
status: done
priority: high
depends_on: [CUST-WP-0045-T03]
state_hub_task_id: "04e56428-d3a8-4aa7-a6e1-172c974ece3a"

Add deterministic output sinks for report instructions.

For this activity, the sink must:

write one dated note under /home/worsch/the-custodian/memory/working/
post one State Hub progress event with event_type: daily_triage
include the activity id, run id, scheduled time, and report summary
be idempotent by activity-core run id and local date
refuse to edit canon/, workplans/, or other canonical files

Done when a manual activity-core trigger creates exactly one working-memory note and one State Hub progress event, and a retry does not duplicate either.

T05 - Update And Validate The Daily Triage ActivityDefinition

id: CUST-WP-0045-T05
status: done
priority: medium
depends_on: [CUST-WP-0045-T02, CUST-WP-0045-T03, CUST-WP-0045-T04]
state_hub_task_id: "0c6d54ec-7ed1-4e80-9cfa-ccb914e65fbf"

Update activity-definitions/daily-statehub-wsjf-triage.md so it is executable by activity-core.

Expected changes:

keep the trigger at 20 7 * * *, timezone Europe/Berlin
keep misfire_policy: skip
add the report instruction block that references the canonical prompt
keep enabled: false until manual validation passes
document the single-runner cutover rule in the file

Validate using activity-core's existing parser and sync commands with ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian.

Done when the definition parses, syncs into activity-core, and appears as a paused Temporal schedule while disabled.

T06 - Canary Cutover And Disable Codex Automation

id: CUST-WP-0045-T06
status: blocked
priority: high
depends_on: [CUST-WP-0045-T05]
state_hub_task_id: "545162d7-0198-4519-a30b-06e88c6db915"
blocking_reason: "Needs an approved non-external LLM path for private State Hub digest data, or explicit operator approval for the external llm-connect backend."
needs_human: true
intervention_note: "Real cutover needs an approved non-external LLM path for private State Hub digest data, or explicit human approval for the external llm-connect backend after review."

Run the cutover safely.

Sequence:

Manually trigger the activity-core definition and verify output.
Pause or delete the Codex app automation daily-state-hub-wsjf-triage.
Set the activity-core definition to enabled: true.
Sync activity definitions and Temporal schedules.
Confirm the Temporal schedule is unpaused and points at RunActivityWorkflow.
Check the next 07:20 run for a working-memory note, State Hub progress event, ActivityRun row, and Temporal workflow history.

Done when activity-core is the only enabled runner and the first scheduled run has completed successfully.

T07 - Observability And Missed-Run Handling

id: CUST-WP-0045-T07
status: todo
priority: medium
depends_on: [CUST-WP-0045-T06]
state_hub_task_id: "b977c721-cadc-461f-8ffb-715d438e4c31"

Document and, where cheap, automate how to tell whether the daily run happened.

The runbook should include:

Temporal schedule and workflow checks
activity-core ActivityRun query
State Hub daily_triage progress-event query
working-memory note path check
expected behavior when the activity-core host is offline at 07:20
the chosen missed-run behavior: skip, not catch-up

Done when the operator can answer "did it run today?" from owned telemetry without inspecting Codex Desktop session internals.

T08 - Three Daily Runs And CUST-WP-0044 Calibration

id: CUST-WP-0045-T08
status: todo
priority: medium
depends_on: [CUST-WP-0045-T06, CUST-WP-0045-T07]
state_hub_task_id: "f4a985fd-8cce-4175-983e-cf3b437e19a5"

Run three consecutive daily canaries from activity-core and compare the recommendations with actual follow-up work.

Feed the result back into CUST-WP-0044-T06:

calibrate WSJF scoring weights
tune report length
adjust loose-end detection thresholds
confirm stale-but-intentionally-parked work is treated correctly
decide whether daily notes are useful enough as a standing habit

Done when CUST-WP-0044 can close its calibration task using activity-core runs, not Codex app automation runs.

Implementation Notes - 2026-05-19

T01 is complete. The 2026-05-19 failed Codex automation run is captured in this workplan's context, and the runner boundary is explicit: activity-core owns the schedule, retries, context resolution, run log, and audit trail; State Hub stays the read model and progress sink; the-custodian owns the prompt and guardrails.

T02 is complete in activity-core. The existing state-hub context resolver now supports the daily triage queries state_summary, next_steps, workplan_index, and hub_inbox while preserving domain_summary and repo_sbom_status. Resolver failures return {} so the workflow can degrade to offline context instead of failing the whole run.

T03 is complete in activity-core. RunActivityWorkflow now evaluates instruction blocks after rules, using the existing instruction executor and a small llm-connect HTTP client boundary. Instruction results carry task specs, optional report payloads, prompt hash, model, validation status, review flag, and condition metadata. A lightweight daily triage report schema is available at schemas/daily-triage-report.json so report payloads can be validated under test before T04 wires the deterministic working-memory and State Hub sinks.

T04 is complete in activity-core. Instruction definitions can now declare report_sinks; report payloads are persisted through deterministic sink code instead of model-authored file operations. The first two sink types are working-memory and state-hub-progress. Working-memory writes refuse canonical Custodian canon/ and workplans/ paths, use run-id/date based idempotency, and State Hub progress posting deduplicates by activity run id and instruction id before posting.

T05 is complete. The daily triage ActivityDefinition now uses a single trusted scalar context.daily_triage_digest instead of raw State Hub JSON. The digest is built in activity-core from safe identifiers, counts, statuses, priority fields, health labels, and shortened titles, while excluding task descriptions, message bodies, and other free-text command surfaces. The digest also carries a deterministic_scoring extension marker so a later high-criticality path can move especially high-gain/high-effort candidate scoring into code without changing the ActivityDefinition contract.

T06 is partially validated but blocked before cutover. A local activity-core dev stack was started, the Custodian ActivityDefinition directory synced into activity-core, and the paused Temporal schedule for the disabled daily triage definition was created. The first sync exposed reusable activity-core gaps that were fixed there instead of bypassed here:

file-authored ActivityDefinition slug ids now map to stable UUIDv5 DB ids
schedule sync no longer uses raw NOT IN :ids SQL that asyncpg rejects
ADR-style context sources without an explicit name validate against the domain model
the worker now registers the existing instruction/report activities

Manual trigger canary evidence, using a local-only llm-connect mock response so no State Hub digest data left the workstation:

workflow id: activity-6fca51fa-387a-4fd0-bc4e-d62c29eb859a:manual-6a6e5950-2338-45c4-9054-573dda9c87cc
Temporal status: COMPLETED
activity-core run id: 2164cb88-8415-5c96-9e31-e47a41cf4e67
working-memory note: memory/working/daily-triage-2026-05-19-2164cb88.md
State Hub progress event: e42c0ada-8111-4d88-9791-821252cd04a2

The real Claude-backed llm-connect trigger was not run. The execution wrapper blocked it because private State Hub workstream/task digest data would be sent to an external LLM provider. Therefore the Codex app automation remains the only enabled runner, the ActivityDefinition remains enabled: false, and T06 is blocked until there is either an approved local/private LLM backend or an explicit operator decision to allow that external data flow.

Verification:

uv run pytest tests/test_state_hub_context_resolver.py -q: 6 passed
activity-core parser validation with ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian: parsed the daily triage definition, cron trigger, trusted instruction, and report sinks
uv run pytest -q in activity-core: 107 passed, 1 skipped
activity-core focused T06 validation: uv run pytest tests/test_sync_activity_definitions.py tests/test_instruction_evaluation.py tests/test_report_sinks.py -q: 10 passed
activity-core full suite after T06 fixes: uv run pytest -q: 110 passed, 1 skipped

Acceptance Criteria

The daily State Hub WSJF triage runs from activity-core, not Codex app cron.
The Codex app automation is disabled or removed before the activity-core schedule is enabled.
The daily run leaves all three evidence surfaces: working-memory note, State Hub daily_triage progress event, and activity-core ActivityRun/Temporal history.
"Did it run today?" can be answered from State Hub and activity-core telemetry.
A powered-off workstation no longer matters once activity-core is running on the chosen always-on host.
If the chosen activity-core host is offline at 07:20, the missed run is skipped by policy and the absence is visible in the runbook checks.
CUST-WP-0044's three-run calibration is completed using the new runner.

Notes

The immediate Codex app automation failure could be patched by chasing the Windows/WSL launch path issue. That is not the preferred durable fix. The preferred fix is to make the existing activity-core ActivityDefinition the primary runner and keep all scheduling, audit, context resolution, and failure visibility in owned infrastructure.

16 KiB Raw Blame History