Files
the-custodian/workplans/CUST-WP-0045-activity-core-daily-triage-runner.md

522 lines
20 KiB
Markdown

---
id: CUST-WP-0045
type: workplan
title: "Activity-Core Daily Triage Runner Cutover"
domain: custodian
repo: the-custodian
status: active
owner: custodian
topic_slug: custodian
planning_priority: high
planning_order: 45
created: "2026-05-19"
updated: "2026-05-21"
state_hub_workstream_id: "d9d9a3ec-f736-4041-beac-bb92c7ad314e"
---
# CUST-WP-0045 - Activity-Core Daily Triage Runner Cutover
## Goal
Move the Daily State Hub WSJF Triage runner from the Codex app automation
substrate to owned activity-core infrastructure.
The outcome should be a reliable daily run at 07:20 Europe/Berlin that produces
the same review artifact promised by `CUST-WP-0044`: a dated working-memory
note, a State Hub `daily_triage` progress event, and an auditable activity-core
run record.
## Context
On 2026-05-19 the Codex app automation fired at the scheduled time, but did not
complete a useful run:
- two `Daily State Hub WSJF Triage` sessions were created at 07:20 Europe/Berlin
- both session files contained only session metadata
- no prompt execution, report, tool call, working-memory note, or final answer
was recorded
- State Hub had no `daily_triage` progress event for that date
- the recorded session cwd values used Windows-style `C:\home\worsch\...`
paths rather than the intended WSL paths
This shows the schedule is present but the launch substrate is not trustworthy
enough for an unattended Custodian operating habit.
activity-core already provides the pieces that should own this class of work:
- Temporal cron schedules with timezone and misfire-policy handling
- `ActivityDefinition` markdown ingestion via `ACTIVITY_DEFINITION_DIRS`
- `state-hub` context resolver hooks
- ActivityRun logging and Temporal workflow history
- rule/instruction model design in `ACT-ADR-003`
- deployment/runbook paths for the Railiance environment
The missing work is to connect those existing capabilities to this judgement
report use case without building a second scheduler or a parallel priority
database.
## Scope
In scope:
- Extend activity-core so the existing daily triage ActivityDefinition can run
as the primary scheduler.
- Reuse the existing prompt at
`runtime/prompts/daily_statehub_wsgi_triage.md`.
- Reuse the existing ActivityDefinition at
`activity-definitions/daily-statehub-wsjf-triage.md`.
- Extend activity-core's State Hub context resolver for the queries this
report already needs.
- Add or finish the instruction/report execution path described by activity-core
ADR-003.
- Write the report to Custodian working memory and log `event_type:
daily_triage` in State Hub.
- Disable the Codex app automation after activity-core is validated, so there
is only one daily runner.
Out of scope:
- Rewriting the WSJF rubric or report template; that belongs to `CUST-WP-0044`.
- Creating a new scheduler, cron daemon, or separate automation database.
- Automatically changing workplan status, priority, canon, secrets, deployment,
or external commitments from the daily report.
- Retiring the workstation fallback or deploying HA activity-core before the
relevant Railiance deployment work is approved.
## Runner Decision
Primary target runner: activity-core Temporal schedule.
Temporary fallback runner: Codex app automation, only until activity-core has
completed a manual run and at least one scheduled canary run.
Cutover rule: do not enable both runners at the same time. The handoff is:
1. Activity-core definition remains disabled while the Codex automation is the
only runner.
2. Activity-core is validated with a manual trigger using the same definition.
3. Codex automation is paused.
4. Activity-core definition is enabled and schedules are synced.
5. The next scheduled run is checked for a working-memory note, State Hub
progress event, and ActivityRun row.
## Tasks
### T01 - Capture Failure Evidence And Runner Boundary
```task
id: CUST-WP-0045-T01
status: done
priority: high
state_hub_task_id: "01f57ed4-0473-42bf-b61c-0491f7ac7e2c"
```
Record the 2026-05-19 failed automation evidence in the implementation notes
for this workplan and, if useful, in the CUST-WP-0044 calibration notes.
Confirm the desired runner boundary:
- activity-core owns schedule, retries, run log, and context resolution
- State Hub remains the read model and progress sink
- the-custodian owns the prompt, report template, and governance guardrails
- Codex app automation is a temporary fallback only
Done when the failure mode and cutover target are explicit enough that future
agents do not try to fix this by adding another local cron path.
### T02 - Extend Activity-Core State Hub Context Resolver
```task
id: CUST-WP-0045-T02
status: done
priority: high
depends_on: [CUST-WP-0045-T01]
state_hub_task_id: "c4303b24-6f6b-445e-8e2e-94441589a7f2"
```
Extend activity-core's existing `state-hub` context resolver instead of adding
bespoke HTTP fetch logic to the Custodian repo.
Required queries:
- `state_summary` -> `GET /state/summary`
- `next_steps` -> `GET /state/next_steps`
- `workplan_index` -> `GET /workstreams/workplan-index`
- `hub_inbox` -> `GET /messages/?to_agent=hub&unread_only=true`
The resolver should keep the existing `STATE_HUB_URL` configuration pattern,
use bounded timeouts, and return `{}` on resolver failure so the workflow can
still fall back to the offline brief/prompt contract.
Done when activity-core tests cover all four new query names and the existing
`domain_summary` and `repo_sbom_status` behavior remains intact.
### T03 - Implement Instruction Report Execution
```task
id: CUST-WP-0045-T03
status: done
priority: high
depends_on: [CUST-WP-0045-T02]
state_hub_task_id: "e766ff2e-1887-49e6-9c66-598bb395e76c"
```
Finish the activity-core instruction/report execution path needed for judgement
runs like daily triage.
Reuse the existing rule/instruction model from `ACT-ADR-003`:
- parse a fenced `instruction` block from the ActivityDefinition
- apply any instruction condition before running the report
- render the canonical prompt with explicit trusted context fields
- call the approved model/agent adapter through the existing org LLM path where
available
- validate the output against a small daily-triage report schema
- record model, prompt hash, validation result, and source instruction id in
the activity-core audit trail
This task should not introduce another scheduler or a one-off daily-triage
script. The deliverable is a reusable instruction execution capability that
this report can use and future judgement activities can share.
Done when activity-core can run a synthetic instruction ActivityDefinition and
produce a validated report payload under test.
### T04 - Add Working-Memory And State Hub Progress Sinks
```task
id: CUST-WP-0045-T04
status: done
priority: high
depends_on: [CUST-WP-0045-T03]
state_hub_task_id: "04e56428-d3a8-4aa7-a6e1-172c974ece3a"
```
Add deterministic output sinks for report instructions.
For this activity, the sink must:
- write one dated note under
`/home/worsch/the-custodian/memory/working/`
- post one State Hub progress event with `event_type: daily_triage`
- include the activity id, run id, scheduled time, and report summary
- be idempotent by activity-core run id and local date
- refuse to edit `canon/`, `workplans/`, or other canonical files
Done when a manual activity-core trigger creates exactly one working-memory
note and one State Hub progress event, and a retry does not duplicate either.
### T05 - Update And Validate The Daily Triage ActivityDefinition
```task
id: CUST-WP-0045-T05
status: done
priority: medium
depends_on: [CUST-WP-0045-T02, CUST-WP-0045-T03, CUST-WP-0045-T04]
state_hub_task_id: "0c6d54ec-7ed1-4e80-9cfa-ccb914e65fbf"
```
Update `activity-definitions/daily-statehub-wsjf-triage.md` so it is executable
by activity-core.
Expected changes:
- keep the trigger at `20 7 * * *`, timezone `Europe/Berlin`
- keep `misfire_policy: skip`
- add the report instruction block that references the canonical prompt
- keep `enabled: false` until manual validation passes
- document the single-runner cutover rule in the file
Validate using activity-core's existing parser and sync commands with
`ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian`.
Done when the definition parses, syncs into activity-core, and appears as a
paused Temporal schedule while disabled.
### T06 - Canary Cutover And Disable Codex Automation
```task
id: CUST-WP-0045-T06
status: in_progress
priority: high
depends_on: [CUST-WP-0045-T05]
state_hub_task_id: "545162d7-0198-4519-a30b-06e88c6db915"
```
Run the cutover safely.
Sequence:
1. Manually trigger the activity-core definition and verify output.
2. Pause or delete the Codex app automation
`daily-state-hub-wsjf-triage`.
3. Set the activity-core definition to `enabled: true`.
4. Sync activity definitions and Temporal schedules.
5. Confirm the Temporal schedule is unpaused and points at
`RunActivityWorkflow`.
6. Check the next 07:20 run for a working-memory note, State Hub progress event,
ActivityRun row, and Temporal workflow history.
Done when activity-core is the only enabled runner and the first scheduled run
has completed successfully.
### T07 - Observability And Missed-Run Handling
```task
id: CUST-WP-0045-T07
status: todo
priority: medium
depends_on: [CUST-WP-0045-T06]
state_hub_task_id: "b977c721-cadc-461f-8ffb-715d438e4c31"
```
Document and, where cheap, automate how to tell whether the daily run happened.
The runbook should include:
- Temporal schedule and workflow checks
- activity-core ActivityRun query
- State Hub `daily_triage` progress-event query
- working-memory note path check
- expected behavior when the activity-core host is offline at 07:20
- the chosen missed-run behavior: `skip`, not catch-up
Done when the operator can answer "did it run today?" from owned telemetry
without inspecting Codex Desktop session internals.
### T08 - Three Daily Runs And CUST-WP-0044 Calibration
```task
id: CUST-WP-0045-T08
status: todo
priority: medium
depends_on: [CUST-WP-0045-T06, CUST-WP-0045-T07]
state_hub_task_id: "f4a985fd-8cce-4175-983e-cf3b437e19a5"
```
Run three consecutive daily canaries from activity-core and compare the
recommendations with actual follow-up work.
Feed the result back into `CUST-WP-0044-T06`:
- calibrate WSJF scoring weights
- tune report length
- adjust loose-end detection thresholds
- confirm stale-but-intentionally-parked work is treated correctly
- decide whether daily notes are useful enough as a standing habit
Done when CUST-WP-0044 can close its calibration task using activity-core runs,
not Codex app automation runs.
## Implementation Notes - 2026-05-19
T01 is complete. The 2026-05-19 failed Codex automation run is captured in this
workplan's context, and the runner boundary is explicit: activity-core owns the
schedule, retries, context resolution, run log, and audit trail; State Hub stays
the read model and progress sink; the-custodian owns the prompt and guardrails.
T02 is complete in activity-core. The existing `state-hub` context resolver now
supports the daily triage queries `state_summary`, `next_steps`,
`workplan_index`, and `hub_inbox` while preserving `domain_summary` and
`repo_sbom_status`. Resolver failures return `{}` so the workflow can degrade
to offline context instead of failing the whole run.
T03 is complete in activity-core. `RunActivityWorkflow` now evaluates
instruction blocks after rules, using the existing instruction executor and a
small llm-connect HTTP client boundary. Instruction results carry task specs,
optional report payloads, prompt hash, model, validation status, review flag,
and condition metadata. A lightweight daily triage report schema is available
at `schemas/daily-triage-report.json` so report payloads can be validated under
test before T04 wires the deterministic working-memory and State Hub sinks.
T04 is complete in activity-core. Instruction definitions can now declare
`report_sinks`; report payloads are persisted through deterministic sink code
instead of model-authored file operations. The first two sink types are
`working-memory` and `state-hub-progress`. Working-memory writes refuse
canonical Custodian `canon/` and `workplans/` paths, use run-id/date based
idempotency, and State Hub progress posting deduplicates by activity run id and
instruction id before posting.
T05 is complete. The daily triage ActivityDefinition now uses a single trusted
scalar `context.daily_triage_digest` instead of raw State Hub JSON. The digest
is built in activity-core from safe identifiers, counts, statuses, priority
fields, health labels, and shortened titles, while excluding task descriptions,
message bodies, and other free-text command surfaces. The digest also carries a
`deterministic_scoring` extension marker so a later high-criticality path can
move especially high-gain/high-effort candidate scoring into code without
changing the ActivityDefinition contract.
T06 is partially validated but blocked before cutover. A local activity-core
dev stack was started, the Custodian ActivityDefinition directory synced into
activity-core, and the paused Temporal schedule for the disabled daily triage
definition was created. The first sync exposed reusable activity-core gaps that
were fixed there instead of bypassed here:
- file-authored ActivityDefinition slug ids now map to stable UUIDv5 DB ids
- schedule sync no longer uses raw `NOT IN :ids` SQL that asyncpg rejects
- ADR-style context sources without an explicit `name` validate against the
domain model
- the worker now registers the existing instruction/report activities
Manual trigger canary evidence, using a local-only llm-connect mock response so
no State Hub digest data left the workstation:
- workflow id:
`activity-6fca51fa-387a-4fd0-bc4e-d62c29eb859a:manual-6a6e5950-2338-45c4-9054-573dda9c87cc`
- Temporal status: `COMPLETED`
- activity-core run id: `2164cb88-8415-5c96-9e31-e47a41cf4e67`
- working-memory note:
`memory/working/daily-triage-2026-05-19-2164cb88.md`
- State Hub progress event: `e42c0ada-8111-4d88-9791-821252cd04a2`
The real Claude-backed llm-connect trigger was not run in that pass. The
execution wrapper blocked it because private State Hub workstream/task digest
data would be sent to an external LLM provider. The operator then clarified that
`llm-connect` is the intended backend boundary for LLM providers and depth
tuning. Follow-up implementation keeps that boundary explicit: activity-core
passes model/depth configuration through llm-connect, provider routing remains
inside llm-connect, and the ActivityDefinition declares a balanced
`custodian-triage-balanced` profile for calibration.
The llm-connect depth path is now reusable instead of daily-triage-specific:
- activity-core `InstructionDef` accepts `temperature`, `max_tokens`,
`max_depth`, and `model_params`
- activity-core sends those values to llm-connect as `RunConfig`
- llm-connect server mode now preserves the full `RunConfig` via
`RunConfig.from_dict`
- the daily triage ActivityDefinition starts with
`model: custodian-triage-balanced`, `max_depth: 2`, and
`model_params.reasoning_effort: medium`
Remaining T06 work is now operational cutover: run the real llm-connect backend
selected by the operator, verify real report quality, pause Codex automation,
set the ActivityDefinition to `enabled: true`, sync schedules, and check the
next 07:20 run.
Verification:
- `uv run pytest tests/test_state_hub_context_resolver.py -q`:
6 passed
- activity-core parser validation with
`ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian`:
parsed the daily triage definition, cron trigger, trusted instruction, and
report sinks
- `uv run pytest -q` in activity-core:
107 passed, 1 skipped
- activity-core focused T06 validation:
`uv run pytest tests/test_sync_activity_definitions.py
tests/test_instruction_evaluation.py tests/test_report_sinks.py -q`:
10 passed
- activity-core full suite after T06 fixes:
`uv run pytest -q`:
110 passed, 1 skipped
- activity-core llm-connect depth pass-through full suite:
`uv run pytest -q`:
114 passed, 1 skipped
- llm-connect focused server validation:
`uv run pytest tests/test_server.py -q`:
10 passed
- llm-connect full suite:
`PYTHONPATH=. uv run pytest -q`:
173 passed
## Implementation Notes - 2026-05-21
T06 remains in progress; no cutover was performed and the Codex automation must
remain the fallback runner. The daily triage ActivityDefinition is still
`enabled: false`.
Real llm-connect canary attempt 1 reached the activity-core workflow but failed
before report persistence:
- workflow id:
`activity-6fca51fa-387a-4fd0-bc4e-d62c29eb859a:manual-d0317873-5e09-4849-a57a-6edff7fada2c`
- Temporal status: `COMPLETED`
- activity-core run id: `9b8486b5-0495-5d3f-8b7b-dc078a7c097b`
- worker evidence: llm-connect returned HTTP 200 twice, but activity-core
rejected the instruction output as invalid JSON
- persistence evidence: no working-memory note and no State Hub
`daily_triage` progress event were written
Diagnosis showed that server-mode llm-connect was resolving the older
`/usr/bin/claude` CLI instead of the working user install at
`/home/worsch/.local/bin/claude`. A direct llm-connect probe through the older
CLI returned the literal content `Execution error`, while the user install could
return raw JSON. Restarting llm-connect with the user CLI path made a small
probe return `{"ok": true}` through the HTTP boundary.
Real llm-connect canary attempt 2 used the working Claude CLI path but still did
not produce a persisted report:
- workflow id:
`activity-6fca51fa-387a-4fd0-bc4e-d62c29eb859a:manual-2de56ad6-0f82-48f0-8184-f357bd22f658`
- Temporal status: `COMPLETED`
- activity-core run id: `953a1f46-e57b-58e1-b4a2-2e41e804a972`
- worker evidence: first llm-connect call returned HTTP 200, then activity-core
retried because the output was not schema-valid JSON; the retry returned
HTTP 500
- persistence evidence: no working-memory note and no State Hub
`daily_triage` progress event were written
The follow-up fix keeps the existing activity-core/llm-connect boundary:
- activity-core now loads an instruction's existing `output_schema` and forwards
that schema to llm-connect as `model_params.json_schema`
- llm-connect's Claude Code adapter now prefers
`LLM_CONNECT_CLAUDE_CLI_PATH`, `CLAUDE_CLI_PATH`, or the user-local
`/home/worsch/.local/bin/claude` before falling back to `claude`
- llm-connect's Claude Code adapter maps `model_params.json_schema` to the
native Claude CLI `--json-schema` option
- the Custodian ActivityDefinition now points at the domain-owned absolute
schema path `/home/worsch/the-custodian/schemas/daily-triage-report.json`
and asks for JSON only as a fallback
The patched schema probe could not be completed because the local Claude Code
session limit was reached; the CLI reported:
`You've hit your session limit · resets 3:40am (Europe/Berlin)`.
Next T06 step after the limit resets, or after llm-connect routes this profile
to another approved provider, is to rerun the manual trigger with the patched
schema path and verify all three evidence surfaces before pausing Codex or
enabling the activity-core schedule.
Verification:
- activity-core focused executor tests:
`uv run pytest tests/rules/test_executor.py -q`:
22 passed
- llm-connect focused Claude Code/factory tests:
`PYTHONPATH=. uv run pytest tests/test_claude_code.py tests/test_factory.py -q`:
18 passed
- activity-core full suite:
`uv run pytest -q`:
115 passed, 1 skipped
- llm-connect full suite:
`PYTHONPATH=. uv run pytest -q`:
175 passed
## Acceptance Criteria
- The daily State Hub WSJF triage runs from activity-core, not Codex app cron.
- The Codex app automation is disabled or removed before the activity-core
schedule is enabled.
- The daily run leaves all three evidence surfaces: working-memory note, State
Hub `daily_triage` progress event, and activity-core ActivityRun/Temporal
history.
- "Did it run today?" can be answered from State Hub and activity-core
telemetry.
- A powered-off workstation no longer matters once activity-core is running on
the chosen always-on host.
- If the chosen activity-core host is offline at 07:20, the missed run is
skipped by policy and the absence is visible in the runbook checks.
- CUST-WP-0044's three-run calibration is completed using the new runner.
## Notes
The immediate Codex app automation failure could be patched by chasing the
Windows/WSL launch path issue. That is not the preferred durable fix. The
preferred fix is to make the existing activity-core ActivityDefinition the
primary runner and keep all scheduling, audit, context resolution, and failure
visibility in owned infrastructure.