Files

tegwick 9d1a35e61b Note 2026-06-01 cutover-runbook session on CUST-WP-0045

T06 remains in_progress — no canary was rerun. Capture the runbook deliverable
(workplans/CUST-WP-0045-cutover-runbook.md @ 8ef5399), the still-unchanged
upstream fixes that should let the patched canary succeed, and the two
operational gotchas the runbook now documents (host-mode env overrides vs.
Docker-network .env; Claude CLI quota collision when triggering from inside
an active Claude Code session). Bump updated: to 2026-06-01.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-06-02 02:16:50 +02:00

23 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, planning_priority, planning_order, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	planning_priority	planning_order	created	updated	state_hub_workstream_id
CUST-WP-0045	workplan	Activity-Core Daily Triage Runner Cutover	custodian	the-custodian	active	custodian	custodian	high	45	2026-05-19	2026-06-01	d9d9a3ec-f736-4041-beac-bb92c7ad314e

CUST-WP-0045 - Activity-Core Daily Triage Runner Cutover

Goal

Move the Daily State Hub WSJF Triage runner from the Codex app automation substrate to owned activity-core infrastructure.

The outcome should be a reliable daily run at 07:20 Europe/Berlin that produces the same review artifact promised by CUST-WP-0044: a dated working-memory note, a State Hub daily_triage progress event, and an auditable activity-core run record.

Context

On 2026-05-19 the Codex app automation fired at the scheduled time, but did not complete a useful run:

two Daily State Hub WSJF Triage sessions were created at 07:20 Europe/Berlin
both session files contained only session metadata
no prompt execution, report, tool call, working-memory note, or final answer was recorded
State Hub had no daily_triage progress event for that date
the recorded session cwd values used Windows-style C:\home\worsch\... paths rather than the intended WSL paths

This shows the schedule is present but the launch substrate is not trustworthy enough for an unattended Custodian operating habit.

activity-core already provides the pieces that should own this class of work:

Temporal cron schedules with timezone and misfire-policy handling
ActivityDefinition markdown ingestion via ACTIVITY_DEFINITION_DIRS
state-hub context resolver hooks
ActivityRun logging and Temporal workflow history
rule/instruction model design in ACT-ADR-003
deployment/runbook paths for the Railiance environment

The missing work is to connect those existing capabilities to this judgement report use case without building a second scheduler or a parallel priority database.

Scope

In scope:

Extend activity-core so the existing daily triage ActivityDefinition can run as the primary scheduler.
Reuse the existing prompt at runtime/prompts/daily_statehub_wsgi_triage.md.
Reuse the existing ActivityDefinition at activity-definitions/daily-statehub-wsjf-triage.md.
Extend activity-core's State Hub context resolver for the queries this report already needs.
Add or finish the instruction/report execution path described by activity-core ADR-003.
Write the report to Custodian working memory and log event_type: daily_triage in State Hub.
Disable the Codex app automation after activity-core is validated, so there is only one daily runner.

Out of scope:

Rewriting the WSJF rubric or report template; that belongs to CUST-WP-0044.
Creating a new scheduler, cron daemon, or separate automation database.
Automatically changing workplan status, priority, canon, secrets, deployment, or external commitments from the daily report.
Retiring the workstation fallback or deploying HA activity-core before the relevant Railiance deployment work is approved.

Runner Decision

Primary target runner: activity-core Temporal schedule.

Temporary fallback runner: Codex app automation, only until activity-core has completed a manual run and at least one scheduled canary run.

Cutover rule: do not enable both runners at the same time. The handoff is:

Activity-core definition remains disabled while the Codex automation is the only runner.
Activity-core is validated with a manual trigger using the same definition.
Codex automation is paused.
Activity-core definition is enabled and schedules are synced.
The next scheduled run is checked for a working-memory note, State Hub progress event, and ActivityRun row.

Tasks

T01 - Capture Failure Evidence And Runner Boundary

id: CUST-WP-0045-T01
status: done
priority: high
state_hub_task_id: "01f57ed4-0473-42bf-b61c-0491f7ac7e2c"

Record the 2026-05-19 failed automation evidence in the implementation notes for this workplan and, if useful, in the CUST-WP-0044 calibration notes.

Confirm the desired runner boundary:

activity-core owns schedule, retries, run log, and context resolution
State Hub remains the read model and progress sink
the-custodian owns the prompt, report template, and governance guardrails
Codex app automation is a temporary fallback only

Done when the failure mode and cutover target are explicit enough that future agents do not try to fix this by adding another local cron path.

T02 - Extend Activity-Core State Hub Context Resolver

id: CUST-WP-0045-T02
status: done
priority: high
depends_on: [CUST-WP-0045-T01]
state_hub_task_id: "c4303b24-6f6b-445e-8e2e-94441589a7f2"

Extend activity-core's existing state-hub context resolver instead of adding bespoke HTTP fetch logic to the Custodian repo.

Required queries:

state_summary -> GET /state/summary
next_steps -> GET /state/next_steps
workplan_index -> GET /workstreams/workplan-index
hub_inbox -> GET /messages/?to_agent=hub&unread_only=true

The resolver should keep the existing STATE_HUB_URL configuration pattern, use bounded timeouts, and return {} on resolver failure so the workflow can still fall back to the offline brief/prompt contract.

Done when activity-core tests cover all four new query names and the existing domain_summary and repo_sbom_status behavior remains intact.

T03 - Implement Instruction Report Execution

id: CUST-WP-0045-T03
status: done
priority: high
depends_on: [CUST-WP-0045-T02]
state_hub_task_id: "e766ff2e-1887-49e6-9c66-598bb395e76c"

Finish the activity-core instruction/report execution path needed for judgement runs like daily triage.

Reuse the existing rule/instruction model from ACT-ADR-003:

parse a fenced instruction block from the ActivityDefinition
apply any instruction condition before running the report
render the canonical prompt with explicit trusted context fields
call the approved model/agent adapter through the existing org LLM path where available
validate the output against a small daily-triage report schema
record model, prompt hash, validation result, and source instruction id in the activity-core audit trail

This task should not introduce another scheduler or a one-off daily-triage script. The deliverable is a reusable instruction execution capability that this report can use and future judgement activities can share.

Done when activity-core can run a synthetic instruction ActivityDefinition and produce a validated report payload under test.

T04 - Add Working-Memory And State Hub Progress Sinks

id: CUST-WP-0045-T04
status: done
priority: high
depends_on: [CUST-WP-0045-T03]
state_hub_task_id: "04e56428-d3a8-4aa7-a6e1-172c974ece3a"

Add deterministic output sinks for report instructions.

For this activity, the sink must:

write one dated note under /home/worsch/the-custodian/memory/working/
post one State Hub progress event with event_type: daily_triage
include the activity id, run id, scheduled time, and report summary
be idempotent by activity-core run id and local date
refuse to edit canon/, workplans/, or other canonical files

Done when a manual activity-core trigger creates exactly one working-memory note and one State Hub progress event, and a retry does not duplicate either.

T05 - Update And Validate The Daily Triage ActivityDefinition

id: CUST-WP-0045-T05
status: done
priority: medium
depends_on: [CUST-WP-0045-T02, CUST-WP-0045-T03, CUST-WP-0045-T04]
state_hub_task_id: "0c6d54ec-7ed1-4e80-9cfa-ccb914e65fbf"

Update activity-definitions/daily-statehub-wsjf-triage.md so it is executable by activity-core.

Expected changes:

keep the trigger at 20 7 * * *, timezone Europe/Berlin
keep misfire_policy: skip
add the report instruction block that references the canonical prompt
keep enabled: false until manual validation passes
document the single-runner cutover rule in the file

Validate using activity-core's existing parser and sync commands with ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian.

Done when the definition parses, syncs into activity-core, and appears as a paused Temporal schedule while disabled.

T06 - Canary Cutover And Disable Codex Automation

id: CUST-WP-0045-T06
status: in_progress
priority: high
depends_on: [CUST-WP-0045-T05]
state_hub_task_id: "545162d7-0198-4519-a30b-06e88c6db915"

Run the cutover safely.

Sequence:

Manually trigger the activity-core definition and verify output.
Pause or delete the Codex app automation daily-state-hub-wsjf-triage.
Set the activity-core definition to enabled: true.
Sync activity definitions and Temporal schedules.
Confirm the Temporal schedule is unpaused and points at RunActivityWorkflow.
Check the next 07:20 run for a working-memory note, State Hub progress event, ActivityRun row, and Temporal workflow history.

Done when activity-core is the only enabled runner and the first scheduled run has completed successfully.

T07 - Observability And Missed-Run Handling

id: CUST-WP-0045-T07
status: todo
priority: medium
depends_on: [CUST-WP-0045-T06]
state_hub_task_id: "b977c721-cadc-461f-8ffb-715d438e4c31"

Document and, where cheap, automate how to tell whether the daily run happened.

The runbook should include:

Temporal schedule and workflow checks
activity-core ActivityRun query
State Hub daily_triage progress-event query
working-memory note path check
expected behavior when the activity-core host is offline at 07:20
the chosen missed-run behavior: skip, not catch-up

Done when the operator can answer "did it run today?" from owned telemetry without inspecting Codex Desktop session internals.

T08 - Three Daily Runs And CUST-WP-0044 Calibration

id: CUST-WP-0045-T08
status: todo
priority: medium
depends_on: [CUST-WP-0045-T06, CUST-WP-0045-T07]
state_hub_task_id: "f4a985fd-8cce-4175-983e-cf3b437e19a5"

Run three consecutive daily canaries from activity-core and compare the recommendations with actual follow-up work.

Feed the result back into CUST-WP-0044-T06:

calibrate WSJF scoring weights
tune report length
adjust loose-end detection thresholds
confirm stale-but-intentionally-parked work is treated correctly
decide whether daily notes are useful enough as a standing habit

Done when CUST-WP-0044 can close its calibration task using activity-core runs, not Codex app automation runs.

Implementation Notes - 2026-05-19

T01 is complete. The 2026-05-19 failed Codex automation run is captured in this workplan's context, and the runner boundary is explicit: activity-core owns the schedule, retries, context resolution, run log, and audit trail; State Hub stays the read model and progress sink; the-custodian owns the prompt and guardrails.

T02 is complete in activity-core. The existing state-hub context resolver now supports the daily triage queries state_summary, next_steps, workplan_index, and hub_inbox while preserving domain_summary and repo_sbom_status. Resolver failures return {} so the workflow can degrade to offline context instead of failing the whole run.

T03 is complete in activity-core. RunActivityWorkflow now evaluates instruction blocks after rules, using the existing instruction executor and a small llm-connect HTTP client boundary. Instruction results carry task specs, optional report payloads, prompt hash, model, validation status, review flag, and condition metadata. A lightweight daily triage report schema is available at schemas/daily-triage-report.json so report payloads can be validated under test before T04 wires the deterministic working-memory and State Hub sinks.

T04 is complete in activity-core. Instruction definitions can now declare report_sinks; report payloads are persisted through deterministic sink code instead of model-authored file operations. The first two sink types are working-memory and state-hub-progress. Working-memory writes refuse canonical Custodian canon/ and workplans/ paths, use run-id/date based idempotency, and State Hub progress posting deduplicates by activity run id and instruction id before posting.

T05 is complete. The daily triage ActivityDefinition now uses a single trusted scalar context.daily_triage_digest instead of raw State Hub JSON. The digest is built in activity-core from safe identifiers, counts, statuses, priority fields, health labels, and shortened titles, while excluding task descriptions, message bodies, and other free-text command surfaces. The digest also carries a deterministic_scoring extension marker so a later high-criticality path can move especially high-gain/high-effort candidate scoring into code without changing the ActivityDefinition contract.

T06 is partially validated but blocked before cutover. A local activity-core dev stack was started, the Custodian ActivityDefinition directory synced into activity-core, and the paused Temporal schedule for the disabled daily triage definition was created. The first sync exposed reusable activity-core gaps that were fixed there instead of bypassed here:

file-authored ActivityDefinition slug ids now map to stable UUIDv5 DB ids
schedule sync no longer uses raw NOT IN :ids SQL that asyncpg rejects
ADR-style context sources without an explicit name validate against the domain model
the worker now registers the existing instruction/report activities

Manual trigger canary evidence, using a local-only llm-connect mock response so no State Hub digest data left the workstation:

workflow id: activity-6fca51fa-387a-4fd0-bc4e-d62c29eb859a:manual-6a6e5950-2338-45c4-9054-573dda9c87cc
Temporal status: COMPLETED
activity-core run id: 2164cb88-8415-5c96-9e31-e47a41cf4e67
working-memory note: memory/working/daily-triage-2026-05-19-2164cb88.md
State Hub progress event: e42c0ada-8111-4d88-9791-821252cd04a2

The real Claude-backed llm-connect trigger was not run in that pass. The execution wrapper blocked it because private State Hub workstream/task digest data would be sent to an external LLM provider. The operator then clarified that llm-connect is the intended backend boundary for LLM providers and depth tuning. Follow-up implementation keeps that boundary explicit: activity-core passes model/depth configuration through llm-connect, provider routing remains inside llm-connect, and the ActivityDefinition declares a balanced custodian-triage-balanced profile for calibration.

The llm-connect depth path is now reusable instead of daily-triage-specific:

activity-core InstructionDef accepts temperature, max_tokens, max_depth, and model_params
activity-core sends those values to llm-connect as RunConfig
llm-connect server mode now preserves the full RunConfig via RunConfig.from_dict
the daily triage ActivityDefinition starts with model: custodian-triage-balanced, max_depth: 2, and model_params.reasoning_effort: medium

Remaining T06 work is now operational cutover: run the real llm-connect backend selected by the operator, verify real report quality, pause Codex automation, set the ActivityDefinition to enabled: true, sync schedules, and check the next 07:20 run.

Verification:

uv run pytest tests/test_state_hub_context_resolver.py -q: 6 passed
activity-core parser validation with ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian: parsed the daily triage definition, cron trigger, trusted instruction, and report sinks
uv run pytest -q in activity-core: 107 passed, 1 skipped
activity-core focused T06 validation: uv run pytest tests/test_sync_activity_definitions.py tests/test_instruction_evaluation.py tests/test_report_sinks.py -q: 10 passed
activity-core full suite after T06 fixes: uv run pytest -q: 110 passed, 1 skipped
activity-core llm-connect depth pass-through full suite: uv run pytest -q: 114 passed, 1 skipped
llm-connect focused server validation: uv run pytest tests/test_server.py -q: 10 passed
llm-connect full suite: PYTHONPATH=. uv run pytest -q: 173 passed

Implementation Notes - 2026-05-21

T06 remains in progress; no cutover was performed and the Codex automation must remain the fallback runner. The daily triage ActivityDefinition is still enabled: false.

Real llm-connect canary attempt 1 reached the activity-core workflow but failed before report persistence:

workflow id: activity-6fca51fa-387a-4fd0-bc4e-d62c29eb859a:manual-d0317873-5e09-4849-a57a-6edff7fada2c
Temporal status: COMPLETED
activity-core run id: 9b8486b5-0495-5d3f-8b7b-dc078a7c097b
worker evidence: llm-connect returned HTTP 200 twice, but activity-core rejected the instruction output as invalid JSON
persistence evidence: no working-memory note and no State Hub daily_triage progress event were written

Diagnosis showed that server-mode llm-connect was resolving the older /usr/bin/claude CLI instead of the working user install at /home/worsch/.local/bin/claude. A direct llm-connect probe through the older CLI returned the literal content Execution error, while the user install could return raw JSON. Restarting llm-connect with the user CLI path made a small probe return {"ok": true} through the HTTP boundary.

Real llm-connect canary attempt 2 used the working Claude CLI path but still did not produce a persisted report:

workflow id: activity-6fca51fa-387a-4fd0-bc4e-d62c29eb859a:manual-2de56ad6-0f82-48f0-8184-f357bd22f658
Temporal status: COMPLETED
activity-core run id: 953a1f46-e57b-58e1-b4a2-2e41e804a972
worker evidence: first llm-connect call returned HTTP 200, then activity-core retried because the output was not schema-valid JSON; the retry returned HTTP 500
persistence evidence: no working-memory note and no State Hub daily_triage progress event were written

The follow-up fix keeps the existing activity-core/llm-connect boundary:

activity-core now loads an instruction's existing output_schema and forwards that schema to llm-connect as model_params.json_schema
llm-connect's Claude Code adapter now prefers LLM_CONNECT_CLAUDE_CLI_PATH, CLAUDE_CLI_PATH, or the user-local /home/worsch/.local/bin/claude before falling back to claude
llm-connect's Claude Code adapter maps model_params.json_schema to the native Claude CLI --json-schema option
the Custodian ActivityDefinition now points at the domain-owned absolute schema path /home/worsch/the-custodian/schemas/daily-triage-report.json and asks for JSON only as a fallback

The patched schema probe could not be completed because the local Claude Code session limit was reached; the CLI reported: You've hit your session limit · resets 3:40am (Europe/Berlin).

Next T06 step after the limit resets, or after llm-connect routes this profile to another approved provider, is to rerun the manual trigger with the patched schema path and verify all three evidence surfaces before pausing Codex or enabling the activity-core schedule.

Verification:

activity-core focused executor tests: uv run pytest tests/rules/test_executor.py -q: 22 passed
llm-connect focused Claude Code/factory tests: PYTHONPATH=. uv run pytest tests/test_claude_code.py tests/test_factory.py -q: 18 passed
activity-core full suite: uv run pytest -q: 115 passed, 1 skipped
llm-connect full suite: PYTHONPATH=. uv run pytest -q: 175 passed

Implementation Notes - 2026-05-23

CUST-WP-0046 verified the separate hourly RecentlyOnScope activity-core schedule, but did not retire the Codex app automation daily-state-hub-wsjf-triage. That automation is still the fallback for this daily WSJF triage cutover while this workplan's ActivityDefinition remains enabled: false.

The State Hub decision recorded under CUST-WP-0046 is to keep the Codex automation active until this workplan completes its own daily WSJF canary and explicit pause/delete cutover step.

Implementation Notes - 2026-06-01

T06 remains in_progress. No canary was rerun in this session. The blocker is the same as on 2026-05-21: the patched real-LLM probe was never executed after the Claude CLI session limit cleared. Eleven days passed without a retry; the activity-core dev stack is currently down (only infra-postgres-1 from another project is running on the workstation).

The fixes that should make the canary succeed are merged and unchanged:

activity-core cf92f0d — forward instruction output_schema to llm-connect as model_params.json_schema
activity-core 5c4f96e — pass temperature, max_tokens, max_depth, model_params through to llm-connect RunConfig
llm-connect b12d1af — Claude Code adapter maps json_schema to native --json-schema CLI option
llm-connect 82e3c07 — server mode preserves the full RunConfig

Session deliverable: a separate cutover runbook capturing the exact host-mode command sequence to bring the dev stack up, sync the ActivityDefinition from ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian, run the smoke probe, trigger the canary, verify all three evidence surfaces, and only then flip enabled: true and sync schedules:

file: workplans/CUST-WP-0045-cutover-runbook.md
commit: 8ef5399 Add CUST-WP-0045 T06 cutover runbook

The runbook also documents two operational gotchas the 2026-05-21 attempts hit:

The repo .env uses Docker network hostnames (temporal:7233, app-db:5432, nats:4222); host-mode worker/API processes must override these on the command line because make auto-loads the file.
Triggering the canary from inside an active Claude Code session shares the CLI quota with llm-connect's Claude Code adapter and can reproduce the Execution error / HTTP 500 failure. Run from a fresh terminal account.

The current daily_triage_digest (built with the live State Hub and the ActivityDefinition's params) was inspected: 10,175 bytes, totals across 13 active topics / 13 active workstreams / 149 todo tasks, 12 open workstreams in the digest with hf-wp-0001, cust-wp-0044, cust-wp-0045, cust-wp-0046 at the priority head. This is substantive context — a working real-LLM canary should produce non-trivial recommendations, not the "summary":"ok" stub from the 2026-05-19 mocked run.

T07 and T08 remain todo. The cutover runbook overlaps with T07's evidence queries but does not cover T07's steady-state "did it run today?" framing or the missed-run policy documentation, so T07 is not satisfied by this session.

Acceptance Criteria

The daily State Hub WSJF triage runs from activity-core, not Codex app cron.
The Codex app automation is disabled or removed before the activity-core schedule is enabled.
The daily run leaves all three evidence surfaces: working-memory note, State Hub daily_triage progress event, and activity-core ActivityRun/Temporal history.
"Did it run today?" can be answered from State Hub and activity-core telemetry.
A powered-off workstation no longer matters once activity-core is running on the chosen always-on host.
If the chosen activity-core host is offline at 07:20, the missed run is skipped by policy and the absence is visible in the runbook checks.
CUST-WP-0044's three-run calibration is completed using the new runner.

Notes

The immediate Codex app automation failure could be patched by chasing the Windows/WSL launch path issue. That is not the preferred durable fix. The preferred fix is to make the existing activity-core ActivityDefinition the primary runner and keep all scheduling, audit, context resolution, and failure visibility in owned infrastructure.

23 KiB Raw Blame History