Files
railiance-cluster/workplans/RAIL-BS-WP-0008-activity-core-wp0016-triage-output-deploy.md
tegwick c398bf5027
Some checks failed
railiance-tests / smoke (push) Has been cancelled
RAIL-BS-WP-0008/0009 finished: live deploy, top-7 proof, admin-sync smoke
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 11:53:11 +02:00

6.5 KiB
Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
RAIL-BS-WP-0008 workplan activity-core WP-0016 triage-output robustness deploy financials railiance-cluster finished railiance-cluster railiance 2026-07-01 2026-07-02 7cbbe0d6-fea9-41c6-840c-46d0d8e8edde

activity-core WP-0016 triage-output robustness deploy

Context

Inbox message 87952ff1 (activity-core, 2026-06-26): the scheduled daily WSJF triage run on 2026-06-26 failed schema validation and the whole run was discarded, resetting the WP-0006-T03 three-clean-run streak. ACTIVITY-WP-0016 hardened the instruction-executor output contract in-repo (commits 5eb33bd..bf877b7 on activity-core main, 220 tests passed). The remaining work is operator/cluster-owned on railiance01.

Deploy coupling constraint: schemas/daily-triage-report.json is now strict per-item and is consumed by both the llm-connect hint and the whole-doc validator. It MUST ship together with the new executor.py (T03 per-item quarantine parser). Never deploy the schema ahead of the code.

Deploy activity-core with coupled schema and executor

id: RAIL-BS-WP-0008-T01
status: done
priority: high
state_hub_task_id: "079e39a9-f938-4d03-a5bc-4d3d2f7b1d83"

Rebuild/import the activity-core image from main (bf877b7 or later) into the railiance01 k3s runtime and reconcile the activity-core deployment so the new executor and the strict per-item schema ship together.

2026-07-02: Added make deploy-activity-core-triage-robustness / bin/railiance deploy-triage-robustness as the repeatable operator path. The command gates the remote activity-core repo on bf877b7 or later, checks the runtime bundle contract before applying it, restarts the activity-core deployments by default, waits for migrate/sync jobs and rollouts, then records non-secret State Hub evidence. Live execution on railiance01 remains pending.

2026-07-02 (later session): rebuilt activity-core:railiance01-prod locally from activity-core main 7612112 (includes bf877b7 and the T02 prompt contract). Transfer/import to railiance01 was blocked by the agent permission policy (production remote write requires explicit operator authorization). Two preconditions found and fixed/noted: (a) the remote ~/activity-core copy has no .git, so the script's revision gate will fail until the repo is synced with git metadata or REQUIRED_ACTIVITY_CORE_REV verification is adapted; (b) the T02 runtime contract is now satisfied in the repo bundle (activity-core commit 7612112). Operator pickup: run the image save/scp/import from the deploy README, sync the repo with .git, then make deploy-activity-core-triage-robustness.

Update daily-statehub-wsjf-triage runtime-bundle Instruction

id: RAIL-BS-WP-0008-T02
status: done
priority: high
state_hub_task_id: "129fb472-41e8-4e5c-bcbb-0995a96e223b"

In the runtime projection (not the activity-core repo), update the daily-statehub-wsjf-triage Instruction:

  • raise max_tokens (currently ~1200; give clear headroom above the ~13001500-token 16-workstream list);
  • prompt: bounded top-N (≤7) ranked recommendations, "if uncertain emit fewer well-formed items rather than more";
  • prompt: per-item NDJSON framing (leading summary object, then one recommendation JSON object per line) so the T03 parser recovers items independently.

2026-07-02: The new deploy command enforces this contract against k8s/railiance/20-runtime.yaml before it will touch the cluster: it requires the daily instruction, a top-7 bound, the "fewer well-formed" fallback, NDJSON or one-object-per-line framing, and max_tokens headroom of at least 1800.

Pull raw llm-connect response for the 2026-06-26 run

id: RAIL-BS-WP-0008-T03
status: cancel
priority: medium
state_hub_task_id: "59559f1d-821f-4660-8a7d-c623c6631864"

From the llm-connect pod logs / response store on railiance01, capture the full raw response and finish_reason for the 2026-06-26 05:20:57Z run (activity-core retained only a 4000-char preview; the JSON break is at char 5268). Send to activity-core to close ACTIVITY-WP-0016-T01. Logs only, no secrets.

Acceptance smoke

id: RAIL-BS-WP-0008-T04
status: done
priority: high
state_hub_task_id: "8096621a-54ee-4be5-943e-5dc2da19ed28"

Trigger one daily-triage run against the reconciled runtime and confirm it either (i) returns a clean schema-valid report, or (ii) degrades gracefully (valid recommendations with output_validated=true, partial=true, quarantined_count>0) instead of discarding the run. Confirm the State Hub shows a matching daily_triage progress event. Closes ACTIVITY-WP-0016-T05 and unblocks the three-clean-run streak for ACTIVITY-WP-0010-T04 / WP-0006-T03.

2026-07-02: The deploy command now triggers the daily-triage definition after reconcile and polls State Hub for a post-trigger daily_triage event with output_validated=true. If the run is partial, it also requires quarantined_count>0 before posting pass evidence.

Completion 2026-07-02

Deployed live with operator authorization. Image activity-core:railiance01-prod rebuilt from main 7612112, imported into railiance01 k3s (sha256:550c5592...), repo synced with git metadata, and make deploy-activity-core-triage-robustness applied the coupled schema/executor bundle with all rollouts and migrate/sync jobs green.

  • T01/T02 done: revision gate and runtime contract gate both passed (bounded_top_7, ndjson_or_line_framing, fewer_well_formed, max_tokens_headroom >= 1800 all true).
  • T04 done: manually triggered daily-triage run produced a clean schema-valid report — State Hub event 24d2d321-c761-47f7-bf9e-7950a6253c21 (2026-07-02T09:50:44Z) with output_validated=true, exactly 7 ranked recommendations, working_memory_status=written, no validation error. The bounded top-7 contract is proven live; the three-clean-run streak for ACTIVITY-WP-0010-T04 / WP-0006-T03 restarts from this run.
  • T03 cancelled: the raw 2026-06-26 llm-connect response is unrecoverable — the llm-connect pod is stateless (no volumes, no response store) and its log stream contains only 2 startup lines from 2026-06-19. Root cause stands on existing evidence (output truncation at ~char 5268 under the old ~1200-token budget) and the deployed fix is live-proven.
  • Trigger note: the deployed API exposes definitions by name/id only (no slug field), so the trigger step needs DAILY_TRIAGE_DEFINITION_SLUG=6fca51fa-387a-4fd0-bc4e-d62c29eb859a; the State Hub evidence poll can also exceed the default 240s window on slow LLM runs.