Files
activity-core/workplans/ACTIVITY-WP-0010-daily-triage-llm-reconciliation.md
tegwick dbd2fbb11c docs(workplan): record railiance01 llm-connect smoke evidence
Note the 2026-06-19 live reconciliation on railiance01: llm-connect
deployed, worker restarted with LLM_CONNECT_URL, fixture smoke passed.
Manual daily triage still blocked on actcore-state-hub-bridge reachability.
2026-06-19 15:58:04 +02:00

7.2 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
ACTIVITY-WP-0010 workplan Daily Triage LLM Reconciliation And Evidence custodian activity-core blocked codex custodian 2026-06-18 2026-06-19 f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9

ACTIVITY-WP-0010 - Daily Triage LLM Reconciliation And Evidence

Context

This workplan implements the in-scope portion of the latest activity-core suggestion review against INTENT.md and SCOPE.md.

Relevant accepted suggestion:

  • State Hub message 6a098e1e-65de-4309-ab4a-446aba2f3587 from llm-connect says LLM-WP-0006 is complete on the llm-connect side. The stable Service URL is http://llm-connect.activity-core.svc.cluster.local:8080, timeout remains 300, the provider Secret reports populated key count, and the in-namespace fixture smoke passed with schema-valid endpoint behavior.

Why this belongs in activity-core:

  • INTENT.md says activity-core owns the when/what/where loop for scheduled coordination work.
  • SCOPE.md keeps LLM instruction execution in scope through the llm-connect boundary, while keeping provider credentials and cluster reconciliation out of scope.
  • ACTIVITY-WP-0006-T03 and ACTIVITY-WP-0009-T01 remain open because daily State Hub WSJF triage has not yet produced three clean scheduled runs after the June 7 runtime projection failure.

Suggestions reviewed but not accepted as product/runtime implementation work:

  • coding_retro activity-core suggestions for Bash tool thrash, schema thrash, and read-before-edit hygiene are agent workflow advice. They are useful for Codex operating style, but they do not change activity-core's Event Bridge product surface and should not become runtime code.
  • The earlier local-kubectl / cluster-owned evidence suggestion for ACTIVITY-WP-0007 has already been handled by moving live evidence ownership to Railiance and closing the workplan from cluster-owned proof.

Latest evidence before this workplan:

  • State Hub daily_triage progress on 2026-06-18 still shows LLM_CONNECT_URL is not configured, which means the live activity-core runtime has not yet consumed the repo-side URL update.
  • k8s/railiance/20-runtime.yaml now sets the verified llm-connect Service URL and LLM_CONNECT_TIMEOUT_SECONDS=300.

Confirm Repo-Side Runtime Contract

id: ACTIVITY-WP-0010-T01
status: done
priority: high
state_hub_task_id: "dd52ce21-23b8-4e46-b3af-cb7bf486e40f"

Update activity-core's Railiance runtime projection so the daily triage worker consumes the verified llm-connect Service URL by default.

Done when:

  • k8s/railiance/20-runtime.yaml sets LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080.
  • LLM_CONNECT_TIMEOUT_SECONDS=300 remains configured.
  • Wiring tests assert the URL and timeout.
  • The Railiance README states that provider credentials remain operator-owned and outside Git / State Hub.

2026-06-18: Completed. Updated the runtime ConfigMap, README, and tests/test_railiance_ops_inventory_wiring.py. Focused tests passed: tests/test_railiance_ops_inventory_wiring.py tests/test_llm_client.py reported 9 passed.

Reconcile Live Railiance Runtime

id: ACTIVITY-WP-0010-T02
status: wait
priority: high
state_hub_task_id: "23545ddc-926b-485a-8535-5cc11e01134a"

Apply or reconcile the updated activity-core Railiance runtime through the cluster-owned deployment path, not through ad hoc local kubectl from this repo.

Done when non-secret evidence shows:

  • live actcore-runtime-config has the verified LLM_CONNECT_URL and timeout;
  • the activity-core worker has restarted or otherwise consumed the new config;
  • activity-core/llm-connect-provider-secrets remains present with a populated key count only, without printing or storing secret values;
  • the State Hub bridge remains reachable from the activity-core runtime.

Current wait reason: this is Railiance/operator-owned live cluster work. State Hub handoff message 9a074b7c-4b87-4e3c-a6bf-e1fe5580daa8 asks railiance-cluster to reconcile the updated config and smoke it.

2026-06-19 recheck:

  • Deployed llm-connect into the activity-core namespace on railiance01 (the cluster that runs actcore-worker). coulombcore had llm-connect only; the in-cluster Service URL is cluster-local.
  • actcore-runtime-config already exposed the verified URL and timeout; deployment/actcore-worker was restarted and now reports LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080.
  • llm-connect-provider-secrets reports DATA 1; no Secret values were inspected.
  • Worker health probe to llm-connect /health returns {"status": "ok"}.
  • actcore-state-hub-bridge remains 0/1 Ready with upstream timeouts, so T02 is not fully closed until the node-local State Hub tunnel is restored.

Run Daily Triage Fixture Smoke

id: ACTIVITY-WP-0010-T03
status: wait
priority: high
state_hub_task_id: "10e0df77-c230-4a82-b720-23c66bd17c0a"

After T02, run a manual or smoke execution of daily-statehub-wsjf-triage against the live activity-core runtime.

Done when:

  • the run calls llm-connect through the configured Service URL;
  • llm-connect returns content accepted as schema-valid daily-triage JSON;
  • State Hub receives a daily_triage progress item with output_validated=true;
  • the working-memory daily-triage note exists at the path recorded in State Hub detail;
  • scripts/verify_daily_triage.py reports the smoke/manual run as present.

2026-06-19 recheck:

  • In-namespace llm-connect fixture smoke on railiance01 passed: smoke: pass health=ok latency_seconds=1.681 recommendations=1.
  • Manual POST /activity-definitions/6fca51fa-387a-4fd0-bc4e-d62c29eb859a/trigger reached llm-connect, but the workflow failed at persist_instruction_reports with state-hub-progress sink Connection refused while actcore-state-hub-bridge is unhealthy.
  • T03 therefore remains open until State Hub bridge reachability is restored and a run emits non-secret daily_triage progress with output_validated=true.

Collect Three Clean Scheduled Runs

id: ACTIVITY-WP-0010-T04
status: wait
priority: high
state_hub_task_id: "dc6b9482-cf43-4fc5-994b-dcd7dea47db7"

Let the normal 07:20 Europe/Berlin schedule produce three consecutive clean daily triage runs after the live config reconciliation.

Done when:

  • three consecutive scheduled runs have Temporal workflow evidence, activity_runs rows, State Hub daily_triage progress, and working-memory notes;
  • none of the three runs are merely manual smoke tests or execution_failed diagnostics;
  • calibration feedback is recorded in State Hub;
  • ACTIVITY-WP-0006-T03 and ACTIVITY-WP-0009-T01 can move from wait to done.

Close Handoff State

id: ACTIVITY-WP-0010-T05
status: wait
priority: medium
state_hub_task_id: "ecc57e21-1716-4daa-aba6-d8a6d824e4ed"

Update the surrounding workplans and State Hub once the live daily triage gate passes.

Done when:

  • ACTIVITY-WP-0006 records the three-run calibration evidence;
  • ACTIVITY-WP-0009 records the scheduled-run trust gap closure;
  • any temporary needs_human flags created for the llm-connect provider/config handoff are cleared or replaced by a narrower follow-up;
  • this workplan is marked finished.