Files
activity-core/workplans/ACTIVITY-WP-0010-daily-triage-llm-reconciliation.md
tegwick dbd2fbb11c docs(workplan): record railiance01 llm-connect smoke evidence
Note the 2026-06-19 live reconciliation on railiance01: llm-connect
deployed, worker restarted with LLM_CONNECT_URL, fixture smoke passed.
Manual daily triage still blocked on actcore-state-hub-bridge reachability.
2026-06-19 15:58:04 +02:00

198 lines
7.2 KiB
Markdown

---
id: ACTIVITY-WP-0010
type: workplan
title: "Daily Triage LLM Reconciliation And Evidence"
domain: custodian
repo: activity-core
status: blocked
owner: codex
topic_slug: custodian
created: "2026-06-18"
updated: "2026-06-19"
state_hub_workstream_id: "f2c73ac6-13f0-4005-82cc-76c7c9f9c8b9"
---
# ACTIVITY-WP-0010 - Daily Triage LLM Reconciliation And Evidence
## Context
This workplan implements the in-scope portion of the latest activity-core
suggestion review against `INTENT.md` and `SCOPE.md`.
Relevant accepted suggestion:
- State Hub message `6a098e1e-65de-4309-ab4a-446aba2f3587` from
`llm-connect` says `LLM-WP-0006` is complete on the llm-connect side. The
stable Service URL is
`http://llm-connect.activity-core.svc.cluster.local:8080`, timeout remains
`300`, the provider Secret reports populated key count, and the in-namespace
fixture smoke passed with schema-valid endpoint behavior.
Why this belongs in activity-core:
- `INTENT.md` says activity-core owns the **when/what/where** loop for
scheduled coordination work.
- `SCOPE.md` keeps LLM instruction execution in scope through the llm-connect
boundary, while keeping provider credentials and cluster reconciliation out of
scope.
- `ACTIVITY-WP-0006-T03` and `ACTIVITY-WP-0009-T01` remain open because daily
State Hub WSJF triage has not yet produced three clean scheduled runs after
the June 7 runtime projection failure.
Suggestions reviewed but not accepted as product/runtime implementation work:
- `coding_retro` activity-core suggestions for Bash tool thrash, schema thrash,
and read-before-edit hygiene are agent workflow advice. They are useful for
Codex operating style, but they do not change activity-core's Event Bridge
product surface and should not become runtime code.
- The earlier local-kubectl / cluster-owned evidence suggestion for
`ACTIVITY-WP-0007` has already been handled by moving live evidence ownership
to Railiance and closing the workplan from cluster-owned proof.
Latest evidence before this workplan:
- State Hub `daily_triage` progress on 2026-06-18 still shows
`LLM_CONNECT_URL is not configured`, which means the live activity-core
runtime has not yet consumed the repo-side URL update.
- `k8s/railiance/20-runtime.yaml` now sets the verified llm-connect Service URL
and `LLM_CONNECT_TIMEOUT_SECONDS=300`.
## Confirm Repo-Side Runtime Contract
```task
id: ACTIVITY-WP-0010-T01
status: done
priority: high
state_hub_task_id: "dd52ce21-23b8-4e46-b3af-cb7bf486e40f"
```
Update activity-core's Railiance runtime projection so the daily triage worker
consumes the verified llm-connect Service URL by default.
Done when:
- `k8s/railiance/20-runtime.yaml` sets
`LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080`.
- `LLM_CONNECT_TIMEOUT_SECONDS=300` remains configured.
- Wiring tests assert the URL and timeout.
- The Railiance README states that provider credentials remain operator-owned
and outside Git / State Hub.
2026-06-18: Completed. Updated the runtime ConfigMap, README, and
`tests/test_railiance_ops_inventory_wiring.py`. Focused tests passed:
`tests/test_railiance_ops_inventory_wiring.py tests/test_llm_client.py`
reported 9 passed.
## Reconcile Live Railiance Runtime
```task
id: ACTIVITY-WP-0010-T02
status: wait
priority: high
state_hub_task_id: "23545ddc-926b-485a-8535-5cc11e01134a"
```
Apply or reconcile the updated activity-core Railiance runtime through the
cluster-owned deployment path, not through ad hoc local kubectl from this repo.
Done when non-secret evidence shows:
- live `actcore-runtime-config` has the verified `LLM_CONNECT_URL` and timeout;
- the activity-core worker has restarted or otherwise consumed the new config;
- `activity-core/llm-connect-provider-secrets` remains present with a populated
key count only, without printing or storing secret values;
- the State Hub bridge remains reachable from the activity-core runtime.
Current wait reason: this is Railiance/operator-owned live cluster work. State
Hub handoff message `9a074b7c-4b87-4e3c-a6bf-e1fe5580daa8` asks
`railiance-cluster` to reconcile the updated config and smoke it.
2026-06-19 recheck:
- Deployed `llm-connect` into the `activity-core` namespace on `railiance01`
(the cluster that runs `actcore-worker`). `coulombcore` had llm-connect only;
the in-cluster Service URL is cluster-local.
- `actcore-runtime-config` already exposed the verified URL and timeout;
`deployment/actcore-worker` was restarted and now reports
`LLM_CONNECT_URL=http://llm-connect.activity-core.svc.cluster.local:8080`.
- `llm-connect-provider-secrets` reports `DATA 1`; no Secret values were
inspected.
- Worker health probe to llm-connect `/health` returns `{"status": "ok"}`.
- `actcore-state-hub-bridge` remains `0/1` Ready with upstream timeouts, so T02
is not fully closed until the node-local State Hub tunnel is restored.
## Run Daily Triage Fixture Smoke
```task
id: ACTIVITY-WP-0010-T03
status: wait
priority: high
state_hub_task_id: "10e0df77-c230-4a82-b720-23c66bd17c0a"
```
After T02, run a manual or smoke execution of
`daily-statehub-wsjf-triage` against the live activity-core runtime.
Done when:
- the run calls llm-connect through the configured Service URL;
- llm-connect returns content accepted as schema-valid daily-triage JSON;
- State Hub receives a `daily_triage` progress item with `output_validated=true`;
- the working-memory daily-triage note exists at the path recorded in State Hub
detail;
- `scripts/verify_daily_triage.py` reports the smoke/manual run as present.
2026-06-19 recheck:
- In-namespace llm-connect fixture smoke on `railiance01` passed:
`smoke: pass health=ok latency_seconds=1.681 recommendations=1`.
- Manual `POST /activity-definitions/6fca51fa-387a-4fd0-bc4e-d62c29eb859a/trigger`
reached llm-connect, but the workflow failed at `persist_instruction_reports`
with `state-hub-progress` sink `Connection refused` while
`actcore-state-hub-bridge` is unhealthy.
- T03 therefore remains open until State Hub bridge reachability is restored and
a run emits non-secret `daily_triage` progress with `output_validated=true`.
## Collect Three Clean Scheduled Runs
```task
id: ACTIVITY-WP-0010-T04
status: wait
priority: high
state_hub_task_id: "dc6b9482-cf43-4fc5-994b-dcd7dea47db7"
```
Let the normal 07:20 Europe/Berlin schedule produce three consecutive clean
daily triage runs after the live config reconciliation.
Done when:
- three consecutive scheduled runs have Temporal workflow evidence,
`activity_runs` rows, State Hub `daily_triage` progress, and working-memory
notes;
- none of the three runs are merely manual smoke tests or `execution_failed`
diagnostics;
- calibration feedback is recorded in State Hub;
- `ACTIVITY-WP-0006-T03` and `ACTIVITY-WP-0009-T01` can move from `wait` to
`done`.
## Close Handoff State
```task
id: ACTIVITY-WP-0010-T05
status: wait
priority: medium
state_hub_task_id: "ecc57e21-1716-4daa-aba6-d8a6d824e4ed"
```
Update the surrounding workplans and State Hub once the live daily triage gate
passes.
Done when:
- `ACTIVITY-WP-0006` records the three-run calibration evidence;
- `ACTIVITY-WP-0009` records the scheduled-run trust gap closure;
- any temporary `needs_human` flags created for the llm-connect provider/config
handoff are cleared or replaced by a narrower follow-up;
- this workplan is marked `finished`.