activity-core

Author	SHA1	Message	Date
tegwick	61f278d643	feat(ACTIVITY-WP-0016-T02): strict bounded daily-triage output schema Replace the accept-anything recommendations.items ({type: object}) with a strict per-item contract (required [rank, candidate, action, why] + typed wsjf) and a maxItems:7 hint. Strict item structure is what lets the T03 boundary parser validate each recommendation independently and quarantine only malformed ones. maxItems is a producer hint (prompt + llm-connect json_schema + T03 mitigation), NOT a hard reject — a hard maxItems reject would discard a whole 16-item report, the blast-radius bug WP-0016 removes. DEPLOY COUPLING: the strict schema is also consumed by the current whole-doc validator, so it must ship with T03's per-item quarantine parser; until then it increases whole-doc hard-fails. Prompt + max_tokens headroom + NDJSON framing are documented as a runtime-bundle handoff. Updated four tests to the strict contract; the forwarded-schema test now reads the live schema file instead of hard-coding it. Full suite: 213 passed, 1 skipped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 17:36:24 +02:00
tegwick	0e9e18a59a	chore(ACTIVITY-WP-0016-T01): record root-cause findings + partial failure fixture Local analysis of the 2026-06-26 daily-triage validation failure: the unbounded ~1-recommendation-per-workstream list (16 active workstreams; JSON break at char 5268, ~rank 8-9) is the structural cause; both the first attempt and the retry failed. The exact offending token and finish_reason are unrecoverable from activity-core data — complete() drops finish_reason/usage, the report sink caps raw output at 4000 chars (< 5268), and the log preview at 2000. Confirming the exact token needs llm-connect producer-side logs on railiance01 (operator-owned); mitigation (T02/T03) is identical regardless. Partial fixture captured. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 15:04:27 +02:00
tegwick	5eb33bd3bb	feat(ACTIVITY-WP-0016): register LLM output robustness & producer trust boundary workplan Add WP-0016 to make the instruction-executor output contract robust after the 2026-06-26 daily-triage validation failure (one malformed delimiter discarded a whole report). Per-item framing for error locality, verify-and-mitigate boundary parsing with a quarantine lane, producer-trust-boundary guardrails (ADR-004), and regression/calibration tests. Unblocks WP-0006-T03 / WP-0010-T04. Also record the 06-26 recheck outcome (streak reset at two) in WP-0006-T03. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 14:39:21 +02:00
tegwick	612c226472	chore(ACTIVITY-WP-0015): dedupe state_hub_workstream_id frontmatter Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 12:53:52 +02:00
tegwick	4b5e96d7c1	feat(ACTIVITY-WP-0014): close workplan — catchup_latest deployed & verified on railiance01 T04 done: built+deployed the WP-0014 image to railiance01, applied catchup_latest to daily-statehub-wsjf-triage, /admin/sync clean (6 defs, 4 schedules, 0 errors). Live schedule verified OverlapPolicy=BufferOne, CatchupWindow=1d; pods healthy. All tasks T01-T05 complete; beachhead-endpoint adoption tracked in WP-0015. Workplan status -> finished. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 12:52:54 +02:00
tegwick	65ef005c2d	docs(ACTIVITY-WP-0014): close T05 in-repo; split beachhead adoption to WP-0015 Idempotent-writes half of T05 is done in-repo; the externally-blocked endpoint adoption + actcore-state-hub-bridge proxy retirement move to ACTIVITY-WP-0015 (blocked on the state-hub beachhead) so WP-0014 can close on completed work. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-24 12:41:21 +02:00
tegwick	b2e57707a7	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-06-23: - ACTIVITY-WP-0014-T05: todo → progress	2026-06-23 21:39:28 +02:00
tegwick	88fe359385	feat(ACTIVITY-WP-0014): idempotency-keyed State Hub writes (T05, in-repo part) Add activity_core/state_hub_write: every State Hub write (report-sink, ops-evidence, schedule-miss) now sends a stable Idempotency-Key header derived from run_id:instruction_id:event_type. Makes writes safe to buffer/replay under the future state-hub beachhead without duplicate progress/triage events. The read-based _progress_exists dedup is now best-effort (returns False on connection error instead of hard-failing), so the guarantee lives on the keyed write rather than a live read. Tests + runbook note. Endpoint adoption / proxy retirement stays blocked on the state-hub beachhead capability. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 21:38:46 +02:00
tegwick	f90591c5f1	docs(ACTIVITY-WP-0014): rescope T05 to thin client under State Hub beachhead model Resilience (queue/cache) is handed to custodian/state-hub as a per-machine beachhead; activity-core keeps only idempotent writes + adopt-beachhead-endpoint and retires its bespoke actcore-state-hub-bridge proxy. Proposal sent to state-hub. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 21:18:01 +02:00
tegwick	cf7a11dcd9	docs(ACTIVITY-WP-0014): correct Motivation to match T01 findings Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 17:16:17 +02:00
tegwick	8424c13783	docs(ACTIVITY-WP-0014): T01 root cause — State Hub Connection refused, not misfire Live inspection of railiance01 (ssh + in-node kubectl/temporal) overturns the catchup_window hypothesis: the daily-triage schedule is healthy (CatchupWindow 365d default, 0 MissedCatchupWindow). The 2026-06-23T05:20Z fire ran but Failed at the report sink with '[Errno 111] Connection refused' posting to State Hub. railiance01 reaches State Hub via a reverse tunnel back to the workstation, which is unreachable at 07:20 Europe/Berlin (102 resolver timeouts in 24h). Mark T01 done; add T05 for resilient sinks/resolvers as the real incident fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 17:14:04 +02:00
tegwick	053d18b24a	feat(ACTIVITY-WP-0014): missed-fire detection & alert sink (T03) Add activity_core/schedule_health: a pure evaluate_schedule_health() verdict (built on Temporal's num_actions_missed_catchup_window plus a staleness check), an async check_schedule_health() reader, and post_missed_fire_alert() that emits a schedule_miss State Hub progress event. Makes a missed fire visible even under misfire_policy=skip, where Temporal drops it by design. Unit tests for the verdict logic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 14:25:33 +02:00
tegwick	0495f8a43f	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-06-23: - ACTIVITY-WP-0014-T04: progress → wait	2026-06-23 14:17:06 +02:00
tegwick	c6cad9e7b3	chore(consistency): renormalize lifecycle state [auto] Updated by fix-consistency on 2026-06-23: - workplan status: proposed → active	2026-06-23 14:17:06 +02:00
tegwick	a83b117f60	feat(ACTIVITY-WP-0014): explicit run-miss recovery policies (T02, T04) Set Temporal catchup_window on cron schedules so a fire missed during a worker/Temporal outage is no longer silently dropped. Redefine misfire_policy into three explicit modes — skip, catchup_all, catchup_latest — mapping to (catchup_window, overlap) pairs; legacy catchup/compress aliased. Add catchup_window_seconds override. Remove the ad-hoc upsert-time 1h backfill in favour of native catchup. Apply catchup_latest to daily-statehub-wsjf-triage in the Railiance runtime manifest and document run-miss policies in the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 14:15:45 +02:00
tegwick	ffc0ee2cb7	feat(ACTIVITY-WP-0014): plan schedule misfire robustness & run-miss options Cron fires are silently dropped: _build_schedule() sets SchedulePolicy(overlap=) but never catchup_window, so a brief worker/Temporal outage at trigger time drops the fire with no recovery and no signal (root cause of missing 06-22/06-23 daily triage runs). Define three explicit run-miss policies: skip, catchup_all, catchup_latest, plus missed-fire detection. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 13:46:19 +02:00
tegwick	4bc5111dfd	chore(consistency): apply state_hub_workstream_id writeback Sync archived workplan frontmatter from State Hub fix-consistency.	2026-06-22 17:43:32 +02:00
tegwick	bf4e61f0bf	feat(ACTIVITY-WP-0012): complete live admin-sync no-restart smoke Ran Railiance01 cluster validation for POST /admin/sync without restarting actcore-worker, added a repeatable smoke script, and closed the workplan.	2026-06-22 16:25:26 +02:00
tegwick	dbd2fbb11c	docs(workplan): record railiance01 llm-connect smoke evidence Note the 2026-06-19 live reconciliation on railiance01: llm-connect deployed, worker restarted with LLM_CONNECT_URL, fixture smoke passed. Manual daily triage still blocked on actcore-state-hub-bridge reachability.	2026-06-19 15:58:04 +02:00
tegwick	3e93567a53	Add admin sync hot reload path	2026-06-19 01:54:13 +02:00
tegwick	2078915854	Add reuse-surface report gaps resolver	2026-06-18 17:58:00 +02:00
tegwick	764339e490	chore(consistency): renormalize lifecycle state [auto] Updated by fix-consistency on 2026-06-18: - workplan status: ready → active	2026-06-18 17:52:33 +02:00
tegwick	17e2e39165	Track definition schedule hot reload	2026-06-18 15:21:59 +02:00
tegwick	727868a245	Finish event payload resolver workplan	2026-06-18 15:15:07 +02:00
tegwick	23e2316dff	Harden coding retro resolver selection	2026-06-18 15:13:08 +02:00
tegwick	206bb336d2	Wire llm-connect runtime for daily triage	2026-06-18 15:12:31 +02:00
tegwick	977a3bd97f	Align activity-core scope boundaries	2026-06-18 15:11:48 +02:00
tegwick	717535b62d	Close event-payload live smoke handoff	2026-06-18 14:26:27 +02:00
tegwick	0554014083	Add event-payload context resolver	2026-06-18 14:01:11 +02:00
tegwick	bcddc88320	Close ops inventory probe handoff	2026-06-16 03:51:02 +02:00
tegwick	14b2d40eb7	Implement weekly coding retro schedule	2026-06-07 20:58:34 +02:00
tegwick	4e8ccbb344	Set up daily WSJF closure gates	2026-06-07 11:00:03 +02:00
tegwick	418eb4ffda	Add schedule smoke test routine	2026-06-06 15:32:57 +02:00
tegwick	4b1b3e1b5f	Wire ops inventory probes for Railiance	2026-06-05 23:40:25 +02:00
tegwick	ebcaacc0b5	chore(consistency): renormalize lifecycle state [auto] Updated by fix-consistency on 2026-06-05: - workplan status: ready → active	2026-06-05 23:17:48 +02:00
tegwick	41d3e75a88	Implement ops inventory probe evidence slice	2026-06-05 23:16:40 +02:00
tegwick	ee1f805c0b	Sync ACTIVITY-WP-0007 with State Hub	2026-06-05 22:49:20 +02:00
tegwick	3b8bac26da	Add ops inventory probe runner workplan	2026-06-05 22:46:11 +02:00
tegwick	42e373aba1	Harden WSJF triage report recovery	2026-06-05 19:27:03 +02:00
tegwick	20d4f26166	Implement post-triage operational hardening	2026-06-04 12:15:07 +02:00
tegwick	b2d56624b2	Normalize legacy WP-0003 status	2026-06-03 15:28:59 +02:00
tegwick	87d3979c20	Record State Hub IDs for WP-0006	2026-06-03 12:09:28 +02:00
tegwick	30598fd1ad	Expand rule actions for per-repo tasks Add safe action interpolation and for_each binding for rule fan-out, update the weekly SBOM definition, cover the new evaluation path, and reconcile activity-core scope/workplans for the State Hub sync.	2026-06-03 11:58:24 +02:00
tegwick	c79d0980a9	Make Temporal activity timeout env-configurable (ADHOC-2026-06-01-T03) The CUST-WP-0045 daily triage canary on 2026-06-01 hit a BrokenPipeError on the llm-connect side. Two 5-minute timeouts were racing: - _ACTIVITY_TIMEOUT = timedelta(minutes=5) in workflows.py - LLM_CONNECT_TIMEOUT_SECONDS default 300 in llm_client.py The 10KB curated digest + max_depth:2 + JSON schema enforcement pushed Claude past 5 minutes. Whichever timer fired first killed the httpx call; the model's late response arrived to a closed socket. Read _ACTIVITY_TIMEOUT from ACTIVITY_TIMEOUT_SECONDS env (default 900 — 15 minutes) so judgement-call activities have headroom for slow LLM runs. Operators should also widen httpx via LLM_CONNECT_TIMEOUT_SECONDS=840 so httpx still times out slightly before Temporal, preserving the clean-error contract. Tests: 120 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 08:10:24 +02:00
tegwick	a8d3cc2782	Fix repo_sbom_status resolver — close ADHOC-2026-06-01-T01 The state-hub resolver was calling GET /sbom/status?repo={slug}, which State Hub does not expose. Real SBOM routes are /sbom/, /sbom/{slug}, /sbom/snapshots/, /sbom/snapshots/{id}, /sbom/ingest/, /sbom/report/licences/. The weekly-sbom-staleness ActivityDefinition was passing params {repos: all} and the resolver was reading params.get("repo_slug", ""), so the URL collapsed to /sbom/status?repo= and 404'd. _fetch_json swallowed the error, the rule context.repos.sbom_age_days > 30 evaluated against {} and never matched, and the weekly SBOM check has been a silent no-op for as long as the route mismatch has existed. Resolver now supports two modes selected by params: - single-repo: {repo_slug: foo} → GET /sbom/{foo}, returns {repo_slug, last_sbom_at, sbom_age_days, has_sbom} - bulk: {repos: all} → GET /repos/, computes per-repo age, returns the worst repo's fields hoisted to the top of the result alongside stale_count, total_count, worst_* fields, and the full per-repo list Never-scanned repos get a 99999 sentinel age so threshold rules treat them as very stale without forcing the rule to special-case None. Hoisting the worst entry to the top preserves the existing rule expression context.repos.sbom_age_days > 30 (and target_repo: context.repos.repo_slug, though that field is a separate interpolation gap tracked as ADHOC-2026-06-01-T02). The integration tests' aspirational per-repo iteration model is left intact. Live validation against State Hub on 2026-06-01: - single: activity-core → 36 days since 2026-04-26 ingest - bulk: 48 repos total, 46 stale (>30d), worst is info-tech-canon (never scanned), rule expression evaluates True Tests: 120 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 03:31:56 +02:00
tegwick	5d3fb33c6b	Capture sbom_status resolver bug as ADHOC-2026-06-01 Surfaced while bringing up the dev worker for the CUST-WP-0045 T06 cutover. weekly-sbom-staleness fires its state-hub resolver with query repo_sbom_status, which hits GET /sbom/status?repo=. State Hub does not expose that route, so _fetch_json returns {} and the rule context.repos.sbom_age_days > 30 silently no-ops. The weekly SBOM check has been a no-op for as long as the route mismatch has existed. Logged as a low-priority adhoc rather than promoting to a workplan because the resolver and definition both need a one-line decision (single-repo vs fan-out), not multi-phase design. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 03:16:12 +02:00
tegwick	f4c38e2d5f	Record state hub IDs for railiance deployment	2026-05-22 13:51:51 +02:00
tegwick	e2aac3ad8c	Deploy activity-core on railiance01	2026-05-22 13:49:46 +02:00
tegwick	a9d2c12212	chore(WP-0004): mark workplan done Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 00:05:01 +02:00
tegwick	2a8e6cfe7f	feat(WP-0004): railiance deployment & service ops - Dockerfile (multi-stage, uv-based, slim runtime) - .dockerignore - docker-compose.railiance.yml (Temporal + NATS + PG, no Elasticsearch) - GET /health endpoint (db + temporal probes, 200/503) - .env.example (complete env var reference) - Makefile: migrate, sync-all, dev-up/down, railiance-up/down, start-worker, start-api, start-event-router, help targets; extracted sync-event-types Python to scripts/sync_event_types.py - SIGTERM graceful shutdown in worker.py and event_router.py - docs/runbook.md: Railiance deployment section Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 00:04:39 +02:00

1 2

71 Commits