activity-core

Author	SHA1	Message	Date
tegwick	30043348f0	Add Core Hub ops evidence sink	2026-06-27 20:34:25 +02:00
tegwick	bf877b7f0d	test(ACTIVITY-WP-0016-T05): regression coverage incl. real 06-26 payload + over-depth Add a test driving the actual captured 2026-06-26 failure payload (tests/fixtures/wp0016/...partial.json): it now recovers 6+ valid recommendations and quarantines the truncated tail, where before WP-0016 it discarded the whole run. Add an over-depth guardrail test. Together with T03/T04 the regression set now covers truncation, one-bad-item, oversized-string, over-depth, allow-list/injection-shaped, and happy-path count cap. In-repo portion of T05 complete; the live railiance01 graceful-degradation smoke is operator-owned cluster work (deploy-coupled with the T02 bundle changes) and remains outstanding. Hand-back notes posted to WP-0006-T03 and WP-0010-T04. Full suite: 220 passed, 1 skipped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 18:18:37 +02:00
tegwick	9be4ddbdb7	feat(ACTIVITY-WP-0016-T04): producer trust-boundary guardrails + ADR-004 Add ADR-004 documenting the producer trust boundary: untrusted producers (LLM, agent, human; erroneous and malicious), the trust-but-handle vs verify-and-mitigate postures, error-locality and quarantine-with-provenance principles, and the concrete activity-core mechanisms. Implement producer-agnostic guardrails in executor.py, applied uniformly on the happy path and the recovery path via _partition_items: structural-type -> schema -> structural caps (_MAX_DEPTH, _MAX_STRING_LEN) -> reference allow-list -> count cap. Each quarantine carries a reason. Closes the happy-path maxItems count cap deferred from T03 (valid 9-item report keeps 7, quarantines 2). Reference allow-list reads context["known_candidates"] via _allow_list_from_context; inert until a resolver populates it. SCOPE.md updated (executor bullet + ADR list); no INTENT drift. New tests: happy-path count cap, oversized-string guardrail, allow-list rejection. Full suite: 218 passed, 1 skipped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 18:10:17 +02:00
tegwick	a70c00a789	feat(ACTIVITY-WP-0016-T03): resilient per-item report recovery with quarantine lane When the whole-document parse + one retry still fail, report instructions now run _resilient_report before the total-loss path. A brace/quote-aware scanner (_extract_object_spans) recovers each recommendation object whether pretty-printed across many lines or NDJSON one-per-line; a truncated tail gets a best-effort _try_repair; _partition_items validates each recovered object against the T02 item schema. Valid items survive (output_validated=True, partial=True), malformed/ over-maxItems items are quarantined with provenance (index, error, raw, reason), capped at 20. Error locality now matches the unit of work: one bad item costs one item, not the whole report. Verified against the real 06-26 shape: 7 valid recommendations + a truncated tail now recovers all 7 and quarantines the broken tail (previously the whole run was discarded). Happy-path maxItems top-N enforcement is deferred to T04 (count caps). Full suite: 215 passed, 1 skipped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 17:56:28 +02:00
tegwick	61f278d643	feat(ACTIVITY-WP-0016-T02): strict bounded daily-triage output schema Replace the accept-anything recommendations.items ({type: object}) with a strict per-item contract (required [rank, candidate, action, why] + typed wsjf) and a maxItems:7 hint. Strict item structure is what lets the T03 boundary parser validate each recommendation independently and quarantine only malformed ones. maxItems is a producer hint (prompt + llm-connect json_schema + T03 mitigation), NOT a hard reject — a hard maxItems reject would discard a whole 16-item report, the blast-radius bug WP-0016 removes. DEPLOY COUPLING: the strict schema is also consumed by the current whole-doc validator, so it must ship with T03's per-item quarantine parser; until then it increases whole-doc hard-fails. Prompt + max_tokens headroom + NDJSON framing are documented as a runtime-bundle handoff. Updated four tests to the strict contract; the forwarded-schema test now reads the live schema file instead of hard-coding it. Full suite: 213 passed, 1 skipped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 17:36:24 +02:00
tegwick	0e9e18a59a	chore(ACTIVITY-WP-0016-T01): record root-cause findings + partial failure fixture Local analysis of the 2026-06-26 daily-triage validation failure: the unbounded ~1-recommendation-per-workstream list (16 active workstreams; JSON break at char 5268, ~rank 8-9) is the structural cause; both the first attempt and the retry failed. The exact offending token and finish_reason are unrecoverable from activity-core data — complete() drops finish_reason/usage, the report sink caps raw output at 4000 chars (< 5268), and the log preview at 2000. Confirming the exact token needs llm-connect producer-side logs on railiance01 (operator-owned); mitigation (T02/T03) is identical regardless. Partial fixture captured. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-26 15:04:27 +02:00
tegwick	88fe359385	feat(ACTIVITY-WP-0014): idempotency-keyed State Hub writes (T05, in-repo part) Add activity_core/state_hub_write: every State Hub write (report-sink, ops-evidence, schedule-miss) now sends a stable Idempotency-Key header derived from run_id:instruction_id:event_type. Makes writes safe to buffer/replay under the future state-hub beachhead without duplicate progress/triage events. The read-based _progress_exists dedup is now best-effort (returns False on connection error instead of hard-failing), so the guarantee lives on the keyed write rather than a live read. Tests + runbook note. Endpoint adoption / proxy retirement stays blocked on the state-hub beachhead capability. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 21:38:46 +02:00
tegwick	053d18b24a	feat(ACTIVITY-WP-0014): missed-fire detection & alert sink (T03) Add activity_core/schedule_health: a pure evaluate_schedule_health() verdict (built on Temporal's num_actions_missed_catchup_window plus a staleness check), an async check_schedule_health() reader, and post_missed_fire_alert() that emits a schedule_miss State Hub progress event. Makes a missed fire visible even under misfire_policy=skip, where Temporal drops it by design. Unit tests for the verdict logic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 14:25:33 +02:00
tegwick	a83b117f60	feat(ACTIVITY-WP-0014): explicit run-miss recovery policies (T02, T04) Set Temporal catchup_window on cron schedules so a fire missed during a worker/Temporal outage is no longer silently dropped. Redefine misfire_policy into three explicit modes — skip, catchup_all, catchup_latest — mapping to (catchup_window, overlap) pairs; legacy catchup/compress aliased. Add catchup_window_seconds override. Remove the ad-hoc upsert-time 1h backfill in favour of native catchup. Apply catchup_latest to daily-statehub-wsjf-triage in the Railiance runtime manifest and document run-miss policies in the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-23 14:15:45 +02:00
tegwick	faf5d60ae8	feat(STATE-WP-0064): enable cluster consistency sweep schedule Enable the definition in k8s projection and pass activity-core source tags.	2026-06-21 21:46:43 +02:00
tegwick	3a981cc98f	feat(STATE-WP-0064): wire consistency_sweep_remote_all state-hub query Add POST /consistency/sweep/remote-all resolver support with a 330s timeout and k8s projection for the consistency sweep definition.	2026-06-21 20:19:22 +02:00
tegwick	3e93567a53	Add admin sync hot reload path	2026-06-19 01:54:13 +02:00
tegwick	a08bd1684f	Add ISSUE_CORE_API_KEY auth to IssueCoreRestSink Issue-core requires a shared ingestion key on POST /issues/. The REST sink now sends Authorization: Bearer using ISSUE_CORE_API_KEY and fails fast when the key is missing under ISSUE_SINK_TYPE=rest. Updates .env.example, emission boundary docs, and unit tests for the header contract and missing-key error.	2026-06-18 22:30:13 +02:00
tegwick	2078915854	Add reuse-surface report gaps resolver	2026-06-18 17:58:00 +02:00
tegwick	23e2316dff	Harden coding retro resolver selection	2026-06-18 15:13:08 +02:00
tegwick	206bb336d2	Wire llm-connect runtime for daily triage	2026-06-18 15:12:31 +02:00
tegwick	977a3bd97f	Align activity-core scope boundaries	2026-06-18 15:11:48 +02:00
tegwick	0554014083	Add event-payload context resolver	2026-06-18 14:01:11 +02:00
tegwick	9a72c9f210	fix: unwrap single-key kaizen resolver payloads in resolve_context When discover_kaizen_projects returns {"projects": [...]} bound to context.projects, for_each can iterate the list directly. Multi-key summaries (e.g. repo SBOM bulk) remain unchanged.	2026-06-18 08:11:09 +02:00
tegwick	517bf9c133	Add kaizen context resolver for scheduled agent fleet discovery. Implement discover_kaizen_scheduled_repos and discover_kaizen_projects per kaizen-agentic ADR-005 contract: State Hub roster, roster.yaml filter, schedule validation, and prepare_command emission. Register kaizen/resolver/shell source types with unit tests and runbook dry-run instructions.	2026-06-18 07:46:46 +02:00
tegwick	14b2d40eb7	Implement weekly coding retro schedule	2026-06-07 20:58:34 +02:00
tegwick	4e8ccbb344	Set up daily WSJF closure gates	2026-06-07 11:00:03 +02:00
tegwick	418eb4ffda	Add schedule smoke test routine	2026-06-06 15:32:57 +02:00
tegwick	4b1b3e1b5f	Wire ops inventory probes for Railiance	2026-06-05 23:40:25 +02:00
tegwick	41d3e75a88	Implement ops inventory probe evidence slice	2026-06-05 23:16:40 +02:00
tegwick	42e373aba1	Harden WSJF triage report recovery	2026-06-05 19:27:03 +02:00
tegwick	20d4f26166	Implement post-triage operational hardening	2026-06-04 12:15:07 +02:00
tegwick	30598fd1ad	Expand rule actions for per-repo tasks Add safe action interpolation and for_each binding for rule fan-out, update the weekly SBOM definition, cover the new evaluation path, and reconcile activity-core scope/workplans for the State Hub sync.	2026-06-03 11:58:24 +02:00
tegwick	a8d3cc2782	Fix repo_sbom_status resolver — close ADHOC-2026-06-01-T01 The state-hub resolver was calling GET /sbom/status?repo={slug}, which State Hub does not expose. Real SBOM routes are /sbom/, /sbom/{slug}, /sbom/snapshots/, /sbom/snapshots/{id}, /sbom/ingest/, /sbom/report/licences/. The weekly-sbom-staleness ActivityDefinition was passing params {repos: all} and the resolver was reading params.get("repo_slug", ""), so the URL collapsed to /sbom/status?repo= and 404'd. _fetch_json swallowed the error, the rule context.repos.sbom_age_days > 30 evaluated against {} and never matched, and the weekly SBOM check has been a silent no-op for as long as the route mismatch has existed. Resolver now supports two modes selected by params: - single-repo: {repo_slug: foo} → GET /sbom/{foo}, returns {repo_slug, last_sbom_at, sbom_age_days, has_sbom} - bulk: {repos: all} → GET /repos/, computes per-repo age, returns the worst repo's fields hoisted to the top of the result alongside stale_count, total_count, worst_* fields, and the full per-repo list Never-scanned repos get a 99999 sentinel age so threshold rules treat them as very stale without forcing the rule to special-case None. Hoisting the worst entry to the top preserves the existing rule expression context.repos.sbom_age_days > 30 (and target_repo: context.repos.repo_slug, though that field is a separate interpolation gap tracked as ADHOC-2026-06-01-T02). The integration tests' aspirational per-repo iteration model is left intact. Live validation against State Hub on 2026-06-01: - single: activity-core → 36 days since 2026-04-26 ingest - bulk: 48 repos total, 46 stale (>30d), worst is info-tech-canon (never scanned), rule expression evaluates True Tests: 120 passed, 1 skipped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 03:31:56 +02:00
tegwick	ca6d80ec07	Enable hourly RecentlyOnScope rollout	2026-05-23 02:51:54 +02:00
tegwick	5055f3eaca	Add State Hub RecentlyOnScope invocation	2026-05-22 16:14:10 +02:00
tegwick	cf92f0d686	Forward instruction schemas to llm-connect	2026-05-21 03:19:27 +02:00
tegwick	5c4f96e7aa	Pass instruction depth config to llm-connect	2026-05-19 20:55:35 +02:00
tegwick	1ff8b14d1b	Fix ActivityDefinition sync for daily triage canary	2026-05-19 20:13:23 +02:00
tegwick	6cb0718e90	Add curated daily triage digest	2026-05-19 19:09:21 +02:00
tegwick	3110399b11	Add instruction report sinks	2026-05-19 18:36:58 +02:00
tegwick	0dc342eb1b	Wire instruction report execution	2026-05-19 18:28:23 +02:00
tegwick	0e7084207e	Extend State Hub context resolver for daily triage	2026-05-19 15:59:12 +02:00
tegwick	827ef9c1a0	feat(WP-0003c): context adapters, first ActivityDefinition, full test suite T51: ContextResolver ABC + CONTEXT_RESOLVER_REGISTRY; resolve_context activity updated to dispatch via registry (warns + binds {} on failure, never aborts run). T52: RepoScopingContextResolver with 5-min in-process cache. T53: StateHubContextResolver (no cache) for domain_summary and repo_sbom_status. T54: activity-definitions/weekly-sbom-staleness.md (Monday 09:00 Berlin, cron trigger, flag-stale-sbom rule at >30 days) + tasks/sbom-rescan.md template. T55: 51 parametrized evaluator tests — all whitelisted operators, unsafe expression rejection, empty condition, missing attribute, nested context access. T56: 15 executor safety tests — UntrustedFieldError, object-type rejection, injection fixture, LLM retry on bad JSON, review_required field. T57: 6 integration tests — parses real definition, evaluates rule per-repo (stale/fresh boundary), emits via NullSink, verifies spawn log entries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 23:24:48 +02:00
tegwick	176867cbe3	feat(WP-0003b): parser, workflow wiring, triggers, webhooks T44: ActivityDefinition markdown file parser (definition_parser.py) - Scans activity-definitions/.md and ACTIVITY_DEFINITION_DIRS paths - Parses YAML frontmatter + fenced rule/instruction blocks - Raises ParseError on any malformed file — never silently skips T45: ActivityDefinition sync command - Migration 0006: adds rules_json/instructions_json JSONB columns - sync_activity_definitions.py + make sync-activity-definitions - Called at worker startup before schedule sync T46: Rule/instruction pipeline wired into RunActivityWorkflow - New evaluate_rules and emit_tasks Temporal activities - Workflow passes event_envelope_json to enable rule evaluation - EventRouter now passes full envelope JSON as 4th workflow arg - IssueSink.emit() writes task_spawn_log rows per task T47: ScheduledTriggerConfig model (one-off future datetime trigger) T48: One-off Temporal Schedule support - Fixed timezone_name → time_zone_name (was causing all schedule tests to fail) - Added ScheduleCalendarSpec-based one-off schedule with remaining_actions=1 - cancel_scheduled() for admin cancellation - Fixed backfill() call to use args unpacking (not list wrapper) - Fixed ScheduleAlreadyRunningError catch in upsert_schedule - sync_schedules now handles ScheduledTriggerConfig definitions T49: Webhook receiver - POST /webhooks/gitea — HMAC-SHA256 via X-Gitea-Signature-256 - POST /webhooks/github — HMAC-SHA256 via X-Hub-Signature-256 - Normalisers: repo.created, push, issue.closed → EventEnvelope - Publishes to NATS activity.{type} subject after registry validation - Mounted in api.py at /webhooks prefix T50: Gitea event type definitions - gitea.repo.created.md, gitea.push.md, gitea.issue.closed.md - Each includes normaliser field mapping in Consumer Notes Tests: 18 passed, 1 skipped (integration). Fixed embedded Temporal server visibility latency in test_upsert_schedule_creates_schedule. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 23:02:33 +02:00
tegwick	c3a256509b	feat(event-bridge): WP-0003a — domain model, rules module, event type registry Implements phases 7–8 of the Event Bridge architecture (custodian-WP-0003a). Domain model (T34, T40): - Added RuleDef, InstructionDef, ActionDef to models.py - Updated ActivityDefinition with rules/instructions fields (task_templates deprecated) - Formalized EventEnvelope: id, type, version, timestamp, publisher, attributes - Added from_nats_message() and from_webhook_payload() classmethods Rules module (T35, T36, T37): - src/activity_core/rules/ skeleton with boundary enforcement - evaluate_condition() — sandboxed AST walker, whitelisted nodes only, never exec() - execute_instruction() — LLM task generation with trusted_fields injection guard - tests/rules/test_boundary.py verifies no cross-boundary imports Infrastructure (T38, T39): - Alembic migrations 0004 (task_spawn_log) and 0005 (event_types) - IssueSink ABC + IssueCoreRestSink (REST) + NullSink (testing) - TaskSpawnLog and EventType ORM models Event type registry (T41, T42, T43): - event_type_registry.py: file scanner, parser, DB sync, in-process lookup - ACTIVITY_CURATOR_GATE env var (disabled\|required) + approve endpoint - Three org event type definitions: org.repo.registered, org.workstream.completed, org.activity.run.completed All 10 tests pass. Boundary test confirms rules/ isolation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 22:01:15 +02:00
tegwick	ea5fbe0bf3	feat(WP-0002): complete Triggers & Ops workstream Delivers all 12 tasks (T22–T33): Temporal Schedule manager + startup sync, NATS JetStream event router, FastAPI CRUD + manual trigger, Prometheus metrics wiring, custom search-attribute tagging, and operational runbook. Marks workplan status as done. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 01:04:43 +01:00

42 Commits