Commit Graph

1747 Commits

Author SHA1 Message Date
14b2d40eb7 Implement weekly coding retro schedule 2026-06-07 20:58:34 +02:00
992fe94034 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-07:
  - update .custodian-brief.md for activity-core
2026-06-07 20:56:40 +02:00
4e8ccbb344 Set up daily WSJF closure gates 2026-06-07 11:00:03 +02:00
418eb4ffda Add schedule smoke test routine 2026-06-06 15:32:57 +02:00
e926636617 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-05:
  - update .custodian-brief.md for activity-core
2026-06-05 23:41:24 +02:00
4b1b3e1b5f Wire ops inventory probes for Railiance 2026-06-05 23:40:25 +02:00
5838077327 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-05:
  - update .custodian-brief.md for activity-core
2026-06-05 23:17:52 +02:00
ebcaacc0b5 chore(consistency): renormalize lifecycle state [auto]
Updated by fix-consistency on 2026-06-05:
  - workplan status: ready → active
2026-06-05 23:17:48 +02:00
41d3e75a88 Implement ops inventory probe evidence slice 2026-06-05 23:16:40 +02:00
ee1f805c0b Sync ACTIVITY-WP-0007 with State Hub 2026-06-05 22:49:20 +02:00
15f495361e chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-05:
  - update .custodian-brief.md for activity-core
2026-06-05 22:47:56 +02:00
3b8bac26da Add ops inventory probe runner workplan 2026-06-05 22:46:11 +02:00
42e373aba1 Harden WSJF triage report recovery 2026-06-05 19:27:03 +02:00
20d4f26166 Implement post-triage operational hardening 2026-06-04 12:15:07 +02:00
8a33ec44b6 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-04:
  - update .custodian-brief.md for activity-core
2026-06-04 09:58:37 +02:00
b2d56624b2 Normalize legacy WP-0003 status 2026-06-03 15:28:59 +02:00
87d3979c20 Record State Hub IDs for WP-0006 2026-06-03 12:09:28 +02:00
33cc19ad7c chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-03:
  - update .custodian-brief.md for activity-core
2026-06-03 12:07:05 +02:00
30598fd1ad Expand rule actions for per-repo tasks
Add safe action interpolation and for_each binding for rule fan-out, update the weekly SBOM definition, cover the new evaluation path, and reconcile activity-core scope/workplans for the State Hub sync.
2026-06-03 11:58:24 +02:00
4b4e162c44 Log raw LLM output preview on instruction validation failure
The CUST-WP-0045 canary failed validation twice without leaving any
record of what the model actually returned. The warning logged only the
error message ($: missing required property 'summary'), not the JSON
shape that triggered it — so diagnosing required modifying code and
re-running. Log a 2KB preview of the offending raw output alongside the
error so the next failure of this shape is one grep away from diagnosis.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 13:06:43 +02:00
c79d0980a9 Make Temporal activity timeout env-configurable (ADHOC-2026-06-01-T03)
The CUST-WP-0045 daily triage canary on 2026-06-01 hit a BrokenPipeError
on the llm-connect side. Two 5-minute timeouts were racing:

- _ACTIVITY_TIMEOUT = timedelta(minutes=5) in workflows.py
- LLM_CONNECT_TIMEOUT_SECONDS default 300 in llm_client.py

The 10KB curated digest + max_depth:2 + JSON schema enforcement pushed
Claude past 5 minutes. Whichever timer fired first killed the httpx call;
the model's late response arrived to a closed socket.

Read _ACTIVITY_TIMEOUT from ACTIVITY_TIMEOUT_SECONDS env (default 900 —
15 minutes) so judgement-call activities have headroom for slow LLM runs.
Operators should also widen httpx via LLM_CONNECT_TIMEOUT_SECONDS=840 so
httpx still times out slightly before Temporal, preserving the
clean-error contract.

Tests: 120 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 08:10:24 +02:00
a8d3cc2782 Fix repo_sbom_status resolver — close ADHOC-2026-06-01-T01
The state-hub resolver was calling GET /sbom/status?repo={slug}, which State
Hub does not expose. Real SBOM routes are /sbom/, /sbom/{slug},
/sbom/snapshots/, /sbom/snapshots/{id}, /sbom/ingest/, /sbom/report/licences/.
The weekly-sbom-staleness ActivityDefinition was passing params {repos: all}
and the resolver was reading params.get("repo_slug", ""), so the URL
collapsed to /sbom/status?repo= and 404'd. _fetch_json swallowed the error,
the rule context.repos.sbom_age_days > 30 evaluated against {} and never
matched, and the weekly SBOM check has been a silent no-op for as long as
the route mismatch has existed.

Resolver now supports two modes selected by params:
- single-repo: {repo_slug: foo} → GET /sbom/{foo}, returns
  {repo_slug, last_sbom_at, sbom_age_days, has_sbom}
- bulk: {repos: all} → GET /repos/, computes per-repo age, returns the
  worst repo's fields hoisted to the top of the result alongside
  stale_count, total_count, worst_* fields, and the full per-repo list

Never-scanned repos get a 99999 sentinel age so threshold rules treat
them as very stale without forcing the rule to special-case None.

Hoisting the worst entry to the top preserves the existing rule
expression context.repos.sbom_age_days > 30 (and target_repo:
context.repos.repo_slug, though that field is a separate interpolation
gap tracked as ADHOC-2026-06-01-T02). The integration tests'
aspirational per-repo iteration model is left intact.

Live validation against State Hub on 2026-06-01:
- single: activity-core → 36 days since 2026-04-26 ingest
- bulk: 48 repos total, 46 stale (>30d), worst is info-tech-canon (never
  scanned), rule expression evaluates True

Tests: 120 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 03:31:56 +02:00
5d3fb33c6b Capture sbom_status resolver bug as ADHOC-2026-06-01
Surfaced while bringing up the dev worker for the CUST-WP-0045 T06 cutover.
weekly-sbom-staleness fires its state-hub resolver with query
repo_sbom_status, which hits GET /sbom/status?repo=. State Hub does not
expose that route, so _fetch_json returns {} and the rule
context.repos.sbom_age_days > 30 silently no-ops. The weekly SBOM check has
been a no-op for as long as the route mismatch has existed. Logged as a
low-priority adhoc rather than promoting to a workplan because the resolver
and definition both need a one-line decision (single-repo vs fan-out), not
multi-phase design.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-02 03:16:12 +02:00
ca6d80ec07 Enable hourly RecentlyOnScope rollout 2026-05-23 02:51:54 +02:00
5055f3eaca Add State Hub RecentlyOnScope invocation 2026-05-22 16:14:10 +02:00
f4c38e2d5f Record state hub IDs for railiance deployment 2026-05-22 13:51:51 +02:00
e2aac3ad8c Deploy activity-core on railiance01 2026-05-22 13:49:46 +02:00
cf92f0d686 Forward instruction schemas to llm-connect 2026-05-21 03:19:27 +02:00
5c4f96e7aa Pass instruction depth config to llm-connect 2026-05-19 20:55:35 +02:00
1ff8b14d1b Fix ActivityDefinition sync for daily triage canary 2026-05-19 20:13:23 +02:00
6cb0718e90 Add curated daily triage digest 2026-05-19 19:09:21 +02:00
3110399b11 Add instruction report sinks 2026-05-19 18:36:58 +02:00
0dc342eb1b Wire instruction report execution 2026-05-19 18:28:23 +02:00
0e7084207e Extend State Hub context resolver for daily triage 2026-05-19 15:59:12 +02:00
5bb61fdef5 Refresh agent instruction files 2026-05-18 16:55:39 +02:00
00e688bd8e fix(WP-0004): live deployment fixes from integration test
- Dockerfile: copy alembic.ini + migrations/ so actcore-migrate works
- docker-compose.railiance.yml:
    - Temporal: add dynamicconfig volume mount + correct DYNAMIC_CONFIG_FILE_PATH
    - Temporal: healthcheck uses 'temporal operator cluster health' (not tctl)
    - NATS: add monitoring port -m 8222 for wget-based healthcheck
    - actcore-api healthcheck: use Python urllib (curl absent from slim image)
- api.py: fix /health Temporal probe — Client has no describe_namespace;
    use workflow_service.get_system_info(GetSystemInfoRequest()) instead
- Makefile: grep -Eh to suppress filename prefix when MAKEFILE_LIST has
    multiple files (.env included via -include)

All 8 services start cleanly; /health returns {"status":"ok",...} HTTP 200;
SIGTERM drains worker cleanly within grace period; make help correct.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 09:23:14 +02:00
94bd34231c chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-15:
  - update .custodian-brief.md for activity-core
2026-05-15 00:06:45 +02:00
a9d2c12212 chore(WP-0004): mark workplan done
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 00:05:01 +02:00
2a8e6cfe7f feat(WP-0004): railiance deployment & service ops
- Dockerfile (multi-stage, uv-based, slim runtime)
- .dockerignore
- docker-compose.railiance.yml (Temporal + NATS + PG, no Elasticsearch)
- GET /health endpoint (db + temporal probes, 200/503)
- .env.example (complete env var reference)
- Makefile: migrate, sync-all, dev-up/down, railiance-up/down,
  start-worker, start-api, start-event-router, help targets;
  extracted sync-event-types Python to scripts/sync_event_types.py
- SIGTERM graceful shutdown in worker.py and event_router.py
- docs/runbook.md: Railiance deployment section

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 00:04:39 +02:00
987cf5a75c chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - update .custodian-brief.md for activity-core
2026-05-14 23:51:06 +02:00
9f8cc43129 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - update .custodian-brief.md for activity-core
2026-05-14 23:36:03 +02:00
827ef9c1a0 feat(WP-0003c): context adapters, first ActivityDefinition, full test suite
T51: ContextResolver ABC + CONTEXT_RESOLVER_REGISTRY; resolve_context activity
updated to dispatch via registry (warns + binds {} on failure, never aborts run).
T52: RepoScopingContextResolver with 5-min in-process cache.
T53: StateHubContextResolver (no cache) for domain_summary and repo_sbom_status.
T54: activity-definitions/weekly-sbom-staleness.md (Monday 09:00 Berlin, cron
trigger, flag-stale-sbom rule at >30 days) + tasks/sbom-rescan.md template.
T55: 51 parametrized evaluator tests — all whitelisted operators, unsafe
expression rejection, empty condition, missing attribute, nested context access.
T56: 15 executor safety tests — UntrustedFieldError, object-type rejection,
injection fixture, LLM retry on bad JSON, review_required field.
T57: 6 integration tests — parses real definition, evaluates rule per-repo
(stale/fresh boundary), emits via NullSink, verifies spawn log entries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 23:24:48 +02:00
fd8d0827d7 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - update .custodian-brief.md for activity-core
2026-05-14 23:21:10 +02:00
df73d28f55 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - update .custodian-brief.md for activity-core
2026-05-14 23:06:22 +02:00
9abb69d179 chore(consistency): sync task status from DB [auto] 2026-05-14 23:03:29 +02:00
176867cbe3 feat(WP-0003b): parser, workflow wiring, triggers, webhooks
T44: ActivityDefinition markdown file parser (definition_parser.py)
  - Scans activity-definitions/*.md and ACTIVITY_DEFINITION_DIRS paths
  - Parses YAML frontmatter + fenced rule/instruction blocks
  - Raises ParseError on any malformed file — never silently skips

T45: ActivityDefinition sync command
  - Migration 0006: adds rules_json/instructions_json JSONB columns
  - sync_activity_definitions.py + make sync-activity-definitions
  - Called at worker startup before schedule sync

T46: Rule/instruction pipeline wired into RunActivityWorkflow
  - New evaluate_rules and emit_tasks Temporal activities
  - Workflow passes event_envelope_json to enable rule evaluation
  - EventRouter now passes full envelope JSON as 4th workflow arg
  - IssueSink.emit() writes task_spawn_log rows per task

T47: ScheduledTriggerConfig model (one-off future datetime trigger)

T48: One-off Temporal Schedule support
  - Fixed timezone_name → time_zone_name (was causing all schedule tests to fail)
  - Added ScheduleCalendarSpec-based one-off schedule with remaining_actions=1
  - cancel_scheduled() for admin cancellation
  - Fixed backfill() call to use *args unpacking (not list wrapper)
  - Fixed ScheduleAlreadyRunningError catch in upsert_schedule
  - sync_schedules now handles ScheduledTriggerConfig definitions

T49: Webhook receiver
  - POST /webhooks/gitea  — HMAC-SHA256 via X-Gitea-Signature-256
  - POST /webhooks/github — HMAC-SHA256 via X-Hub-Signature-256
  - Normalisers: repo.created, push, issue.closed → EventEnvelope
  - Publishes to NATS activity.{type} subject after registry validation
  - Mounted in api.py at /webhooks prefix

T50: Gitea event type definitions
  - gitea.repo.created.md, gitea.push.md, gitea.issue.closed.md
  - Each includes normaliser field mapping in Consumer Notes

Tests: 18 passed, 1 skipped (integration). Fixed embedded Temporal
server visibility latency in test_upsert_schedule_creates_schedule.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 23:02:33 +02:00
dc20c44a44 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - update .custodian-brief.md for activity-core
2026-05-14 22:52:19 +02:00
c25c711e3c chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - update .custodian-brief.md for activity-core
2026-05-14 22:36:00 +02:00
7c1c4441d1 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - update .custodian-brief.md for activity-core
2026-05-14 22:22:06 +02:00
0cb98af18c chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - update .custodian-brief.md for activity-core
2026-05-14 22:06:11 +02:00