Commit Graph

1824 Commits

Author SHA1 Message Date
919edd98ac chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - update .custodian-brief.md for activity-core
2026-06-26 18:20:26 +02:00
bf877b7f0d test(ACTIVITY-WP-0016-T05): regression coverage incl. real 06-26 payload + over-depth
Add a test driving the actual captured 2026-06-26 failure payload
(tests/fixtures/wp0016/...partial.json): it now recovers 6+ valid recommendations
and quarantines the truncated tail, where before WP-0016 it discarded the whole run.
Add an over-depth guardrail test. Together with T03/T04 the regression set now covers
truncation, one-bad-item, oversized-string, over-depth, allow-list/injection-shaped,
and happy-path count cap.

In-repo portion of T05 complete; the live railiance01 graceful-degradation smoke is
operator-owned cluster work (deploy-coupled with the T02 bundle changes) and remains
outstanding. Hand-back notes posted to WP-0006-T03 and WP-0010-T04. Full suite: 220
passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 18:18:37 +02:00
9be4ddbdb7 feat(ACTIVITY-WP-0016-T04): producer trust-boundary guardrails + ADR-004
Add ADR-004 documenting the producer trust boundary: untrusted producers (LLM,
agent, human; erroneous and malicious), the trust-but-handle vs verify-and-mitigate
postures, error-locality and quarantine-with-provenance principles, and the concrete
activity-core mechanisms.

Implement producer-agnostic guardrails in executor.py, applied uniformly on the
happy path and the recovery path via _partition_items: structural-type -> schema ->
structural caps (_MAX_DEPTH, _MAX_STRING_LEN) -> reference allow-list -> count cap.
Each quarantine carries a reason. Closes the happy-path maxItems count cap deferred
from T03 (valid 9-item report keeps 7, quarantines 2). Reference allow-list reads
context["known_candidates"] via _allow_list_from_context; inert until a resolver
populates it. SCOPE.md updated (executor bullet + ADR list); no INTENT drift.

New tests: happy-path count cap, oversized-string guardrail, allow-list rejection.
Full suite: 218 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 18:10:17 +02:00
c5440e8429 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - update .custodian-brief.md for activity-core
2026-06-26 18:04:07 +02:00
53dc0f6e93 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - ACTIVITY-WP-0016-T03: progress → done
2026-06-26 18:03:50 +02:00
a70c00a789 feat(ACTIVITY-WP-0016-T03): resilient per-item report recovery with quarantine lane
When the whole-document parse + one retry still fail, report instructions now run
_resilient_report before the total-loss path. A brace/quote-aware scanner
(_extract_object_spans) recovers each recommendation object whether pretty-printed
across many lines or NDJSON one-per-line; a truncated tail gets a best-effort
_try_repair; _partition_items validates each recovered object against the T02 item
schema. Valid items survive (output_validated=True, partial=True), malformed/
over-maxItems items are quarantined with provenance (index, error, raw, reason),
capped at 20. Error locality now matches the unit of work: one bad item costs one
item, not the whole report.

Verified against the real 06-26 shape: 7 valid recommendations + a truncated tail
now recovers all 7 and quarantines the broken tail (previously the whole run was
discarded). Happy-path maxItems top-N enforcement is deferred to T04 (count caps).
Full suite: 215 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 17:56:28 +02:00
b41b6034ee chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - update .custodian-brief.md for activity-core
2026-06-26 17:52:46 +02:00
960fb05268 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - ACTIVITY-WP-0016-T03: todo → progress
2026-06-26 17:52:30 +02:00
b7b0b5bf6e chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - ACTIVITY-WP-0016-T02: todo → progress
2026-06-26 17:52:29 +02:00
14f76fb6d9 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-26:
  - ACTIVITY-WP-0016-T01: todo → wait
2026-06-26 17:52:28 +02:00
caa2608092 chore(consistency): renormalize lifecycle state [auto]
Updated by fix-consistency on 2026-06-26:
  - workplan status: proposed → active
2026-06-26 17:52:28 +02:00
61f278d643 feat(ACTIVITY-WP-0016-T02): strict bounded daily-triage output schema
Replace the accept-anything recommendations.items ({type: object}) with a strict
per-item contract (required [rank, candidate, action, why] + typed wsjf) and a
maxItems:7 hint. Strict item structure is what lets the T03 boundary parser
validate each recommendation independently and quarantine only malformed ones.

maxItems is a producer hint (prompt + llm-connect json_schema + T03 mitigation),
NOT a hard reject — a hard maxItems reject would discard a whole 16-item report,
the blast-radius bug WP-0016 removes. DEPLOY COUPLING: the strict schema is also
consumed by the current whole-doc validator, so it must ship with T03's per-item
quarantine parser; until then it increases whole-doc hard-fails. Prompt + max_tokens
headroom + NDJSON framing are documented as a runtime-bundle handoff.

Updated four tests to the strict contract; the forwarded-schema test now reads the
live schema file instead of hard-coding it. Full suite: 213 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 17:36:24 +02:00
0e9e18a59a chore(ACTIVITY-WP-0016-T01): record root-cause findings + partial failure fixture
Local analysis of the 2026-06-26 daily-triage validation failure: the unbounded
~1-recommendation-per-workstream list (16 active workstreams; JSON break at char
5268, ~rank 8-9) is the structural cause; both the first attempt and the retry
failed. The exact offending token and finish_reason are unrecoverable from
activity-core data — complete() drops finish_reason/usage, the report sink caps
raw output at 4000 chars (< 5268), and the log preview at 2000. Confirming the
exact token needs llm-connect producer-side logs on railiance01 (operator-owned);
mitigation (T02/T03) is identical regardless. Partial fixture captured.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 15:04:27 +02:00
5eb33bd3bb feat(ACTIVITY-WP-0016): register LLM output robustness & producer trust boundary workplan
Add WP-0016 to make the instruction-executor output contract robust after the
2026-06-26 daily-triage validation failure (one malformed delimiter discarded a
whole report). Per-item framing for error locality, verify-and-mitigate boundary
parsing with a quarantine lane, producer-trust-boundary guardrails (ADR-004), and
regression/calibration tests. Unblocks WP-0006-T03 / WP-0010-T04.

Also record the 06-26 recheck outcome (streak reset at two) in WP-0006-T03.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 14:39:21 +02:00
612c226472 chore(ACTIVITY-WP-0015): dedupe state_hub_workstream_id frontmatter
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:53:52 +02:00
0b2c68838e chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-24:
  - update .custodian-brief.md for activity-core
2026-06-24 12:53:31 +02:00
4b5e96d7c1 feat(ACTIVITY-WP-0014): close workplan — catchup_latest deployed & verified on railiance01
T04 done: built+deployed the WP-0014 image to railiance01, applied catchup_latest
to daily-statehub-wsjf-triage, /admin/sync clean (6 defs, 4 schedules, 0 errors).
Live schedule verified OverlapPolicy=BufferOne, CatchupWindow=1d; pods healthy.
All tasks T01-T05 complete; beachhead-endpoint adoption tracked in WP-0015.
Workplan status -> finished.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:52:54 +02:00
65ef005c2d docs(ACTIVITY-WP-0014): close T05 in-repo; split beachhead adoption to WP-0015
Idempotent-writes half of T05 is done in-repo; the externally-blocked endpoint
adoption + actcore-state-hub-bridge proxy retirement move to ACTIVITY-WP-0015
(blocked on the state-hub beachhead) so WP-0014 can close on completed work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 12:41:21 +02:00
0e75aaec01 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - update .custodian-brief.md for activity-core
2026-06-23 21:39:32 +02:00
b2e57707a7 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - ACTIVITY-WP-0014-T05: todo → progress
2026-06-23 21:39:28 +02:00
88fe359385 feat(ACTIVITY-WP-0014): idempotency-keyed State Hub writes (T05, in-repo part)
Add activity_core/state_hub_write: every State Hub write (report-sink,
ops-evidence, schedule-miss) now sends a stable Idempotency-Key header derived
from run_id:instruction_id:event_type. Makes writes safe to buffer/replay under
the future state-hub beachhead without duplicate progress/triage events. The
read-based _progress_exists dedup is now best-effort (returns False on connection
error instead of hard-failing), so the guarantee lives on the keyed write rather
than a live read. Tests + runbook note. Endpoint adoption / proxy retirement stays
blocked on the state-hub beachhead capability.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:38:46 +02:00
f90591c5f1 docs(ACTIVITY-WP-0014): rescope T05 to thin client under State Hub beachhead model
Resilience (queue/cache) is handed to custodian/state-hub as a per-machine
beachhead; activity-core keeps only idempotent writes + adopt-beachhead-endpoint
and retires its bespoke actcore-state-hub-bridge proxy. Proposal sent to state-hub.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 21:18:01 +02:00
cf7a11dcd9 docs(ACTIVITY-WP-0014): correct Motivation to match T01 findings
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 17:16:17 +02:00
99e5d525a8 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - update .custodian-brief.md for activity-core
2026-06-23 17:15:41 +02:00
8424c13783 docs(ACTIVITY-WP-0014): T01 root cause — State Hub Connection refused, not misfire
Live inspection of railiance01 (ssh + in-node kubectl/temporal) overturns the
catchup_window hypothesis: the daily-triage schedule is healthy (CatchupWindow
365d default, 0 MissedCatchupWindow). The 2026-06-23T05:20Z fire ran but Failed
at the report sink with '[Errno 111] Connection refused' posting to State Hub.
railiance01 reaches State Hub via a reverse tunnel back to the workstation, which
is unreachable at 07:20 Europe/Berlin (102 resolver timeouts in 24h). Mark T01
done; add T05 for resilient sinks/resolvers as the real incident fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 17:14:04 +02:00
864f90f9b9 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - update .custodian-brief.md for activity-core
2026-06-23 14:27:54 +02:00
053d18b24a feat(ACTIVITY-WP-0014): missed-fire detection & alert sink (T03)
Add activity_core/schedule_health: a pure evaluate_schedule_health() verdict
(built on Temporal's num_actions_missed_catchup_window plus a staleness check),
an async check_schedule_health() reader, and post_missed_fire_alert() that emits
a schedule_miss State Hub progress event. Makes a missed fire visible even under
misfire_policy=skip, where Temporal drops it by design. Unit tests for the
verdict logic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 14:25:33 +02:00
77af65afb2 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - update .custodian-brief.md for activity-core
2026-06-23 14:17:14 +02:00
0495f8a43f chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-23:
  - ACTIVITY-WP-0014-T04: progress → wait
2026-06-23 14:17:06 +02:00
c6cad9e7b3 chore(consistency): renormalize lifecycle state [auto]
Updated by fix-consistency on 2026-06-23:
  - workplan status: proposed → active
2026-06-23 14:17:06 +02:00
a83b117f60 feat(ACTIVITY-WP-0014): explicit run-miss recovery policies (T02, T04)
Set Temporal catchup_window on cron schedules so a fire missed during a
worker/Temporal outage is no longer silently dropped. Redefine misfire_policy
into three explicit modes — skip, catchup_all, catchup_latest — mapping to
(catchup_window, overlap) pairs; legacy catchup/compress aliased. Add
catchup_window_seconds override. Remove the ad-hoc upsert-time 1h backfill in
favour of native catchup. Apply catchup_latest to daily-statehub-wsjf-triage in
the Railiance runtime manifest and document run-miss policies in the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 14:15:45 +02:00
ffc0ee2cb7 feat(ACTIVITY-WP-0014): plan schedule misfire robustness & run-miss options
Cron fires are silently dropped: _build_schedule() sets SchedulePolicy(overlap=)
but never catchup_window, so a brief worker/Temporal outage at trigger time drops
the fire with no recovery and no signal (root cause of missing 06-22/06-23 daily
triage runs). Define three explicit run-miss policies: skip, catchup_all,
catchup_latest, plus missed-fire detection.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 13:46:19 +02:00
59b3b73061 ui rules established 2026-06-22 23:03:40 +02:00
4bc5111dfd chore(consistency): apply state_hub_workstream_id writeback
Sync archived workplan frontmatter from State Hub fix-consistency.
2026-06-22 17:43:32 +02:00
e9a6029ded chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-22:
  - update .custodian-brief.md for activity-core
2026-06-22 16:50:01 +02:00
bf4e61f0bf feat(ACTIVITY-WP-0012): complete live admin-sync no-restart smoke
Ran Railiance01 cluster validation for POST /admin/sync without restarting
actcore-worker, added a repeatable smoke script, and closed the workplan.
2026-06-22 16:25:26 +02:00
40fa851ec0 fix(bridge): use /state/health for readiness probe
The actcore-state-hub-bridge readiness probe hit /state/summary through
the tunnel proxy chain. Cold-cache summary requests and intermittent
tunnel stalls routinely exceeded the 5s probe timeout (1584 failures
over 17h), leaving the pod 0/1 Ready and breaking hourly/triage sinks.

Use /state/health instead — same signal the ops inventory already
expects, and completes in ~30ms through the bridge.
2026-06-22 14:03:57 +02:00
e0742d18d7 Mark .repo-classification.yaml human-reviewed (CUST-WP-0050 T02)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 11:40:43 +02:00
ccac285b0a Reclassify as tooling (CUST-WP-0050 T02)
Apply the new 'tooling' category (reusable internal tooling/infrastructure)
from the Repo Classification Standard. First-pass agent classification.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 03:06:01 +02:00
a0dcc52353 Add repo classification (CUST-WP-0050 T02)
First-pass agent classification per the Repo Classification Standard v1.0
(canon-repo-classification); pending human review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 02:44:46 +02:00
faf5d60ae8 feat(STATE-WP-0064): enable cluster consistency sweep schedule
Enable the definition in k8s projection and pass activity-core source tags.
2026-06-21 21:46:43 +02:00
adfd1a9067 fix(STATE-WP-0064): allow 360s POST timeout on state-hub bridge proxy
Consistency sweeps exceed the previous 30s urllib timeout when triggered from
Railiance01 activity-core through actcore-state-hub-bridge.
2026-06-21 20:56:35 +02:00
44987457c1 chore: add make sync-schedules target for Temporal schedule reconcile
Wraps python -m activity_core.sync_schedules for operator discoverability.
2026-06-21 20:28:04 +02:00
3a981cc98f feat(STATE-WP-0064): wire consistency_sweep_remote_all state-hub query
Add POST /consistency/sweep/remote-all resolver support with a 330s
timeout and k8s projection for the consistency sweep definition.
2026-06-21 20:19:22 +02:00
dbd2fbb11c docs(workplan): record railiance01 llm-connect smoke evidence
Note the 2026-06-19 live reconciliation on railiance01: llm-connect
deployed, worker restarted with LLM_CONNECT_URL, fixture smoke passed.
Manual daily triage still blocked on actcore-state-hub-bridge reachability.
2026-06-19 15:58:04 +02:00
c938b80503 chore(kaizen): demote coach/optimization to weekly operate cadence
After coulomb-loop bootstrap E2E (3/3 cycles on 2026-06-18), revert
activity-core from experimental daily crons to weekly Monday schedules
so discover_kaizen_scheduled_repos(cadence=weekly) matches the
operate-phase ActivityDefinitions. Drop the disabled tdd-workflow stub.
2026-06-19 11:32:36 +02:00
3e93567a53 Add admin sync hot reload path 2026-06-19 01:54:13 +02:00
6f68f8f9ec chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-19:
  - update .custodian-brief.md for activity-core
2026-06-19 01:52:52 +02:00
f05c56e202 fix(issue-sink): stringify triggering_event_id before JSON encode
IssueCoreRestSink.emit() passed task_spec.triggering_event_id straight
into the httpx json= payload. When the field is a UUID object (rather
than a string), httpx's JSON encoder raised
"TypeError: Object of type UUID is not JSON serializable", failing the
emission. Guard with str(), preserving None for optional event ids.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 00:15:03 +02:00
200ec0c97a Add credential routing instructions for all agent runtimes
Propagate shared credential-routing section (Codex, Claude, Grok, llm-connect)
from state-hub template via scripts/propagate_credential_routing.py.
2026-06-18 22:48:37 +02:00