Idempotent-writes half of T05 is done in-repo; the externally-blocked endpoint
adoption + actcore-state-hub-bridge proxy retirement move to ACTIVITY-WP-0015
(blocked on the state-hub beachhead) so WP-0014 can close on completed work.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add activity_core/state_hub_write: every State Hub write (report-sink,
ops-evidence, schedule-miss) now sends a stable Idempotency-Key header derived
from run_id:instruction_id:event_type. Makes writes safe to buffer/replay under
the future state-hub beachhead without duplicate progress/triage events. The
read-based _progress_exists dedup is now best-effort (returns False on connection
error instead of hard-failing), so the guarantee lives on the keyed write rather
than a live read. Tests + runbook note. Endpoint adoption / proxy retirement stays
blocked on the state-hub beachhead capability.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resilience (queue/cache) is handed to custodian/state-hub as a per-machine
beachhead; activity-core keeps only idempotent writes + adopt-beachhead-endpoint
and retires its bespoke actcore-state-hub-bridge proxy. Proposal sent to state-hub.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Live inspection of railiance01 (ssh + in-node kubectl/temporal) overturns the
catchup_window hypothesis: the daily-triage schedule is healthy (CatchupWindow
365d default, 0 MissedCatchupWindow). The 2026-06-23T05:20Z fire ran but Failed
at the report sink with '[Errno 111] Connection refused' posting to State Hub.
railiance01 reaches State Hub via a reverse tunnel back to the workstation, which
is unreachable at 07:20 Europe/Berlin (102 resolver timeouts in 24h). Mark T01
done; add T05 for resilient sinks/resolvers as the real incident fix.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add activity_core/schedule_health: a pure evaluate_schedule_health() verdict
(built on Temporal's num_actions_missed_catchup_window plus a staleness check),
an async check_schedule_health() reader, and post_missed_fire_alert() that emits
a schedule_miss State Hub progress event. Makes a missed fire visible even under
misfire_policy=skip, where Temporal drops it by design. Unit tests for the
verdict logic.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Set Temporal catchup_window on cron schedules so a fire missed during a
worker/Temporal outage is no longer silently dropped. Redefine misfire_policy
into three explicit modes — skip, catchup_all, catchup_latest — mapping to
(catchup_window, overlap) pairs; legacy catchup/compress aliased. Add
catchup_window_seconds override. Remove the ad-hoc upsert-time 1h backfill in
favour of native catchup. Apply catchup_latest to daily-statehub-wsjf-triage in
the Railiance runtime manifest and document run-miss policies in the runbook.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cron fires are silently dropped: _build_schedule() sets SchedulePolicy(overlap=)
but never catchup_window, so a brief worker/Temporal outage at trigger time drops
the fire with no recovery and no signal (root cause of missing 06-22/06-23 daily
triage runs). Define three explicit run-miss policies: skip, catchup_all,
catchup_latest, plus missed-fire detection.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The actcore-state-hub-bridge readiness probe hit /state/summary through
the tunnel proxy chain. Cold-cache summary requests and intermittent
tunnel stalls routinely exceeded the 5s probe timeout (1584 failures
over 17h), leaving the pod 0/1 Ready and breaking hourly/triage sinks.
Use /state/health instead — same signal the ops inventory already
expects, and completes in ~30ms through the bridge.
Apply the new 'tooling' category (reusable internal tooling/infrastructure)
from the Repo Classification Standard. First-pass agent classification.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First-pass agent classification per the Repo Classification Standard v1.0
(canon-repo-classification); pending human review.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Note the 2026-06-19 live reconciliation on railiance01: llm-connect
deployed, worker restarted with LLM_CONNECT_URL, fixture smoke passed.
Manual daily triage still blocked on actcore-state-hub-bridge reachability.
After coulomb-loop bootstrap E2E (3/3 cycles on 2026-06-18), revert
activity-core from experimental daily crons to weekly Monday schedules
so discover_kaizen_scheduled_repos(cadence=weekly) matches the
operate-phase ActivityDefinitions. Drop the disabled tdd-workflow stub.
IssueCoreRestSink.emit() passed task_spec.triggering_event_id straight
into the httpx json= payload. When the field is a UUID object (rather
than a string), httpx's JSON encoder raised
"TypeError: Object of type UUID is not JSON serializable", failing the
emission. Guard with str(), preserving None for optional event ids.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add ISSUE_CORE_URL, ISSUE_CORE_API_KEY, and ISSUE_SINK_TYPE guidance so
agents pair keys locally or via OpenBao instead of requesting them from
ops-warden.
Issue-core requires a shared ingestion key on POST /issues/. The REST sink
now sends Authorization: Bearer using ISSUE_CORE_API_KEY and fails fast
when the key is missing under ISSUE_SINK_TYPE=rest.
Updates .env.example, emission boundary docs, and unit tests for the
header contract and missing-key error.