--- id: ACTIVITY-WP-0012 type: workplan title: "Definition And Schedule Hot Reload" domain: custodian repo: activity-core status: finished owner: codex topic_slug: custodian created: "2026-06-18" updated: "2026-06-22" state_hub_workstream_id: "8887075e-21ec-451b-b82b-cd81035c9ca5" --- # ACTIVITY-WP-0012 - Definition And Schedule Hot Reload ## Context State Hub message `f4876517-f738-4571-a2d6-76f2965e9a13` from `coulomb-loop` reports an operational gap from the Coulomb cadence ramp: after renaming customer definitions from hourly to daily, operators had to run definition/schedule sync and restart the worker before new Temporal schedule state was reliable. Current behavior: - `worker.py` runs `sync_activity_definitions` and `sync_schedules` once at startup. - `RunActivityWorkflow` loads ActivityDefinitions from the DB at activity time. - The event router reloads enabled event definitions per NATS message. - Cron schedule changes only take effect when `sync_schedules` runs. This belongs in activity-core because the repo owns ActivityDefinition sync, Temporal schedule projection, and the admin API. The first implementation should expose an operator-triggered sync path without turning activity-core into a repo checkout manager or CI system. ## Extract Reusable Sync Service ```task id: ACTIVITY-WP-0012-T01 status: done priority: high state_hub_task_id: "53a7970b-7eec-47f5-ad30-bbd7c6271952" ``` Refactor the worker-startup sync sequence into a reusable async service that can be called by startup and the API. Done when: - the service can run ActivityDefinition sync, event type sync, and Temporal schedule sync independently based on booleans; - it accepts the existing DB session factory / Temporal client dependencies without creating hidden global state; - startup behavior remains unchanged except for calling the shared service; - failures are collected into a bounded `errors[]` result while preserving the current startup best-effort behavior. 2026-06-19: Completed. Added `activity_core.sync_service.run_sync`, which orchestrates ActivityDefinition, event type, and schedule sync independently from explicit DB session factory and Temporal client dependencies. Worker startup now calls the shared service for definitions+schedules and logs bounded stage errors while continuing startup. ## Add Admin Sync Endpoint ```task id: ACTIVITY-WP-0012-T02 status: done priority: high state_hub_task_id: "8697c761-15d1-4da0-b66b-d838218a2495" ``` Add an operator-only API endpoint: `POST /admin/sync?definitions=true&schedules=true&event_types=true` Done when: - the endpoint runs the shared sync service without requiring worker restart; - response JSON reports counts for definitions, event types, schedules upserted, schedules paused/deleted, and errors; - default parameters sync definitions and schedules, with event types opt-in or clearly documented; - endpoint tests cover definitions-only, schedules-only, all-sync, and failure result behavior. 2026-06-19: Completed. Added `POST /admin/sync` with defaults `definitions=true`, `schedules=true`, and `event_types=false`. The response reports definition/event counts, schedule upsert/pause/orphan-delete counts, and bounded `errors[]`. Tests cover definitions-only, schedules-only, all-sync, and failure-result behavior. ## Preserve Schedule Drift Semantics ```task id: ACTIVITY-WP-0012-T03 status: done priority: high state_hub_task_id: "efeac412-632c-4c90-9428-bb575ac7a624" ``` Make the sync result explicit enough for cadence changes and renames. Done when: - disabled cron definitions pause their Temporal schedules on sync; - renamed definitions create the new schedule and pause/delete orphaned old schedules according to the existing `sync_schedules` semantics; - event-triggered definitions remain hot through the existing router DB reload path; - regression tests demonstrate the Coulomb hourly-to-daily rename shape without needing a worker restart. 2026-06-19: Completed. `sync_schedules` now returns explicit counts for enabled schedule upserts, disabled schedule pauses, and orphan deletes. Regression tests cover the hourly-to-daily rename shape: a new enabled cron schedule is upserted, the old disabled cron schedule is preserved as paused, unrelated orphan schedules are deleted, event-triggered definitions do not create schedules, and one-shot scheduled definitions are no longer mistaken for orphans. ## Optional Background Sync Loop ```task id: ACTIVITY-WP-0012-T04 status: done priority: medium state_hub_task_id: "d774087b-c51d-4444-8e90-bfef43765456" ``` Decide whether to add a periodic sync loop after the admin endpoint exists. Done when: - either `ACTIVITY_SYNC_INTERVAL_SECONDS` is implemented with a default disabled or conservative interval, or the workplan records why manual/admin-triggered sync is the safer v1 posture; - if implemented, logs and metrics expose the last successful sync timestamp and last error summary; - the loop does not block worker startup or workflow task processing. 2026-06-19: Completed by decision. v1 stays manual/operator-triggered through `POST /admin/sync`; no background loop was added. The runbook records this posture so customer definition changes stay explicit and the worker does not start background repo scanning. A periodic loop remains a future option if live operator use proves it is needed. ## Live No-Restart Smoke ```task id: ACTIVITY-WP-0012-T05 status: done priority: high state_hub_task_id: "68a0e22a-106a-4d21-9f39-c6279850cb5e" ``` Validate the hot-reload path in the cluster/operator environment. Done when non-secret State Hub evidence shows: - a customer repo definition rename or `enabled` flip is synced through `/admin/sync`; - new Temporal schedules are active and retired schedules are paused/deleted without worker SIGTERM or pod restart; - event-triggered definitions still fire normally; - rollback or repeat sync is idempotent. 2026-06-22: Completed on Railiance01 (`KUBECONFIG=~/.kube/config-hosteurope`). Smoke target: disabled projection `ops-service-inventory-probes` (`40d15a87-7ff6-4d8e-992c-37df15f95110`) in `actcore-external-activity-definitions`. Evidence: - ConfigMap flip `enabled: false -> true` and cadence `15 * * * * -> 25 * * * *`, then `POST /admin/sync?definitions=true&schedules=true` from `actcore-api`. - DB after sync: `enabled=true`, `cron=25 * * * *`. - Temporal schedule after sync: `paused=false`, calendar minute `25`. - Repeat sync returned identical schedule counts (`upserted=5`, `paused=1`, `deleted_orphans=0`) — idempotent. - Rollback flip restored `enabled=false`, `cron=15 * * * *`, schedule `paused=true`, calendar minute `15`. - `actcore-worker` pod UID unchanged (`a68d6539-2bba-457e-a78a-39564002a980`, started `2026-06-21T18:46:46Z`); `actcore-event-router` pod UID unchanged. - Event-triggered definitions: none projected on Railiance01 today; hot DB reload path for event definitions remains covered by T03 unit tests and an unchanged event-router deployment. Automation: `scripts/smoke_admin_sync_no_restart.py`. Runbook section added under "Railiance01 no-restart smoke".