Files
activity-core/workplans/ACTIVITY-WP-0012-definition-schedule-hot-reload.md
tegwick bf4e61f0bf feat(ACTIVITY-WP-0012): complete live admin-sync no-restart smoke
Ran Railiance01 cluster validation for POST /admin/sync without restarting
actcore-worker, added a repeatable smoke script, and closed the workplan.
2026-06-22 16:25:26 +02:00

7.0 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
ACTIVITY-WP-0012 workplan Definition And Schedule Hot Reload custodian activity-core finished codex custodian 2026-06-18 2026-06-22 8887075e-21ec-451b-b82b-cd81035c9ca5

ACTIVITY-WP-0012 - Definition And Schedule Hot Reload

Context

State Hub message f4876517-f738-4571-a2d6-76f2965e9a13 from coulomb-loop reports an operational gap from the Coulomb cadence ramp: after renaming customer definitions from hourly to daily, operators had to run definition/schedule sync and restart the worker before new Temporal schedule state was reliable.

Current behavior:

  • worker.py runs sync_activity_definitions and sync_schedules once at startup.
  • RunActivityWorkflow loads ActivityDefinitions from the DB at activity time.
  • The event router reloads enabled event definitions per NATS message.
  • Cron schedule changes only take effect when sync_schedules runs.

This belongs in activity-core because the repo owns ActivityDefinition sync, Temporal schedule projection, and the admin API. The first implementation should expose an operator-triggered sync path without turning activity-core into a repo checkout manager or CI system.

Extract Reusable Sync Service

id: ACTIVITY-WP-0012-T01
status: done
priority: high
state_hub_task_id: "53a7970b-7eec-47f5-ad30-bbd7c6271952"

Refactor the worker-startup sync sequence into a reusable async service that can be called by startup and the API.

Done when:

  • the service can run ActivityDefinition sync, event type sync, and Temporal schedule sync independently based on booleans;
  • it accepts the existing DB session factory / Temporal client dependencies without creating hidden global state;
  • startup behavior remains unchanged except for calling the shared service;
  • failures are collected into a bounded errors[] result while preserving the current startup best-effort behavior.

2026-06-19: Completed. Added activity_core.sync_service.run_sync, which orchestrates ActivityDefinition, event type, and schedule sync independently from explicit DB session factory and Temporal client dependencies. Worker startup now calls the shared service for definitions+schedules and logs bounded stage errors while continuing startup.

Add Admin Sync Endpoint

id: ACTIVITY-WP-0012-T02
status: done
priority: high
state_hub_task_id: "8697c761-15d1-4da0-b66b-d838218a2495"

Add an operator-only API endpoint:

POST /admin/sync?definitions=true&schedules=true&event_types=true

Done when:

  • the endpoint runs the shared sync service without requiring worker restart;
  • response JSON reports counts for definitions, event types, schedules upserted, schedules paused/deleted, and errors;
  • default parameters sync definitions and schedules, with event types opt-in or clearly documented;
  • endpoint tests cover definitions-only, schedules-only, all-sync, and failure result behavior.

2026-06-19: Completed. Added POST /admin/sync with defaults definitions=true, schedules=true, and event_types=false. The response reports definition/event counts, schedule upsert/pause/orphan-delete counts, and bounded errors[]. Tests cover definitions-only, schedules-only, all-sync, and failure-result behavior.

Preserve Schedule Drift Semantics

id: ACTIVITY-WP-0012-T03
status: done
priority: high
state_hub_task_id: "efeac412-632c-4c90-9428-bb575ac7a624"

Make the sync result explicit enough for cadence changes and renames.

Done when:

  • disabled cron definitions pause their Temporal schedules on sync;
  • renamed definitions create the new schedule and pause/delete orphaned old schedules according to the existing sync_schedules semantics;
  • event-triggered definitions remain hot through the existing router DB reload path;
  • regression tests demonstrate the Coulomb hourly-to-daily rename shape without needing a worker restart.

2026-06-19: Completed. sync_schedules now returns explicit counts for enabled schedule upserts, disabled schedule pauses, and orphan deletes. Regression tests cover the hourly-to-daily rename shape: a new enabled cron schedule is upserted, the old disabled cron schedule is preserved as paused, unrelated orphan schedules are deleted, event-triggered definitions do not create schedules, and one-shot scheduled definitions are no longer mistaken for orphans.

Optional Background Sync Loop

id: ACTIVITY-WP-0012-T04
status: done
priority: medium
state_hub_task_id: "d774087b-c51d-4444-8e90-bfef43765456"

Decide whether to add a periodic sync loop after the admin endpoint exists.

Done when:

  • either ACTIVITY_SYNC_INTERVAL_SECONDS is implemented with a default disabled or conservative interval, or the workplan records why manual/admin-triggered sync is the safer v1 posture;
  • if implemented, logs and metrics expose the last successful sync timestamp and last error summary;
  • the loop does not block worker startup or workflow task processing.

2026-06-19: Completed by decision. v1 stays manual/operator-triggered through POST /admin/sync; no background loop was added. The runbook records this posture so customer definition changes stay explicit and the worker does not start background repo scanning. A periodic loop remains a future option if live operator use proves it is needed.

Live No-Restart Smoke

id: ACTIVITY-WP-0012-T05
status: done
priority: high
state_hub_task_id: "68a0e22a-106a-4d21-9f39-c6279850cb5e"

Validate the hot-reload path in the cluster/operator environment.

Done when non-secret State Hub evidence shows:

  • a customer repo definition rename or enabled flip is synced through /admin/sync;
  • new Temporal schedules are active and retired schedules are paused/deleted without worker SIGTERM or pod restart;
  • event-triggered definitions still fire normally;
  • rollback or repeat sync is idempotent.

2026-06-22: Completed on Railiance01 (KUBECONFIG=~/.kube/config-hosteurope).

Smoke target: disabled projection ops-service-inventory-probes (40d15a87-7ff6-4d8e-992c-37df15f95110) in actcore-external-activity-definitions.

Evidence:

  • ConfigMap flip enabled: false -> true and cadence 15 * * * * -> 25 * * * *, then POST /admin/sync?definitions=true&schedules=true from actcore-api.
  • DB after sync: enabled=true, cron=25 * * * *.
  • Temporal schedule after sync: paused=false, calendar minute 25.
  • Repeat sync returned identical schedule counts (upserted=5, paused=1, deleted_orphans=0) — idempotent.
  • Rollback flip restored enabled=false, cron=15 * * * *, schedule paused=true, calendar minute 15.
  • actcore-worker pod UID unchanged (a68d6539-2bba-457e-a78a-39564002a980, started 2026-06-21T18:46:46Z); actcore-event-router pod UID unchanged.
  • Event-triggered definitions: none projected on Railiance01 today; hot DB reload path for event definitions remains covered by T03 unit tests and an unchanged event-router deployment.

Automation: scripts/smoke_admin_sync_no_restart.py. Runbook section added under "Railiance01 no-restart smoke".