generated from coulomb/repo-seed
Ran Railiance01 cluster validation for POST /admin/sync without restarting actcore-worker, added a repeatable smoke script, and closed the workplan.
193 lines
7.0 KiB
Markdown
193 lines
7.0 KiB
Markdown
---
|
|
id: ACTIVITY-WP-0012
|
|
type: workplan
|
|
title: "Definition And Schedule Hot Reload"
|
|
domain: custodian
|
|
repo: activity-core
|
|
status: finished
|
|
owner: codex
|
|
topic_slug: custodian
|
|
created: "2026-06-18"
|
|
updated: "2026-06-22"
|
|
state_hub_workstream_id: "8887075e-21ec-451b-b82b-cd81035c9ca5"
|
|
---
|
|
|
|
# ACTIVITY-WP-0012 - Definition And Schedule Hot Reload
|
|
|
|
## Context
|
|
|
|
State Hub message `f4876517-f738-4571-a2d6-76f2965e9a13` from
|
|
`coulomb-loop` reports an operational gap from the Coulomb cadence ramp: after
|
|
renaming customer definitions from hourly to daily, operators had to run
|
|
definition/schedule sync and restart the worker before new Temporal schedule
|
|
state was reliable.
|
|
|
|
Current behavior:
|
|
|
|
- `worker.py` runs `sync_activity_definitions` and `sync_schedules` once at
|
|
startup.
|
|
- `RunActivityWorkflow` loads ActivityDefinitions from the DB at activity time.
|
|
- The event router reloads enabled event definitions per NATS message.
|
|
- Cron schedule changes only take effect when `sync_schedules` runs.
|
|
|
|
This belongs in activity-core because the repo owns ActivityDefinition sync,
|
|
Temporal schedule projection, and the admin API. The first implementation
|
|
should expose an operator-triggered sync path without turning activity-core into
|
|
a repo checkout manager or CI system.
|
|
|
|
## Extract Reusable Sync Service
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0012-T01
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "53a7970b-7eec-47f5-ad30-bbd7c6271952"
|
|
```
|
|
|
|
Refactor the worker-startup sync sequence into a reusable async service that can
|
|
be called by startup and the API.
|
|
|
|
Done when:
|
|
|
|
- the service can run ActivityDefinition sync, event type sync, and Temporal
|
|
schedule sync independently based on booleans;
|
|
- it accepts the existing DB session factory / Temporal client dependencies
|
|
without creating hidden global state;
|
|
- startup behavior remains unchanged except for calling the shared service;
|
|
- failures are collected into a bounded `errors[]` result while preserving the
|
|
current startup best-effort behavior.
|
|
|
|
2026-06-19: Completed. Added `activity_core.sync_service.run_sync`, which
|
|
orchestrates ActivityDefinition, event type, and schedule sync independently
|
|
from explicit DB session factory and Temporal client dependencies. Worker
|
|
startup now calls the shared service for definitions+schedules and logs bounded
|
|
stage errors while continuing startup.
|
|
|
|
## Add Admin Sync Endpoint
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0012-T02
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "8697c761-15d1-4da0-b66b-d838218a2495"
|
|
```
|
|
|
|
Add an operator-only API endpoint:
|
|
|
|
`POST /admin/sync?definitions=true&schedules=true&event_types=true`
|
|
|
|
Done when:
|
|
|
|
- the endpoint runs the shared sync service without requiring worker restart;
|
|
- response JSON reports counts for definitions, event types, schedules upserted,
|
|
schedules paused/deleted, and errors;
|
|
- default parameters sync definitions and schedules, with event types opt-in or
|
|
clearly documented;
|
|
- endpoint tests cover definitions-only, schedules-only, all-sync, and failure
|
|
result behavior.
|
|
|
|
2026-06-19: Completed. Added `POST /admin/sync` with defaults
|
|
`definitions=true`, `schedules=true`, and `event_types=false`. The response
|
|
reports definition/event counts, schedule upsert/pause/orphan-delete counts, and
|
|
bounded `errors[]`. Tests cover definitions-only, schedules-only, all-sync, and
|
|
failure-result behavior.
|
|
|
|
## Preserve Schedule Drift Semantics
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0012-T03
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "efeac412-632c-4c90-9428-bb575ac7a624"
|
|
```
|
|
|
|
Make the sync result explicit enough for cadence changes and renames.
|
|
|
|
Done when:
|
|
|
|
- disabled cron definitions pause their Temporal schedules on sync;
|
|
- renamed definitions create the new schedule and pause/delete orphaned old
|
|
schedules according to the existing `sync_schedules` semantics;
|
|
- event-triggered definitions remain hot through the existing router DB reload
|
|
path;
|
|
- regression tests demonstrate the Coulomb hourly-to-daily rename shape without
|
|
needing a worker restart.
|
|
|
|
2026-06-19: Completed. `sync_schedules` now returns explicit counts for enabled
|
|
schedule upserts, disabled schedule pauses, and orphan deletes. Regression tests
|
|
cover the hourly-to-daily rename shape: a new enabled cron schedule is upserted,
|
|
the old disabled cron schedule is preserved as paused, unrelated orphan
|
|
schedules are deleted, event-triggered definitions do not create schedules, and
|
|
one-shot scheduled definitions are no longer mistaken for orphans.
|
|
|
|
## Optional Background Sync Loop
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0012-T04
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "d774087b-c51d-4444-8e90-bfef43765456"
|
|
```
|
|
|
|
Decide whether to add a periodic sync loop after the admin endpoint exists.
|
|
|
|
Done when:
|
|
|
|
- either `ACTIVITY_SYNC_INTERVAL_SECONDS` is implemented with a default disabled
|
|
or conservative interval, or the workplan records why manual/admin-triggered
|
|
sync is the safer v1 posture;
|
|
- if implemented, logs and metrics expose the last successful sync timestamp and
|
|
last error summary;
|
|
- the loop does not block worker startup or workflow task processing.
|
|
|
|
2026-06-19: Completed by decision. v1 stays manual/operator-triggered through
|
|
`POST /admin/sync`; no background loop was added. The runbook records this
|
|
posture so customer definition changes stay explicit and the worker does not
|
|
start background repo scanning. A periodic loop remains a future option if live
|
|
operator use proves it is needed.
|
|
|
|
## Live No-Restart Smoke
|
|
|
|
```task
|
|
id: ACTIVITY-WP-0012-T05
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "68a0e22a-106a-4d21-9f39-c6279850cb5e"
|
|
```
|
|
|
|
Validate the hot-reload path in the cluster/operator environment.
|
|
|
|
Done when non-secret State Hub evidence shows:
|
|
|
|
- a customer repo definition rename or `enabled` flip is synced through
|
|
`/admin/sync`;
|
|
- new Temporal schedules are active and retired schedules are paused/deleted
|
|
without worker SIGTERM or pod restart;
|
|
- event-triggered definitions still fire normally;
|
|
- rollback or repeat sync is idempotent.
|
|
|
|
2026-06-22: Completed on Railiance01 (`KUBECONFIG=~/.kube/config-hosteurope`).
|
|
|
|
Smoke target: disabled projection `ops-service-inventory-probes`
|
|
(`40d15a87-7ff6-4d8e-992c-37df15f95110`) in
|
|
`actcore-external-activity-definitions`.
|
|
|
|
Evidence:
|
|
|
|
- ConfigMap flip `enabled: false -> true` and cadence `15 * * * * -> 25 * * * *`,
|
|
then `POST /admin/sync?definitions=true&schedules=true` from `actcore-api`.
|
|
- DB after sync: `enabled=true`, `cron=25 * * * *`.
|
|
- Temporal schedule after sync: `paused=false`, calendar minute `25`.
|
|
- Repeat sync returned identical schedule counts
|
|
(`upserted=5`, `paused=1`, `deleted_orphans=0`) — idempotent.
|
|
- Rollback flip restored `enabled=false`, `cron=15 * * * *`, schedule
|
|
`paused=true`, calendar minute `15`.
|
|
- `actcore-worker` pod UID unchanged (`a68d6539-2bba-457e-a78a-39564002a980`,
|
|
started `2026-06-21T18:46:46Z`); `actcore-event-router` pod UID unchanged.
|
|
- Event-triggered definitions: none projected on Railiance01 today; hot DB
|
|
reload path for event definitions remains covered by T03 unit tests and an
|
|
unchanged event-router deployment.
|
|
|
|
Automation: `scripts/smoke_admin_sync_no_restart.py`. Runbook section added
|
|
under "Railiance01 no-restart smoke".
|