generated from coulomb/repo-seed
Extract the JSON payload from mixed script output and document Railiance01 kubectl sync steps. Mark T02 done after cluster bridge and resolver canaries.
125 lines
3.7 KiB
Markdown
125 lines
3.7 KiB
Markdown
# State Hub Consistency Sweep Runbook
|
|
|
|
## Purpose
|
|
|
|
This runbook answers whether the 15-minute State Hub consistency sync ran
|
|
without relying on the local `custodian-sync.timer`.
|
|
|
|
The intended steady state after `STATE-WP-0064` cutover is:
|
|
|
|
- activity-core on Railiance01 owns the `*/15 * * * *` UTC schedule and
|
|
ActivityRun audit trail.
|
|
- State Hub on the workstation owns `scripts/consistency_check.py`, lock
|
|
semantics, reconciliation, and the `consistency_sweep_remote_all`
|
|
progress event.
|
|
- The local systemd timer is disabled after the parallel week passes.
|
|
|
|
## API Surface
|
|
|
|
Manual or cluster-triggered invocation:
|
|
|
|
```bash
|
|
curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"max_seconds": 300}' | python3 -m json.tool
|
|
```
|
|
|
|
From Railiance01 through the bridge tunnel, use the `STATE_HUB_URL`
|
|
configured for activity-core (for example the `actcore-state-hub-bridge`
|
|
service target).
|
|
|
|
## Schedule Check
|
|
|
|
From the activity-core host, confirm the definition is synced and the
|
|
Temporal schedule exists:
|
|
|
|
Run on **Railiance01** (the laptop `.env` points at docker-compose hostnames
|
|
like `app-db` and will time out from WSL):
|
|
|
|
```bash
|
|
export KUBECONFIG=~/.kube/config-hosteurope
|
|
|
|
# 1. Apply runtime manifest when definitions change
|
|
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml
|
|
|
|
# 2. Sync definitions into Postgres
|
|
kubectl -n activity-core delete job actcore-sync --ignore-not-found
|
|
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml
|
|
kubectl -n activity-core wait --for=condition=complete job/actcore-sync --timeout=180s
|
|
|
|
# 3. Reconcile Temporal schedules
|
|
kubectl -n activity-core exec deploy/actcore-worker -- python -m activity_core.sync_schedules
|
|
```
|
|
|
|
After changing application code, rebuild and import `activity-core:railiance01-prod`
|
|
per `activity-core/k8s/railiance/README.md`, then restart
|
|
`actcore-worker`, `actcore-api`, and `actcore-event-router`.
|
|
|
|
Ensure `state-hub-railiance01` ops-bridge tunnel is `connected` before
|
|
cluster-triggered sweeps; the in-cluster bridge proxy allows up to 360s for
|
|
POST requests.
|
|
|
|
Expected definition:
|
|
|
|
- name: `State Hub Consistency Sweep`
|
|
- trigger: `*/15 * * * *`
|
|
- timezone: `UTC`
|
|
- misfire policy: `skip`
|
|
- enabled: `false` until manual canary passes, then `true` after cutover
|
|
|
|
## Progress Event Check
|
|
|
|
Query State Hub for the latest sweep event:
|
|
|
|
```bash
|
|
curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \
|
|
| python3 -m json.tool
|
|
```
|
|
|
|
Healthy evidence includes:
|
|
|
|
- `lock_skipped: false` on normal runs
|
|
- `repos_processed` entries only for repos that needed action
|
|
- `skipped_clean`, `skipped_missing`, and `skipped_budget` metadata when
|
|
applicable
|
|
- `exit_code: 0` for warn-only remote-all sweeps
|
|
|
|
A `lock_skipped: true` response is normal when the local timer and the
|
|
cluster schedule overlap during the parallel week.
|
|
|
|
## ActivityRun Check
|
|
|
|
Query the activity-core database for the most recent run of the sweep
|
|
definition:
|
|
|
|
```sql
|
|
select
|
|
run_id,
|
|
fired_at,
|
|
scheduled_for,
|
|
context_snapshot->'consistency_sweep_remote_all' as sweep_result
|
|
from activity_runs
|
|
where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b'
|
|
order by fired_at desc
|
|
limit 5;
|
|
```
|
|
|
|
## Manual Canary
|
|
|
|
Before enabling the cluster schedule:
|
|
|
|
1. Confirm `state-hub-railiance01` tunnel health from ops-bridge.
|
|
2. Trigger one manual ActivityRun or POST the API through the bridge URL.
|
|
3. Verify the progress event and ActivityRun context snapshot.
|
|
4. Confirm idempotence when the local timer also fires (lock skip is OK).
|
|
|
|
## Cutover
|
|
|
|
After one parallel week (`STATE-WP-0064-T03`):
|
|
|
|
```bash
|
|
systemctl --user disable --now custodian-sync.timer
|
|
```
|
|
|
|
Then enable the activity-core definition and treat the cluster schedule
|
|
as the sole primary runner. |