state-hub/docs/consistency-sweep-runbook.md

# State Hub Consistency Sweep Runbook

## Purpose

This runbook answers whether the 15-minute State Hub consistency sync ran
without relying on the local `custodian-sync.timer`.

The intended steady state after `STATE-WP-0064` cutover is:

- activity-core on Railiance01 owns the `*/15 * * * *` UTC schedule and
  ActivityRun audit trail.
- State Hub on the workstation owns `scripts/consistency_check.py`, lock
  semantics, reconciliation, and the `consistency_sweep_remote_all`
  progress event.
- The local systemd timer is disabled after the parallel week passes.

## API Surface

Manual or cluster-triggered invocation:

```bash
curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \
  -H "Content-Type: application/json" \
  -d '{"max_seconds": 300}' | python3 -m json.tool
```

From Railiance01 through the bridge tunnel, use the `STATE_HUB_URL`
configured for activity-core (for example the `actcore-state-hub-bridge`
service target).

## Schedule Check

From the activity-core host, confirm the definition is synced and the
Temporal schedule exists:

```bash
cd ~/activity-core
ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian make sync-activity-definitions
```

Reconcile Temporal schedules (pick one):

```bash
# Preferred when activity-core API is up (no worker restart)
curl -s -X POST 'http://localhost:8010/admin/sync?definitions=true&schedules=true'

# CLI fallback
ACTCORE_DB_URL=... TEMPORAL_HOST=... uv run python -m activity_core.sync_schedules
```

On Railiance01, use the in-cluster activity-core API URL and env from the
deployment instead of `localhost:8010`.

Expected definition:

- name: `State Hub Consistency Sweep`
- trigger: `*/15 * * * *`
- timezone: `UTC`
- misfire policy: `skip`
- enabled: `false` until manual canary passes, then `true` after cutover

## Progress Event Check

Query State Hub for the latest sweep event:

```bash
curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \
  | python3 -m json.tool
```

Healthy evidence includes:

- `lock_skipped: false` on normal runs
- `repos_processed` entries only for repos that needed action
- `skipped_clean`, `skipped_missing`, and `skipped_budget` metadata when
  applicable
- `exit_code: 0` for warn-only remote-all sweeps

A `lock_skipped: true` response is normal when the local timer and the
cluster schedule overlap during the parallel week.

## ActivityRun Check

Query the activity-core database for the most recent run of the sweep
definition:

```sql
select
  run_id,
  fired_at,
  scheduled_for,
  context_snapshot->'consistency_sweep_remote_all' as sweep_result
from activity_runs
where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b'
order by fired_at desc
limit 5;
```

## Manual Canary

Before enabling the cluster schedule:

1. Confirm `state-hub-railiance01` tunnel health from ops-bridge.
2. Trigger one manual ActivityRun or POST the API through the bridge URL.
3. Verify the progress event and ActivityRun context snapshot.
4. Confirm idempotence when the local timer also fires (lock skip is OK).

## Cutover

After one parallel week (`STATE-WP-0064-T03`):

```bash
systemctl --user disable --now custodian-sync.timer
```

Then enable the activity-core definition and treat the cluster schedule
as the sole primary runner.