From 3d5e354ff8d8d6fb890f8a7cb4398943e015d556 Mon Sep 17 00:00:00 2001 From: tegwick Date: Sun, 21 Jun 2026 17:32:44 +0200 Subject: [PATCH] docs(state-hub): weekend automation assessment and repair workplans MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Persist the Fri-evening→Sun-afternoon automation gap assessment in history/, and add STATE-WP-0063 (repair broken paths and cluster reachability) plus STATE-WP-0064 (move State Hub consistency sync to Railiance01 via activity-core). Workplans registered in State Hub via fix-consistency. --- .../20260621-weekend-automation-assessment.md | 103 ++++++++++++ ...STATE-WP-0063-weekend-automation-repair.md | 143 +++++++++++++++++ ...4-statehub-consistency-sync-railiance01.md | 150 ++++++++++++++++++ 3 files changed, 396 insertions(+) create mode 100644 history/20260621-weekend-automation-assessment.md create mode 100644 workplans/STATE-WP-0063-weekend-automation-repair.md create mode 100644 workplans/STATE-WP-0064-statehub-consistency-sync-railiance01.md diff --git a/history/20260621-weekend-automation-assessment.md b/history/20260621-weekend-automation-assessment.md new file mode 100644 index 0000000..b52e0fa --- /dev/null +++ b/history/20260621-weekend-automation-assessment.md @@ -0,0 +1,103 @@ +--- +id: STATE-HIST-20260621-WEEKEND-AUTOMATION +type: assessment +title: "Weekend automation gap assessment (Fri evening → Sun afternoon)" +domain: custodian +repo: state-hub +created: "2026-06-21" +assessor: grok +session_boundary: "2026-06-19T19:22Z (last interactive milestone)" +--- + +# Weekend automation gap assessment + +Assessment window: **Friday 2026-06-19 ~21:22 CEST** (session close) through +**Sunday 2026-06-21 ~16:00 CEST** (resumption). Sources: State Hub +`/progress/` API, activity-definition files, local `journalctl` for +`custodian-sync.service`, and crontab. + +## Scheduled automation landscape + +activity-core runs on **Railiance01 (K3s + Temporal)**, not on the WSL +workstation. Custodian-owned ActivityDefinitions in +`the-custodian/activity-definitions/`: + +| Activity | Schedule | Enabled | Effect | +|----------|----------|---------|--------| +| Hourly RecentlyOnScope | `0 * * * *` Europe/Berlin | yes | `POST /recently-on-scope/hourly` | +| Daily State Hub WSJF Triage | `20 7 * * *` Europe/Berlin | yes | LLM triage → `daily_triage` progress event | +| Ops Service Inventory Probes | `15 * * * *` Europe/Berlin | no | HTTP service probes | + +activity-core repo definitions: weekly SBOM staleness (Mon 09:00, enabled); +weekly coding retro (Sat 19:00, disabled). + +**Not yet on activity-core:** the 15-minute workplan↔DB consistency sweep +still uses the local **`custodian-sync.timer`** systemd user unit. + +## What ran automatically + +### Friday evening (before/at session close) + +- **20:00 CEST** (`18:00 UTC`): Last successful hourly RecentlyOnScope run. + Generated one `helix_forge` digest; 13 domains skipped. +- **After 20:00 CEST**: No further hourly runs recorded (19:00 and 20:00 UTC + slots absent from progress events). + +### Saturday 2026-06-20 + +- **Zero** State Hub progress events for the entire day. +- **07:20 CEST daily WSJF triage**: did not run (no `daily_triage` event; no new + file in `the-custodian/memory/working/` since 2026-06-18). +- Workstation journal shows **no `custodian-sync` activity** between + Fri ~21:18 and Sun ~15:50 CEST — consistent with sleep/hibernate or WSL not + running. + +### Sunday 2026-06-21 (before interactive session) + +- **16:00 CEST** (`14:00 UTC`): Hourly RecentlyOnScope resumed. Result: + **0 generated, 14 skipped, 0 failed** (all domains quiet). +- **07:20 CEST daily WSJF triage**: did not run. +- **After 16:09 CEST**: Interactive grok/custodian session work (not automation). + +### Local maintenance (broken, not activity-core) + +`custodian-sync.service` fired every ~15 minutes when the machine was awake but +**failed continuously** with: + +``` +.venv/bin/python: No such file or directory +``` + +Root cause: unit `WorkingDirectory` still points at the pre-extraction path +`/home/worsch/the-custodian/state-hub`; the standalone repo lives at +`/home/worsch/state-hub`. + +Other local crons also failed when they fired: + +- **02:00** `railiance backup` — `/home/worsch/railiance-bootstrap/bin/railiance: not found` +- **03:00** `bridge maintenance cleanup` — no log evidence + +## Gap summary + +| Automation | Expected over weekend | Observed | +|------------|----------------------|----------| +| Hourly RecentlyOnScope | ~44 hourly runs | **1** (Sun 16:00 only) after **~44 h gap** | +| Daily WSJF triage | 2 runs (Sat + Sun 07:20) | **0** | +| State Hub consistency sweep | ~180 runs (15 min) | **All failed** when machine awake (bad path) | +| Saturday hub activity | n/a | **None recorded** | + +## Naming note + +The local systemd unit is called **custodian-sync**, but the job reconciles +**workplan files ↔ State Hub DB for all registered repos**. The cron-migration +design stub already uses **`state-hub-consistency-sweep`** as the +ActivityDefinition id. Prefer **State Hub consistency sync** for operator-facing +names; retain `custodian-sync-hook` in git hooks only until a deliberate rename +pass (hook marker is widespread). + +## Follow-up workplans + +- `STATE-WP-0063` — repair broken weekend automation (local paths, cluster + reachability, missed triage). +- `STATE-WP-0064` — move State Hub consistency sync to Railiance01 via + activity-core; retire local `custodian-sync.timer` after cutover. \ No newline at end of file diff --git a/workplans/STATE-WP-0063-weekend-automation-repair.md b/workplans/STATE-WP-0063-weekend-automation-repair.md new file mode 100644 index 0000000..4848b26 --- /dev/null +++ b/workplans/STATE-WP-0063-weekend-automation-repair.md @@ -0,0 +1,143 @@ +--- +id: STATE-WP-0063 +type: workplan +title: "Repair broken weekend automation (paths, cluster reachability, triage)" +domain: custodian +repo: state-hub +status: ready +owner: codex +topic_slug: custodian +created: "2026-06-21" +updated: "2026-06-21" +state_hub_workstream_id: "41194e3b-7d45-4cf1-9715-56a843230ad7" +--- + +# STATE-WP-0063 — Repair broken weekend automation + +**Origin:** `history/20260621-weekend-automation-assessment.md` (2026-06-21). + +Over the Fri-evening → Sun-afternoon window, three automation substrates were +offline or broken: local `custodian-sync` (wrong repo path), activity-core +hourly/daily jobs on Railiance01 (~44 h gap), and local `railiance backup` +cron (missing binary). This workplan restores reliable operation before the +Railiance01 migration in `STATE-WP-0064`. + +## Scope + +In scope: + +- Fix the local `custodian-sync` systemd unit so `consistency_check.py + --remote --all` runs successfully from `/home/worsch/state-hub`. +- Diagnose and fix the Railiance01 activity-core gap affecting hourly + RecentlyOnScope and daily WSJF triage (Temporal schedule, tunnel, bridge). +- Repair or document the broken `railiance backup` crontab entry. +- Confirm `bridge maintenance cleanup` cron runs or fix its path/logging. +- Post a State Hub progress milestone when repairs are verified. + +Out of scope: + +- Moving the consistency sweep to Railiance01 (see `STATE-WP-0064`). +- Renaming `custodian-sync` units cluster-wide (deferred to WP-0064). + +## Evidence baseline + +See `history/20260621-weekend-automation-assessment.md` for the full timeline. +Key failures: + +- `custodian-sync.service`: `WorkingDirectory=/home/worsch/the-custodian/state-hub` + (pre-extraction path). +- Hourly RecentlyOnScope: last run Fri 18:00 UTC; next Sun 14:00 UTC. +- Daily triage: no `daily_triage` events on 2026-06-20 or 2026-06-21 morning. + +--- + +## T1 — Fix local State Hub consistency sync unit + +```task +id: STATE-WP-0063-T01 +status: todo +priority: high +state_hub_task_id: "720e69f7-b02c-4baa-b4f7-d2916e1737c2" +``` + +Update `~/.config/systemd/user/custodian-sync.service` (and any checked-in +template under `infra/`) to use `/home/worsch/state-hub` and invoke consistency +via `uv run python scripts/consistency_check.py --remote --all` (or equivalent +`$(UV)` path) instead of a stale `.venv/bin/python`. + +Acceptance: + +- `systemctl --user start custodian-sync.service` exits 0. +- `journalctl --user -u custodian-sync.service -n 5` shows a completed sweep, + not `status=127`. +- Document the interim fix in `infra/README.md` until WP-0064 cutover. + +## T2 — Diagnose activity-core weekend gap on Railiance01 + +```task +id: STATE-WP-0063-T02 +status: todo +priority: high +state_hub_task_id: "4d35698c-3176-468b-a8f6-043e19191a61" +``` + +Investigate why Temporal schedules for `hourly-recently-on-scope` and +`daily-statehub-wsjf-triage` produced no State Hub progress between Fri +20:00 CEST and Sun 16:00 CEST. Check: + +- activity-core worker/API pod health on Railiance01 +- `actcore-state-hub-bridge` reachability to the workstation State Hub +- `state-hub-railiance01` ops-bridge tunnel state when the laptop is awake +- Temporal schedule pause/misfire history + +Record findings in a short note (progress event or workplan update). Fix any +configuration regression found (do not re-enable duplicate Codex app fallbacks). + +## T3 — Restore daily WSJF triage evidence + +```task +id: STATE-WP-0063-T03 +status: todo +priority: medium +state_hub_task_id: "4b68b207-a80e-4c71-a246-10035ef69625" +``` + +After T2, trigger one manual canary of `daily-statehub-wsjf-triage` on +Railiance01 and confirm: + +- `event_type: daily_triage` progress event lands in State Hub +- working-memory report file appears under `the-custodian/memory/working/` + +If the schedule was paused, unpause and verify the next 07:20 Europe/Berlin +fire is armed. + +## T4 — Repair ancillary local crons + +```task +id: STATE-WP-0063-T04 +status: todo +priority: low +state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e" +``` + +- Fix `railiance backup` crontab to use the current `railiance-bootstrap` install + path, or document the correct operator install step. +- Verify `bridge maintenance cleanup` at 03:00 writes to + `~/.local/state/bridge/cleanup.log` on the next run. + +## T5 — Verification sweep and close-out + +```task +id: STATE-WP-0063-T05 +status: todo +priority: medium +state_hub_task_id: "43811131-92f3-476f-9cc3-05370e0d56c1" +``` + +Run a 24-hour observation window: + +- at least one successful `custodian-sync` sweep per hour while machine is awake +- at least two hourly RecentlyOnScope progress events (generated or skipped) +- no duplicate Codex + activity-core primary runners for the same job + +Mark workplan `finished` and log a milestone progress event summarising repairs. \ No newline at end of file diff --git a/workplans/STATE-WP-0064-statehub-consistency-sync-railiance01.md b/workplans/STATE-WP-0064-statehub-consistency-sync-railiance01.md new file mode 100644 index 0000000..a8d01f3 --- /dev/null +++ b/workplans/STATE-WP-0064-statehub-consistency-sync-railiance01.md @@ -0,0 +1,150 @@ +--- +id: STATE-WP-0064 +type: workplan +title: "Move State Hub consistency sync to Railiance01 (activity-core)" +domain: custodian +repo: state-hub +status: ready +owner: codex +topic_slug: custodian +created: "2026-06-21" +updated: "2026-06-21" +state_hub_workstream_id: "669d810a-53f4-448b-a0c1-a6543daa7c44" +--- + +# STATE-WP-0064 — Move State Hub consistency sync to Railiance01 + +**Origin:** `history/20260621-weekend-automation-assessment.md` and +`docs/cron-migration.md` design stub (CUST-WP-0040 T04). + +The 15-minute workplan↔DB reconciliation is a **State Hub read-model +maintenance** job across all registered repos. The legacy name **custodian-sync** +reflects the owning domain, not the job's scope. Operator-facing names should +use **State Hub consistency sync**; the ActivityDefinition id +`the-custodian.state-hub-consistency-sweep` already matches this in +`docs/cron-migration.md`. + +This workplan moves **scheduling** to activity-core on Railiance01 while +`scripts/consistency_check.py` remains in the `state-hub` repo. + +Depends on `STATE-WP-0063` repairing the current broken local path so there is +a known-good baseline before cutover. + +## Scope + +In scope: + +- Land `state-hub-consistency-sweep` ActivityDefinition in + `the-custodian/activity-definitions/`. +- Run the sweep from Railiance01 against the workstation State Hub via the + existing bridge/tunnel pattern (`actcore-state-hub-bridge` or equivalent). +- Parallel-run with local `custodian-sync.timer` for one week, then disable the + local timer. +- Update `infra/README.md`, `docs/cron-migration.md`, and operator runbooks. + +Out of scope: + +- Changing consistency_check.py reconciliation rules (ADR-001 logic stays). +- Renaming `# custodian-sync-hook` in every registered repo's git hook (separate + hygiene pass; hooks may keep the marker until all repos are updated). +- Per-commit hook migration to event-driven activity-core (see cron-migration §C). + +## Naming decision (decided) + +| Layer | Current | Target | +|-------|---------|--------| +| Operator docs | custodian sync / custodian-sync | **State Hub consistency sync** | +| ActivityDefinition id | (not landed) | `the-custodian.state-hub-consistency-sweep` | +| systemd unit (interim) | `custodian-sync.{service,timer}` | disable after cutover; optional rename to `statehub-consistency-sync.*` during WP-0063 if low cost | +| git hook marker | `# custodian-sync-hook` | unchanged in this workplan | + +--- + +## T1 — ActivityDefinition and cluster wiring + +```task +id: STATE-WP-0064-T01 +status: todo +priority: high +state_hub_task_id: "ecc0f846-e00f-4063-8ec1-f6ad630e9265" +``` + +Create `the-custodian/activity-definitions/state-hub-consistency-sweep.md` +from the draft in `docs/cron-migration.md` §2A, adjusting: + +- shell command to reach the workstation repo path or a cluster-side checkout +- `STATE_HUB_URL` via bridge service (not hard-coded `127.0.0.1` on cluster) +- `misfire_policy: skip` and `--max-seconds 300` budget +- `on_failure: log_and_continue` for warn-only sweeps + +Sync definition to Railiance01 activity-core (projection manifest per +`hourly-recently-on-scope` precedent). Enable after manual canary. + +## T2 — Manual canary on Railiance01 + +```task +id: STATE-WP-0064-T02 +status: todo +priority: high +state_hub_task_id: "2e9b5b66-a7b1-46a5-8e1f-22e6b5caeff6" +``` + +Trigger one manual ActivityRun. Confirm: + +- `consistency_check.py --remote --all` completes within budget +- C-15 writeback and C-16 pull gate behave as today +- progress or activity-core run history shows success +- no duplicate side-effects when local timer also fires (idempotent) + +## T3 — Parallel run and observability + +```task +id: STATE-WP-0064-T03 +status: todo +priority: medium +state_hub_task_id: "8abb31ad-2f03-4aa7-889e-e60c3c39f1f8" +``` + +Run cluster schedule (`*/15 * * * *` UTC per design stub) alongside local +`custodian-sync.timer` for **one week**. Compare: + +- sweep completion rate +- repos skipped due to lock or budget +- hard failures vs warn-only exits + +Document comparison in a progress event or short runbook addendum. + +## T4 — Retire local timer + +```task +id: STATE-WP-0064-T04 +status: todo +priority: medium +state_hub_task_id: "c8275471-5ec0-4dfb-8fec-2b3ec3894036" +``` + +After parallel week passes: + +```bash +systemctl --user disable --now custodian-sync.timer +``` + +Archive or update unit files under `infra/`. Mark cron-migration stub §5 step 4 +complete. Update `docs/activity-core-delegation.md` cross-reference. + +## T5 — Docs and operator handoff + +```task +id: STATE-WP-0064-T05 +status: todo +priority: low +state_hub_task_id: "270ed7dd-aa79-469d-a817-e3fa1e71be41" +``` + +- `infra/README.md`: primary schedule is activity-core on Railiance01; local + timer is retired. +- `docs/cron-migration.md`: promote §2A from design stub to implemented; + note blockers cleared. +- Dashboard or AGENTS snippet: "State Hub consistency sync" terminology. + +Mark workplan `finished` when cluster schedule is the sole primary runner. \ No newline at end of file