generated from coulomb/repo-seed
docs(state-hub): weekend automation assessment and repair workplans
Persist the Fri-evening→Sun-afternoon automation gap assessment in history/, and add STATE-WP-0063 (repair broken paths and cluster reachability) plus STATE-WP-0064 (move State Hub consistency sync to Railiance01 via activity-core). Workplans registered in State Hub via fix-consistency.
This commit is contained in:
103
history/20260621-weekend-automation-assessment.md
Normal file
103
history/20260621-weekend-automation-assessment.md
Normal file
@@ -0,0 +1,103 @@
|
||||
---
|
||||
id: STATE-HIST-20260621-WEEKEND-AUTOMATION
|
||||
type: assessment
|
||||
title: "Weekend automation gap assessment (Fri evening → Sun afternoon)"
|
||||
domain: custodian
|
||||
repo: state-hub
|
||||
created: "2026-06-21"
|
||||
assessor: grok
|
||||
session_boundary: "2026-06-19T19:22Z (last interactive milestone)"
|
||||
---
|
||||
|
||||
# Weekend automation gap assessment
|
||||
|
||||
Assessment window: **Friday 2026-06-19 ~21:22 CEST** (session close) through
|
||||
**Sunday 2026-06-21 ~16:00 CEST** (resumption). Sources: State Hub
|
||||
`/progress/` API, activity-definition files, local `journalctl` for
|
||||
`custodian-sync.service`, and crontab.
|
||||
|
||||
## Scheduled automation landscape
|
||||
|
||||
activity-core runs on **Railiance01 (K3s + Temporal)**, not on the WSL
|
||||
workstation. Custodian-owned ActivityDefinitions in
|
||||
`the-custodian/activity-definitions/`:
|
||||
|
||||
| Activity | Schedule | Enabled | Effect |
|
||||
|----------|----------|---------|--------|
|
||||
| Hourly RecentlyOnScope | `0 * * * *` Europe/Berlin | yes | `POST /recently-on-scope/hourly` |
|
||||
| Daily State Hub WSJF Triage | `20 7 * * *` Europe/Berlin | yes | LLM triage → `daily_triage` progress event |
|
||||
| Ops Service Inventory Probes | `15 * * * *` Europe/Berlin | no | HTTP service probes |
|
||||
|
||||
activity-core repo definitions: weekly SBOM staleness (Mon 09:00, enabled);
|
||||
weekly coding retro (Sat 19:00, disabled).
|
||||
|
||||
**Not yet on activity-core:** the 15-minute workplan↔DB consistency sweep
|
||||
still uses the local **`custodian-sync.timer`** systemd user unit.
|
||||
|
||||
## What ran automatically
|
||||
|
||||
### Friday evening (before/at session close)
|
||||
|
||||
- **20:00 CEST** (`18:00 UTC`): Last successful hourly RecentlyOnScope run.
|
||||
Generated one `helix_forge` digest; 13 domains skipped.
|
||||
- **After 20:00 CEST**: No further hourly runs recorded (19:00 and 20:00 UTC
|
||||
slots absent from progress events).
|
||||
|
||||
### Saturday 2026-06-20
|
||||
|
||||
- **Zero** State Hub progress events for the entire day.
|
||||
- **07:20 CEST daily WSJF triage**: did not run (no `daily_triage` event; no new
|
||||
file in `the-custodian/memory/working/` since 2026-06-18).
|
||||
- Workstation journal shows **no `custodian-sync` activity** between
|
||||
Fri ~21:18 and Sun ~15:50 CEST — consistent with sleep/hibernate or WSL not
|
||||
running.
|
||||
|
||||
### Sunday 2026-06-21 (before interactive session)
|
||||
|
||||
- **16:00 CEST** (`14:00 UTC`): Hourly RecentlyOnScope resumed. Result:
|
||||
**0 generated, 14 skipped, 0 failed** (all domains quiet).
|
||||
- **07:20 CEST daily WSJF triage**: did not run.
|
||||
- **After 16:09 CEST**: Interactive grok/custodian session work (not automation).
|
||||
|
||||
### Local maintenance (broken, not activity-core)
|
||||
|
||||
`custodian-sync.service` fired every ~15 minutes when the machine was awake but
|
||||
**failed continuously** with:
|
||||
|
||||
```
|
||||
.venv/bin/python: No such file or directory
|
||||
```
|
||||
|
||||
Root cause: unit `WorkingDirectory` still points at the pre-extraction path
|
||||
`/home/worsch/the-custodian/state-hub`; the standalone repo lives at
|
||||
`/home/worsch/state-hub`.
|
||||
|
||||
Other local crons also failed when they fired:
|
||||
|
||||
- **02:00** `railiance backup` — `/home/worsch/railiance-bootstrap/bin/railiance: not found`
|
||||
- **03:00** `bridge maintenance cleanup` — no log evidence
|
||||
|
||||
## Gap summary
|
||||
|
||||
| Automation | Expected over weekend | Observed |
|
||||
|------------|----------------------|----------|
|
||||
| Hourly RecentlyOnScope | ~44 hourly runs | **1** (Sun 16:00 only) after **~44 h gap** |
|
||||
| Daily WSJF triage | 2 runs (Sat + Sun 07:20) | **0** |
|
||||
| State Hub consistency sweep | ~180 runs (15 min) | **All failed** when machine awake (bad path) |
|
||||
| Saturday hub activity | n/a | **None recorded** |
|
||||
|
||||
## Naming note
|
||||
|
||||
The local systemd unit is called **custodian-sync**, but the job reconciles
|
||||
**workplan files ↔ State Hub DB for all registered repos**. The cron-migration
|
||||
design stub already uses **`state-hub-consistency-sweep`** as the
|
||||
ActivityDefinition id. Prefer **State Hub consistency sync** for operator-facing
|
||||
names; retain `custodian-sync-hook` in git hooks only until a deliberate rename
|
||||
pass (hook marker is widespread).
|
||||
|
||||
## Follow-up workplans
|
||||
|
||||
- `STATE-WP-0063` — repair broken weekend automation (local paths, cluster
|
||||
reachability, missed triage).
|
||||
- `STATE-WP-0064` — move State Hub consistency sync to Railiance01 via
|
||||
activity-core; retire local `custodian-sync.timer` after cutover.
|
||||
143
workplans/STATE-WP-0063-weekend-automation-repair.md
Normal file
143
workplans/STATE-WP-0063-weekend-automation-repair.md
Normal file
@@ -0,0 +1,143 @@
|
||||
---
|
||||
id: STATE-WP-0063
|
||||
type: workplan
|
||||
title: "Repair broken weekend automation (paths, cluster reachability, triage)"
|
||||
domain: custodian
|
||||
repo: state-hub
|
||||
status: ready
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-21"
|
||||
updated: "2026-06-21"
|
||||
state_hub_workstream_id: "41194e3b-7d45-4cf1-9715-56a843230ad7"
|
||||
---
|
||||
|
||||
# STATE-WP-0063 — Repair broken weekend automation
|
||||
|
||||
**Origin:** `history/20260621-weekend-automation-assessment.md` (2026-06-21).
|
||||
|
||||
Over the Fri-evening → Sun-afternoon window, three automation substrates were
|
||||
offline or broken: local `custodian-sync` (wrong repo path), activity-core
|
||||
hourly/daily jobs on Railiance01 (~44 h gap), and local `railiance backup`
|
||||
cron (missing binary). This workplan restores reliable operation before the
|
||||
Railiance01 migration in `STATE-WP-0064`.
|
||||
|
||||
## Scope
|
||||
|
||||
In scope:
|
||||
|
||||
- Fix the local `custodian-sync` systemd unit so `consistency_check.py
|
||||
--remote --all` runs successfully from `/home/worsch/state-hub`.
|
||||
- Diagnose and fix the Railiance01 activity-core gap affecting hourly
|
||||
RecentlyOnScope and daily WSJF triage (Temporal schedule, tunnel, bridge).
|
||||
- Repair or document the broken `railiance backup` crontab entry.
|
||||
- Confirm `bridge maintenance cleanup` cron runs or fix its path/logging.
|
||||
- Post a State Hub progress milestone when repairs are verified.
|
||||
|
||||
Out of scope:
|
||||
|
||||
- Moving the consistency sweep to Railiance01 (see `STATE-WP-0064`).
|
||||
- Renaming `custodian-sync` units cluster-wide (deferred to WP-0064).
|
||||
|
||||
## Evidence baseline
|
||||
|
||||
See `history/20260621-weekend-automation-assessment.md` for the full timeline.
|
||||
Key failures:
|
||||
|
||||
- `custodian-sync.service`: `WorkingDirectory=/home/worsch/the-custodian/state-hub`
|
||||
(pre-extraction path).
|
||||
- Hourly RecentlyOnScope: last run Fri 18:00 UTC; next Sun 14:00 UTC.
|
||||
- Daily triage: no `daily_triage` events on 2026-06-20 or 2026-06-21 morning.
|
||||
|
||||
---
|
||||
|
||||
## T1 — Fix local State Hub consistency sync unit
|
||||
|
||||
```task
|
||||
id: STATE-WP-0063-T01
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "720e69f7-b02c-4baa-b4f7-d2916e1737c2"
|
||||
```
|
||||
|
||||
Update `~/.config/systemd/user/custodian-sync.service` (and any checked-in
|
||||
template under `infra/`) to use `/home/worsch/state-hub` and invoke consistency
|
||||
via `uv run python scripts/consistency_check.py --remote --all` (or equivalent
|
||||
`$(UV)` path) instead of a stale `.venv/bin/python`.
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `systemctl --user start custodian-sync.service` exits 0.
|
||||
- `journalctl --user -u custodian-sync.service -n 5` shows a completed sweep,
|
||||
not `status=127`.
|
||||
- Document the interim fix in `infra/README.md` until WP-0064 cutover.
|
||||
|
||||
## T2 — Diagnose activity-core weekend gap on Railiance01
|
||||
|
||||
```task
|
||||
id: STATE-WP-0063-T02
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "4d35698c-3176-468b-a8f6-043e19191a61"
|
||||
```
|
||||
|
||||
Investigate why Temporal schedules for `hourly-recently-on-scope` and
|
||||
`daily-statehub-wsjf-triage` produced no State Hub progress between Fri
|
||||
20:00 CEST and Sun 16:00 CEST. Check:
|
||||
|
||||
- activity-core worker/API pod health on Railiance01
|
||||
- `actcore-state-hub-bridge` reachability to the workstation State Hub
|
||||
- `state-hub-railiance01` ops-bridge tunnel state when the laptop is awake
|
||||
- Temporal schedule pause/misfire history
|
||||
|
||||
Record findings in a short note (progress event or workplan update). Fix any
|
||||
configuration regression found (do not re-enable duplicate Codex app fallbacks).
|
||||
|
||||
## T3 — Restore daily WSJF triage evidence
|
||||
|
||||
```task
|
||||
id: STATE-WP-0063-T03
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "4b68b207-a80e-4c71-a246-10035ef69625"
|
||||
```
|
||||
|
||||
After T2, trigger one manual canary of `daily-statehub-wsjf-triage` on
|
||||
Railiance01 and confirm:
|
||||
|
||||
- `event_type: daily_triage` progress event lands in State Hub
|
||||
- working-memory report file appears under `the-custodian/memory/working/`
|
||||
|
||||
If the schedule was paused, unpause and verify the next 07:20 Europe/Berlin
|
||||
fire is armed.
|
||||
|
||||
## T4 — Repair ancillary local crons
|
||||
|
||||
```task
|
||||
id: STATE-WP-0063-T04
|
||||
status: todo
|
||||
priority: low
|
||||
state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e"
|
||||
```
|
||||
|
||||
- Fix `railiance backup` crontab to use the current `railiance-bootstrap` install
|
||||
path, or document the correct operator install step.
|
||||
- Verify `bridge maintenance cleanup` at 03:00 writes to
|
||||
`~/.local/state/bridge/cleanup.log` on the next run.
|
||||
|
||||
## T5 — Verification sweep and close-out
|
||||
|
||||
```task
|
||||
id: STATE-WP-0063-T05
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "43811131-92f3-476f-9cc3-05370e0d56c1"
|
||||
```
|
||||
|
||||
Run a 24-hour observation window:
|
||||
|
||||
- at least one successful `custodian-sync` sweep per hour while machine is awake
|
||||
- at least two hourly RecentlyOnScope progress events (generated or skipped)
|
||||
- no duplicate Codex + activity-core primary runners for the same job
|
||||
|
||||
Mark workplan `finished` and log a milestone progress event summarising repairs.
|
||||
150
workplans/STATE-WP-0064-statehub-consistency-sync-railiance01.md
Normal file
150
workplans/STATE-WP-0064-statehub-consistency-sync-railiance01.md
Normal file
@@ -0,0 +1,150 @@
|
||||
---
|
||||
id: STATE-WP-0064
|
||||
type: workplan
|
||||
title: "Move State Hub consistency sync to Railiance01 (activity-core)"
|
||||
domain: custodian
|
||||
repo: state-hub
|
||||
status: ready
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-21"
|
||||
updated: "2026-06-21"
|
||||
state_hub_workstream_id: "669d810a-53f4-448b-a0c1-a6543daa7c44"
|
||||
---
|
||||
|
||||
# STATE-WP-0064 — Move State Hub consistency sync to Railiance01
|
||||
|
||||
**Origin:** `history/20260621-weekend-automation-assessment.md` and
|
||||
`docs/cron-migration.md` design stub (CUST-WP-0040 T04).
|
||||
|
||||
The 15-minute workplan↔DB reconciliation is a **State Hub read-model
|
||||
maintenance** job across all registered repos. The legacy name **custodian-sync**
|
||||
reflects the owning domain, not the job's scope. Operator-facing names should
|
||||
use **State Hub consistency sync**; the ActivityDefinition id
|
||||
`the-custodian.state-hub-consistency-sweep` already matches this in
|
||||
`docs/cron-migration.md`.
|
||||
|
||||
This workplan moves **scheduling** to activity-core on Railiance01 while
|
||||
`scripts/consistency_check.py` remains in the `state-hub` repo.
|
||||
|
||||
Depends on `STATE-WP-0063` repairing the current broken local path so there is
|
||||
a known-good baseline before cutover.
|
||||
|
||||
## Scope
|
||||
|
||||
In scope:
|
||||
|
||||
- Land `state-hub-consistency-sweep` ActivityDefinition in
|
||||
`the-custodian/activity-definitions/`.
|
||||
- Run the sweep from Railiance01 against the workstation State Hub via the
|
||||
existing bridge/tunnel pattern (`actcore-state-hub-bridge` or equivalent).
|
||||
- Parallel-run with local `custodian-sync.timer` for one week, then disable the
|
||||
local timer.
|
||||
- Update `infra/README.md`, `docs/cron-migration.md`, and operator runbooks.
|
||||
|
||||
Out of scope:
|
||||
|
||||
- Changing consistency_check.py reconciliation rules (ADR-001 logic stays).
|
||||
- Renaming `# custodian-sync-hook` in every registered repo's git hook (separate
|
||||
hygiene pass; hooks may keep the marker until all repos are updated).
|
||||
- Per-commit hook migration to event-driven activity-core (see cron-migration §C).
|
||||
|
||||
## Naming decision (decided)
|
||||
|
||||
| Layer | Current | Target |
|
||||
|-------|---------|--------|
|
||||
| Operator docs | custodian sync / custodian-sync | **State Hub consistency sync** |
|
||||
| ActivityDefinition id | (not landed) | `the-custodian.state-hub-consistency-sweep` |
|
||||
| systemd unit (interim) | `custodian-sync.{service,timer}` | disable after cutover; optional rename to `statehub-consistency-sync.*` during WP-0063 if low cost |
|
||||
| git hook marker | `# custodian-sync-hook` | unchanged in this workplan |
|
||||
|
||||
---
|
||||
|
||||
## T1 — ActivityDefinition and cluster wiring
|
||||
|
||||
```task
|
||||
id: STATE-WP-0064-T01
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "ecc0f846-e00f-4063-8ec1-f6ad630e9265"
|
||||
```
|
||||
|
||||
Create `the-custodian/activity-definitions/state-hub-consistency-sweep.md`
|
||||
from the draft in `docs/cron-migration.md` §2A, adjusting:
|
||||
|
||||
- shell command to reach the workstation repo path or a cluster-side checkout
|
||||
- `STATE_HUB_URL` via bridge service (not hard-coded `127.0.0.1` on cluster)
|
||||
- `misfire_policy: skip` and `--max-seconds 300` budget
|
||||
- `on_failure: log_and_continue` for warn-only sweeps
|
||||
|
||||
Sync definition to Railiance01 activity-core (projection manifest per
|
||||
`hourly-recently-on-scope` precedent). Enable after manual canary.
|
||||
|
||||
## T2 — Manual canary on Railiance01
|
||||
|
||||
```task
|
||||
id: STATE-WP-0064-T02
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "2e9b5b66-a7b1-46a5-8e1f-22e6b5caeff6"
|
||||
```
|
||||
|
||||
Trigger one manual ActivityRun. Confirm:
|
||||
|
||||
- `consistency_check.py --remote --all` completes within budget
|
||||
- C-15 writeback and C-16 pull gate behave as today
|
||||
- progress or activity-core run history shows success
|
||||
- no duplicate side-effects when local timer also fires (idempotent)
|
||||
|
||||
## T3 — Parallel run and observability
|
||||
|
||||
```task
|
||||
id: STATE-WP-0064-T03
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "8abb31ad-2f03-4aa7-889e-e60c3c39f1f8"
|
||||
```
|
||||
|
||||
Run cluster schedule (`*/15 * * * *` UTC per design stub) alongside local
|
||||
`custodian-sync.timer` for **one week**. Compare:
|
||||
|
||||
- sweep completion rate
|
||||
- repos skipped due to lock or budget
|
||||
- hard failures vs warn-only exits
|
||||
|
||||
Document comparison in a progress event or short runbook addendum.
|
||||
|
||||
## T4 — Retire local timer
|
||||
|
||||
```task
|
||||
id: STATE-WP-0064-T04
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "c8275471-5ec0-4dfb-8fec-2b3ec3894036"
|
||||
```
|
||||
|
||||
After parallel week passes:
|
||||
|
||||
```bash
|
||||
systemctl --user disable --now custodian-sync.timer
|
||||
```
|
||||
|
||||
Archive or update unit files under `infra/`. Mark cron-migration stub §5 step 4
|
||||
complete. Update `docs/activity-core-delegation.md` cross-reference.
|
||||
|
||||
## T5 — Docs and operator handoff
|
||||
|
||||
```task
|
||||
id: STATE-WP-0064-T05
|
||||
status: todo
|
||||
priority: low
|
||||
state_hub_task_id: "270ed7dd-aa79-469d-a817-e3fa1e71be41"
|
||||
```
|
||||
|
||||
- `infra/README.md`: primary schedule is activity-core on Railiance01; local
|
||||
timer is retired.
|
||||
- `docs/cron-migration.md`: promote §2A from design stub to implemented;
|
||||
note blockers cleared.
|
||||
- Dashboard or AGENTS snippet: "State Hub consistency sync" terminology.
|
||||
|
||||
Mark workplan `finished` when cluster schedule is the sole primary runner.
|
||||
Reference in New Issue
Block a user