Files
state-hub/scripts/push-seal.md
2026-05-06 04:04:53 +02:00

161 lines
7.8 KiB
Markdown

# Push Seal (T04)
**Module:** `repo_sync.py`
**Check IDs:** C-16 (pull gate), C-17 (backlog guard), T04 (push seal)
**Tests:** `tests/test_repo_sync.py`
---
## The problem it solves
The consistency engine auto-commits two kinds of changes to managed repos:
- **Writebacks (T03):** task status patched into workplan files when the DB is ahead of the file
- **Brief updates:** `.custodian-brief.md` regenerated when workstream progress changes
Before T04, these commits were created but never pushed. The `custodian-sync.service` timer fires every 15 minutes. Each firing that detected any change would create a new commit. Over time — especially when a repo had no active workstreams and the brief kept oscillating — hundreds of auto-commits piled up locally while the remote received none. When real development commits landed on the remote, the repo became diverged and unrecoverable without manual intervention.
**Root causes:**
1. Commits were made without a corresponding push — an open loop.
2. C-16 (pull gate) only fires when remote is *ahead* of local, not when local is ahead. Once the first auto-commit ran, local was already ahead and C-16 never fired again.
3. The `.custodian-brief.md` domain-slug lookup produced `(unknown)` for fully-completed repos (no active workstreams → no topic lookup → empty string), causing the brief to be considered "changed" on every run.
---
## The invariant
> Every `fix_repo()` run that creates commits must push them to remote before returning, so the next run starts with `local ≡ remote`.
When this holds, the timer loop is **idempotent**: a clean repo with no unpushed commits is detected as clean and skipped entirely — no git activity, no divergence.
---
## Three interlocking checks
### C-16 — pull gate (T02, pre-existing)
Fires when `count_remote_ahead(repo_path) > 0`.
Remote has commits local lacks. Writing more commits on top would lose remote progress. All write operations are skipped; the `--remote` flag in the service pulls before fixing.
### C-17 — backlog guard (new)
Fires when `count_local_ahead(repo_path) > 0` AND `push_ff()` fails.
Local has unpushed commits from a prior run where the push failed. Before making more commits this run, the guard attempts to push the backlog. If push fails (e.g. the repo has diverged), all write operations for this run are skipped. This prevents the commit count from growing further during an already-broken state.
If push succeeds, the guard clears itself (`"C-17 cleared: pushed N backlogged commit(s)"`) and the run proceeds normally.
### T04 — push seal (new)
At the end of every `fix_repo()` run, after all writebacks and the brief update, `push_ff()` is called unconditionally.
```
fix_repo()
├── C-16 check → skip all writes if behind remote
├── C-17 guard → retry backlog push; skip all writes if push still fails
├── apply fixable issues (C-04, C-05, C-09, C-10, C-11, C-12, C-13, C-15)
├── _write_custodian_brief() → commit if content changed
└── push_ff() ← T04: seal the loop
```
A push on a repo with no new commits is a no-op that returns success. So T04 adds no overhead for clean runs.
---
## Stable state
Once T04 is in effect, the timer converges to this stable state:
```
repo clean AND local ≡ remote
→ _report_needs_action(report, behind=0, ahead=0) → False
→ repo skipped (logged as CLEAN)
→ no git activity
→ state unchanged on next fire
```
The only way to leave stable state is:
- A real fix is needed (task drift, brief content change, etc.) → run completes with a push → returns to stable state
- Remote receives new commits → C-16 pulls → returns to stable state
- Push fails (network, diverge) → C-17 fires next run → retries push → returns to stable state or escalates to a visible WARN
---
## Domain-slug fallback (related fix)
The brief regeneration was also fixed to avoid oscillating between `inter_hub` and `(unknown)` for repos with no active workstreams.
**Before:** domain slug was resolved only via active workstreams. A repo with all workplans completed had no active workstreams → topic lookup returned nothing → `domain_slug = ""` → brief rendered `(unknown)` → brief was considered "changed" on every run → spurious commit every 15 minutes.
**After:** if no active workstreams exist, the lookup falls back to any workstream (completed or archived) on the same repo. Domain context is preserved permanently.
---
## Service configuration
```ini
# /home/worsch/.config/systemd/user/custodian-sync.service
ExecStart=… consistency_check.py --remote --all
```
`--remote --all` uses `fix_all_remote()`, which:
1. Runs `check_repo()` + `count_remote_ahead()` + `count_local_ahead()` for each repo
2. Skips repos that are already clean (no issues, not behind, not ahead)
3. For repos needing action: `git pull --ff-only` first, then `fix_repo()` (which ends with T04 push)
It also holds `/tmp/custodian-consistency-remote-all.lock` for the duration of
the sweep and defaults to a 300-second wall-clock budget. These guards keep a
slow or stalled sweep from overlapping with the next 15-minute timer activation.
Previously `--all --fix` was used, which skipped the pull step and the clean-repo skip logic.
---
## Public API (`repo_sync.py`)
| Function | Returns | Purpose |
|---|---|---|
| `pull_ff(repo_path)` | `(bool, str)` | `git pull --ff-only`; T02 pull gate |
| `push_ff(repo_path)` | `(bool, str)` | `git push`; T04 push seal |
| `count_remote_ahead(repo_path)` | `int` | commits remote has that local lacks; C-16 input |
| `count_local_ahead(repo_path)` | `int` | commits local has that remote lacks; C-17 / T04 input |
All functions are **best-effort**: errors return `0` / `(False, "…")` rather than raising. The consistency engine must never be blocked by transient network issues.
Note: `push_ff` does not pass `--ff-only` to git — that flag applies to `git pull`, not `git push`. `git push` already rejects non-fast-forward updates by default. The function name signals the semantic guarantee (no force-push), not a CLI flag.
---
## Compatibility aliases in `consistency_check.py`
The original inline git functions have been replaced by imports from `repo_sync`, with thin wrappers for backward compatibility:
```python
from repo_sync import count_local_ahead, count_remote_ahead, pull_ff, push_ff
_detect_ahead_of_remote = count_local_ahead
_git_pull = pull_ff
_git_push = push_ff
def _detect_behind_remote(repo_path: str) -> bool:
return count_remote_ahead(repo_path) > 0
```
Existing code and tests that import `_detect_behind_remote` or `_git_pull` from `consistency_check` continue to work unchanged.
---
## Test coverage (`tests/test_repo_sync.py`)
All tests use real git repos via `tmp_path` — no mocks, no subprocess patches. A `git_pair` fixture provides a bare remote + one clone with an initial pushed commit. A `_make_second_clone` helper creates a second worker clone for divergence scenarios.
| Class | What it covers |
|---|---|
| `TestErrorResilience` | All four functions return safe defaults for non-git paths, non-existent paths, and repos with no upstream |
| `TestCountLocalAhead` | 0 when in sync; N after N local commits; 0 after push |
| `TestCountRemoteAhead` | 0 when in sync; 0 when local is ahead (not behind); N when remote has N new commits; nonzero when diverged |
| `TestPushFf` | Success when ahead; no-op success when in sync; push-seal invariant (ahead=0 after push); idempotent; fails when diverged; C-17 scenario (diverged push rejection does not grow the backlog) |
| `TestPullFf` | Up-to-date no-op; pulls new remote commits; resolves behind state to 0; fails when diverged; fails with no remote |
| `TestPushSealLoop` | Full fix-run cycle leaves repo clean; multiple writebacks sealed in one push; no-commit run needs no push; runaway prevention on failed push |