Files
the-custodian/workplans/CUST-WP-0026-distributed-consistency.md
2026-03-26 12:16:33 +01:00

253 lines
8.6 KiB
Markdown

---
id: CUST-WP-0026
type: workplan
title: "Distributed Consistency — Multi-Machine State Sync"
domain: custodian
repo: the-custodian
status: done
owner: custodian
topic_slug: custodian
created: "2026-03-21"
updated: "2026-03-21"
state_hub_workstream_id: "32de6210-ce1e-4cba-ad1f-fdeba462030d"
---
# Distributed Consistency — Multi-Machine State Sync
## Problem
The consistency checker assumes local workplan files are always the authoritative
source of truth. This breaks in the primary development workflow:
1. Implementation runs on **CoulombCore** (remote)
2. Task status is written to the **state-hub DB** via ops-bridge tunnel
3. The **workstation's local repo** is not updated (no `git pull`)
4. Session close triggers `fix-consistency` on the workstation
5. Checker reads stale local files (tasks still `todo`) and **regresses** DB
status — overwriting `done`/`in_progress` back to `todo`
6. The dashboard shows progress, then silently reverts
This is a design assumption in ADR-001 that breaks under multi-machine workflows.
ADR-001 states the DB is rebuilt from files — but only holds when local files
are always up to date.
## Goal
Eliminate false regressions and make `fix-consistency` safe to run regardless
of local repo staleness. Three layers of defence:
- **T01** (no-regress rule): Never allow fix-consistency to move a task
*backwards* in status. DB-ahead wins.
- **T02** (pull gate): Detect and warn when local repo is behind its remote
before applying fixes.
- **T03** (DB→file writeback): Write DB status back into workplan files and
commit, so files stay truthful and the multi-machine workflow naturally
converges.
## Implementation Notes
The status progression order for the no-regress rule:
`todo → in_progress → blocked → done → cancelled`
For the pull gate, `git fetch` is the only network call needed. No push, no
merge — just detection. The fix mode should refuse or warn; check mode should
always be allowed to report.
For writeback (T03), `fix-consistency --fix` needs to:
1. Detect tasks where DB status > file status
2. Edit the workplan file (update the `status:` field in the task block)
3. Stage and commit the change with a standard commit message
Writeback must be idempotent and must not alter anything other than `status:`
fields in task blocks.
## Tasks
### T01 — No-regress rule in consistency_check.py
```task
id: CUST-WP-0026-T01
status: done
priority: high
state_hub_task_id: "34a76f4c-ad3f-4780-ad62-1e788ceca224"
```
Modify `state-hub/scripts/consistency_check.py` so that `--fix` mode never
regresses task status in the DB.
**Status ordering:**
```python
STATUS_ORDER = {"todo": 0, "in_progress": 1, "blocked": 1,
"done": 2, "cancelled": 2}
```
In the C-11 fix path (file task found, DB task found, statuses differ):
- If `STATUS_ORDER[db_status] >= STATUS_ORDER[file_status]`: skip the DB
update, emit a new check code **C-13** WARN:
`"DB task '{title}' is ahead of file (db={db_status}, file={file_status}) — skipped to prevent regression"`
- If `STATUS_ORDER[db_status] < STATUS_ORDER[file_status]`: apply the update
as today (file is ahead, sync forward)
New check code **C-13**: "DB task ahead of workplan file — regression
prevented". Severity: WARN (not FAIL — this is expected in multi-machine
workflows).
Gate: `make test` must pass after this change.
---
### T02 — Git pull gate before --fix
```task
id: CUST-WP-0026-T02
status: done
priority: high
state_hub_task_id: "f9dbad4e-ba66-4e20-83ef-93b78c9e1590"
```
Add a remote-staleness check to `consistency_check.py` that runs at the start
of `--fix` mode for each repo being checked.
**Detection logic:**
```bash
git -C <repo_path> fetch --quiet origin 2>/dev/null
LOCAL=$(git -C <repo_path> rev-parse HEAD)
REMOTE=$(git -C <repo_path> rev-parse @{u} 2>/dev/null)
# If LOCAL != REMOTE and REMOTE is reachable → repo is behind
```
If the repo is behind its remote tracking branch:
- In `--fix` mode: emit **C-14** WARN and skip all write operations for that
repo. Print: `"Repo '{slug}' is behind remote — pull before fixing to avoid
clobbering remote progress"`.
- In check-only mode: emit C-14 INFO (no-op, just informational).
The `git fetch` must be best-effort — if the remote is unreachable (offline,
ops-bridge down), skip the check silently rather than failing.
New check code **C-14**: "Repo behind remote tracking branch". Severity: WARN
in fix mode, INFO in check mode.
Gate: `make test` must pass. Add a test that simulates a behind-remote repo
(mock `rev-parse` output).
---
### T03 — DB→file status writeback
```task
id: CUST-WP-0026-T03
status: done
priority: medium
state_hub_task_id: "749130f9-b397-46fd-8eb3-43c0fc127dac"
```
Extend `consistency_check.py --fix` to write DB status back into workplan
files when DB is ahead of the file (the C-13 case from T01).
**Writeback logic:**
1. Locate the task block in the workplan file by matching `id: <task_id>`
2. Replace the `status: <old>` line within that block with `status: <new>`
3. Stage the file: `git -C <repo_path> add <workplan_file>`
4. Commit with message:
```
chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on <ISO-date>:
- <task_id>: <old_status> → <new_status>
```
**Guard rails:**
- Only modify lines inside a ` ```task ... ``` ` block
- Only change the `status:` field — never touch `id:`, `priority:`,
`state_hub_task_id:`, or any other field
- If the workplan file has uncommitted local changes, skip writeback for that
file and emit C-14 WARN ("workplan has uncommitted changes — skipping
writeback")
- If git commit fails for any reason, log the error but do not abort the rest
of the consistency run
**New flag:** `--no-writeback` — disables T03 behaviour while keeping T01/T02
active. Default: writeback enabled when `--fix` is set.
Gate: `make test` must pass. The existing workplan parsing tests should cover
the task block regex; add a writeback-specific test.
---
### T04 — Session protocol update
```task
id: CUST-WP-0026-T04
status: done
priority: medium
state_hub_task_id: "59a5d09a-1e67-4749-9d84-039982edc3ef"
```
Update `the-custodian/CLAUDE.md` session close protocol (step 5) to reflect
the new behaviour and add the recommended pre-fix step:
**Current step 5:**
> If any workplan files were written or modified this session, run:
> `make fix-consistency REPO=the-custodian`
**Updated step 5:**
> Before running fix-consistency on any repo that has a remote, ensure the
> local copy is up to date:
> ```bash
> git -C <repo_path> pull --ff-only
> cd state-hub && make fix-consistency REPO=<slug>
> ```
> The consistency checker will now warn (C-14) if the repo is still behind
> and refuse to regress status (C-13). A C-13 warning is normal for repos
> where work has progressed on a remote machine — it means writeback is
> keeping the files in sync.
Also update the `state-hub/scripts/project_rules/session-protocol.template`
so newly registered repos get the updated guidance.
---
### T05 — Makefile: fix-consistency-remote target
```task
id: CUST-WP-0026-T05
status: done
priority: low
state_hub_task_id: "b8375cbc-9c44-48f6-a78c-b7333d409525"
```
Add a convenience target to `state-hub/Makefile` that pulls before fixing:
```makefile
## Pull repo then sync consistency: make fix-consistency-remote REPO=net-kingdom
fix-consistency-remote:
@test -n "$(REPO)" || (echo "ERROR: REPO is required."; exit 1)
$(eval REPO_PATH := $(shell \
curl -s $(API_BASE)/repos/?slug=$(REPO) | \
python3 -c "import json,sys; \
repos=json.load(sys.stdin); \
print(next((r['local_path'] for r in repos if r['slug']=='$(REPO)'), ''))" \
))
@test -n "$(REPO_PATH)" || (echo "ERROR: repo '$(REPO)' not found in state-hub"; exit 1)
git -C "$(REPO_PATH)" pull --ff-only || \
(echo "WARN: pull failed (conflicts or no remote) — running fix-consistency anyway"; true)
$(MAKE) fix-consistency REPO=$(REPO) REPO_PATH=$(REPO_PATH)
```
This makes the safe path the convenient path:
`make fix-consistency-remote REPO=net-kingdom`
## Done Criteria
- [ ] `make fix-consistency REPO=net-kingdom` never regresses a `done` task
back to `todo` when local file is stale
- [ ] C-13 warning is emitted (not error) when DB is ahead of file
- [ ] C-14 warning is emitted in fix mode when repo is behind remote;
fix operations are skipped for that repo
- [ ] DB→file writeback commits corrected status to the workplan file
- [ ] `--no-writeback` flag disables writeback cleanly
- [ ] `make fix-consistency-remote REPO=<slug>` pulls then fixes in one step
- [ ] `make test` passes after all changes
- [ ] Session protocol updated in CLAUDE.md and session-protocol.template