Files
the-custodian/workplans/CUST-WP-0026-distributed-consistency.md
2026-03-26 12:16:33 +01:00

8.6 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
CUST-WP-0026 workplan Distributed Consistency — Multi-Machine State Sync custodian the-custodian done custodian custodian 2026-03-21 2026-03-21 32de6210-ce1e-4cba-ad1f-fdeba462030d

Distributed Consistency — Multi-Machine State Sync

Problem

The consistency checker assumes local workplan files are always the authoritative source of truth. This breaks in the primary development workflow:

  1. Implementation runs on CoulombCore (remote)
  2. Task status is written to the state-hub DB via ops-bridge tunnel
  3. The workstation's local repo is not updated (no git pull)
  4. Session close triggers fix-consistency on the workstation
  5. Checker reads stale local files (tasks still todo) and regresses DB status — overwriting done/in_progress back to todo
  6. The dashboard shows progress, then silently reverts

This is a design assumption in ADR-001 that breaks under multi-machine workflows. ADR-001 states the DB is rebuilt from files — but only holds when local files are always up to date.

Goal

Eliminate false regressions and make fix-consistency safe to run regardless of local repo staleness. Three layers of defence:

  • T01 (no-regress rule): Never allow fix-consistency to move a task backwards in status. DB-ahead wins.
  • T02 (pull gate): Detect and warn when local repo is behind its remote before applying fixes.
  • T03 (DB→file writeback): Write DB status back into workplan files and commit, so files stay truthful and the multi-machine workflow naturally converges.

Implementation Notes

The status progression order for the no-regress rule: todo → in_progress → blocked → done → cancelled

For the pull gate, git fetch is the only network call needed. No push, no merge — just detection. The fix mode should refuse or warn; check mode should always be allowed to report.

For writeback (T03), fix-consistency --fix needs to:

  1. Detect tasks where DB status > file status
  2. Edit the workplan file (update the status: field in the task block)
  3. Stage and commit the change with a standard commit message

Writeback must be idempotent and must not alter anything other than status: fields in task blocks.

Tasks

T01 — No-regress rule in consistency_check.py

id: CUST-WP-0026-T01
status: done
priority: high
state_hub_task_id: "34a76f4c-ad3f-4780-ad62-1e788ceca224"

Modify state-hub/scripts/consistency_check.py so that --fix mode never regresses task status in the DB.

Status ordering:

STATUS_ORDER = {"todo": 0, "in_progress": 1, "blocked": 1,
                "done": 2, "cancelled": 2}

In the C-11 fix path (file task found, DB task found, statuses differ):

  • If STATUS_ORDER[db_status] >= STATUS_ORDER[file_status]: skip the DB update, emit a new check code C-13 WARN: "DB task '{title}' is ahead of file (db={db_status}, file={file_status}) — skipped to prevent regression"
  • If STATUS_ORDER[db_status] < STATUS_ORDER[file_status]: apply the update as today (file is ahead, sync forward)

New check code C-13: "DB task ahead of workplan file — regression prevented". Severity: WARN (not FAIL — this is expected in multi-machine workflows).

Gate: make test must pass after this change.


T02 — Git pull gate before --fix

id: CUST-WP-0026-T02
status: done
priority: high
state_hub_task_id: "f9dbad4e-ba66-4e20-83ef-93b78c9e1590"

Add a remote-staleness check to consistency_check.py that runs at the start of --fix mode for each repo being checked.

Detection logic:

git -C <repo_path> fetch --quiet origin 2>/dev/null
LOCAL=$(git -C <repo_path> rev-parse HEAD)
REMOTE=$(git -C <repo_path> rev-parse @{u} 2>/dev/null)
# If LOCAL != REMOTE and REMOTE is reachable → repo is behind

If the repo is behind its remote tracking branch:

  • In --fix mode: emit C-14 WARN and skip all write operations for that repo. Print: "Repo '{slug}' is behind remote — pull before fixing to avoid clobbering remote progress".
  • In check-only mode: emit C-14 INFO (no-op, just informational).

The git fetch must be best-effort — if the remote is unreachable (offline, ops-bridge down), skip the check silently rather than failing.

New check code C-14: "Repo behind remote tracking branch". Severity: WARN in fix mode, INFO in check mode.

Gate: make test must pass. Add a test that simulates a behind-remote repo (mock rev-parse output).


T03 — DB→file status writeback

id: CUST-WP-0026-T03
status: done
priority: medium
state_hub_task_id: "749130f9-b397-46fd-8eb3-43c0fc127dac"

Extend consistency_check.py --fix to write DB status back into workplan files when DB is ahead of the file (the C-13 case from T01).

Writeback logic:

  1. Locate the task block in the workplan file by matching id: <task_id>
  2. Replace the status: <old> line within that block with status: <new>
  3. Stage the file: git -C <repo_path> add <workplan_file>
  4. Commit with message:
    chore(consistency): sync task status from DB [auto]
    
    Updated by fix-consistency on <ISO-date>:
    - <task_id>: <old_status> → <new_status>
    

Guard rails:

  • Only modify lines inside a ```task ... ``` block
  • Only change the status: field — never touch id:, priority:, state_hub_task_id:, or any other field
  • If the workplan file has uncommitted local changes, skip writeback for that file and emit C-14 WARN ("workplan has uncommitted changes — skipping writeback")
  • If git commit fails for any reason, log the error but do not abort the rest of the consistency run

New flag: --no-writeback — disables T03 behaviour while keeping T01/T02 active. Default: writeback enabled when --fix is set.

Gate: make test must pass. The existing workplan parsing tests should cover the task block regex; add a writeback-specific test.


T04 — Session protocol update

id: CUST-WP-0026-T04
status: done
priority: medium
state_hub_task_id: "59a5d09a-1e67-4749-9d84-039982edc3ef"

Update the-custodian/CLAUDE.md session close protocol (step 5) to reflect the new behaviour and add the recommended pre-fix step:

Current step 5:

If any workplan files were written or modified this session, run: make fix-consistency REPO=the-custodian

Updated step 5:

Before running fix-consistency on any repo that has a remote, ensure the local copy is up to date:

git -C <repo_path> pull --ff-only
cd state-hub && make fix-consistency REPO=<slug>

The consistency checker will now warn (C-14) if the repo is still behind and refuse to regress status (C-13). A C-13 warning is normal for repos where work has progressed on a remote machine — it means writeback is keeping the files in sync.

Also update the state-hub/scripts/project_rules/session-protocol.template so newly registered repos get the updated guidance.


T05 — Makefile: fix-consistency-remote target

id: CUST-WP-0026-T05
status: done
priority: low
state_hub_task_id: "b8375cbc-9c44-48f6-a78c-b7333d409525"

Add a convenience target to state-hub/Makefile that pulls before fixing:

## Pull repo then sync consistency: make fix-consistency-remote REPO=net-kingdom
fix-consistency-remote:
	@test -n "$(REPO)" || (echo "ERROR: REPO is required."; exit 1)
	$(eval REPO_PATH := $(shell \
	  curl -s $(API_BASE)/repos/?slug=$(REPO) | \
	  python3 -c "import json,sys; \
	    repos=json.load(sys.stdin); \
	    print(next((r['local_path'] for r in repos if r['slug']=='$(REPO)'), ''))" \
	))
	@test -n "$(REPO_PATH)" || (echo "ERROR: repo '$(REPO)' not found in state-hub"; exit 1)
	git -C "$(REPO_PATH)" pull --ff-only || \
	  (echo "WARN: pull failed (conflicts or no remote) — running fix-consistency anyway"; true)
	$(MAKE) fix-consistency REPO=$(REPO) REPO_PATH=$(REPO_PATH)

This makes the safe path the convenient path: make fix-consistency-remote REPO=net-kingdom

Done Criteria

  • make fix-consistency REPO=net-kingdom never regresses a done task back to todo when local file is stale
  • C-13 warning is emitted (not error) when DB is ahead of file
  • C-14 warning is emitted in fix mode when repo is behind remote; fix operations are skipped for that repo
  • DB→file writeback commits corrected status to the workplan file
  • --no-writeback flag disables writeback cleanly
  • make fix-consistency-remote REPO=<slug> pulls then fixes in one step
  • make test passes after all changes
  • Session protocol updated in CLAUDE.md and session-protocol.template