Files
state-hub/workplans/STATE-WP-0059-mcp-write-layer-reliability.md

169 lines
6.6 KiB
Markdown

---
id: STATE-WP-0059
type: workplan
title: "State Hub MCP write-layer reliability"
domain: custodian
repo: state-hub
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-07"
updated: "2026-06-07"
state_hub_workstream_id: "0c9f9ed3-235b-4291-bfa3-cc08699b02b4"
---
# State Hub MCP Write-Layer Reliability
**Origin:** infrastructure-friction analysis from the `helix_forge` domain
(`agentic-resources/docs/ASSESSMENT-infra-friction.md`).
## Critical Review
This workplan conforms with `INTENT.md`: State Hub owns the local-first
coordination service and FastMCP tools that expose topics, workstreams, tasks,
decisions, and progress. Fixing a flaky MCP write wrapper keeps the hub usable
as an agent coordination surface without moving canon out of workplan files or
turning State Hub into a task factory.
Implementation should stay narrow:
- Keep MCP as a stateless HTTP client over the FastAPI API; do not add direct DB
writes or duplicate API business logic in `mcp_server/server.py`.
- Preserve the REST API as the source contract and make MCP responses/error
handling match it.
- Do not fold the bulk-write work from `STATE-WP-0058` into this workplan; this
plan is about correctness and reliable error surfacing for existing writes.
Repo review on 2026-06-07 found the likely fault surfaces:
- `mcp_server/server.py` implements the affected write tools directly:
`create_workstream`, `create_task`, `update_task_status`, `record_decision`,
and `add_progress_event`.
- `_get` / `_post` / `_patch` return `{"error": ...}` on HTTP/client failure.
Several write tools then assume successful response fields exist; for example
`update_task_status` formats `task['title']`, which can explain the observed
`'title'` KeyError when the underlying PATCH returned an error object.
- Existing `tests/test_mcp_smoke.py` verifies REST endpoint shapes the MCP tools
depend on, but it explicitly does not exercise the MCP protocol/tool wrapper
itself. This gap is where the `-32602 Invalid request parameters` family can
hide.
The State Hub MCP **write** tools fail while the **REST API stays healthy**
(`/state/health` → 200). Observed continuously during a helix_forge work session:
- `update_task_status``'title'` **KeyError**
- `add_progress_event` / `create_workstream` / `create_task` → `-32602 Invalid
request parameters`
Every hub write that session had to fall back to REST `PATCH`/`POST`. The
helix_forge error-mining corroborates it independently: this error family recurs
in **4 captured sessions across 3 repos**.
This is a **reliability defect in the MCP wrapper layer** (the API and DB are
fine), and it is dangerous: an agent with no REST fallback would silently fail to
record status/progress, leaving file↔hub drift unreconciled.
## Reproduce + Diagnose the MCP Write Failures
```task
id: STATE-WP-0059-T01
status: done
priority: high
state_hub_task_id: "7fc04b00-2493-4310-ae76-101660621da6"
```
Reproduce the `update_task_status` `'title'` KeyError and the `-32602` on
`add_progress_event` / `create_workstream` / `create_task` against the working REST
contract. Capture the exact failing request payloads and server-side
stack/response. Pin down response-serialization (KeyError `'title'`) vs
request-param-validation (`-32602`) and which tools each affects.
Done when a focused failing test or diagnostic script exercises the real MCP
tool wrapper, not only the FastAPI endpoint, and records:
- the MCP tool name and argument payload,
- the corresponding REST request/response,
- whether failure happens before the HTTP call, during HTTP error handling, or
during MCP response serialization.
## Fix the MCP Serialization / Param-Validation Layer
```task
id: STATE-WP-0059-T02
status: done
priority: high
state_hub_task_id: "2636e1d5-142e-463f-875e-4dc1edf12853"
```
Fix the MCP wrapper so the write tools succeed and return the same shape as their
REST equivalents. Add an end-to-end regression test exercising each write tool
(`create_workstream` / `create_task` / `update_task_status` / `add_progress_event`
/ `record_decision`) so the MCP and REST paths stay in parity.
Implementation notes:
- Add a small shared MCP response helper rather than open-coding success/error
assumptions in each write tool.
- If the API helper returns an error object, return a clear MCP-visible error
payload and do not emit follow-up progress events that pretend the write
succeeded.
- Ensure optional arguments and structured values such as `detail` are declared
and normalized in a way FastMCP can validate consistently.
- Keep existing automatic progress-event side effects for successful writes.
- Update `tests/test_mcp_smoke.py` or add a sibling test that invokes the MCP
tool layer directly enough to catch `-32602` and response-field KeyErrors.
## Verify + Harden Error Surfacing
```task
id: STATE-WP-0059-T03
status: done
priority: medium
state_hub_task_id: "af137d6b-07ca-4110-abb0-45a96b84b7a3"
```
Verify the fixed tools against the REST contract across a representative session
flow. Ensure failures surface clearly (no silent success) so agents can fall back
deterministically. Document the MCP↔REST parity guarantee in
`mcp_server/TOOLS.md`, including the expected fallback behavior when the API is
unreachable.
Done when:
- the MCP write regression tests pass,
- a manual local flow can create a workstream, create a task, update task
status, record a decision, and add progress through MCP while REST
verification confirms matching records,
- an induced API failure returns a clear error instead of a KeyError or false
success,
- the session closes with `make fix-consistency REPO=state-hub`.
After workplan updates, run from `~/state-hub`:
```bash
make fix-consistency REPO=state-hub
```
## Verification Notes
Completed 2026-06-07:
- Added shared MCP response/error helpers in `mcp_server/server.py` so API error
payloads and malformed REST responses return clear MCP-visible errors instead
of triggering response-field KeyErrors or false success.
- Kept successful MCP write responses in the same JSON shape as their REST
equivalents, while preserving automatic progress events only after successful
primary writes.
- Added `tests/test_mcp_write_tools.py`, which invokes the registered FastMCP
tools in-process and covers success/error behavior for `create_workstream`,
`create_task`, `update_task_status`, `add_progress_event`, and
`record_decision`.
- Documented MCP/REST write parity and fallback expectations in
`mcp_server/TOOLS.md`.
Verification:
- `.venv/bin/python -m pytest tests/test_mcp_write_tools.py -q` -> 8 passed
- `.venv/bin/python -m pytest tests/test_mcp_write_tools.py tests/test_mcp_smoke.py -q` -> 21 passed
- `git diff --check` -> clean