Fixed and improved token tracking

2026-05-23 13:59:05 +02:00
parent dd3279ea1a
commit c12091c2eb
29 changed files with 3549 additions and 278 deletions
--- a/workplans/CUST-WP-0012-multi-user-onboarding.md
+++ b/workplans/CUST-WP-0012-multi-user-onboarding.md
@@ -4,12 +4,12 @@ type: workplan
 title: "Multi-User Onboarding and Environment Bootstrap"
 domain: custodian
 repo: state-hub
-status: active
+status: finished
 owner: custodian
 topic_slug: custodian
 state_hub_workstream_id: "a28d9e29-4119-4b73-9469-f921920253ef"
 created: "2026-03-11"
-updated: "2026-05-17"
+updated: "2026-05-23"
 ---

 # Multi-User Onboarding and Environment Bootstrap
@@ -51,7 +51,7 @@ Two personas:
 ```task
 id: CUST-WP-0012-T01
 state_hub_task_id: 71628269-9a75-4dae-a347-e64a86040322
-status: todo
+status: done
 priority: medium
 ```

@@ -79,6 +79,12 @@ git config --global credential.helper 'cache --timeout=3600'
 **Done when:** included in bootstrap script; push to Gitea works without
 re-entering credentials on second attempt.

+**Implemented 2026-05-23:** `scripts/bootstrap-env.sh` configures a global
+credential helper when one is not already present. It prefers `libsecret`, uses
+`cache --timeout=3600` as the safe automatic fallback, and supports explicit
+headless plaintext storage via `--git-helper store --allow-plaintext-store`.
+`docs/onboarding.md` documents the tradeoffs.
+
 ---

 ### T02 — SSH key generation and authorization automation
@@ -86,7 +92,7 @@ re-entering credentials on second attempt.
 ```task
 id: CUST-WP-0012-T02
 state_hub_task_id: fea965e9-8a8f-439c-9096-8f7756eb71ed
-status: todo
+status: done
 priority: medium
 ```

@@ -110,6 +116,11 @@ ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.130.254

 **Done when:** included in bootstrap script; documented in onboarding guide.

+**Implemented 2026-05-23:** `scripts/bootstrap-env.sh` generates
+`~/.ssh/id_ed25519` if missing, prints the public key, and can run
+`ssh-copy-id` for Railiance01 and CoulombCore with `--authorize-ssh`.
+`docs/onboarding.md` documents the operator and collaborator path.
+
 ---

 ### T03 — Claude Code MCP registration automation
@@ -117,7 +128,7 @@ ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.130.254
 ```task
 id: CUST-WP-0012-T03
 state_hub_task_id: 60318e9a-972e-45c8-afde-82ed0625f594
-status: todo
+status: done
 priority: medium
 ```

@@ -132,10 +143,10 @@ make register-mcp   # idempotent; safe to re-run

 The script should:
 1. Detect whether `state-hub` is already in `~/.claude.json`
-2. Extract the server config from `.mcp.json`
+2. Use the current SSE MCP config (`http://127.0.0.1:8001/sse` locally or
+   `http://127.0.0.1:18001/sse` through ops-bridge)
 3. Run `claude mcp add-json -s user state-hub <config>`
-4. Run `patch_mcp_cwd.py` to restore the cwd field
-5. Print instructions to restart Claude Code
+4. Print instructions to restart Claude Code

 Should also detect whether the state hub is reachable directly
 (`http://127.0.0.1:8000`) or needs a tunnel (via ops-bridge), and emit
@@ -144,6 +155,12 @@ a warning if neither is available.
 **Done when:** `make register-mcp` works on a clean machine; documented
 in onboarding guide.

+**Implemented 2026-05-23:** `scripts/register-mcp.sh` and the
+`make register-mcp` target register the current SSE MCP transport
+idempotently. The script detects local/tunnel reachability, supports
+`MCP_URL`, `API_BASE`, and `DRY_RUN=1`, and documents the old `.mcp.json` cwd
+patch path as legacy.
+
 ---

 ### T04 — Environment bootstrap script
@@ -151,7 +168,7 @@ in onboarding guide.
 ```task
 id: CUST-WP-0012-T04
 state_hub_task_id: 84a94761-e424-4470-a9a2-64d9cabadb7f
-status: todo
+status: done
 priority: high
 ```

@@ -176,6 +193,11 @@ Design constraints:
 **Done when:** running the script on a clean Ubuntu 24.04 machine
 produces a working Custodian environment with no additional manual steps.

+**Implemented 2026-05-23:** `scripts/bootstrap-env.sh` and
+`make bootstrap-env` provide the idempotent entrypoint. It supports dry-run,
+non-interactive mode, optional apt package installation, SSH authorization,
+Gitea token prompting, MCP registration, and State Hub health checks.
+
 ---

 ### T05 — Onboarding guide and user journey documentation
@@ -183,7 +205,7 @@ produces a working Custodian environment with no additional manual steps.
 ```task
 id: CUST-WP-0012-T05
 state_hub_task_id: b0839802-659a-475b-8b84-ab7341ea3d15
-status: todo
+status: done
 priority: medium
 ```

@@ -208,6 +230,10 @@ for both personas:
 **Done when:** a new collaborator can follow the guide without
 clarification from the primary operator.

+**Implemented 2026-05-23:** `docs/onboarding.md` covers primary operator and
+domain collaborator journeys, including SSH, Gitea token file, credential
+helper choices, MCP registration, tunnel setup, and verification checks.
+
 ---

 ### T06 — State Hub multi-user model (deferred)
@@ -215,7 +241,7 @@ clarification from the primary operator.
 ```task
 id: CUST-WP-0012-T06
 state_hub_task_id: d5df3302-67b9-4765-a8d8-ea2df53dff6e
-status: todo
+status: done
 priority: low
 ```

@@ -235,6 +261,11 @@ domain) or rely on Gitea repo permissions as the authoritative boundary
 Implement T01–T05 first; multi-user access control is only needed when
 there is more than one user.

+**Implemented 2026-05-23:** `docs/multi-user-access-model.md` records the
+current decision: repo permissions, SSH access, tunnels, and OpenBao remain the
+authoritative boundaries for this phase; State Hub API auth is deferred until a
+real second-user or exposed-deployment trigger exists.
+
 ---

 ## References
--- a/workplans/STATE-WP-0045-token-measurement-accuracy.md
+++ b/workplans/STATE-WP-0045-token-measurement-accuracy.md
@@ -0,0 +1,310 @@
+---
+id: STATE-WP-0045
+type: workplan
+title: "Token Measurement Accuracy and Resilience"
+domain: custodian
+repo: state-hub
+status: finished
+owner: codex
+topic_slug: custodian
+created: "2026-05-23"
+updated: "2026-05-23"
+state_hub_workstream_id: "0aefe379-c182-4471-84dd-c136d5e1206b"
+---
+
+# Token Measurement Accuracy and Resilience
+
+## Summary
+
+Make State Hub token tracking accurate enough to trust for daily operations and
+robust enough to survive agent/tool changes.
+
+The May 19 flatline showed the current weak spots: token events mixed measured
+usage, task-completion fallbacks, and file-sync side effects in the same table;
+Claude measurement depended on one hook path; Codex usage lived in local session
+logs until a manual backfill; and the dashboard treated every token event as the
+same quality of evidence. The immediate fix restored Codex session totals and
+suppressed sync-generated fallback events, but the system still needs a durable
+measurement model, idempotent source adapters, reconciliation checks, and a
+dashboard that exposes provenance and confidence.
+
+## Current Findings
+
+- `token_events` stores counts, associations, free-text notes, and timestamps,
+  but not structured provenance such as source system, source event id, parser
+  version, raw token categories, confidence, or whether the row is measured,
+  allocated, estimated, or superseded.
+- `PATCH /tasks/{id}` can still create heuristic token events on a transition to
+  `done`. That fallback is useful as a temporary operational signal, but it is
+  not a measurement and should not be blended into measured totals.
+- `fix-consistency` now suppresses token events while syncing file-backed task
+  status, but this is a narrow guard. Other bulk sync, import, and migration
+  paths need the same invariant.
+- Codex Desktop session logs contain structured `token_count` events with
+  `last_token_usage`, `total_token_usage`, cached-input counts, and reasoning
+  output counts. The new backfill script can restore these, but it is not yet a
+  scheduled or monitored ingestion path.
+- Claude Code measurement currently depends on `scripts/task_token_hook.py`
+  firing after one MCP tool name. It uses per-session state in `/tmp`, so missed
+  hooks, restarts, renamed tools, and non-MCP REST paths can silently degrade to
+  fallback events.
+- Repository attribution for Codex backfill is path-based. This is good enough
+  for the emergency restore, but long-term attribution should prefer registered
+  repo fingerprints/remotes and then fall back to paths.
+- The Token Cost dashboard currently aggregates all events returned by
+  `/token-events/?limit=1000`; it does not show measurement quality, source,
+  superseded rows, ingestion freshness, or possible gaps.
+
+## Out of Scope
+
+- Exact billing reconciliation against vendor invoices.
+- Capturing private transcript content in State Hub.
+- Replacing existing task/workstream/repo relationships.
+- Implementing every provider-specific parser in one pass. The first pass should
+  cover Codex Desktop and Claude Code, with a documented adapter contract for
+  others.
+
+## T01 - Define Token Evidence Model
+
+```task
+id: STATE-WP-0045-T01
+status: done
+priority: high
+state_hub_task_id: "29aed6d9-40aa-40fc-9e9a-3eb3e6f985bc"
+```
+
+Define a structured model that separates measured usage from allocated,
+estimated, and superseded rows.
+
+Implementation notes:
+
+- Add a short design note or ADR section covering token event semantics.
+- Define measurement classes such as `measured`, `allocated`, `estimated`, and
+  `superseded`.
+- Define source classes such as `codex_session`, `claude_transcript`,
+  `llm_connect`, `manual`, and `task_fallback`.
+- Define structured provenance fields: source system, source id, source path or
+  URI, source timestamp, parser version, ingestion timestamp, and confidence.
+- Decide how to represent raw token categories: input, cached input, output,
+  reasoning output, and provider total.
+- Decide whether cached input should be included in default totals or shown as a
+  separate metric. Preserve enough fields to support both views.
+- Replace free-text note taxonomy as the primary quality signal. Notes can
+  remain for human context, but dashboards and APIs should rely on structured
+  fields.
+
+Done when the repo has a reviewed token evidence contract and the follow-on
+schema/API tasks can implement it without ambiguity.
+
+## T02 - Add Provenance Schema and Idempotent Upsert API
+
+```task
+id: STATE-WP-0045-T02
+status: done
+priority: high
+state_hub_task_id: "ade2bd40-343c-4829-ba4f-44bc8b7cbef9"
+```
+
+Extend token storage so source-derived events can be written repeatedly without
+duplicates and without losing provenance.
+
+Implementation notes:
+
+- Add migration fields for the evidence model from T01. Candidate fields:
+  `measurement_kind`, `source_provider`, `source_id`, `source_path`,
+  `source_created_at`, `ingested_at`, `parser_version`, `confidence`,
+  `cached_input_tokens`, `reasoning_output_tokens`, `raw_total_tokens`,
+  `cost_estimated_usd`, and `raw_metadata`.
+- Add a unique constraint or partial unique index that prevents duplicate
+  measured source rows. For example: source provider plus source id, scoped by
+  measurement kind.
+- Provide an upsert endpoint or make `POST /token-events/` support an explicit
+  idempotency key. The behavior should update a growing live session rather than
+  creating a second row.
+- Keep backward compatibility for existing clients that only post
+  `tokens_in`/`tokens_out`, but classify those rows explicitly.
+- Update schemas, router tests, and migration tests.
+
+Done when source-backed token events can be inserted or updated idempotently and
+legacy callers continue to work.
+
+## T03 - Build Reusable Token Source Adapters
+
+```task
+id: STATE-WP-0045-T03
+status: done
+priority: high
+state_hub_task_id: "3844fb70-4ceb-4f90-9894-d4845970f0a6"
+```
+
+Move source-specific parsing out of one-off scripts and hooks into reusable,
+tested adapter modules.
+
+Implementation notes:
+
+- Add an `api/services/token_sources/` package or equivalent service layer.
+- Implement a Codex Desktop adapter for `.codex/sessions/**` and
+  `.codex/archived_sessions/**`.
+- Implement a Claude Code adapter for `.claude/projects/**/*.jsonl` that reads
+  usage metadata without storing transcript text.
+- Provide a common adapter result type with source id, timestamps, token
+  categories, model, agent, cwd/path context, and raw parser metadata.
+- Make parsing safe by default: no conversation text in logs, progress events,
+  token notes, or API payloads.
+- Add fixtures with synthetic Codex and Claude session records that cover live
+  sessions, archived sessions, duplicate files, malformed JSONL, resets, and
+  missing usage records.
+- Keep `scripts/backfill_codex_token_events.py` as a thin CLI over the reusable
+  service or replace it with a new unified CLI.
+
+Done when Codex and Claude token sources have deterministic parser tests and a
+shared ingestion interface.
+
+## T04 - Improve Repo, Workstream, and Task Attribution
+
+```task
+id: STATE-WP-0045-T04
+status: done
+priority: high
+state_hub_task_id: "d78b36ea-2a1a-40d6-bd83-03d48ff2ad9b"
+```
+
+Make attribution accurate without relying solely on local path string matching.
+
+Implementation notes:
+
+- Resolve repo attribution by git root fingerprint and remote URL when possible,
+  then fall back to registered host paths.
+- Handle duplicate local paths or alias repos explicitly, especially where one
+  checkout is registered under multiple slugs.
+- Attribute session-level usage to repo first, then optionally to workstreams or
+  tasks when there is strong evidence.
+- Define task allocation rules that do not change measured session totals. For
+  example, produce `allocated` child rows from measured session rows using task
+  completion timestamps, tool-call metadata, or explicit operator input.
+- Record the allocation method and confidence for every task-level allocation.
+- Avoid minting task-level heuristic rows automatically for bulk import, status
+  sync, migration, and consistency tooling.
+
+Done when measured session totals are stable and task/workstream attribution is
+explicitly either measured, allocated, or estimated.
+
+## T05 - Add Reconciliation, Gap Detection, and Backfill Operations
+
+```task
+id: STATE-WP-0045-T05
+status: done
+priority: high
+state_hub_task_id: "efaa2629-4f9a-439c-b0a3-85d77b03580f"
+```
+
+Add an operator-safe reconciliation command that detects flatlines, duplicate
+rows, stale ingestion, and fallback leakage.
+
+Implementation notes:
+
+- Add a command such as `make token-reconcile` or
+  `python scripts/token_reconcile.py --since <date>`.
+- Report sessions found, sessions ingested, sessions stale, duplicate source
+  ids, fallback events, superseded rows, unattributed sessions, and rows missing
+  structured provenance.
+- Support `--dry-run` by default and `--apply` for writes.
+- Include an explicit `--zero-superseded-fallbacks` or equivalent flag rather
+  than silently editing historical rows.
+- Store reconciliation summaries as progress events or report files without
+  including transcript content.
+- Add a canary threshold: alert or fail when measured token volume is zero while
+  task/progress activity exists for the same window.
+
+Done when an operator can run one command to verify token tracking health and
+perform safe, idempotent backfills.
+
+## T06 - Harden Hooks and Runtime Integration
+
+```task
+id: STATE-WP-0045-T06
+status: done
+priority: medium
+state_hub_task_id: "5fd99241-e6dd-4ca6-8c58-a0048f08f0ca"
+```
+
+Make token collection survive hook misses, tool renames, restarts, and multiple
+agent runtimes.
+
+Implementation notes:
+
+- Update Claude hook handling so it can match supported task completion paths,
+  not just one exact MCP tool name.
+- Persist hook high-water marks in a durable State Hub or repo-local location
+  instead of only `/tmp`.
+- Add hook health logging that records when a hook ran, what source id it
+  processed, and whether it patched or skipped a token event.
+- Add a Codex ingestion path that can run on demand and from a schedule without
+  requiring manual script execution.
+- Document required environment variables and path discovery for Windows, WSL,
+  and remote Linux hosts.
+- Ensure failures degrade to visible `estimated` events or health warnings, not
+  silent flatlines.
+
+Done when missing or stale token ingestion becomes visible within one reporting
+window and can be recovered without ad hoc inspection.
+
+## T07 - Upgrade Token APIs and Dashboard Quality Signals
+
+```task
+id: STATE-WP-0045-T07
+status: done
+priority: medium
+state_hub_task_id: "ecaf6ff8-59aa-4c56-8163-125dc96b2068"
+```
+
+Expose token quality, source, and freshness in APIs and dashboard views.
+
+Implementation notes:
+
+- Add API filters for measurement kind, source provider, repo, time range,
+  superseded rows, and unattributed rows.
+- Replace the hard dashboard dependence on `/token-events/?limit=1000` with
+  paginated or pre-aggregated endpoints that support time windows.
+- Add dashboard controls for measured-only, include allocated, include
+  estimates, and show superseded rows.
+- Show ingestion freshness: last Codex session ingested, last Claude transcript
+  ingested, and last reconciliation run.
+- Add a data-quality section listing fallback events, unattributed measured
+  sessions, duplicate source ids, and days with progress/task activity but zero
+  measured tokens.
+- Update the Token Cost page and docs so operators know which numbers are
+  measured versus inferred.
+
+Done when the dashboard no longer presents fallback, allocated, and measured
+usage as indistinguishable totals.
+
+## T08 - Verification and Migration Playbook
+
+```task
+id: STATE-WP-0045-T08
+status: done
+priority: medium
+state_hub_task_id: "61baff79-832e-45f8-80f3-106abe262096"
+```
+
+Cover the new measurement system with tests and a safe rollout plan.
+
+Implementation notes:
+
+- Add unit tests for the evidence model, source adapters, source-id
+  deduplication, repo attribution, and task allocation.
+- Add router tests for idempotent upsert, source filters, measurement-kind
+  filters, created-at preservation, and backwards-compatible legacy posts.
+- Add reconciliation tests with synthetic pre-May-19 and post-May-19 flatline
+  scenarios.
+- Add dashboard/data-loader tests or fixture checks for quality filters and
+  aggregate counts.
+- Write a migration playbook covering old heuristic rows, existing
+  `backfill:codex-session` rows, and any rows without structured provenance.
+- Verify the full suite and run a dry-run reconciliation before marking this
+  workplan finished.
+
+Done when the improved token measurement path has automated coverage, an
+operator playbook, and a dry-run reconciliation report showing no hidden
+fallback leakage.