state-hub/workplans/STATE-WP-0045-token-measurement-accuracy.md

---
id: STATE-WP-0045
type: workplan
title: "Token Measurement Accuracy and Resilience"
domain: custodian
repo: state-hub
status: finished
owner: codex
topic_slug: custodian
created: "2026-05-23"
updated: "2026-05-23"
state_hub_workstream_id: "0aefe379-c182-4471-84dd-c136d5e1206b"
---

# Token Measurement Accuracy and Resilience

## Summary

Make State Hub token tracking accurate enough to trust for daily operations and
robust enough to survive agent/tool changes.

The May 19 flatline showed the current weak spots: token events mixed measured
usage, task-completion fallbacks, and file-sync side effects in the same table;
Claude measurement depended on one hook path; Codex usage lived in local session
logs until a manual backfill; and the dashboard treated every token event as the
same quality of evidence. The immediate fix restored Codex session totals and
suppressed sync-generated fallback events, but the system still needs a durable
measurement model, idempotent source adapters, reconciliation checks, and a
dashboard that exposes provenance and confidence.

## Current Findings

- `token_events` stores counts, associations, free-text notes, and timestamps,
  but not structured provenance such as source system, source event id, parser
  version, raw token categories, confidence, or whether the row is measured,
  allocated, estimated, or superseded.
- `PATCH /tasks/{id}` can still create heuristic token events on a transition to
  `done`. That fallback is useful as a temporary operational signal, but it is
  not a measurement and should not be blended into measured totals.
- `fix-consistency` now suppresses token events while syncing file-backed task
  status, but this is a narrow guard. Other bulk sync, import, and migration
  paths need the same invariant.
- Codex Desktop session logs contain structured `token_count` events with
  `last_token_usage`, `total_token_usage`, cached-input counts, and reasoning
  output counts. The new backfill script can restore these, but it is not yet a
  scheduled or monitored ingestion path.
- Claude Code measurement currently depends on `scripts/task_token_hook.py`
  firing after one MCP tool name. It uses per-session state in `/tmp`, so missed
  hooks, restarts, renamed tools, and non-MCP REST paths can silently degrade to
  fallback events.
- Repository attribution for Codex backfill is path-based. This is good enough
  for the emergency restore, but long-term attribution should prefer registered
  repo fingerprints/remotes and then fall back to paths.
- The Token Cost dashboard currently aggregates all events returned by
  `/token-events/?limit=1000`; it does not show measurement quality, source,
  superseded rows, ingestion freshness, or possible gaps.

## Out of Scope

- Exact billing reconciliation against vendor invoices.
- Capturing private transcript content in State Hub.
- Replacing existing task/workstream/repo relationships.
- Implementing every provider-specific parser in one pass. The first pass should
  cover Codex Desktop and Claude Code, with a documented adapter contract for
  others.

## T01 - Define Token Evidence Model

```task
id: STATE-WP-0045-T01
status: done
priority: high
state_hub_task_id: "29aed6d9-40aa-40fc-9e9a-3eb3e6f985bc"
```

Define a structured model that separates measured usage from allocated,
estimated, and superseded rows.

Implementation notes:

- Add a short design note or ADR section covering token event semantics.
- Define measurement classes such as `measured`, `allocated`, `estimated`, and
  `superseded`.
- Define source classes such as `codex_session`, `claude_transcript`,
  `llm_connect`, `manual`, and `task_fallback`.
- Define structured provenance fields: source system, source id, source path or
  URI, source timestamp, parser version, ingestion timestamp, and confidence.
- Decide how to represent raw token categories: input, cached input, output,
  reasoning output, and provider total.
- Decide whether cached input should be included in default totals or shown as a
  separate metric. Preserve enough fields to support both views.
- Replace free-text note taxonomy as the primary quality signal. Notes can
  remain for human context, but dashboards and APIs should rely on structured
  fields.

Done when the repo has a reviewed token evidence contract and the follow-on
schema/API tasks can implement it without ambiguity.

## T02 - Add Provenance Schema and Idempotent Upsert API

```task
id: STATE-WP-0045-T02
status: done
priority: high
state_hub_task_id: "ade2bd40-343c-4829-ba4f-44bc8b7cbef9"
```

Extend token storage so source-derived events can be written repeatedly without
duplicates and without losing provenance.

Implementation notes:

- Add migration fields for the evidence model from T01. Candidate fields:
  `measurement_kind`, `source_provider`, `source_id`, `source_path`,
  `source_created_at`, `ingested_at`, `parser_version`, `confidence`,
  `cached_input_tokens`, `reasoning_output_tokens`, `raw_total_tokens`,
  `cost_estimated_usd`, and `raw_metadata`.
- Add a unique constraint or partial unique index that prevents duplicate
  measured source rows. For example: source provider plus source id, scoped by
  measurement kind.
- Provide an upsert endpoint or make `POST /token-events/` support an explicit
  idempotency key. The behavior should update a growing live session rather than
  creating a second row.
- Keep backward compatibility for existing clients that only post
  `tokens_in`/`tokens_out`, but classify those rows explicitly.
- Update schemas, router tests, and migration tests.

Done when source-backed token events can be inserted or updated idempotently and
legacy callers continue to work.

## T03 - Build Reusable Token Source Adapters

```task
id: STATE-WP-0045-T03
status: done
priority: high
state_hub_task_id: "3844fb70-4ceb-4f90-9894-d4845970f0a6"
```

Move source-specific parsing out of one-off scripts and hooks into reusable,
tested adapter modules.

Implementation notes:

- Add an `api/services/token_sources/` package or equivalent service layer.
- Implement a Codex Desktop adapter for `.codex/sessions/**` and
  `.codex/archived_sessions/**`.
- Implement a Claude Code adapter for `.claude/projects/**/*.jsonl` that reads
  usage metadata without storing transcript text.
- Provide a common adapter result type with source id, timestamps, token
  categories, model, agent, cwd/path context, and raw parser metadata.
- Make parsing safe by default: no conversation text in logs, progress events,
  token notes, or API payloads.
- Add fixtures with synthetic Codex and Claude session records that cover live
  sessions, archived sessions, duplicate files, malformed JSONL, resets, and
  missing usage records.
- Keep `scripts/backfill_codex_token_events.py` as a thin CLI over the reusable
  service or replace it with a new unified CLI.

Done when Codex and Claude token sources have deterministic parser tests and a
shared ingestion interface.

## T04 - Improve Repo, Workstream, and Task Attribution

```task
id: STATE-WP-0045-T04
status: done
priority: high
state_hub_task_id: "d78b36ea-2a1a-40d6-bd83-03d48ff2ad9b"
```

Make attribution accurate without relying solely on local path string matching.

Implementation notes:

- Resolve repo attribution by git root fingerprint and remote URL when possible,
  then fall back to registered host paths.
- Handle duplicate local paths or alias repos explicitly, especially where one
  checkout is registered under multiple slugs.
- Attribute session-level usage to repo first, then optionally to workstreams or
  tasks when there is strong evidence.
- Define task allocation rules that do not change measured session totals. For
  example, produce `allocated` child rows from measured session rows using task
  completion timestamps, tool-call metadata, or explicit operator input.
- Record the allocation method and confidence for every task-level allocation.
- Avoid minting task-level heuristic rows automatically for bulk import, status
  sync, migration, and consistency tooling.

Done when measured session totals are stable and task/workstream attribution is
explicitly either measured, allocated, or estimated.

## T05 - Add Reconciliation, Gap Detection, and Backfill Operations

```task
id: STATE-WP-0045-T05
status: done
priority: high
state_hub_task_id: "efaa2629-4f9a-439c-b0a3-85d77b03580f"
```

Add an operator-safe reconciliation command that detects flatlines, duplicate
rows, stale ingestion, and fallback leakage.

Implementation notes:

- Add a command such as `make token-reconcile` or
  `python scripts/token_reconcile.py --since <date>`.
- Report sessions found, sessions ingested, sessions stale, duplicate source
  ids, fallback events, superseded rows, unattributed sessions, and rows missing
  structured provenance.
- Support `--dry-run` by default and `--apply` for writes.
- Include an explicit `--zero-superseded-fallbacks` or equivalent flag rather
  than silently editing historical rows.
- Store reconciliation summaries as progress events or report files without
  including transcript content.
- Add a canary threshold: alert or fail when measured token volume is zero while
  task/progress activity exists for the same window.

Done when an operator can run one command to verify token tracking health and
perform safe, idempotent backfills.

## T06 - Harden Hooks and Runtime Integration

```task
id: STATE-WP-0045-T06
status: done
priority: medium
state_hub_task_id: "5fd99241-e6dd-4ca6-8c58-a0048f08f0ca"
```

Make token collection survive hook misses, tool renames, restarts, and multiple
agent runtimes.

Implementation notes:

- Update Claude hook handling so it can match supported task completion paths,
  not just one exact MCP tool name.
- Persist hook high-water marks in a durable State Hub or repo-local location
  instead of only `/tmp`.
- Add hook health logging that records when a hook ran, what source id it
  processed, and whether it patched or skipped a token event.
- Add a Codex ingestion path that can run on demand and from a schedule without
  requiring manual script execution.
- Document required environment variables and path discovery for Windows, WSL,
  and remote Linux hosts.
- Ensure failures degrade to visible `estimated` events or health warnings, not
  silent flatlines.

Done when missing or stale token ingestion becomes visible within one reporting
window and can be recovered without ad hoc inspection.

## T07 - Upgrade Token APIs and Dashboard Quality Signals

```task
id: STATE-WP-0045-T07
status: done
priority: medium
state_hub_task_id: "ecaf6ff8-59aa-4c56-8163-125dc96b2068"
```

Expose token quality, source, and freshness in APIs and dashboard views.

Implementation notes:

- Add API filters for measurement kind, source provider, repo, time range,
  superseded rows, and unattributed rows.
- Replace the hard dashboard dependence on `/token-events/?limit=1000` with
  paginated or pre-aggregated endpoints that support time windows.
- Add dashboard controls for measured-only, include allocated, include
  estimates, and show superseded rows.
- Show ingestion freshness: last Codex session ingested, last Claude transcript
  ingested, and last reconciliation run.
- Add a data-quality section listing fallback events, unattributed measured
  sessions, duplicate source ids, and days with progress/task activity but zero
  measured tokens.
- Update the Token Cost page and docs so operators know which numbers are
  measured versus inferred.

Done when the dashboard no longer presents fallback, allocated, and measured
usage as indistinguishable totals.

## T08 - Verification and Migration Playbook

```task
id: STATE-WP-0045-T08
status: done
priority: medium
state_hub_task_id: "61baff79-832e-45f8-80f3-106abe262096"
```

Cover the new measurement system with tests and a safe rollout plan.

Implementation notes:

- Add unit tests for the evidence model, source adapters, source-id
  deduplication, repo attribution, and task allocation.
- Add router tests for idempotent upsert, source filters, measurement-kind
  filters, created-at preservation, and backwards-compatible legacy posts.
- Add reconciliation tests with synthetic pre-May-19 and post-May-19 flatline
  scenarios.
- Add dashboard/data-loader tests or fixture checks for quality filters and
  aggregate counts.
- Write a migration playbook covering old heuristic rows, existing
  `backfill:codex-session` rows, and any rows without structured provenance.
- Verify the full suite and run a dry-run reconciliation before marking this
  workplan finished.

Done when the improved token measurement path has automated coverage, an
operator playbook, and a dry-run reconciliation report showing no hidden
fallback leakage.