generated from coulomb/repo-seed
311 lines
12 KiB
Markdown
311 lines
12 KiB
Markdown
---
|
|
id: STATE-WP-0045
|
|
type: workplan
|
|
title: "Token Measurement Accuracy and Resilience"
|
|
domain: custodian
|
|
repo: state-hub
|
|
status: finished
|
|
owner: codex
|
|
topic_slug: custodian
|
|
created: "2026-05-23"
|
|
updated: "2026-05-23"
|
|
state_hub_workstream_id: "0aefe379-c182-4471-84dd-c136d5e1206b"
|
|
---
|
|
|
|
# Token Measurement Accuracy and Resilience
|
|
|
|
## Summary
|
|
|
|
Make State Hub token tracking accurate enough to trust for daily operations and
|
|
robust enough to survive agent/tool changes.
|
|
|
|
The May 19 flatline showed the current weak spots: token events mixed measured
|
|
usage, task-completion fallbacks, and file-sync side effects in the same table;
|
|
Claude measurement depended on one hook path; Codex usage lived in local session
|
|
logs until a manual backfill; and the dashboard treated every token event as the
|
|
same quality of evidence. The immediate fix restored Codex session totals and
|
|
suppressed sync-generated fallback events, but the system still needs a durable
|
|
measurement model, idempotent source adapters, reconciliation checks, and a
|
|
dashboard that exposes provenance and confidence.
|
|
|
|
## Current Findings
|
|
|
|
- `token_events` stores counts, associations, free-text notes, and timestamps,
|
|
but not structured provenance such as source system, source event id, parser
|
|
version, raw token categories, confidence, or whether the row is measured,
|
|
allocated, estimated, or superseded.
|
|
- `PATCH /tasks/{id}` can still create heuristic token events on a transition to
|
|
`done`. That fallback is useful as a temporary operational signal, but it is
|
|
not a measurement and should not be blended into measured totals.
|
|
- `fix-consistency` now suppresses token events while syncing file-backed task
|
|
status, but this is a narrow guard. Other bulk sync, import, and migration
|
|
paths need the same invariant.
|
|
- Codex Desktop session logs contain structured `token_count` events with
|
|
`last_token_usage`, `total_token_usage`, cached-input counts, and reasoning
|
|
output counts. The new backfill script can restore these, but it is not yet a
|
|
scheduled or monitored ingestion path.
|
|
- Claude Code measurement currently depends on `scripts/task_token_hook.py`
|
|
firing after one MCP tool name. It uses per-session state in `/tmp`, so missed
|
|
hooks, restarts, renamed tools, and non-MCP REST paths can silently degrade to
|
|
fallback events.
|
|
- Repository attribution for Codex backfill is path-based. This is good enough
|
|
for the emergency restore, but long-term attribution should prefer registered
|
|
repo fingerprints/remotes and then fall back to paths.
|
|
- The Token Cost dashboard currently aggregates all events returned by
|
|
`/token-events/?limit=1000`; it does not show measurement quality, source,
|
|
superseded rows, ingestion freshness, or possible gaps.
|
|
|
|
## Out of Scope
|
|
|
|
- Exact billing reconciliation against vendor invoices.
|
|
- Capturing private transcript content in State Hub.
|
|
- Replacing existing task/workstream/repo relationships.
|
|
- Implementing every provider-specific parser in one pass. The first pass should
|
|
cover Codex Desktop and Claude Code, with a documented adapter contract for
|
|
others.
|
|
|
|
## T01 - Define Token Evidence Model
|
|
|
|
```task
|
|
id: STATE-WP-0045-T01
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "29aed6d9-40aa-40fc-9e9a-3eb3e6f985bc"
|
|
```
|
|
|
|
Define a structured model that separates measured usage from allocated,
|
|
estimated, and superseded rows.
|
|
|
|
Implementation notes:
|
|
|
|
- Add a short design note or ADR section covering token event semantics.
|
|
- Define measurement classes such as `measured`, `allocated`, `estimated`, and
|
|
`superseded`.
|
|
- Define source classes such as `codex_session`, `claude_transcript`,
|
|
`llm_connect`, `manual`, and `task_fallback`.
|
|
- Define structured provenance fields: source system, source id, source path or
|
|
URI, source timestamp, parser version, ingestion timestamp, and confidence.
|
|
- Decide how to represent raw token categories: input, cached input, output,
|
|
reasoning output, and provider total.
|
|
- Decide whether cached input should be included in default totals or shown as a
|
|
separate metric. Preserve enough fields to support both views.
|
|
- Replace free-text note taxonomy as the primary quality signal. Notes can
|
|
remain for human context, but dashboards and APIs should rely on structured
|
|
fields.
|
|
|
|
Done when the repo has a reviewed token evidence contract and the follow-on
|
|
schema/API tasks can implement it without ambiguity.
|
|
|
|
## T02 - Add Provenance Schema and Idempotent Upsert API
|
|
|
|
```task
|
|
id: STATE-WP-0045-T02
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "ade2bd40-343c-4829-ba4f-44bc8b7cbef9"
|
|
```
|
|
|
|
Extend token storage so source-derived events can be written repeatedly without
|
|
duplicates and without losing provenance.
|
|
|
|
Implementation notes:
|
|
|
|
- Add migration fields for the evidence model from T01. Candidate fields:
|
|
`measurement_kind`, `source_provider`, `source_id`, `source_path`,
|
|
`source_created_at`, `ingested_at`, `parser_version`, `confidence`,
|
|
`cached_input_tokens`, `reasoning_output_tokens`, `raw_total_tokens`,
|
|
`cost_estimated_usd`, and `raw_metadata`.
|
|
- Add a unique constraint or partial unique index that prevents duplicate
|
|
measured source rows. For example: source provider plus source id, scoped by
|
|
measurement kind.
|
|
- Provide an upsert endpoint or make `POST /token-events/` support an explicit
|
|
idempotency key. The behavior should update a growing live session rather than
|
|
creating a second row.
|
|
- Keep backward compatibility for existing clients that only post
|
|
`tokens_in`/`tokens_out`, but classify those rows explicitly.
|
|
- Update schemas, router tests, and migration tests.
|
|
|
|
Done when source-backed token events can be inserted or updated idempotently and
|
|
legacy callers continue to work.
|
|
|
|
## T03 - Build Reusable Token Source Adapters
|
|
|
|
```task
|
|
id: STATE-WP-0045-T03
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "3844fb70-4ceb-4f90-9894-d4845970f0a6"
|
|
```
|
|
|
|
Move source-specific parsing out of one-off scripts and hooks into reusable,
|
|
tested adapter modules.
|
|
|
|
Implementation notes:
|
|
|
|
- Add an `api/services/token_sources/` package or equivalent service layer.
|
|
- Implement a Codex Desktop adapter for `.codex/sessions/**` and
|
|
`.codex/archived_sessions/**`.
|
|
- Implement a Claude Code adapter for `.claude/projects/**/*.jsonl` that reads
|
|
usage metadata without storing transcript text.
|
|
- Provide a common adapter result type with source id, timestamps, token
|
|
categories, model, agent, cwd/path context, and raw parser metadata.
|
|
- Make parsing safe by default: no conversation text in logs, progress events,
|
|
token notes, or API payloads.
|
|
- Add fixtures with synthetic Codex and Claude session records that cover live
|
|
sessions, archived sessions, duplicate files, malformed JSONL, resets, and
|
|
missing usage records.
|
|
- Keep `scripts/backfill_codex_token_events.py` as a thin CLI over the reusable
|
|
service or replace it with a new unified CLI.
|
|
|
|
Done when Codex and Claude token sources have deterministic parser tests and a
|
|
shared ingestion interface.
|
|
|
|
## T04 - Improve Repo, Workstream, and Task Attribution
|
|
|
|
```task
|
|
id: STATE-WP-0045-T04
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "d78b36ea-2a1a-40d6-bd83-03d48ff2ad9b"
|
|
```
|
|
|
|
Make attribution accurate without relying solely on local path string matching.
|
|
|
|
Implementation notes:
|
|
|
|
- Resolve repo attribution by git root fingerprint and remote URL when possible,
|
|
then fall back to registered host paths.
|
|
- Handle duplicate local paths or alias repos explicitly, especially where one
|
|
checkout is registered under multiple slugs.
|
|
- Attribute session-level usage to repo first, then optionally to workstreams or
|
|
tasks when there is strong evidence.
|
|
- Define task allocation rules that do not change measured session totals. For
|
|
example, produce `allocated` child rows from measured session rows using task
|
|
completion timestamps, tool-call metadata, or explicit operator input.
|
|
- Record the allocation method and confidence for every task-level allocation.
|
|
- Avoid minting task-level heuristic rows automatically for bulk import, status
|
|
sync, migration, and consistency tooling.
|
|
|
|
Done when measured session totals are stable and task/workstream attribution is
|
|
explicitly either measured, allocated, or estimated.
|
|
|
|
## T05 - Add Reconciliation, Gap Detection, and Backfill Operations
|
|
|
|
```task
|
|
id: STATE-WP-0045-T05
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "efaa2629-4f9a-439c-b0a3-85d77b03580f"
|
|
```
|
|
|
|
Add an operator-safe reconciliation command that detects flatlines, duplicate
|
|
rows, stale ingestion, and fallback leakage.
|
|
|
|
Implementation notes:
|
|
|
|
- Add a command such as `make token-reconcile` or
|
|
`python scripts/token_reconcile.py --since <date>`.
|
|
- Report sessions found, sessions ingested, sessions stale, duplicate source
|
|
ids, fallback events, superseded rows, unattributed sessions, and rows missing
|
|
structured provenance.
|
|
- Support `--dry-run` by default and `--apply` for writes.
|
|
- Include an explicit `--zero-superseded-fallbacks` or equivalent flag rather
|
|
than silently editing historical rows.
|
|
- Store reconciliation summaries as progress events or report files without
|
|
including transcript content.
|
|
- Add a canary threshold: alert or fail when measured token volume is zero while
|
|
task/progress activity exists for the same window.
|
|
|
|
Done when an operator can run one command to verify token tracking health and
|
|
perform safe, idempotent backfills.
|
|
|
|
## T06 - Harden Hooks and Runtime Integration
|
|
|
|
```task
|
|
id: STATE-WP-0045-T06
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "5fd99241-e6dd-4ca6-8c58-a0048f08f0ca"
|
|
```
|
|
|
|
Make token collection survive hook misses, tool renames, restarts, and multiple
|
|
agent runtimes.
|
|
|
|
Implementation notes:
|
|
|
|
- Update Claude hook handling so it can match supported task completion paths,
|
|
not just one exact MCP tool name.
|
|
- Persist hook high-water marks in a durable State Hub or repo-local location
|
|
instead of only `/tmp`.
|
|
- Add hook health logging that records when a hook ran, what source id it
|
|
processed, and whether it patched or skipped a token event.
|
|
- Add a Codex ingestion path that can run on demand and from a schedule without
|
|
requiring manual script execution.
|
|
- Document required environment variables and path discovery for Windows, WSL,
|
|
and remote Linux hosts.
|
|
- Ensure failures degrade to visible `estimated` events or health warnings, not
|
|
silent flatlines.
|
|
|
|
Done when missing or stale token ingestion becomes visible within one reporting
|
|
window and can be recovered without ad hoc inspection.
|
|
|
|
## T07 - Upgrade Token APIs and Dashboard Quality Signals
|
|
|
|
```task
|
|
id: STATE-WP-0045-T07
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "ecaf6ff8-59aa-4c56-8163-125dc96b2068"
|
|
```
|
|
|
|
Expose token quality, source, and freshness in APIs and dashboard views.
|
|
|
|
Implementation notes:
|
|
|
|
- Add API filters for measurement kind, source provider, repo, time range,
|
|
superseded rows, and unattributed rows.
|
|
- Replace the hard dashboard dependence on `/token-events/?limit=1000` with
|
|
paginated or pre-aggregated endpoints that support time windows.
|
|
- Add dashboard controls for measured-only, include allocated, include
|
|
estimates, and show superseded rows.
|
|
- Show ingestion freshness: last Codex session ingested, last Claude transcript
|
|
ingested, and last reconciliation run.
|
|
- Add a data-quality section listing fallback events, unattributed measured
|
|
sessions, duplicate source ids, and days with progress/task activity but zero
|
|
measured tokens.
|
|
- Update the Token Cost page and docs so operators know which numbers are
|
|
measured versus inferred.
|
|
|
|
Done when the dashboard no longer presents fallback, allocated, and measured
|
|
usage as indistinguishable totals.
|
|
|
|
## T08 - Verification and Migration Playbook
|
|
|
|
```task
|
|
id: STATE-WP-0045-T08
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "61baff79-832e-45f8-80f3-106abe262096"
|
|
```
|
|
|
|
Cover the new measurement system with tests and a safe rollout plan.
|
|
|
|
Implementation notes:
|
|
|
|
- Add unit tests for the evidence model, source adapters, source-id
|
|
deduplication, repo attribution, and task allocation.
|
|
- Add router tests for idempotent upsert, source filters, measurement-kind
|
|
filters, created-at preservation, and backwards-compatible legacy posts.
|
|
- Add reconciliation tests with synthetic pre-May-19 and post-May-19 flatline
|
|
scenarios.
|
|
- Add dashboard/data-loader tests or fixture checks for quality filters and
|
|
aggregate counts.
|
|
- Write a migration playbook covering old heuristic rows, existing
|
|
`backfill:codex-session` rows, and any rows without structured provenance.
|
|
- Verify the full suite and run a dry-run reconciliation before marking this
|
|
workplan finished.
|
|
|
|
Done when the improved token measurement path has automated coverage, an
|
|
operator playbook, and a dry-run reconciliation report showing no hidden
|
|
fallback leakage.
|