Files

tegwick c12091c2eb Fixed and improved token tracking

2026-05-23 13:59:05 +02:00

12 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	created	updated	state_hub_workstream_id
STATE-WP-0045	workplan	Token Measurement Accuracy and Resilience	custodian	state-hub	finished	codex	custodian	2026-05-23	2026-05-23	0aefe379-c182-4471-84dd-c136d5e1206b

Token Measurement Accuracy and Resilience

Summary

Make State Hub token tracking accurate enough to trust for daily operations and robust enough to survive agent/tool changes.

The May 19 flatline showed the current weak spots: token events mixed measured usage, task-completion fallbacks, and file-sync side effects in the same table; Claude measurement depended on one hook path; Codex usage lived in local session logs until a manual backfill; and the dashboard treated every token event as the same quality of evidence. The immediate fix restored Codex session totals and suppressed sync-generated fallback events, but the system still needs a durable measurement model, idempotent source adapters, reconciliation checks, and a dashboard that exposes provenance and confidence.

Current Findings

token_events stores counts, associations, free-text notes, and timestamps, but not structured provenance such as source system, source event id, parser version, raw token categories, confidence, or whether the row is measured, allocated, estimated, or superseded.
PATCH /tasks/{id} can still create heuristic token events on a transition to done. That fallback is useful as a temporary operational signal, but it is not a measurement and should not be blended into measured totals.
fix-consistency now suppresses token events while syncing file-backed task status, but this is a narrow guard. Other bulk sync, import, and migration paths need the same invariant.
Codex Desktop session logs contain structured token_count events with last_token_usage, total_token_usage, cached-input counts, and reasoning output counts. The new backfill script can restore these, but it is not yet a scheduled or monitored ingestion path.
Claude Code measurement currently depends on scripts/task_token_hook.py firing after one MCP tool name. It uses per-session state in /tmp, so missed hooks, restarts, renamed tools, and non-MCP REST paths can silently degrade to fallback events.
Repository attribution for Codex backfill is path-based. This is good enough for the emergency restore, but long-term attribution should prefer registered repo fingerprints/remotes and then fall back to paths.
The Token Cost dashboard currently aggregates all events returned by /token-events/?limit=1000; it does not show measurement quality, source, superseded rows, ingestion freshness, or possible gaps.

Out of Scope

Exact billing reconciliation against vendor invoices.
Capturing private transcript content in State Hub.
Replacing existing task/workstream/repo relationships.
Implementing every provider-specific parser in one pass. The first pass should cover Codex Desktop and Claude Code, with a documented adapter contract for others.

T01 - Define Token Evidence Model

id: STATE-WP-0045-T01
status: done
priority: high
state_hub_task_id: "29aed6d9-40aa-40fc-9e9a-3eb3e6f985bc"

Define a structured model that separates measured usage from allocated, estimated, and superseded rows.

Implementation notes:

Add a short design note or ADR section covering token event semantics.
Define measurement classes such as measured, allocated, estimated, and superseded.
Define source classes such as codex_session, claude_transcript, llm_connect, manual, and task_fallback.
Define structured provenance fields: source system, source id, source path or URI, source timestamp, parser version, ingestion timestamp, and confidence.
Decide how to represent raw token categories: input, cached input, output, reasoning output, and provider total.
Decide whether cached input should be included in default totals or shown as a separate metric. Preserve enough fields to support both views.
Replace free-text note taxonomy as the primary quality signal. Notes can remain for human context, but dashboards and APIs should rely on structured fields.

Done when the repo has a reviewed token evidence contract and the follow-on schema/API tasks can implement it without ambiguity.

T02 - Add Provenance Schema and Idempotent Upsert API

id: STATE-WP-0045-T02
status: done
priority: high
state_hub_task_id: "ade2bd40-343c-4829-ba4f-44bc8b7cbef9"

Extend token storage so source-derived events can be written repeatedly without duplicates and without losing provenance.

Implementation notes:

Add migration fields for the evidence model from T01. Candidate fields: measurement_kind, source_provider, source_id, source_path, source_created_at, ingested_at, parser_version, confidence, cached_input_tokens, reasoning_output_tokens, raw_total_tokens, cost_estimated_usd, and raw_metadata.
Add a unique constraint or partial unique index that prevents duplicate measured source rows. For example: source provider plus source id, scoped by measurement kind.
Provide an upsert endpoint or make POST /token-events/ support an explicit idempotency key. The behavior should update a growing live session rather than creating a second row.
Keep backward compatibility for existing clients that only post tokens_in/tokens_out, but classify those rows explicitly.
Update schemas, router tests, and migration tests.

Done when source-backed token events can be inserted or updated idempotently and legacy callers continue to work.

T03 - Build Reusable Token Source Adapters

id: STATE-WP-0045-T03
status: done
priority: high
state_hub_task_id: "3844fb70-4ceb-4f90-9894-d4845970f0a6"

Move source-specific parsing out of one-off scripts and hooks into reusable, tested adapter modules.

Implementation notes:

Add an api/services/token_sources/ package or equivalent service layer.
Implement a Codex Desktop adapter for .codex/sessions/** and .codex/archived_sessions/**.
Implement a Claude Code adapter for .claude/projects/**/*.jsonl that reads usage metadata without storing transcript text.
Provide a common adapter result type with source id, timestamps, token categories, model, agent, cwd/path context, and raw parser metadata.
Make parsing safe by default: no conversation text in logs, progress events, token notes, or API payloads.
Add fixtures with synthetic Codex and Claude session records that cover live sessions, archived sessions, duplicate files, malformed JSONL, resets, and missing usage records.
Keep scripts/backfill_codex_token_events.py as a thin CLI over the reusable service or replace it with a new unified CLI.

Done when Codex and Claude token sources have deterministic parser tests and a shared ingestion interface.

T04 - Improve Repo, Workstream, and Task Attribution

id: STATE-WP-0045-T04
status: done
priority: high
state_hub_task_id: "d78b36ea-2a1a-40d6-bd83-03d48ff2ad9b"

Make attribution accurate without relying solely on local path string matching.

Implementation notes:

Resolve repo attribution by git root fingerprint and remote URL when possible, then fall back to registered host paths.
Handle duplicate local paths or alias repos explicitly, especially where one checkout is registered under multiple slugs.
Attribute session-level usage to repo first, then optionally to workstreams or tasks when there is strong evidence.
Define task allocation rules that do not change measured session totals. For example, produce allocated child rows from measured session rows using task completion timestamps, tool-call metadata, or explicit operator input.
Record the allocation method and confidence for every task-level allocation.
Avoid minting task-level heuristic rows automatically for bulk import, status sync, migration, and consistency tooling.

Done when measured session totals are stable and task/workstream attribution is explicitly either measured, allocated, or estimated.

T05 - Add Reconciliation, Gap Detection, and Backfill Operations

id: STATE-WP-0045-T05
status: done
priority: high
state_hub_task_id: "efaa2629-4f9a-439c-b0a3-85d77b03580f"

Add an operator-safe reconciliation command that detects flatlines, duplicate rows, stale ingestion, and fallback leakage.

Implementation notes:

Add a command such as make token-reconcile or python scripts/token_reconcile.py --since <date>.
Report sessions found, sessions ingested, sessions stale, duplicate source ids, fallback events, superseded rows, unattributed sessions, and rows missing structured provenance.
Support --dry-run by default and --apply for writes.
Include an explicit --zero-superseded-fallbacks or equivalent flag rather than silently editing historical rows.
Store reconciliation summaries as progress events or report files without including transcript content.
Add a canary threshold: alert or fail when measured token volume is zero while task/progress activity exists for the same window.

Done when an operator can run one command to verify token tracking health and perform safe, idempotent backfills.

T06 - Harden Hooks and Runtime Integration

id: STATE-WP-0045-T06
status: done
priority: medium
state_hub_task_id: "5fd99241-e6dd-4ca6-8c58-a0048f08f0ca"

Make token collection survive hook misses, tool renames, restarts, and multiple agent runtimes.

Implementation notes:

Update Claude hook handling so it can match supported task completion paths, not just one exact MCP tool name.
Persist hook high-water marks in a durable State Hub or repo-local location instead of only /tmp.
Add hook health logging that records when a hook ran, what source id it processed, and whether it patched or skipped a token event.
Add a Codex ingestion path that can run on demand and from a schedule without requiring manual script execution.
Document required environment variables and path discovery for Windows, WSL, and remote Linux hosts.
Ensure failures degrade to visible estimated events or health warnings, not silent flatlines.

Done when missing or stale token ingestion becomes visible within one reporting window and can be recovered without ad hoc inspection.

T07 - Upgrade Token APIs and Dashboard Quality Signals

id: STATE-WP-0045-T07
status: done
priority: medium
state_hub_task_id: "ecaf6ff8-59aa-4c56-8163-125dc96b2068"

Expose token quality, source, and freshness in APIs and dashboard views.

Implementation notes:

Add API filters for measurement kind, source provider, repo, time range, superseded rows, and unattributed rows.
Replace the hard dashboard dependence on /token-events/?limit=1000 with paginated or pre-aggregated endpoints that support time windows.
Add dashboard controls for measured-only, include allocated, include estimates, and show superseded rows.
Show ingestion freshness: last Codex session ingested, last Claude transcript ingested, and last reconciliation run.
Add a data-quality section listing fallback events, unattributed measured sessions, duplicate source ids, and days with progress/task activity but zero measured tokens.
Update the Token Cost page and docs so operators know which numbers are measured versus inferred.

Done when the dashboard no longer presents fallback, allocated, and measured usage as indistinguishable totals.

T08 - Verification and Migration Playbook

id: STATE-WP-0045-T08
status: done
priority: medium
state_hub_task_id: "61baff79-832e-45f8-80f3-106abe262096"

Cover the new measurement system with tests and a safe rollout plan.

Implementation notes:

Add unit tests for the evidence model, source adapters, source-id deduplication, repo attribution, and task allocation.
Add router tests for idempotent upsert, source filters, measurement-kind filters, created-at preservation, and backwards-compatible legacy posts.
Add reconciliation tests with synthetic pre-May-19 and post-May-19 flatline scenarios.
Add dashboard/data-loader tests or fixture checks for quality filters and aggregate counts.
Write a migration playbook covering old heuristic rows, existing backfill:codex-session rows, and any rows without structured provenance.
Verify the full suite and run a dry-run reconciliation before marking this workplan finished.

Done when the improved token measurement path has automated coverage, an operator playbook, and a dry-run reconciliation report showing no hidden fallback leakage.

12 KiB Raw Blame History

Token Measurement Accuracy and Resilience

Summary

Current Findings

Out of Scope

T01 - Define Token Evidence Model

T02 - Add Provenance Schema and Idempotent Upsert API

T03 - Build Reusable Token Source Adapters

T04 - Improve Repo, Workstream, and Task Attribution

T05 - Add Reconciliation, Gap Detection, and Backfill Operations

T06 - Harden Hooks and Runtime Integration

T07 - Upgrade Token APIs and Dashboard Quality Signals

T08 - Verification and Migration Playbook

12 KiB

Raw Blame History