Fixed and improved token tracking

This commit is contained in:
2026-05-23 13:59:05 +02:00
parent dd3279ea1a
commit c12091c2eb
29 changed files with 3549 additions and 278 deletions

View File

@@ -4,12 +4,12 @@ type: workplan
title: "Multi-User Onboarding and Environment Bootstrap"
domain: custodian
repo: state-hub
status: active
status: finished
owner: custodian
topic_slug: custodian
state_hub_workstream_id: "a28d9e29-4119-4b73-9469-f921920253ef"
created: "2026-03-11"
updated: "2026-05-17"
updated: "2026-05-23"
---
# Multi-User Onboarding and Environment Bootstrap
@@ -51,7 +51,7 @@ Two personas:
```task
id: CUST-WP-0012-T01
state_hub_task_id: 71628269-9a75-4dae-a347-e64a86040322
status: todo
status: done
priority: medium
```
@@ -79,6 +79,12 @@ git config --global credential.helper 'cache --timeout=3600'
**Done when:** included in bootstrap script; push to Gitea works without
re-entering credentials on second attempt.
**Implemented 2026-05-23:** `scripts/bootstrap-env.sh` configures a global
credential helper when one is not already present. It prefers `libsecret`, uses
`cache --timeout=3600` as the safe automatic fallback, and supports explicit
headless plaintext storage via `--git-helper store --allow-plaintext-store`.
`docs/onboarding.md` documents the tradeoffs.
---
### T02 — SSH key generation and authorization automation
@@ -86,7 +92,7 @@ re-entering credentials on second attempt.
```task
id: CUST-WP-0012-T02
state_hub_task_id: fea965e9-8a8f-439c-9096-8f7756eb71ed
status: todo
status: done
priority: medium
```
@@ -110,6 +116,11 @@ ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.130.254
**Done when:** included in bootstrap script; documented in onboarding guide.
**Implemented 2026-05-23:** `scripts/bootstrap-env.sh` generates
`~/.ssh/id_ed25519` if missing, prints the public key, and can run
`ssh-copy-id` for Railiance01 and CoulombCore with `--authorize-ssh`.
`docs/onboarding.md` documents the operator and collaborator path.
---
### T03 — Claude Code MCP registration automation
@@ -117,7 +128,7 @@ ssh-copy-id -i ~/.ssh/id_ed25519.pub tegwick@92.205.130.254
```task
id: CUST-WP-0012-T03
state_hub_task_id: 60318e9a-972e-45c8-afde-82ed0625f594
status: todo
status: done
priority: medium
```
@@ -132,10 +143,10 @@ make register-mcp # idempotent; safe to re-run
The script should:
1. Detect whether `state-hub` is already in `~/.claude.json`
2. Extract the server config from `.mcp.json`
2. Use the current SSE MCP config (`http://127.0.0.1:8001/sse` locally or
`http://127.0.0.1:18001/sse` through ops-bridge)
3. Run `claude mcp add-json -s user state-hub <config>`
4. Run `patch_mcp_cwd.py` to restore the cwd field
5. Print instructions to restart Claude Code
4. Print instructions to restart Claude Code
Should also detect whether the state hub is reachable directly
(`http://127.0.0.1:8000`) or needs a tunnel (via ops-bridge), and emit
@@ -144,6 +155,12 @@ a warning if neither is available.
**Done when:** `make register-mcp` works on a clean machine; documented
in onboarding guide.
**Implemented 2026-05-23:** `scripts/register-mcp.sh` and the
`make register-mcp` target register the current SSE MCP transport
idempotently. The script detects local/tunnel reachability, supports
`MCP_URL`, `API_BASE`, and `DRY_RUN=1`, and documents the old `.mcp.json` cwd
patch path as legacy.
---
### T04 — Environment bootstrap script
@@ -151,7 +168,7 @@ in onboarding guide.
```task
id: CUST-WP-0012-T04
state_hub_task_id: 84a94761-e424-4470-a9a2-64d9cabadb7f
status: todo
status: done
priority: high
```
@@ -176,6 +193,11 @@ Design constraints:
**Done when:** running the script on a clean Ubuntu 24.04 machine
produces a working Custodian environment with no additional manual steps.
**Implemented 2026-05-23:** `scripts/bootstrap-env.sh` and
`make bootstrap-env` provide the idempotent entrypoint. It supports dry-run,
non-interactive mode, optional apt package installation, SSH authorization,
Gitea token prompting, MCP registration, and State Hub health checks.
---
### T05 — Onboarding guide and user journey documentation
@@ -183,7 +205,7 @@ produces a working Custodian environment with no additional manual steps.
```task
id: CUST-WP-0012-T05
state_hub_task_id: b0839802-659a-475b-8b84-ab7341ea3d15
status: todo
status: done
priority: medium
```
@@ -208,6 +230,10 @@ for both personas:
**Done when:** a new collaborator can follow the guide without
clarification from the primary operator.
**Implemented 2026-05-23:** `docs/onboarding.md` covers primary operator and
domain collaborator journeys, including SSH, Gitea token file, credential
helper choices, MCP registration, tunnel setup, and verification checks.
---
### T06 — State Hub multi-user model (deferred)
@@ -215,7 +241,7 @@ clarification from the primary operator.
```task
id: CUST-WP-0012-T06
state_hub_task_id: d5df3302-67b9-4765-a8d8-ea2df53dff6e
status: todo
status: done
priority: low
```
@@ -235,6 +261,11 @@ domain) or rely on Gitea repo permissions as the authoritative boundary
Implement T01T05 first; multi-user access control is only needed when
there is more than one user.
**Implemented 2026-05-23:** `docs/multi-user-access-model.md` records the
current decision: repo permissions, SSH access, tunnels, and OpenBao remain the
authoritative boundaries for this phase; State Hub API auth is deferred until a
real second-user or exposed-deployment trigger exists.
---
## References

View File

@@ -0,0 +1,310 @@
---
id: STATE-WP-0045
type: workplan
title: "Token Measurement Accuracy and Resilience"
domain: custodian
repo: state-hub
status: finished
owner: codex
topic_slug: custodian
created: "2026-05-23"
updated: "2026-05-23"
state_hub_workstream_id: "0aefe379-c182-4471-84dd-c136d5e1206b"
---
# Token Measurement Accuracy and Resilience
## Summary
Make State Hub token tracking accurate enough to trust for daily operations and
robust enough to survive agent/tool changes.
The May 19 flatline showed the current weak spots: token events mixed measured
usage, task-completion fallbacks, and file-sync side effects in the same table;
Claude measurement depended on one hook path; Codex usage lived in local session
logs until a manual backfill; and the dashboard treated every token event as the
same quality of evidence. The immediate fix restored Codex session totals and
suppressed sync-generated fallback events, but the system still needs a durable
measurement model, idempotent source adapters, reconciliation checks, and a
dashboard that exposes provenance and confidence.
## Current Findings
- `token_events` stores counts, associations, free-text notes, and timestamps,
but not structured provenance such as source system, source event id, parser
version, raw token categories, confidence, or whether the row is measured,
allocated, estimated, or superseded.
- `PATCH /tasks/{id}` can still create heuristic token events on a transition to
`done`. That fallback is useful as a temporary operational signal, but it is
not a measurement and should not be blended into measured totals.
- `fix-consistency` now suppresses token events while syncing file-backed task
status, but this is a narrow guard. Other bulk sync, import, and migration
paths need the same invariant.
- Codex Desktop session logs contain structured `token_count` events with
`last_token_usage`, `total_token_usage`, cached-input counts, and reasoning
output counts. The new backfill script can restore these, but it is not yet a
scheduled or monitored ingestion path.
- Claude Code measurement currently depends on `scripts/task_token_hook.py`
firing after one MCP tool name. It uses per-session state in `/tmp`, so missed
hooks, restarts, renamed tools, and non-MCP REST paths can silently degrade to
fallback events.
- Repository attribution for Codex backfill is path-based. This is good enough
for the emergency restore, but long-term attribution should prefer registered
repo fingerprints/remotes and then fall back to paths.
- The Token Cost dashboard currently aggregates all events returned by
`/token-events/?limit=1000`; it does not show measurement quality, source,
superseded rows, ingestion freshness, or possible gaps.
## Out of Scope
- Exact billing reconciliation against vendor invoices.
- Capturing private transcript content in State Hub.
- Replacing existing task/workstream/repo relationships.
- Implementing every provider-specific parser in one pass. The first pass should
cover Codex Desktop and Claude Code, with a documented adapter contract for
others.
## T01 - Define Token Evidence Model
```task
id: STATE-WP-0045-T01
status: done
priority: high
state_hub_task_id: "29aed6d9-40aa-40fc-9e9a-3eb3e6f985bc"
```
Define a structured model that separates measured usage from allocated,
estimated, and superseded rows.
Implementation notes:
- Add a short design note or ADR section covering token event semantics.
- Define measurement classes such as `measured`, `allocated`, `estimated`, and
`superseded`.
- Define source classes such as `codex_session`, `claude_transcript`,
`llm_connect`, `manual`, and `task_fallback`.
- Define structured provenance fields: source system, source id, source path or
URI, source timestamp, parser version, ingestion timestamp, and confidence.
- Decide how to represent raw token categories: input, cached input, output,
reasoning output, and provider total.
- Decide whether cached input should be included in default totals or shown as a
separate metric. Preserve enough fields to support both views.
- Replace free-text note taxonomy as the primary quality signal. Notes can
remain for human context, but dashboards and APIs should rely on structured
fields.
Done when the repo has a reviewed token evidence contract and the follow-on
schema/API tasks can implement it without ambiguity.
## T02 - Add Provenance Schema and Idempotent Upsert API
```task
id: STATE-WP-0045-T02
status: done
priority: high
state_hub_task_id: "ade2bd40-343c-4829-ba4f-44bc8b7cbef9"
```
Extend token storage so source-derived events can be written repeatedly without
duplicates and without losing provenance.
Implementation notes:
- Add migration fields for the evidence model from T01. Candidate fields:
`measurement_kind`, `source_provider`, `source_id`, `source_path`,
`source_created_at`, `ingested_at`, `parser_version`, `confidence`,
`cached_input_tokens`, `reasoning_output_tokens`, `raw_total_tokens`,
`cost_estimated_usd`, and `raw_metadata`.
- Add a unique constraint or partial unique index that prevents duplicate
measured source rows. For example: source provider plus source id, scoped by
measurement kind.
- Provide an upsert endpoint or make `POST /token-events/` support an explicit
idempotency key. The behavior should update a growing live session rather than
creating a second row.
- Keep backward compatibility for existing clients that only post
`tokens_in`/`tokens_out`, but classify those rows explicitly.
- Update schemas, router tests, and migration tests.
Done when source-backed token events can be inserted or updated idempotently and
legacy callers continue to work.
## T03 - Build Reusable Token Source Adapters
```task
id: STATE-WP-0045-T03
status: done
priority: high
state_hub_task_id: "3844fb70-4ceb-4f90-9894-d4845970f0a6"
```
Move source-specific parsing out of one-off scripts and hooks into reusable,
tested adapter modules.
Implementation notes:
- Add an `api/services/token_sources/` package or equivalent service layer.
- Implement a Codex Desktop adapter for `.codex/sessions/**` and
`.codex/archived_sessions/**`.
- Implement a Claude Code adapter for `.claude/projects/**/*.jsonl` that reads
usage metadata without storing transcript text.
- Provide a common adapter result type with source id, timestamps, token
categories, model, agent, cwd/path context, and raw parser metadata.
- Make parsing safe by default: no conversation text in logs, progress events,
token notes, or API payloads.
- Add fixtures with synthetic Codex and Claude session records that cover live
sessions, archived sessions, duplicate files, malformed JSONL, resets, and
missing usage records.
- Keep `scripts/backfill_codex_token_events.py` as a thin CLI over the reusable
service or replace it with a new unified CLI.
Done when Codex and Claude token sources have deterministic parser tests and a
shared ingestion interface.
## T04 - Improve Repo, Workstream, and Task Attribution
```task
id: STATE-WP-0045-T04
status: done
priority: high
state_hub_task_id: "d78b36ea-2a1a-40d6-bd83-03d48ff2ad9b"
```
Make attribution accurate without relying solely on local path string matching.
Implementation notes:
- Resolve repo attribution by git root fingerprint and remote URL when possible,
then fall back to registered host paths.
- Handle duplicate local paths or alias repos explicitly, especially where one
checkout is registered under multiple slugs.
- Attribute session-level usage to repo first, then optionally to workstreams or
tasks when there is strong evidence.
- Define task allocation rules that do not change measured session totals. For
example, produce `allocated` child rows from measured session rows using task
completion timestamps, tool-call metadata, or explicit operator input.
- Record the allocation method and confidence for every task-level allocation.
- Avoid minting task-level heuristic rows automatically for bulk import, status
sync, migration, and consistency tooling.
Done when measured session totals are stable and task/workstream attribution is
explicitly either measured, allocated, or estimated.
## T05 - Add Reconciliation, Gap Detection, and Backfill Operations
```task
id: STATE-WP-0045-T05
status: done
priority: high
state_hub_task_id: "efaa2629-4f9a-439c-b0a3-85d77b03580f"
```
Add an operator-safe reconciliation command that detects flatlines, duplicate
rows, stale ingestion, and fallback leakage.
Implementation notes:
- Add a command such as `make token-reconcile` or
`python scripts/token_reconcile.py --since <date>`.
- Report sessions found, sessions ingested, sessions stale, duplicate source
ids, fallback events, superseded rows, unattributed sessions, and rows missing
structured provenance.
- Support `--dry-run` by default and `--apply` for writes.
- Include an explicit `--zero-superseded-fallbacks` or equivalent flag rather
than silently editing historical rows.
- Store reconciliation summaries as progress events or report files without
including transcript content.
- Add a canary threshold: alert or fail when measured token volume is zero while
task/progress activity exists for the same window.
Done when an operator can run one command to verify token tracking health and
perform safe, idempotent backfills.
## T06 - Harden Hooks and Runtime Integration
```task
id: STATE-WP-0045-T06
status: done
priority: medium
state_hub_task_id: "5fd99241-e6dd-4ca6-8c58-a0048f08f0ca"
```
Make token collection survive hook misses, tool renames, restarts, and multiple
agent runtimes.
Implementation notes:
- Update Claude hook handling so it can match supported task completion paths,
not just one exact MCP tool name.
- Persist hook high-water marks in a durable State Hub or repo-local location
instead of only `/tmp`.
- Add hook health logging that records when a hook ran, what source id it
processed, and whether it patched or skipped a token event.
- Add a Codex ingestion path that can run on demand and from a schedule without
requiring manual script execution.
- Document required environment variables and path discovery for Windows, WSL,
and remote Linux hosts.
- Ensure failures degrade to visible `estimated` events or health warnings, not
silent flatlines.
Done when missing or stale token ingestion becomes visible within one reporting
window and can be recovered without ad hoc inspection.
## T07 - Upgrade Token APIs and Dashboard Quality Signals
```task
id: STATE-WP-0045-T07
status: done
priority: medium
state_hub_task_id: "ecaf6ff8-59aa-4c56-8163-125dc96b2068"
```
Expose token quality, source, and freshness in APIs and dashboard views.
Implementation notes:
- Add API filters for measurement kind, source provider, repo, time range,
superseded rows, and unattributed rows.
- Replace the hard dashboard dependence on `/token-events/?limit=1000` with
paginated or pre-aggregated endpoints that support time windows.
- Add dashboard controls for measured-only, include allocated, include
estimates, and show superseded rows.
- Show ingestion freshness: last Codex session ingested, last Claude transcript
ingested, and last reconciliation run.
- Add a data-quality section listing fallback events, unattributed measured
sessions, duplicate source ids, and days with progress/task activity but zero
measured tokens.
- Update the Token Cost page and docs so operators know which numbers are
measured versus inferred.
Done when the dashboard no longer presents fallback, allocated, and measured
usage as indistinguishable totals.
## T08 - Verification and Migration Playbook
```task
id: STATE-WP-0045-T08
status: done
priority: medium
state_hub_task_id: "61baff79-832e-45f8-80f3-106abe262096"
```
Cover the new measurement system with tests and a safe rollout plan.
Implementation notes:
- Add unit tests for the evidence model, source adapters, source-id
deduplication, repo attribution, and task allocation.
- Add router tests for idempotent upsert, source filters, measurement-kind
filters, created-at preservation, and backwards-compatible legacy posts.
- Add reconciliation tests with synthetic pre-May-19 and post-May-19 flatline
scenarios.
- Add dashboard/data-loader tests or fixture checks for quality filters and
aggregate counts.
- Write a migration playbook covering old heuristic rows, existing
`backfill:codex-session` rows, and any rows without structured provenance.
- Verify the full suite and run a dry-run reconciliation before marking this
workplan finished.
Done when the improved token measurement path has automated coverage, an
operator playbook, and a dry-run reconciliation report showing no hidden
fallback leakage.