Files

tegwick a66d502b95 docs: add kaizen-agentic project metrics correlation (WP-0005 T16)

Link Helix Forge fleet session memory to kaizen-agentic ADR-004 project
metrics via helix_session_uid. Reciprocal reference to the cross-repo
correlation contract.

2026-06-16 07:13:07 +02:00

23 KiB

Raw Permalink Blame History

Design Document — Coding Session Memory

Domain: helix_forge Repo: agentic-resources Status: Draft v0.1 Author: Claude (drafted with Bernd Worsch) Created: 2026-06-06 Updated: 2026-06-06 Related: PRD-helix-forge.md (this is the Capture + storage layer, FR-C* / §8)

1. Purpose

Helix Forge's loop (Capture → Detect → Curate → Distribute → Measure) needs a durable, bounded memory of coding sessions. This document specifies that memory: how we access each coding agent's session protocol, how we normalize those protocols into one schema, where we store the result, and how we age it out — preferring a storage-budget-based eviction that drops old raw content once it has been analyzed or no longer fits, rather than a naive fixed time window.

The guiding asymmetry: raw transcripts are bulky and re-derivable; the distilled analysis is small and precious. So we keep a bounded cache of raw sessions and a durable, compact layer of extracted digests/signals. Eviction targets the former, never the latter.

2. Research — How to Access Each Agent's Session Protocol

All three families persist sessions to the local filesystem as JSONL (plus, for Grok, a per-session directory). All findings below were verified against the live installs on this workstation (~/.claude, ~/.grok) and public docs (Codex; not installed here).

2.1 Claude Code ✅ verified on disk

Aspect	Finding
Session transcripts	`~/.claude/projects/<url-encoded-cwd>/<session-uuid>.jsonl` — one JSONL per session
Subagent sidechains	same dir, `agent-<id>.jsonl`; records carry `isSidechain: true`
Global prompt history	`~/.claude/history.jsonl`
Record format	one JSON object per line; `type` discriminates: `user`, `assistant`, `attachment`, `queue-operation`, `ai-title`, `last-prompt`, `summary`, plus tool-result records
Key fields	`type`, `timestamp`, `sessionId`, `uuid`, `parentUuid` (turn DAG), `message` (`role` + content blocks: `text`/`thinking`/`tool_use`/`tool_result`), `cwd`, `gitBranch`, `version`, `requestId`, `toolUseResult`, `userType`
Token usage	inside assistant `message.usage` (input/output/cache tokens)
Model	`message.model` (e.g. `claude-opus-4-8`)
Side data	`~/.claude/todos/`, `~/.claude/tasks/`, `~/.claude/file-history/`, `~/.claude/shell-snapshots/`
Live capture hook	Claude Code SessionEnd / Stop / SessionStart hooks can fire our ingest on session close (push), in addition to batch scanning (pull)

The turn DAG (uuid/parentUuid) lets us reconstruct branching, retries, and sidechains exactly.

2.2 OpenAI Codex CLI ✅ schema confirmed from source (not installed locally)

Schema confirmed from the openai/codex source (codex-rs/protocol/src/protocol.rs via DeepWiki) and a reverse-engineering writeup with real example lines — the two cross-agree.

Aspect	Finding
Session ("rollout") files	`$CODEX_HOME/sessions/YYYY/MM/DD/rollout-*.jsonl` (default `$CODEX_HOME = ~/.codex`)
Line wrapper (`RolloutLine`)	every line: `{timestamp, type, payload}` (UTC ts + a `RolloutItem`)
`type` discriminator	`session_meta` · `response_item` · `event_msg` · `turn_context` · `compacted`
`session_meta`	`{id, source, cwd, model_provider, cli_version}` (+ model) — restores env
`turn_context`	`{model, approval_policy, sandbox_policy}` — per-turn settings snapshot
`response_item`	raw model output / tool calls; `payload.type` ∈ `message` · `function_call` · `function_call_output` · `reasoning`
→ `message`	`{role: developer\|user\|assistant, content:[{type:"output_text"\|…, text}]}`
→ `function_call`	`{name, arguments (JSON string), call_id}`
→ `function_call_output`	`{call_id, output}`
`event_msg`	protocol events; `payload.type` ∈ `task_started` · `task_complete` · `user_message` · `agent_message` · `token_count` · lifecycle
Token usage	`event_msg` with `payload.type = token_count`, interspersed (no fixed cadence)
Turn linkage	flat — tool calls/outputs linked by `call_id`, no parent-ref DAG; causality inferred from temporal order (unlike Claude's `uuid`/`parentUuid`)
Schema versions	older installs differ ("new ≥0.44 / mid / oldest 2025/08"); adapter version-detects on `session_meta.cli_version`
Naming / resume	filenames + `session_id` auto-generated; `codex resume --last`; `codex exec` for headless (trajectory-JSON is gh issue #2288)
Override location	`CODEX_HOME` env var

Adapter notes: map event_msg/task_started|task_complete → lifecycle events and outcome; response_item/message → user_msg/assistant_msg; function_call+function_call_output → tool_call/tool_result joined on call_id; response_item/reasoning → thinking; event_msg/token_count → cost block. Because there is no parent-ref DAG, the adapter assigns seq/parent_seq from temporal order rather than native links.

2.3 Grok CLI (xAI) ✅ verified on disk

Grok stores a directory per session, which is the richest source of the three.

Aspect	Finding
Session dir	`~/.grok/sessions/<url-encoded-cwd>/<session-uuid>/`
`chat_history.jsonl`	full conversation; `type` = `system`/`user`/`assistant` + content
`events.jsonl`	structured lifecycle events — `{ts, type, session_id, turn_number, model_id, yolo_mode, conversation_message_count, session_relationship, schema_version}`; types like `turn_started`, `loop_started`
`updates.jsonl`	streaming incremental updates
`summary.json`	`{id, cwd, session_summary, created_at, updated_at}`
`prompt_context.json`	injected context, incl. which AGENTS.md/CLAUDE.md files were loaded
`system_prompt.txt`	exact system prompt for the session
`rewind_points.jsonl`, `plan_mode.json`	rewind/plan-mode state
Per-cwd prompt history	`~/.grok/sessions/<cwd>/prompt_history.jsonl` — `{timestamp, session_id, prompt, is_bash}`
Global structured log	`~/.grok/logs/unified.jsonl` — `{ts, src, pid, lvl, msg, ctx, sid, ver}`
Search index	`~/.grok/sessions/session_search.sqlite` — `session_docs(session_id, cwd, updated_at, title)` + FTS5 (`session_docs_fts`) we can query directly
Integration surfaces	Grok exposes ACP (Agent Client Protocol), headless mode (`grok -p`), and hooks (`~/.grok/docs/user-guide/10-hooks.md`) — push-capture options

2.4 Cross-family summary

	Claude Code	Codex CLI	Grok CLI
Root	`~/.claude/projects/`	`~/.codex/sessions/`	`~/.grok/sessions/`
Unit	one `.jsonl`/session	one `rollout-*.jsonl`/session	one dir/session
Layout	flat per-cwd dir	date-partitioned `YYYY/MM/DD`	per-cwd, per-session dir
Discriminator	`type`	`type` (version-dependent)	`type` (in `chat_history`/`events`)
Lifecycle events	inferred from records	inferred from records	explicit `events.jsonl`
Token usage	`message.usage`	per-line usage	from events/updates
Push capture	Stop/SessionEnd hooks	`codex exec` wrappers	hooks / ACP
Pull capture	scan dir by mtime	scan date partitions	scan dirs / query FTS sqlite

Implication: the common denominator is "JSONL records discriminated by a type field, with a session id, timestamps, turn linkage, tool calls, and token usage." That maps cleanly onto one normalized schema (§4). Per-family quirks (Grok's explicit events.jsonl, Codex's schema versions, Claude's sidechains) are handled inside each adapter.

3. Tiered Storage Model

 Tier 0  SOURCE (agents' own logs)        read-only, never mutated
         ~/.claude/projects  ~/.codex/sessions  ~/.grok/sessions
                 │  collector adapters (per family) + ingest cursor
                 ▼
 Tier 1  RAW CACHE (bounded, EVICTABLE)   normalized Session + Event records
                 │  signal extractors / digesters
                 ▼
 Tier 2  DISTILLED MEMORY (durable, small)  session digests + signals + pattern evidence

Tier 0 — Source. The agents' own logs. We treat them as read-only. We keep a small ingest cursor per source so re-scans are incremental (see §6).
Tier 1 — Raw cache. Normalized copies of sessions/events. This is the bulky tier and the only tier subject to budget eviction.
Tier 2 — Distilled memory. Per-session digest (outcome, costs, tool histogram, error/retry/intervention markers, key snippets) plus extracted signals and pattern evidence pointers. Compact and durable. A session can be fully evicted from Tier 1 once its Tier 2 digest exists.

This is what makes "drop old content once it has been analyzed" safe: analysis promotes the valuable bits into Tier 2 before the raw bytes are dropped.

3.1 Per-session lifecycle / watermarks

Each session row carries timestamps that drive eviction:

discovered_at → ingested_at → analyzed_at → [evictable] → evicted_at

ingested_at set when normalized into Tier 1.
analyzed_at set when the Tier 2 digest is written. A session is evictable iff analyzed_at is set.
evicted_at set when raw bytes are dropped from Tier 1 (Tier 2 digest remains).

4. Normalized Schema (Tier 1)

Two record kinds. Field names are stable across all adapters.

4.1 `Session`

{
  "session_uid": "claude:17092961-…",      // "<flavor>:<native id>", globally unique
  "flavor": "claude" | "codex" | "grok",
  "native_session_id": "17092961-…",
  "repo": "agentic-resources",             // resolved from cwd
  "domain": "helix_forge",                 // resolved from repo→domain map
  "cwd": "/home/worsch/agentic-resources",
  "git_branch": "main",
  "model": "claude-opus-4-8",
  "started_at": "2026-06-05T21:59:30Z",
  "ended_at": "2026-06-05T22:14:00Z",
  "outcome": "success|fail|abandoned|unknown",
  "cost": { "input_tokens": 0, "output_tokens": 0, "cache_tokens": 0,
            "wall_clock_s": 0, "turns": 0, "retries": 0 },
  "task_ref": "AGENTIC-WP-0002-T01",       // if derivable; else null
  "source_path": "~/.claude/projects/…/….jsonl",
  "source_bytes": 0,
  "schema_version": 1,
  "ingested_at": "…", "analyzed_at": null, "evicted_at": null
}

4.2 `SessionEvent`

{
  "session_uid": "claude:17092961-…",
  "seq": 12,                               // monotonic within session
  "parent_seq": 11,                        // turn DAG (Claude uuid/parentUuid)
  "ts": "2026-06-05T22:01:13Z",
  "kind": "user_msg | assistant_msg | thinking | tool_call | tool_result"
        + "| error | test_run | edit | retry | human_intervention | decision"
        + "| lifecycle | completion",
  "role": "user|assistant|system|tool",
  "tool": "Bash|Edit|Read|…",              // when kind=tool_call/result
  "summary": "ran pytest -q",              // short, human-readable
  "payload_ref": "blob://…",               // pointer to full content in Tier 1 blob store
  "tokens": 0,
  "is_sidechain": false
}

Adapters map native records onto kind. Grok's events.jsonl populates lifecycle/turn events directly; Claude/Codex lifecycle is inferred from the record stream. Bulky bodies live behind payload_ref so Tier 1 rows stay light and blobs can be evicted independently.

4.3 Native → `kind` mapping (all three families)

Each cell is the native record/discriminator an adapter reads to emit that SessionEvent.kind. — = not natively present; the adapter synthesizes or omits.

`kind`	Claude Code (`type` / `message`)	Codex CLI (`type` → `payload.type`)	Grok CLI (file → `type`)
`user_msg`	`user`, `message.role=user`	`response_item` → `message` `role=user`/`developer`	`chat_history` → `user`
`assistant_msg`	`assistant`, `message.role=assistant`, content `text`	`response_item` → `message` `role=assistant` (`output_text`)	`chat_history` → `assistant`
`thinking`	`assistant` content block `type=thinking`	`response_item` → `reasoning`	`chat_history`/`updates` reasoning block
`tool_call`	`assistant` content block `type=tool_use` (`name`,`input`)	`response_item` → `function_call` (`name`,`arguments`,`call_id`)	`chat_history`/`updates` tool-call entry
`tool_result`	`user`/tool record `type=tool_result` + `toolUseResult`	`response_item` → `function_call_output` (join on `call_id`)	`updates` tool-result entry
`test_run`	derived from `tool_call` (Bash running tests)	derived from `function_call` (`exec_command`)	derived from tool-call entry
`edit`	`tool_use` where `name` ∈ Edit/Write/NotebookEdit	`function_call` apply-patch/file-write tool	tool-call entry (edit/write)
`error`	`toolUseResult` error / non-zero result	`function_call_output` error / `event_msg` error	`events.jsonl` error / failed update
`retry`	repeated `tool_use` after error (inferred via DAG)	repeated `function_call` after error (inferred, temporal)	`events.jsonl` loop/retry event
`human_intervention`	`user` record mid-turn (interrupt), `userType`	`event_msg` → `user_message` mid-task	`prompt_history` mid-session / `events.jsonl`
`decision`	recorded out-of-band (State Hub `/decisions`)	recorded out-of-band (State Hub)	recorded out-of-band (State Hub)
`lifecycle`	inferred: first/last record, `summary`, `queue-operation`	`event_msg` → `task_started` / `task_complete`	`events.jsonl` → `turn_started`/`loop_started`/… (explicit)
`completion`	inferred: last `assistant` + `Stop`/`SessionEnd` hook	`event_msg` → `task_complete`	`events.jsonl` turn end + `summary.json`

Linkage note (drives seq/parent_seq): Claude has a true turn DAG (uuid/parentUuid) — preserve it directly. Codex is flat, joined only by call_id; assign seq by temporal order. Grok carries explicit turn_number in events.jsonl; key seq off that plus record order.

Cost block sources: Claude message.usage; Codex event_msg/token_count; Grok events.jsonl / updates.jsonl token fields.

5. Retention & Eviction

The user's stated preference: storage-budget-based, dropping old content once it has been analyzed or once it no longer fits — better than a fixed daily/weekly window. We implement budget-based as primary, with a time backstop and a scheduled cadence as the trigger.

5.1 Configurable knobs

[session_memory.retention]
raw_soft_cap_bytes   = "4GiB"   # begin evicting analyzed sessions above this
raw_hard_cap_bytes   = "6GiB"   # absolute ceiling for Tier 1
raw_max_age_days     = 45       # backstop: analyzed raw older than this is evictable regardless of space
distilled_cap_bytes  = "1GiB"   # Tier 2 ceiling (should grow slowly; alert, don't auto-drop)
cadence              = "daily"  # ingest+analyze+evict sweep: daily | weekly | on-hook

5.2 Eviction algorithm (runs after each ingest+analyze sweep)

Compute current Tier 1 usage.
Backstop pass: evict any session where analyzed_at is set AND age > raw_max_age_days.
Budget pass: while usage > raw_soft_cap_bytes:
- pick the oldest analyzed_at session that is not yet evicted;
- drop its Tier 1 raw rows + blobs (Tier 2 digest is kept), set evicted_at;
- if no analyzed-but-unevicted session remains, stop the budget pass (we will not destroy un-analyzed data to free space) and go to step 4.
Back-pressure / overflow: if usage > raw_hard_cap_bytes and the only remaining bulk is un-analyzed:
- first try to analyze now (run extraction) to make those sessions evictable, then re-run the budget pass;
- if still over hard cap (analysis can't keep up or fails), evict the oldest un-analyzed sessions as a last resort and emit a session_memory.data_loss warning event + a State Hub progress note. This is the only path that loses un-analyzed data, and it is always reported.
Tier 2 guard: if distilled usage > distilled_cap_bytes, do not auto-drop; flag for human/curation review (digests are the product).

Invariant: no session's raw bytes are dropped before its Tier 2 digest exists, except the explicitly-reported hard-cap overflow path.

5.3 Why budget-based beats fixed-window

A fixed daily/weekly drop either deletes data we never analyzed (lossy) or hoards data we already distilled (wasteful). Budget + analyzed_at watermark ties deletion to two real conditions the user named — "once it has been analyzed" (promoted to Tier 2) and "doesn't fit any longer" (over budget) — and only falls back to time as a backstop.

6. Ingest Cursors (incremental, idempotent)

Per source, persist a small cursor so sweeps are cheap and re-runnable:

Claude / Grok (per-cwd dirs): track (file_path, size, mtime) and last parsed line offset; re-ingest only grown/changed files. session_uid dedupes.
Codex (date partitions): track last-seen YYYY/MM/DD + per-file offset.
Ingest is idempotent keyed on (session_uid, seq) — safe to re-run after a crash or partial sweep.

7. Capture Modes

Pull (default, portable): scheduled sweep scans Tier 0 by mtime/partition. Works for all three families with zero coupling to the agent. Triggered on the configured cadence via the repo's scheduler (/schedule, cron, or /loop).
Push (optional, low-latency): wire the agent's own hooks to ping the ingester on session close — Claude Stop/SessionEnd hooks, Grok hooks/ACP, Codex exec wrappers. Push just enqueues; the same idempotent pull pipeline does the work.

Capture must be non-blocking (PRD FR-C5): we read copies of logs out-of-band; we never sit in the agent's critical path.

8. Component Layout (proposed, in-repo)

session-memory/
  adapters/
    claude.py      # Tier0→Tier1 normalizer (verified schema)
    codex.py       # version-detecting normalizer (confirm against real rollout)
    grok.py        # reads session dir incl. events.jsonl
  core/
    schema.py      # Session / SessionEvent dataclasses + versioning
    store.py       # Tier1 (rows+blobs) and Tier2 (digests) — SQLite to start
    cursor.py      # per-source ingest cursors
    retention.py   # §5 eviction algorithm
    digest.py      # Tier1→Tier2 session digest + signal stubs
  ingest.py        # one sweep: discover → normalize → analyze → evict
  config.toml      # §5.1 knobs + repo→domain map + source paths

Storage starts as SQLite + a blob dir (rows in SQLite, bulky payloads as files under payload_ref); graduate to Postgres alongside the State Hub only if volume demands. Digests/decisions are also surfaced to the hub per ADR-001 (files-first; hub indexes).

9. Privacy / Safety

Tier 0 logs can contain secrets (the Grok auth.json and Claude .credentials live in the same trees). The ingester reads only session transcripts, never credential files, and redacts obvious secret patterns into payload_ref blobs.
All data is local; nothing leaves the workstation. Eviction of Tier 1 is a real delete (not just an index drop) so the bounded cache is also a privacy bound.

10. Open Questions

OQ1 Confirm Codex rollout-*.jsonl per-line schema. Resolved (§2.2): {timestamp,type,payload} lines, type ∈ session_meta/response_item/event_msg/turn_context/compacted, tool calls flat-linked by call_id, tokens via event_msg/token_count. Remaining sub-item: verify the token_count payload field names against a real install when Codex is present (older-version variance only).
OQ2 Outcome inference: how do we reliably label success/fail/abandoned across flavors (exit signals differ)? Start heuristic (last-turn + test results + human-intervention markers), refine in Detect phase.
OQ3 task_ref resolution — can we always map a session to a workplan task (via cwd + branch + state-hub), or only sometimes?
OQ4 Right default for raw_soft_cap_bytes. Measured (Phase 0, 85 real local Claude files / 63 distinct sessions): source bytes per session min 396 · median ~49 KB · max 48 MB (one outlier) · ~103 MB total. Claude defaults (4 GiB soft / 6 GiB hard) leave ample headroom; revisit once Grok dirs (heavier, multi-file) are ingested in Phase 1.
OQ6 (new, found in Phase 0) Multi-file sessions: ~84 transcript files mapped to ~63 session_uids — some sessions span multiple files (resume/sidechain sharing a sessionId). Current behavior upserts (last file wins per (session_uid, seq)); a future refinement is to merge events across files of one session rather than overwrite. Acceptable for Phase 0.
OQ5 Should push-hooks be opt-in per machine to avoid surprising the agents?

11. Project metrics correlation (kaizen-agentic)

Helix Forge owns fleet-level session capture and digests (this repo). The kaizen-agentic framework owns project-scoped agent execution metrics (ADR-004: .kaizen/metrics/<agent>/executions.jsonl). The two layers correlate by optional helix_session_uid on project records — link-by-reference, no duplicate ingestion in either repo.

Layer	Owner	Storage
Fleet	agentic-resources (Helix Forge)	digest store (`digests` table)
Project	kaizen-agentic	`.kaizen/metrics/<agent>/executions.jsonl`

Cross-repo contract: Helix Forge Correlation Contract (kaizen-agentic). Field mapping from Session.session_uid → helix_session_uid, digest.cost → tokens, tool_histogram MCP share → infra_overhead_share.

Read path: kaizen-agentic metrics correlate <uid> looks up a digest via HELIX_STORE_DB (this repo's session store). No write path from kaizen-agentic into Helix Forge.

Related kaizen-agentic docs: ADR-004 project metrics convention, wiki/EcosystemIntegration.md.

Next step: [AGENTIC-WP-0002] implements Phase 0 — the schema, the Claude collector, the Tier1/Tier2 store, and the budget-based eviction sweep.

Sources

Claude Code session format — verified on disk: ~/.claude/projects/*/*.jsonl, ~/.claude/history.jsonl.
Grok CLI session format — verified on disk: ~/.grok/sessions/, ~/.grok/logs/unified.jsonl, ~/.grok/sessions/session_search.sqlite; ~/.grok/README.md (ACP/headless/hooks).
Codex CLI session format — ccusage Codex guide, Codex advanced config, codex-trace, codex-logs, Session/Rollout Files discussion #3827, trajectory-JSON issue #2288.

23 KiB Raw Permalink Blame History