Link Helix Forge fleet session memory to kaizen-agentic ADR-004 project metrics via helix_session_uid. Reciprocal reference to the cross-repo correlation contract.
23 KiB
Design Document — Coding Session Memory
Domain: helix_forge Repo: agentic-resources Status: Draft v0.1 Author: Claude (drafted with Bernd Worsch) Created: 2026-06-06 Updated: 2026-06-06 Related: PRD-helix-forge.md (this is the Capture + storage layer, FR-C* / §8)
1. Purpose
Helix Forge's loop (Capture → Detect → Curate → Distribute → Measure) needs a durable, bounded memory of coding sessions. This document specifies that memory: how we access each coding agent's session protocol, how we normalize those protocols into one schema, where we store the result, and how we age it out — preferring a storage-budget-based eviction that drops old raw content once it has been analyzed or no longer fits, rather than a naive fixed time window.
The guiding asymmetry: raw transcripts are bulky and re-derivable; the distilled analysis is small and precious. So we keep a bounded cache of raw sessions and a durable, compact layer of extracted digests/signals. Eviction targets the former, never the latter.
2. Research — How to Access Each Agent's Session Protocol
All three families persist sessions to the local filesystem as JSONL (plus, for
Grok, a per-session directory). All findings below were verified against the live
installs on this workstation (~/.claude, ~/.grok) and public docs (Codex; not
installed here).
2.1 Claude Code ✅ verified on disk
| Aspect | Finding |
|---|---|
| Session transcripts | ~/.claude/projects/<url-encoded-cwd>/<session-uuid>.jsonl — one JSONL per session |
| Subagent sidechains | same dir, agent-<id>.jsonl; records carry isSidechain: true |
| Global prompt history | ~/.claude/history.jsonl |
| Record format | one JSON object per line; type discriminates: user, assistant, attachment, queue-operation, ai-title, last-prompt, summary, plus tool-result records |
| Key fields | type, timestamp, sessionId, uuid, parentUuid (turn DAG), message (role + content blocks: text/thinking/tool_use/tool_result), cwd, gitBranch, version, requestId, toolUseResult, userType |
| Token usage | inside assistant message.usage (input/output/cache tokens) |
| Model | message.model (e.g. claude-opus-4-8) |
| Side data | ~/.claude/todos/, ~/.claude/tasks/, ~/.claude/file-history/, ~/.claude/shell-snapshots/ |
| Live capture hook | Claude Code SessionEnd / Stop / SessionStart hooks can fire our ingest on session close (push), in addition to batch scanning (pull) |
The turn DAG (uuid/parentUuid) lets us reconstruct branching, retries, and
sidechains exactly.
2.2 OpenAI Codex CLI ✅ schema confirmed from source (not installed locally)
Schema confirmed from the openai/codex source (codex-rs/protocol/src/protocol.rs
via DeepWiki) and a reverse-engineering writeup with real example lines — the two
cross-agree.
| Aspect | Finding |
|---|---|
| Session ("rollout") files | $CODEX_HOME/sessions/YYYY/MM/DD/rollout-*.jsonl (default $CODEX_HOME = ~/.codex) |
Line wrapper (RolloutLine) |
every line: {timestamp, type, payload} (UTC ts + a RolloutItem) |
type discriminator |
session_meta · response_item · event_msg · turn_context · compacted |
session_meta |
{id, source, cwd, model_provider, cli_version} (+ model) — restores env |
turn_context |
{model, approval_policy, sandbox_policy} — per-turn settings snapshot |
response_item |
raw model output / tool calls; payload.type ∈ message · function_call · function_call_output · reasoning |
→ message |
{role: developer|user|assistant, content:[{type:"output_text"|…, text}]} |
→ function_call |
{name, arguments (JSON string), call_id} |
→ function_call_output |
{call_id, output} |
event_msg |
protocol events; payload.type ∈ task_started · task_complete · user_message · agent_message · token_count · lifecycle |
| Token usage | event_msg with payload.type = token_count, interspersed (no fixed cadence) |
| Turn linkage | flat — tool calls/outputs linked by call_id, no parent-ref DAG; causality inferred from temporal order (unlike Claude's uuid/parentUuid) |
| Schema versions | older installs differ ("new ≥0.44 / mid / oldest 2025/08"); adapter version-detects on session_meta.cli_version |
| Naming / resume | filenames + session_id auto-generated; codex resume --last; codex exec for headless (trajectory-JSON is gh issue #2288) |
| Override location | CODEX_HOME env var |
Adapter notes: map event_msg/task_started|task_complete → lifecycle
events and outcome; response_item/message → user_msg/assistant_msg;
function_call+function_call_output → tool_call/tool_result joined on
call_id; response_item/reasoning → thinking; event_msg/token_count → cost
block. Because there is no parent-ref DAG, the adapter assigns seq/parent_seq
from temporal order rather than native links.
2.3 Grok CLI (xAI) ✅ verified on disk
Grok stores a directory per session, which is the richest source of the three.
| Aspect | Finding |
|---|---|
| Session dir | ~/.grok/sessions/<url-encoded-cwd>/<session-uuid>/ |
chat_history.jsonl |
full conversation; type = system/user/assistant + content |
events.jsonl |
structured lifecycle events — {ts, type, session_id, turn_number, model_id, yolo_mode, conversation_message_count, session_relationship, schema_version}; types like turn_started, loop_started |
updates.jsonl |
streaming incremental updates |
summary.json |
{id, cwd, session_summary, created_at, updated_at} |
prompt_context.json |
injected context, incl. which AGENTS.md/CLAUDE.md files were loaded |
system_prompt.txt |
exact system prompt for the session |
rewind_points.jsonl, plan_mode.json |
rewind/plan-mode state |
| Per-cwd prompt history | ~/.grok/sessions/<cwd>/prompt_history.jsonl — {timestamp, session_id, prompt, is_bash} |
| Global structured log | ~/.grok/logs/unified.jsonl — {ts, src, pid, lvl, msg, ctx, sid, ver} |
| Search index | ~/.grok/sessions/session_search.sqlite — session_docs(session_id, cwd, updated_at, title) + FTS5 (session_docs_fts) we can query directly |
| Integration surfaces | Grok exposes ACP (Agent Client Protocol), headless mode (grok -p), and hooks (~/.grok/docs/user-guide/10-hooks.md) — push-capture options |
2.4 Cross-family summary
| Claude Code | Codex CLI | Grok CLI | |
|---|---|---|---|
| Root | ~/.claude/projects/ |
~/.codex/sessions/ |
~/.grok/sessions/ |
| Unit | one .jsonl/session |
one rollout-*.jsonl/session |
one dir/session |
| Layout | flat per-cwd dir | date-partitioned YYYY/MM/DD |
per-cwd, per-session dir |
| Discriminator | type |
type (version-dependent) |
type (in chat_history/events) |
| Lifecycle events | inferred from records | inferred from records | explicit events.jsonl |
| Token usage | message.usage |
per-line usage | from events/updates |
| Push capture | Stop/SessionEnd hooks | codex exec wrappers |
hooks / ACP |
| Pull capture | scan dir by mtime | scan date partitions | scan dirs / query FTS sqlite |
Implication: the common denominator is "JSONL records discriminated by a
type field, with a session id, timestamps, turn linkage, tool calls, and token
usage." That maps cleanly onto one normalized schema (§4). Per-family quirks
(Grok's explicit events.jsonl, Codex's schema versions, Claude's sidechains) are
handled inside each adapter.
3. Tiered Storage Model
Tier 0 SOURCE (agents' own logs) read-only, never mutated
~/.claude/projects ~/.codex/sessions ~/.grok/sessions
│ collector adapters (per family) + ingest cursor
▼
Tier 1 RAW CACHE (bounded, EVICTABLE) normalized Session + Event records
│ signal extractors / digesters
▼
Tier 2 DISTILLED MEMORY (durable, small) session digests + signals + pattern evidence
- Tier 0 — Source. The agents' own logs. We treat them as read-only. We keep a small ingest cursor per source so re-scans are incremental (see §6).
- Tier 1 — Raw cache. Normalized copies of sessions/events. This is the bulky tier and the only tier subject to budget eviction.
- Tier 2 — Distilled memory. Per-session digest (outcome, costs, tool histogram, error/retry/intervention markers, key snippets) plus extracted signals and pattern evidence pointers. Compact and durable. A session can be fully evicted from Tier 1 once its Tier 2 digest exists.
This is what makes "drop old content once it has been analyzed" safe: analysis promotes the valuable bits into Tier 2 before the raw bytes are dropped.
3.1 Per-session lifecycle / watermarks
Each session row carries timestamps that drive eviction:
discovered_at → ingested_at → analyzed_at → [evictable] → evicted_at
ingested_atset when normalized into Tier 1.analyzed_atset when the Tier 2 digest is written. A session is evictable iffanalyzed_atis set.evicted_atset when raw bytes are dropped from Tier 1 (Tier 2 digest remains).
4. Normalized Schema (Tier 1)
Two record kinds. Field names are stable across all adapters.
4.1 Session
{
"session_uid": "claude:17092961-…", // "<flavor>:<native id>", globally unique
"flavor": "claude" | "codex" | "grok",
"native_session_id": "17092961-…",
"repo": "agentic-resources", // resolved from cwd
"domain": "helix_forge", // resolved from repo→domain map
"cwd": "/home/worsch/agentic-resources",
"git_branch": "main",
"model": "claude-opus-4-8",
"started_at": "2026-06-05T21:59:30Z",
"ended_at": "2026-06-05T22:14:00Z",
"outcome": "success|fail|abandoned|unknown",
"cost": { "input_tokens": 0, "output_tokens": 0, "cache_tokens": 0,
"wall_clock_s": 0, "turns": 0, "retries": 0 },
"task_ref": "AGENTIC-WP-0002-T01", // if derivable; else null
"source_path": "~/.claude/projects/…/….jsonl",
"source_bytes": 0,
"schema_version": 1,
"ingested_at": "…", "analyzed_at": null, "evicted_at": null
}
4.2 SessionEvent
{
"session_uid": "claude:17092961-…",
"seq": 12, // monotonic within session
"parent_seq": 11, // turn DAG (Claude uuid/parentUuid)
"ts": "2026-06-05T22:01:13Z",
"kind": "user_msg | assistant_msg | thinking | tool_call | tool_result"
+ "| error | test_run | edit | retry | human_intervention | decision"
+ "| lifecycle | completion",
"role": "user|assistant|system|tool",
"tool": "Bash|Edit|Read|…", // when kind=tool_call/result
"summary": "ran pytest -q", // short, human-readable
"payload_ref": "blob://…", // pointer to full content in Tier 1 blob store
"tokens": 0,
"is_sidechain": false
}
Adapters map native records onto kind. Grok's events.jsonl populates
lifecycle/turn events directly; Claude/Codex lifecycle is inferred from the
record stream. Bulky bodies live behind payload_ref so Tier 1 rows stay light
and blobs can be evicted independently.
4.3 Native → kind mapping (all three families)
Each cell is the native record/discriminator an adapter reads to emit that
SessionEvent.kind. — = not natively present; the adapter synthesizes or omits.
kind |
Claude Code (type / message) |
Codex CLI (type → payload.type) |
Grok CLI (file → type) |
|---|---|---|---|
user_msg |
user, message.role=user |
response_item → message role=user/developer |
chat_history → user |
assistant_msg |
assistant, message.role=assistant, content text |
response_item → message role=assistant (output_text) |
chat_history → assistant |
thinking |
assistant content block type=thinking |
response_item → reasoning |
chat_history/updates reasoning block |
tool_call |
assistant content block type=tool_use (name,input) |
response_item → function_call (name,arguments,call_id) |
chat_history/updates tool-call entry |
tool_result |
user/tool record type=tool_result + toolUseResult |
response_item → function_call_output (join on call_id) |
updates tool-result entry |
test_run |
derived from tool_call (Bash running tests) |
derived from function_call (exec_command) |
derived from tool-call entry |
edit |
tool_use where name ∈ Edit/Write/NotebookEdit |
function_call apply-patch/file-write tool |
tool-call entry (edit/write) |
error |
toolUseResult error / non-zero result |
function_call_output error / event_msg error |
events.jsonl error / failed update |
retry |
repeated tool_use after error (inferred via DAG) |
repeated function_call after error (inferred, temporal) |
events.jsonl loop/retry event |
human_intervention |
user record mid-turn (interrupt), userType |
event_msg → user_message mid-task |
prompt_history mid-session / events.jsonl |
decision |
recorded out-of-band (State Hub /decisions) |
recorded out-of-band (State Hub) | recorded out-of-band (State Hub) |
lifecycle |
inferred: first/last record, summary, queue-operation |
event_msg → task_started / task_complete |
events.jsonl → turn_started/loop_started/… (explicit) |
completion |
inferred: last assistant + Stop/SessionEnd hook |
event_msg → task_complete |
events.jsonl turn end + summary.json |
Linkage note (drives seq/parent_seq): Claude has a true turn DAG
(uuid/parentUuid) — preserve it directly. Codex is flat, joined only by
call_id; assign seq by temporal order. Grok carries explicit turn_number in
events.jsonl; key seq off that plus record order.
Cost block sources: Claude message.usage; Codex event_msg/token_count;
Grok events.jsonl / updates.jsonl token fields.
5. Retention & Eviction
The user's stated preference: storage-budget-based, dropping old content once it has been analyzed or once it no longer fits — better than a fixed daily/weekly window. We implement budget-based as primary, with a time backstop and a scheduled cadence as the trigger.
5.1 Configurable knobs
[session_memory.retention]
raw_soft_cap_bytes = "4GiB" # begin evicting analyzed sessions above this
raw_hard_cap_bytes = "6GiB" # absolute ceiling for Tier 1
raw_max_age_days = 45 # backstop: analyzed raw older than this is evictable regardless of space
distilled_cap_bytes = "1GiB" # Tier 2 ceiling (should grow slowly; alert, don't auto-drop)
cadence = "daily" # ingest+analyze+evict sweep: daily | weekly | on-hook
5.2 Eviction algorithm (runs after each ingest+analyze sweep)
- Compute current Tier 1 usage.
- Backstop pass: evict any session where
analyzed_atis set ANDage > raw_max_age_days. - Budget pass: while
usage > raw_soft_cap_bytes:- pick the oldest
analyzed_atsession that is not yet evicted; - drop its Tier 1 raw rows + blobs (Tier 2 digest is kept), set
evicted_at; - if no analyzed-but-unevicted session remains, stop the budget pass (we will not destroy un-analyzed data to free space) and go to step 4.
- pick the oldest
- Back-pressure / overflow: if
usage > raw_hard_cap_bytesand the only remaining bulk is un-analyzed:- first try to analyze now (run extraction) to make those sessions evictable, then re-run the budget pass;
- if still over hard cap (analysis can't keep up or fails), evict the oldest
un-analyzed sessions as a last resort and emit a
session_memory.data_losswarning event + a State Hub progress note. This is the only path that loses un-analyzed data, and it is always reported.
- Tier 2 guard: if distilled usage >
distilled_cap_bytes, do not auto-drop; flag for human/curation review (digests are the product).
Invariant: no session's raw bytes are dropped before its Tier 2 digest exists, except the explicitly-reported hard-cap overflow path.
5.3 Why budget-based beats fixed-window
A fixed daily/weekly drop either deletes data we never analyzed (lossy) or hoards
data we already distilled (wasteful). Budget + analyzed_at watermark ties
deletion to two real conditions the user named — "once it has been analyzed"
(promoted to Tier 2) and "doesn't fit any longer" (over budget) — and only falls
back to time as a backstop.
6. Ingest Cursors (incremental, idempotent)
Per source, persist a small cursor so sweeps are cheap and re-runnable:
- Claude / Grok (per-cwd dirs): track
(file_path, size, mtime)and last parsed line offset; re-ingest only grown/changed files.session_uiddedupes. - Codex (date partitions): track last-seen
YYYY/MM/DD+ per-file offset. - Ingest is idempotent keyed on
(session_uid, seq)— safe to re-run after a crash or partial sweep.
7. Capture Modes
- Pull (default, portable): scheduled sweep scans Tier 0 by mtime/partition.
Works for all three families with zero coupling to the agent. Triggered on the
configured
cadencevia the repo's scheduler (/schedule, cron, or/loop). - Push (optional, low-latency): wire the agent's own hooks to ping the ingester
on session close — Claude
Stop/SessionEndhooks, Grok hooks/ACP, Codexexecwrappers. Push just enqueues; the same idempotent pull pipeline does the work.
Capture must be non-blocking (PRD FR-C5): we read copies of logs out-of-band; we never sit in the agent's critical path.
8. Component Layout (proposed, in-repo)
session-memory/
adapters/
claude.py # Tier0→Tier1 normalizer (verified schema)
codex.py # version-detecting normalizer (confirm against real rollout)
grok.py # reads session dir incl. events.jsonl
core/
schema.py # Session / SessionEvent dataclasses + versioning
store.py # Tier1 (rows+blobs) and Tier2 (digests) — SQLite to start
cursor.py # per-source ingest cursors
retention.py # §5 eviction algorithm
digest.py # Tier1→Tier2 session digest + signal stubs
ingest.py # one sweep: discover → normalize → analyze → evict
config.toml # §5.1 knobs + repo→domain map + source paths
Storage starts as SQLite + a blob dir (rows in SQLite, bulky payloads as files
under payload_ref); graduate to Postgres alongside the State Hub only if volume
demands. Digests/decisions are also surfaced to the hub per ADR-001 (files-first;
hub indexes).
9. Privacy / Safety
- Tier 0 logs can contain secrets (the Grok
auth.jsonand Claude.credentialslive in the same trees). The ingester reads only session transcripts, never credential files, and redacts obvious secret patterns intopayload_refblobs. - All data is local; nothing leaves the workstation. Eviction of Tier 1 is a real delete (not just an index drop) so the bounded cache is also a privacy bound.
10. Open Questions
OQ1 Confirm CodexResolved (§2.2):rollout-*.jsonlper-line schema.{timestamp,type,payload}lines,type∈session_meta/response_item/event_msg/turn_context/compacted, tool calls flat-linked bycall_id, tokens viaevent_msg/token_count. Remaining sub-item: verify thetoken_countpayload field names against a real install when Codex is present (older-version variance only).- OQ2 Outcome inference: how do we reliably label
success/fail/abandonedacross flavors (exit signals differ)? Start heuristic (last-turn + test results + human-intervention markers), refine in Detect phase. - OQ3
task_refresolution — can we always map a session to a workplan task (via cwd + branch + state-hub), or only sometimes? OQ4 Right default forMeasured (Phase 0, 85 real local Claude files / 63 distinct sessions): source bytes per session min 396 · median ~49 KB · max 48 MB (one outlier) · ~103 MB total. Claude defaults (4 GiB soft / 6 GiB hard) leave ample headroom; revisit once Grok dirs (heavier, multi-file) are ingested in Phase 1.raw_soft_cap_bytes.- OQ6 (new, found in Phase 0) Multi-file sessions: ~84 transcript files mapped
to ~63
session_uids — some sessions span multiple files (resume/sidechain sharing asessionId). Current behavior upserts (last file wins per(session_uid, seq)); a future refinement is to merge events across files of one session rather than overwrite. Acceptable for Phase 0. - OQ5 Should push-hooks be opt-in per machine to avoid surprising the agents?
11. Project metrics correlation (kaizen-agentic)
Helix Forge owns fleet-level session capture and digests (this repo). The
kaizen-agentic framework owns project-scoped agent execution metrics
(ADR-004: .kaizen/metrics/<agent>/executions.jsonl). The two layers correlate
by optional helix_session_uid on project records — link-by-reference, no
duplicate ingestion in either repo.
| Layer | Owner | Storage |
|---|---|---|
| Fleet | agentic-resources (Helix Forge) | digest store (digests table) |
| Project | kaizen-agentic | .kaizen/metrics/<agent>/executions.jsonl |
Cross-repo contract: Helix Forge Correlation Contract
(kaizen-agentic). Field mapping from Session.session_uid → helix_session_uid,
digest.cost → tokens, tool_histogram MCP share → infra_overhead_share.
Read path: kaizen-agentic metrics correlate <uid> looks up a digest via
HELIX_STORE_DB (this repo's session store). No write path from kaizen-agentic
into Helix Forge.
Related kaizen-agentic docs: ADR-004 project metrics convention, wiki/EcosystemIntegration.md.
Next step: [AGENTIC-WP-0002] implements Phase 0 — the schema, the Claude collector, the Tier1/Tier2 store, and the budget-based eviction sweep.
Sources
- Claude Code session format — verified on disk:
~/.claude/projects/*/*.jsonl,~/.claude/history.jsonl. - Grok CLI session format — verified on disk:
~/.grok/sessions/,~/.grok/logs/unified.jsonl,~/.grok/sessions/session_search.sqlite;~/.grok/README.md(ACP/headless/hooks). - Codex CLI session format — ccusage Codex guide, Codex advanced config, codex-trace, codex-logs, Session/Rollout Files discussion #3827, trajectory-JSON issue #2288.