Files
agentic-resources/docs/DESIGN-session-memory.md
tegwick 7c6f4358ee session-memory Phase 0: end-to-end verification + docs (T07)
- verified full sweep over 85 real local Claude transcripts: 63 sessions
  ingested+analyzed, eviction under tiny cap freed 26MB with zero data loss,
  digest-preservation invariant holds, idempotent re-run
- session_memory/README.md: usage, scheduling, retention knobs
- design doc: OQ4 resolved (median ~49KB/session), OQ6 (multi-file sessions)
- workplan AGENTIC-WP-0002 finished

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 21:44:46 +02:00

22 KiB

Design Document — Coding Session Memory

Domain: helix_forge Repo: agentic-resources Status: Draft v0.1 Author: Claude (drafted with Bernd Worsch) Created: 2026-06-06 Updated: 2026-06-06 Related: PRD-helix-forge.md (this is the Capture + storage layer, FR-C* / §8)


1. Purpose

Helix Forge's loop (Capture → Detect → Curate → Distribute → Measure) needs a durable, bounded memory of coding sessions. This document specifies that memory: how we access each coding agent's session protocol, how we normalize those protocols into one schema, where we store the result, and how we age it out — preferring a storage-budget-based eviction that drops old raw content once it has been analyzed or no longer fits, rather than a naive fixed time window.

The guiding asymmetry: raw transcripts are bulky and re-derivable; the distilled analysis is small and precious. So we keep a bounded cache of raw sessions and a durable, compact layer of extracted digests/signals. Eviction targets the former, never the latter.

2. Research — How to Access Each Agent's Session Protocol

All three families persist sessions to the local filesystem as JSONL (plus, for Grok, a per-session directory). All findings below were verified against the live installs on this workstation (~/.claude, ~/.grok) and public docs (Codex; not installed here).

2.1 Claude Code verified on disk

Aspect Finding
Session transcripts ~/.claude/projects/<url-encoded-cwd>/<session-uuid>.jsonl — one JSONL per session
Subagent sidechains same dir, agent-<id>.jsonl; records carry isSidechain: true
Global prompt history ~/.claude/history.jsonl
Record format one JSON object per line; type discriminates: user, assistant, attachment, queue-operation, ai-title, last-prompt, summary, plus tool-result records
Key fields type, timestamp, sessionId, uuid, parentUuid (turn DAG), message (role + content blocks: text/thinking/tool_use/tool_result), cwd, gitBranch, version, requestId, toolUseResult, userType
Token usage inside assistant message.usage (input/output/cache tokens)
Model message.model (e.g. claude-opus-4-8)
Side data ~/.claude/todos/, ~/.claude/tasks/, ~/.claude/file-history/, ~/.claude/shell-snapshots/
Live capture hook Claude Code SessionEnd / Stop / SessionStart hooks can fire our ingest on session close (push), in addition to batch scanning (pull)

The turn DAG (uuid/parentUuid) lets us reconstruct branching, retries, and sidechains exactly.

2.2 OpenAI Codex CLI schema confirmed from source (not installed locally)

Schema confirmed from the openai/codex source (codex-rs/protocol/src/protocol.rs via DeepWiki) and a reverse-engineering writeup with real example lines — the two cross-agree.

Aspect Finding
Session ("rollout") files $CODEX_HOME/sessions/YYYY/MM/DD/rollout-*.jsonl (default $CODEX_HOME = ~/.codex)
Line wrapper (RolloutLine) every line: {timestamp, type, payload} (UTC ts + a RolloutItem)
type discriminator session_meta · response_item · event_msg · turn_context · compacted
session_meta {id, source, cwd, model_provider, cli_version} (+ model) — restores env
turn_context {model, approval_policy, sandbox_policy} — per-turn settings snapshot
response_item raw model output / tool calls; payload.typemessage · function_call · function_call_output · reasoning
message {role: developer|user|assistant, content:[{type:"output_text"|…, text}]}
function_call {name, arguments (JSON string), call_id}
function_call_output {call_id, output}
event_msg protocol events; payload.typetask_started · task_complete · user_message · agent_message · token_count · lifecycle
Token usage event_msg with payload.type = token_count, interspersed (no fixed cadence)
Turn linkage flat — tool calls/outputs linked by call_id, no parent-ref DAG; causality inferred from temporal order (unlike Claude's uuid/parentUuid)
Schema versions older installs differ ("new ≥0.44 / mid / oldest 2025/08"); adapter version-detects on session_meta.cli_version
Naming / resume filenames + session_id auto-generated; codex resume --last; codex exec for headless (trajectory-JSON is gh issue #2288)
Override location CODEX_HOME env var

Adapter notes: map event_msg/task_started|task_completelifecycle events and outcome; response_item/messageuser_msg/assistant_msg; function_call+function_call_outputtool_call/tool_result joined on call_id; response_item/reasoningthinking; event_msg/token_count → cost block. Because there is no parent-ref DAG, the adapter assigns seq/parent_seq from temporal order rather than native links.

2.3 Grok CLI (xAI) verified on disk

Grok stores a directory per session, which is the richest source of the three.

Aspect Finding
Session dir ~/.grok/sessions/<url-encoded-cwd>/<session-uuid>/
chat_history.jsonl full conversation; type = system/user/assistant + content
events.jsonl structured lifecycle events{ts, type, session_id, turn_number, model_id, yolo_mode, conversation_message_count, session_relationship, schema_version}; types like turn_started, loop_started
updates.jsonl streaming incremental updates
summary.json {id, cwd, session_summary, created_at, updated_at}
prompt_context.json injected context, incl. which AGENTS.md/CLAUDE.md files were loaded
system_prompt.txt exact system prompt for the session
rewind_points.jsonl, plan_mode.json rewind/plan-mode state
Per-cwd prompt history ~/.grok/sessions/<cwd>/prompt_history.jsonl{timestamp, session_id, prompt, is_bash}
Global structured log ~/.grok/logs/unified.jsonl{ts, src, pid, lvl, msg, ctx, sid, ver}
Search index ~/.grok/sessions/session_search.sqlitesession_docs(session_id, cwd, updated_at, title) + FTS5 (session_docs_fts) we can query directly
Integration surfaces Grok exposes ACP (Agent Client Protocol), headless mode (grok -p), and hooks (~/.grok/docs/user-guide/10-hooks.md) — push-capture options

2.4 Cross-family summary

Claude Code Codex CLI Grok CLI
Root ~/.claude/projects/ ~/.codex/sessions/ ~/.grok/sessions/
Unit one .jsonl/session one rollout-*.jsonl/session one dir/session
Layout flat per-cwd dir date-partitioned YYYY/MM/DD per-cwd, per-session dir
Discriminator type type (version-dependent) type (in chat_history/events)
Lifecycle events inferred from records inferred from records explicit events.jsonl
Token usage message.usage per-line usage from events/updates
Push capture Stop/SessionEnd hooks codex exec wrappers hooks / ACP
Pull capture scan dir by mtime scan date partitions scan dirs / query FTS sqlite

Implication: the common denominator is "JSONL records discriminated by a type field, with a session id, timestamps, turn linkage, tool calls, and token usage." That maps cleanly onto one normalized schema (§4). Per-family quirks (Grok's explicit events.jsonl, Codex's schema versions, Claude's sidechains) are handled inside each adapter.

3. Tiered Storage Model

 Tier 0  SOURCE (agents' own logs)        read-only, never mutated
         ~/.claude/projects  ~/.codex/sessions  ~/.grok/sessions
                 │  collector adapters (per family) + ingest cursor
                 ▼
 Tier 1  RAW CACHE (bounded, EVICTABLE)   normalized Session + Event records
                 │  signal extractors / digesters
                 ▼
 Tier 2  DISTILLED MEMORY (durable, small)  session digests + signals + pattern evidence
  • Tier 0 — Source. The agents' own logs. We treat them as read-only. We keep a small ingest cursor per source so re-scans are incremental (see §6).
  • Tier 1 — Raw cache. Normalized copies of sessions/events. This is the bulky tier and the only tier subject to budget eviction.
  • Tier 2 — Distilled memory. Per-session digest (outcome, costs, tool histogram, error/retry/intervention markers, key snippets) plus extracted signals and pattern evidence pointers. Compact and durable. A session can be fully evicted from Tier 1 once its Tier 2 digest exists.

This is what makes "drop old content once it has been analyzed" safe: analysis promotes the valuable bits into Tier 2 before the raw bytes are dropped.

3.1 Per-session lifecycle / watermarks

Each session row carries timestamps that drive eviction:

discovered_at → ingested_at → analyzed_at → [evictable] → evicted_at
  • ingested_at set when normalized into Tier 1.
  • analyzed_at set when the Tier 2 digest is written. A session is evictable iff analyzed_at is set.
  • evicted_at set when raw bytes are dropped from Tier 1 (Tier 2 digest remains).

4. Normalized Schema (Tier 1)

Two record kinds. Field names are stable across all adapters.

4.1 Session

{
  "session_uid": "claude:17092961-…",      // "<flavor>:<native id>", globally unique
  "flavor": "claude" | "codex" | "grok",
  "native_session_id": "17092961-…",
  "repo": "agentic-resources",             // resolved from cwd
  "domain": "helix_forge",                 // resolved from repo→domain map
  "cwd": "/home/worsch/agentic-resources",
  "git_branch": "main",
  "model": "claude-opus-4-8",
  "started_at": "2026-06-05T21:59:30Z",
  "ended_at": "2026-06-05T22:14:00Z",
  "outcome": "success|fail|abandoned|unknown",
  "cost": { "input_tokens": 0, "output_tokens": 0, "cache_tokens": 0,
            "wall_clock_s": 0, "turns": 0, "retries": 0 },
  "task_ref": "AGENTIC-WP-0002-T01",       // if derivable; else null
  "source_path": "~/.claude/projects/…/….jsonl",
  "source_bytes": 0,
  "schema_version": 1,
  "ingested_at": "…", "analyzed_at": null, "evicted_at": null
}

4.2 SessionEvent

{
  "session_uid": "claude:17092961-…",
  "seq": 12,                               // monotonic within session
  "parent_seq": 11,                        // turn DAG (Claude uuid/parentUuid)
  "ts": "2026-06-05T22:01:13Z",
  "kind": "user_msg | assistant_msg | thinking | tool_call | tool_result"
        + "| error | test_run | edit | retry | human_intervention | decision"
        + "| lifecycle | completion",
  "role": "user|assistant|system|tool",
  "tool": "Bash|Edit|Read|…",              // when kind=tool_call/result
  "summary": "ran pytest -q",              // short, human-readable
  "payload_ref": "blob://…",               // pointer to full content in Tier 1 blob store
  "tokens": 0,
  "is_sidechain": false
}

Adapters map native records onto kind. Grok's events.jsonl populates lifecycle/turn events directly; Claude/Codex lifecycle is inferred from the record stream. Bulky bodies live behind payload_ref so Tier 1 rows stay light and blobs can be evicted independently.

4.3 Native → kind mapping (all three families)

Each cell is the native record/discriminator an adapter reads to emit that SessionEvent.kind. = not natively present; the adapter synthesizes or omits.

kind Claude Code (type / message) Codex CLI (typepayload.type) Grok CLI (file → type)
user_msg user, message.role=user response_itemmessage role=user/developer chat_historyuser
assistant_msg assistant, message.role=assistant, content text response_itemmessage role=assistant (output_text) chat_historyassistant
thinking assistant content block type=thinking response_itemreasoning chat_history/updates reasoning block
tool_call assistant content block type=tool_use (name,input) response_itemfunction_call (name,arguments,call_id) chat_history/updates tool-call entry
tool_result user/tool record type=tool_result + toolUseResult response_itemfunction_call_output (join on call_id) updates tool-result entry
test_run derived from tool_call (Bash running tests) derived from function_call (exec_command) derived from tool-call entry
edit tool_use where name ∈ Edit/Write/NotebookEdit function_call apply-patch/file-write tool tool-call entry (edit/write)
error toolUseResult error / non-zero result function_call_output error / event_msg error events.jsonl error / failed update
retry repeated tool_use after error (inferred via DAG) repeated function_call after error (inferred, temporal) events.jsonl loop/retry event
human_intervention user record mid-turn (interrupt), userType event_msguser_message mid-task prompt_history mid-session / events.jsonl
decision recorded out-of-band (State Hub /decisions) recorded out-of-band (State Hub) recorded out-of-band (State Hub)
lifecycle inferred: first/last record, summary, queue-operation event_msgtask_started / task_complete events.jsonlturn_started/loop_started/… (explicit)
completion inferred: last assistant + Stop/SessionEnd hook event_msgtask_complete events.jsonl turn end + summary.json

Linkage note (drives seq/parent_seq): Claude has a true turn DAG (uuid/parentUuid) — preserve it directly. Codex is flat, joined only by call_id; assign seq by temporal order. Grok carries explicit turn_number in events.jsonl; key seq off that plus record order.

Cost block sources: Claude message.usage; Codex event_msg/token_count; Grok events.jsonl / updates.jsonl token fields.

5. Retention & Eviction

The user's stated preference: storage-budget-based, dropping old content once it has been analyzed or once it no longer fits — better than a fixed daily/weekly window. We implement budget-based as primary, with a time backstop and a scheduled cadence as the trigger.

5.1 Configurable knobs

[session_memory.retention]
raw_soft_cap_bytes   = "4GiB"   # begin evicting analyzed sessions above this
raw_hard_cap_bytes   = "6GiB"   # absolute ceiling for Tier 1
raw_max_age_days     = 45       # backstop: analyzed raw older than this is evictable regardless of space
distilled_cap_bytes  = "1GiB"   # Tier 2 ceiling (should grow slowly; alert, don't auto-drop)
cadence              = "daily"  # ingest+analyze+evict sweep: daily | weekly | on-hook

5.2 Eviction algorithm (runs after each ingest+analyze sweep)

  1. Compute current Tier 1 usage.
  2. Backstop pass: evict any session where analyzed_at is set AND age > raw_max_age_days.
  3. Budget pass: while usage > raw_soft_cap_bytes:
    • pick the oldest analyzed_at session that is not yet evicted;
    • drop its Tier 1 raw rows + blobs (Tier 2 digest is kept), set evicted_at;
    • if no analyzed-but-unevicted session remains, stop the budget pass (we will not destroy un-analyzed data to free space) and go to step 4.
  4. Back-pressure / overflow: if usage > raw_hard_cap_bytes and the only remaining bulk is un-analyzed:
    • first try to analyze now (run extraction) to make those sessions evictable, then re-run the budget pass;
    • if still over hard cap (analysis can't keep up or fails), evict the oldest un-analyzed sessions as a last resort and emit a session_memory.data_loss warning event + a State Hub progress note. This is the only path that loses un-analyzed data, and it is always reported.
  5. Tier 2 guard: if distilled usage > distilled_cap_bytes, do not auto-drop; flag for human/curation review (digests are the product).

Invariant: no session's raw bytes are dropped before its Tier 2 digest exists, except the explicitly-reported hard-cap overflow path.

5.3 Why budget-based beats fixed-window

A fixed daily/weekly drop either deletes data we never analyzed (lossy) or hoards data we already distilled (wasteful). Budget + analyzed_at watermark ties deletion to two real conditions the user named — "once it has been analyzed" (promoted to Tier 2) and "doesn't fit any longer" (over budget) — and only falls back to time as a backstop.

6. Ingest Cursors (incremental, idempotent)

Per source, persist a small cursor so sweeps are cheap and re-runnable:

  • Claude / Grok (per-cwd dirs): track (file_path, size, mtime) and last parsed line offset; re-ingest only grown/changed files. session_uid dedupes.
  • Codex (date partitions): track last-seen YYYY/MM/DD + per-file offset.
  • Ingest is idempotent keyed on (session_uid, seq) — safe to re-run after a crash or partial sweep.

7. Capture Modes

  • Pull (default, portable): scheduled sweep scans Tier 0 by mtime/partition. Works for all three families with zero coupling to the agent. Triggered on the configured cadence via the repo's scheduler (/schedule, cron, or /loop).
  • Push (optional, low-latency): wire the agent's own hooks to ping the ingester on session close — Claude Stop/SessionEnd hooks, Grok hooks/ACP, Codex exec wrappers. Push just enqueues; the same idempotent pull pipeline does the work.

Capture must be non-blocking (PRD FR-C5): we read copies of logs out-of-band; we never sit in the agent's critical path.

8. Component Layout (proposed, in-repo)

session-memory/
  adapters/
    claude.py      # Tier0→Tier1 normalizer (verified schema)
    codex.py       # version-detecting normalizer (confirm against real rollout)
    grok.py        # reads session dir incl. events.jsonl
  core/
    schema.py      # Session / SessionEvent dataclasses + versioning
    store.py       # Tier1 (rows+blobs) and Tier2 (digests) — SQLite to start
    cursor.py      # per-source ingest cursors
    retention.py   # §5 eviction algorithm
    digest.py      # Tier1→Tier2 session digest + signal stubs
  ingest.py        # one sweep: discover → normalize → analyze → evict
  config.toml      # §5.1 knobs + repo→domain map + source paths

Storage starts as SQLite + a blob dir (rows in SQLite, bulky payloads as files under payload_ref); graduate to Postgres alongside the State Hub only if volume demands. Digests/decisions are also surfaced to the hub per ADR-001 (files-first; hub indexes).

9. Privacy / Safety

  • Tier 0 logs can contain secrets (the Grok auth.json and Claude .credentials live in the same trees). The ingester reads only session transcripts, never credential files, and redacts obvious secret patterns into payload_ref blobs.
  • All data is local; nothing leaves the workstation. Eviction of Tier 1 is a real delete (not just an index drop) so the bounded cache is also a privacy bound.

10. Open Questions

  • OQ1 Confirm Codex rollout-*.jsonl per-line schema. Resolved (§2.2): {timestamp,type,payload} lines, typesession_meta/response_item/event_msg/turn_context/compacted, tool calls flat-linked by call_id, tokens via event_msg/token_count. Remaining sub-item: verify the token_count payload field names against a real install when Codex is present (older-version variance only).
  • OQ2 Outcome inference: how do we reliably label success/fail/abandoned across flavors (exit signals differ)? Start heuristic (last-turn + test results + human-intervention markers), refine in Detect phase.
  • OQ3 task_ref resolution — can we always map a session to a workplan task (via cwd + branch + state-hub), or only sometimes?
  • OQ4 Right default for raw_soft_cap_bytes. Measured (Phase 0, 85 real local Claude files / 63 distinct sessions): source bytes per session min 396 · median ~49 KB · max 48 MB (one outlier) · ~103 MB total. Claude defaults (4 GiB soft / 6 GiB hard) leave ample headroom; revisit once Grok dirs (heavier, multi-file) are ingested in Phase 1.
  • OQ6 (new, found in Phase 0) Multi-file sessions: ~84 transcript files mapped to ~63 session_uids — some sessions span multiple files (resume/sidechain sharing a sessionId). Current behavior upserts (last file wins per (session_uid, seq)); a future refinement is to merge events across files of one session rather than overwrite. Acceptable for Phase 0.
  • OQ5 Should push-hooks be opt-in per machine to avoid surprising the agents?

Next step: [AGENTIC-WP-0002] implements Phase 0 — the schema, the Claude collector, the Tier1/Tier2 store, and the budget-based eviction sweep.

Sources