From ffe191d44e10c1198d46e626950ce1d4399bf1ef Mon Sep 17 00:00:00 2001 From: tegwick Date: Sat, 6 Jun 2026 19:00:30 +0200 Subject: [PATCH] Add Helix Forge PRD, session-memory design, and Phase 0 workplan MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - docs/PRD-helix-forge.md: Capture→Detect→Curate→Distribute→Measure loop - docs/DESIGN-session-memory.md: tiered store + budget-based eviction; verified session-log schemas for Claude/Codex/Grok - workplans/AGENTIC-WP-0002: Phase 0 (registered with State Hub) Co-Authored-By: Claude Opus 4.8 --- docs/DESIGN-session-memory.md | 372 ++++++++++++++++++ docs/PRD-helix-forge.md | 279 +++++++++++++ .../AGENTIC-WP-0002-session-memory-phase0.md | 135 +++++++ 3 files changed, 786 insertions(+) create mode 100644 docs/DESIGN-session-memory.md create mode 100644 docs/PRD-helix-forge.md create mode 100644 workplans/AGENTIC-WP-0002-session-memory-phase0.md diff --git a/docs/DESIGN-session-memory.md b/docs/DESIGN-session-memory.md new file mode 100644 index 0000000..7aa00e1 --- /dev/null +++ b/docs/DESIGN-session-memory.md @@ -0,0 +1,372 @@ +# Design Document — Coding Session Memory + +**Domain:** helix_forge +**Repo:** agentic-resources +**Status:** Draft v0.1 +**Author:** Claude (drafted with Bernd Worsch) +**Created:** 2026-06-06 +**Updated:** 2026-06-06 +**Related:** [PRD-helix-forge.md](./PRD-helix-forge.md) (this is the Capture + storage layer, FR-C* / §8) + +--- + +## 1. Purpose + +Helix Forge's loop (Capture → Detect → Curate → Distribute → Measure) needs a +durable, bounded **memory of coding sessions**. This document specifies that +memory: how we **access** each coding agent's session protocol, how we +**normalize** those protocols into one schema, where we **store** the result, and +how we **age it out** — preferring a *storage-budget-based* eviction that drops +old raw content once it has been analyzed or no longer fits, rather than a naive +fixed time window. + +The guiding asymmetry: **raw transcripts are bulky and re-derivable; the distilled +analysis is small and precious.** So we keep a *bounded cache* of raw sessions and +a *durable, compact* layer of extracted digests/signals. Eviction targets the +former, never the latter. + +## 2. Research — How to Access Each Agent's Session Protocol + +All three families persist sessions to the local filesystem as JSONL (plus, for +Grok, a per-session directory). All findings below were verified against the live +installs on this workstation (`~/.claude`, `~/.grok`) and public docs (Codex; not +installed here). + +### 2.1 Claude Code ✅ verified on disk + +| Aspect | Finding | +|--------|---------| +| Session transcripts | `~/.claude/projects//.jsonl` — one JSONL per session | +| Subagent sidechains | same dir, `agent-.jsonl`; records carry `isSidechain: true` | +| Global prompt history | `~/.claude/history.jsonl` | +| Record format | one JSON object per line; **`type`** discriminates: `user`, `assistant`, `attachment`, `queue-operation`, `ai-title`, `last-prompt`, `summary`, plus tool-result records | +| Key fields | `type`, `timestamp`, `sessionId`, `uuid`, `parentUuid` (turn DAG), `message` (`role` + content blocks: `text`/`thinking`/`tool_use`/`tool_result`), `cwd`, `gitBranch`, `version`, `requestId`, `toolUseResult`, `userType` | +| Token usage | inside assistant `message.usage` (input/output/cache tokens) | +| Model | `message.model` (e.g. `claude-opus-4-8`) | +| Side data | `~/.claude/todos/`, `~/.claude/tasks/`, `~/.claude/file-history/`, `~/.claude/shell-snapshots/` | +| Live capture hook | Claude Code **SessionEnd / Stop / SessionStart hooks** can fire our ingest on session close (push), in addition to batch scanning (pull) | + +The turn DAG (`uuid`/`parentUuid`) lets us reconstruct branching, retries, and +sidechains exactly. + +### 2.2 OpenAI Codex CLI ✅ schema confirmed from source (not installed locally) + +Schema confirmed from the openai/codex source (`codex-rs/protocol/src/protocol.rs` +via DeepWiki) and a reverse-engineering writeup with real example lines — the two +cross-agree. + +| Aspect | Finding | +|--------|---------| +| Session ("rollout") files | `$CODEX_HOME/sessions/YYYY/MM/DD/rollout-*.jsonl` (default `$CODEX_HOME = ~/.codex`) | +| Line wrapper (`RolloutLine`) | every line: **`{timestamp, type, payload}`** (UTC ts + a `RolloutItem`) | +| `type` discriminator | `session_meta` · `response_item` · `event_msg` · `turn_context` · `compacted` | +| `session_meta` | `{id, source, cwd, model_provider, cli_version}` (+ model) — restores env | +| `turn_context` | `{model, approval_policy, sandbox_policy}` — per-turn settings snapshot | +| `response_item` | raw model output / tool calls; `payload.type` ∈ `message` · `function_call` · `function_call_output` · `reasoning` | +| → `message` | `{role: developer\|user\|assistant, content:[{type:"output_text"\|…, text}]}` | +| → `function_call` | `{name, arguments (JSON string), call_id}` | +| → `function_call_output` | `{call_id, output}` | +| `event_msg` | protocol events; `payload.type` ∈ `task_started` · `task_complete` · `user_message` · `agent_message` · `token_count` · lifecycle | +| Token usage | `event_msg` with `payload.type = token_count`, interspersed (no fixed cadence) | +| Turn linkage | **flat — tool calls/outputs linked by `call_id`, no parent-ref DAG**; causality inferred from temporal order (unlike Claude's `uuid`/`parentUuid`) | +| Schema versions | older installs differ ("new ≥0.44 / mid / oldest 2025/08"); adapter version-detects on `session_meta.cli_version` | +| Naming / resume | filenames + `session_id` auto-generated; `codex resume --last`; `codex exec` for headless (trajectory-JSON is gh issue #2288) | +| Override location | `CODEX_HOME` env var | + +**Adapter notes:** map `event_msg/task_started|task_complete` → `lifecycle` +events and outcome; `response_item/message` → `user_msg`/`assistant_msg`; +`function_call`+`function_call_output` → `tool_call`/`tool_result` joined on +`call_id`; `response_item/reasoning` → `thinking`; `event_msg/token_count` → cost +block. Because there is no parent-ref DAG, the adapter assigns `seq`/`parent_seq` +from temporal order rather than native links. + +### 2.3 Grok CLI (xAI) ✅ verified on disk + +Grok stores **a directory per session**, which is the richest source of the three. + +| Aspect | Finding | +|--------|---------| +| Session dir | `~/.grok/sessions///` | +| `chat_history.jsonl` | full conversation; `type` = `system`/`user`/`assistant` + content | +| `events.jsonl` | **structured lifecycle events** — `{ts, type, session_id, turn_number, model_id, yolo_mode, conversation_message_count, session_relationship, schema_version}`; types like `turn_started`, `loop_started` | +| `updates.jsonl` | streaming incremental updates | +| `summary.json` | `{id, cwd, session_summary, created_at, updated_at}` | +| `prompt_context.json` | injected context, incl. which AGENTS.md/CLAUDE.md files were loaded | +| `system_prompt.txt` | exact system prompt for the session | +| `rewind_points.jsonl`, `plan_mode.json` | rewind/plan-mode state | +| Per-cwd prompt history | `~/.grok/sessions//prompt_history.jsonl` — `{timestamp, session_id, prompt, is_bash}` | +| Global structured log | `~/.grok/logs/unified.jsonl` — `{ts, src, pid, lvl, msg, ctx, sid, ver}` | +| Search index | `~/.grok/sessions/session_search.sqlite` — `session_docs(session_id, cwd, updated_at, title)` + FTS5 (`session_docs_fts`) we can query directly | +| Integration surfaces | Grok exposes **ACP (Agent Client Protocol)**, **headless mode** (`grok -p`), and **hooks** (`~/.grok/docs/user-guide/10-hooks.md`) — push-capture options | + +### 2.4 Cross-family summary + +| | Claude Code | Codex CLI | Grok CLI | +|--|--|--|--| +| Root | `~/.claude/projects/` | `~/.codex/sessions/` | `~/.grok/sessions/` | +| Unit | one `.jsonl`/session | one `rollout-*.jsonl`/session | one **dir**/session | +| Layout | flat per-cwd dir | date-partitioned `YYYY/MM/DD` | per-cwd, per-session dir | +| Discriminator | `type` | `type` (version-dependent) | `type` (in `chat_history`/`events`) | +| Lifecycle events | inferred from records | inferred from records | **explicit** `events.jsonl` | +| Token usage | `message.usage` | per-line usage | from events/updates | +| Push capture | Stop/SessionEnd hooks | `codex exec` wrappers | hooks / ACP | +| Pull capture | scan dir by mtime | scan date partitions | scan dirs / query FTS sqlite | + +**Implication:** the common denominator is *"JSONL records discriminated by a +`type` field, with a session id, timestamps, turn linkage, tool calls, and token +usage."* That maps cleanly onto one normalized schema (§4). Per-family quirks +(Grok's explicit `events.jsonl`, Codex's schema versions, Claude's sidechains) are +handled inside each adapter. + +## 3. Tiered Storage Model + +``` + Tier 0 SOURCE (agents' own logs) read-only, never mutated + ~/.claude/projects ~/.codex/sessions ~/.grok/sessions + │ collector adapters (per family) + ingest cursor + ▼ + Tier 1 RAW CACHE (bounded, EVICTABLE) normalized Session + Event records + │ signal extractors / digesters + ▼ + Tier 2 DISTILLED MEMORY (durable, small) session digests + signals + pattern evidence +``` + +- **Tier 0 — Source.** The agents' own logs. We treat them as read-only. We keep a + small **ingest cursor** per source so re-scans are incremental (see §6). +- **Tier 1 — Raw cache.** Normalized copies of sessions/events. This is the bulky + tier and the *only* tier subject to budget eviction. +- **Tier 2 — Distilled memory.** Per-session **digest** (outcome, costs, tool + histogram, error/retry/intervention markers, key snippets) plus extracted + **signals** and **pattern evidence pointers**. Compact and durable. A session can + be fully evicted from Tier 1 once its Tier 2 digest exists. + +This is what makes "drop old content once it has been analyzed" safe: analysis +*promotes* the valuable bits into Tier 2 before the raw bytes are dropped. + +### 3.1 Per-session lifecycle / watermarks + +Each session row carries timestamps that drive eviction: + +``` +discovered_at → ingested_at → analyzed_at → [evictable] → evicted_at +``` + +- `ingested_at` set when normalized into Tier 1. +- `analyzed_at` set when the Tier 2 digest is written. **A session is evictable iff + `analyzed_at` is set.** +- `evicted_at` set when raw bytes are dropped from Tier 1 (Tier 2 digest remains). + +## 4. Normalized Schema (Tier 1) + +Two record kinds. Field names are stable across all adapters. + +### 4.1 `Session` + +```jsonc +{ + "session_uid": "claude:17092961-…", // ":", globally unique + "flavor": "claude" | "codex" | "grok", + "native_session_id": "17092961-…", + "repo": "agentic-resources", // resolved from cwd + "domain": "helix_forge", // resolved from repo→domain map + "cwd": "/home/worsch/agentic-resources", + "git_branch": "main", + "model": "claude-opus-4-8", + "started_at": "2026-06-05T21:59:30Z", + "ended_at": "2026-06-05T22:14:00Z", + "outcome": "success|fail|abandoned|unknown", + "cost": { "input_tokens": 0, "output_tokens": 0, "cache_tokens": 0, + "wall_clock_s": 0, "turns": 0, "retries": 0 }, + "task_ref": "AGENTIC-WP-0002-T01", // if derivable; else null + "source_path": "~/.claude/projects/…/….jsonl", + "source_bytes": 0, + "schema_version": 1, + "ingested_at": "…", "analyzed_at": null, "evicted_at": null +} +``` + +### 4.2 `SessionEvent` + +```jsonc +{ + "session_uid": "claude:17092961-…", + "seq": 12, // monotonic within session + "parent_seq": 11, // turn DAG (Claude uuid/parentUuid) + "ts": "2026-06-05T22:01:13Z", + "kind": "user_msg | assistant_msg | thinking | tool_call | tool_result" + + "| error | test_run | edit | retry | human_intervention | decision" + + "| lifecycle | completion", + "role": "user|assistant|system|tool", + "tool": "Bash|Edit|Read|…", // when kind=tool_call/result + "summary": "ran pytest -q", // short, human-readable + "payload_ref": "blob://…", // pointer to full content in Tier 1 blob store + "tokens": 0, + "is_sidechain": false +} +``` + +Adapters map native records onto `kind`. Grok's `events.jsonl` populates +`lifecycle`/`turn` events directly; Claude/Codex lifecycle is inferred from the +record stream. Bulky bodies live behind `payload_ref` so Tier 1 rows stay light +and blobs can be evicted independently. + +### 4.3 Native → `kind` mapping (all three families) + +Each cell is the native record/discriminator an adapter reads to emit that +`SessionEvent.kind`. `—` = not natively present; the adapter synthesizes or omits. + +| `kind` | Claude Code (`type` / `message`) | Codex CLI (`type` → `payload.type`) | Grok CLI (file → `type`) | +|--------|----------------------------------|--------------------------------------|---------------------------| +| `user_msg` | `user`, `message.role=user` | `response_item` → `message` `role=user`/`developer` | `chat_history` → `user` | +| `assistant_msg` | `assistant`, `message.role=assistant`, content `text` | `response_item` → `message` `role=assistant` (`output_text`) | `chat_history` → `assistant` | +| `thinking` | `assistant` content block `type=thinking` | `response_item` → `reasoning` | `chat_history`/`updates` reasoning block | +| `tool_call` | `assistant` content block `type=tool_use` (`name`,`input`) | `response_item` → `function_call` (`name`,`arguments`,`call_id`) | `chat_history`/`updates` tool-call entry | +| `tool_result` | `user`/tool record `type=tool_result` + `toolUseResult` | `response_item` → `function_call_output` (join on `call_id`) | `updates` tool-result entry | +| `test_run` | derived from `tool_call` (Bash running tests) | derived from `function_call` (`exec_command`) | derived from tool-call entry | +| `edit` | `tool_use` where `name` ∈ Edit/Write/NotebookEdit | `function_call` apply-patch/file-write tool | tool-call entry (edit/write) | +| `error` | `toolUseResult` error / non-zero result | `function_call_output` error / `event_msg` error | `events.jsonl` error / failed update | +| `retry` | repeated `tool_use` after error (inferred via DAG) | repeated `function_call` after error (inferred, temporal) | `events.jsonl` loop/retry event | +| `human_intervention` | `user` record mid-turn (interrupt), `userType` | `event_msg` → `user_message` mid-task | `prompt_history` mid-session / `events.jsonl` | +| `decision` | recorded out-of-band (State Hub `/decisions`) | recorded out-of-band (State Hub) | recorded out-of-band (State Hub) | +| `lifecycle` | inferred: first/last record, `summary`, `queue-operation` | `event_msg` → `task_started` / `task_complete` | **`events.jsonl`** → `turn_started`/`loop_started`/… (explicit) | +| `completion` | inferred: last `assistant` + `Stop`/`SessionEnd` hook | `event_msg` → `task_complete` | `events.jsonl` turn end + `summary.json` | + +**Linkage note (drives `seq`/`parent_seq`):** Claude has a true turn DAG +(`uuid`/`parentUuid`) — preserve it directly. Codex is **flat**, joined only by +`call_id`; assign `seq` by temporal order. Grok carries explicit `turn_number` in +`events.jsonl`; key `seq` off that plus record order. + +**Cost block sources:** Claude `message.usage`; Codex `event_msg/token_count`; +Grok `events.jsonl` / `updates.jsonl` token fields. + +## 5. Retention & Eviction + +The user's stated preference: **storage-budget-based**, dropping old content once +it has been analyzed or once it no longer fits — *better than* a fixed daily/weekly +window. We implement budget-based as primary, with a time backstop and a scheduled +cadence as the trigger. + +### 5.1 Configurable knobs + +```toml +[session_memory.retention] +raw_soft_cap_bytes = "4GiB" # begin evicting analyzed sessions above this +raw_hard_cap_bytes = "6GiB" # absolute ceiling for Tier 1 +raw_max_age_days = 45 # backstop: analyzed raw older than this is evictable regardless of space +distilled_cap_bytes = "1GiB" # Tier 2 ceiling (should grow slowly; alert, don't auto-drop) +cadence = "daily" # ingest+analyze+evict sweep: daily | weekly | on-hook +``` + +### 5.2 Eviction algorithm (runs after each ingest+analyze sweep) + +1. **Compute** current Tier 1 usage. +2. **Backstop pass:** evict any session where `analyzed_at` is set AND + `age > raw_max_age_days`. +3. **Budget pass:** while `usage > raw_soft_cap_bytes`: + - pick the **oldest `analyzed_at`** session that is not yet evicted; + - drop its Tier 1 raw rows + blobs (Tier 2 digest is kept), set `evicted_at`; + - if **no analyzed-but-unevicted session remains**, stop the budget pass + (we will not destroy un-analyzed data to free space) and go to step 4. +4. **Back-pressure / overflow:** if `usage > raw_hard_cap_bytes` and the only + remaining bulk is **un-analyzed**: + - first try to **analyze now** (run extraction) to make those sessions + evictable, then re-run the budget pass; + - if still over hard cap (analysis can't keep up or fails), evict the **oldest + un-analyzed** sessions as a last resort and emit a + `session_memory.data_loss` warning event + a State Hub progress note. This is + the only path that loses un-analyzed data, and it is always reported. +5. **Tier 2 guard:** if distilled usage > `distilled_cap_bytes`, **do not + auto-drop**; flag for human/curation review (digests are the product). + +**Invariant:** *no session's raw bytes are dropped before its Tier 2 digest +exists, except the explicitly-reported hard-cap overflow path.* + +### 5.3 Why budget-based beats fixed-window + +A fixed daily/weekly drop either deletes data we never analyzed (lossy) or hoards +data we already distilled (wasteful). Budget + `analyzed_at` watermark ties +deletion to **two** real conditions the user named — *"once it has been analyzed"* +(promoted to Tier 2) and *"doesn't fit any longer"* (over budget) — and only falls +back to time as a backstop. + +## 6. Ingest Cursors (incremental, idempotent) + +Per source, persist a small cursor so sweeps are cheap and re-runnable: + +- **Claude / Grok (per-cwd dirs):** track `(file_path, size, mtime)` and last + parsed line offset; re-ingest only grown/changed files. `session_uid` dedupes. +- **Codex (date partitions):** track last-seen `YYYY/MM/DD` + per-file offset. +- Ingest is **idempotent** keyed on `(session_uid, seq)` — safe to re-run after a + crash or partial sweep. + +## 7. Capture Modes + +- **Pull (default, portable):** scheduled sweep scans Tier 0 by mtime/partition. + Works for all three families with zero coupling to the agent. Triggered on the + configured `cadence` via the repo's scheduler (`/schedule`, cron, or `/loop`). +- **Push (optional, low-latency):** wire the agent's own hooks to ping the ingester + on session close — Claude `Stop`/`SessionEnd` hooks, Grok hooks/ACP, Codex + `exec` wrappers. Push just enqueues; the same idempotent pull pipeline does the + work. + +Capture must be **non-blocking** (PRD FR-C5): we read copies of logs out-of-band; +we never sit in the agent's critical path. + +## 8. Component Layout (proposed, in-repo) + +``` +session-memory/ + adapters/ + claude.py # Tier0→Tier1 normalizer (verified schema) + codex.py # version-detecting normalizer (confirm against real rollout) + grok.py # reads session dir incl. events.jsonl + core/ + schema.py # Session / SessionEvent dataclasses + versioning + store.py # Tier1 (rows+blobs) and Tier2 (digests) — SQLite to start + cursor.py # per-source ingest cursors + retention.py # §5 eviction algorithm + digest.py # Tier1→Tier2 session digest + signal stubs + ingest.py # one sweep: discover → normalize → analyze → evict + config.toml # §5.1 knobs + repo→domain map + source paths +``` + +Storage starts as **SQLite + a blob dir** (rows in SQLite, bulky payloads as files +under `payload_ref`); graduate to Postgres alongside the State Hub only if volume +demands. Digests/decisions are also surfaced to the hub per ADR-001 (files-first; +hub indexes). + +## 9. Privacy / Safety + +- Tier 0 logs can contain secrets (the Grok `auth.json` and Claude `.credentials` + live in the same trees). The ingester reads **only** session transcripts, never + credential files, and **redacts** obvious secret patterns into `payload_ref` + blobs. +- All data is local; nothing leaves the workstation. Eviction of Tier 1 is a real + delete (not just an index drop) so the bounded cache is also a privacy bound. + +## 10. Open Questions + +- ~~**OQ1** Confirm Codex `rollout-*.jsonl` per-line schema.~~ **Resolved** (§2.2): + `{timestamp,type,payload}` lines, `type` ∈ `session_meta`/`response_item`/`event_msg`/`turn_context`/`compacted`, + tool calls flat-linked by `call_id`, tokens via `event_msg/token_count`. Remaining + sub-item: verify the `token_count` payload field names against a real install when + Codex is present (older-version variance only). +- **OQ2** Outcome inference: how do we reliably label `success/fail/abandoned` + across flavors (exit signals differ)? Start heuristic (last-turn + test results + + human-intervention markers), refine in Detect phase. +- **OQ3** `task_ref` resolution — can we always map a session to a workplan task + (via cwd + branch + state-hub), or only sometimes? +- **OQ4** Right default for `raw_soft_cap_bytes` — measure real per-session sizes + first (Grok dirs are heavier than Claude single-files). +- **OQ5** Should push-hooks be opt-in per machine to avoid surprising the agents? + +--- + +*Next step: [AGENTIC-WP-0002] implements Phase 0 — the schema, the Claude +collector, the Tier1/Tier2 store, and the budget-based eviction sweep.* + +## Sources + +- Claude Code session format — verified on disk: `~/.claude/projects/*/*.jsonl`, `~/.claude/history.jsonl`. +- Grok CLI session format — verified on disk: `~/.grok/sessions/`, `~/.grok/logs/unified.jsonl`, `~/.grok/sessions/session_search.sqlite`; `~/.grok/README.md` (ACP/headless/hooks). +- Codex CLI session format — [ccusage Codex guide](https://ccusage.com/guide/codex/), [Codex advanced config](https://developers.openai.com/codex/config-advanced), [codex-trace](https://github.com/PixelPaw-Labs/codex-trace), [codex-logs](https://github.com/wondercoms/codex-logs), [Session/Rollout Files discussion #3827](https://github.com/openai/codex/discussions/3827), [trajectory-JSON issue #2288](https://github.com/openai/codex/issues/2288). diff --git a/docs/PRD-helix-forge.md b/docs/PRD-helix-forge.md new file mode 100644 index 0000000..55c6bb0 --- /dev/null +++ b/docs/PRD-helix-forge.md @@ -0,0 +1,279 @@ +# Product Requirements Document — Helix Forge + +**Domain:** helix_forge +**Repo:** agentic-resources +**Status:** Draft v0.1 +**Author:** Claude (drafted with Bernd Worsch) +**Created:** 2026-06-06 +**Updated:** 2026-06-06 + +--- + +## 1. Summary + +Helix Forge is a system for **handling a collection of repositories and evolving +the utility of what those repositories provide**, by treating the coding sessions +run against them as a first-class data source. + +Concretely: across a fleet of repos worked on by multiple coding agents (Claude, +Codex, GrokBuild), Helix Forge **inspects the sessions**, **collects data about the +problems agents hit and the moves that resolved them**, and turns that data into +**reusable solution patterns** that can be discussed, implemented, and re-applied — +across every agent flavor, not just the one that discovered the pattern. + +The name is the metaphor: a *helix* of repeated turns (session → pattern → improved +session) feeding a *forge* where the tooling, environments, and instructions for our +agents are hammered into better shape over time. This is the operational engine +behind the INTENT.md goal of an *antifragile, continuously-optimizing agentic +ecosystem*. + +## 2. Problem Statement + +We run many coding sessions, across many repos, with several different agents. Today +the value of each session is **trapped in that session**: + +- When an agent solves a tricky problem, the solution is not captured in a form + another agent (or the same agent next week) can reuse. +- When an agent fails, struggles, or burns excess budget on a problem, that failure + signal is lost — we re-encounter the same friction repeatedly. +- Each agent flavor (Claude, Codex, GrokBuild) has its own environment, instruction + format, and extension mechanism, so a fix discovered for one is **not portable** to + the others without manual translation. +- We have no systematic, evidence-based answer to "what is actually slowing our + agents down, and what consistently makes them faster?" — decisions about tooling, + prompts, and environments are made on anecdote. + +**The cost:** repeated mistakes, non-transferable wins, slow and uneven improvement +of agent performance, and no feedback loop from real session data back into the +tools/environments/instructions that shape future sessions. + +## 3. Goals & Non-Goals + +### 3.1 Goals + +| # | Goal | +|---|------| +| G1 | **Capture** coding sessions from Claude, Codex, and GrokBuild in a normalized, comparable form. | +| G2 | **Detect** recurring *problem patterns* (failure, friction, wasted budget) and *success patterns* (efficient resolutions) from that data. | +| G3 | **Curate** detected patterns into a reviewed catalog of *solution patterns* that humans and agents can discuss and approve. | +| G4 | **Distribute** approved patterns back into agent environments — as instructions, tools, or extensions — in a per-flavor-appropriate form. | +| G5 | **Measure** whether distributed patterns actually improved subsequent sessions (close the loop). | +| G6 | Keep the whole loop **agent-flavor-agnostic at the core**, with thin per-flavor adapters at the edges. | + +### 3.2 Non-Goals (initial release) + +- Not a replacement for the coding agents themselves; Helix Forge observes and + improves them, it does not execute coding tasks. +- Not a general APM/observability product; scope is coding-session improvement, not + arbitrary infrastructure monitoring. +- Not an autonomous self-modifying system — pattern promotion into live agent + environments requires human approval (HITL) for the first release. +- Not building new model training/fine-tuning pipelines; we optimize *context, + tooling, and environment*, not model weights. +- Not replacing the Custodian State Hub; Helix Forge is a producer/consumer of hub + state, not a competing system of record. (See §9.) + +## 4. Users & Personas + +| Persona | Description | What they need from Helix Forge | +|---------|-------------|----------------------------------| +| **Operator (Bernd)** | Owns the agentic ecosystem; decides which patterns become standards. | A reviewable catalog of patterns with evidence; control over what ships to agents. | +| **Coding agent (Claude / Codex / GrokBuild)** | Runs tasks in a repo; both the *source* of session data and the *consumer* of patterns. | To emit session data cheaply; to receive applicable patterns in its native format at session start. | +| **Repo maintainer agent** | The per-repo agent persona (e.g. `agentic-resources`) following AGENTS.md conventions. | Patterns scoped to its repo/domain; integration via existing workplan + state-hub flow. | +| **Reviewer (human or kaizen agent)** | Evaluates candidate patterns before they become standards. | Clear pattern proposals, supporting evidence, and a discuss/approve/reject workflow. | + +## 5. Core Concepts (Domain Model) + +- **Session** — one bounded run of a coding agent against a repo. Has an agent flavor, + repo, task reference, timeline of events, outcome, and cost (tokens/time). +- **Session Event** — a normalized atomic record within a session: tool call, edit, + test run, error, retry, human intervention, decision, completion. +- **Signal** — a derived indicator extracted from sessions: e.g. *repeated test + failure on same file*, *budget overrun*, *fast clean resolution*, *retry storm*, + *human escalation*. +- **Problem Pattern** — a recurring negative signal cluster ("agents repeatedly fail + X because Y"). +- **Success Pattern** — a recurring positive resolution ("doing Z reliably resolves X + cheaply"). +- **Solution Pattern** — a curated, reviewed artifact pairing a problem with one or + more recommended resolutions, written agent-flavor-agnostically, with per-flavor + rendering hints. +- **Pattern Application** — the act of distributing a solution pattern into a specific + agent environment (an instruction snippet, a tool, an extension), plus the record of + its effect on later sessions. + +## 6. Functional Requirements + +### 6.1 Capture (G1) + +- **FR-C1** Ingest session transcripts/logs from each supported agent flavor via a + per-flavor **collector adapter**. +- **FR-C2** Normalize raw logs into the common `Session` + `Session Event` schema, + regardless of source flavor. +- **FR-C3** Tag every session with: agent flavor, repo, domain, task/workplan id (if + any), outcome (success/fail/abandoned), and cost metrics (tokens, wall-clock, + retries). +- **FR-C4** Support both **batch import** (historical logs) and **incremental ingest** + (new sessions as they close). +- **FR-C5** Collection must be low-friction and non-blocking — an agent emitting + session data must never slow or break the actual coding task. + +### 6.2 Detect (G2) + +- **FR-D1** Run signal extractors over normalized sessions to surface problem and + success signals. +- **FR-D2** Cluster recurring signals across sessions/repos/flavors into candidate + Problem Patterns and Success Patterns. +- **FR-D3** For each candidate pattern, attach **evidence**: the supporting sessions, + frequency, affected repos, affected flavors, and estimated cost impact. +- **FR-D4** Flag **cross-flavor** patterns explicitly (a problem seen in Claude that + Codex also hits) — these are the highest-value reuse targets. + +### 6.3 Curate (G3) + +- **FR-U1** Present candidate patterns for review with their evidence in a + discuss/approve/reject workflow. +- **FR-U2** Allow a reviewer (human or kaizen agent) to promote a candidate into a + **Solution Pattern**: a named, versioned artifact with problem description, + recommended resolution(s), applicability scope, and per-flavor rendering hints. +- **FR-U3** Maintain a **Pattern Catalog** as the source of truth for approved + solution patterns, versioned and stored as files in-repo (consistent with ADR-001: + files originate work, the hub indexes them). +- **FR-U4** Record pattern decisions through the State Hub decision mechanism so + rationale is auditable. + +### 6.4 Distribute (G4) + +- **FR-X1** Render each approved solution pattern into per-flavor artifacts via + **distributor adapters**: + - Claude → `CLAUDE.md` snippets, skills, or settings/hooks. + - Codex → `AGENTS.md` snippets / repo conventions. + - GrokBuild → its native instruction/extension format. +- **FR-X2** Scope distribution by repo and domain, so a pattern only lands where it + applies. +- **FR-X3** Distribution is **proposed, not auto-applied** in v1 — output is a + reviewable change (e.g. a workplan or PR), gated by human approval. +- **FR-X4** Track which patterns are currently active in which environments. + +### 6.5 Measure (G5) + +- **FR-M1** After a pattern is applied, compare subsequent sessions touching the same + signal against the pre-application baseline (cost, retry rate, success rate, + human-intervention rate). +- **FR-M2** Surface per-pattern **effectiveness** so ineffective patterns can be + revised or retired. +- **FR-M3** Provide a fleet-level view: are sessions across the collection getting + cheaper / more reliable over time? (the helix turning.) + +### 6.6 Multi-Agent Support (G6) + +- **FR-A1** The core schema, detection, catalog, and measurement are **flavor-agnostic**. +- **FR-A2** All flavor-specific knowledge lives in **collector adapters** (input) and + **distributor adapters** (output). Adding a fourth agent = adding one collector + + one distributor, no core changes. +- **FR-A3** A successful pattern discovered via one flavor MUST be expressible for all + other supported flavors. + +## 7. Architecture Overview + +``` + ┌──────────── per-flavor edges ────────────┐ ┌──── flavor-agnostic core ────┐ + │ │ │ │ + Claude ─┐ │ │ │ + Codex ─┼─► Collector Adapters ──► Normalizer ─┼────────►│ Session + Event Store │ + Grok ─┘ │ │ │ │ + │ │ ▼ │ + │ │ Signal Extractors │ + │ │ │ │ + │ │ ▼ │ + │ │ Pattern Detector / Clusterer│ + │ │ │ │ + │ │ ▼ │ + │ │ Curation + Pattern Catalog │ ◄─ reviewer (human/kaizen) + │ │ │ │ + Claude ◄┐ │ │ ▼ │ + Codex ◄┼── Distributor Adapters ◄────────────┼─────────│ Effectiveness Measurement │ + Grok ◄┘ │ │ │ + └───────────────────────────────────────────┘ └──────────────────────────────┘ + ▲ feeds back into ▲ tools / environments / instructions +``` + +**Design principle:** *agnostic core, thin adapters at the edges.* The expensive, +reusable intelligence (normalized sessions, detection, catalog, measurement) is built +once; each agent flavor only needs an input adapter and an output adapter. + +## 8. Data & Storage + +- **Pattern Catalog** and **workplans**: files in `agentic-resources` (per ADR-001 in + AGENTS.md — files are the source of truth, the hub indexes them). +- **Session/event data**: a local store (start simple: structured files / SQLite; + graduate to Postgres alongside the State Hub if volume warrants). +- **Decisions & progress**: recorded through the Custodian State Hub so the broader + ecosystem stays aware of Helix Forge's activity. + +## 9. Integration with the Custodian State Hub + +Helix Forge runs inside the `helix_forge` domain and is **not** a competing system of +record: + +- Work originates as **workplans** in this repo (`AGENTIC-WP-NNNN`), synced via + `make fix-consistency REPO=agentic-resources`. +- Pattern-promotion and distribution decisions are logged via the hub's decision API. +- Each Helix Forge run logs at least one `add_progress_event()` / `POST /progress/`. +- The hub remains a **read model**; Helix Forge writes its durable artifacts as files + and lets the hub index them. + +## 10. Success Metrics + +| Metric | Meaning | Target (directional, v1) | +|--------|---------|--------------------------| +| Sessions captured | Coverage of real work | ≥ 90% of sessions across the 3 flavors normalized | +| Patterns cataloged | Knowledge made reusable | A growing, non-trivial catalog of reviewed solution patterns | +| Cross-flavor patterns | Reuse leverage | ≥ 1 pattern proven to transfer across flavors | +| Pattern effectiveness | Loop is closing | Applied patterns show measurable cost/reliability improvement vs. baseline | +| Fleet trend | The helix turns | Median session cost ↓ and success rate ↑ over time | +| Repeated-failure rate | Friction eliminated | Known problem patterns recur less after distribution | + +## 11. Phasing / Roadmap + +- **Phase 0 — Foundations.** Define the Session/Event schema and Pattern Catalog + format. One collector adapter (Claude) + batch import. Manual inspection only. +- **Phase 1 — Detect.** Signal extractors + pattern clustering over captured sessions; + candidate patterns surfaced with evidence. Add Codex + GrokBuild collectors. +- **Phase 2 — Curate.** Review workflow + versioned Pattern Catalog, wired to hub + decisions. +- **Phase 3 — Distribute.** Distributor adapters for all three flavors; patterns ship + as reviewable workplans/PRs (HITL). +- **Phase 4 — Measure.** Baseline-vs-after effectiveness and fleet-level trend + reporting; retire ineffective patterns. Loop is closed. + +## 12. Open Questions + +- **OQ1** What is the canonical raw log format available from each of Claude, Codex, + and GrokBuild today, and how lossy is normalization from each? +- **OQ2** How are sessions reliably bounded and attributed to a repo/task across the + three flavors? +- **OQ3** Where does detection logic run — local batch jobs, hub-side, or a dedicated + service? What volume do we actually expect? +- **OQ4** Pattern format: how do we keep one agnostic representation while giving each + distributor enough to render high-quality native artifacts? +- **OQ5** What's the minimum trustworthy evidence bar before a pattern is allowed to be + distributed to live agent environments? +- **OQ6** How do we prevent pattern bloat — too many low-value instructions degrading + agent context budgets (cf. the token-budget policy in global instructions)? + +## 13. Risks + +| Risk | Mitigation | +|------|------------| +| Capture overhead slows real coding sessions | Async, non-blocking collection (FR-C5); never in the agent's critical path. | +| Patterns become noise / context bloat | Effectiveness gating (FR-M2) + retirement; measure before broad distribution. | +| Over-fitting to one flavor | Agnostic core + explicit cross-flavor flagging (FR-D4, FR-A3). | +| Bad pattern degrades agents | HITL approval before distribution (FR-X3); baseline measurement to catch regressions. | +| Drift from State Hub conventions | Files-first per ADR-001; log via hub; no competing source of record. | + +--- + +*This PRD is a draft for discussion. Next step: a `proposed` workplan +(`AGENTIC-WP-0002`) scoping Phase 0 — the Session/Event schema and the first +(Claude) collector adapter.* diff --git a/workplans/AGENTIC-WP-0002-session-memory-phase0.md b/workplans/AGENTIC-WP-0002-session-memory-phase0.md new file mode 100644 index 0000000..10ed7dc --- /dev/null +++ b/workplans/AGENTIC-WP-0002-session-memory-phase0.md @@ -0,0 +1,135 @@ +--- +id: AGENTIC-WP-0002 +type: workplan +title: "Coding Session Memory — Phase 0 (Capture + budget-based retention)" +domain: helix_forge +repo: agentic-resources +status: active +owner: codex +topic_slug: helix-forge +created: "2026-06-06" +updated: "2026-06-06" +state_hub_workstream_id: "06e6726d-057d-47d8-84f4-0974858f6288" +--- + +# Coding Session Memory — Phase 0 + +Implements Phase 0 of [PRD-helix-forge](../docs/PRD-helix-forge.md) per the +[session-memory design](../docs/DESIGN-session-memory.md): a normalized session +schema, the first (Claude) collector, a two-tier store, and a budget-based +eviction sweep that drops analyzed/over-budget raw content while preserving +compact digests. + +Scope is deliberately one agent flavor (Claude, schema verified on disk) end to +end, so the agnostic-core / thin-adapter boundary is proven before Codex and Grok +adapters land in Phase 1. + +## Define Normalized Session Schema + +```task +id: AGENTIC-WP-0002-T01 +status: todo +priority: high +state_hub_task_id: "61297a16-257c-4579-bd1f-3db035781258" +``` + +Implement `core/schema.py` with the `Session` and `SessionEvent` dataclasses from +design §4, including `schema_version`, the `flavor`-prefixed `session_uid`, the +cost block, and the `discovered/ingested/analyzed/evicted` watermarks. Add +round-trip (de)serialization tests. This is the contract every adapter targets. + +## Claude Collector Adapter + +```task +id: AGENTIC-WP-0002-T02 +status: todo +priority: high +state_hub_task_id: "3b4e6b35-b4f3-40dc-a845-7ac78aa20d62" +``` + +Implement `adapters/claude.py`: read `~/.claude/projects//.jsonl`, +discriminate on `type`, reconstruct the turn DAG via `uuid`/`parentUuid`, map +records onto `SessionEvent.kind`, capture `message.usage` into the cost block, +handle `agent-*.jsonl` sidechains (`is_sidechain`), and resolve `repo`/`domain` +from `cwd`. Verify against real local sessions in this repo's project dir. No +Codex/Grok work in Phase 0 (designed for, not built). + +## Tier 1 / Tier 2 Store + +```task +id: AGENTIC-WP-0002-T03 +status: todo +priority: high +state_hub_task_id: "2387258e-ba6d-4a41-919e-f2f4e0822110" +``` + +Implement `core/store.py`: SQLite for `Session`/`SessionEvent` rows plus a blob +dir for `payload_ref` bodies (Tier 1), and a compact `digest` table (Tier 2). +Writes are idempotent on `(session_uid, seq)`. Provide usage-bytes accounting for +Tier 1 (rows + blobs) and Tier 2, used by retention. + +## Session Digest (Tier 1 → Tier 2) + +```task +id: AGENTIC-WP-0002-T04 +status: todo +priority: medium +state_hub_task_id: "017d8e90-633a-49f2-b342-8690938798cd" +``` + +Implement `core/digest.py`: produce a per-session digest (outcome heuristic, cost +totals, tool histogram, error/retry/human-intervention markers, key snippets) and +set `analyzed_at`. This is the promotion step that makes a session evictable. +Signal extraction beyond the digest stays stubbed for the Detect phase. + +## Budget-Based Retention Sweep + +```task +id: AGENTIC-WP-0002-T05 +status: todo +priority: high +state_hub_task_id: "89177c79-528e-4023-a7eb-67f8e0276ba9" +``` + +Implement `core/retention.py` per design §5: backstop pass (`raw_max_age_days`), +budget pass (evict oldest `analyzed_at` first while over `raw_soft_cap_bytes`, +never touching un-analyzed sessions), and the hard-cap overflow path (analyze-now, +then last-resort evict oldest un-analyzed with a reported `data_loss` event). +Enforce the invariant: raw bytes are never dropped before the Tier 2 digest +exists (except the reported overflow path). Cover each branch with tests using +synthetic sessions and tiny caps. + +## Ingest Cursor + Sweep Entrypoint + +```task +id: AGENTIC-WP-0002-T06 +status: todo +priority: medium +state_hub_task_id: "a4b35c76-154d-4e99-b6d0-61cb6e47ecc0" +``` + +Implement `core/cursor.py` (per-source `(path,size,mtime,offset)` cursors, +idempotent re-runs) and `ingest.py` wiring one sweep: discover → normalize +(Claude) → store → digest → evict. Add `config.toml` with the §5.1 retention +knobs, source paths, and repo→domain map. Document running a sweep and the +intended `cadence` trigger (`/schedule` daily/weekly) in the repo docs. + +## Verify End-to-End on Real Sessions + +```task +id: AGENTIC-WP-0002-T07 +status: todo +priority: medium +state_hub_task_id: "98d5cc7c-c285-4556-91a3-a85e0a2bb6df" +``` + +Run the full sweep against this workstation's real Claude sessions; confirm +normalized rows, digests, idempotent re-run, and an eviction cycle under a small +test cap (analyzed dropped, un-analyzed preserved, overflow reported). Record +results and update the design doc's open questions (esp. OQ4 real per-session +sizes). After workplan file updates, notify the custodian operator to run from +`~/state-hub`: + +```bash +make fix-consistency REPO=agentic-resources +```