Files

tegwick ffe191d44e Add Helix Forge PRD, session-memory design, and Phase 0 workplan

- docs/PRD-helix-forge.md: Capture→Detect→Curate→Distribute→Measure loop
- docs/DESIGN-session-memory.md: tiered store + budget-based eviction;
  verified session-log schemas for Claude/Codex/Grok
- workplans/AGENTIC-WP-0002: Phase 0 (registered with State Hub)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-06 19:00:30 +02:00

16 KiB

Raw Blame History

Product Requirements Document — Helix Forge

Domain: helix_forge Repo: agentic-resources Status: Draft v0.1 Author: Claude (drafted with Bernd Worsch) Created: 2026-06-06 Updated: 2026-06-06

1. Summary

Helix Forge is a system for handling a collection of repositories and evolving the utility of what those repositories provide, by treating the coding sessions run against them as a first-class data source.

Concretely: across a fleet of repos worked on by multiple coding agents (Claude, Codex, GrokBuild), Helix Forge inspects the sessions, collects data about the problems agents hit and the moves that resolved them, and turns that data into reusable solution patterns that can be discussed, implemented, and re-applied — across every agent flavor, not just the one that discovered the pattern.

The name is the metaphor: a helix of repeated turns (session → pattern → improved session) feeding a forge where the tooling, environments, and instructions for our agents are hammered into better shape over time. This is the operational engine behind the INTENT.md goal of an antifragile, continuously-optimizing agentic ecosystem.

2. Problem Statement

We run many coding sessions, across many repos, with several different agents. Today the value of each session is trapped in that session:

When an agent solves a tricky problem, the solution is not captured in a form another agent (or the same agent next week) can reuse.
When an agent fails, struggles, or burns excess budget on a problem, that failure signal is lost — we re-encounter the same friction repeatedly.
Each agent flavor (Claude, Codex, GrokBuild) has its own environment, instruction format, and extension mechanism, so a fix discovered for one is not portable to the others without manual translation.
We have no systematic, evidence-based answer to "what is actually slowing our agents down, and what consistently makes them faster?" — decisions about tooling, prompts, and environments are made on anecdote.

The cost: repeated mistakes, non-transferable wins, slow and uneven improvement of agent performance, and no feedback loop from real session data back into the tools/environments/instructions that shape future sessions.

3. Goals & Non-Goals

3.1 Goals

#	Goal
G1	Capture coding sessions from Claude, Codex, and GrokBuild in a normalized, comparable form.
G2	Detect recurring problem patterns (failure, friction, wasted budget) and success patterns (efficient resolutions) from that data.
G3	Curate detected patterns into a reviewed catalog of solution patterns that humans and agents can discuss and approve.
G4	Distribute approved patterns back into agent environments — as instructions, tools, or extensions — in a per-flavor-appropriate form.
G5	Measure whether distributed patterns actually improved subsequent sessions (close the loop).
G6	Keep the whole loop agent-flavor-agnostic at the core, with thin per-flavor adapters at the edges.

3.2 Non-Goals (initial release)

Not a replacement for the coding agents themselves; Helix Forge observes and improves them, it does not execute coding tasks.
Not a general APM/observability product; scope is coding-session improvement, not arbitrary infrastructure monitoring.
Not an autonomous self-modifying system — pattern promotion into live agent environments requires human approval (HITL) for the first release.
Not building new model training/fine-tuning pipelines; we optimize context, tooling, and environment, not model weights.
Not replacing the Custodian State Hub; Helix Forge is a producer/consumer of hub state, not a competing system of record. (See §9.)

4. Users & Personas

Persona	Description	What they need from Helix Forge
Operator (Bernd)	Owns the agentic ecosystem; decides which patterns become standards.	A reviewable catalog of patterns with evidence; control over what ships to agents.
Coding agent (Claude / Codex / GrokBuild)	Runs tasks in a repo; both the source of session data and the consumer of patterns.	To emit session data cheaply; to receive applicable patterns in its native format at session start.
Repo maintainer agent	The per-repo agent persona (e.g. `agentic-resources`) following AGENTS.md conventions.	Patterns scoped to its repo/domain; integration via existing workplan + state-hub flow.
Reviewer (human or kaizen agent)	Evaluates candidate patterns before they become standards.	Clear pattern proposals, supporting evidence, and a discuss/approve/reject workflow.

5. Core Concepts (Domain Model)

Session — one bounded run of a coding agent against a repo. Has an agent flavor, repo, task reference, timeline of events, outcome, and cost (tokens/time).
Session Event — a normalized atomic record within a session: tool call, edit, test run, error, retry, human intervention, decision, completion.
Signal — a derived indicator extracted from sessions: e.g. repeated test failure on same file, budget overrun, fast clean resolution, retry storm, human escalation.
Problem Pattern — a recurring negative signal cluster ("agents repeatedly fail X because Y").
Success Pattern — a recurring positive resolution ("doing Z reliably resolves X cheaply").
Solution Pattern — a curated, reviewed artifact pairing a problem with one or more recommended resolutions, written agent-flavor-agnostically, with per-flavor rendering hints.
Pattern Application — the act of distributing a solution pattern into a specific agent environment (an instruction snippet, a tool, an extension), plus the record of its effect on later sessions.

6. Functional Requirements

6.1 Capture (G1)

FR-C1 Ingest session transcripts/logs from each supported agent flavor via a per-flavor collector adapter.
FR-C2 Normalize raw logs into the common Session + Session Event schema, regardless of source flavor.
FR-C3 Tag every session with: agent flavor, repo, domain, task/workplan id (if any), outcome (success/fail/abandoned), and cost metrics (tokens, wall-clock, retries).
FR-C4 Support both batch import (historical logs) and incremental ingest (new sessions as they close).
FR-C5 Collection must be low-friction and non-blocking — an agent emitting session data must never slow or break the actual coding task.

6.2 Detect (G2)

FR-D1 Run signal extractors over normalized sessions to surface problem and success signals.
FR-D2 Cluster recurring signals across sessions/repos/flavors into candidate Problem Patterns and Success Patterns.
FR-D3 For each candidate pattern, attach evidence: the supporting sessions, frequency, affected repos, affected flavors, and estimated cost impact.
FR-D4 Flag cross-flavor patterns explicitly (a problem seen in Claude that Codex also hits) — these are the highest-value reuse targets.

6.3 Curate (G3)

FR-U1 Present candidate patterns for review with their evidence in a discuss/approve/reject workflow.
FR-U2 Allow a reviewer (human or kaizen agent) to promote a candidate into a Solution Pattern: a named, versioned artifact with problem description, recommended resolution(s), applicability scope, and per-flavor rendering hints.
FR-U3 Maintain a Pattern Catalog as the source of truth for approved solution patterns, versioned and stored as files in-repo (consistent with ADR-001: files originate work, the hub indexes them).
FR-U4 Record pattern decisions through the State Hub decision mechanism so rationale is auditable.

6.4 Distribute (G4)

FR-X1 Render each approved solution pattern into per-flavor artifacts via distributor adapters:
- Claude → CLAUDE.md snippets, skills, or settings/hooks.
- Codex → AGENTS.md snippets / repo conventions.
- GrokBuild → its native instruction/extension format.
FR-X2 Scope distribution by repo and domain, so a pattern only lands where it applies.
FR-X3 Distribution is proposed, not auto-applied in v1 — output is a reviewable change (e.g. a workplan or PR), gated by human approval.
FR-X4 Track which patterns are currently active in which environments.

6.5 Measure (G5)

FR-M1 After a pattern is applied, compare subsequent sessions touching the same signal against the pre-application baseline (cost, retry rate, success rate, human-intervention rate).
FR-M2 Surface per-pattern effectiveness so ineffective patterns can be revised or retired.
FR-M3 Provide a fleet-level view: are sessions across the collection getting cheaper / more reliable over time? (the helix turning.)

6.6 Multi-Agent Support (G6)

FR-A1 The core schema, detection, catalog, and measurement are flavor-agnostic.
FR-A2 All flavor-specific knowledge lives in collector adapters (input) and distributor adapters (output). Adding a fourth agent = adding one collector + one distributor, no core changes.
FR-A3 A successful pattern discovered via one flavor MUST be expressible for all other supported flavors.

7. Architecture Overview

   ┌──────────── per-flavor edges ────────────┐         ┌──── flavor-agnostic core ────┐
   │                                           │         │                              │
 Claude ─┐                                     │         │                              │
 Codex  ─┼─► Collector Adapters ──► Normalizer ─┼────────►│  Session + Event Store       │
 Grok   ─┘                                     │         │           │                  │
                                               │         │           ▼                  │
                                               │         │  Signal Extractors           │
                                               │         │           │                  │
                                               │         │           ▼                  │
                                               │         │  Pattern Detector / Clusterer│
                                               │         │           │                  │
                                               │         │           ▼                  │
                                               │         │  Curation + Pattern Catalog  │  ◄─ reviewer (human/kaizen)
                                               │         │           │                  │
 Claude ◄┐                                     │         │           ▼                  │
 Codex  ◄┼── Distributor Adapters ◄────────────┼─────────│  Effectiveness Measurement   │
 Grok   ◄┘                                     │         │                              │
   └───────────────────────────────────────────┘         └──────────────────────────────┘
                                  ▲ feeds back into ▲  tools / environments / instructions

Design principle: agnostic core, thin adapters at the edges. The expensive, reusable intelligence (normalized sessions, detection, catalog, measurement) is built once; each agent flavor only needs an input adapter and an output adapter.

8. Data & Storage

Pattern Catalog and workplans: files in agentic-resources (per ADR-001 in AGENTS.md — files are the source of truth, the hub indexes them).
Session/event data: a local store (start simple: structured files / SQLite; graduate to Postgres alongside the State Hub if volume warrants).
Decisions & progress: recorded through the Custodian State Hub so the broader ecosystem stays aware of Helix Forge's activity.

9. Integration with the Custodian State Hub

Helix Forge runs inside the helix_forge domain and is not a competing system of record:

Work originates as workplans in this repo (AGENTIC-WP-NNNN), synced via make fix-consistency REPO=agentic-resources.
Pattern-promotion and distribution decisions are logged via the hub's decision API.
Each Helix Forge run logs at least one add_progress_event() / POST /progress/.
The hub remains a read model; Helix Forge writes its durable artifacts as files and lets the hub index them.

10. Success Metrics

Metric	Meaning	Target (directional, v1)
Sessions captured	Coverage of real work	≥ 90% of sessions across the 3 flavors normalized
Patterns cataloged	Knowledge made reusable	A growing, non-trivial catalog of reviewed solution patterns
Cross-flavor patterns	Reuse leverage	≥ 1 pattern proven to transfer across flavors
Pattern effectiveness	Loop is closing	Applied patterns show measurable cost/reliability improvement vs. baseline
Fleet trend	The helix turns	Median session cost ↓ and success rate ↑ over time
Repeated-failure rate	Friction eliminated	Known problem patterns recur less after distribution

11. Phasing / Roadmap

Phase 0 — Foundations. Define the Session/Event schema and Pattern Catalog format. One collector adapter (Claude) + batch import. Manual inspection only.
Phase 1 — Detect. Signal extractors + pattern clustering over captured sessions; candidate patterns surfaced with evidence. Add Codex + GrokBuild collectors.
Phase 2 — Curate. Review workflow + versioned Pattern Catalog, wired to hub decisions.
Phase 3 — Distribute. Distributor adapters for all three flavors; patterns ship as reviewable workplans/PRs (HITL).
Phase 4 — Measure. Baseline-vs-after effectiveness and fleet-level trend reporting; retire ineffective patterns. Loop is closed.

12. Open Questions

OQ1 What is the canonical raw log format available from each of Claude, Codex, and GrokBuild today, and how lossy is normalization from each?
OQ2 How are sessions reliably bounded and attributed to a repo/task across the three flavors?
OQ3 Where does detection logic run — local batch jobs, hub-side, or a dedicated service? What volume do we actually expect?
OQ4 Pattern format: how do we keep one agnostic representation while giving each distributor enough to render high-quality native artifacts?
OQ5 What's the minimum trustworthy evidence bar before a pattern is allowed to be distributed to live agent environments?
OQ6 How do we prevent pattern bloat — too many low-value instructions degrading agent context budgets (cf. the token-budget policy in global instructions)?

13. Risks

Risk	Mitigation
Capture overhead slows real coding sessions	Async, non-blocking collection (FR-C5); never in the agent's critical path.
Patterns become noise / context bloat	Effectiveness gating (FR-M2) + retirement; measure before broad distribution.
Over-fitting to one flavor	Agnostic core + explicit cross-flavor flagging (FR-D4, FR-A3).
Bad pattern degrades agents	HITL approval before distribution (FR-X3); baseline measurement to catch regressions.
Drift from State Hub conventions	Files-first per ADR-001; log via hub; no competing source of record.

This PRD is a draft for discussion. Next step: a proposed workplan (AGENTIC-WP-0002) scoping Phase 0 — the Session/Event schema and the first (Claude) collector adapter.

16 KiB Raw Blame History