# Product Requirements Document — Helix Forge **Domain:** helix_forge **Repo:** agentic-resources **Status:** Draft v0.1 **Author:** Claude (drafted with Bernd Worsch) **Created:** 2026-06-06 **Updated:** 2026-06-06 --- ## 1. Summary Helix Forge is a system for **handling a collection of repositories and evolving the utility of what those repositories provide**, by treating the coding sessions run against them as a first-class data source. Concretely: across a fleet of repos worked on by multiple coding agents (Claude, Codex, GrokBuild), Helix Forge **inspects the sessions**, **collects data about the problems agents hit and the moves that resolved them**, and turns that data into **reusable solution patterns** that can be discussed, implemented, and re-applied — across every agent flavor, not just the one that discovered the pattern. The name is the metaphor: a *helix* of repeated turns (session → pattern → improved session) feeding a *forge* where the tooling, environments, and instructions for our agents are hammered into better shape over time. This is the operational engine behind the INTENT.md goal of an *antifragile, continuously-optimizing agentic ecosystem*. ## 2. Problem Statement We run many coding sessions, across many repos, with several different agents. Today the value of each session is **trapped in that session**: - When an agent solves a tricky problem, the solution is not captured in a form another agent (or the same agent next week) can reuse. - When an agent fails, struggles, or burns excess budget on a problem, that failure signal is lost — we re-encounter the same friction repeatedly. - Each agent flavor (Claude, Codex, GrokBuild) has its own environment, instruction format, and extension mechanism, so a fix discovered for one is **not portable** to the others without manual translation. - We have no systematic, evidence-based answer to "what is actually slowing our agents down, and what consistently makes them faster?" — decisions about tooling, prompts, and environments are made on anecdote. **The cost:** repeated mistakes, non-transferable wins, slow and uneven improvement of agent performance, and no feedback loop from real session data back into the tools/environments/instructions that shape future sessions. ## 3. Goals & Non-Goals ### 3.1 Goals | # | Goal | |---|------| | G1 | **Capture** coding sessions from Claude, Codex, and GrokBuild in a normalized, comparable form. | | G2 | **Detect** recurring *problem patterns* (failure, friction, wasted budget) and *success patterns* (efficient resolutions) from that data. | | G3 | **Curate** detected patterns into a reviewed catalog of *solution patterns* that humans and agents can discuss and approve. | | G4 | **Distribute** approved patterns back into agent environments — as instructions, tools, or extensions — in a per-flavor-appropriate form. | | G5 | **Measure** whether distributed patterns actually improved subsequent sessions (close the loop). | | G6 | Keep the whole loop **agent-flavor-agnostic at the core**, with thin per-flavor adapters at the edges. | ### 3.2 Non-Goals (initial release) - Not a replacement for the coding agents themselves; Helix Forge observes and improves them, it does not execute coding tasks. - Not a general APM/observability product; scope is coding-session improvement, not arbitrary infrastructure monitoring. - Not an autonomous self-modifying system — pattern promotion into live agent environments requires human approval (HITL) for the first release. - Not building new model training/fine-tuning pipelines; we optimize *context, tooling, and environment*, not model weights. - Not replacing the Custodian State Hub; Helix Forge is a producer/consumer of hub state, not a competing system of record. (See §9.) ## 4. Users & Personas | Persona | Description | What they need from Helix Forge | |---------|-------------|----------------------------------| | **Operator (Bernd)** | Owns the agentic ecosystem; decides which patterns become standards. | A reviewable catalog of patterns with evidence; control over what ships to agents. | | **Coding agent (Claude / Codex / GrokBuild)** | Runs tasks in a repo; both the *source* of session data and the *consumer* of patterns. | To emit session data cheaply; to receive applicable patterns in its native format at session start. | | **Repo maintainer agent** | The per-repo agent persona (e.g. `agentic-resources`) following AGENTS.md conventions. | Patterns scoped to its repo/domain; integration via existing workplan + state-hub flow. | | **Reviewer (human or kaizen agent)** | Evaluates candidate patterns before they become standards. | Clear pattern proposals, supporting evidence, and a discuss/approve/reject workflow. | ## 5. Core Concepts (Domain Model) - **Session** — one bounded run of a coding agent against a repo. Has an agent flavor, repo, task reference, timeline of events, outcome, and cost (tokens/time). - **Session Event** — a normalized atomic record within a session: tool call, edit, test run, error, retry, human intervention, decision, completion. - **Signal** — a derived indicator extracted from sessions: e.g. *repeated test failure on same file*, *budget overrun*, *fast clean resolution*, *retry storm*, *human escalation*. - **Problem Pattern** — a recurring negative signal cluster ("agents repeatedly fail X because Y"). - **Success Pattern** — a recurring positive resolution ("doing Z reliably resolves X cheaply"). - **Solution Pattern** — a curated, reviewed artifact pairing a problem with one or more recommended resolutions, written agent-flavor-agnostically, with per-flavor rendering hints. - **Pattern Application** — the act of distributing a solution pattern into a specific agent environment (an instruction snippet, a tool, an extension), plus the record of its effect on later sessions. ## 6. Functional Requirements ### 6.1 Capture (G1) - **FR-C1** Ingest session transcripts/logs from each supported agent flavor via a per-flavor **collector adapter**. - **FR-C2** Normalize raw logs into the common `Session` + `Session Event` schema, regardless of source flavor. - **FR-C3** Tag every session with: agent flavor, repo, domain, task/workplan id (if any), outcome (success/fail/abandoned), and cost metrics (tokens, wall-clock, retries). - **FR-C4** Support both **batch import** (historical logs) and **incremental ingest** (new sessions as they close). - **FR-C5** Collection must be low-friction and non-blocking — an agent emitting session data must never slow or break the actual coding task. ### 6.2 Detect (G2) - **FR-D1** Run signal extractors over normalized sessions to surface problem and success signals. - **FR-D2** Cluster recurring signals across sessions/repos/flavors into candidate Problem Patterns and Success Patterns. - **FR-D3** For each candidate pattern, attach **evidence**: the supporting sessions, frequency, affected repos, affected flavors, and estimated cost impact. - **FR-D4** Flag **cross-flavor** patterns explicitly (a problem seen in Claude that Codex also hits) — these are the highest-value reuse targets. ### 6.3 Curate (G3) - **FR-U1** Present candidate patterns for review with their evidence in a discuss/approve/reject workflow. - **FR-U2** Allow a reviewer (human or kaizen agent) to promote a candidate into a **Solution Pattern**: a named, versioned artifact with problem description, recommended resolution(s), applicability scope, and per-flavor rendering hints. - **FR-U3** Maintain a **Pattern Catalog** as the source of truth for approved solution patterns, versioned and stored as files in-repo (consistent with ADR-001: files originate work, the hub indexes them). - **FR-U4** Record pattern decisions through the State Hub decision mechanism so rationale is auditable. ### 6.4 Distribute (G4) - **FR-X1** Render each approved solution pattern into per-flavor artifacts via **distributor adapters**: - Claude → `CLAUDE.md` snippets, skills, or settings/hooks. - Codex → `AGENTS.md` snippets / repo conventions. - GrokBuild → its native instruction/extension format. - **FR-X2** Scope distribution by repo and domain, so a pattern only lands where it applies. - **FR-X3** Distribution is **proposed, not auto-applied** in v1 — output is a reviewable change (e.g. a workplan or PR), gated by human approval. - **FR-X4** Track which patterns are currently active in which environments. ### 6.5 Measure (G5) - **FR-M1** After a pattern is applied, compare subsequent sessions touching the same signal against the pre-application baseline (cost, retry rate, success rate, human-intervention rate). - **FR-M2** Surface per-pattern **effectiveness** so ineffective patterns can be revised or retired. - **FR-M3** Provide a fleet-level view: are sessions across the collection getting cheaper / more reliable over time? (the helix turning.) ### 6.6 Multi-Agent Support (G6) - **FR-A1** The core schema, detection, catalog, and measurement are **flavor-agnostic**. - **FR-A2** All flavor-specific knowledge lives in **collector adapters** (input) and **distributor adapters** (output). Adding a fourth agent = adding one collector + one distributor, no core changes. - **FR-A3** A successful pattern discovered via one flavor MUST be expressible for all other supported flavors. ## 7. Architecture Overview ``` ┌──────────── per-flavor edges ────────────┐ ┌──── flavor-agnostic core ────┐ │ │ │ │ Claude ─┐ │ │ │ Codex ─┼─► Collector Adapters ──► Normalizer ─┼────────►│ Session + Event Store │ Grok ─┘ │ │ │ │ │ │ ▼ │ │ │ Signal Extractors │ │ │ │ │ │ │ ▼ │ │ │ Pattern Detector / Clusterer│ │ │ │ │ │ │ ▼ │ │ │ Curation + Pattern Catalog │ ◄─ reviewer (human/kaizen) │ │ │ │ Claude ◄┐ │ │ ▼ │ Codex ◄┼── Distributor Adapters ◄────────────┼─────────│ Effectiveness Measurement │ Grok ◄┘ │ │ │ └───────────────────────────────────────────┘ └──────────────────────────────┘ ▲ feeds back into ▲ tools / environments / instructions ``` **Design principle:** *agnostic core, thin adapters at the edges.* The expensive, reusable intelligence (normalized sessions, detection, catalog, measurement) is built once; each agent flavor only needs an input adapter and an output adapter. ## 8. Data & Storage - **Pattern Catalog** and **workplans**: files in `agentic-resources` (per ADR-001 in AGENTS.md — files are the source of truth, the hub indexes them). - **Session/event data**: a local store (start simple: structured files / SQLite; graduate to Postgres alongside the State Hub if volume warrants). - **Decisions & progress**: recorded through the Custodian State Hub so the broader ecosystem stays aware of Helix Forge's activity. ## 9. Integration with the Custodian State Hub Helix Forge runs inside the `helix_forge` domain and is **not** a competing system of record: - Work originates as **workplans** in this repo (`AGENTIC-WP-NNNN`), synced via `make fix-consistency REPO=agentic-resources`. - Pattern-promotion and distribution decisions are logged via the hub's decision API. - Each Helix Forge run logs at least one `add_progress_event()` / `POST /progress/`. - The hub remains a **read model**; Helix Forge writes its durable artifacts as files and lets the hub index them. ## 10. Success Metrics | Metric | Meaning | Target (directional, v1) | |--------|---------|--------------------------| | Sessions captured | Coverage of real work | ≥ 90% of sessions across the 3 flavors normalized | | Patterns cataloged | Knowledge made reusable | A growing, non-trivial catalog of reviewed solution patterns | | Cross-flavor patterns | Reuse leverage | ≥ 1 pattern proven to transfer across flavors | | Pattern effectiveness | Loop is closing | Applied patterns show measurable cost/reliability improvement vs. baseline | | Fleet trend | The helix turns | Median session cost ↓ and success rate ↑ over time | | Repeated-failure rate | Friction eliminated | Known problem patterns recur less after distribution | ## 11. Phasing / Roadmap - **Phase 0 — Foundations.** Define the Session/Event schema and Pattern Catalog format. One collector adapter (Claude) + batch import. Manual inspection only. - **Phase 1 — Detect.** Signal extractors + pattern clustering over captured sessions; candidate patterns surfaced with evidence. Add Codex + GrokBuild collectors. - **Phase 2 — Curate.** Review workflow + versioned Pattern Catalog, wired to hub decisions. - **Phase 3 — Distribute.** Distributor adapters for all three flavors; patterns ship as reviewable workplans/PRs (HITL). - **Phase 4 — Measure.** Baseline-vs-after effectiveness and fleet-level trend reporting; retire ineffective patterns. Loop is closed. ## 12. Open Questions - **OQ1** What is the canonical raw log format available from each of Claude, Codex, and GrokBuild today, and how lossy is normalization from each? - **OQ2** How are sessions reliably bounded and attributed to a repo/task across the three flavors? - **OQ3** Where does detection logic run — local batch jobs, hub-side, or a dedicated service? What volume do we actually expect? - ~~**OQ4** Pattern format: how do we keep one agnostic representation while giving each distributor enough to render high-quality native artifacts?~~ **Resolved (Phase 2, AGENTIC-WP-0004):** the `SolutionPattern` core is flavor-agnostic (problem, resolutions, scope, provenance) and carries per-flavor knowledge only in a separate `rendering_hints` sub-structure keyed by flavor — distributors read the hints, the core stays neutral. Catalogued as versioned files-first artifacts (FR-U3). - ~~**OQ5** What's the minimum trustworthy evidence bar before a pattern is allowed to be distributed to live agent environments?~~ **Resolved (Phase 2):** a two-tier evidence bar (`[curate.gate]`). A *promote* floor (frequency / distinct sessions / cost-impact) admits a candidate as `provisional`; a stricter *distribution* floor (higher frequency, optional cross-flavor requirement, cost-impact) is required to mark a pattern `approved` + `distribution_ready`. Defaults are conservative and config-tunable. - ~~**OQ6** How do we prevent pattern bloat — too many low-value instructions degrading agent context budgets (cf. the token-budget policy in global instructions)?~~ **Resolved (Phase 2):** a bloat guard flags duplicate (same id) and near-duplicate (same signal-type+locus) candidates at review time, and the catalog dedups structurally on the source-candidate key so re-promotion never multiplies entries. Thin candidates stay `provisional` (not distributed) rather than padding live context. ## 13. Risks | Risk | Mitigation | |------|------------| | Capture overhead slows real coding sessions | Async, non-blocking collection (FR-C5); never in the agent's critical path. | | Patterns become noise / context bloat | Effectiveness gating (FR-M2) + retirement; measure before broad distribution. | | Over-fitting to one flavor | Agnostic core + explicit cross-flavor flagging (FR-D4, FR-A3). | | Bad pattern degrades agents | HITL approval before distribution (FR-X3); baseline measurement to catch regressions. | | Drift from State Hub conventions | Files-first per ADR-001; log via hub; no competing source of record. | --- *This PRD is a draft for discussion. Next step: a `proposed` workplan (`AGENTIC-WP-0002`) scoping Phase 0 — the Session/Event schema and the first (Claude) collector adapter.*