agentic-resources/docs/PRD-helix-forge.md

# Product Requirements Document — Helix Forge

**Domain:** helix_forge
**Repo:** agentic-resources
**Status:** Draft v0.1
**Author:** Claude (drafted with Bernd Worsch)
**Created:** 2026-06-06
**Updated:** 2026-06-06

---

## 1. Summary

Helix Forge is a system for **handling a collection of repositories and evolving
the utility of what those repositories provide**, by treating the coding sessions
run against them as a first-class data source.

Concretely: across a fleet of repos worked on by multiple coding agents (Claude,
Codex, GrokBuild), Helix Forge **inspects the sessions**, **collects data about the
problems agents hit and the moves that resolved them**, and turns that data into
**reusable solution patterns** that can be discussed, implemented, and re-applied —
across every agent flavor, not just the one that discovered the pattern.

The name is the metaphor: a *helix* of repeated turns (session → pattern → improved
session) feeding a *forge* where the tooling, environments, and instructions for our
agents are hammered into better shape over time. This is the operational engine
behind the INTENT.md goal of an *antifragile, continuously-optimizing agentic
ecosystem*.

## 2. Problem Statement

We run many coding sessions, across many repos, with several different agents. Today
the value of each session is **trapped in that session**:

- When an agent solves a tricky problem, the solution is not captured in a form
  another agent (or the same agent next week) can reuse.
- When an agent fails, struggles, or burns excess budget on a problem, that failure
  signal is lost — we re-encounter the same friction repeatedly.
- Each agent flavor (Claude, Codex, GrokBuild) has its own environment, instruction
  format, and extension mechanism, so a fix discovered for one is **not portable** to
  the others without manual translation.
- We have no systematic, evidence-based answer to "what is actually slowing our
  agents down, and what consistently makes them faster?" — decisions about tooling,
  prompts, and environments are made on anecdote.

**The cost:** repeated mistakes, non-transferable wins, slow and uneven improvement
of agent performance, and no feedback loop from real session data back into the
tools/environments/instructions that shape future sessions.

## 3. Goals & Non-Goals

### 3.1 Goals

| # | Goal |
|---|------|
| G1 | **Capture** coding sessions from Claude, Codex, and GrokBuild in a normalized, comparable form. |
| G2 | **Detect** recurring *problem patterns* (failure, friction, wasted budget) and *success patterns* (efficient resolutions) from that data. |
| G3 | **Curate** detected patterns into a reviewed catalog of *solution patterns* that humans and agents can discuss and approve. |
| G4 | **Distribute** approved patterns back into agent environments — as instructions, tools, or extensions — in a per-flavor-appropriate form. |
| G5 | **Measure** whether distributed patterns actually improved subsequent sessions (close the loop). |
| G6 | Keep the whole loop **agent-flavor-agnostic at the core**, with thin per-flavor adapters at the edges. |

### 3.2 Non-Goals (initial release)

- Not a replacement for the coding agents themselves; Helix Forge observes and
  improves them, it does not execute coding tasks.
- Not a general APM/observability product; scope is coding-session improvement, not
  arbitrary infrastructure monitoring.
- Not an autonomous self-modifying system — pattern promotion into live agent
  environments requires human approval (HITL) for the first release.
- Not building new model training/fine-tuning pipelines; we optimize *context,
  tooling, and environment*, not model weights.
- Not replacing the Custodian State Hub; Helix Forge is a producer/consumer of hub
  state, not a competing system of record. (See §9.)

## 4. Users & Personas

| Persona | Description | What they need from Helix Forge |
|---------|-------------|----------------------------------|
| **Operator (Bernd)** | Owns the agentic ecosystem; decides which patterns become standards. | A reviewable catalog of patterns with evidence; control over what ships to agents. |
| **Coding agent (Claude / Codex / GrokBuild)** | Runs tasks in a repo; both the *source* of session data and the *consumer* of patterns. | To emit session data cheaply; to receive applicable patterns in its native format at session start. |
| **Repo maintainer agent** | The per-repo agent persona (e.g. `agentic-resources`) following AGENTS.md conventions. | Patterns scoped to its repo/domain; integration via existing workplan + state-hub flow. |
| **Reviewer (human or kaizen agent)** | Evaluates candidate patterns before they become standards. | Clear pattern proposals, supporting evidence, and a discuss/approve/reject workflow. |

## 5. Core Concepts (Domain Model)

- **Session** — one bounded run of a coding agent against a repo. Has an agent flavor,
  repo, task reference, timeline of events, outcome, and cost (tokens/time).
- **Session Event** — a normalized atomic record within a session: tool call, edit,
  test run, error, retry, human intervention, decision, completion.
- **Signal** — a derived indicator extracted from sessions: e.g. *repeated test
  failure on same file*, *budget overrun*, *fast clean resolution*, *retry storm*,
  *human escalation*.
- **Problem Pattern** — a recurring negative signal cluster ("agents repeatedly fail
  X because Y").
- **Success Pattern** — a recurring positive resolution ("doing Z reliably resolves X
  cheaply").
- **Solution Pattern** — a curated, reviewed artifact pairing a problem with one or
  more recommended resolutions, written agent-flavor-agnostically, with per-flavor
  rendering hints.
- **Pattern Application** — the act of distributing a solution pattern into a specific
  agent environment (an instruction snippet, a tool, an extension), plus the record of
  its effect on later sessions.

## 6. Functional Requirements

### 6.1 Capture (G1)

- **FR-C1** Ingest session transcripts/logs from each supported agent flavor via a
  per-flavor **collector adapter**.
- **FR-C2** Normalize raw logs into the common `Session` + `Session Event` schema,
  regardless of source flavor.
- **FR-C3** Tag every session with: agent flavor, repo, domain, task/workplan id (if
  any), outcome (success/fail/abandoned), and cost metrics (tokens, wall-clock,
  retries).
- **FR-C4** Support both **batch import** (historical logs) and **incremental ingest**
  (new sessions as they close).
- **FR-C5** Collection must be low-friction and non-blocking — an agent emitting
  session data must never slow or break the actual coding task.

### 6.2 Detect (G2)

- **FR-D1** Run signal extractors over normalized sessions to surface problem and
  success signals.
- **FR-D2** Cluster recurring signals across sessions/repos/flavors into candidate
  Problem Patterns and Success Patterns.
- **FR-D3** For each candidate pattern, attach **evidence**: the supporting sessions,
  frequency, affected repos, affected flavors, and estimated cost impact.
- **FR-D4** Flag **cross-flavor** patterns explicitly (a problem seen in Claude that
  Codex also hits) — these are the highest-value reuse targets.

### 6.3 Curate (G3)

- **FR-U1** Present candidate patterns for review with their evidence in a
  discuss/approve/reject workflow.
- **FR-U2** Allow a reviewer (human or kaizen agent) to promote a candidate into a
  **Solution Pattern**: a named, versioned artifact with problem description,
  recommended resolution(s), applicability scope, and per-flavor rendering hints.
- **FR-U3** Maintain a **Pattern Catalog** as the source of truth for approved
  solution patterns, versioned and stored as files in-repo (consistent with ADR-001:
  files originate work, the hub indexes them).
- **FR-U4** Record pattern decisions through the State Hub decision mechanism so
  rationale is auditable.

### 6.4 Distribute (G4)

- **FR-X1** Render each approved solution pattern into per-flavor artifacts via
  **distributor adapters**:
  - Claude → `CLAUDE.md` snippets, skills, or settings/hooks.
  - Codex → `AGENTS.md` snippets / repo conventions.
  - GrokBuild → its native instruction/extension format.
- **FR-X2** Scope distribution by repo and domain, so a pattern only lands where it
  applies.
- **FR-X3** Distribution is **proposed, not auto-applied** in v1 — output is a
  reviewable change (e.g. a workplan or PR), gated by human approval.
- **FR-X4** Track which patterns are currently active in which environments.

### 6.5 Measure (G5)

- **FR-M1** After a pattern is applied, compare subsequent sessions touching the same
  signal against the pre-application baseline (cost, retry rate, success rate,
  human-intervention rate).
- **FR-M2** Surface per-pattern **effectiveness** so ineffective patterns can be
  revised or retired.
- **FR-M3** Provide a fleet-level view: are sessions across the collection getting
  cheaper / more reliable over time? (the helix turning.)

### 6.6 Multi-Agent Support (G6)

- **FR-A1** The core schema, detection, catalog, and measurement are **flavor-agnostic**.
- **FR-A2** All flavor-specific knowledge lives in **collector adapters** (input) and
  **distributor adapters** (output). Adding a fourth agent = adding one collector +
  one distributor, no core changes.
- **FR-A3** A successful pattern discovered via one flavor MUST be expressible for all
  other supported flavors.

## 7. Architecture Overview

```
   ┌──────────── per-flavor edges ────────────┐         ┌──── flavor-agnostic core ────┐
   │                                           │         │                              │
 Claude ─┐                                     │         │                              │
 Codex  ─┼─► Collector Adapters ──► Normalizer ─┼────────►│  Session + Event Store       │
 Grok   ─┘                                     │         │           │                  │
                                               │         │           ▼                  │
                                               │         │  Signal Extractors           │
                                               │         │           │                  │
                                               │         │           ▼                  │
                                               │         │  Pattern Detector / Clusterer│
                                               │         │           │                  │
                                               │         │           ▼                  │
                                               │         │  Curation + Pattern Catalog  │  ◄─ reviewer (human/kaizen)
                                               │         │           │                  │
 Claude ◄┐                                     │         │           ▼                  │
 Codex  ◄┼── Distributor Adapters ◄────────────┼─────────│  Effectiveness Measurement   │
 Grok   ◄┘                                     │         │                              │
   └───────────────────────────────────────────┘         └──────────────────────────────┘
                                  ▲ feeds back into ▲  tools / environments / instructions
```

**Design principle:** *agnostic core, thin adapters at the edges.* The expensive,
reusable intelligence (normalized sessions, detection, catalog, measurement) is built
once; each agent flavor only needs an input adapter and an output adapter.

## 8. Data & Storage

- **Pattern Catalog** and **workplans**: files in `agentic-resources` (per ADR-001 in
  AGENTS.md — files are the source of truth, the hub indexes them).
- **Session/event data**: a local store (start simple: structured files / SQLite;
  graduate to Postgres alongside the State Hub if volume warrants).
- **Decisions & progress**: recorded through the Custodian State Hub so the broader
  ecosystem stays aware of Helix Forge's activity.

## 9. Integration with the Custodian State Hub

Helix Forge runs inside the `helix_forge` domain and is **not** a competing system of
record:

- Work originates as **workplans** in this repo (`AGENTIC-WP-NNNN`), synced via
  `make fix-consistency REPO=agentic-resources`.
- Pattern-promotion and distribution decisions are logged via the hub's decision API.
- Each Helix Forge run logs at least one `add_progress_event()` / `POST /progress/`.
- The hub remains a **read model**; Helix Forge writes its durable artifacts as files
  and lets the hub index them.

## 10. Success Metrics

| Metric | Meaning | Target (directional, v1) |
|--------|---------|--------------------------|
| Sessions captured | Coverage of real work | ≥ 90% of sessions across the 3 flavors normalized |
| Patterns cataloged | Knowledge made reusable | A growing, non-trivial catalog of reviewed solution patterns |
| Cross-flavor patterns | Reuse leverage | ≥ 1 pattern proven to transfer across flavors |
| Pattern effectiveness | Loop is closing | Applied patterns show measurable cost/reliability improvement vs. baseline |
| Fleet trend | The helix turns | Median session cost ↓ and success rate ↑ over time |
| Repeated-failure rate | Friction eliminated | Known problem patterns recur less after distribution |

## 11. Phasing / Roadmap

- **Phase 0 — Foundations.** Define the Session/Event schema and Pattern Catalog
  format. One collector adapter (Claude) + batch import. Manual inspection only.
- **Phase 1 — Detect.** Signal extractors + pattern clustering over captured sessions;
  candidate patterns surfaced with evidence. Add Codex + GrokBuild collectors.
- **Phase 2 — Curate.** Review workflow + versioned Pattern Catalog, wired to hub
  decisions.
- **Phase 3 — Distribute.** Distributor adapters for all three flavors; patterns ship
  as reviewable workplans/PRs (HITL).
- **Phase 4 — Measure.** Baseline-vs-after effectiveness and fleet-level trend
  reporting; retire ineffective patterns. Loop is closed.

## 12. Open Questions

- **OQ1** What is the canonical raw log format available from each of Claude, Codex,
  and GrokBuild today, and how lossy is normalization from each?
- **OQ2** How are sessions reliably bounded and attributed to a repo/task across the
  three flavors?
- **OQ3** Where does detection logic run — local batch jobs, hub-side, or a dedicated
  service? What volume do we actually expect?
- ~~**OQ4** Pattern format: how do we keep one agnostic representation while giving each
  distributor enough to render high-quality native artifacts?~~ **Resolved (Phase 2,
  AGENTIC-WP-0004):** the `SolutionPattern` core is flavor-agnostic (problem,
  resolutions, scope, provenance) and carries per-flavor knowledge only in a separate
  `rendering_hints` sub-structure keyed by flavor — distributors read the hints, the
  core stays neutral. Catalogued as versioned files-first artifacts (FR-U3).
- ~~**OQ5** What's the minimum trustworthy evidence bar before a pattern is allowed to be
  distributed to live agent environments?~~ **Resolved (Phase 2):** a two-tier
  evidence bar (`[curate.gate]`). A *promote* floor (frequency / distinct sessions /
  cost-impact) admits a candidate as `provisional`; a stricter *distribution* floor
  (higher frequency, optional cross-flavor requirement, cost-impact) is required to
  mark a pattern `approved` + `distribution_ready`. Defaults are conservative and
  config-tunable.
- ~~**OQ6** How do we prevent pattern bloat — too many low-value instructions degrading
  agent context budgets (cf. the token-budget policy in global instructions)?~~
  **Resolved (Phase 2):** a bloat guard flags duplicate (same id) and near-duplicate
  (same signal-type+locus) candidates at review time, and the catalog dedups
  structurally on the source-candidate key so re-promotion never multiplies entries.
  Thin candidates stay `provisional` (not distributed) rather than padding live
  context.

## 13. Risks

| Risk | Mitigation |
|------|------------|
| Capture overhead slows real coding sessions | Async, non-blocking collection (FR-C5); never in the agent's critical path. |
| Patterns become noise / context bloat | Effectiveness gating (FR-M2) + retirement; measure before broad distribution. |
| Over-fitting to one flavor | Agnostic core + explicit cross-flavor flagging (FR-D4, FR-A3). |
| Bad pattern degrades agents | HITL approval before distribution (FR-X3); baseline measurement to catch regressions. |
| Drift from State Hub conventions | Files-first per ADR-001; log via hub; no competing source of record. |

---

*This PRD is a draft for discussion. Next step: a `proposed` workplan
(`AGENTIC-WP-0002`) scoping Phase 0 — the Session/Event schema and the first
(Claude) collector adapter.*