Files
agentic-resources/docs/PRD-helix-forge.md
tegwick d06791f070 session-memory Phase 2: verify + catalog artifacts (T07)
End-to-end verification over real local sessions: ingest 94->93 -> 72 digests;
detect 3 candidates (2 cross-flavor); curate --auto-approve cataloged 3
SolutionPatterns (2 cross-flavor approved/distribution_ready, 1 Claude-only),
re-run fully idempotent, 3 hub decisions queued (API offline). Commits the 3
catalog artifacts as the source of truth. PRD §12 OQ4/OQ5/OQ6 marked resolved;
README + design refreshed. Workplan finished; suite 72/72.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 10:08:52 +02:00

294 lines
17 KiB
Markdown

# Product Requirements Document — Helix Forge
**Domain:** helix_forge
**Repo:** agentic-resources
**Status:** Draft v0.1
**Author:** Claude (drafted with Bernd Worsch)
**Created:** 2026-06-06
**Updated:** 2026-06-06
---
## 1. Summary
Helix Forge is a system for **handling a collection of repositories and evolving
the utility of what those repositories provide**, by treating the coding sessions
run against them as a first-class data source.
Concretely: across a fleet of repos worked on by multiple coding agents (Claude,
Codex, GrokBuild), Helix Forge **inspects the sessions**, **collects data about the
problems agents hit and the moves that resolved them**, and turns that data into
**reusable solution patterns** that can be discussed, implemented, and re-applied —
across every agent flavor, not just the one that discovered the pattern.
The name is the metaphor: a *helix* of repeated turns (session → pattern → improved
session) feeding a *forge* where the tooling, environments, and instructions for our
agents are hammered into better shape over time. This is the operational engine
behind the INTENT.md goal of an *antifragile, continuously-optimizing agentic
ecosystem*.
## 2. Problem Statement
We run many coding sessions, across many repos, with several different agents. Today
the value of each session is **trapped in that session**:
- When an agent solves a tricky problem, the solution is not captured in a form
another agent (or the same agent next week) can reuse.
- When an agent fails, struggles, or burns excess budget on a problem, that failure
signal is lost — we re-encounter the same friction repeatedly.
- Each agent flavor (Claude, Codex, GrokBuild) has its own environment, instruction
format, and extension mechanism, so a fix discovered for one is **not portable** to
the others without manual translation.
- We have no systematic, evidence-based answer to "what is actually slowing our
agents down, and what consistently makes them faster?" — decisions about tooling,
prompts, and environments are made on anecdote.
**The cost:** repeated mistakes, non-transferable wins, slow and uneven improvement
of agent performance, and no feedback loop from real session data back into the
tools/environments/instructions that shape future sessions.
## 3. Goals & Non-Goals
### 3.1 Goals
| # | Goal |
|---|------|
| G1 | **Capture** coding sessions from Claude, Codex, and GrokBuild in a normalized, comparable form. |
| G2 | **Detect** recurring *problem patterns* (failure, friction, wasted budget) and *success patterns* (efficient resolutions) from that data. |
| G3 | **Curate** detected patterns into a reviewed catalog of *solution patterns* that humans and agents can discuss and approve. |
| G4 | **Distribute** approved patterns back into agent environments — as instructions, tools, or extensions — in a per-flavor-appropriate form. |
| G5 | **Measure** whether distributed patterns actually improved subsequent sessions (close the loop). |
| G6 | Keep the whole loop **agent-flavor-agnostic at the core**, with thin per-flavor adapters at the edges. |
### 3.2 Non-Goals (initial release)
- Not a replacement for the coding agents themselves; Helix Forge observes and
improves them, it does not execute coding tasks.
- Not a general APM/observability product; scope is coding-session improvement, not
arbitrary infrastructure monitoring.
- Not an autonomous self-modifying system — pattern promotion into live agent
environments requires human approval (HITL) for the first release.
- Not building new model training/fine-tuning pipelines; we optimize *context,
tooling, and environment*, not model weights.
- Not replacing the Custodian State Hub; Helix Forge is a producer/consumer of hub
state, not a competing system of record. (See §9.)
## 4. Users & Personas
| Persona | Description | What they need from Helix Forge |
|---------|-------------|----------------------------------|
| **Operator (Bernd)** | Owns the agentic ecosystem; decides which patterns become standards. | A reviewable catalog of patterns with evidence; control over what ships to agents. |
| **Coding agent (Claude / Codex / GrokBuild)** | Runs tasks in a repo; both the *source* of session data and the *consumer* of patterns. | To emit session data cheaply; to receive applicable patterns in its native format at session start. |
| **Repo maintainer agent** | The per-repo agent persona (e.g. `agentic-resources`) following AGENTS.md conventions. | Patterns scoped to its repo/domain; integration via existing workplan + state-hub flow. |
| **Reviewer (human or kaizen agent)** | Evaluates candidate patterns before they become standards. | Clear pattern proposals, supporting evidence, and a discuss/approve/reject workflow. |
## 5. Core Concepts (Domain Model)
- **Session** — one bounded run of a coding agent against a repo. Has an agent flavor,
repo, task reference, timeline of events, outcome, and cost (tokens/time).
- **Session Event** — a normalized atomic record within a session: tool call, edit,
test run, error, retry, human intervention, decision, completion.
- **Signal** — a derived indicator extracted from sessions: e.g. *repeated test
failure on same file*, *budget overrun*, *fast clean resolution*, *retry storm*,
*human escalation*.
- **Problem Pattern** — a recurring negative signal cluster ("agents repeatedly fail
X because Y").
- **Success Pattern** — a recurring positive resolution ("doing Z reliably resolves X
cheaply").
- **Solution Pattern** — a curated, reviewed artifact pairing a problem with one or
more recommended resolutions, written agent-flavor-agnostically, with per-flavor
rendering hints.
- **Pattern Application** — the act of distributing a solution pattern into a specific
agent environment (an instruction snippet, a tool, an extension), plus the record of
its effect on later sessions.
## 6. Functional Requirements
### 6.1 Capture (G1)
- **FR-C1** Ingest session transcripts/logs from each supported agent flavor via a
per-flavor **collector adapter**.
- **FR-C2** Normalize raw logs into the common `Session` + `Session Event` schema,
regardless of source flavor.
- **FR-C3** Tag every session with: agent flavor, repo, domain, task/workplan id (if
any), outcome (success/fail/abandoned), and cost metrics (tokens, wall-clock,
retries).
- **FR-C4** Support both **batch import** (historical logs) and **incremental ingest**
(new sessions as they close).
- **FR-C5** Collection must be low-friction and non-blocking — an agent emitting
session data must never slow or break the actual coding task.
### 6.2 Detect (G2)
- **FR-D1** Run signal extractors over normalized sessions to surface problem and
success signals.
- **FR-D2** Cluster recurring signals across sessions/repos/flavors into candidate
Problem Patterns and Success Patterns.
- **FR-D3** For each candidate pattern, attach **evidence**: the supporting sessions,
frequency, affected repos, affected flavors, and estimated cost impact.
- **FR-D4** Flag **cross-flavor** patterns explicitly (a problem seen in Claude that
Codex also hits) — these are the highest-value reuse targets.
### 6.3 Curate (G3)
- **FR-U1** Present candidate patterns for review with their evidence in a
discuss/approve/reject workflow.
- **FR-U2** Allow a reviewer (human or kaizen agent) to promote a candidate into a
**Solution Pattern**: a named, versioned artifact with problem description,
recommended resolution(s), applicability scope, and per-flavor rendering hints.
- **FR-U3** Maintain a **Pattern Catalog** as the source of truth for approved
solution patterns, versioned and stored as files in-repo (consistent with ADR-001:
files originate work, the hub indexes them).
- **FR-U4** Record pattern decisions through the State Hub decision mechanism so
rationale is auditable.
### 6.4 Distribute (G4)
- **FR-X1** Render each approved solution pattern into per-flavor artifacts via
**distributor adapters**:
- Claude → `CLAUDE.md` snippets, skills, or settings/hooks.
- Codex → `AGENTS.md` snippets / repo conventions.
- GrokBuild → its native instruction/extension format.
- **FR-X2** Scope distribution by repo and domain, so a pattern only lands where it
applies.
- **FR-X3** Distribution is **proposed, not auto-applied** in v1 — output is a
reviewable change (e.g. a workplan or PR), gated by human approval.
- **FR-X4** Track which patterns are currently active in which environments.
### 6.5 Measure (G5)
- **FR-M1** After a pattern is applied, compare subsequent sessions touching the same
signal against the pre-application baseline (cost, retry rate, success rate,
human-intervention rate).
- **FR-M2** Surface per-pattern **effectiveness** so ineffective patterns can be
revised or retired.
- **FR-M3** Provide a fleet-level view: are sessions across the collection getting
cheaper / more reliable over time? (the helix turning.)
### 6.6 Multi-Agent Support (G6)
- **FR-A1** The core schema, detection, catalog, and measurement are **flavor-agnostic**.
- **FR-A2** All flavor-specific knowledge lives in **collector adapters** (input) and
**distributor adapters** (output). Adding a fourth agent = adding one collector +
one distributor, no core changes.
- **FR-A3** A successful pattern discovered via one flavor MUST be expressible for all
other supported flavors.
## 7. Architecture Overview
```
┌──────────── per-flavor edges ────────────┐ ┌──── flavor-agnostic core ────┐
│ │ │ │
Claude ─┐ │ │ │
Codex ─┼─► Collector Adapters ──► Normalizer ─┼────────►│ Session + Event Store │
Grok ─┘ │ │ │ │
│ │ ▼ │
│ │ Signal Extractors │
│ │ │ │
│ │ ▼ │
│ │ Pattern Detector / Clusterer│
│ │ │ │
│ │ ▼ │
│ │ Curation + Pattern Catalog │ ◄─ reviewer (human/kaizen)
│ │ │ │
Claude ◄┐ │ │ ▼ │
Codex ◄┼── Distributor Adapters ◄────────────┼─────────│ Effectiveness Measurement │
Grok ◄┘ │ │ │
└───────────────────────────────────────────┘ └──────────────────────────────┘
▲ feeds back into ▲ tools / environments / instructions
```
**Design principle:** *agnostic core, thin adapters at the edges.* The expensive,
reusable intelligence (normalized sessions, detection, catalog, measurement) is built
once; each agent flavor only needs an input adapter and an output adapter.
## 8. Data & Storage
- **Pattern Catalog** and **workplans**: files in `agentic-resources` (per ADR-001 in
AGENTS.md — files are the source of truth, the hub indexes them).
- **Session/event data**: a local store (start simple: structured files / SQLite;
graduate to Postgres alongside the State Hub if volume warrants).
- **Decisions & progress**: recorded through the Custodian State Hub so the broader
ecosystem stays aware of Helix Forge's activity.
## 9. Integration with the Custodian State Hub
Helix Forge runs inside the `helix_forge` domain and is **not** a competing system of
record:
- Work originates as **workplans** in this repo (`AGENTIC-WP-NNNN`), synced via
`make fix-consistency REPO=agentic-resources`.
- Pattern-promotion and distribution decisions are logged via the hub's decision API.
- Each Helix Forge run logs at least one `add_progress_event()` / `POST /progress/`.
- The hub remains a **read model**; Helix Forge writes its durable artifacts as files
and lets the hub index them.
## 10. Success Metrics
| Metric | Meaning | Target (directional, v1) |
|--------|---------|--------------------------|
| Sessions captured | Coverage of real work | ≥ 90% of sessions across the 3 flavors normalized |
| Patterns cataloged | Knowledge made reusable | A growing, non-trivial catalog of reviewed solution patterns |
| Cross-flavor patterns | Reuse leverage | ≥ 1 pattern proven to transfer across flavors |
| Pattern effectiveness | Loop is closing | Applied patterns show measurable cost/reliability improvement vs. baseline |
| Fleet trend | The helix turns | Median session cost ↓ and success rate ↑ over time |
| Repeated-failure rate | Friction eliminated | Known problem patterns recur less after distribution |
## 11. Phasing / Roadmap
- **Phase 0 — Foundations.** Define the Session/Event schema and Pattern Catalog
format. One collector adapter (Claude) + batch import. Manual inspection only.
- **Phase 1 — Detect.** Signal extractors + pattern clustering over captured sessions;
candidate patterns surfaced with evidence. Add Codex + GrokBuild collectors.
- **Phase 2 — Curate.** Review workflow + versioned Pattern Catalog, wired to hub
decisions.
- **Phase 3 — Distribute.** Distributor adapters for all three flavors; patterns ship
as reviewable workplans/PRs (HITL).
- **Phase 4 — Measure.** Baseline-vs-after effectiveness and fleet-level trend
reporting; retire ineffective patterns. Loop is closed.
## 12. Open Questions
- **OQ1** What is the canonical raw log format available from each of Claude, Codex,
and GrokBuild today, and how lossy is normalization from each?
- **OQ2** How are sessions reliably bounded and attributed to a repo/task across the
three flavors?
- **OQ3** Where does detection logic run — local batch jobs, hub-side, or a dedicated
service? What volume do we actually expect?
- ~~**OQ4** Pattern format: how do we keep one agnostic representation while giving each
distributor enough to render high-quality native artifacts?~~ **Resolved (Phase 2,
AGENTIC-WP-0004):** the `SolutionPattern` core is flavor-agnostic (problem,
resolutions, scope, provenance) and carries per-flavor knowledge only in a separate
`rendering_hints` sub-structure keyed by flavor — distributors read the hints, the
core stays neutral. Catalogued as versioned files-first artifacts (FR-U3).
- ~~**OQ5** What's the minimum trustworthy evidence bar before a pattern is allowed to be
distributed to live agent environments?~~ **Resolved (Phase 2):** a two-tier
evidence bar (`[curate.gate]`). A *promote* floor (frequency / distinct sessions /
cost-impact) admits a candidate as `provisional`; a stricter *distribution* floor
(higher frequency, optional cross-flavor requirement, cost-impact) is required to
mark a pattern `approved` + `distribution_ready`. Defaults are conservative and
config-tunable.
- ~~**OQ6** How do we prevent pattern bloat — too many low-value instructions degrading
agent context budgets (cf. the token-budget policy in global instructions)?~~
**Resolved (Phase 2):** a bloat guard flags duplicate (same id) and near-duplicate
(same signal-type+locus) candidates at review time, and the catalog dedups
structurally on the source-candidate key so re-promotion never multiplies entries.
Thin candidates stay `provisional` (not distributed) rather than padding live
context.
## 13. Risks
| Risk | Mitigation |
|------|------------|
| Capture overhead slows real coding sessions | Async, non-blocking collection (FR-C5); never in the agent's critical path. |
| Patterns become noise / context bloat | Effectiveness gating (FR-M2) + retirement; measure before broad distribution. |
| Over-fitting to one flavor | Agnostic core + explicit cross-flavor flagging (FR-D4, FR-A3). |
| Bad pattern degrades agents | HITL approval before distribution (FR-X3); baseline measurement to catch regressions. |
| Drift from State Hub conventions | Files-first per ADR-001; log via hub; no competing source of record. |
---
*This PRD is a draft for discussion. Next step: a `proposed` workplan
(`AGENTIC-WP-0002`) scoping Phase 0 — the Session/Event schema and the first
(Claude) collector adapter.*