generated from coulomb/repo-seed
End-to-end verification over real local sessions: ingest 94->93 -> 72 digests; detect 3 candidates (2 cross-flavor); curate --auto-approve cataloged 3 SolutionPatterns (2 cross-flavor approved/distribution_ready, 1 Claude-only), re-run fully idempotent, 3 hub decisions queued (API offline). Commits the 3 catalog artifacts as the source of truth. PRD §12 OQ4/OQ5/OQ6 marked resolved; README + design refreshed. Workplan finished; suite 72/72. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
294 lines
17 KiB
Markdown
294 lines
17 KiB
Markdown
# Product Requirements Document — Helix Forge
|
|
|
|
**Domain:** helix_forge
|
|
**Repo:** agentic-resources
|
|
**Status:** Draft v0.1
|
|
**Author:** Claude (drafted with Bernd Worsch)
|
|
**Created:** 2026-06-06
|
|
**Updated:** 2026-06-06
|
|
|
|
---
|
|
|
|
## 1. Summary
|
|
|
|
Helix Forge is a system for **handling a collection of repositories and evolving
|
|
the utility of what those repositories provide**, by treating the coding sessions
|
|
run against them as a first-class data source.
|
|
|
|
Concretely: across a fleet of repos worked on by multiple coding agents (Claude,
|
|
Codex, GrokBuild), Helix Forge **inspects the sessions**, **collects data about the
|
|
problems agents hit and the moves that resolved them**, and turns that data into
|
|
**reusable solution patterns** that can be discussed, implemented, and re-applied —
|
|
across every agent flavor, not just the one that discovered the pattern.
|
|
|
|
The name is the metaphor: a *helix* of repeated turns (session → pattern → improved
|
|
session) feeding a *forge* where the tooling, environments, and instructions for our
|
|
agents are hammered into better shape over time. This is the operational engine
|
|
behind the INTENT.md goal of an *antifragile, continuously-optimizing agentic
|
|
ecosystem*.
|
|
|
|
## 2. Problem Statement
|
|
|
|
We run many coding sessions, across many repos, with several different agents. Today
|
|
the value of each session is **trapped in that session**:
|
|
|
|
- When an agent solves a tricky problem, the solution is not captured in a form
|
|
another agent (or the same agent next week) can reuse.
|
|
- When an agent fails, struggles, or burns excess budget on a problem, that failure
|
|
signal is lost — we re-encounter the same friction repeatedly.
|
|
- Each agent flavor (Claude, Codex, GrokBuild) has its own environment, instruction
|
|
format, and extension mechanism, so a fix discovered for one is **not portable** to
|
|
the others without manual translation.
|
|
- We have no systematic, evidence-based answer to "what is actually slowing our
|
|
agents down, and what consistently makes them faster?" — decisions about tooling,
|
|
prompts, and environments are made on anecdote.
|
|
|
|
**The cost:** repeated mistakes, non-transferable wins, slow and uneven improvement
|
|
of agent performance, and no feedback loop from real session data back into the
|
|
tools/environments/instructions that shape future sessions.
|
|
|
|
## 3. Goals & Non-Goals
|
|
|
|
### 3.1 Goals
|
|
|
|
| # | Goal |
|
|
|---|------|
|
|
| G1 | **Capture** coding sessions from Claude, Codex, and GrokBuild in a normalized, comparable form. |
|
|
| G2 | **Detect** recurring *problem patterns* (failure, friction, wasted budget) and *success patterns* (efficient resolutions) from that data. |
|
|
| G3 | **Curate** detected patterns into a reviewed catalog of *solution patterns* that humans and agents can discuss and approve. |
|
|
| G4 | **Distribute** approved patterns back into agent environments — as instructions, tools, or extensions — in a per-flavor-appropriate form. |
|
|
| G5 | **Measure** whether distributed patterns actually improved subsequent sessions (close the loop). |
|
|
| G6 | Keep the whole loop **agent-flavor-agnostic at the core**, with thin per-flavor adapters at the edges. |
|
|
|
|
### 3.2 Non-Goals (initial release)
|
|
|
|
- Not a replacement for the coding agents themselves; Helix Forge observes and
|
|
improves them, it does not execute coding tasks.
|
|
- Not a general APM/observability product; scope is coding-session improvement, not
|
|
arbitrary infrastructure monitoring.
|
|
- Not an autonomous self-modifying system — pattern promotion into live agent
|
|
environments requires human approval (HITL) for the first release.
|
|
- Not building new model training/fine-tuning pipelines; we optimize *context,
|
|
tooling, and environment*, not model weights.
|
|
- Not replacing the Custodian State Hub; Helix Forge is a producer/consumer of hub
|
|
state, not a competing system of record. (See §9.)
|
|
|
|
## 4. Users & Personas
|
|
|
|
| Persona | Description | What they need from Helix Forge |
|
|
|---------|-------------|----------------------------------|
|
|
| **Operator (Bernd)** | Owns the agentic ecosystem; decides which patterns become standards. | A reviewable catalog of patterns with evidence; control over what ships to agents. |
|
|
| **Coding agent (Claude / Codex / GrokBuild)** | Runs tasks in a repo; both the *source* of session data and the *consumer* of patterns. | To emit session data cheaply; to receive applicable patterns in its native format at session start. |
|
|
| **Repo maintainer agent** | The per-repo agent persona (e.g. `agentic-resources`) following AGENTS.md conventions. | Patterns scoped to its repo/domain; integration via existing workplan + state-hub flow. |
|
|
| **Reviewer (human or kaizen agent)** | Evaluates candidate patterns before they become standards. | Clear pattern proposals, supporting evidence, and a discuss/approve/reject workflow. |
|
|
|
|
## 5. Core Concepts (Domain Model)
|
|
|
|
- **Session** — one bounded run of a coding agent against a repo. Has an agent flavor,
|
|
repo, task reference, timeline of events, outcome, and cost (tokens/time).
|
|
- **Session Event** — a normalized atomic record within a session: tool call, edit,
|
|
test run, error, retry, human intervention, decision, completion.
|
|
- **Signal** — a derived indicator extracted from sessions: e.g. *repeated test
|
|
failure on same file*, *budget overrun*, *fast clean resolution*, *retry storm*,
|
|
*human escalation*.
|
|
- **Problem Pattern** — a recurring negative signal cluster ("agents repeatedly fail
|
|
X because Y").
|
|
- **Success Pattern** — a recurring positive resolution ("doing Z reliably resolves X
|
|
cheaply").
|
|
- **Solution Pattern** — a curated, reviewed artifact pairing a problem with one or
|
|
more recommended resolutions, written agent-flavor-agnostically, with per-flavor
|
|
rendering hints.
|
|
- **Pattern Application** — the act of distributing a solution pattern into a specific
|
|
agent environment (an instruction snippet, a tool, an extension), plus the record of
|
|
its effect on later sessions.
|
|
|
|
## 6. Functional Requirements
|
|
|
|
### 6.1 Capture (G1)
|
|
|
|
- **FR-C1** Ingest session transcripts/logs from each supported agent flavor via a
|
|
per-flavor **collector adapter**.
|
|
- **FR-C2** Normalize raw logs into the common `Session` + `Session Event` schema,
|
|
regardless of source flavor.
|
|
- **FR-C3** Tag every session with: agent flavor, repo, domain, task/workplan id (if
|
|
any), outcome (success/fail/abandoned), and cost metrics (tokens, wall-clock,
|
|
retries).
|
|
- **FR-C4** Support both **batch import** (historical logs) and **incremental ingest**
|
|
(new sessions as they close).
|
|
- **FR-C5** Collection must be low-friction and non-blocking — an agent emitting
|
|
session data must never slow or break the actual coding task.
|
|
|
|
### 6.2 Detect (G2)
|
|
|
|
- **FR-D1** Run signal extractors over normalized sessions to surface problem and
|
|
success signals.
|
|
- **FR-D2** Cluster recurring signals across sessions/repos/flavors into candidate
|
|
Problem Patterns and Success Patterns.
|
|
- **FR-D3** For each candidate pattern, attach **evidence**: the supporting sessions,
|
|
frequency, affected repos, affected flavors, and estimated cost impact.
|
|
- **FR-D4** Flag **cross-flavor** patterns explicitly (a problem seen in Claude that
|
|
Codex also hits) — these are the highest-value reuse targets.
|
|
|
|
### 6.3 Curate (G3)
|
|
|
|
- **FR-U1** Present candidate patterns for review with their evidence in a
|
|
discuss/approve/reject workflow.
|
|
- **FR-U2** Allow a reviewer (human or kaizen agent) to promote a candidate into a
|
|
**Solution Pattern**: a named, versioned artifact with problem description,
|
|
recommended resolution(s), applicability scope, and per-flavor rendering hints.
|
|
- **FR-U3** Maintain a **Pattern Catalog** as the source of truth for approved
|
|
solution patterns, versioned and stored as files in-repo (consistent with ADR-001:
|
|
files originate work, the hub indexes them).
|
|
- **FR-U4** Record pattern decisions through the State Hub decision mechanism so
|
|
rationale is auditable.
|
|
|
|
### 6.4 Distribute (G4)
|
|
|
|
- **FR-X1** Render each approved solution pattern into per-flavor artifacts via
|
|
**distributor adapters**:
|
|
- Claude → `CLAUDE.md` snippets, skills, or settings/hooks.
|
|
- Codex → `AGENTS.md` snippets / repo conventions.
|
|
- GrokBuild → its native instruction/extension format.
|
|
- **FR-X2** Scope distribution by repo and domain, so a pattern only lands where it
|
|
applies.
|
|
- **FR-X3** Distribution is **proposed, not auto-applied** in v1 — output is a
|
|
reviewable change (e.g. a workplan or PR), gated by human approval.
|
|
- **FR-X4** Track which patterns are currently active in which environments.
|
|
|
|
### 6.5 Measure (G5)
|
|
|
|
- **FR-M1** After a pattern is applied, compare subsequent sessions touching the same
|
|
signal against the pre-application baseline (cost, retry rate, success rate,
|
|
human-intervention rate).
|
|
- **FR-M2** Surface per-pattern **effectiveness** so ineffective patterns can be
|
|
revised or retired.
|
|
- **FR-M3** Provide a fleet-level view: are sessions across the collection getting
|
|
cheaper / more reliable over time? (the helix turning.)
|
|
|
|
### 6.6 Multi-Agent Support (G6)
|
|
|
|
- **FR-A1** The core schema, detection, catalog, and measurement are **flavor-agnostic**.
|
|
- **FR-A2** All flavor-specific knowledge lives in **collector adapters** (input) and
|
|
**distributor adapters** (output). Adding a fourth agent = adding one collector +
|
|
one distributor, no core changes.
|
|
- **FR-A3** A successful pattern discovered via one flavor MUST be expressible for all
|
|
other supported flavors.
|
|
|
|
## 7. Architecture Overview
|
|
|
|
```
|
|
┌──────────── per-flavor edges ────────────┐ ┌──── flavor-agnostic core ────┐
|
|
│ │ │ │
|
|
Claude ─┐ │ │ │
|
|
Codex ─┼─► Collector Adapters ──► Normalizer ─┼────────►│ Session + Event Store │
|
|
Grok ─┘ │ │ │ │
|
|
│ │ ▼ │
|
|
│ │ Signal Extractors │
|
|
│ │ │ │
|
|
│ │ ▼ │
|
|
│ │ Pattern Detector / Clusterer│
|
|
│ │ │ │
|
|
│ │ ▼ │
|
|
│ │ Curation + Pattern Catalog │ ◄─ reviewer (human/kaizen)
|
|
│ │ │ │
|
|
Claude ◄┐ │ │ ▼ │
|
|
Codex ◄┼── Distributor Adapters ◄────────────┼─────────│ Effectiveness Measurement │
|
|
Grok ◄┘ │ │ │
|
|
└───────────────────────────────────────────┘ └──────────────────────────────┘
|
|
▲ feeds back into ▲ tools / environments / instructions
|
|
```
|
|
|
|
**Design principle:** *agnostic core, thin adapters at the edges.* The expensive,
|
|
reusable intelligence (normalized sessions, detection, catalog, measurement) is built
|
|
once; each agent flavor only needs an input adapter and an output adapter.
|
|
|
|
## 8. Data & Storage
|
|
|
|
- **Pattern Catalog** and **workplans**: files in `agentic-resources` (per ADR-001 in
|
|
AGENTS.md — files are the source of truth, the hub indexes them).
|
|
- **Session/event data**: a local store (start simple: structured files / SQLite;
|
|
graduate to Postgres alongside the State Hub if volume warrants).
|
|
- **Decisions & progress**: recorded through the Custodian State Hub so the broader
|
|
ecosystem stays aware of Helix Forge's activity.
|
|
|
|
## 9. Integration with the Custodian State Hub
|
|
|
|
Helix Forge runs inside the `helix_forge` domain and is **not** a competing system of
|
|
record:
|
|
|
|
- Work originates as **workplans** in this repo (`AGENTIC-WP-NNNN`), synced via
|
|
`make fix-consistency REPO=agentic-resources`.
|
|
- Pattern-promotion and distribution decisions are logged via the hub's decision API.
|
|
- Each Helix Forge run logs at least one `add_progress_event()` / `POST /progress/`.
|
|
- The hub remains a **read model**; Helix Forge writes its durable artifacts as files
|
|
and lets the hub index them.
|
|
|
|
## 10. Success Metrics
|
|
|
|
| Metric | Meaning | Target (directional, v1) |
|
|
|--------|---------|--------------------------|
|
|
| Sessions captured | Coverage of real work | ≥ 90% of sessions across the 3 flavors normalized |
|
|
| Patterns cataloged | Knowledge made reusable | A growing, non-trivial catalog of reviewed solution patterns |
|
|
| Cross-flavor patterns | Reuse leverage | ≥ 1 pattern proven to transfer across flavors |
|
|
| Pattern effectiveness | Loop is closing | Applied patterns show measurable cost/reliability improvement vs. baseline |
|
|
| Fleet trend | The helix turns | Median session cost ↓ and success rate ↑ over time |
|
|
| Repeated-failure rate | Friction eliminated | Known problem patterns recur less after distribution |
|
|
|
|
## 11. Phasing / Roadmap
|
|
|
|
- **Phase 0 — Foundations.** Define the Session/Event schema and Pattern Catalog
|
|
format. One collector adapter (Claude) + batch import. Manual inspection only.
|
|
- **Phase 1 — Detect.** Signal extractors + pattern clustering over captured sessions;
|
|
candidate patterns surfaced with evidence. Add Codex + GrokBuild collectors.
|
|
- **Phase 2 — Curate.** Review workflow + versioned Pattern Catalog, wired to hub
|
|
decisions.
|
|
- **Phase 3 — Distribute.** Distributor adapters for all three flavors; patterns ship
|
|
as reviewable workplans/PRs (HITL).
|
|
- **Phase 4 — Measure.** Baseline-vs-after effectiveness and fleet-level trend
|
|
reporting; retire ineffective patterns. Loop is closed.
|
|
|
|
## 12. Open Questions
|
|
|
|
- **OQ1** What is the canonical raw log format available from each of Claude, Codex,
|
|
and GrokBuild today, and how lossy is normalization from each?
|
|
- **OQ2** How are sessions reliably bounded and attributed to a repo/task across the
|
|
three flavors?
|
|
- **OQ3** Where does detection logic run — local batch jobs, hub-side, or a dedicated
|
|
service? What volume do we actually expect?
|
|
- ~~**OQ4** Pattern format: how do we keep one agnostic representation while giving each
|
|
distributor enough to render high-quality native artifacts?~~ **Resolved (Phase 2,
|
|
AGENTIC-WP-0004):** the `SolutionPattern` core is flavor-agnostic (problem,
|
|
resolutions, scope, provenance) and carries per-flavor knowledge only in a separate
|
|
`rendering_hints` sub-structure keyed by flavor — distributors read the hints, the
|
|
core stays neutral. Catalogued as versioned files-first artifacts (FR-U3).
|
|
- ~~**OQ5** What's the minimum trustworthy evidence bar before a pattern is allowed to be
|
|
distributed to live agent environments?~~ **Resolved (Phase 2):** a two-tier
|
|
evidence bar (`[curate.gate]`). A *promote* floor (frequency / distinct sessions /
|
|
cost-impact) admits a candidate as `provisional`; a stricter *distribution* floor
|
|
(higher frequency, optional cross-flavor requirement, cost-impact) is required to
|
|
mark a pattern `approved` + `distribution_ready`. Defaults are conservative and
|
|
config-tunable.
|
|
- ~~**OQ6** How do we prevent pattern bloat — too many low-value instructions degrading
|
|
agent context budgets (cf. the token-budget policy in global instructions)?~~
|
|
**Resolved (Phase 2):** a bloat guard flags duplicate (same id) and near-duplicate
|
|
(same signal-type+locus) candidates at review time, and the catalog dedups
|
|
structurally on the source-candidate key so re-promotion never multiplies entries.
|
|
Thin candidates stay `provisional` (not distributed) rather than padding live
|
|
context.
|
|
|
|
## 13. Risks
|
|
|
|
| Risk | Mitigation |
|
|
|------|------------|
|
|
| Capture overhead slows real coding sessions | Async, non-blocking collection (FR-C5); never in the agent's critical path. |
|
|
| Patterns become noise / context bloat | Effectiveness gating (FR-M2) + retirement; measure before broad distribution. |
|
|
| Over-fitting to one flavor | Agnostic core + explicit cross-flavor flagging (FR-D4, FR-A3). |
|
|
| Bad pattern degrades agents | HITL approval before distribution (FR-X3); baseline measurement to catch regressions. |
|
|
| Drift from State Hub conventions | Files-first per ADR-001; log via hub; no competing source of record. |
|
|
|
|
---
|
|
|
|
*This PRD is a draft for discussion. Next step: a `proposed` workplan
|
|
(`AGENTIC-WP-0002`) scoping Phase 0 — the Session/Event schema and the first
|
|
(Claude) collector adapter.*
|