Files

tegwick 0d05dfcc5d session-memory: weekly retro entrypoint + hub publish (AGENTIC-WP-0010)

The analysis half of the weekly coding retrospection. retro/build.py: windowed
detect+measure -> top-3 improvement suggestions per repo (cross-flavor first,
recommendations pulled from the Pattern Catalog) + fleet snapshot. retro/publish.py:
publishes the report to the hub as the coding_retro read model (event_type=
coding_retro progress event) + local JSON/md, graceful degrade. retro entrypoint
with --window-days/--publish/--json. Live verify over real sessions surfaced
per-repo suggestions with catalog recommendations. 13 new tests; suite 152/152.

Consumed by activity-core ACTIVITY-WP-0008 (Weekly Coding Retrospection, Sat 19:00).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-07 19:17:24 +02:00

11 KiB

Raw Permalink Blame History

session_memory

Capture + retention layer for Helix Forge — the Capture stage of the loop in ../docs/PRD-helix-forge.md, built to the ../docs/DESIGN-session-memory.md spec.

It scans coding-agent session logs, normalizes them into one schema, distills a compact per-session digest, and ages out raw bulk under a storage budget (dropping sessions once analyzed and once space is needed) rather than a fixed time window.

Layout

session_memory/
  adapters/common.py   # shared Normalized bundle + helpers
  adapters/claude.py   # Tier0 -> Tier1 normalizers, one per flavor
  adapters/codex.py    #   (rollout {timestamp,type,payload}, flat call_id join)
  adapters/grok.py     #   (per-session dir: chat_history + events + updates)
  core/schema.py       # Session / SessionEvent / Cost
  core/store.py        # SQLite rows + blob-dir bodies (Tier1) + digests/patterns (Tier2)
  core/cursor.py       # incremental ingest cursors
  core/digest.py       # Tier1 -> Tier2 promotion + outcome heuristic
  core/retention.py    # budget-based eviction sweep
  ingest.py            # one sweep: discover -> normalize -> store -> digest -> evict
  detect/signals.py    # signal extractors over digests
  detect/cluster.py    # cluster signals -> candidate patterns + cross-flavor flag
  detect/__main__.py   # python -m session_memory.detect (ranked report)
  curate/schema.py     # SolutionPattern artifact + per-flavor rendering hints
  curate/catalog.py    # versioned, files-first Pattern Catalog (dedup on id)
  curate/gating.py     # promotion evidence bar + bloat guard
  curate/review.py     # discuss/approve/reject -> promote workflow
  curate/decisions.py  # hub decision audit trail (graceful local-queue fallback)
  curate/__main__.py   # python -m session_memory.curate (interactive / --auto-approve)
  catalog/             # the committed Pattern Catalog (source of truth)
  distribute/base.py   # Artifact + Distributor protocol + idempotent snippet markers
  distribute/claude.py # CLAUDE.md (or skill) renderer    } per-flavor edges
  distribute/codex.py  # AGENTS.md renderer                } (agnostic body,
  distribute/grok.py   # native instruction renderer       }  different targets)
  distribute/proposals.py  # scoping + proposed-not-applied output + active registry
  distribute/__main__.py   # python -m session_memory.distribute
  measure/metrics.py   # fleet metrics + persisted baseline snapshots
  measure/effect.py    # before/after per-pattern effectiveness
  measure/__main__.py  # python -m session_memory.measure
  retro/build.py       # windowed top-3-per-repo suggestions
  retro/publish.py     # hub coding_retro read model + local report
  retro/__main__.py    # python -m session_memory.retro
  config.toml          # store paths, retention caps, sources, repo->domain map, curate gate

The local store lives under session_memory/.store/ (gitignored).

Run a sweep

# from the repo root
python -m session_memory.ingest                 # ingest + analyze + evict
python -m session_memory.ingest --dry-run       # discover + parse only, writes nothing
python -m session_memory.ingest --config path/to/config.toml

Output reports discovered / ingested / skipped_unchanged / analyzed and a retention line (freed, final_usage, and per-pass eviction counts). Sweeps are idempotent — re-running skips unchanged files via the cursor.

Scheduling (cadence)

Retention is budget-based; the cadence in config.toml only decides how often the sweep runs. Trigger it with the repo scheduler, e.g. daily:

# Claude Code: schedule a daily routine that runs the sweep
/schedule "daily session-memory sweep" -- python -m session_memory.ingest

or a cron entry / /loop on a timer. Push-capture (agent Stop/SessionEnd hooks) can also enqueue a sweep; see design §7.

Detect candidate patterns

After ingesting, mine the digests for recurring problem/success patterns:

python -m session_memory.detect                 # ranked report, cross-flavor first
python -m session_memory.detect --json          # machine-readable candidates
python -m session_memory.detect --min-frequency 3

Candidates are persisted to a Tier 2 patterns table and are the input to the Curate phase (Phase 2). Patterns whose evidence spans more than one agent flavor are flagged [CROSS-FLAVOR] — the highest-value reuse targets.

Curate candidates into the Pattern Catalog

Review detect candidates into versioned Solution Patterns held in the files-first catalog (session_memory/catalog/). The flow is detect → curate → (Phase 3) distribute; curate refreshes candidates by running detect first.

python -m session_memory.curate                 # interactive review (a/r/d per candidate)
python -m session_memory.curate --auto-approve  # batch: promote all that clear the evidence bar
python -m session_memory.curate --json          # machine-readable result

Promotion writes a SolutionPattern file (id = source candidate key, so re-promoting the same candidate dedups; content changes bump the semver and archive the prior version to <id>.history.jsonl).
The evidence bar ([curate.gate]) sets two floors: a promote floor and a stricter distribution floor. A thin-but-real candidate lands provisional; one clearing the distribution floor lands approved + distribution_ready.
A bloat guard flags duplicate / near-duplicate candidates so the catalog stays lean.
Re-review is idempotent — a remembered decision is skipped unless the candidate's evidence changed; a prior reject is not re-surfaced.
Each final promote/reject is recorded as a hub decision; if the hub is offline the decision is queued to [curate].decision_queue for later sync (the same after-the-fact pattern used in Phase 1).

Curate knobs (`[curate]` / `[curate.gate]` in config.toml)

Key	Meaning
`catalog_dir`	committed Pattern Catalog dir (source of truth)
`review_log` / `decision_queue`	remembered decisions + pending hub decisions (gitignored)
`min_frequency` / `min_sessions` / `min_cost_impact`	floor to promote at all
`dist_require_cross_flavor`	require cross-flavor evidence to be distribution-eligible
`dist_min_frequency` / `dist_min_cost_impact`	stricter floor for `distribution_ready`

Distribute patterns as per-flavor proposals

Render approved catalog patterns into per-flavor artifacts — proposed, never auto-applied (HITL). Completes the loop: detect → curate → distribute.

python -m session_memory.distribute                 # proposals for all repos/flavors
python -m session_memory.distribute --repo state-hub --flavor claude
python -m session_memory.distribute --json

Only approved + distribution_ready patterns are rendered; each pattern's Scope (repos/domains/flavors) decides where it lands (FR-X2).
Each flavor renders the same agnostic body to its own target (Claude → CLAUDE.md/skill, Codex → AGENTS.md, Grok → native) via rendering_hints (FR-A3); blocks carry stable BEGIN/END markers so re-running updates in place.
Output goes to session_memory/proposals/<repo>/<target> (gitignored, regenerated) — a reviewable diff a human applies (FR-X3). The committed distribute/active_patterns.json records which pattern+version is proposed in which (repo, flavor) (FR-X4).

Measure effectiveness (closing the loop)

Track whether the fleet is getting cheaper / more reliable, and whether a distributed pattern actually helped.

python -m session_memory.measure --label "baseline"      # snapshot + trend
python -m session_memory.measure --since 2026-06-07      # before/after a change
python -m session_memory.measure --no-save --json

A snapshot (infra-overhead share, error rate, schema-thrash, token percentiles, success rate) is appended to measure/baselines.jsonl to build a trend (FR-M3).
--since DATE splits sessions before/after a change and diffs the metrics, with an improved verdict per metric (FR-M1/FR-M2) — so ineffective patterns can be retired. Recorded pre-fix baseline (2026-06-07): 27 sessions, infra-overhead median 11.7 %, error rate 0.96, schema-thrash 8 sessions.

Weekly retro (the input to the scheduled retrospection)

A windowed roll-up: detect + measure over the last N days → the top-3 improvement suggestions per repo (cross-flavor first; recommendations pulled from the Pattern Catalog) → published to the hub as the coding_retro read model.

python -m session_memory.retro                      # last 7 days, local report
python -m session_memory.retro --window-days 30 --json
python -m session_memory.retro --publish            # also post coding_retro to the hub

Writes retro/last_retro.{json,md} and (with --publish) posts an event_type=coding_retro progress event. This is consumed by activity-core's Weekly Coding Retrospection schedule (ACTIVITY-WP-0008, Saturday 19:00 Berlin), which emits one improvement task per relevant repo. Hub publish degrades gracefully when the hub is unreachable.

Retention knobs (`[retention]` in config.toml)

Key	Meaning
`raw_soft_cap_bytes`	begin evicting analyzed sessions above this (oldest first)
`raw_hard_cap_bytes`	absolute Tier 1 ceiling; overflow path may, as a last resort, evict un-analyzed sessions and report `data_loss`
`raw_max_age_days`	backstop: analyzed raw older than this is evictable regardless of space
`distilled_cap_bytes`	Tier 2 ceiling — alert only, never auto-dropped

Invariant: a session's raw bytes are never dropped before its Tier 2 digest exists, except the explicitly-reported hard-cap overflow path.

Tests

python -m pytest          # schema, adapters, store, digest, retention, ingest, detect, curate

Status

Phase 0 (AGENTIC-WP-0002): schema, store, digest, budget retention, Claude adapter, ingest sweep.
Phase 1 (AGENTIC-WP-0003): Codex + Grok adapters, multi-file session merge, and the Detect pipeline (signals → clustering → cross-flavor candidate patterns).
Phase 2 (AGENTIC-WP-0004): Curate — Solution Pattern schema, versioned files-first Pattern Catalog, discuss/approve/reject review with an evidence bar + bloat guard, and hub-decision audit trail.
Detect hardening (AGENTIC-WP-0005): session-quality filter + tool-mix / infra-overhead signals. Error mining (AGENTIC-WP-0006): recurring error fingerprints → root-cause patterns.
Phase 3 (AGENTIC-WP-0007): Distribute — per-flavor distributor adapters render approved patterns into proposed (HITL) artifacts, scoped by repo/domain, with an active-pattern registry.
Phase 4 (AGENTIC-WP-0009): Measure — fleet baseline/trend + before/after per-pattern effectiveness. The Capture → Detect → Curate → Distribute → Measure loop is closed.

11 KiB Raw Permalink Blame History