Normalize agent instructions and workplan frontmatter (STATE-WP-0067)

- Align agent files with on-disk workplan prefixes (infer from workplan ids) - Set workplan domain to registered domain_slug; add topic_slug where applicable - Repair frontmatter delimiter formatting; migrate legacy task status literals - Regenerate AGENTS.md, CLAUDE.md, and .claude/rules from State Hub templates
chore(consistency): sync task status from DB [auto]
2026-06-22 23:16:24 +02:00 · 2026-06-22 20:28:07 +02:00 · 2026-06-22 17:47:36 +02:00 · 2026-06-22 01:30:22 +02:00 · 2026-06-22 01:15:09 +02:00 · 2026-06-16 01:52:33 +02:00
20 changed files with 1220 additions and 2 deletions
--- a/.claude/rules/agents.md
+++ b/.claude/rules/agents.md
@@ -0,0 +1,20 @@
+## Kaizen Agents
+
+Specialized agent personas available on demand via the state-hub MCP.
+
+**Discover:** `list_kaizen_agents()` — returns all agents with name, description, category
+**Load:** `get_kaizen_agent("tdd-workflow")` — returns full instructions; read and follow them
+
+Common agents:
+
+| Agent | Category | When to use |
+|-------|----------|-------------|
+| `tdd-workflow` | testing | Step-by-step TDD8 workflow for any feature |
+| `code-refactoring` | quality | Code quality analysis and safe refactoring |
+| `test-maintenance` | testing | Diagnose and fix failing tests |
+| `requirements-engineering` | process | Prevent interface/mock mismatches upfront |
+| `keepaTodofile` | process | Maintain TODO.md during work |
+| `project-management` | process | Track status, determine next steps |
+| `datamodel-optimization` | quality | Optimize dataclasses and data structures |
+
+All 17 agents: call `list_kaizen_agents()` for the full list.
--- a/.claude/rules/architecture.md
+++ b/.claude/rules/architecture.md
@@ -0,0 +1,8 @@
+## Architecture
+
+<!-- TODO: Describe the key design decisions and component structure.
+     Key modules, data flows, external integrations, state machines, etc. -->
+
+## Quick Reference
+
+`~/state-hub/mcp_server/TOOLS.md` — MCP tool reference
--- a/.claude/rules/credential-routing.md
+++ b/.claude/rules/credential-routing.md
@@ -0,0 +1,50 @@
+# Credential and access routing
+
+**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
+for inference. Run this check **before** requesting secrets, API keys, SSH access,
+login tokens, or database passwords — in any repo, not only `ops-warden`.
+
+ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
+other credential need belongs to another subsystem. **Do not** message
+`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
+
+### Lookup (do this first)
+
+```bash
+warden route find "<describe your need>" --json
+warden route show <catalog-id> --json
+```
+
+Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
+
+| Agent runtime | How to orient |
+| --- | --- |
+| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=evidence-source` is for coordination, not secret vending |
+| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
+| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
+
+### Quick routing table
+
+| I need… | Owner | ops-warden executes? |
+| --- | --- | --- |
+| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
+| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
+| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
+| Authorization decision | flex-auth | No — route only |
+| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
+| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
+
+### Anti-patterns (do not do these)
+
+- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
+- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
+- Pasting secrets into Git, State Hub, workplans, logs, or chat
+
+### Other capabilities (reuse-surface)
+
+Non-credential capabilities are usually discovered through **reuse-surface** federation
+(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
+every repo's agent instructions because it is high-frequency, high-risk, and easy to
+get wrong.
+
+**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
--- a/.claude/rules/first-session.md
+++ b/.claude/rules/first-session.md
@@ -0,0 +1,38 @@
+## First Session Protocol
+
+Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
+The project is registered but work has not yet been structured.
+
+**Step 1 — Read, don't write**
+- `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
+- `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
+- Scan repo root: README, directory structure, existing code or docs
+
+**Step 2 — Survey in-progress work**
+Look for TODOs, open branches, half-finished files. Note done vs. started but incomplete.
+
+**Step 3 — Propose workstreams to Bernd**
+Propose 1–3 workstreams — each a coherent strand, weeks to months, anchored to a
+roadmap phase. **Wait for approval before creating.**
+
+**Step 4 — Create workplan file first, then DB record (ADR-001)**
+```
+workplans/ESRC-WP-NNNN-<slug>.md   ← write this first
+```
+Then register in the hub:
+```
+create_workstream(topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", title="...", owner="...", description="...")
+create_task(workstream_id="<id>", title="...", priority="high|medium|low")
+```
+
+**Step 5 — Record the setup**
+```
+add_progress_event(
+    summary="First session: structured infotech into N workstreams, M tasks",
+    event_type="milestone",
+    topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
+    detail={"workstreams": [...], "tasks_created": M}
+)
+```
+
+<!-- Delete or archive this file once past first session -->
--- a/.claude/rules/repo-boundary.md
+++ b/.claude/rules/repo-boundary.md
@@ -0,0 +1,8 @@
+## Repo boundary
+
+This repo owns **evidence-source** only. It does not own:
+
+<!-- TODO: List what belongs in adjacent repos, e.g.:
+- SSH key management → railiance-infra/
+- State hub code     → state-hub/
+-->
--- a/.claude/rules/repo-identity.md
+++ b/.claude/rules/repo-identity.md
@@ -0,0 +1,5 @@
+**Purpose:** Document ingestion, extraction, fingerprinting, citation recovery. Depends only on citation-engine. INTENT-only during umbrella-first MVP.
+
+**Domain:** infotech
+**Repo slug:** evidence-source
+**Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a
--- a/.claude/rules/session-protocol.md
+++ b/.claude/rules/session-protocol.md
@@ -0,0 +1,85 @@
+## Session Protocol
+
+Dev Hub (State Hub API): http://127.0.0.1:8000
+MCP server name in `~/.claude.json`: `dev-hub`
+
+**Step 1 — Orient**
+
+Read the offline-safe brief first — it works without a live hub connection:
+```bash
+cat .custodian-brief.md
+```
+Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
+```
+get_domain_summary("infotech")
+```
+If MCP tools are unavailable in the current agent session, use the REST API:
+```bash
+curl -s "http://127.0.0.1:8000/state/summary" | python3 -m json.tool
+```
+If the hub is offline: `cd ~/state-hub && make api`
+
+**Step 2 — Check inbox**
+With MCP tools:
+```
+get_messages(to_agent="evidence-source", unread_only=True)
+```
+Mark read with `mark_message_read(message_id)`. Reply or act on coordination
+requests before proceeding.
+
+Without MCP tools:
+```bash
+curl -s "http://127.0.0.1:8000/messages/?to_agent=evidence-source&unread_only=true" \
+  | python3 -m json.tool
+curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
+  -H "Content-Type: application/json" -d '{}'
+```
+
+**Step 3 — Scan workplans**
+```bash
+ls workplans/
+```
+For each file with `status: ready`, `active`, or `blocked`, note pending
+`wait`/`todo`/`progress` tasks.
+
+**Step 4 — Present brief**
+
+1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
+2. **Pending tasks** from `workplans/` + any `[repo:evidence-source]` hub tasks
+3. **Goal guidance** — if `goal_guidance` in summary:
+   - `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
+   - `alignment_warnings`: flag if active work is not aligned with current goal
+4. **Suggested next action** — highest-priority open item
+5. **SBOM status** — flag if `last_sbom_at` is unset for this repo
+
+If no workstreams: follow First Session Protocol (`first-session.md`).
+
+**During work:** `record_decision()` · `add_progress_event()` · `resolve_decision()`
+
+> State Hub is a *read model*. Bootstrap tools (`create_workstream`, `create_task`)
+> are First Session Protocol only. Work structure belongs in repo files (ADR-001).
+
+**Session close:**
+With MCP tools:
+```
+add_progress_event(summary="...", topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", workstream_id="<uuid>")
+```
+Without MCP tools:
+```bash
+curl -s -X POST http://127.0.0.1:8000/progress/ \
+  -H "Content-Type: application/json" \
+  -d '{"topic_id":"cee7bedf-2b48-46ef-8601-006474f2ad7a","workstream_id":"<uuid>","event_type":"note","summary":"what changed","author":"codex"}'
+```
+If workplan files were modified, ensure the local copy is up to date first:
+```bash
+git -C <repo_path> pull --ff-only
+cd ~/state-hub && make fix-consistency REPO=evidence-source
+```
+For repos where implementation runs on a remote machine (e.g. CoulombCore),
+use the combined target which pulls before fixing:
+```bash
+cd ~/state-hub && make fix-consistency-remote REPO=evidence-source
+```
+**C-15** (DB task ahead of file) is normal in multi-machine workflows — writeback
+will sync the file to match DB.  **C-16** (repo behind remote) blocks all writes
+until you pull — intentional to prevent clobbering remote progress.
--- a/.claude/rules/stack-and-commands.md
+++ b/.claude/rules/stack-and-commands.md
@@ -0,0 +1,19 @@
+## Stack
+
+<!-- TODO: Fill in language, frameworks, and key dependencies -->
+- **Language:**
+- **Key deps:**
+
+## Dev Commands
+
+```bash
+# TODO: Fill in the standard commands for this repo
+
+# Install dependencies
+
+# Run tests
+
+# Lint / type check
+
+# Build / package (if applicable)
+```
--- a/.claude/rules/workplan-convention.md
+++ b/.claude/rules/workplan-convention.md
@@ -0,0 +1,40 @@
+## Workplan Convention (ADR-001)
+
+File location: `workplans/ESRC-WP-NNNN-<slug>.md`
+ID prefix: `ESRC-WP-`
+
+Work items originate as files in this repo **before** being registered in the hub.
+
+Canonical workplan/workstream frontmatter statuses are:
+`proposed`, `ready`, `active`, `blocked`, `backlog`, `finished`, `archived`.
+Use `proposed` for a newly drafted plan, `ready` after review against current
+repo state, and `finished` when implementation is complete. `stalled` and
+`needs_review` are derived health labels, not stored statuses.
+
+Closed workplans may be moved to `workplans/archived/` with a completion-date
+prefix: `YYMMDD-ESRC-WP-NNNN-<slug>.md`. The frontmatter id remains
+unchanged; the prefix is only for quick visual reference.
+
+Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
+`workplans/ADHOC-YYYY-MM-DD.md`, workstream slug `adhoc-YYYY-MM-DD`, and task ids
+`ADHOC-YYYY-MM-DD-T01`, `T02`, etc. Use adhocs only for low-risk work completed
+directly. Promote anything requiring analysis, design, approval, dependencies, or
+multiple planned phases into a normal workplan.
+
+Ecosystem todos from other agents arrive as `[repo:evidence-source]` hub tasks —
+visible at session start. Pick one up by creating the workplan file, then registering
+the workstream.
+
+Task blocks use this shape:
+
+```task
+id: ESRC-WP-NNNN-T01
+status: wait | todo | progress | done | cancel
+priority: high | medium | low
+state_hub_task_id: "<uuid>"         # written by fix-consistency — do not edit
+```
+
+Status progression is `todo` → `progress` → `done`; use `wait` for waiting or
+blocked work and `cancel` for stopped work.
+
+<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->
--- a/.custodian-brief.md
+++ b/.custodian-brief.md
@@ -0,0 +1,18 @@
+<!-- custodian-brief: generated by fix-consistency — do not edit manually -->
+# Custodian Brief — evidence-source
+
+**Domain:** infotech  
+**Last synced:** 2026-06-22 18:28 UTC  
+**State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
+
+## Active Workstreams
+
+*(none — repo may need first-session setup)*
+
+---
+## MCP Orientation (when available)
+
+If the state-hub MCP server is reachable, call:
+`get_domain_summary("infotech")`
+This provides richer cross-domain context.
+If the MCP call fails, use this file as your orientation source.
--- a/.repo-classification.yaml
+++ b/.repo-classification.yaml
@@ -0,0 +1,19 @@
+repo_classification:
+  standard: Repo Classification Standard
+  version: '1.0'
+  classified_at: '2026-06-22'
+  classified_by: agent
+  category: project
+  domain: infotech
+  secondary_domains: []
+  capability_tags:
+  - evidence
+  - traceability
+  - source-management
+  business_stake:
+  - technology
+  - product
+  - operations
+  business_mechanics:
+  - coordination
+  - operation
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -0,0 +1,219 @@
+# evidence-source — Agent Instructions
+
+## Repo Identity
+
+**Purpose:** Document ingestion, extraction, fingerprinting, citation recovery. Depends only on citation-engine. INTENT-only during umbrella-first MVP.
+
+**Domain:** infotech
+**Repo slug:** evidence-source
+**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
+**Workplan prefix:** `ESRC-WP-`
+
+---
+
+## State Hub Integration
+
+The Custodian State Hub tracks work across all domains. Interact via HTTP REST —
+there is no MCP server for Codex agents.
+
+| Context | URL |
+|---------|-----|
+| Local workstation | `http://127.0.0.1:8000` |
+| Remote via tunnel | `http://127.0.0.1:18000` |
+
+### Orient at session start
+
+```bash
+# Offline brief — works without hub connection
+cat .custodian-brief.md
+
+# Active workstreams for this domain
+curl -s "http://127.0.0.1:8000/workstreams/?topic_id=cee7bedf-2b48-46ef-8601-006474f2ad7a&status=active" \
+  | python3 -m json.tool
+
+# Check inbox
+curl -s "http://127.0.0.1:8000/messages/?to_agent=evidence-source&unread_only=true" \
+  | python3 -m json.tool
+```
+
+Mark a message read:
+```bash
+curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
+  -H "Content-Type: application/json" -d '{}'
+```
+
+### Log progress (required at session close)
+
+```bash
+curl -s -X POST http://127.0.0.1:8000/progress/ \
+  -H "Content-Type: application/json" \
+  -d '{
+    "summary": "what was done",
+    "event_type": "note",
+    "author": "codex",
+    "workstream_id": "<uuid>",
+    "task_id": "<uuid>"
+  }'
+```
+
+Omit `workstream_id` / `task_id` when not applicable.
+
+### Update task status
+
+```bash
+curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
+  -H "Content-Type: application/json" \
+  -d '{"status": "progress"}'
+# values: wait | todo | progress | done | cancel
+```
+
+### Flag a task for human review
+
+```bash
+curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
+  -H "Content-Type: application/json" \
+  -d '{"needs_human": true, "intervention_note": "reason"}'
+```
+
+---
+
+## Session Protocol
+
+**Start:**
+1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
+2. Check inbox: `GET /messages/?to_agent=evidence-source&unread_only=true`; mark read
+3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
+4. Check human-needed tasks: `GET /tasks/?needs_human=true`
+
+**During work:**
+- Update task statuses in workplan files as tasks progress
+- Record significant decisions via `POST /decisions/`
+
+**Close:**
+1. Update workplan file task statuses to reflect progress
+2. Log: `POST /progress/` with a summary of what changed
+3. Note for the custodian operator: after workplan file changes, run from
+   `~/state-hub`:
+   ```bash
+   make fix-consistency REPO=evidence-source
+   ```
+   This syncs task status from files into the hub DB.
+
+---
+
+## Credential and access routing
+
+**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
+for inference. Run this check **before** requesting secrets, API keys, SSH access,
+login tokens, or database passwords — in any repo, not only `ops-warden`.
+
+ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
+other credential need belongs to another subsystem. **Do not** message
+`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
+
+### Lookup (do this first)
+
+```bash
+warden route find "<describe your need>" --json
+warden route show <catalog-id> --json
+```
+
+Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
+
+| Agent runtime | How to orient |
+| --- | --- |
+| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=evidence-source` is for coordination, not secret vending |
+| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
+| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
+
+### Quick routing table
+
+| I need… | Owner | ops-warden executes? |
+| --- | --- | --- |
+| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
+| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
+| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
+| Authorization decision | flex-auth | No — route only |
+| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
+| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
+
+### Anti-patterns (do not do these)
+
+- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
+- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
+- Pasting secrets into Git, State Hub, workplans, logs, or chat
+
+### Other capabilities (reuse-surface)
+
+Non-credential capabilities are usually discovered through **reuse-surface** federation
+(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
+every repo's agent instructions because it is high-frequency, high-risk, and easy to
+get wrong.
+
+**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
+
+<!-- REPO-AGENTS-EXTENSIONS -->
+<!-- Append repo-specific agent instructions below this marker.
+     The state-hub template sync preserves content after this line. -->
+
+---
+
+## Workplan Convention (ADR-001)
+
+Work items originate as files in this repo — not in the hub. The hub is a
+read/cache/index layer that rebuilds from files.
+
+**File location:** `workplans/EVIDENCE-WP-NNNN-<slug>.md`
+
+**Archived location:** finished workplans may move to
+`workplans/archived/YYMMDD-EVIDENCE-WP-NNNN-<slug>.md`. The `YYMMDD` prefix is
+the completion/archive date; the frontmatter `id` does not change.
+
+**Ad Hoc Tasks:** small opportunistic fixes discovered during a session use
+`workplans/ADHOC-YYYY-MM-DD.md` with task ids `ADHOC-YYYY-MM-DD-T01`, etc. Use
+this only for low-risk work completed directly; create a normal workplan for
+anything needing analysis, design, approval, dependencies, or multiple phases.
+
+**Frontmatter:**
+
+```yaml
+---
+id: EVIDENCE-WP-NNNN
+type: workplan
+title: "..."
+domain: infotech
+repo: evidence-source
+status: proposed | ready | active | blocked | backlog | finished | archived
+owner: codex
+topic_slug: ...
+created: "YYYY-MM-DD"
+updated: "YYYY-MM-DD"
+state_hub_workstream_id: "<uuid>"   # written by fix-consistency — do not edit
+---
+```
+
+Use `proposed` for a new draft, `ready` after review against current repo
+state, and `finished` after implementation. `stalled` and `needs_review` are
+derived health labels, not frontmatter statuses.
+
+**Task block format** (one per `##` section):
+
+```
+## Task Title
+
+` ` `task
+id: EVIDENCE-WP-NNNN-T01
+status: wait | todo | progress | done | cancel
+priority: high | medium | low
+state_hub_task_id: "<uuid>"         # written by fix-consistency — do not edit
+` ` `
+
+Task description text.
+```
+
+Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
+
+To create a new workplan:
+1. Write the file following the format above
+2. Notify the custodian operator to run `make fix-consistency REPO=evidence-source`
+   (or send a message to the hub agent via `POST /messages/`)
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,12 @@
+# evidence-source — Claude Code Instructions
+
+@SCOPE.md
+@.claude/rules/repo-identity.md
+@.claude/rules/session-protocol.md
+@.claude/rules/first-session.md
+@.claude/rules/workplan-convention.md
+@.claude/rules/stack-and-commands.md
+@.claude/rules/architecture.md
+@.claude/rules/repo-boundary.md
+@.claude/rules/credential-routing.md
+@.claude/rules/agents.md
--- a/INTENT.md
+++ b/INTENT.md
@@ -0,0 +1,492 @@
+# INTENT
+
+## Purpose
+
+This repository exists to provide the document source, ingestion, extraction, metadata, and citation recovery layer for the **citation-evidence** ecosystem.
+
+**evidence-source** turns raw documents and source clues into usable, searchable, addressable document representations that can support annotations, evidence items, citation recovery, and source-backed workflows.
+
+It is responsible for answering the source-side questions:
+
+> What is this document?  
+> How can we extract usable text and structure from it?  
+> How can we find or recover a cited source passage?
+
+---
+
+## Primary Utility
+
+The repository provides the source pipeline for citation-evidence.
+
+It should make it possible to:
+
+- import documents into a collection or workspace,
+- identify document type and media type,
+- compute stable document fingerprints,
+- extract document metadata,
+- extract canonical text,
+- create document representations for PDFs, Markdown, HTML, and later other formats,
+- build maps between text, pages, sections, and rendered views,
+- support local full-text search,
+- support source lookup and citation recovery,
+- provide the document representations needed by **evidence-anchor** and **citation-work**.
+
+This repository turns documents into evidence-ready sources.
+
+---
+
+## Intended Users
+
+Primary users of this repository are developers and agents implementing source handling for citation-evidence.
+
+They include:
+
+- developers building document import workflows,
+- developers building review collections,
+- developers implementing PDF, Markdown, and HTML source handling,
+- developers implementing citation recovery,
+- developers integrating local or external source libraries,
+- coding agents that need structured access to document text and metadata.
+
+End users should experience this repository indirectly whenever they add a document, search source text, or recover a citation.
+
+---
+
+## Strategic Role
+
+The strategic role of **evidence-source** is to make source documents usable as reliable evidence substrates.
+
+Without this repository, the system would depend on whatever a viewer happens to show at runtime. That would make citation capture, re-opening, search, and recovery fragile.
+
+**evidence-source** creates the normalized source representations that allow the rest of the system to operate consistently across document formats.
+
+It enables the flow:
+
+```text
+Raw Source
+  → Document Identity
+  → Metadata
+  → Canonical Text
+  → Document Representation
+  → Searchable Source
+  → Anchorable Evidence Context
+````
+
+---
+
+## Core Concept
+
+The core concept of this repository is the **document representation**.
+
+A document representation is a normalized, searchable, addressable view of a source document.
+
+For a PDF, a representation may include:
+
+```text
+document fingerprint
+metadata
+page count
+page text
+global canonical text
+page-local offset map
+text item map
+page dimensions
+source-to-rendering hints
+```
+
+For Markdown or HTML, a representation may include:
+
+```text
+canonical text
+rendered HTML
+sanitized content
+heading map
+section map
+DOM or AST structure
+offset-to-node map
+source line map where available
+```
+
+These representations allow **evidence-anchor** to create and resolve selectors and allow **citation-work** to display and search documents efficiently.
+
+---
+
+## Scope
+
+This repository should own:
+
+* document import workflows,
+* document source identification,
+* media type detection,
+* document fingerprinting,
+* source URI handling,
+* metadata extraction,
+* canonical text extraction,
+* PDF text extraction,
+* Markdown normalization,
+* HTML normalization and sanitization,
+* document representation generation,
+* representation caching,
+* local source search support,
+* quote search support,
+* citation clue parsing,
+* local citation recovery,
+* external source discovery hooks,
+* recovery state tracking,
+* privacy boundaries for source lookup.
+
+It should provide the source-side capabilities consumed by:
+
+* **citation-engine** for creating `Document` and `DocumentRepresentation` records,
+* **evidence-anchor** for selector creation and resolution,
+* **citation-work** for document review workflows,
+* **evidence-binder** when evidence needs source context,
+* **citation-evidence** for the integrated product experience.
+
+---
+
+## Out of Scope
+
+This repository should not own the broader evidence domain or user workflows.
+
+Specifically, it should not own:
+
+* the canonical evidence domain model,
+* persistence policy beyond source and representation storage contracts,
+* low-level anchor resolution algorithms,
+* visual highlight rendering,
+* review workspace UI,
+* form-field binding semantics,
+* visual guide overlay behavior,
+* citation card rendering,
+* application shell and deployment,
+* final human validation of evidence quality.
+
+Those responsibilities belong to the appropriate citation-evidence subsystem repositories.
+
+---
+
+## Architectural Position
+
+```text
+citation-evidence
+  integrated product shell
+
+citation-engine
+  core domain model, services, persistence contracts
+
+evidence-source
+  document ingestion, extraction, metadata, representations, citation recovery
+
+evidence-anchor
+  selectors, anchor resolution, re-anchoring, highlighting contracts
+
+citation-work
+  review workspace and annotation UX
+
+evidence-binder
+  evidence-to-target binding and active evidence state
+```
+
+**evidence-source** should provide document representations, not define what evidence means.
+
+It should feed reliable source material into the rest of the system.
+
+---
+
+## Primary Workflows
+
+### 1. Import Document
+
+A user or system adds a source document.
+
+```text
+Add Source
+  → Identify Media Type
+  → Compute Fingerprint
+  → Extract Metadata
+  → Extract Text
+  → Build Representation
+  → Register Document
+```
+
+### 2. Generate PDF Representation
+
+A PDF is converted into a representation suitable for review and anchoring.
+
+```text
+PDF Source
+  → Load PDF
+  → Extract Page Text
+  → Normalize Text
+  → Build Page Map
+  → Build Offset Map
+  → Store Representation
+```
+
+### 3. Generate Markdown / HTML Representation
+
+A Markdown or HTML source is converted into a normalized rendered and searchable representation.
+
+```text
+Markdown / HTML Source
+  → Parse / Sanitize
+  → Render if needed
+  → Extract Canonical Text
+  → Build Heading / Section Map
+  → Build Offset Map
+  → Store Representation
+```
+
+### 4. Search Local Sources
+
+A user or subsystem searches available source material.
+
+```text
+Search Query / Quote
+  → Search Metadata
+  → Search Full Text
+  → Return Candidate Documents / Passages
+```
+
+### 5. Recover Citation
+
+A user provides a citation, quote, or source clue.
+
+```text
+Citation Clue
+  → Parse Source Metadata
+  → Search Local Library
+  → Optionally Search Configured External Sources
+  → Load Candidate Source
+  → Search Exact Quote
+  → Search Fuzzy Quote
+  → Present Candidate Passages
+  → User Confirms
+  → Create Source Context for Annotation
+```
+
+---
+
+## Initial Source Types
+
+The first version should support or prepare for:
+
+```text
+PDF
+Markdown
+HTML
+plain text
+remote URL references
+```
+
+Later versions may support:
+
+```text
+DOCX
+EPUB
+scanned image documents
+OCR-derived text
+IIIF resources
+TEI XML
+structured datasets with source passages
+```
+
+---
+
+## Citation Recovery States
+
+Citation recovery should be modeled explicitly.
+
+Initial recovery states may include:
+
+```text
+created
+source-found-fulltext
+source-found-preview-only
+source-found-metadata-only
+source-not-found
+quote-found
+quote-not-found
+candidate-passages-found
+manual-confirmation-needed
+confirmed
+annotation-created
+failed
+```
+
+The system should distinguish between finding a source and finding the exact cited passage.
+
+---
+
+## Privacy and Source Lookup Principles
+
+Source lookup can create privacy risks.
+
+The repository should follow these principles:
+
+* search local sources first,
+* make external lookup explicit and configurable,
+* avoid sending private document text to external services by default,
+* record which external services were queried,
+* distinguish public metadata lookup from full-text upload,
+* allow deployments to disable external lookup completely,
+* prefer deterministic local processing where possible.
+
+External source discovery should be an extension point, not an unavoidable default behavior.
+
+---
+
+## Design Principles
+
+### Source Identity First
+
+Every imported document should receive a stable identity based on available metadata, source URI, and fingerprint.
+
+### Canonical Text Matters
+
+Anchoring and search depend on canonical text. The repository should make text normalization explicit and repeatable.
+
+### Representation Is Not Source
+
+The original source and generated representation are different things.
+
+The system should preserve this distinction.
+
+### Local Before External
+
+Citation recovery should search local documents before looking elsewhere.
+
+### Human Confirmation
+
+Recovered citations should not silently become confirmed evidence. Candidate matches should be presented for confirmation when uncertainty exists.
+
+### Format-Aware, Model-Neutral
+
+The repository should understand document formats but should not own the broader evidence model.
+
+### Cache Expensive Work
+
+Text extraction, fingerprinting, and representation generation should be cacheable by source fingerprint and version.
+
+### Agent-Friendly Output
+
+Extracted metadata, representations, and recovery candidates should be structured enough for agents to inspect, rank, and explain.
+
+---
+
+## Expected Dependencies
+
+This repository is expected to depend on shared types and service contracts from:
+
+```text
+citation-engine
+  Document, DocumentRepresentation, CitationRecoveryAttempt, source-related contracts
+```
+
+It may be consumed by:
+
+```text
+citation-work
+  to load reviewable documents and document representations
+
+evidence-anchor
+  to resolve selectors against extracted representations
+
+evidence-binder
+  to retrieve source context for linked evidence
+
+citation-evidence
+  to provide integrated import and recovery workflows
+```
+
+It should avoid depending on review UI or form-binding implementation details.
+
+---
+
+## First Useful Version
+
+A first useful version of **evidence-source** should provide:
+
+* source import interface,
+* media type detection,
+* document fingerprinting,
+* basic metadata extraction,
+* PDF text extraction,
+* Markdown text extraction,
+* HTML sanitization and text extraction,
+* canonical text normalization,
+* document representation generation,
+* simple local quote search,
+* recovery attempt model or contract,
+* examples showing how a document becomes a representation usable by **evidence-anchor**.
+
+The first version does not need full external source discovery or OCR, but it should establish the ingestion and representation pattern.
+
+---
+
+## Success Criteria
+
+The repository is successful when another subsystem can use it to:
+
+1. import a source document,
+2. identify and fingerprint it,
+3. extract useful metadata,
+4. generate canonical text,
+5. generate a document representation,
+6. search the source text,
+7. provide representation data to **evidence-anchor**,
+8. support a local citation recovery attempt from a quote or citation clue.
+
+A developer or coding agent should be able to understand from this repository how raw documents become evidence-ready sources.
+
+---
+
+## Repository Character
+
+This repository should be:
+
+* source-focused,
+* ingestion-oriented,
+* privacy-conscious,
+* format-aware,
+* representation-centered,
+* cache-friendly,
+* suitable for local-first and server-side use,
+* explicit about uncertainty in citation recovery,
+* careful not to absorb review or binding responsibilities.
+
+---
+
+## MVP Coordination — Code Lives Upstream
+
+During the umbrella-first MVP phase (decided 2026-05-24), **the source code
+for this subsystem does not live in this repository yet**. It lives in the
+umbrella repo at `citation-evidence/src/source/`.
+
+This INTENT.md documents the *intended* responsibilities and boundaries.
+When the ingestion and representation interfaces have stabilized through
+actual MVP use, the corresponding code extracts into this repository.
+
+**Shared contracts** (Document and DocumentRepresentation shapes,
+CitationRecoveryAttempt state enum, canonical text normalization, allowed
+dependency edges) are maintained in the umbrella repo:
+
+* `citation-evidence/wiki/SharedContracts.md`
+* `citation-evidence/wiki/DependencyMap.md`
+* `citation-evidence/docs/decisions/` (ADRs)
+
+This subsystem's eventual code must not contradict those documents. Changes
+to shared contracts happen in the umbrella, not here.
+
+Under the dependency map, **`evidence-source` may depend only on
+`citation-engine`** — not on `evidence-anchor`. When ingestion needs to know
+"could a selector resolve here?", the answer travels through events, not
+direct calls.
+
+---
+
+## Guiding Statement
+
+**evidence-source exists to turn documents and citation clues into reliable, searchable, anchorable source context.**
+
--- a/README.md
+++ b/README.md
@@ -1,3 +1,16 @@
-# repo-seed
+# evidence-source

-A git repository template to bootstrap coulomb projects from.
+Document source, ingestion, extraction, metadata, and citation recovery —
+PDF/HTML/MD ingest, fingerprinting, page-/offset-map construction,
+canonical-text extraction, and the recovery behavior for stale selectors.
+
+## MVP status: INTENT only
+
+During the citation-evidence MVP, code lives upstream in
+[`citation-evidence`](../citation-evidence/) under `src/source/`. This repo
+currently holds `INTENT.md` describing what will move here. Contract
+changes belong in
+[`citation-evidence/wiki/SharedContracts.md`](../citation-evidence/wiki/SharedContracts.md),
+not here.
+
+Per the dependency map, source depends on `shared/` and `engine/` only.
--- a/SCOPE.md
+++ b/SCOPE.md
@@ -0,0 +1,137 @@
+# SCOPE
+
+> This file helps you quickly understand what this repository is about,
+> when it is relevant, and when it is not.
+> It is intentionally lightweight and may be incomplete.
+
+---
+
+## One-liner
+
+<!-- Describe the purpose of this repository in one precise sentence. -->
+<!-- Example: "Provides a lightweight event router for Kubernetes-native systems." -->
+
+---
+
+## Core Idea
+
+<!-- What is the main capability or idea behind this repository? -->
+<!-- What problem does it try to solve? -->
+
+---
+
+## In Scope
+
+<!-- What this repository is responsible for. -->
+<!-- Be explicit and concrete. -->
+
+-
+-
+-
+
+---
+
+## Out of Scope
+
+<!-- What this repository deliberately does NOT do. -->
+<!-- This is often more important than "In Scope". -->
+
+-
+-
+-
+
+---
+
+## Relevant When
+
+<!-- When should someone consider using or exploring this repository? -->
+
+-
+-
+-
+
+---
+
+## Not Relevant When
+
+<!-- When should someone ignore this repository? -->
+
+-
+-
+-
+
+---
+
+## Current State
+
+<!-- Rough indication of maturity. No strict format required. -->
+
+- Status: <!-- e.g. concept / experimental / active / stable / deprecated -->
+- Implementation: <!-- e.g. idea / partial / substantial / complete -->
+- Stability: <!-- e.g. unstable / evolving / stable -->
+- Usage: <!-- e.g. none / personal / internal / production -->
+
+<!-- Add any notes that help set expectations. -->
+
+---
+
+## How It Fits
+
+<!-- Where does this repository sit in the bigger picture? -->
+
+- Upstream dependencies:
+- Downstream consumers:
+- Often used with:
+
+---
+
+## Terminology
+
+<!-- Terms that are important to understand this repo. -->
+<!-- Especially useful if naming differs from other repos. -->
+
+- Preferred terms:
+- Also known as:
+- Potentially confusing terms:
+
+---
+
+## Related / Overlapping Repositories
+
+<!-- List repositories that have similar or adjacent responsibilities. -->
+<!-- Helps detect duplication and navigate the ecosystem. -->
+
+- <repo-name> — <!-- how it relates -->
+
+---
+
+## Getting Oriented
+
+<!-- If someone decides to look deeper, where should they start? -->
+
+- Start with:
+- Key files / directories:
+- Entry points:
+
+---
+
+## Provided Capabilities
+
+<!-- What can this repo's domain provide to other domains on request? -->
+<!-- Each capability block is parsed by the state-hub capability catalog ingest. -->
+<!-- Remove the examples and add your own, or leave empty if none. -->
+
+<!--
+```capability
+type: infrastructure
+title: Example capability title
+description: What this capability provides, in one or two sentences.
+keywords: [keyword1, keyword2, keyword3]
+```
+-->
+
+---
+
+## Notes
+
+<!-- Anything else worth knowing. Keep it short. -->
--- a/registry/README.md
+++ b/registry/README.md
@@ -0,0 +1,12 @@
+# Capability Registry
+
+Markdown-first capability index for federation and reuse planning.
+
+## Authoring
+
+1. Copy a capability entry template (see reuse-surface `templates/capability-entry.template.md`).
+2. Add the row to `indexes/capabilities.yaml`.
+3. Run `reuse-surface validate` from a checkout with the CLI installed.
+4. Merge to `main` and verify publish with `reuse-surface establish --publish-check`.
+
+Federation contract: reuse-surface `docs/RegistryFederation.md`.
--- a/registry/capabilities/.gitkeep
+++ b/registry/capabilities/.gitkeep
--- a/registry/indexes/capabilities.yaml
+++ b/registry/indexes/capabilities.yaml
@@ -0,0 +1,4 @@
+version: 1
+updated: '2026-06-16'
+domain: helix_forge
+capabilities: []
--- a/workplans/ESRC-WP-0001-intent-placeholder.md
+++ b/workplans/ESRC-WP-0001-intent-placeholder.md
@@ -0,0 +1,19 @@
+---
+id: ESRC-WP-0001
+type: workplan
+title: "INTENT placeholder — await extraction from citation-evidence"
+domain: infotech
+repo: evidence-source
+status: backlog
+owner: codex
+topic_slug: citation_evidence_mvp
+created: "2026-06-21"
+updated: "2026-06-21"
+state_hub_workstream_id: "64771b5d-4b83-4848-a562-4b00aad017b2"
+---
+
+# ESRC-WP-0001 — INTENT Placeholder
+
+Umbrella-first MVP: source/ingestion code will extract from `citation-evidence`
+when the subsystem boundary stabilizes. This file satisfies ADR-001 workplan
+structure until then. See `INTENT.md`.
Author	SHA1	Message	Date
tegwick	0a7176bf2d	Normalize agent instructions and workplan frontmatter (STATE-WP-0067) - Align agent files with on-disk workplan prefixes (infer from workplan ids) - Set workplan domain to registered domain_slug; add topic_slug where applicable - Repair frontmatter delimiter formatting; migrate legacy task status literals - Regenerate AGENTS.md, CLAUDE.md, and .claude/rules from State Hub templates	2026-06-22 23:16:24 +02:00
tegwick	f6003bc4a1	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-06-22: - update .custodian-brief.md for evidence-source	2026-06-22 20:28:07 +02:00
tegwick	7cf52213bf	Add .repo-classification.yaml (CUST-WP-0050 T11 agent first-pass)	2026-06-22 17:47:36 +02:00
tegwick	25fb6946bc	chore(ADR-001): add INTENT placeholder workplan for umbrella-first MVP Add ESRC-WP-0001-intent-placeholder.md so consistency sweeps pass C-01 while this repo remains INTENT-only. Implementation stays in citation-evidence until extraction. Hub workstream id is written in frontmatter from fix-consistency.	2026-06-22 01:30:22 +02:00
tegwick	720e46eef5	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-06-22: - update .custodian-brief.md for evidence-source	2026-06-22 01:15:09 +02:00
tegwick	40b2f12797	Add capability registry scaffold (REUSE-WP-0014-T04 B02) Empty helix_forge registry layout for federation publishing.	2026-06-16 01:52:33 +02:00
tegwick	d37f22ac18	Point README at citation-evidence umbrella during MVP phase Code lives upstream in citation-evidence/src/source/ during the MVP. README documents that and points at SharedContracts.md for ingest, fingerprint, and recovery contract changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 00:13:12 +02:00
tegwick	d8a08d6032	Add MVP Coordination section: code lives in citation-evidence umbrella during MVP Documents the umbrella-first MVP decision (2026-05-24). This repo remains INTENT-only until the ingestion and representation interfaces stabilize through real product use. Reaffirms: source depends only on engine, not on anchor — coordination between them flows through events. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 16:51:06 +02:00