Normalize agent instructions and workplan frontmatter (STATE-WP-0067)

- Align agent files with on-disk workplan prefixes (infer from workplan ids) - Set workplan domain to registered domain_slug; add topic_slug where applicable - Repair frontmatter delimiter formatting; migrate legacy task status literals - Regenerate AGENTS.md, CLAUDE.md, and .claude/rules from State Hub templates
chore(consistency): sync task status from DB [auto]
2026-06-22 23:16:24 +02:00 · 2026-06-22 20:28:07 +02:00 · 2026-06-22 17:47:36 +02:00 · 2026-06-22 01:30:22 +02:00 · 2026-06-22 01:15:09 +02:00 · 2026-06-16 01:52:33 +02:00
20 changed files with 1220 additions and 2 deletions
--- a/.claude/rules/agents.md
+++ b/.claude/rules/agents.md
@@ -0,0 +1,20 @@
 ## Kaizen Agents
 Specialized agent personas available on demand via the state-hub MCP.
 **Discover:** `list_kaizen_agents()` — returns all agents with name, description, category
 **Load:** `get_kaizen_agent("tdd-workflow")` — returns full instructions; read and follow them
 Common agents:
 | Agent | Category | When to use |
 |-------|----------|-------------|
 | `tdd-workflow` | testing | Step-by-step TDD8 workflow for any feature |
 | `code-refactoring` | quality | Code quality analysis and safe refactoring |
 | `test-maintenance` | testing | Diagnose and fix failing tests |
 | `requirements-engineering` | process | Prevent interface/mock mismatches upfront |
 | `keepaTodofile` | process | Maintain TODO.md during work |
 | `project-management` | process | Track status, determine next steps |
 | `datamodel-optimization` | quality | Optimize dataclasses and data structures |
 All 17 agents: call `list_kaizen_agents()` for the full list.
--- a/.claude/rules/architecture.md
+++ b/.claude/rules/architecture.md
@@ -0,0 +1,8 @@
 ## Architecture
 <!-- TODO: Describe the key design decisions and component structure.
     Key modules, data flows, external integrations, state machines, etc. -->
 ## Quick Reference
 `~/state-hub/mcp_server/TOOLS.md` — MCP tool reference
--- a/.claude/rules/credential-routing.md
+++ b/.claude/rules/credential-routing.md
@@ -0,0 +1,50 @@
 # Credential and access routing
 **Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
 for inference. Run this check **before** requesting secrets, API keys, SSH access,
 login tokens, or database passwords — in any repo, not only `ops-warden`.
 ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
 other credential need belongs to another subsystem. **Do not** message
 `ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
 ### Lookup (do this first)
 ```bash
 warden route find "<describe your need>" --json
 warden route show <catalog-id> --json
 ```
 Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
 | Agent runtime | How to orient |
 | --- | --- |
 | **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=evidence-source` is for coordination, not secret vending |
 | **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
 | **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
 ### Quick routing table
 | I need… | Owner | ops-warden executes? |
 | --- | --- | --- |
 | SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
 | API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
 | Login / OIDC / MFA | key-cape / Keycloak | No — route only |
 | Authorization decision | flex-auth | No — route only |
 | activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
 | SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
 ### Anti-patterns (do not do these)
 - `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
 - Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
 - Pasting secrets into Git, State Hub, workplans, logs, or chat
 ### Other capabilities (reuse-surface)
 Non-credential capabilities are usually discovered through **reuse-surface** federation
 (`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
 every repo's agent instructions because it is high-frequency, high-risk, and easy to
 get wrong.
 **Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
--- a/.claude/rules/first-session.md
+++ b/.claude/rules/first-session.md
@@ -0,0 +1,38 @@
 ## First Session Protocol
 Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
 The project is registered but work has not yet been structured.
 **Step 1 — Read, don't write**
 - `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
 - `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
 - Scan repo root: README, directory structure, existing code or docs
 **Step 2 — Survey in-progress work**
 Look for TODOs, open branches, half-finished files. Note done vs. started but incomplete.
 **Step 3 — Propose workstreams to Bernd**
 Propose 1–3 workstreams — each a coherent strand, weeks to months, anchored to a
 roadmap phase. **Wait for approval before creating.**
 **Step 4 — Create workplan file first, then DB record (ADR-001)**
 ```
 workplans/ESRC-WP-NNNN-<slug>.md   ← write this first
 ```
 Then register in the hub:
 ```
 create_workstream(topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", title="...", owner="...", description="...")
 create_task(workstream_id="<id>", title="...", priority="high|medium|low")
 ```
 **Step 5 — Record the setup**
 ```
 add_progress_event(
    summary="First session: structured infotech into N workstreams, M tasks",
    event_type="milestone",
    topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
    detail={"workstreams": [...], "tasks_created": M}
 )
 ```
 <!-- Delete or archive this file once past first session -->
--- a/.claude/rules/repo-boundary.md
+++ b/.claude/rules/repo-boundary.md
@@ -0,0 +1,8 @@
 ## Repo boundary
 This repo owns **evidence-source** only. It does not own:
 <!-- TODO: List what belongs in adjacent repos, e.g.:
 - SSH key management → railiance-infra/
 - State hub code     → state-hub/
 -->
--- a/.claude/rules/repo-identity.md
+++ b/.claude/rules/repo-identity.md
@@ -0,0 +1,5 @@
 **Purpose:** Document ingestion, extraction, fingerprinting, citation recovery. Depends only on citation-engine. INTENT-only during umbrella-first MVP.
 **Domain:** infotech
 **Repo slug:** evidence-source
 **Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a
--- a/.claude/rules/session-protocol.md
+++ b/.claude/rules/session-protocol.md
@@ -0,0 +1,85 @@
 ## Session Protocol
 Dev Hub (State Hub API): http://127.0.0.1:8000
 MCP server name in `~/.claude.json`: `dev-hub`
 **Step 1 — Orient**
 Read the offline-safe brief first — it works without a live hub connection:
 ```bash
 cat .custodian-brief.md
 ```
 Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
 ```
 get_domain_summary("infotech")
 ```
 If MCP tools are unavailable in the current agent session, use the REST API:
 ```bash
 curl -s "http://127.0.0.1:8000/state/summary" | python3 -m json.tool
 ```
 If the hub is offline: `cd ~/state-hub && make api`
 **Step 2 — Check inbox**
 With MCP tools:
 ```
 get_messages(to_agent="evidence-source", unread_only=True)
 ```
 Mark read with `mark_message_read(message_id)`. Reply or act on coordination
 requests before proceeding.
 Without MCP tools:
 ```bash
 curl -s "http://127.0.0.1:8000/messages/?to_agent=evidence-source&unread_only=true" \
  | python3 -m json.tool
 curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
  -H "Content-Type: application/json" -d '{}'
 ```
 **Step 3 — Scan workplans**
 ```bash
 ls workplans/
 ```
 For each file with `status: ready`, `active`, or `blocked`, note pending
 `wait`/`todo`/`progress` tasks.
 **Step 4 — Present brief**
 1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
 2. **Pending tasks** from `workplans/` + any `[repo:evidence-source]` hub tasks
 3. **Goal guidance** — if `goal_guidance` in summary:
   - `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
   - `alignment_warnings`: flag if active work is not aligned with current goal
 4. **Suggested next action** — highest-priority open item
 5. **SBOM status** — flag if `last_sbom_at` is unset for this repo
 If no workstreams: follow First Session Protocol (`first-session.md`).
 **During work:** `record_decision()` · `add_progress_event()` · `resolve_decision()`
 > State Hub is a *read model*. Bootstrap tools (`create_workstream`, `create_task`)
 > are First Session Protocol only. Work structure belongs in repo files (ADR-001).
 **Session close:**
 With MCP tools:
 ```
 add_progress_event(summary="...", topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", workstream_id="<uuid>")
 ```
 Without MCP tools:
 ```bash
 curl -s -X POST http://127.0.0.1:8000/progress/ \
  -H "Content-Type: application/json" \
  -d '{"topic_id":"cee7bedf-2b48-46ef-8601-006474f2ad7a","workstream_id":"<uuid>","event_type":"note","summary":"what changed","author":"codex"}'
 ```
 If workplan files were modified, ensure the local copy is up to date first:
 ```bash
 git -C <repo_path> pull --ff-only
 cd ~/state-hub && make fix-consistency REPO=evidence-source
 ```
 For repos where implementation runs on a remote machine (e.g. CoulombCore),
 use the combined target which pulls before fixing:
 ```bash
 cd ~/state-hub && make fix-consistency-remote REPO=evidence-source
 ```
 **C-15** (DB task ahead of file) is normal in multi-machine workflows — writeback
 will sync the file to match DB.  **C-16** (repo behind remote) blocks all writes
 until you pull — intentional to prevent clobbering remote progress.
--- a/.claude/rules/stack-and-commands.md
+++ b/.claude/rules/stack-and-commands.md
@@ -0,0 +1,19 @@
 ## Stack
 <!-- TODO: Fill in language, frameworks, and key dependencies -->
 - **Language:**
 - **Key deps:**
 ## Dev Commands
 ```bash
 # TODO: Fill in the standard commands for this repo
 # Install dependencies
 # Run tests
 # Lint / type check
 # Build / package (if applicable)
 ```
--- a/.claude/rules/workplan-convention.md
+++ b/.claude/rules/workplan-convention.md
@@ -0,0 +1,40 @@
 ## Workplan Convention (ADR-001)
 File location: `workplans/ESRC-WP-NNNN-<slug>.md`
 ID prefix: `ESRC-WP-`
 Work items originate as files in this repo **before** being registered in the hub.
 Canonical workplan/workstream frontmatter statuses are:
 `proposed`, `ready`, `active`, `blocked`, `backlog`, `finished`, `archived`.
 Use `proposed` for a newly drafted plan, `ready` after review against current
 repo state, and `finished` when implementation is complete. `stalled` and
 `needs_review` are derived health labels, not stored statuses.
 Closed workplans may be moved to `workplans/archived/` with a completion-date
 prefix: `YYMMDD-ESRC-WP-NNNN-<slug>.md`. The frontmatter id remains
 unchanged; the prefix is only for quick visual reference.
 Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
 `workplans/ADHOC-YYYY-MM-DD.md`, workstream slug `adhoc-YYYY-MM-DD`, and task ids
 `ADHOC-YYYY-MM-DD-T01`, `T02`, etc. Use adhocs only for low-risk work completed
 directly. Promote anything requiring analysis, design, approval, dependencies, or
 multiple planned phases into a normal workplan.
 Ecosystem todos from other agents arrive as `[repo:evidence-source]` hub tasks —
 visible at session start. Pick one up by creating the workplan file, then registering
 the workstream.
 Task blocks use this shape:
 ```task
 id: ESRC-WP-NNNN-T01
 status: wait | todo | progress | done | cancel
 priority: high | medium | low
 state_hub_task_id: "<uuid>"         # written by fix-consistency — do not edit
 ```
 Status progression is `todo` → `progress` → `done`; use `wait` for waiting or
 blocked work and `cancel` for stopped work.
 <!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->
--- a/.custodian-brief.md
+++ b/.custodian-brief.md
@@ -0,0 +1,18 @@
 <!-- custodian-brief: generated by fix-consistency — do not edit manually -->
 # Custodian Brief — evidence-source
 **Domain:** infotech  
 **Last synced:** 2026-06-22 18:28 UTC  
 **State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
 ## Active Workstreams
 *(none — repo may need first-session setup)*
 ---
 ## MCP Orientation (when available)
 If the state-hub MCP server is reachable, call:
 `get_domain_summary("infotech")`
 This provides richer cross-domain context.
 If the MCP call fails, use this file as your orientation source.
--- a/.repo-classification.yaml
+++ b/.repo-classification.yaml
@@ -0,0 +1,19 @@
 repo_classification:
  standard: Repo Classification Standard
  version: '1.0'
  classified_at: '2026-06-22'
  classified_by: agent
  category: project
  domain: infotech
  secondary_domains: []
  capability_tags:
  - evidence
  - traceability
  - source-management
  business_stake:
  - technology
  - product
  - operations
  business_mechanics:
  - coordination
  - operation
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -0,0 +1,219 @@
 # evidence-source — Agent Instructions
 ## Repo Identity
 **Purpose:** Document ingestion, extraction, fingerprinting, citation recovery. Depends only on citation-engine. INTENT-only during umbrella-first MVP.
 **Domain:** infotech
 **Repo slug:** evidence-source
 **Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
 **Workplan prefix:** `ESRC-WP-`
 ---
 ## State Hub Integration
 The Custodian State Hub tracks work across all domains. Interact via HTTP REST —
 there is no MCP server for Codex agents.
 | Context | URL |
 |---------|-----|
 | Local workstation | `http://127.0.0.1:8000` |
 | Remote via tunnel | `http://127.0.0.1:18000` |
 ### Orient at session start
 ```bash
 # Offline brief — works without hub connection
 cat .custodian-brief.md
 # Active workstreams for this domain
 curl -s "http://127.0.0.1:8000/workstreams/?topic_id=cee7bedf-2b48-46ef-8601-006474f2ad7a&status=active" \
  | python3 -m json.tool
 # Check inbox
 curl -s "http://127.0.0.1:8000/messages/?to_agent=evidence-source&unread_only=true" \
  | python3 -m json.tool
 ```
 Mark a message read:
 ```bash
 curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
  -H "Content-Type: application/json" -d '{}'
 ```
 ### Log progress (required at session close)
 ```bash
 curl -s -X POST http://127.0.0.1:8000/progress/ \
  -H "Content-Type: application/json" \
  -d '{
    "summary": "what was done",
    "event_type": "note",
    "author": "codex",
    "workstream_id": "<uuid>",
    "task_id": "<uuid>"
  }'
 ```
 Omit `workstream_id` / `task_id` when not applicable.
 ### Update task status
 ```bash
 curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
  -H "Content-Type: application/json" \
  -d '{"status": "progress"}'
 # values: wait | todo | progress | done | cancel
 ```
 ### Flag a task for human review
 ```bash
 curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
  -H "Content-Type: application/json" \
  -d '{"needs_human": true, "intervention_note": "reason"}'
 ```
 ---
 ## Session Protocol
 **Start:**
 1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
 2. Check inbox: `GET /messages/?to_agent=evidence-source&unread_only=true`; mark read
 3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
 4. Check human-needed tasks: `GET /tasks/?needs_human=true`
 **During work:**
 - Update task statuses in workplan files as tasks progress
 - Record significant decisions via `POST /decisions/`
 **Close:**
 1. Update workplan file task statuses to reflect progress
 2. Log: `POST /progress/` with a summary of what changed
 3. Note for the custodian operator: after workplan file changes, run from
   `~/state-hub`:
   ```bash
   make fix-consistency REPO=evidence-source
   ```
   This syncs task status from files into the hub DB.
 ---
 ## Credential and access routing
 **Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
 for inference. Run this check **before** requesting secrets, API keys, SSH access,
 login tokens, or database passwords — in any repo, not only `ops-warden`.
 ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
 other credential need belongs to another subsystem. **Do not** message
 `ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
 ### Lookup (do this first)
 ```bash
 warden route find "<describe your need>" --json
 warden route show <catalog-id> --json
 ```
 Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
 | Agent runtime | How to orient |
 | --- | --- |
 | **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=evidence-source` is for coordination, not secret vending |
 | **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
 | **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
 ### Quick routing table
 | I need… | Owner | ops-warden executes? |
 | --- | --- | --- |
 | SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
 | API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
 | Login / OIDC / MFA | key-cape / Keycloak | No — route only |
 | Authorization decision | flex-auth | No — route only |
 | activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
 | SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
 ### Anti-patterns (do not do these)
 - `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
 - Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
 - Pasting secrets into Git, State Hub, workplans, logs, or chat
 ### Other capabilities (reuse-surface)
 Non-credential capabilities are usually discovered through **reuse-surface** federation
 (`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
 every repo's agent instructions because it is high-frequency, high-risk, and easy to
 get wrong.
 **Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
 <!-- REPO-AGENTS-EXTENSIONS -->
 <!-- Append repo-specific agent instructions below this marker.
     The state-hub template sync preserves content after this line. -->
 ---
 ## Workplan Convention (ADR-001)
 Work items originate as files in this repo — not in the hub. The hub is a
 read/cache/index layer that rebuilds from files.
 **File location:** `workplans/EVIDENCE-WP-NNNN-<slug>.md`
 **Archived location:** finished workplans may move to
 `workplans/archived/YYMMDD-EVIDENCE-WP-NNNN-<slug>.md`. The `YYMMDD` prefix is
 the completion/archive date; the frontmatter `id` does not change.
 **Ad Hoc Tasks:** small opportunistic fixes discovered during a session use
 `workplans/ADHOC-YYYY-MM-DD.md` with task ids `ADHOC-YYYY-MM-DD-T01`, etc. Use
 this only for low-risk work completed directly; create a normal workplan for
 anything needing analysis, design, approval, dependencies, or multiple phases.
 **Frontmatter:**
 ```yaml
 ---
 id: EVIDENCE-WP-NNNN
 type: workplan
 title: "..."
 domain: infotech
 repo: evidence-source
 status: proposed | ready | active | blocked | backlog | finished | archived
 owner: codex
 topic_slug: ...
 created: "YYYY-MM-DD"
 updated: "YYYY-MM-DD"
 state_hub_workstream_id: "<uuid>"   # written by fix-consistency — do not edit
 ---
 ```
 Use `proposed` for a new draft, `ready` after review against current repo
 state, and `finished` after implementation. `stalled` and `needs_review` are
 derived health labels, not frontmatter statuses.
 **Task block format** (one per `##` section):
 ```
 ## Task Title
 ` ` `task
 id: EVIDENCE-WP-NNNN-T01
 status: wait | todo | progress | done | cancel
 priority: high | medium | low
 state_hub_task_id: "<uuid>"         # written by fix-consistency — do not edit
 ` ` `
 Task description text.
 ```
 Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
 To create a new workplan:
 1. Write the file following the format above
 2. Notify the custodian operator to run `make fix-consistency REPO=evidence-source`
   (or send a message to the hub agent via `POST /messages/`)
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,12 @@
 # evidence-source — Claude Code Instructions
@SCOPE.md
@.claude/rules/repo-identity.md
@.claude/rules/session-protocol.md
@.claude/rules/first-session.md
@.claude/rules/workplan-convention.md
@.claude/rules/stack-and-commands.md
@.claude/rules/architecture.md
@.claude/rules/repo-boundary.md
@.claude/rules/credential-routing.md
@.claude/rules/agents.md
--- a/INTENT.md
+++ b/INTENT.md
@@ -0,0 +1,492 @@
 # INTENT
 ## Purpose
 This repository exists to provide the document source, ingestion, extraction, metadata, and citation recovery layer for the **citation-evidence** ecosystem.
 **evidence-source** turns raw documents and source clues into usable, searchable, addressable document representations that can support annotations, evidence items, citation recovery, and source-backed workflows.
 It is responsible for answering the source-side questions:
 > What is this document?  
 > How can we extract usable text and structure from it?  
 > How can we find or recover a cited source passage?
 ---
 ## Primary Utility
 The repository provides the source pipeline for citation-evidence.
 It should make it possible to:
 - import documents into a collection or workspace,
 - identify document type and media type,
 - compute stable document fingerprints,
 - extract document metadata,
 - extract canonical text,
 - create document representations for PDFs, Markdown, HTML, and later other formats,
 - build maps between text, pages, sections, and rendered views,
 - support local full-text search,
 - support source lookup and citation recovery,
 - provide the document representations needed by **evidence-anchor** and **citation-work**.
 This repository turns documents into evidence-ready sources.
 ---
 ## Intended Users
 Primary users of this repository are developers and agents implementing source handling for citation-evidence.
 They include:
 - developers building document import workflows,
 - developers building review collections,
 - developers implementing PDF, Markdown, and HTML source handling,
 - developers implementing citation recovery,
 - developers integrating local or external source libraries,
 - coding agents that need structured access to document text and metadata.
 End users should experience this repository indirectly whenever they add a document, search source text, or recover a citation.
 ---
 ## Strategic Role
 The strategic role of **evidence-source** is to make source documents usable as reliable evidence substrates.
 Without this repository, the system would depend on whatever a viewer happens to show at runtime. That would make citation capture, re-opening, search, and recovery fragile.
 **evidence-source** creates the normalized source representations that allow the rest of the system to operate consistently across document formats.
 It enables the flow:
 ```text
 Raw Source
  → Document Identity
  → Metadata
  → Canonical Text
  → Document Representation
  → Searchable Source
  → Anchorable Evidence Context
 ````
 ---
 ## Core Concept
 The core concept of this repository is the **document representation**.
 A document representation is a normalized, searchable, addressable view of a source document.
 For a PDF, a representation may include:
 ```text
 document fingerprint
 metadata
 page count
 page text
 global canonical text
 page-local offset map
 text item map
 page dimensions
 source-to-rendering hints
 ```
 For Markdown or HTML, a representation may include:
 ```text
 canonical text
 rendered HTML
 sanitized content
 heading map
 section map
 DOM or AST structure
 offset-to-node map
 source line map where available
 ```
 These representations allow **evidence-anchor** to create and resolve selectors and allow **citation-work** to display and search documents efficiently.
 ---
 ## Scope
 This repository should own:
 * document import workflows,
 * document source identification,
 * media type detection,
 * document fingerprinting,
 * source URI handling,
 * metadata extraction,
 * canonical text extraction,
 * PDF text extraction,
 * Markdown normalization,
 * HTML normalization and sanitization,
 * document representation generation,
 * representation caching,
 * local source search support,
 * quote search support,
 * citation clue parsing,
 * local citation recovery,
 * external source discovery hooks,
 * recovery state tracking,
 * privacy boundaries for source lookup.
 It should provide the source-side capabilities consumed by:
 * **citation-engine** for creating `Document` and `DocumentRepresentation` records,
 * **evidence-anchor** for selector creation and resolution,
 * **citation-work** for document review workflows,
 * **evidence-binder** when evidence needs source context,
 * **citation-evidence** for the integrated product experience.
 ---
 ## Out of Scope
 This repository should not own the broader evidence domain or user workflows.
 Specifically, it should not own:
 * the canonical evidence domain model,
 * persistence policy beyond source and representation storage contracts,
 * low-level anchor resolution algorithms,
 * visual highlight rendering,
 * review workspace UI,
 * form-field binding semantics,
 * visual guide overlay behavior,
 * citation card rendering,
 * application shell and deployment,
 * final human validation of evidence quality.
 Those responsibilities belong to the appropriate citation-evidence subsystem repositories.
 ---
 ## Architectural Position
 ```text
 citation-evidence
  integrated product shell
 citation-engine
  core domain model, services, persistence contracts
 evidence-source
  document ingestion, extraction, metadata, representations, citation recovery
 evidence-anchor
  selectors, anchor resolution, re-anchoring, highlighting contracts
 citation-work
  review workspace and annotation UX
 evidence-binder
  evidence-to-target binding and active evidence state
 ```
 **evidence-source** should provide document representations, not define what evidence means.
 It should feed reliable source material into the rest of the system.
 ---
 ## Primary Workflows
 ### 1. Import Document
 A user or system adds a source document.
 ```text
 Add Source
  → Identify Media Type
  → Compute Fingerprint
  → Extract Metadata
  → Extract Text
  → Build Representation
  → Register Document
 ```
 ### 2. Generate PDF Representation
 A PDF is converted into a representation suitable for review and anchoring.
 ```text
 PDF Source
  → Load PDF
  → Extract Page Text
  → Normalize Text
  → Build Page Map
  → Build Offset Map
  → Store Representation
 ```
 ### 3. Generate Markdown / HTML Representation
 A Markdown or HTML source is converted into a normalized rendered and searchable representation.
 ```text
 Markdown / HTML Source
  → Parse / Sanitize
  → Render if needed
  → Extract Canonical Text
  → Build Heading / Section Map
  → Build Offset Map
  → Store Representation
 ```
 ### 4. Search Local Sources
 A user or subsystem searches available source material.
 ```text
 Search Query / Quote
  → Search Metadata
  → Search Full Text
  → Return Candidate Documents / Passages
 ```
 ### 5. Recover Citation
 A user provides a citation, quote, or source clue.
 ```text
 Citation Clue
  → Parse Source Metadata
  → Search Local Library
  → Optionally Search Configured External Sources
  → Load Candidate Source
  → Search Exact Quote
  → Search Fuzzy Quote
  → Present Candidate Passages
  → User Confirms
  → Create Source Context for Annotation
 ```
 ---
 ## Initial Source Types
 The first version should support or prepare for:
 ```text
 PDF
 Markdown
 HTML
 plain text
 remote URL references
 ```
 Later versions may support:
 ```text
 DOCX
 EPUB
 scanned image documents
 OCR-derived text
 IIIF resources
 TEI XML
 structured datasets with source passages
 ```
 ---
 ## Citation Recovery States
 Citation recovery should be modeled explicitly.
 Initial recovery states may include:
 ```text
 created
 source-found-fulltext
 source-found-preview-only
 source-found-metadata-only
 source-not-found
 quote-found
 quote-not-found
 candidate-passages-found
 manual-confirmation-needed
 confirmed
 annotation-created
 failed
 ```
 The system should distinguish between finding a source and finding the exact cited passage.
 ---
 ## Privacy and Source Lookup Principles
 Source lookup can create privacy risks.
 The repository should follow these principles:
 * search local sources first,
 * make external lookup explicit and configurable,
 * avoid sending private document text to external services by default,
 * record which external services were queried,
 * distinguish public metadata lookup from full-text upload,
 * allow deployments to disable external lookup completely,
 * prefer deterministic local processing where possible.
 External source discovery should be an extension point, not an unavoidable default behavior.
 ---
 ## Design Principles
 ### Source Identity First
 Every imported document should receive a stable identity based on available metadata, source URI, and fingerprint.
 ### Canonical Text Matters
 Anchoring and search depend on canonical text. The repository should make text normalization explicit and repeatable.
 ### Representation Is Not Source
 The original source and generated representation are different things.
 The system should preserve this distinction.
 ### Local Before External
 Citation recovery should search local documents before looking elsewhere.
 ### Human Confirmation
 Recovered citations should not silently become confirmed evidence. Candidate matches should be presented for confirmation when uncertainty exists.
 ### Format-Aware, Model-Neutral
 The repository should understand document formats but should not own the broader evidence model.
 ### Cache Expensive Work
 Text extraction, fingerprinting, and representation generation should be cacheable by source fingerprint and version.
 ### Agent-Friendly Output
 Extracted metadata, representations, and recovery candidates should be structured enough for agents to inspect, rank, and explain.
 ---
 ## Expected Dependencies
 This repository is expected to depend on shared types and service contracts from:
 ```text
 citation-engine
  Document, DocumentRepresentation, CitationRecoveryAttempt, source-related contracts
 ```
 It may be consumed by:
 ```text
 citation-work
  to load reviewable documents and document representations
 evidence-anchor
  to resolve selectors against extracted representations
 evidence-binder
  to retrieve source context for linked evidence
 citation-evidence
  to provide integrated import and recovery workflows
 ```
 It should avoid depending on review UI or form-binding implementation details.
 ---
 ## First Useful Version
 A first useful version of **evidence-source** should provide:
 * source import interface,
 * media type detection,
 * document fingerprinting,
 * basic metadata extraction,
 * PDF text extraction,
 * Markdown text extraction,
 * HTML sanitization and text extraction,
 * canonical text normalization,
 * document representation generation,
 * simple local quote search,
 * recovery attempt model or contract,
 * examples showing how a document becomes a representation usable by **evidence-anchor**.
 The first version does not need full external source discovery or OCR, but it should establish the ingestion and representation pattern.
 ---
 ## Success Criteria
 The repository is successful when another subsystem can use it to:
 1. import a source document,
 2. identify and fingerprint it,
 3. extract useful metadata,
 4. generate canonical text,
 5. generate a document representation,
 6. search the source text,
 7. provide representation data to **evidence-anchor**,
 8. support a local citation recovery attempt from a quote or citation clue.
 A developer or coding agent should be able to understand from this repository how raw documents become evidence-ready sources.
 ---
 ## Repository Character
 This repository should be:
 * source-focused,
 * ingestion-oriented,
 * privacy-conscious,
 * format-aware,
 * representation-centered,
 * cache-friendly,
 * suitable for local-first and server-side use,
 * explicit about uncertainty in citation recovery,
 * careful not to absorb review or binding responsibilities.
 ---
 ## MVP Coordination — Code Lives Upstream
 During the umbrella-first MVP phase (decided 2026-05-24), **the source code
 for this subsystem does not live in this repository yet**. It lives in the
 umbrella repo at `citation-evidence/src/source/`.
 This INTENT.md documents the *intended* responsibilities and boundaries.
 When the ingestion and representation interfaces have stabilized through
 actual MVP use, the corresponding code extracts into this repository.
 **Shared contracts** (Document and DocumentRepresentation shapes,
 CitationRecoveryAttempt state enum, canonical text normalization, allowed
 dependency edges) are maintained in the umbrella repo:
 * `citation-evidence/wiki/SharedContracts.md`
 * `citation-evidence/wiki/DependencyMap.md`
 * `citation-evidence/docs/decisions/` (ADRs)
 This subsystem's eventual code must not contradict those documents. Changes
 to shared contracts happen in the umbrella, not here.
 Under the dependency map, **`evidence-source` may depend only on
 `citation-engine`** — not on `evidence-anchor`. When ingestion needs to know
 "could a selector resolve here?", the answer travels through events, not
 direct calls.
 ---
 ## Guiding Statement
 **evidence-source exists to turn documents and citation clues into reliable, searchable, anchorable source context.**
--- a/README.md
+++ b/README.md
@@ -1,3 +1,16 @@
-# repo-seed
+# evidence-source
-A git repository template to bootstrap coulomb projects from.
+Document source, ingestion, extraction, metadata, and citation recovery —
 PDF/HTML/MD ingest, fingerprinting, page-/offset-map construction,
 canonical-text extraction, and the recovery behavior for stale selectors.
 ## MVP status: INTENT only
 During the citation-evidence MVP, code lives upstream in
 [`citation-evidence`](../citation-evidence/) under `src/source/`. This repo
 currently holds `INTENT.md` describing what will move here. Contract
 changes belong in
 [`citation-evidence/wiki/SharedContracts.md`](../citation-evidence/wiki/SharedContracts.md),
 not here.
 Per the dependency map, source depends on `shared/` and `engine/` only.
--- a/SCOPE.md
+++ b/SCOPE.md
@@ -0,0 +1,137 @@
 # SCOPE
 > This file helps you quickly understand what this repository is about,
 > when it is relevant, and when it is not.
 > It is intentionally lightweight and may be incomplete.
 ---
 ## One-liner
 <!-- Describe the purpose of this repository in one precise sentence. -->
 <!-- Example: "Provides a lightweight event router for Kubernetes-native systems." -->
 ---
 ## Core Idea
 <!-- What is the main capability or idea behind this repository? -->
 <!-- What problem does it try to solve? -->
 ---
 ## In Scope
 <!-- What this repository is responsible for. -->
 <!-- Be explicit and concrete. -->
 -
 -
 -
 ---
 ## Out of Scope
 <!-- What this repository deliberately does NOT do. -->
 <!-- This is often more important than "In Scope". -->
 -
 -
 -
 ---
 ## Relevant When
 <!-- When should someone consider using or exploring this repository? -->
 -
 -
 -
 ---
 ## Not Relevant When
 <!-- When should someone ignore this repository? -->
 -
 -
 -
 ---
 ## Current State
 <!-- Rough indication of maturity. No strict format required. -->
 - Status: <!-- e.g. concept / experimental / active / stable / deprecated -->
 - Implementation: <!-- e.g. idea / partial / substantial / complete -->
 - Stability: <!-- e.g. unstable / evolving / stable -->
 - Usage: <!-- e.g. none / personal / internal / production -->
 <!-- Add any notes that help set expectations. -->
 ---
 ## How It Fits
 <!-- Where does this repository sit in the bigger picture? -->
 - Upstream dependencies:
 - Downstream consumers:
 - Often used with:
 ---
 ## Terminology
 <!-- Terms that are important to understand this repo. -->
 <!-- Especially useful if naming differs from other repos. -->
 - Preferred terms:
 - Also known as:
 - Potentially confusing terms:
 ---
 ## Related / Overlapping Repositories
 <!-- List repositories that have similar or adjacent responsibilities. -->
 <!-- Helps detect duplication and navigate the ecosystem. -->
 - <repo-name> — <!-- how it relates -->
 ---
 ## Getting Oriented
 <!-- If someone decides to look deeper, where should they start? -->
 - Start with:
 - Key files / directories:
 - Entry points:
 ---
 ## Provided Capabilities
 <!-- What can this repo's domain provide to other domains on request? -->
 <!-- Each capability block is parsed by the state-hub capability catalog ingest. -->
 <!-- Remove the examples and add your own, or leave empty if none. -->
 <!--
 ```capability
 type: infrastructure
 title: Example capability title
 description: What this capability provides, in one or two sentences.
 keywords: [keyword1, keyword2, keyword3]
 ```
 -->
 ---
 ## Notes
 <!-- Anything else worth knowing. Keep it short. -->
--- a/registry/README.md
+++ b/registry/README.md
@@ -0,0 +1,12 @@
 # Capability Registry
 Markdown-first capability index for federation and reuse planning.
 ## Authoring
 1. Copy a capability entry template (see reuse-surface `templates/capability-entry.template.md`).
 2. Add the row to `indexes/capabilities.yaml`.
 3. Run `reuse-surface validate` from a checkout with the CLI installed.
 4. Merge to `main` and verify publish with `reuse-surface establish --publish-check`.
 Federation contract: reuse-surface `docs/RegistryFederation.md`.
--- a/registry/capabilities/.gitkeep
+++ b/registry/capabilities/.gitkeep
--- a/registry/indexes/capabilities.yaml
+++ b/registry/indexes/capabilities.yaml
@@ -0,0 +1,4 @@
 version: 1
 updated: '2026-06-16'
 domain: helix_forge
 capabilities: []
--- a/workplans/ESRC-WP-0001-intent-placeholder.md
+++ b/workplans/ESRC-WP-0001-intent-placeholder.md
@@ -0,0 +1,19 @@
 ---
 id: ESRC-WP-0001
 type: workplan
 title: "INTENT placeholder — await extraction from citation-evidence"
 domain: infotech
 repo: evidence-source
 status: backlog
 owner: codex
 topic_slug: citation_evidence_mvp
 created: "2026-06-21"
 updated: "2026-06-21"
 state_hub_workstream_id: "64771b5d-4b83-4848-a562-4b00aad017b2"
 ---
 # ESRC-WP-0001 — INTENT Placeholder
 Umbrella-first MVP: source/ingestion code will extract from `citation-evidence`
 when the subsystem boundary stabilizes. This file satisfies ADR-001 workplan
 structure until then. See `INTENT.md`.
Author	SHA1	Message	Date
tegwick	0a7176bf2d	Normalize agent instructions and workplan frontmatter (STATE-WP-0067) - Align agent files with on-disk workplan prefixes (infer from workplan ids) - Set workplan domain to registered domain_slug; add topic_slug where applicable - Repair frontmatter delimiter formatting; migrate legacy task status literals - Regenerate AGENTS.md, CLAUDE.md, and .claude/rules from State Hub templates	2026-06-22 23:16:24 +02:00
tegwick	f6003bc4a1	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-06-22: - update .custodian-brief.md for evidence-source	2026-06-22 20:28:07 +02:00
tegwick	7cf52213bf	Add .repo-classification.yaml (CUST-WP-0050 T11 agent first-pass)	2026-06-22 17:47:36 +02:00
tegwick	25fb6946bc	chore(ADR-001): add INTENT placeholder workplan for umbrella-first MVP Add ESRC-WP-0001-intent-placeholder.md so consistency sweeps pass C-01 while this repo remains INTENT-only. Implementation stays in citation-evidence until extraction. Hub workstream id is written in frontmatter from fix-consistency.	2026-06-22 01:30:22 +02:00
tegwick	720e46eef5	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-06-22: - update .custodian-brief.md for evidence-source	2026-06-22 01:15:09 +02:00
tegwick	40b2f12797	Add capability registry scaffold (REUSE-WP-0014-T04 B02) Empty helix_forge registry layout for federation publishing.	2026-06-16 01:52:33 +02:00
tegwick	d37f22ac18	Point README at citation-evidence umbrella during MVP phase Code lives upstream in citation-evidence/src/source/ during the MVP. README documents that and points at SharedContracts.md for ingest, fingerprint, and recovery contract changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 00:13:12 +02:00
tegwick	d8a08d6032	Add MVP Coordination section: code lives in citation-evidence umbrella during MVP Documents the umbrella-first MVP decision (2026-05-24). This repo remains INTENT-only until the ingestion and representation interfaces stabilize through real product use. Reaffirms: source depends only on engine, not on anchor — coordination between them flows through events. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 16:51:06 +02:00