Compare commits

...

8 Commits

Author SHA1 Message Date
0a7176bf2d Normalize agent instructions and workplan frontmatter (STATE-WP-0067)
- Align agent files with on-disk workplan prefixes (infer from workplan ids)
- Set workplan domain to registered domain_slug; add topic_slug where applicable
- Repair frontmatter delimiter formatting; migrate legacy task status literals
- Regenerate AGENTS.md, CLAUDE.md, and .claude/rules from State Hub templates
2026-06-22 23:16:24 +02:00
f6003bc4a1 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-22:
  - update .custodian-brief.md for evidence-source
2026-06-22 20:28:07 +02:00
7cf52213bf Add .repo-classification.yaml (CUST-WP-0050 T11 agent first-pass) 2026-06-22 17:47:36 +02:00
25fb6946bc chore(ADR-001): add INTENT placeholder workplan for umbrella-first MVP
Add ESRC-WP-0001-intent-placeholder.md so consistency sweeps pass C-01 while this repo remains
INTENT-only. Implementation stays in citation-evidence until extraction.
Hub workstream id is written in frontmatter from fix-consistency.
2026-06-22 01:30:22 +02:00
720e46eef5 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-22:
  - update .custodian-brief.md for evidence-source
2026-06-22 01:15:09 +02:00
40b2f12797 Add capability registry scaffold (REUSE-WP-0014-T04 B02)
Empty helix_forge registry layout for federation publishing.
2026-06-16 01:52:33 +02:00
d37f22ac18 Point README at citation-evidence umbrella during MVP phase
Code lives upstream in citation-evidence/src/source/ during the MVP.
README documents that and points at SharedContracts.md for ingest,
fingerprint, and recovery contract changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 00:13:12 +02:00
d8a08d6032 Add MVP Coordination section: code lives in citation-evidence umbrella during MVP
Documents the umbrella-first MVP decision (2026-05-24). This repo remains
INTENT-only until the ingestion and representation interfaces stabilize
through real product use. Reaffirms: source depends only on engine, not on
anchor — coordination between them flows through events.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 16:51:06 +02:00
20 changed files with 1220 additions and 2 deletions

20
.claude/rules/agents.md Normal file
View File

@@ -0,0 +1,20 @@
## Kaizen Agents
Specialized agent personas available on demand via the state-hub MCP.
**Discover:** `list_kaizen_agents()` — returns all agents with name, description, category
**Load:** `get_kaizen_agent("tdd-workflow")` — returns full instructions; read and follow them
Common agents:
| Agent | Category | When to use |
|-------|----------|-------------|
| `tdd-workflow` | testing | Step-by-step TDD8 workflow for any feature |
| `code-refactoring` | quality | Code quality analysis and safe refactoring |
| `test-maintenance` | testing | Diagnose and fix failing tests |
| `requirements-engineering` | process | Prevent interface/mock mismatches upfront |
| `keepaTodofile` | process | Maintain TODO.md during work |
| `project-management` | process | Track status, determine next steps |
| `datamodel-optimization` | quality | Optimize dataclasses and data structures |
All 17 agents: call `list_kaizen_agents()` for the full list.

View File

@@ -0,0 +1,8 @@
## Architecture
<!-- TODO: Describe the key design decisions and component structure.
Key modules, data flows, external integrations, state machines, etc. -->
## Quick Reference
`~/state-hub/mcp_server/TOOLS.md` — MCP tool reference

View File

@@ -0,0 +1,50 @@
# Credential and access routing
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
for inference. Run this check **before** requesting secrets, API keys, SSH access,
login tokens, or database passwords — in any repo, not only `ops-warden`.
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
other credential need belongs to another subsystem. **Do not** message
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
### Lookup (do this first)
```bash
warden route find "<describe your need>" --json
warden route show <catalog-id> --json
```
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
| Agent runtime | How to orient |
| --- | --- |
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=evidence-source` is for coordination, not secret vending |
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
### Quick routing table
| I need… | Owner | ops-warden executes? |
| --- | --- | --- |
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes**`warden sign` |
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
| Authorization decision | flex-auth | No — route only |
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
### Anti-patterns (do not do these)
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
- Pasting secrets into Git, State Hub, workplans, logs, or chat
### Other capabilities (reuse-surface)
Non-credential capabilities are usually discovered through **reuse-surface** federation
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
every repo's agent instructions because it is high-frequency, high-risk, and easy to
get wrong.
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`

View File

@@ -0,0 +1,38 @@
## First Session Protocol
Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
The project is registered but work has not yet been structured.
**Step 1 — Read, don't write**
- `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
- `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
- Scan repo root: README, directory structure, existing code or docs
**Step 2 — Survey in-progress work**
Look for TODOs, open branches, half-finished files. Note done vs. started but incomplete.
**Step 3 — Propose workstreams to Bernd**
Propose 13 workstreams — each a coherent strand, weeks to months, anchored to a
roadmap phase. **Wait for approval before creating.**
**Step 4 — Create workplan file first, then DB record (ADR-001)**
```
workplans/ESRC-WP-NNNN-<slug>.md ← write this first
```
Then register in the hub:
```
create_workstream(topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", title="...", owner="...", description="...")
create_task(workstream_id="<id>", title="...", priority="high|medium|low")
```
**Step 5 — Record the setup**
```
add_progress_event(
summary="First session: structured infotech into N workstreams, M tasks",
event_type="milestone",
topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
detail={"workstreams": [...], "tasks_created": M}
)
```
<!-- Delete or archive this file once past first session -->

View File

@@ -0,0 +1,8 @@
## Repo boundary
This repo owns **evidence-source** only. It does not own:
<!-- TODO: List what belongs in adjacent repos, e.g.:
- SSH key management → railiance-infra/
- State hub code → state-hub/
-->

View File

@@ -0,0 +1,5 @@
**Purpose:** Document ingestion, extraction, fingerprinting, citation recovery. Depends only on citation-engine. INTENT-only during umbrella-first MVP.
**Domain:** infotech
**Repo slug:** evidence-source
**Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a

View File

@@ -0,0 +1,85 @@
## Session Protocol
Dev Hub (State Hub API): http://127.0.0.1:8000
MCP server name in `~/.claude.json`: `dev-hub`
**Step 1 — Orient**
Read the offline-safe brief first — it works without a live hub connection:
```bash
cat .custodian-brief.md
```
Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
```
get_domain_summary("infotech")
```
If MCP tools are unavailable in the current agent session, use the REST API:
```bash
curl -s "http://127.0.0.1:8000/state/summary" | python3 -m json.tool
```
If the hub is offline: `cd ~/state-hub && make api`
**Step 2 — Check inbox**
With MCP tools:
```
get_messages(to_agent="evidence-source", unread_only=True)
```
Mark read with `mark_message_read(message_id)`. Reply or act on coordination
requests before proceeding.
Without MCP tools:
```bash
curl -s "http://127.0.0.1:8000/messages/?to_agent=evidence-source&unread_only=true" \
| python3 -m json.tool
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
-H "Content-Type: application/json" -d '{}'
```
**Step 3 — Scan workplans**
```bash
ls workplans/
```
For each file with `status: ready`, `active`, or `blocked`, note pending
`wait`/`todo`/`progress` tasks.
**Step 4 — Present brief**
1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
2. **Pending tasks** from `workplans/` + any `[repo:evidence-source]` hub tasks
3. **Goal guidance** — if `goal_guidance` in summary:
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
- `alignment_warnings`: flag if active work is not aligned with current goal
4. **Suggested next action** — highest-priority open item
5. **SBOM status** — flag if `last_sbom_at` is unset for this repo
If no workstreams: follow First Session Protocol (`first-session.md`).
**During work:** `record_decision()` · `add_progress_event()` · `resolve_decision()`
> State Hub is a *read model*. Bootstrap tools (`create_workstream`, `create_task`)
> are First Session Protocol only. Work structure belongs in repo files (ADR-001).
**Session close:**
With MCP tools:
```
add_progress_event(summary="...", topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", workstream_id="<uuid>")
```
Without MCP tools:
```bash
curl -s -X POST http://127.0.0.1:8000/progress/ \
-H "Content-Type: application/json" \
-d '{"topic_id":"cee7bedf-2b48-46ef-8601-006474f2ad7a","workstream_id":"<uuid>","event_type":"note","summary":"what changed","author":"codex"}'
```
If workplan files were modified, ensure the local copy is up to date first:
```bash
git -C <repo_path> pull --ff-only
cd ~/state-hub && make fix-consistency REPO=evidence-source
```
For repos where implementation runs on a remote machine (e.g. CoulombCore),
use the combined target which pulls before fixing:
```bash
cd ~/state-hub && make fix-consistency-remote REPO=evidence-source
```
**C-15** (DB task ahead of file) is normal in multi-machine workflows — writeback
will sync the file to match DB. **C-16** (repo behind remote) blocks all writes
until you pull — intentional to prevent clobbering remote progress.

View File

@@ -0,0 +1,19 @@
## Stack
<!-- TODO: Fill in language, frameworks, and key dependencies -->
- **Language:**
- **Key deps:**
## Dev Commands
```bash
# TODO: Fill in the standard commands for this repo
# Install dependencies
# Run tests
# Lint / type check
# Build / package (if applicable)
```

View File

@@ -0,0 +1,40 @@
## Workplan Convention (ADR-001)
File location: `workplans/ESRC-WP-NNNN-<slug>.md`
ID prefix: `ESRC-WP-`
Work items originate as files in this repo **before** being registered in the hub.
Canonical workplan/workstream frontmatter statuses are:
`proposed`, `ready`, `active`, `blocked`, `backlog`, `finished`, `archived`.
Use `proposed` for a newly drafted plan, `ready` after review against current
repo state, and `finished` when implementation is complete. `stalled` and
`needs_review` are derived health labels, not stored statuses.
Closed workplans may be moved to `workplans/archived/` with a completion-date
prefix: `YYMMDD-ESRC-WP-NNNN-<slug>.md`. The frontmatter id remains
unchanged; the prefix is only for quick visual reference.
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
`workplans/ADHOC-YYYY-MM-DD.md`, workstream slug `adhoc-YYYY-MM-DD`, and task ids
`ADHOC-YYYY-MM-DD-T01`, `T02`, etc. Use adhocs only for low-risk work completed
directly. Promote anything requiring analysis, design, approval, dependencies, or
multiple planned phases into a normal workplan.
Ecosystem todos from other agents arrive as `[repo:evidence-source]` hub tasks —
visible at session start. Pick one up by creating the workplan file, then registering
the workstream.
Task blocks use this shape:
```task
id: ESRC-WP-NNNN-T01
status: wait | todo | progress | done | cancel
priority: high | medium | low
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
```
Status progression is `todo``progress``done`; use `wait` for waiting or
blocked work and `cancel` for stopped work.
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->

18
.custodian-brief.md Normal file
View File

@@ -0,0 +1,18 @@
<!-- custodian-brief: generated by fix-consistency — do not edit manually -->
# Custodian Brief — evidence-source
**Domain:** infotech
**Last synced:** 2026-06-22 18:28 UTC
**State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
## Active Workstreams
*(none — repo may need first-session setup)*
---
## MCP Orientation (when available)
If the state-hub MCP server is reachable, call:
`get_domain_summary("infotech")`
This provides richer cross-domain context.
If the MCP call fails, use this file as your orientation source.

19
.repo-classification.yaml Normal file
View File

@@ -0,0 +1,19 @@
repo_classification:
standard: Repo Classification Standard
version: '1.0'
classified_at: '2026-06-22'
classified_by: agent
category: project
domain: infotech
secondary_domains: []
capability_tags:
- evidence
- traceability
- source-management
business_stake:
- technology
- product
- operations
business_mechanics:
- coordination
- operation

219
AGENTS.md Normal file
View File

@@ -0,0 +1,219 @@
# evidence-source — Agent Instructions
## Repo Identity
**Purpose:** Document ingestion, extraction, fingerprinting, citation recovery. Depends only on citation-engine. INTENT-only during umbrella-first MVP.
**Domain:** infotech
**Repo slug:** evidence-source
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
**Workplan prefix:** `ESRC-WP-`
---
## State Hub Integration
The Custodian State Hub tracks work across all domains. Interact via HTTP REST —
there is no MCP server for Codex agents.
| Context | URL |
|---------|-----|
| Local workstation | `http://127.0.0.1:8000` |
| Remote via tunnel | `http://127.0.0.1:18000` |
### Orient at session start
```bash
# Offline brief — works without hub connection
cat .custodian-brief.md
# Active workstreams for this domain
curl -s "http://127.0.0.1:8000/workstreams/?topic_id=cee7bedf-2b48-46ef-8601-006474f2ad7a&status=active" \
| python3 -m json.tool
# Check inbox
curl -s "http://127.0.0.1:8000/messages/?to_agent=evidence-source&unread_only=true" \
| python3 -m json.tool
```
Mark a message read:
```bash
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
-H "Content-Type: application/json" -d '{}'
```
### Log progress (required at session close)
```bash
curl -s -X POST http://127.0.0.1:8000/progress/ \
-H "Content-Type: application/json" \
-d '{
"summary": "what was done",
"event_type": "note",
"author": "codex",
"workstream_id": "<uuid>",
"task_id": "<uuid>"
}'
```
Omit `workstream_id` / `task_id` when not applicable.
### Update task status
```bash
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
-H "Content-Type: application/json" \
-d '{"status": "progress"}'
# values: wait | todo | progress | done | cancel
```
### Flag a task for human review
```bash
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
-H "Content-Type: application/json" \
-d '{"needs_human": true, "intervention_note": "reason"}'
```
---
## Session Protocol
**Start:**
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
2. Check inbox: `GET /messages/?to_agent=evidence-source&unread_only=true`; mark read
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
4. Check human-needed tasks: `GET /tasks/?needs_human=true`
**During work:**
- Update task statuses in workplan files as tasks progress
- Record significant decisions via `POST /decisions/`
**Close:**
1. Update workplan file task statuses to reflect progress
2. Log: `POST /progress/` with a summary of what changed
3. Note for the custodian operator: after workplan file changes, run from
`~/state-hub`:
```bash
make fix-consistency REPO=evidence-source
```
This syncs task status from files into the hub DB.
---
## Credential and access routing
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
for inference. Run this check **before** requesting secrets, API keys, SSH access,
login tokens, or database passwords — in any repo, not only `ops-warden`.
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
other credential need belongs to another subsystem. **Do not** message
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
### Lookup (do this first)
```bash
warden route find "<describe your need>" --json
warden route show <catalog-id> --json
```
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
| Agent runtime | How to orient |
| --- | --- |
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=evidence-source` is for coordination, not secret vending |
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
### Quick routing table
| I need… | Owner | ops-warden executes? |
| --- | --- | --- |
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
| Authorization decision | flex-auth | No — route only |
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
### Anti-patterns (do not do these)
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
- Pasting secrets into Git, State Hub, workplans, logs, or chat
### Other capabilities (reuse-surface)
Non-credential capabilities are usually discovered through **reuse-surface** federation
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
every repo's agent instructions because it is high-frequency, high-risk, and easy to
get wrong.
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
<!-- REPO-AGENTS-EXTENSIONS -->
<!-- Append repo-specific agent instructions below this marker.
The state-hub template sync preserves content after this line. -->
---
## Workplan Convention (ADR-001)
Work items originate as files in this repo — not in the hub. The hub is a
read/cache/index layer that rebuilds from files.
**File location:** `workplans/EVIDENCE-WP-NNNN-<slug>.md`
**Archived location:** finished workplans may move to
`workplans/archived/YYMMDD-EVIDENCE-WP-NNNN-<slug>.md`. The `YYMMDD` prefix is
the completion/archive date; the frontmatter `id` does not change.
**Ad Hoc Tasks:** small opportunistic fixes discovered during a session use
`workplans/ADHOC-YYYY-MM-DD.md` with task ids `ADHOC-YYYY-MM-DD-T01`, etc. Use
this only for low-risk work completed directly; create a normal workplan for
anything needing analysis, design, approval, dependencies, or multiple phases.
**Frontmatter:**
```yaml
---
id: EVIDENCE-WP-NNNN
type: workplan
title: "..."
domain: infotech
repo: evidence-source
status: proposed | ready | active | blocked | backlog | finished | archived
owner: codex
topic_slug: ...
created: "YYYY-MM-DD"
updated: "YYYY-MM-DD"
state_hub_workstream_id: "<uuid>" # written by fix-consistency — do not edit
---
```
Use `proposed` for a new draft, `ready` after review against current repo
state, and `finished` after implementation. `stalled` and `needs_review` are
derived health labels, not frontmatter statuses.
**Task block format** (one per `##` section):
```
## Task Title
` ` `task
id: EVIDENCE-WP-NNNN-T01
status: wait | todo | progress | done | cancel
priority: high | medium | low
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
` ` `
Task description text.
```
Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
To create a new workplan:
1. Write the file following the format above
2. Notify the custodian operator to run `make fix-consistency REPO=evidence-source`
(or send a message to the hub agent via `POST /messages/`)

12
CLAUDE.md Normal file
View File

@@ -0,0 +1,12 @@
# evidence-source — Claude Code Instructions
@SCOPE.md
@.claude/rules/repo-identity.md
@.claude/rules/session-protocol.md
@.claude/rules/first-session.md
@.claude/rules/workplan-convention.md
@.claude/rules/stack-and-commands.md
@.claude/rules/architecture.md
@.claude/rules/repo-boundary.md
@.claude/rules/credential-routing.md
@.claude/rules/agents.md

492
INTENT.md Normal file
View File

@@ -0,0 +1,492 @@
# INTENT
## Purpose
This repository exists to provide the document source, ingestion, extraction, metadata, and citation recovery layer for the **citation-evidence** ecosystem.
**evidence-source** turns raw documents and source clues into usable, searchable, addressable document representations that can support annotations, evidence items, citation recovery, and source-backed workflows.
It is responsible for answering the source-side questions:
> What is this document?
> How can we extract usable text and structure from it?
> How can we find or recover a cited source passage?
---
## Primary Utility
The repository provides the source pipeline for citation-evidence.
It should make it possible to:
- import documents into a collection or workspace,
- identify document type and media type,
- compute stable document fingerprints,
- extract document metadata,
- extract canonical text,
- create document representations for PDFs, Markdown, HTML, and later other formats,
- build maps between text, pages, sections, and rendered views,
- support local full-text search,
- support source lookup and citation recovery,
- provide the document representations needed by **evidence-anchor** and **citation-work**.
This repository turns documents into evidence-ready sources.
---
## Intended Users
Primary users of this repository are developers and agents implementing source handling for citation-evidence.
They include:
- developers building document import workflows,
- developers building review collections,
- developers implementing PDF, Markdown, and HTML source handling,
- developers implementing citation recovery,
- developers integrating local or external source libraries,
- coding agents that need structured access to document text and metadata.
End users should experience this repository indirectly whenever they add a document, search source text, or recover a citation.
---
## Strategic Role
The strategic role of **evidence-source** is to make source documents usable as reliable evidence substrates.
Without this repository, the system would depend on whatever a viewer happens to show at runtime. That would make citation capture, re-opening, search, and recovery fragile.
**evidence-source** creates the normalized source representations that allow the rest of the system to operate consistently across document formats.
It enables the flow:
```text
Raw Source
→ Document Identity
→ Metadata
→ Canonical Text
→ Document Representation
→ Searchable Source
→ Anchorable Evidence Context
````
---
## Core Concept
The core concept of this repository is the **document representation**.
A document representation is a normalized, searchable, addressable view of a source document.
For a PDF, a representation may include:
```text
document fingerprint
metadata
page count
page text
global canonical text
page-local offset map
text item map
page dimensions
source-to-rendering hints
```
For Markdown or HTML, a representation may include:
```text
canonical text
rendered HTML
sanitized content
heading map
section map
DOM or AST structure
offset-to-node map
source line map where available
```
These representations allow **evidence-anchor** to create and resolve selectors and allow **citation-work** to display and search documents efficiently.
---
## Scope
This repository should own:
* document import workflows,
* document source identification,
* media type detection,
* document fingerprinting,
* source URI handling,
* metadata extraction,
* canonical text extraction,
* PDF text extraction,
* Markdown normalization,
* HTML normalization and sanitization,
* document representation generation,
* representation caching,
* local source search support,
* quote search support,
* citation clue parsing,
* local citation recovery,
* external source discovery hooks,
* recovery state tracking,
* privacy boundaries for source lookup.
It should provide the source-side capabilities consumed by:
* **citation-engine** for creating `Document` and `DocumentRepresentation` records,
* **evidence-anchor** for selector creation and resolution,
* **citation-work** for document review workflows,
* **evidence-binder** when evidence needs source context,
* **citation-evidence** for the integrated product experience.
---
## Out of Scope
This repository should not own the broader evidence domain or user workflows.
Specifically, it should not own:
* the canonical evidence domain model,
* persistence policy beyond source and representation storage contracts,
* low-level anchor resolution algorithms,
* visual highlight rendering,
* review workspace UI,
* form-field binding semantics,
* visual guide overlay behavior,
* citation card rendering,
* application shell and deployment,
* final human validation of evidence quality.
Those responsibilities belong to the appropriate citation-evidence subsystem repositories.
---
## Architectural Position
```text
citation-evidence
integrated product shell
citation-engine
core domain model, services, persistence contracts
evidence-source
document ingestion, extraction, metadata, representations, citation recovery
evidence-anchor
selectors, anchor resolution, re-anchoring, highlighting contracts
citation-work
review workspace and annotation UX
evidence-binder
evidence-to-target binding and active evidence state
```
**evidence-source** should provide document representations, not define what evidence means.
It should feed reliable source material into the rest of the system.
---
## Primary Workflows
### 1. Import Document
A user or system adds a source document.
```text
Add Source
→ Identify Media Type
→ Compute Fingerprint
→ Extract Metadata
→ Extract Text
→ Build Representation
→ Register Document
```
### 2. Generate PDF Representation
A PDF is converted into a representation suitable for review and anchoring.
```text
PDF Source
→ Load PDF
→ Extract Page Text
→ Normalize Text
→ Build Page Map
→ Build Offset Map
→ Store Representation
```
### 3. Generate Markdown / HTML Representation
A Markdown or HTML source is converted into a normalized rendered and searchable representation.
```text
Markdown / HTML Source
→ Parse / Sanitize
→ Render if needed
→ Extract Canonical Text
→ Build Heading / Section Map
→ Build Offset Map
→ Store Representation
```
### 4. Search Local Sources
A user or subsystem searches available source material.
```text
Search Query / Quote
→ Search Metadata
→ Search Full Text
→ Return Candidate Documents / Passages
```
### 5. Recover Citation
A user provides a citation, quote, or source clue.
```text
Citation Clue
→ Parse Source Metadata
→ Search Local Library
→ Optionally Search Configured External Sources
→ Load Candidate Source
→ Search Exact Quote
→ Search Fuzzy Quote
→ Present Candidate Passages
→ User Confirms
→ Create Source Context for Annotation
```
---
## Initial Source Types
The first version should support or prepare for:
```text
PDF
Markdown
HTML
plain text
remote URL references
```
Later versions may support:
```text
DOCX
EPUB
scanned image documents
OCR-derived text
IIIF resources
TEI XML
structured datasets with source passages
```
---
## Citation Recovery States
Citation recovery should be modeled explicitly.
Initial recovery states may include:
```text
created
source-found-fulltext
source-found-preview-only
source-found-metadata-only
source-not-found
quote-found
quote-not-found
candidate-passages-found
manual-confirmation-needed
confirmed
annotation-created
failed
```
The system should distinguish between finding a source and finding the exact cited passage.
---
## Privacy and Source Lookup Principles
Source lookup can create privacy risks.
The repository should follow these principles:
* search local sources first,
* make external lookup explicit and configurable,
* avoid sending private document text to external services by default,
* record which external services were queried,
* distinguish public metadata lookup from full-text upload,
* allow deployments to disable external lookup completely,
* prefer deterministic local processing where possible.
External source discovery should be an extension point, not an unavoidable default behavior.
---
## Design Principles
### Source Identity First
Every imported document should receive a stable identity based on available metadata, source URI, and fingerprint.
### Canonical Text Matters
Anchoring and search depend on canonical text. The repository should make text normalization explicit and repeatable.
### Representation Is Not Source
The original source and generated representation are different things.
The system should preserve this distinction.
### Local Before External
Citation recovery should search local documents before looking elsewhere.
### Human Confirmation
Recovered citations should not silently become confirmed evidence. Candidate matches should be presented for confirmation when uncertainty exists.
### Format-Aware, Model-Neutral
The repository should understand document formats but should not own the broader evidence model.
### Cache Expensive Work
Text extraction, fingerprinting, and representation generation should be cacheable by source fingerprint and version.
### Agent-Friendly Output
Extracted metadata, representations, and recovery candidates should be structured enough for agents to inspect, rank, and explain.
---
## Expected Dependencies
This repository is expected to depend on shared types and service contracts from:
```text
citation-engine
Document, DocumentRepresentation, CitationRecoveryAttempt, source-related contracts
```
It may be consumed by:
```text
citation-work
to load reviewable documents and document representations
evidence-anchor
to resolve selectors against extracted representations
evidence-binder
to retrieve source context for linked evidence
citation-evidence
to provide integrated import and recovery workflows
```
It should avoid depending on review UI or form-binding implementation details.
---
## First Useful Version
A first useful version of **evidence-source** should provide:
* source import interface,
* media type detection,
* document fingerprinting,
* basic metadata extraction,
* PDF text extraction,
* Markdown text extraction,
* HTML sanitization and text extraction,
* canonical text normalization,
* document representation generation,
* simple local quote search,
* recovery attempt model or contract,
* examples showing how a document becomes a representation usable by **evidence-anchor**.
The first version does not need full external source discovery or OCR, but it should establish the ingestion and representation pattern.
---
## Success Criteria
The repository is successful when another subsystem can use it to:
1. import a source document,
2. identify and fingerprint it,
3. extract useful metadata,
4. generate canonical text,
5. generate a document representation,
6. search the source text,
7. provide representation data to **evidence-anchor**,
8. support a local citation recovery attempt from a quote or citation clue.
A developer or coding agent should be able to understand from this repository how raw documents become evidence-ready sources.
---
## Repository Character
This repository should be:
* source-focused,
* ingestion-oriented,
* privacy-conscious,
* format-aware,
* representation-centered,
* cache-friendly,
* suitable for local-first and server-side use,
* explicit about uncertainty in citation recovery,
* careful not to absorb review or binding responsibilities.
---
## MVP Coordination — Code Lives Upstream
During the umbrella-first MVP phase (decided 2026-05-24), **the source code
for this subsystem does not live in this repository yet**. It lives in the
umbrella repo at `citation-evidence/src/source/`.
This INTENT.md documents the *intended* responsibilities and boundaries.
When the ingestion and representation interfaces have stabilized through
actual MVP use, the corresponding code extracts into this repository.
**Shared contracts** (Document and DocumentRepresentation shapes,
CitationRecoveryAttempt state enum, canonical text normalization, allowed
dependency edges) are maintained in the umbrella repo:
* `citation-evidence/wiki/SharedContracts.md`
* `citation-evidence/wiki/DependencyMap.md`
* `citation-evidence/docs/decisions/` (ADRs)
This subsystem's eventual code must not contradict those documents. Changes
to shared contracts happen in the umbrella, not here.
Under the dependency map, **`evidence-source` may depend only on
`citation-engine`** — not on `evidence-anchor`. When ingestion needs to know
"could a selector resolve here?", the answer travels through events, not
direct calls.
---
## Guiding Statement
**evidence-source exists to turn documents and citation clues into reliable, searchable, anchorable source context.**

View File

@@ -1,3 +1,16 @@
# repo-seed
# evidence-source
A git repository template to bootstrap coulomb projects from.
Document source, ingestion, extraction, metadata, and citation recovery —
PDF/HTML/MD ingest, fingerprinting, page-/offset-map construction,
canonical-text extraction, and the recovery behavior for stale selectors.
## MVP status: INTENT only
During the citation-evidence MVP, code lives upstream in
[`citation-evidence`](../citation-evidence/) under `src/source/`. This repo
currently holds `INTENT.md` describing what will move here. Contract
changes belong in
[`citation-evidence/wiki/SharedContracts.md`](../citation-evidence/wiki/SharedContracts.md),
not here.
Per the dependency map, source depends on `shared/` and `engine/` only.

137
SCOPE.md Normal file
View File

@@ -0,0 +1,137 @@
# SCOPE
> This file helps you quickly understand what this repository is about,
> when it is relevant, and when it is not.
> It is intentionally lightweight and may be incomplete.
---
## One-liner
<!-- Describe the purpose of this repository in one precise sentence. -->
<!-- Example: "Provides a lightweight event router for Kubernetes-native systems." -->
---
## Core Idea
<!-- What is the main capability or idea behind this repository? -->
<!-- What problem does it try to solve? -->
---
## In Scope
<!-- What this repository is responsible for. -->
<!-- Be explicit and concrete. -->
-
-
-
---
## Out of Scope
<!-- What this repository deliberately does NOT do. -->
<!-- This is often more important than "In Scope". -->
-
-
-
---
## Relevant When
<!-- When should someone consider using or exploring this repository? -->
-
-
-
---
## Not Relevant When
<!-- When should someone ignore this repository? -->
-
-
-
---
## Current State
<!-- Rough indication of maturity. No strict format required. -->
- Status: <!-- e.g. concept / experimental / active / stable / deprecated -->
- Implementation: <!-- e.g. idea / partial / substantial / complete -->
- Stability: <!-- e.g. unstable / evolving / stable -->
- Usage: <!-- e.g. none / personal / internal / production -->
<!-- Add any notes that help set expectations. -->
---
## How It Fits
<!-- Where does this repository sit in the bigger picture? -->
- Upstream dependencies:
- Downstream consumers:
- Often used with:
---
## Terminology
<!-- Terms that are important to understand this repo. -->
<!-- Especially useful if naming differs from other repos. -->
- Preferred terms:
- Also known as:
- Potentially confusing terms:
---
## Related / Overlapping Repositories
<!-- List repositories that have similar or adjacent responsibilities. -->
<!-- Helps detect duplication and navigate the ecosystem. -->
- <repo-name> — <!-- how it relates -->
---
## Getting Oriented
<!-- If someone decides to look deeper, where should they start? -->
- Start with:
- Key files / directories:
- Entry points:
---
## Provided Capabilities
<!-- What can this repo's domain provide to other domains on request? -->
<!-- Each capability block is parsed by the state-hub capability catalog ingest. -->
<!-- Remove the examples and add your own, or leave empty if none. -->
<!--
```capability
type: infrastructure
title: Example capability title
description: What this capability provides, in one or two sentences.
keywords: [keyword1, keyword2, keyword3]
```
-->
---
## Notes
<!-- Anything else worth knowing. Keep it short. -->

12
registry/README.md Normal file
View File

@@ -0,0 +1,12 @@
# Capability Registry
Markdown-first capability index for federation and reuse planning.
## Authoring
1. Copy a capability entry template (see reuse-surface `templates/capability-entry.template.md`).
2. Add the row to `indexes/capabilities.yaml`.
3. Run `reuse-surface validate` from a checkout with the CLI installed.
4. Merge to `main` and verify publish with `reuse-surface establish --publish-check`.
Federation contract: reuse-surface `docs/RegistryFederation.md`.

View File

View File

@@ -0,0 +1,4 @@
version: 1
updated: '2026-06-16'
domain: helix_forge
capabilities: []

View File

@@ -0,0 +1,19 @@
---
id: ESRC-WP-0001
type: workplan
title: "INTENT placeholder — await extraction from citation-evidence"
domain: infotech
repo: evidence-source
status: backlog
owner: codex
topic_slug: citation_evidence_mvp
created: "2026-06-21"
updated: "2026-06-21"
state_hub_workstream_id: "64771b5d-4b83-4848-a562-4b00aad017b2"
---
# ESRC-WP-0001 — INTENT Placeholder
Umbrella-first MVP: source/ingestion code will extract from `citation-evidence`
when the subsystem boundary stabilizes. This file satisfies ADR-001 workplan
structure until then. See `INTENT.md`.