feat(source): add pdf read adapter

2026-05-14 23:33:31 +02:00
parent 24ee499b50
commit 0c9a418e85
8 changed files with 1176 additions and 13 deletions
--- a/docs/pdf-adapter.md
+++ b/docs/pdf-adapter.md
@@ -0,0 +1,45 @@
+# PDF Adapter
+
+`source.pdf` is a read-only Markitect source adapter for local,
+digitally-readable PDF files.
+
+## Dependency Policy
+
+The first implementation is stdlib-only. The `pdf` optional dependency extra is
+present so a richer pure-Python backend can be added later without changing the
+adapter boundary or making PDF support mandatory for EPUB3 users.
+
+The adapter does not use network access, external processes, OCR engines,
+native system services, or renderer-specific tooling.
+
+## Supported Inputs
+
+- Local files with media type `application/pdf` or extension `.pdf`.
+- PDFs with extractable text in page content streams.
+- Plain and FlateDecode content streams for the first deterministic slice.
+
+## Deferred Inputs
+
+- Scanned or image-only PDFs that require OCR.
+- Encrypted or permission-restricted PDFs.
+- Pixel-perfect layout reconstruction.
+- Table, figure, annotation, form, signature, and attachment extraction.
+- PDF writing/export.
+
+## Options
+
+- `page_range`: optional 1-based page range such as `1-3,5`.
+- `include_page_breaks`: when true, prefixes each page segment with a Markdown
+  page marker comment.
+- `normalize_whitespace`: when true, collapses repeated horizontal whitespace
+  while preserving extracted line breaks.
+
+## Provenance And Quality
+
+The adapter emits one segment per extracted page. Each segment carries
+page-level `SourceProvenance` with the source path, source digest, page number,
+and originating PDF page object id.
+
+Quality metadata records the extraction backend, document page count, selected
+pages, extracted page count, page coverage, skipped pages, warning count,
+lossiness, and confidence.