feat(source): add pdf read adapter

This commit is contained in:
2026-05-14 23:33:31 +02:00
parent 24ee499b50
commit 0c9a418e85
8 changed files with 1176 additions and 13 deletions

45
docs/pdf-adapter.md Normal file
View File

@@ -0,0 +1,45 @@
# PDF Adapter
`source.pdf` is a read-only Markitect source adapter for local,
digitally-readable PDF files.
## Dependency Policy
The first implementation is stdlib-only. The `pdf` optional dependency extra is
present so a richer pure-Python backend can be added later without changing the
adapter boundary or making PDF support mandatory for EPUB3 users.
The adapter does not use network access, external processes, OCR engines,
native system services, or renderer-specific tooling.
## Supported Inputs
- Local files with media type `application/pdf` or extension `.pdf`.
- PDFs with extractable text in page content streams.
- Plain and FlateDecode content streams for the first deterministic slice.
## Deferred Inputs
- Scanned or image-only PDFs that require OCR.
- Encrypted or permission-restricted PDFs.
- Pixel-perfect layout reconstruction.
- Table, figure, annotation, form, signature, and attachment extraction.
- PDF writing/export.
## Options
- `page_range`: optional 1-based page range such as `1-3,5`.
- `include_page_breaks`: when true, prefixes each page segment with a Markdown
page marker comment.
- `normalize_whitespace`: when true, collapses repeated horizontal whitespace
while preserving extracted line breaks.
## Provenance And Quality
The adapter emits one segment per extracted page. Each segment carries
page-level `SourceProvenance` with the source path, source digest, page number,
and originating PDF page object id.
Quality metadata records the extraction backend, document page count, selected
pages, extracted page count, page coverage, skipped pages, warning count,
lossiness, and confidence.