generated from coulomb/repo-seed
feat(source): add pdf read adapter
This commit is contained in:
45
docs/pdf-adapter.md
Normal file
45
docs/pdf-adapter.md
Normal file
@@ -0,0 +1,45 @@
|
||||
# PDF Adapter
|
||||
|
||||
`source.pdf` is a read-only Markitect source adapter for local,
|
||||
digitally-readable PDF files.
|
||||
|
||||
## Dependency Policy
|
||||
|
||||
The first implementation is stdlib-only. The `pdf` optional dependency extra is
|
||||
present so a richer pure-Python backend can be added later without changing the
|
||||
adapter boundary or making PDF support mandatory for EPUB3 users.
|
||||
|
||||
The adapter does not use network access, external processes, OCR engines,
|
||||
native system services, or renderer-specific tooling.
|
||||
|
||||
## Supported Inputs
|
||||
|
||||
- Local files with media type `application/pdf` or extension `.pdf`.
|
||||
- PDFs with extractable text in page content streams.
|
||||
- Plain and FlateDecode content streams for the first deterministic slice.
|
||||
|
||||
## Deferred Inputs
|
||||
|
||||
- Scanned or image-only PDFs that require OCR.
|
||||
- Encrypted or permission-restricted PDFs.
|
||||
- Pixel-perfect layout reconstruction.
|
||||
- Table, figure, annotation, form, signature, and attachment extraction.
|
||||
- PDF writing/export.
|
||||
|
||||
## Options
|
||||
|
||||
- `page_range`: optional 1-based page range such as `1-3,5`.
|
||||
- `include_page_breaks`: when true, prefixes each page segment with a Markdown
|
||||
page marker comment.
|
||||
- `normalize_whitespace`: when true, collapses repeated horizontal whitespace
|
||||
while preserving extracted line breaks.
|
||||
|
||||
## Provenance And Quality
|
||||
|
||||
The adapter emits one segment per extracted page. Each segment carries
|
||||
page-level `SourceProvenance` with the source path, source digest, page number,
|
||||
and originating PDF page object id.
|
||||
|
||||
Quality metadata records the extraction backend, document page count, selected
|
||||
pages, extracted page count, page coverage, skipped pages, warning count,
|
||||
lossiness, and confidence.
|
||||
Reference in New Issue
Block a user