--- id: MKTF-WP-0002 type: workplan title: "PDF Read Adapter" domain: markitect status: done owner: markitect-filter topic_slug: markitect planning_priority: complete planning_order: 20 depends_on_workplans: - MKTF-WP-0001 related_workplans: - MKTT-WP-0018 created: "2026-05-14" updated: "2026-05-14" state_hub_workstream_id: "7445fe6b-f1a9-4383-8053-4337337dc095" --- # MKTF-WP-0002: PDF Read Adapter ## Purpose Implement the second concrete `markitect-filter` source adapter: `source.pdf`, a read-only PDF adapter that satisfies the `markitect-tool` source adapter contract. The contract dependency is cross-repo and is tracked as related work rather than a same-repo State Hub dependency edge: `markitect-tool` `MKTT-WP-0018`. The first PDF slice should target deterministic text extraction from digitally-readable PDFs. It should preserve page-level provenance and make extraction uncertainty visible through diagnostics and quality signals. ## Implemented Scope - Optional PDF dependency profile isolated behind a `pdf` extra. - Entry point group registration: `markitect_tool.source_adapters`. - Lightweight `pdf_adapter_descriptor`. - Adapter id `source.pdf` with media type `application/pdf` and extension `.pdf`. - Inspection for basic PDF metadata, page count, encryption status, and extractability signals. - Read-only page text extraction into ordered Markdown segments. - Page-aware source provenance with source paths, page numbers, page labels where available, and stable segment ids. - Configurable first-slice options such as page range, page break markers, and whitespace normalization policy. - Structured diagnostics for malformed PDFs, encrypted PDFs, image-only or scanned pages, empty extraction, partial page failures, unsupported embedded media, and lossy layout/table handling. - Quality metadata for confidence, lossiness, skipped pages, warning counts, extraction backend, and page coverage. - Tests for descriptor shape, matching, inspection, normalization, malformed inputs, encrypted or non-extractable inputs where fixtures allow, Markitect API registry use, and entry point shape. ## Non-Goals - OCR or scanned-document recognition. - Pixel-perfect layout preservation. - Table reconstruction beyond plain text and diagnostics. - Image, figure, annotation, form, signature, or attachment extraction beyond future metadata/diagnostic hooks. - PDF writing/export. - Network fetching. - External processes or native system services in the first slice. - Making PDF dependencies mandatory for EPUB3 or other adapters. ## P2.1 - Pin PDF v1 dependency and extraction policy ```task id: MKTF-WP-0002-T001 status: done priority: high state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb" ``` Choose the first PDF extraction backend and dependency profile. The decision should document: - pure-Python preference for the first slice - optional dependency placement under the `pdf` extra - supported inputs: local, digitally-readable PDFs - unsupported inputs: scanned/image-only PDFs without OCR - encrypted/permission-restricted PDF behavior - how page range, page breaks, and whitespace normalization should behave - fallback or future status for heavier layout/OCR backends Output: dependency decision, option contract, and implementation notes. Implemented: `docs/pdf-adapter.md`, `pyproject.toml`, and the descriptor metadata document a stdlib first slice, a reserved `pdf` extra, local digitally-readable PDF support, page range/page marker/whitespace options, and deferred OCR/layout-heavy backends. ## P2.2 - Add descriptor and entry point registration ```task id: MKTF-WP-0002-T002 status: done priority: high state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10" ``` Add a `pdf_adapter_descriptor` matching the existing EPUB3 descriptor pattern. The descriptor should define: - adapter id `source.pdf` - version `1` - media type `application/pdf` - extension `.pdf` - read operation only - safety metadata with local reads only - option schema for page range, page breaks, and whitespace normalization - quality profile and dependency metadata - lazy factory import for the PDF adapter implementation Output: descriptor, entry point registration, and descriptor tests. Implemented: `pdf_adapter_descriptor` is registered through `markitect_tool.source_adapters`, exported from the package, and covered by descriptor and discovery tests. ## P2.3 - Implement PDF inspection ```task id: MKTF-WP-0002-T003 status: done priority: high state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65" ``` Implement `inspect` for PDF assets. Inspection should report: - title, creators/authors, subject, keywords, producer, creation/modification dates where available - page count - encryption or permission status - basic extractability signals - diagnostics for malformed, unreadable, encrypted, or unsupported PDFs Output: inspection implementation and tests with small fixtures. Implemented: `PdfReadAdapter.inspect` reports metadata, page count, extractability signals, encryption status, quality metadata, and malformed or encrypted diagnostics using deterministic generated fixtures. ## P2.4 - Normalize page text into Markitect Markdown ```task id: MKTF-WP-0002-T004 status: done priority: high state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761" ``` Implement `read` for digitally-readable PDFs. Normalization should: - iterate pages in deterministic order - apply page range filtering - convert extracted text into Markdown-safe segment text - create one or more ordered segments with stable segment ids - preserve page-level provenance on every segment - optionally insert page break markers - produce a stable document id and cache key through the Markitect source contract helpers Output: read implementation and normalization tests. Implemented: `PdfReadAdapter.read` extracts ordered page text into stable page segments, applies page ranges, supports optional page markers, preserves page provenance, and uses the Markitect cache-key helpers. ## P2.5 - Add diagnostics and quality semantics ```task id: MKTF-WP-0002-T005 status: done priority: high state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2" ``` Define PDF-specific diagnostics and quality metadata. The adapter should distinguish: - malformed PDF - encrypted or permission-restricted PDF - no extractable text - partially failed pages - scanned/image-only pages - dropped layout, tables, figures, annotations, or forms - unsupported embedded resources Quality should include extraction backend, page coverage, warning count, skipped pages, lossiness, and confidence. Output: diagnostic helpers, quality rules, and tests. Implemented: PDF diagnostics cover malformed files, unreadable files, encrypted PDFs, invalid page ranges, missing/empty streams, image-only pages, empty extraction, and stream decompression failures. Quality metadata records backend, page count, selected pages, extracted pages, coverage, warnings, and skipped pages. ## P2.6 - Add fixtures, docs, and validation ```task id: MKTF-WP-0002-T006 status: done priority: medium state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238" ``` Add small deterministic PDF fixtures and documentation. Validation should cover: - descriptor shape - media type and extension matching - metadata inspection - page text normalization - malformed or empty extraction behavior - registry and entry point shape - `markitect-tool` API use through `inspect_source` and `normalize_source` Output: tests, README update, and validation command. Implemented: generated PDF fixtures and tests cover descriptor shape, matching, metadata inspection, normalization, page range markers, malformed PDFs, encrypted PDFs, registry use, entry point discovery, README documentation, and the validation command below. ## Validation Run from `markitect-filter`: ```bash PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest ```