--- id: MKTF-WP-0002 type: workplan title: "PDF Read Adapter" domain: markitect status: todo owner: markitect-filter topic_slug: markitect planning_priority: P1 planning_order: 20 depends_on_workplans: - MKTF-WP-0001 related_workplans: - MKTT-WP-0018 created: "2026-05-14" updated: "2026-05-14" state_hub_workstream_id: "7445fe6b-f1a9-4383-8053-4337337dc095" --- # MKTF-WP-0002: PDF Read Adapter ## Purpose Implement the second concrete `markitect-filter` source adapter: `source.pdf`, a read-only PDF adapter that satisfies the `markitect-tool` source adapter contract. The contract dependency is cross-repo and is tracked as related work rather than a same-repo State Hub dependency edge: `markitect-tool` `MKTT-WP-0018`. The first PDF slice should target deterministic text extraction from digitally-readable PDFs. It should preserve page-level provenance and make extraction uncertainty visible through diagnostics and quality signals. ## Planned Scope - Optional PDF dependency profile isolated behind a `pdf` extra. - Entry point group registration: `markitect_tool.source_adapters`. - Lightweight `pdf_adapter_descriptor`. - Adapter id `source.pdf` with media type `application/pdf` and extension `.pdf`. - Inspection for basic PDF metadata, page count, encryption status, and extractability signals. - Read-only page text extraction into ordered Markdown segments. - Page-aware source provenance with source paths, page numbers, page labels where available, and stable segment ids. - Configurable first-slice options such as page range, page break markers, and whitespace normalization policy. - Structured diagnostics for malformed PDFs, encrypted PDFs, image-only or scanned pages, empty extraction, partial page failures, unsupported embedded media, and lossy layout/table handling. - Quality metadata for confidence, lossiness, skipped pages, warning counts, extraction backend, and page coverage. - Tests for descriptor shape, matching, inspection, normalization, malformed inputs, encrypted or non-extractable inputs where fixtures allow, Markitect API registry use, and entry point shape. ## Non-Goals - OCR or scanned-document recognition. - Pixel-perfect layout preservation. - Table reconstruction beyond plain text and diagnostics. - Image, figure, annotation, form, signature, or attachment extraction beyond future metadata/diagnostic hooks. - PDF writing/export. - Network fetching. - External processes or native system services in the first slice. - Making PDF dependencies mandatory for EPUB3 or other adapters. ## P2.1 - Pin PDF v1 dependency and extraction policy ```task id: MKTF-WP-0002-T001 status: todo priority: high state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb" ``` Choose the first PDF extraction backend and dependency profile. The decision should document: - pure-Python preference for the first slice - optional dependency placement under the `pdf` extra - supported inputs: local, digitally-readable PDFs - unsupported inputs: scanned/image-only PDFs without OCR - encrypted/permission-restricted PDF behavior - how page range, page breaks, and whitespace normalization should behave - fallback or future status for heavier layout/OCR backends Output: dependency decision, option contract, and implementation notes. ## P2.2 - Add descriptor and entry point registration ```task id: MKTF-WP-0002-T002 status: todo priority: high state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10" ``` Add a `pdf_adapter_descriptor` matching the existing EPUB3 descriptor pattern. The descriptor should define: - adapter id `source.pdf` - version `1` - media type `application/pdf` - extension `.pdf` - read operation only - safety metadata with local reads only - option schema for page range, page breaks, and whitespace normalization - quality profile and dependency metadata - lazy factory import for the PDF adapter implementation Output: descriptor, entry point registration, and descriptor tests. ## P2.3 - Implement PDF inspection ```task id: MKTF-WP-0002-T003 status: todo priority: high state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65" ``` Implement `inspect` for PDF assets. Inspection should report: - title, creators/authors, subject, keywords, producer, creation/modification dates where available - page count - encryption or permission status - basic extractability signals - diagnostics for malformed, unreadable, encrypted, or unsupported PDFs Output: inspection implementation and tests with small fixtures. ## P2.4 - Normalize page text into Markitect Markdown ```task id: MKTF-WP-0002-T004 status: todo priority: high state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761" ``` Implement `read` for digitally-readable PDFs. Normalization should: - iterate pages in deterministic order - apply page range filtering - convert extracted text into Markdown-safe segment text - create one or more ordered segments with stable segment ids - preserve page-level provenance on every segment - optionally insert page break markers - produce a stable document id and cache key through the Markitect source contract helpers Output: read implementation and normalization tests. ## P2.5 - Add diagnostics and quality semantics ```task id: MKTF-WP-0002-T005 status: todo priority: high state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2" ``` Define PDF-specific diagnostics and quality metadata. The adapter should distinguish: - malformed PDF - encrypted or permission-restricted PDF - no extractable text - partially failed pages - scanned/image-only pages - dropped layout, tables, figures, annotations, or forms - unsupported embedded resources Quality should include extraction backend, page coverage, warning count, skipped pages, lossiness, and confidence. Output: diagnostic helpers, quality rules, and tests. ## P2.6 - Add fixtures, docs, and validation ```task id: MKTF-WP-0002-T006 status: todo priority: medium state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238" ``` Add small deterministic PDF fixtures and documentation. Validation should cover: - descriptor shape - media type and extension matching - metadata inspection - page text normalization - malformed or empty extraction behavior - registry and entry point shape - `markitect-tool` API use through `inspect_source` and `normalize_source` Output: tests, README update, and validation command. ## Validation Run from `markitect-filter`: ```bash PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest ```