docs(workplans): add pdf read adapter plan

2026-05-14 23:17:45 +02:00
parent 63518863aa
commit 3deb728375
1 changed files with 220 additions and 0 deletions
--- a/workplans/MKTF-WP-0002-pdf-read-adapter.md
+++ b/workplans/MKTF-WP-0002-pdf-read-adapter.md
@@ -0,0 +1,220 @@
+---
+id: MKTF-WP-0002
+type: workplan
+title: "PDF Read Adapter"
+domain: markitect
+status: todo
+owner: markitect-filter
+topic_slug: markitect
+planning_priority: P1
+planning_order: 20
+depends_on_workplans:
+  - MKTF-WP-0001
+related_workplans:
+  - MKTT-WP-0018
+created: "2026-05-14"
+updated: "2026-05-14"
+state_hub_workstream_id: "7445fe6b-f1a9-4383-8053-4337337dc095"
+---
+
+# MKTF-WP-0002: PDF Read Adapter
+
+## Purpose
+
+Implement the second concrete `markitect-filter` source adapter:
+`source.pdf`, a read-only PDF adapter that satisfies the `markitect-tool`
+source adapter contract.
+
+The contract dependency is cross-repo and is tracked as related work rather
+than a same-repo State Hub dependency edge: `markitect-tool` `MKTT-WP-0018`.
+
+The first PDF slice should target deterministic text extraction from
+digitally-readable PDFs. It should preserve page-level provenance and make
+extraction uncertainty visible through diagnostics and quality signals.
+
+## Planned Scope
+
+- Optional PDF dependency profile isolated behind a `pdf` extra.
+- Entry point group registration:
+  `markitect_tool.source_adapters`.
+- Lightweight `pdf_adapter_descriptor`.
+- Adapter id `source.pdf` with media type `application/pdf` and extension
+  `.pdf`.
+- Inspection for basic PDF metadata, page count, encryption status, and
+  extractability signals.
+- Read-only page text extraction into ordered Markdown segments.
+- Page-aware source provenance with source paths, page numbers, page labels
+  where available, and stable segment ids.
+- Configurable first-slice options such as page range, page break markers, and
+  whitespace normalization policy.
+- Structured diagnostics for malformed PDFs, encrypted PDFs, image-only or
+  scanned pages, empty extraction, partial page failures, unsupported embedded
+  media, and lossy layout/table handling.
+- Quality metadata for confidence, lossiness, skipped pages, warning counts,
+  extraction backend, and page coverage.
+- Tests for descriptor shape, matching, inspection, normalization, malformed
+  inputs, encrypted or non-extractable inputs where fixtures allow, Markitect
+  API registry use, and entry point shape.
+
+## Non-Goals
+
+- OCR or scanned-document recognition.
+- Pixel-perfect layout preservation.
+- Table reconstruction beyond plain text and diagnostics.
+- Image, figure, annotation, form, signature, or attachment extraction beyond
+  future metadata/diagnostic hooks.
+- PDF writing/export.
+- Network fetching.
+- External processes or native system services in the first slice.
+- Making PDF dependencies mandatory for EPUB3 or other adapters.
+
+## P2.1 - Pin PDF v1 dependency and extraction policy
+
+```task
+id: MKTF-WP-0002-T001
+status: todo
+priority: high
+state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb"
+```
+
+Choose the first PDF extraction backend and dependency profile.
+
+The decision should document:
+
+- pure-Python preference for the first slice
+- optional dependency placement under the `pdf` extra
+- supported inputs: local, digitally-readable PDFs
+- unsupported inputs: scanned/image-only PDFs without OCR
+- encrypted/permission-restricted PDF behavior
+- how page range, page breaks, and whitespace normalization should behave
+- fallback or future status for heavier layout/OCR backends
+
+Output: dependency decision, option contract, and implementation notes.
+
+## P2.2 - Add descriptor and entry point registration
+
+```task
+id: MKTF-WP-0002-T002
+status: todo
+priority: high
+state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10"
+```
+
+Add a `pdf_adapter_descriptor` matching the existing EPUB3 descriptor pattern.
+
+The descriptor should define:
+
+- adapter id `source.pdf`
+- version `1`
+- media type `application/pdf`
+- extension `.pdf`
+- read operation only
+- safety metadata with local reads only
+- option schema for page range, page breaks, and whitespace normalization
+- quality profile and dependency metadata
+- lazy factory import for the PDF adapter implementation
+
+Output: descriptor, entry point registration, and descriptor tests.
+
+## P2.3 - Implement PDF inspection
+
+```task
+id: MKTF-WP-0002-T003
+status: todo
+priority: high
+state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65"
+```
+
+Implement `inspect` for PDF assets.
+
+Inspection should report:
+
+- title, creators/authors, subject, keywords, producer, creation/modification
+  dates where available
+- page count
+- encryption or permission status
+- basic extractability signals
+- diagnostics for malformed, unreadable, encrypted, or unsupported PDFs
+
+Output: inspection implementation and tests with small fixtures.
+
+## P2.4 - Normalize page text into Markitect Markdown
+
+```task
+id: MKTF-WP-0002-T004
+status: todo
+priority: high
+state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761"
+```
+
+Implement `read` for digitally-readable PDFs.
+
+Normalization should:
+
+- iterate pages in deterministic order
+- apply page range filtering
+- convert extracted text into Markdown-safe segment text
+- create one or more ordered segments with stable segment ids
+- preserve page-level provenance on every segment
+- optionally insert page break markers
+- produce a stable document id and cache key through the Markitect source
+  contract helpers
+
+Output: read implementation and normalization tests.
+
+## P2.5 - Add diagnostics and quality semantics
+
+```task
+id: MKTF-WP-0002-T005
+status: todo
+priority: high
+state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2"
+```
+
+Define PDF-specific diagnostics and quality metadata.
+
+The adapter should distinguish:
+
+- malformed PDF
+- encrypted or permission-restricted PDF
+- no extractable text
+- partially failed pages
+- scanned/image-only pages
+- dropped layout, tables, figures, annotations, or forms
+- unsupported embedded resources
+
+Quality should include extraction backend, page coverage, warning count,
+skipped pages, lossiness, and confidence.
+
+Output: diagnostic helpers, quality rules, and tests.
+
+## P2.6 - Add fixtures, docs, and validation
+
+```task
+id: MKTF-WP-0002-T006
+status: todo
+priority: medium
+state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238"
+```
+
+Add small deterministic PDF fixtures and documentation.
+
+Validation should cover:
+
+- descriptor shape
+- media type and extension matching
+- metadata inspection
+- page text normalization
+- malformed or empty extraction behavior
+- registry and entry point shape
+- `markitect-tool` API use through `inspect_source` and `normalize_source`
+
+Output: tests, README update, and validation command.
+
+## Validation
+
+Run from `markitect-filter`:
+
+```bash
+PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest
+```