From 3deb7283750b77b248dd1724f4e3b16e794020b9 Mon Sep 17 00:00:00 2001 From: tegwick Date: Thu, 14 May 2026 23:17:45 +0200 Subject: [PATCH] docs(workplans): add pdf read adapter plan --- workplans/MKTF-WP-0002-pdf-read-adapter.md | 220 +++++++++++++++++++++ 1 file changed, 220 insertions(+) create mode 100644 workplans/MKTF-WP-0002-pdf-read-adapter.md diff --git a/workplans/MKTF-WP-0002-pdf-read-adapter.md b/workplans/MKTF-WP-0002-pdf-read-adapter.md new file mode 100644 index 0000000..c3c6ba3 --- /dev/null +++ b/workplans/MKTF-WP-0002-pdf-read-adapter.md @@ -0,0 +1,220 @@ +--- +id: MKTF-WP-0002 +type: workplan +title: "PDF Read Adapter" +domain: markitect +status: todo +owner: markitect-filter +topic_slug: markitect +planning_priority: P1 +planning_order: 20 +depends_on_workplans: + - MKTF-WP-0001 +related_workplans: + - MKTT-WP-0018 +created: "2026-05-14" +updated: "2026-05-14" +state_hub_workstream_id: "7445fe6b-f1a9-4383-8053-4337337dc095" +--- + +# MKTF-WP-0002: PDF Read Adapter + +## Purpose + +Implement the second concrete `markitect-filter` source adapter: +`source.pdf`, a read-only PDF adapter that satisfies the `markitect-tool` +source adapter contract. + +The contract dependency is cross-repo and is tracked as related work rather +than a same-repo State Hub dependency edge: `markitect-tool` `MKTT-WP-0018`. + +The first PDF slice should target deterministic text extraction from +digitally-readable PDFs. It should preserve page-level provenance and make +extraction uncertainty visible through diagnostics and quality signals. + +## Planned Scope + +- Optional PDF dependency profile isolated behind a `pdf` extra. +- Entry point group registration: + `markitect_tool.source_adapters`. +- Lightweight `pdf_adapter_descriptor`. +- Adapter id `source.pdf` with media type `application/pdf` and extension + `.pdf`. +- Inspection for basic PDF metadata, page count, encryption status, and + extractability signals. +- Read-only page text extraction into ordered Markdown segments. +- Page-aware source provenance with source paths, page numbers, page labels + where available, and stable segment ids. +- Configurable first-slice options such as page range, page break markers, and + whitespace normalization policy. +- Structured diagnostics for malformed PDFs, encrypted PDFs, image-only or + scanned pages, empty extraction, partial page failures, unsupported embedded + media, and lossy layout/table handling. +- Quality metadata for confidence, lossiness, skipped pages, warning counts, + extraction backend, and page coverage. +- Tests for descriptor shape, matching, inspection, normalization, malformed + inputs, encrypted or non-extractable inputs where fixtures allow, Markitect + API registry use, and entry point shape. + +## Non-Goals + +- OCR or scanned-document recognition. +- Pixel-perfect layout preservation. +- Table reconstruction beyond plain text and diagnostics. +- Image, figure, annotation, form, signature, or attachment extraction beyond + future metadata/diagnostic hooks. +- PDF writing/export. +- Network fetching. +- External processes or native system services in the first slice. +- Making PDF dependencies mandatory for EPUB3 or other adapters. + +## P2.1 - Pin PDF v1 dependency and extraction policy + +```task +id: MKTF-WP-0002-T001 +status: todo +priority: high +state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb" +``` + +Choose the first PDF extraction backend and dependency profile. + +The decision should document: + +- pure-Python preference for the first slice +- optional dependency placement under the `pdf` extra +- supported inputs: local, digitally-readable PDFs +- unsupported inputs: scanned/image-only PDFs without OCR +- encrypted/permission-restricted PDF behavior +- how page range, page breaks, and whitespace normalization should behave +- fallback or future status for heavier layout/OCR backends + +Output: dependency decision, option contract, and implementation notes. + +## P2.2 - Add descriptor and entry point registration + +```task +id: MKTF-WP-0002-T002 +status: todo +priority: high +state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10" +``` + +Add a `pdf_adapter_descriptor` matching the existing EPUB3 descriptor pattern. + +The descriptor should define: + +- adapter id `source.pdf` +- version `1` +- media type `application/pdf` +- extension `.pdf` +- read operation only +- safety metadata with local reads only +- option schema for page range, page breaks, and whitespace normalization +- quality profile and dependency metadata +- lazy factory import for the PDF adapter implementation + +Output: descriptor, entry point registration, and descriptor tests. + +## P2.3 - Implement PDF inspection + +```task +id: MKTF-WP-0002-T003 +status: todo +priority: high +state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65" +``` + +Implement `inspect` for PDF assets. + +Inspection should report: + +- title, creators/authors, subject, keywords, producer, creation/modification + dates where available +- page count +- encryption or permission status +- basic extractability signals +- diagnostics for malformed, unreadable, encrypted, or unsupported PDFs + +Output: inspection implementation and tests with small fixtures. + +## P2.4 - Normalize page text into Markitect Markdown + +```task +id: MKTF-WP-0002-T004 +status: todo +priority: high +state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761" +``` + +Implement `read` for digitally-readable PDFs. + +Normalization should: + +- iterate pages in deterministic order +- apply page range filtering +- convert extracted text into Markdown-safe segment text +- create one or more ordered segments with stable segment ids +- preserve page-level provenance on every segment +- optionally insert page break markers +- produce a stable document id and cache key through the Markitect source + contract helpers + +Output: read implementation and normalization tests. + +## P2.5 - Add diagnostics and quality semantics + +```task +id: MKTF-WP-0002-T005 +status: todo +priority: high +state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2" +``` + +Define PDF-specific diagnostics and quality metadata. + +The adapter should distinguish: + +- malformed PDF +- encrypted or permission-restricted PDF +- no extractable text +- partially failed pages +- scanned/image-only pages +- dropped layout, tables, figures, annotations, or forms +- unsupported embedded resources + +Quality should include extraction backend, page coverage, warning count, +skipped pages, lossiness, and confidence. + +Output: diagnostic helpers, quality rules, and tests. + +## P2.6 - Add fixtures, docs, and validation + +```task +id: MKTF-WP-0002-T006 +status: todo +priority: medium +state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238" +``` + +Add small deterministic PDF fixtures and documentation. + +Validation should cover: + +- descriptor shape +- media type and extension matching +- metadata inspection +- page text normalization +- malformed or empty extraction behavior +- registry and entry point shape +- `markitect-tool` API use through `inspect_source` and `normalize_source` + +Output: tests, README update, and validation command. + +## Validation + +Run from `markitect-filter`: + +```bash +PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest +```