feat(source): add pdf read adapter

2026-05-14 23:33:31 +02:00
parent 24ee499b50
commit 0c9a418e85
8 changed files with 1176 additions and 13 deletions
--- a/workplans/MKTF-WP-0002-pdf-read-adapter.md
+++ b/workplans/MKTF-WP-0002-pdf-read-adapter.md
@@ -3,10 +3,10 @@ id: MKTF-WP-0002
 type: workplan
 title: "PDF Read Adapter"
 domain: markitect
-status: todo
+status: done
 owner: markitect-filter
 topic_slug: markitect
-planning_priority: P1
+planning_priority: complete
 planning_order: 20
 depends_on_workplans:
  - MKTF-WP-0001
@@ -32,7 +32,7 @@ The first PDF slice should target deterministic text extraction from
 digitally-readable PDFs. It should preserve page-level provenance and make
 extraction uncertainty visible through diagnostics and quality signals.

-## Planned Scope
+## Implemented Scope

 - Optional PDF dependency profile isolated behind a `pdf` extra.
 - Entry point group registration:
@@ -72,7 +72,7 @@ extraction uncertainty visible through diagnostics and quality signals.

 ```task
 id: MKTF-WP-0002-T001
-status: todo
+status: done
 priority: high
 state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb"
 ```
@@ -91,11 +91,16 @@ The decision should document:

 Output: dependency decision, option contract, and implementation notes.

+Implemented: `docs/pdf-adapter.md`, `pyproject.toml`, and the descriptor
+metadata document a stdlib first slice, a reserved `pdf` extra, local
+digitally-readable PDF support, page range/page marker/whitespace options, and
+deferred OCR/layout-heavy backends.
+
 ## P2.2 - Add descriptor and entry point registration

 ```task
 id: MKTF-WP-0002-T002
-status: todo
+status: done
 priority: high
 state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10"
 ```
@@ -116,11 +121,15 @@ The descriptor should define:

 Output: descriptor, entry point registration, and descriptor tests.

+Implemented: `pdf_adapter_descriptor` is registered through
+`markitect_tool.source_adapters`, exported from the package, and covered by
+descriptor and discovery tests.
+
 ## P2.3 - Implement PDF inspection

 ```task
 id: MKTF-WP-0002-T003
-status: todo
+status: done
 priority: high
 state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65"
 ```
@@ -138,11 +147,15 @@ Inspection should report:

 Output: inspection implementation and tests with small fixtures.

+Implemented: `PdfReadAdapter.inspect` reports metadata, page count,
+extractability signals, encryption status, quality metadata, and malformed or
+encrypted diagnostics using deterministic generated fixtures.
+
 ## P2.4 - Normalize page text into Markitect Markdown

 ```task
 id: MKTF-WP-0002-T004
-status: todo
+status: done
 priority: high
 state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761"
 ```
@@ -162,11 +175,15 @@ Normalization should:

 Output: read implementation and normalization tests.

+Implemented: `PdfReadAdapter.read` extracts ordered page text into stable
+page segments, applies page ranges, supports optional page markers, preserves
+page provenance, and uses the Markitect cache-key helpers.
+
 ## P2.5 - Add diagnostics and quality semantics

 ```task
 id: MKTF-WP-0002-T005
-status: todo
+status: done
 priority: high
 state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2"
 ```
@@ -188,11 +205,17 @@ skipped pages, lossiness, and confidence.

 Output: diagnostic helpers, quality rules, and tests.

+Implemented: PDF diagnostics cover malformed files, unreadable files,
+encrypted PDFs, invalid page ranges, missing/empty streams, image-only pages,
+empty extraction, and stream decompression failures. Quality metadata records
+backend, page count, selected pages, extracted pages, coverage, warnings, and
+skipped pages.
+
 ## P2.6 - Add fixtures, docs, and validation

 ```task
 id: MKTF-WP-0002-T006
-status: todo
+status: done
 priority: medium
 state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238"
 ```
@@ -211,6 +234,11 @@ Validation should cover:

 Output: tests, README update, and validation command.

+Implemented: generated PDF fixtures and tests cover descriptor shape, matching,
+metadata inspection, normalization, page range markers, malformed PDFs,
+encrypted PDFs, registry use, entry point discovery, README documentation, and
+the validation command below.
+
 ## Validation

 Run from `markitect-filter`: