feat(source): add pdf read adapter

This commit is contained in:
2026-05-14 23:33:31 +02:00
parent 24ee499b50
commit 0c9a418e85
8 changed files with 1176 additions and 13 deletions

View File

@@ -3,10 +3,10 @@ id: MKTF-WP-0002
type: workplan
title: "PDF Read Adapter"
domain: markitect
status: todo
status: done
owner: markitect-filter
topic_slug: markitect
planning_priority: P1
planning_priority: complete
planning_order: 20
depends_on_workplans:
- MKTF-WP-0001
@@ -32,7 +32,7 @@ The first PDF slice should target deterministic text extraction from
digitally-readable PDFs. It should preserve page-level provenance and make
extraction uncertainty visible through diagnostics and quality signals.
## Planned Scope
## Implemented Scope
- Optional PDF dependency profile isolated behind a `pdf` extra.
- Entry point group registration:
@@ -72,7 +72,7 @@ extraction uncertainty visible through diagnostics and quality signals.
```task
id: MKTF-WP-0002-T001
status: todo
status: done
priority: high
state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb"
```
@@ -91,11 +91,16 @@ The decision should document:
Output: dependency decision, option contract, and implementation notes.
Implemented: `docs/pdf-adapter.md`, `pyproject.toml`, and the descriptor
metadata document a stdlib first slice, a reserved `pdf` extra, local
digitally-readable PDF support, page range/page marker/whitespace options, and
deferred OCR/layout-heavy backends.
## P2.2 - Add descriptor and entry point registration
```task
id: MKTF-WP-0002-T002
status: todo
status: done
priority: high
state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10"
```
@@ -116,11 +121,15 @@ The descriptor should define:
Output: descriptor, entry point registration, and descriptor tests.
Implemented: `pdf_adapter_descriptor` is registered through
`markitect_tool.source_adapters`, exported from the package, and covered by
descriptor and discovery tests.
## P2.3 - Implement PDF inspection
```task
id: MKTF-WP-0002-T003
status: todo
status: done
priority: high
state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65"
```
@@ -138,11 +147,15 @@ Inspection should report:
Output: inspection implementation and tests with small fixtures.
Implemented: `PdfReadAdapter.inspect` reports metadata, page count,
extractability signals, encryption status, quality metadata, and malformed or
encrypted diagnostics using deterministic generated fixtures.
## P2.4 - Normalize page text into Markitect Markdown
```task
id: MKTF-WP-0002-T004
status: todo
status: done
priority: high
state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761"
```
@@ -162,11 +175,15 @@ Normalization should:
Output: read implementation and normalization tests.
Implemented: `PdfReadAdapter.read` extracts ordered page text into stable
page segments, applies page ranges, supports optional page markers, preserves
page provenance, and uses the Markitect cache-key helpers.
## P2.5 - Add diagnostics and quality semantics
```task
id: MKTF-WP-0002-T005
status: todo
status: done
priority: high
state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2"
```
@@ -188,11 +205,17 @@ skipped pages, lossiness, and confidence.
Output: diagnostic helpers, quality rules, and tests.
Implemented: PDF diagnostics cover malformed files, unreadable files,
encrypted PDFs, invalid page ranges, missing/empty streams, image-only pages,
empty extraction, and stream decompression failures. Quality metadata records
backend, page count, selected pages, extracted pages, coverage, warnings, and
skipped pages.
## P2.6 - Add fixtures, docs, and validation
```task
id: MKTF-WP-0002-T006
status: todo
status: done
priority: medium
state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238"
```
@@ -211,6 +234,11 @@ Validation should cover:
Output: tests, README update, and validation command.
Implemented: generated PDF fixtures and tests cover descriptor shape, matching,
metadata inspection, normalization, page range markers, malformed PDFs,
encrypted PDFs, registry use, entry point discovery, README documentation, and
the validation command below.
## Validation
Run from `markitect-filter`: