generated from coulomb/repo-seed
feat(source): add pdf read adapter
This commit is contained in:
@@ -3,10 +3,10 @@ id: MKTF-WP-0002
|
||||
type: workplan
|
||||
title: "PDF Read Adapter"
|
||||
domain: markitect
|
||||
status: todo
|
||||
status: done
|
||||
owner: markitect-filter
|
||||
topic_slug: markitect
|
||||
planning_priority: P1
|
||||
planning_priority: complete
|
||||
planning_order: 20
|
||||
depends_on_workplans:
|
||||
- MKTF-WP-0001
|
||||
@@ -32,7 +32,7 @@ The first PDF slice should target deterministic text extraction from
|
||||
digitally-readable PDFs. It should preserve page-level provenance and make
|
||||
extraction uncertainty visible through diagnostics and quality signals.
|
||||
|
||||
## Planned Scope
|
||||
## Implemented Scope
|
||||
|
||||
- Optional PDF dependency profile isolated behind a `pdf` extra.
|
||||
- Entry point group registration:
|
||||
@@ -72,7 +72,7 @@ extraction uncertainty visible through diagnostics and quality signals.
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0002-T001
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb"
|
||||
```
|
||||
@@ -91,11 +91,16 @@ The decision should document:
|
||||
|
||||
Output: dependency decision, option contract, and implementation notes.
|
||||
|
||||
Implemented: `docs/pdf-adapter.md`, `pyproject.toml`, and the descriptor
|
||||
metadata document a stdlib first slice, a reserved `pdf` extra, local
|
||||
digitally-readable PDF support, page range/page marker/whitespace options, and
|
||||
deferred OCR/layout-heavy backends.
|
||||
|
||||
## P2.2 - Add descriptor and entry point registration
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0002-T002
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10"
|
||||
```
|
||||
@@ -116,11 +121,15 @@ The descriptor should define:
|
||||
|
||||
Output: descriptor, entry point registration, and descriptor tests.
|
||||
|
||||
Implemented: `pdf_adapter_descriptor` is registered through
|
||||
`markitect_tool.source_adapters`, exported from the package, and covered by
|
||||
descriptor and discovery tests.
|
||||
|
||||
## P2.3 - Implement PDF inspection
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0002-T003
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65"
|
||||
```
|
||||
@@ -138,11 +147,15 @@ Inspection should report:
|
||||
|
||||
Output: inspection implementation and tests with small fixtures.
|
||||
|
||||
Implemented: `PdfReadAdapter.inspect` reports metadata, page count,
|
||||
extractability signals, encryption status, quality metadata, and malformed or
|
||||
encrypted diagnostics using deterministic generated fixtures.
|
||||
|
||||
## P2.4 - Normalize page text into Markitect Markdown
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0002-T004
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761"
|
||||
```
|
||||
@@ -162,11 +175,15 @@ Normalization should:
|
||||
|
||||
Output: read implementation and normalization tests.
|
||||
|
||||
Implemented: `PdfReadAdapter.read` extracts ordered page text into stable
|
||||
page segments, applies page ranges, supports optional page markers, preserves
|
||||
page provenance, and uses the Markitect cache-key helpers.
|
||||
|
||||
## P2.5 - Add diagnostics and quality semantics
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0002-T005
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2"
|
||||
```
|
||||
@@ -188,11 +205,17 @@ skipped pages, lossiness, and confidence.
|
||||
|
||||
Output: diagnostic helpers, quality rules, and tests.
|
||||
|
||||
Implemented: PDF diagnostics cover malformed files, unreadable files,
|
||||
encrypted PDFs, invalid page ranges, missing/empty streams, image-only pages,
|
||||
empty extraction, and stream decompression failures. Quality metadata records
|
||||
backend, page count, selected pages, extracted pages, coverage, warnings, and
|
||||
skipped pages.
|
||||
|
||||
## P2.6 - Add fixtures, docs, and validation
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0002-T006
|
||||
status: todo
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238"
|
||||
```
|
||||
@@ -211,6 +234,11 @@ Validation should cover:
|
||||
|
||||
Output: tests, README update, and validation command.
|
||||
|
||||
Implemented: generated PDF fixtures and tests cover descriptor shape, matching,
|
||||
metadata inspection, normalization, page range markers, malformed PDFs,
|
||||
encrypted PDFs, registry use, entry point discovery, README documentation, and
|
||||
the validation command below.
|
||||
|
||||
## Validation
|
||||
|
||||
Run from `markitect-filter`:
|
||||
|
||||
Reference in New Issue
Block a user