generated from coulomb/repo-seed
docs(workplans): add pdf read adapter plan
This commit is contained in:
220
workplans/MKTF-WP-0002-pdf-read-adapter.md
Normal file
220
workplans/MKTF-WP-0002-pdf-read-adapter.md
Normal file
@@ -0,0 +1,220 @@
|
||||
---
|
||||
id: MKTF-WP-0002
|
||||
type: workplan
|
||||
title: "PDF Read Adapter"
|
||||
domain: markitect
|
||||
status: todo
|
||||
owner: markitect-filter
|
||||
topic_slug: markitect
|
||||
planning_priority: P1
|
||||
planning_order: 20
|
||||
depends_on_workplans:
|
||||
- MKTF-WP-0001
|
||||
related_workplans:
|
||||
- MKTT-WP-0018
|
||||
created: "2026-05-14"
|
||||
updated: "2026-05-14"
|
||||
state_hub_workstream_id: "7445fe6b-f1a9-4383-8053-4337337dc095"
|
||||
---
|
||||
|
||||
# MKTF-WP-0002: PDF Read Adapter
|
||||
|
||||
## Purpose
|
||||
|
||||
Implement the second concrete `markitect-filter` source adapter:
|
||||
`source.pdf`, a read-only PDF adapter that satisfies the `markitect-tool`
|
||||
source adapter contract.
|
||||
|
||||
The contract dependency is cross-repo and is tracked as related work rather
|
||||
than a same-repo State Hub dependency edge: `markitect-tool` `MKTT-WP-0018`.
|
||||
|
||||
The first PDF slice should target deterministic text extraction from
|
||||
digitally-readable PDFs. It should preserve page-level provenance and make
|
||||
extraction uncertainty visible through diagnostics and quality signals.
|
||||
|
||||
## Planned Scope
|
||||
|
||||
- Optional PDF dependency profile isolated behind a `pdf` extra.
|
||||
- Entry point group registration:
|
||||
`markitect_tool.source_adapters`.
|
||||
- Lightweight `pdf_adapter_descriptor`.
|
||||
- Adapter id `source.pdf` with media type `application/pdf` and extension
|
||||
`.pdf`.
|
||||
- Inspection for basic PDF metadata, page count, encryption status, and
|
||||
extractability signals.
|
||||
- Read-only page text extraction into ordered Markdown segments.
|
||||
- Page-aware source provenance with source paths, page numbers, page labels
|
||||
where available, and stable segment ids.
|
||||
- Configurable first-slice options such as page range, page break markers, and
|
||||
whitespace normalization policy.
|
||||
- Structured diagnostics for malformed PDFs, encrypted PDFs, image-only or
|
||||
scanned pages, empty extraction, partial page failures, unsupported embedded
|
||||
media, and lossy layout/table handling.
|
||||
- Quality metadata for confidence, lossiness, skipped pages, warning counts,
|
||||
extraction backend, and page coverage.
|
||||
- Tests for descriptor shape, matching, inspection, normalization, malformed
|
||||
inputs, encrypted or non-extractable inputs where fixtures allow, Markitect
|
||||
API registry use, and entry point shape.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- OCR or scanned-document recognition.
|
||||
- Pixel-perfect layout preservation.
|
||||
- Table reconstruction beyond plain text and diagnostics.
|
||||
- Image, figure, annotation, form, signature, or attachment extraction beyond
|
||||
future metadata/diagnostic hooks.
|
||||
- PDF writing/export.
|
||||
- Network fetching.
|
||||
- External processes or native system services in the first slice.
|
||||
- Making PDF dependencies mandatory for EPUB3 or other adapters.
|
||||
|
||||
## P2.1 - Pin PDF v1 dependency and extraction policy
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0002-T001
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb"
|
||||
```
|
||||
|
||||
Choose the first PDF extraction backend and dependency profile.
|
||||
|
||||
The decision should document:
|
||||
|
||||
- pure-Python preference for the first slice
|
||||
- optional dependency placement under the `pdf` extra
|
||||
- supported inputs: local, digitally-readable PDFs
|
||||
- unsupported inputs: scanned/image-only PDFs without OCR
|
||||
- encrypted/permission-restricted PDF behavior
|
||||
- how page range, page breaks, and whitespace normalization should behave
|
||||
- fallback or future status for heavier layout/OCR backends
|
||||
|
||||
Output: dependency decision, option contract, and implementation notes.
|
||||
|
||||
## P2.2 - Add descriptor and entry point registration
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0002-T002
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10"
|
||||
```
|
||||
|
||||
Add a `pdf_adapter_descriptor` matching the existing EPUB3 descriptor pattern.
|
||||
|
||||
The descriptor should define:
|
||||
|
||||
- adapter id `source.pdf`
|
||||
- version `1`
|
||||
- media type `application/pdf`
|
||||
- extension `.pdf`
|
||||
- read operation only
|
||||
- safety metadata with local reads only
|
||||
- option schema for page range, page breaks, and whitespace normalization
|
||||
- quality profile and dependency metadata
|
||||
- lazy factory import for the PDF adapter implementation
|
||||
|
||||
Output: descriptor, entry point registration, and descriptor tests.
|
||||
|
||||
## P2.3 - Implement PDF inspection
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0002-T003
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65"
|
||||
```
|
||||
|
||||
Implement `inspect` for PDF assets.
|
||||
|
||||
Inspection should report:
|
||||
|
||||
- title, creators/authors, subject, keywords, producer, creation/modification
|
||||
dates where available
|
||||
- page count
|
||||
- encryption or permission status
|
||||
- basic extractability signals
|
||||
- diagnostics for malformed, unreadable, encrypted, or unsupported PDFs
|
||||
|
||||
Output: inspection implementation and tests with small fixtures.
|
||||
|
||||
## P2.4 - Normalize page text into Markitect Markdown
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0002-T004
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761"
|
||||
```
|
||||
|
||||
Implement `read` for digitally-readable PDFs.
|
||||
|
||||
Normalization should:
|
||||
|
||||
- iterate pages in deterministic order
|
||||
- apply page range filtering
|
||||
- convert extracted text into Markdown-safe segment text
|
||||
- create one or more ordered segments with stable segment ids
|
||||
- preserve page-level provenance on every segment
|
||||
- optionally insert page break markers
|
||||
- produce a stable document id and cache key through the Markitect source
|
||||
contract helpers
|
||||
|
||||
Output: read implementation and normalization tests.
|
||||
|
||||
## P2.5 - Add diagnostics and quality semantics
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0002-T005
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2"
|
||||
```
|
||||
|
||||
Define PDF-specific diagnostics and quality metadata.
|
||||
|
||||
The adapter should distinguish:
|
||||
|
||||
- malformed PDF
|
||||
- encrypted or permission-restricted PDF
|
||||
- no extractable text
|
||||
- partially failed pages
|
||||
- scanned/image-only pages
|
||||
- dropped layout, tables, figures, annotations, or forms
|
||||
- unsupported embedded resources
|
||||
|
||||
Quality should include extraction backend, page coverage, warning count,
|
||||
skipped pages, lossiness, and confidence.
|
||||
|
||||
Output: diagnostic helpers, quality rules, and tests.
|
||||
|
||||
## P2.6 - Add fixtures, docs, and validation
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0002-T006
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238"
|
||||
```
|
||||
|
||||
Add small deterministic PDF fixtures and documentation.
|
||||
|
||||
Validation should cover:
|
||||
|
||||
- descriptor shape
|
||||
- media type and extension matching
|
||||
- metadata inspection
|
||||
- page text normalization
|
||||
- malformed or empty extraction behavior
|
||||
- registry and entry point shape
|
||||
- `markitect-tool` API use through `inspect_source` and `normalize_source`
|
||||
|
||||
Output: tests, README update, and validation command.
|
||||
|
||||
## Validation
|
||||
|
||||
Run from `markitect-filter`:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest
|
||||
```
|
||||
Reference in New Issue
Block a user