6.4 KiB
id, type, title, domain, status, owner, topic_slug, planning_priority, planning_order, depends_on_workplans, related_workplans, created, updated, state_hub_workstream_id
| id | type | title | domain | status | owner | topic_slug | planning_priority | planning_order | depends_on_workplans | related_workplans | created | updated | state_hub_workstream_id | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MKTF-WP-0002 | workplan | PDF Read Adapter | markitect | todo | markitect-filter | markitect | P1 | 20 |
|
|
2026-05-14 | 2026-05-14 | 7445fe6b-f1a9-4383-8053-4337337dc095 |
MKTF-WP-0002: PDF Read Adapter
Purpose
Implement the second concrete markitect-filter source adapter:
source.pdf, a read-only PDF adapter that satisfies the markitect-tool
source adapter contract.
The contract dependency is cross-repo and is tracked as related work rather
than a same-repo State Hub dependency edge: markitect-tool MKTT-WP-0018.
The first PDF slice should target deterministic text extraction from digitally-readable PDFs. It should preserve page-level provenance and make extraction uncertainty visible through diagnostics and quality signals.
Planned Scope
- Optional PDF dependency profile isolated behind a
pdfextra. - Entry point group registration:
markitect_tool.source_adapters. - Lightweight
pdf_adapter_descriptor. - Adapter id
source.pdfwith media typeapplication/pdfand extension.pdf. - Inspection for basic PDF metadata, page count, encryption status, and extractability signals.
- Read-only page text extraction into ordered Markdown segments.
- Page-aware source provenance with source paths, page numbers, page labels where available, and stable segment ids.
- Configurable first-slice options such as page range, page break markers, and whitespace normalization policy.
- Structured diagnostics for malformed PDFs, encrypted PDFs, image-only or scanned pages, empty extraction, partial page failures, unsupported embedded media, and lossy layout/table handling.
- Quality metadata for confidence, lossiness, skipped pages, warning counts, extraction backend, and page coverage.
- Tests for descriptor shape, matching, inspection, normalization, malformed inputs, encrypted or non-extractable inputs where fixtures allow, Markitect API registry use, and entry point shape.
Non-Goals
- OCR or scanned-document recognition.
- Pixel-perfect layout preservation.
- Table reconstruction beyond plain text and diagnostics.
- Image, figure, annotation, form, signature, or attachment extraction beyond future metadata/diagnostic hooks.
- PDF writing/export.
- Network fetching.
- External processes or native system services in the first slice.
- Making PDF dependencies mandatory for EPUB3 or other adapters.
P2.1 - Pin PDF v1 dependency and extraction policy
id: MKTF-WP-0002-T001
status: todo
priority: high
state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb"
Choose the first PDF extraction backend and dependency profile.
The decision should document:
- pure-Python preference for the first slice
- optional dependency placement under the
pdfextra - supported inputs: local, digitally-readable PDFs
- unsupported inputs: scanned/image-only PDFs without OCR
- encrypted/permission-restricted PDF behavior
- how page range, page breaks, and whitespace normalization should behave
- fallback or future status for heavier layout/OCR backends
Output: dependency decision, option contract, and implementation notes.
P2.2 - Add descriptor and entry point registration
id: MKTF-WP-0002-T002
status: todo
priority: high
state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10"
Add a pdf_adapter_descriptor matching the existing EPUB3 descriptor pattern.
The descriptor should define:
- adapter id
source.pdf - version
1 - media type
application/pdf - extension
.pdf - read operation only
- safety metadata with local reads only
- option schema for page range, page breaks, and whitespace normalization
- quality profile and dependency metadata
- lazy factory import for the PDF adapter implementation
Output: descriptor, entry point registration, and descriptor tests.
P2.3 - Implement PDF inspection
id: MKTF-WP-0002-T003
status: todo
priority: high
state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65"
Implement inspect for PDF assets.
Inspection should report:
- title, creators/authors, subject, keywords, producer, creation/modification dates where available
- page count
- encryption or permission status
- basic extractability signals
- diagnostics for malformed, unreadable, encrypted, or unsupported PDFs
Output: inspection implementation and tests with small fixtures.
P2.4 - Normalize page text into Markitect Markdown
id: MKTF-WP-0002-T004
status: todo
priority: high
state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761"
Implement read for digitally-readable PDFs.
Normalization should:
- iterate pages in deterministic order
- apply page range filtering
- convert extracted text into Markdown-safe segment text
- create one or more ordered segments with stable segment ids
- preserve page-level provenance on every segment
- optionally insert page break markers
- produce a stable document id and cache key through the Markitect source contract helpers
Output: read implementation and normalization tests.
P2.5 - Add diagnostics and quality semantics
id: MKTF-WP-0002-T005
status: todo
priority: high
state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2"
Define PDF-specific diagnostics and quality metadata.
The adapter should distinguish:
- malformed PDF
- encrypted or permission-restricted PDF
- no extractable text
- partially failed pages
- scanned/image-only pages
- dropped layout, tables, figures, annotations, or forms
- unsupported embedded resources
Quality should include extraction backend, page coverage, warning count, skipped pages, lossiness, and confidence.
Output: diagnostic helpers, quality rules, and tests.
P2.6 - Add fixtures, docs, and validation
id: MKTF-WP-0002-T006
status: todo
priority: medium
state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238"
Add small deterministic PDF fixtures and documentation.
Validation should cover:
- descriptor shape
- media type and extension matching
- metadata inspection
- page text normalization
- malformed or empty extraction behavior
- registry and entry point shape
markitect-toolAPI use throughinspect_sourceandnormalize_source
Output: tests, README update, and validation command.
Validation
Run from markitect-filter:
PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest