7.8 KiB
id, type, title, domain, status, owner, topic_slug, planning_priority, planning_order, depends_on_workplans, related_workplans, created, updated, state_hub_workstream_id
| id | type | title | domain | status | owner | topic_slug | planning_priority | planning_order | depends_on_workplans | related_workplans | created | updated | state_hub_workstream_id | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MKTF-WP-0002 | workplan | PDF Read Adapter | markitect | done | markitect-filter | markitect | complete | 20 |
|
|
2026-05-14 | 2026-05-14 | 7445fe6b-f1a9-4383-8053-4337337dc095 |
MKTF-WP-0002: PDF Read Adapter
Purpose
Implement the second concrete markitect-filter source adapter:
source.pdf, a read-only PDF adapter that satisfies the markitect-tool
source adapter contract.
The contract dependency is cross-repo and is tracked as related work rather
than a same-repo State Hub dependency edge: markitect-tool MKTT-WP-0018.
The first PDF slice should target deterministic text extraction from digitally-readable PDFs. It should preserve page-level provenance and make extraction uncertainty visible through diagnostics and quality signals.
Implemented Scope
- Optional PDF dependency profile isolated behind a
pdfextra. - Entry point group registration:
markitect_tool.source_adapters. - Lightweight
pdf_adapter_descriptor. - Adapter id
source.pdfwith media typeapplication/pdfand extension.pdf. - Inspection for basic PDF metadata, page count, encryption status, and extractability signals.
- Read-only page text extraction into ordered Markdown segments.
- Page-aware source provenance with source paths, page numbers, page labels where available, and stable segment ids.
- Configurable first-slice options such as page range, page break markers, and whitespace normalization policy.
- Structured diagnostics for malformed PDFs, encrypted PDFs, image-only or scanned pages, empty extraction, partial page failures, unsupported embedded media, and lossy layout/table handling.
- Quality metadata for confidence, lossiness, skipped pages, warning counts, extraction backend, and page coverage.
- Tests for descriptor shape, matching, inspection, normalization, malformed inputs, encrypted or non-extractable inputs where fixtures allow, Markitect API registry use, and entry point shape.
Non-Goals
- OCR or scanned-document recognition.
- Pixel-perfect layout preservation.
- Table reconstruction beyond plain text and diagnostics.
- Image, figure, annotation, form, signature, or attachment extraction beyond future metadata/diagnostic hooks.
- PDF writing/export.
- Network fetching.
- External processes or native system services in the first slice.
- Making PDF dependencies mandatory for EPUB3 or other adapters.
P2.1 - Pin PDF v1 dependency and extraction policy
id: MKTF-WP-0002-T001
status: done
priority: high
state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb"
Choose the first PDF extraction backend and dependency profile.
The decision should document:
- pure-Python preference for the first slice
- optional dependency placement under the
pdfextra - supported inputs: local, digitally-readable PDFs
- unsupported inputs: scanned/image-only PDFs without OCR
- encrypted/permission-restricted PDF behavior
- how page range, page breaks, and whitespace normalization should behave
- fallback or future status for heavier layout/OCR backends
Output: dependency decision, option contract, and implementation notes.
Implemented: docs/pdf-adapter.md, pyproject.toml, and the descriptor
metadata document a stdlib first slice, a reserved pdf extra, local
digitally-readable PDF support, page range/page marker/whitespace options, and
deferred OCR/layout-heavy backends.
P2.2 - Add descriptor and entry point registration
id: MKTF-WP-0002-T002
status: done
priority: high
state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10"
Add a pdf_adapter_descriptor matching the existing EPUB3 descriptor pattern.
The descriptor should define:
- adapter id
source.pdf - version
1 - media type
application/pdf - extension
.pdf - read operation only
- safety metadata with local reads only
- option schema for page range, page breaks, and whitespace normalization
- quality profile and dependency metadata
- lazy factory import for the PDF adapter implementation
Output: descriptor, entry point registration, and descriptor tests.
Implemented: pdf_adapter_descriptor is registered through
markitect_tool.source_adapters, exported from the package, and covered by
descriptor and discovery tests.
P2.3 - Implement PDF inspection
id: MKTF-WP-0002-T003
status: done
priority: high
state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65"
Implement inspect for PDF assets.
Inspection should report:
- title, creators/authors, subject, keywords, producer, creation/modification dates where available
- page count
- encryption or permission status
- basic extractability signals
- diagnostics for malformed, unreadable, encrypted, or unsupported PDFs
Output: inspection implementation and tests with small fixtures.
Implemented: PdfReadAdapter.inspect reports metadata, page count,
extractability signals, encryption status, quality metadata, and malformed or
encrypted diagnostics using deterministic generated fixtures.
P2.4 - Normalize page text into Markitect Markdown
id: MKTF-WP-0002-T004
status: done
priority: high
state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761"
Implement read for digitally-readable PDFs.
Normalization should:
- iterate pages in deterministic order
- apply page range filtering
- convert extracted text into Markdown-safe segment text
- create one or more ordered segments with stable segment ids
- preserve page-level provenance on every segment
- optionally insert page break markers
- produce a stable document id and cache key through the Markitect source contract helpers
Output: read implementation and normalization tests.
Implemented: PdfReadAdapter.read extracts ordered page text into stable
page segments, applies page ranges, supports optional page markers, preserves
page provenance, and uses the Markitect cache-key helpers.
P2.5 - Add diagnostics and quality semantics
id: MKTF-WP-0002-T005
status: done
priority: high
state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2"
Define PDF-specific diagnostics and quality metadata.
The adapter should distinguish:
- malformed PDF
- encrypted or permission-restricted PDF
- no extractable text
- partially failed pages
- scanned/image-only pages
- dropped layout, tables, figures, annotations, or forms
- unsupported embedded resources
Quality should include extraction backend, page coverage, warning count, skipped pages, lossiness, and confidence.
Output: diagnostic helpers, quality rules, and tests.
Implemented: PDF diagnostics cover malformed files, unreadable files, encrypted PDFs, invalid page ranges, missing/empty streams, image-only pages, empty extraction, and stream decompression failures. Quality metadata records backend, page count, selected pages, extracted pages, coverage, warnings, and skipped pages.
P2.6 - Add fixtures, docs, and validation
id: MKTF-WP-0002-T006
status: done
priority: medium
state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238"
Add small deterministic PDF fixtures and documentation.
Validation should cover:
- descriptor shape
- media type and extension matching
- metadata inspection
- page text normalization
- malformed or empty extraction behavior
- registry and entry point shape
markitect-toolAPI use throughinspect_sourceandnormalize_source
Output: tests, README update, and validation command.
Implemented: generated PDF fixtures and tests cover descriptor shape, matching, metadata inspection, normalization, page range markers, malformed PDFs, encrypted PDFs, registry use, entry point discovery, README documentation, and the validation command below.
Validation
Run from markitect-filter:
PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest