Files

tegwick 0c9a418e85 feat(source): add pdf read adapter

2026-05-14 23:33:31 +02:00

7.8 KiB

Raw Permalink Blame History

id, type, title, domain, status, owner, topic_slug, planning_priority, planning_order, depends_on_workplans, related_workplans, created, updated, state_hub_workstream_id

type

title

domain

status

owner

topic_slug

planning_priority

planning_order

depends_on_workplans

related_workplans

created

updated

state_hub_workstream_id

MKTF-WP-0002

workplan

PDF Read Adapter

markitect

done

markitect-filter

markitect

complete

MKTF-WP-0001

MKTT-WP-0018

2026-05-14

7445fe6b-f1a9-4383-8053-4337337dc095

MKTF-WP-0002: PDF Read Adapter

Purpose

Implement the second concrete markitect-filter source adapter: source.pdf, a read-only PDF adapter that satisfies the markitect-tool source adapter contract.

The contract dependency is cross-repo and is tracked as related work rather than a same-repo State Hub dependency edge: markitect-tool MKTT-WP-0018.

The first PDF slice should target deterministic text extraction from digitally-readable PDFs. It should preserve page-level provenance and make extraction uncertainty visible through diagnostics and quality signals.

Implemented Scope

Optional PDF dependency profile isolated behind a pdf extra.
Entry point group registration: markitect_tool.source_adapters.
Lightweight pdf_adapter_descriptor.
Adapter id source.pdf with media type application/pdf and extension .pdf.
Inspection for basic PDF metadata, page count, encryption status, and extractability signals.
Read-only page text extraction into ordered Markdown segments.
Page-aware source provenance with source paths, page numbers, page labels where available, and stable segment ids.
Configurable first-slice options such as page range, page break markers, and whitespace normalization policy.
Structured diagnostics for malformed PDFs, encrypted PDFs, image-only or scanned pages, empty extraction, partial page failures, unsupported embedded media, and lossy layout/table handling.
Quality metadata for confidence, lossiness, skipped pages, warning counts, extraction backend, and page coverage.
Tests for descriptor shape, matching, inspection, normalization, malformed inputs, encrypted or non-extractable inputs where fixtures allow, Markitect API registry use, and entry point shape.

Non-Goals

OCR or scanned-document recognition.
Pixel-perfect layout preservation.
Table reconstruction beyond plain text and diagnostics.
Image, figure, annotation, form, signature, or attachment extraction beyond future metadata/diagnostic hooks.
PDF writing/export.
Network fetching.
External processes or native system services in the first slice.
Making PDF dependencies mandatory for EPUB3 or other adapters.

P2.1 - Pin PDF v1 dependency and extraction policy

id: MKTF-WP-0002-T001
status: done
priority: high
state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb"

Choose the first PDF extraction backend and dependency profile.

The decision should document:

pure-Python preference for the first slice
optional dependency placement under the pdf extra
supported inputs: local, digitally-readable PDFs
unsupported inputs: scanned/image-only PDFs without OCR
encrypted/permission-restricted PDF behavior
how page range, page breaks, and whitespace normalization should behave
fallback or future status for heavier layout/OCR backends

Output: dependency decision, option contract, and implementation notes.

Implemented: docs/pdf-adapter.md, pyproject.toml, and the descriptor metadata document a stdlib first slice, a reserved pdf extra, local digitally-readable PDF support, page range/page marker/whitespace options, and deferred OCR/layout-heavy backends.

P2.2 - Add descriptor and entry point registration

id: MKTF-WP-0002-T002
status: done
priority: high
state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10"

Add a pdf_adapter_descriptor matching the existing EPUB3 descriptor pattern.

The descriptor should define:

adapter id source.pdf
version 1
media type application/pdf
extension .pdf
read operation only
safety metadata with local reads only
option schema for page range, page breaks, and whitespace normalization
quality profile and dependency metadata
lazy factory import for the PDF adapter implementation

Output: descriptor, entry point registration, and descriptor tests.

Implemented: pdf_adapter_descriptor is registered through markitect_tool.source_adapters, exported from the package, and covered by descriptor and discovery tests.

P2.3 - Implement PDF inspection

id: MKTF-WP-0002-T003
status: done
priority: high
state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65"

Implement inspect for PDF assets.

Inspection should report:

title, creators/authors, subject, keywords, producer, creation/modification dates where available
page count
encryption or permission status
basic extractability signals
diagnostics for malformed, unreadable, encrypted, or unsupported PDFs

Output: inspection implementation and tests with small fixtures.

Implemented: PdfReadAdapter.inspect reports metadata, page count, extractability signals, encryption status, quality metadata, and malformed or encrypted diagnostics using deterministic generated fixtures.

P2.4 - Normalize page text into Markitect Markdown

id: MKTF-WP-0002-T004
status: done
priority: high
state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761"

Implement read for digitally-readable PDFs.

Normalization should:

iterate pages in deterministic order
apply page range filtering
convert extracted text into Markdown-safe segment text
create one or more ordered segments with stable segment ids
preserve page-level provenance on every segment
optionally insert page break markers
produce a stable document id and cache key through the Markitect source contract helpers

Output: read implementation and normalization tests.

Implemented: PdfReadAdapter.read extracts ordered page text into stable page segments, applies page ranges, supports optional page markers, preserves page provenance, and uses the Markitect cache-key helpers.

P2.5 - Add diagnostics and quality semantics

id: MKTF-WP-0002-T005
status: done
priority: high
state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2"

Define PDF-specific diagnostics and quality metadata.

The adapter should distinguish:

malformed PDF
encrypted or permission-restricted PDF
no extractable text
partially failed pages
scanned/image-only pages
dropped layout, tables, figures, annotations, or forms
unsupported embedded resources

Quality should include extraction backend, page coverage, warning count, skipped pages, lossiness, and confidence.

Output: diagnostic helpers, quality rules, and tests.

Implemented: PDF diagnostics cover malformed files, unreadable files, encrypted PDFs, invalid page ranges, missing/empty streams, image-only pages, empty extraction, and stream decompression failures. Quality metadata records backend, page count, selected pages, extracted pages, coverage, warnings, and skipped pages.

P2.6 - Add fixtures, docs, and validation

id: MKTF-WP-0002-T006
status: done
priority: medium
state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238"

Add small deterministic PDF fixtures and documentation.

Validation should cover:

descriptor shape
media type and extension matching
metadata inspection
page text normalization
malformed or empty extraction behavior
registry and entry point shape
markitect-tool API use through inspect_source and normalize_source

Output: tests, README update, and validation command.

Implemented: generated PDF fixtures and tests cover descriptor shape, matching, metadata inspection, normalization, page range markers, malformed PDFs, encrypted PDFs, registry use, entry point discovery, README documentation, and the validation command below.

Validation

Run from markitect-filter:

PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest

7.8 KiB Raw Permalink Blame History