Files
markitect-filter/workplans/MKTF-WP-0002-pdf-read-adapter.md

7.8 KiB

id, type, title, domain, status, owner, topic_slug, planning_priority, planning_order, depends_on_workplans, related_workplans, created, updated, state_hub_workstream_id
id type title domain status owner topic_slug planning_priority planning_order depends_on_workplans related_workplans created updated state_hub_workstream_id
MKTF-WP-0002 workplan PDF Read Adapter markitect done markitect-filter markitect complete 20
MKTF-WP-0001
MKTT-WP-0018
2026-05-14 2026-05-14 7445fe6b-f1a9-4383-8053-4337337dc095

MKTF-WP-0002: PDF Read Adapter

Purpose

Implement the second concrete markitect-filter source adapter: source.pdf, a read-only PDF adapter that satisfies the markitect-tool source adapter contract.

The contract dependency is cross-repo and is tracked as related work rather than a same-repo State Hub dependency edge: markitect-tool MKTT-WP-0018.

The first PDF slice should target deterministic text extraction from digitally-readable PDFs. It should preserve page-level provenance and make extraction uncertainty visible through diagnostics and quality signals.

Implemented Scope

  • Optional PDF dependency profile isolated behind a pdf extra.
  • Entry point group registration: markitect_tool.source_adapters.
  • Lightweight pdf_adapter_descriptor.
  • Adapter id source.pdf with media type application/pdf and extension .pdf.
  • Inspection for basic PDF metadata, page count, encryption status, and extractability signals.
  • Read-only page text extraction into ordered Markdown segments.
  • Page-aware source provenance with source paths, page numbers, page labels where available, and stable segment ids.
  • Configurable first-slice options such as page range, page break markers, and whitespace normalization policy.
  • Structured diagnostics for malformed PDFs, encrypted PDFs, image-only or scanned pages, empty extraction, partial page failures, unsupported embedded media, and lossy layout/table handling.
  • Quality metadata for confidence, lossiness, skipped pages, warning counts, extraction backend, and page coverage.
  • Tests for descriptor shape, matching, inspection, normalization, malformed inputs, encrypted or non-extractable inputs where fixtures allow, Markitect API registry use, and entry point shape.

Non-Goals

  • OCR or scanned-document recognition.
  • Pixel-perfect layout preservation.
  • Table reconstruction beyond plain text and diagnostics.
  • Image, figure, annotation, form, signature, or attachment extraction beyond future metadata/diagnostic hooks.
  • PDF writing/export.
  • Network fetching.
  • External processes or native system services in the first slice.
  • Making PDF dependencies mandatory for EPUB3 or other adapters.

P2.1 - Pin PDF v1 dependency and extraction policy

id: MKTF-WP-0002-T001
status: done
priority: high
state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb"

Choose the first PDF extraction backend and dependency profile.

The decision should document:

  • pure-Python preference for the first slice
  • optional dependency placement under the pdf extra
  • supported inputs: local, digitally-readable PDFs
  • unsupported inputs: scanned/image-only PDFs without OCR
  • encrypted/permission-restricted PDF behavior
  • how page range, page breaks, and whitespace normalization should behave
  • fallback or future status for heavier layout/OCR backends

Output: dependency decision, option contract, and implementation notes.

Implemented: docs/pdf-adapter.md, pyproject.toml, and the descriptor metadata document a stdlib first slice, a reserved pdf extra, local digitally-readable PDF support, page range/page marker/whitespace options, and deferred OCR/layout-heavy backends.

P2.2 - Add descriptor and entry point registration

id: MKTF-WP-0002-T002
status: done
priority: high
state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10"

Add a pdf_adapter_descriptor matching the existing EPUB3 descriptor pattern.

The descriptor should define:

  • adapter id source.pdf
  • version 1
  • media type application/pdf
  • extension .pdf
  • read operation only
  • safety metadata with local reads only
  • option schema for page range, page breaks, and whitespace normalization
  • quality profile and dependency metadata
  • lazy factory import for the PDF adapter implementation

Output: descriptor, entry point registration, and descriptor tests.

Implemented: pdf_adapter_descriptor is registered through markitect_tool.source_adapters, exported from the package, and covered by descriptor and discovery tests.

P2.3 - Implement PDF inspection

id: MKTF-WP-0002-T003
status: done
priority: high
state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65"

Implement inspect for PDF assets.

Inspection should report:

  • title, creators/authors, subject, keywords, producer, creation/modification dates where available
  • page count
  • encryption or permission status
  • basic extractability signals
  • diagnostics for malformed, unreadable, encrypted, or unsupported PDFs

Output: inspection implementation and tests with small fixtures.

Implemented: PdfReadAdapter.inspect reports metadata, page count, extractability signals, encryption status, quality metadata, and malformed or encrypted diagnostics using deterministic generated fixtures.

P2.4 - Normalize page text into Markitect Markdown

id: MKTF-WP-0002-T004
status: done
priority: high
state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761"

Implement read for digitally-readable PDFs.

Normalization should:

  • iterate pages in deterministic order
  • apply page range filtering
  • convert extracted text into Markdown-safe segment text
  • create one or more ordered segments with stable segment ids
  • preserve page-level provenance on every segment
  • optionally insert page break markers
  • produce a stable document id and cache key through the Markitect source contract helpers

Output: read implementation and normalization tests.

Implemented: PdfReadAdapter.read extracts ordered page text into stable page segments, applies page ranges, supports optional page markers, preserves page provenance, and uses the Markitect cache-key helpers.

P2.5 - Add diagnostics and quality semantics

id: MKTF-WP-0002-T005
status: done
priority: high
state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2"

Define PDF-specific diagnostics and quality metadata.

The adapter should distinguish:

  • malformed PDF
  • encrypted or permission-restricted PDF
  • no extractable text
  • partially failed pages
  • scanned/image-only pages
  • dropped layout, tables, figures, annotations, or forms
  • unsupported embedded resources

Quality should include extraction backend, page coverage, warning count, skipped pages, lossiness, and confidence.

Output: diagnostic helpers, quality rules, and tests.

Implemented: PDF diagnostics cover malformed files, unreadable files, encrypted PDFs, invalid page ranges, missing/empty streams, image-only pages, empty extraction, and stream decompression failures. Quality metadata records backend, page count, selected pages, extracted pages, coverage, warnings, and skipped pages.

P2.6 - Add fixtures, docs, and validation

id: MKTF-WP-0002-T006
status: done
priority: medium
state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238"

Add small deterministic PDF fixtures and documentation.

Validation should cover:

  • descriptor shape
  • media type and extension matching
  • metadata inspection
  • page text normalization
  • malformed or empty extraction behavior
  • registry and entry point shape
  • markitect-tool API use through inspect_source and normalize_source

Output: tests, README update, and validation command.

Implemented: generated PDF fixtures and tests cover descriptor shape, matching, metadata inspection, normalization, page range markers, malformed PDFs, encrypted PDFs, registry use, entry point discovery, README documentation, and the validation command below.

Validation

Run from markitect-filter:

PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest