Files
markitect-filter/workplans/MKTF-WP-0002-pdf-read-adapter.md

6.4 KiB

id, type, title, domain, status, owner, topic_slug, planning_priority, planning_order, depends_on_workplans, related_workplans, created, updated, state_hub_workstream_id
id type title domain status owner topic_slug planning_priority planning_order depends_on_workplans related_workplans created updated state_hub_workstream_id
MKTF-WP-0002 workplan PDF Read Adapter markitect todo markitect-filter markitect P1 20
MKTF-WP-0001
MKTT-WP-0018
2026-05-14 2026-05-14 7445fe6b-f1a9-4383-8053-4337337dc095

MKTF-WP-0002: PDF Read Adapter

Purpose

Implement the second concrete markitect-filter source adapter: source.pdf, a read-only PDF adapter that satisfies the markitect-tool source adapter contract.

The contract dependency is cross-repo and is tracked as related work rather than a same-repo State Hub dependency edge: markitect-tool MKTT-WP-0018.

The first PDF slice should target deterministic text extraction from digitally-readable PDFs. It should preserve page-level provenance and make extraction uncertainty visible through diagnostics and quality signals.

Planned Scope

  • Optional PDF dependency profile isolated behind a pdf extra.
  • Entry point group registration: markitect_tool.source_adapters.
  • Lightweight pdf_adapter_descriptor.
  • Adapter id source.pdf with media type application/pdf and extension .pdf.
  • Inspection for basic PDF metadata, page count, encryption status, and extractability signals.
  • Read-only page text extraction into ordered Markdown segments.
  • Page-aware source provenance with source paths, page numbers, page labels where available, and stable segment ids.
  • Configurable first-slice options such as page range, page break markers, and whitespace normalization policy.
  • Structured diagnostics for malformed PDFs, encrypted PDFs, image-only or scanned pages, empty extraction, partial page failures, unsupported embedded media, and lossy layout/table handling.
  • Quality metadata for confidence, lossiness, skipped pages, warning counts, extraction backend, and page coverage.
  • Tests for descriptor shape, matching, inspection, normalization, malformed inputs, encrypted or non-extractable inputs where fixtures allow, Markitect API registry use, and entry point shape.

Non-Goals

  • OCR or scanned-document recognition.
  • Pixel-perfect layout preservation.
  • Table reconstruction beyond plain text and diagnostics.
  • Image, figure, annotation, form, signature, or attachment extraction beyond future metadata/diagnostic hooks.
  • PDF writing/export.
  • Network fetching.
  • External processes or native system services in the first slice.
  • Making PDF dependencies mandatory for EPUB3 or other adapters.

P2.1 - Pin PDF v1 dependency and extraction policy

id: MKTF-WP-0002-T001
status: todo
priority: high
state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb"

Choose the first PDF extraction backend and dependency profile.

The decision should document:

  • pure-Python preference for the first slice
  • optional dependency placement under the pdf extra
  • supported inputs: local, digitally-readable PDFs
  • unsupported inputs: scanned/image-only PDFs without OCR
  • encrypted/permission-restricted PDF behavior
  • how page range, page breaks, and whitespace normalization should behave
  • fallback or future status for heavier layout/OCR backends

Output: dependency decision, option contract, and implementation notes.

P2.2 - Add descriptor and entry point registration

id: MKTF-WP-0002-T002
status: todo
priority: high
state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10"

Add a pdf_adapter_descriptor matching the existing EPUB3 descriptor pattern.

The descriptor should define:

  • adapter id source.pdf
  • version 1
  • media type application/pdf
  • extension .pdf
  • read operation only
  • safety metadata with local reads only
  • option schema for page range, page breaks, and whitespace normalization
  • quality profile and dependency metadata
  • lazy factory import for the PDF adapter implementation

Output: descriptor, entry point registration, and descriptor tests.

P2.3 - Implement PDF inspection

id: MKTF-WP-0002-T003
status: todo
priority: high
state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65"

Implement inspect for PDF assets.

Inspection should report:

  • title, creators/authors, subject, keywords, producer, creation/modification dates where available
  • page count
  • encryption or permission status
  • basic extractability signals
  • diagnostics for malformed, unreadable, encrypted, or unsupported PDFs

Output: inspection implementation and tests with small fixtures.

P2.4 - Normalize page text into Markitect Markdown

id: MKTF-WP-0002-T004
status: todo
priority: high
state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761"

Implement read for digitally-readable PDFs.

Normalization should:

  • iterate pages in deterministic order
  • apply page range filtering
  • convert extracted text into Markdown-safe segment text
  • create one or more ordered segments with stable segment ids
  • preserve page-level provenance on every segment
  • optionally insert page break markers
  • produce a stable document id and cache key through the Markitect source contract helpers

Output: read implementation and normalization tests.

P2.5 - Add diagnostics and quality semantics

id: MKTF-WP-0002-T005
status: todo
priority: high
state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2"

Define PDF-specific diagnostics and quality metadata.

The adapter should distinguish:

  • malformed PDF
  • encrypted or permission-restricted PDF
  • no extractable text
  • partially failed pages
  • scanned/image-only pages
  • dropped layout, tables, figures, annotations, or forms
  • unsupported embedded resources

Quality should include extraction backend, page coverage, warning count, skipped pages, lossiness, and confidence.

Output: diagnostic helpers, quality rules, and tests.

P2.6 - Add fixtures, docs, and validation

id: MKTF-WP-0002-T006
status: todo
priority: medium
state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238"

Add small deterministic PDF fixtures and documentation.

Validation should cover:

  • descriptor shape
  • media type and extension matching
  • metadata inspection
  • page text normalization
  • malformed or empty extraction behavior
  • registry and entry point shape
  • markitect-tool API use through inspect_source and normalize_source

Output: tests, README update, and validation command.

Validation

Run from markitect-filter:

PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest