---
id: MKTF-WP-0002
type: workplan
title: "PDF Read Adapter"
domain: markitect
status: done
owner: markitect-filter
topic_slug: markitect
planning_priority: complete
planning_order: 20
depends_on_workplans:
  - MKTF-WP-0001
related_workplans:
  - MKTT-WP-0018
created: "2026-05-14"
updated: "2026-05-14"
state_hub_workstream_id: "7445fe6b-f1a9-4383-8053-4337337dc095"
---

# MKTF-WP-0002: PDF Read Adapter

## Purpose

Implement the second concrete `markitect-filter` source adapter:
`source.pdf`, a read-only PDF adapter that satisfies the `markitect-tool`
source adapter contract.

The contract dependency is cross-repo and is tracked as related work rather
than a same-repo State Hub dependency edge: `markitect-tool` `MKTT-WP-0018`.

The first PDF slice should target deterministic text extraction from
digitally-readable PDFs. It should preserve page-level provenance and make
extraction uncertainty visible through diagnostics and quality signals.

## Implemented Scope

- Optional PDF dependency profile isolated behind a `pdf` extra.
- Entry point group registration:
  `markitect_tool.source_adapters`.
- Lightweight `pdf_adapter_descriptor`.
- Adapter id `source.pdf` with media type `application/pdf` and extension
  `.pdf`.
- Inspection for basic PDF metadata, page count, encryption status, and
  extractability signals.
- Read-only page text extraction into ordered Markdown segments.
- Page-aware source provenance with source paths, page numbers, page labels
  where available, and stable segment ids.
- Configurable first-slice options such as page range, page break markers, and
  whitespace normalization policy.
- Structured diagnostics for malformed PDFs, encrypted PDFs, image-only or
  scanned pages, empty extraction, partial page failures, unsupported embedded
  media, and lossy layout/table handling.
- Quality metadata for confidence, lossiness, skipped pages, warning counts,
  extraction backend, and page coverage.
- Tests for descriptor shape, matching, inspection, normalization, malformed
  inputs, encrypted or non-extractable inputs where fixtures allow, Markitect
  API registry use, and entry point shape.

## Non-Goals

- OCR or scanned-document recognition.
- Pixel-perfect layout preservation.
- Table reconstruction beyond plain text and diagnostics.
- Image, figure, annotation, form, signature, or attachment extraction beyond
  future metadata/diagnostic hooks.
- PDF writing/export.
- Network fetching.
- External processes or native system services in the first slice.
- Making PDF dependencies mandatory for EPUB3 or other adapters.

## P2.1 - Pin PDF v1 dependency and extraction policy

```task
id: MKTF-WP-0002-T001
status: done
priority: high
state_hub_task_id: "2ce51bb9-9182-4927-90d1-4c08433b5ddb"
```

Choose the first PDF extraction backend and dependency profile.

The decision should document:

- pure-Python preference for the first slice
- optional dependency placement under the `pdf` extra
- supported inputs: local, digitally-readable PDFs
- unsupported inputs: scanned/image-only PDFs without OCR
- encrypted/permission-restricted PDF behavior
- how page range, page breaks, and whitespace normalization should behave
- fallback or future status for heavier layout/OCR backends

Output: dependency decision, option contract, and implementation notes.

Implemented: `docs/pdf-adapter.md`, `pyproject.toml`, and the descriptor
metadata document a stdlib first slice, a reserved `pdf` extra, local
digitally-readable PDF support, page range/page marker/whitespace options, and
deferred OCR/layout-heavy backends.

## P2.2 - Add descriptor and entry point registration

```task
id: MKTF-WP-0002-T002
status: done
priority: high
state_hub_task_id: "27d754a9-59ae-4419-946b-f1f847bd3b10"
```

Add a `pdf_adapter_descriptor` matching the existing EPUB3 descriptor pattern.

The descriptor should define:

- adapter id `source.pdf`
- version `1`
- media type `application/pdf`
- extension `.pdf`
- read operation only
- safety metadata with local reads only
- option schema for page range, page breaks, and whitespace normalization
- quality profile and dependency metadata
- lazy factory import for the PDF adapter implementation

Output: descriptor, entry point registration, and descriptor tests.

Implemented: `pdf_adapter_descriptor` is registered through
`markitect_tool.source_adapters`, exported from the package, and covered by
descriptor and discovery tests.

## P2.3 - Implement PDF inspection

```task
id: MKTF-WP-0002-T003
status: done
priority: high
state_hub_task_id: "33b594e6-d12a-46d5-bc50-6ec1aebaaf65"
```

Implement `inspect` for PDF assets.

Inspection should report:

- title, creators/authors, subject, keywords, producer, creation/modification
  dates where available
- page count
- encryption or permission status
- basic extractability signals
- diagnostics for malformed, unreadable, encrypted, or unsupported PDFs

Output: inspection implementation and tests with small fixtures.

Implemented: `PdfReadAdapter.inspect` reports metadata, page count,
extractability signals, encryption status, quality metadata, and malformed or
encrypted diagnostics using deterministic generated fixtures.

## P2.4 - Normalize page text into Markitect Markdown

```task
id: MKTF-WP-0002-T004
status: done
priority: high
state_hub_task_id: "30c0c777-a4e4-43d1-ac24-6a0f84c7b761"
```

Implement `read` for digitally-readable PDFs.

Normalization should:

- iterate pages in deterministic order
- apply page range filtering
- convert extracted text into Markdown-safe segment text
- create one or more ordered segments with stable segment ids
- preserve page-level provenance on every segment
- optionally insert page break markers
- produce a stable document id and cache key through the Markitect source
  contract helpers

Output: read implementation and normalization tests.

Implemented: `PdfReadAdapter.read` extracts ordered page text into stable
page segments, applies page ranges, supports optional page markers, preserves
page provenance, and uses the Markitect cache-key helpers.

## P2.5 - Add diagnostics and quality semantics

```task
id: MKTF-WP-0002-T005
status: done
priority: high
state_hub_task_id: "8b6a190a-350b-4c61-ac4f-1900673a8cd2"
```

Define PDF-specific diagnostics and quality metadata.

The adapter should distinguish:

- malformed PDF
- encrypted or permission-restricted PDF
- no extractable text
- partially failed pages
- scanned/image-only pages
- dropped layout, tables, figures, annotations, or forms
- unsupported embedded resources

Quality should include extraction backend, page coverage, warning count,
skipped pages, lossiness, and confidence.

Output: diagnostic helpers, quality rules, and tests.

Implemented: PDF diagnostics cover malformed files, unreadable files,
encrypted PDFs, invalid page ranges, missing/empty streams, image-only pages,
empty extraction, and stream decompression failures. Quality metadata records
backend, page count, selected pages, extracted pages, coverage, warnings, and
skipped pages.

## P2.6 - Add fixtures, docs, and validation

```task
id: MKTF-WP-0002-T006
status: done
priority: medium
state_hub_task_id: "af597160-e189-42be-8479-c6e0f467d238"
```

Add small deterministic PDF fixtures and documentation.

Validation should cover:

- descriptor shape
- media type and extension matching
- metadata inspection
- page text normalization
- malformed or empty extraction behavior
- registry and entry point shape
- `markitect-tool` API use through `inspect_source` and `normalize_source`

Output: tests, README update, and validation command.

Implemented: generated PDF fixtures and tests cover descriptor shape, matching,
metadata inspection, normalization, page range markers, malformed PDFs,
encrypted PDFs, registry use, entry point discovery, README documentation, and
the validation command below.

## Validation

Run from `markitect-filter`:

```bash
PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest
```