epub3 inbound filter

This commit is contained in:
2026-05-14 22:46:51 +02:00
parent 8d62b2d241
commit 925b36521d
7 changed files with 971 additions and 2 deletions

View File

@@ -0,0 +1,56 @@
---
id: MKTF-WP-0001
type: workplan
title: "EPUB3 Read Adapter"
domain: markitect
status: done
owner: markitect-filter
topic_slug: markitect
planning_priority: complete
planning_order: 10
depends_on_workplans:
- MKTT-WP-0018
created: "2026-05-14"
updated: "2026-05-14"
---
# MKTF-WP-0001: EPUB3 Read Adapter
## Purpose
Implement the first concrete `markitect-filter` source adapter:
`source.epub3`, a read-only EPUB3 adapter that satisfies the
`markitect-tool` source adapter contract.
## Implemented Scope
- Python package scaffold with `pyproject.toml`.
- Entry point group registration:
`markitect_tool.source_adapters`.
- Lightweight `epub3_adapter_descriptor`.
- Stdlib-only EPUB3 package reading with `zipfile` and `ElementTree`.
- `META-INF/container.xml` rootfile discovery.
- OPF metadata, manifest, and spine extraction.
- EPUB nav label extraction.
- XHTML body extraction into ordered Markdown segments.
- Source provenance with package paths, hrefs, anchors, and section labels.
- Structured diagnostics for malformed EPUBs, skipped boilerplate, missing
spine items, unsupported media, and malformed XML.
- Tests for descriptor shape, matching, inspection, normalization, malformed
packages, Markitect API registry use, and entry point shape.
## Non-Goals
- PDF, DOCX, ODT, OCR, or browser extraction.
- Write/export adapters.
- Network fetching.
- Styling-preserving conversion.
- Image extraction beyond future metadata/attachment handling.
## Validation
Run from `markitect-filter`:
```bash
PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest
```