Add source attachment metadata compatibility

This commit is contained in:
2026-05-15 14:36:24 +02:00
parent afa51f8764
commit ad137b214f
13 changed files with 724 additions and 28 deletions

View File

@@ -23,7 +23,7 @@ native system services, or renderer-specific tooling.
- Scanned or image-only PDFs that require OCR.
- Encrypted or permission-restricted PDFs.
- Pixel-perfect layout reconstruction.
- Table, figure, annotation, form, signature, and attachment extraction.
- Table, figure, annotation, form, signature, and rich attachment extraction.
- PDF writing/export.
## Options
@@ -43,3 +43,9 @@ and originating PDF page object id.
Quality metadata records the extraction backend, document page count, selected
pages, extracted page count, page coverage, skipped pages, warning count,
lossiness, and confidence.
`NormalizedMarkdownDocument.attachments` may include read-side metadata for
embedded file streams and image-resource signals when the stdlib parser can
detect them. Embedded files include byte size and digest. Image resources are
signal-only descriptors with page/object provenance; the adapter does not
extract image bytes or perform OCR.

View File

@@ -0,0 +1,81 @@
# Source Attachment Metadata
`markitect-filter` exposes read-side attachment metadata through
`NormalizedMarkdownDocument.attachments`. These entries are
`markitect_tool.source.SourceAsset` objects, so `markitect-tool` can consume
them when building passive render asset manifests.
The metadata schema marker is:
```text
markitect-filter.source-attachment.v1
```
## Common Fields
Attachment entries should preserve:
- `uri`: stable source package or document member URI
- `path`: package member path or signal path
- `name`: member filename or signal label
- `media_type` and `extension` when known
- `size` and `digest` when bytes are available
- `metadata.source_adapter`: adapter id such as `source.epub3` or `source.pdf`
- `metadata.source_role`: logical read-side role
- `metadata.package_path`, `metadata.page`, `metadata.pdf_object`, or related
provenance coordinates when known
- `metadata.render_manifest_compatible: true` when the entry can feed
`RenderAsset.from_source_asset`
These entries describe source-side resources only. They do not imply output
paths, copy execution, final artifact locations, or publication state.
## EPUB3
The EPUB3 adapter records manifest resources for images, stylesheets, fonts,
audio, and video when the package entry exists and can be read cheaply from the
ZIP archive. It stores byte size and sha256 digest for each collected resource.
Unsupported non-XHTML package resources produce
`source.epub3.skipped_resource` warnings. Declared but missing resources produce
`source.epub3.missing_resource` warnings.
## PDF
The PDF adapter records embedded file streams when a stdlib scan can identify
`Filespec` and `EmbeddedFile` objects. It stores member bytes, media type by
filename, size, digest, object id, and source role `embedded-file`.
For image resources, the stdlib slice records signal-only entries with source
role `image-signal`. These entries preserve page/object provenance and a stable
digest of the detected page/resource signal, but they do not extract image
bytes. Image signals emit `source.pdf.image_resource_signal` warnings so callers
know the adapter detected media that it did not extract.
## Render Manifest Handoff
`markitect-tool` can convert attachment entries to passive render assets:
```python
from markitect_tool.render import RenderAsset
render_assets = [
RenderAsset.from_source_asset(asset, role=asset.metadata["source_role"])
for asset in document.attachments
]
```
The resulting render assets remain passive descriptors. Asset copying,
renderer output references, link rewriting, and final artifact validation stay
outside `markitect-filter`.
Example normalized attachment envelopes live in:
- `examples/source-attachments/epub3-attachments.normalized.yaml`
- `examples/source-attachments/pdf-attachments.normalized.yaml`
Cross-repo validation can be run from this checkout with:
```bash
PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest
```