generated from coulomb/repo-seed
Add source attachment metadata compatibility
This commit is contained in:
@@ -23,7 +23,7 @@ native system services, or renderer-specific tooling.
|
||||
- Scanned or image-only PDFs that require OCR.
|
||||
- Encrypted or permission-restricted PDFs.
|
||||
- Pixel-perfect layout reconstruction.
|
||||
- Table, figure, annotation, form, signature, and attachment extraction.
|
||||
- Table, figure, annotation, form, signature, and rich attachment extraction.
|
||||
- PDF writing/export.
|
||||
|
||||
## Options
|
||||
@@ -43,3 +43,9 @@ and originating PDF page object id.
|
||||
Quality metadata records the extraction backend, document page count, selected
|
||||
pages, extracted page count, page coverage, skipped pages, warning count,
|
||||
lossiness, and confidence.
|
||||
|
||||
`NormalizedMarkdownDocument.attachments` may include read-side metadata for
|
||||
embedded file streams and image-resource signals when the stdlib parser can
|
||||
detect them. Embedded files include byte size and digest. Image resources are
|
||||
signal-only descriptors with page/object provenance; the adapter does not
|
||||
extract image bytes or perform OCR.
|
||||
|
||||
81
docs/source-attachment-metadata.md
Normal file
81
docs/source-attachment-metadata.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# Source Attachment Metadata
|
||||
|
||||
`markitect-filter` exposes read-side attachment metadata through
|
||||
`NormalizedMarkdownDocument.attachments`. These entries are
|
||||
`markitect_tool.source.SourceAsset` objects, so `markitect-tool` can consume
|
||||
them when building passive render asset manifests.
|
||||
|
||||
The metadata schema marker is:
|
||||
|
||||
```text
|
||||
markitect-filter.source-attachment.v1
|
||||
```
|
||||
|
||||
## Common Fields
|
||||
|
||||
Attachment entries should preserve:
|
||||
|
||||
- `uri`: stable source package or document member URI
|
||||
- `path`: package member path or signal path
|
||||
- `name`: member filename or signal label
|
||||
- `media_type` and `extension` when known
|
||||
- `size` and `digest` when bytes are available
|
||||
- `metadata.source_adapter`: adapter id such as `source.epub3` or `source.pdf`
|
||||
- `metadata.source_role`: logical read-side role
|
||||
- `metadata.package_path`, `metadata.page`, `metadata.pdf_object`, or related
|
||||
provenance coordinates when known
|
||||
- `metadata.render_manifest_compatible: true` when the entry can feed
|
||||
`RenderAsset.from_source_asset`
|
||||
|
||||
These entries describe source-side resources only. They do not imply output
|
||||
paths, copy execution, final artifact locations, or publication state.
|
||||
|
||||
## EPUB3
|
||||
|
||||
The EPUB3 adapter records manifest resources for images, stylesheets, fonts,
|
||||
audio, and video when the package entry exists and can be read cheaply from the
|
||||
ZIP archive. It stores byte size and sha256 digest for each collected resource.
|
||||
|
||||
Unsupported non-XHTML package resources produce
|
||||
`source.epub3.skipped_resource` warnings. Declared but missing resources produce
|
||||
`source.epub3.missing_resource` warnings.
|
||||
|
||||
## PDF
|
||||
|
||||
The PDF adapter records embedded file streams when a stdlib scan can identify
|
||||
`Filespec` and `EmbeddedFile` objects. It stores member bytes, media type by
|
||||
filename, size, digest, object id, and source role `embedded-file`.
|
||||
|
||||
For image resources, the stdlib slice records signal-only entries with source
|
||||
role `image-signal`. These entries preserve page/object provenance and a stable
|
||||
digest of the detected page/resource signal, but they do not extract image
|
||||
bytes. Image signals emit `source.pdf.image_resource_signal` warnings so callers
|
||||
know the adapter detected media that it did not extract.
|
||||
|
||||
## Render Manifest Handoff
|
||||
|
||||
`markitect-tool` can convert attachment entries to passive render assets:
|
||||
|
||||
```python
|
||||
from markitect_tool.render import RenderAsset
|
||||
|
||||
render_assets = [
|
||||
RenderAsset.from_source_asset(asset, role=asset.metadata["source_role"])
|
||||
for asset in document.attachments
|
||||
]
|
||||
```
|
||||
|
||||
The resulting render assets remain passive descriptors. Asset copying,
|
||||
renderer output references, link rewriting, and final artifact validation stay
|
||||
outside `markitect-filter`.
|
||||
|
||||
Example normalized attachment envelopes live in:
|
||||
|
||||
- `examples/source-attachments/epub3-attachments.normalized.yaml`
|
||||
- `examples/source-attachments/pdf-attachments.normalized.yaml`
|
||||
|
||||
Cross-repo validation can be run from this checkout with:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest
|
||||
```
|
||||
Reference in New Issue
Block a user