Files
markitect-filter/docs/source-attachment-metadata.md

2.9 KiB

Source Attachment Metadata

markitect-filter exposes read-side attachment metadata through NormalizedMarkdownDocument.attachments. These entries are markitect_tool.source.SourceAsset objects, so markitect-tool can consume them when building passive render asset manifests.

The metadata schema marker is:

markitect-filter.source-attachment.v1

Common Fields

Attachment entries should preserve:

  • uri: stable source package or document member URI
  • path: package member path or signal path
  • name: member filename or signal label
  • media_type and extension when known
  • size and digest when bytes are available
  • metadata.source_adapter: adapter id such as source.epub3 or source.pdf
  • metadata.source_role: logical read-side role
  • metadata.package_path, metadata.page, metadata.pdf_object, or related provenance coordinates when known
  • metadata.render_manifest_compatible: true when the entry can feed RenderAsset.from_source_asset

These entries describe source-side resources only. They do not imply output paths, copy execution, final artifact locations, or publication state.

EPUB3

The EPUB3 adapter records manifest resources for images, stylesheets, fonts, audio, and video when the package entry exists and can be read cheaply from the ZIP archive. It stores byte size and sha256 digest for each collected resource.

Unsupported non-XHTML package resources produce source.epub3.skipped_resource warnings. Declared but missing resources produce source.epub3.missing_resource warnings.

PDF

The PDF adapter records embedded file streams when a stdlib scan can identify Filespec and EmbeddedFile objects. It stores member bytes, media type by filename, size, digest, object id, and source role embedded-file.

For image resources, the stdlib slice records signal-only entries with source role image-signal. These entries preserve page/object provenance and a stable digest of the detected page/resource signal, but they do not extract image bytes. Image signals emit source.pdf.image_resource_signal warnings so callers know the adapter detected media that it did not extract.

Render Manifest Handoff

markitect-tool can convert attachment entries to passive render assets:

from markitect_tool.render import RenderAsset

render_assets = [
    RenderAsset.from_source_asset(asset, role=asset.metadata["source_role"])
    for asset in document.attachments
]

The resulting render assets remain passive descriptors. Asset copying, renderer output references, link rewriting, and final artifact validation stay outside markitect-filter.

Example normalized attachment envelopes live in:

  • examples/source-attachments/epub3-attachments.normalized.yaml
  • examples/source-attachments/pdf-attachments.normalized.yaml

Cross-repo validation can be run from this checkout with:

PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest