Add source attachment metadata compatibility

This commit is contained in:
2026-05-15 14:36:24 +02:00
parent afa51f8764
commit ad137b214f
13 changed files with 724 additions and 28 deletions

View File

@@ -28,3 +28,9 @@ pdf = "markitect_filter.adapters:pdf_adapter_descriptor"
The first PDF slice is stdlib-only and targets deterministic text extraction
from local, digitally-readable PDFs. OCR, scanned-document recognition, and
layout-perfect reconstruction are intentionally deferred.
Read-side attachment metadata is exposed through
`NormalizedMarkdownDocument.attachments` for EPUB3 package resources, PDF
embedded files, and PDF image-resource signals. See
`docs/source-attachment-metadata.md` for the handoff contract to passive render
asset manifests.

View File

@@ -23,7 +23,7 @@ native system services, or renderer-specific tooling.
- Scanned or image-only PDFs that require OCR.
- Encrypted or permission-restricted PDFs.
- Pixel-perfect layout reconstruction.
- Table, figure, annotation, form, signature, and attachment extraction.
- Table, figure, annotation, form, signature, and rich attachment extraction.
- PDF writing/export.
## Options
@@ -43,3 +43,9 @@ and originating PDF page object id.
Quality metadata records the extraction backend, document page count, selected
pages, extracted page count, page coverage, skipped pages, warning count,
lossiness, and confidence.
`NormalizedMarkdownDocument.attachments` may include read-side metadata for
embedded file streams and image-resource signals when the stdlib parser can
detect them. Embedded files include byte size and digest. Image resources are
signal-only descriptors with page/object provenance; the adapter does not
extract image bytes or perform OCR.

View File

@@ -0,0 +1,81 @@
# Source Attachment Metadata
`markitect-filter` exposes read-side attachment metadata through
`NormalizedMarkdownDocument.attachments`. These entries are
`markitect_tool.source.SourceAsset` objects, so `markitect-tool` can consume
them when building passive render asset manifests.
The metadata schema marker is:
```text
markitect-filter.source-attachment.v1
```
## Common Fields
Attachment entries should preserve:
- `uri`: stable source package or document member URI
- `path`: package member path or signal path
- `name`: member filename or signal label
- `media_type` and `extension` when known
- `size` and `digest` when bytes are available
- `metadata.source_adapter`: adapter id such as `source.epub3` or `source.pdf`
- `metadata.source_role`: logical read-side role
- `metadata.package_path`, `metadata.page`, `metadata.pdf_object`, or related
provenance coordinates when known
- `metadata.render_manifest_compatible: true` when the entry can feed
`RenderAsset.from_source_asset`
These entries describe source-side resources only. They do not imply output
paths, copy execution, final artifact locations, or publication state.
## EPUB3
The EPUB3 adapter records manifest resources for images, stylesheets, fonts,
audio, and video when the package entry exists and can be read cheaply from the
ZIP archive. It stores byte size and sha256 digest for each collected resource.
Unsupported non-XHTML package resources produce
`source.epub3.skipped_resource` warnings. Declared but missing resources produce
`source.epub3.missing_resource` warnings.
## PDF
The PDF adapter records embedded file streams when a stdlib scan can identify
`Filespec` and `EmbeddedFile` objects. It stores member bytes, media type by
filename, size, digest, object id, and source role `embedded-file`.
For image resources, the stdlib slice records signal-only entries with source
role `image-signal`. These entries preserve page/object provenance and a stable
digest of the detected page/resource signal, but they do not extract image
bytes. Image signals emit `source.pdf.image_resource_signal` warnings so callers
know the adapter detected media that it did not extract.
## Render Manifest Handoff
`markitect-tool` can convert attachment entries to passive render assets:
```python
from markitect_tool.render import RenderAsset
render_assets = [
RenderAsset.from_source_asset(asset, role=asset.metadata["source_role"])
for asset in document.attachments
]
```
The resulting render assets remain passive descriptors. Asset copying,
renderer output references, link rewriting, and final artifact validation stay
outside `markitect-filter`.
Example normalized attachment envelopes live in:
- `examples/source-attachments/epub3-attachments.normalized.yaml`
- `examples/source-attachments/pdf-attachments.normalized.yaml`
Cross-repo validation can be run from this checkout with:
```bash
PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest
```

View File

@@ -0,0 +1,40 @@
schema_version: markitect.source.v1
document_id: source.epub3:fixture
adapter:
id: source.epub3
version: "1"
attachments:
- uri: fixture.epub!/EPUB/images/chart.png
path: EPUB/images/chart.png
name: chart.png
media_type: image/png
extension: .png
size: 15
digest: sha256:example-chart
metadata:
schema_version: markitect-filter.source-attachment.v1
source_adapter: source.epub3
source_role: image
package_path: EPUB/images/chart.png
href: images/chart.png
manifest_id: chart
render_manifest_compatible: true
- uri: fixture.epub!/EPUB/styles/book.css
path: EPUB/styles/book.css
name: book.css
media_type: text/css
extension: .css
size: 22
digest: sha256:example-css
metadata:
schema_version: markitect-filter.source-attachment.v1
source_adapter: source.epub3
source_role: stylesheet
package_path: EPUB/styles/book.css
href: styles/book.css
manifest_id: style
render_manifest_compatible: true
render_asset_manifest_handoff:
compatible_schema: markitect.render.reference.v1
core_asset_copying: false
renderer_required: false

View File

@@ -0,0 +1,40 @@
schema_version: markitect.source.v1
document_id: source.pdf:fixture
adapter:
id: source.pdf
version: "1"
attachments:
- uri: fixture.pdf!/embedded/attachment.txt
path: embedded/attachment.txt
name: attachment.txt
media_type: text/plain
extension: .txt
size: 13
digest: sha256:example-embedded-file
metadata:
schema_version: markitect-filter.source-attachment.v1
source_adapter: source.pdf
source_role: embedded-file
package_path: embedded/attachment.txt
pdf_object: 8
embedded_file_name: attachment.txt
render_manifest_compatible: true
- uri: fixture.pdf#page-0001/image-signal
path: page-0001/image-signal
name: image-signal
media_type: application/x.markitect-pdf-image-signal
digest: sha256:example-image-signal
metadata:
schema_version: markitect-filter.source-attachment.v1
source_adapter: source.pdf
source_role: image-signal
signal_only: true
page: 1
pdf_object: 3
image_objects:
- 5
render_manifest_compatible: true
render_asset_manifest_handoff:
compatible_schema: markitect.render.reference.v1
core_asset_copying: false
renderer_required: false

View File

@@ -1,5 +1,10 @@
"""Concrete source-format adapters for Markitect."""
from markitect_filter.adapters import epub3_adapter_descriptor, pdf_adapter_descriptor
from markitect_filter.assets import SOURCE_ATTACHMENT_METADATA_VERSION
__all__ = ["epub3_adapter_descriptor", "pdf_adapter_descriptor"]
__all__ = [
"SOURCE_ATTACHMENT_METADATA_VERSION",
"epub3_adapter_descriptor",
"pdf_adapter_descriptor",
]

View File

@@ -41,12 +41,15 @@ def epub3_adapter_descriptor() -> SourceAdapterDescriptor:
},
quality_profile={
"text_extraction": "stdlib-xhtml",
"images": "metadata-only",
"styles": "ignored",
"images": "metadata-with-digest",
"styles": "metadata-with-digest",
"fonts": "metadata-with-digest",
"attachments": "read-side-source-assets",
},
metadata={
"format": "EPUB3",
"dependency_profile": "stdlib",
"render_asset_manifest_compatible": True,
},
)
@@ -96,12 +99,14 @@ def pdf_adapter_descriptor() -> SourceAdapterDescriptor:
},
quality_profile={
"text_extraction": "stdlib-pdf-text",
"images": "diagnostic-only",
"attachments": "metadata-with-digest",
"images": "signal-only",
"styles": "ignored",
"tables": "plain-text-only",
},
metadata={
"format": "PDF",
"dependency_profile": "stdlib",
"render_asset_manifest_compatible": True,
},
)

View File

@@ -0,0 +1,88 @@
"""Read-side source asset metadata helpers."""
from __future__ import annotations
import hashlib
import mimetypes
import posixpath
from pathlib import PurePosixPath
from typing import Any
from markitect_tool.source import SourceAsset
SOURCE_ATTACHMENT_METADATA_VERSION = "markitect-filter.source-attachment.v1"
def bytes_digest(data: bytes) -> str:
"""Return a Markitect-compatible sha256 digest for source bytes."""
return "sha256:" + hashlib.sha256(data).hexdigest()
def source_member_uri(container_uri: str, member_path: str) -> str:
"""Return a stable URI for a source package member."""
return f"{container_uri}!/{member_path}"
def source_asset_from_member(
*,
container_uri: str,
member_path: str,
data: bytes,
media_type: str | None,
source_adapter: str,
source_role: str,
metadata: dict[str, Any] | None = None,
) -> SourceAsset:
"""Build read-side metadata for a package member without extracting it."""
name = PurePosixPath(member_path).name or member_path
extension = PurePosixPath(name).suffix.lower() or None
resolved_media_type = media_type or mimetypes.guess_type(name)[0] or "application/octet-stream"
return SourceAsset(
uri=source_member_uri(container_uri, member_path),
path=member_path,
name=name,
media_type=resolved_media_type,
extension=extension,
size=len(data),
digest=bytes_digest(data),
metadata={
"schema_version": SOURCE_ATTACHMENT_METADATA_VERSION,
"source_adapter": source_adapter,
"source_role": source_role,
"package_path": posixpath.normpath(member_path),
**(metadata or {}),
},
)
def source_asset_signal(
*,
container_uri: str,
signal_id: str,
media_type: str,
source_adapter: str,
source_role: str,
digest_parts: list[bytes],
metadata: dict[str, Any] | None = None,
) -> SourceAsset:
"""Build metadata for a detected source resource signal without bytes."""
digest = bytes_digest(b"\n".join(digest_parts))
return SourceAsset(
uri=f"{container_uri}#{signal_id}",
path=signal_id,
name=signal_id.rsplit("/", 1)[-1],
media_type=media_type,
digest=digest,
metadata={
"schema_version": SOURCE_ATTACHMENT_METADATA_VERSION,
"source_adapter": source_adapter,
"source_role": source_role,
"signal_only": True,
**(metadata or {}),
},
)

View File

@@ -28,12 +28,29 @@ from markitect_tool.source import (
)
from markitect_filter.adapters import epub3_adapter_descriptor
from markitect_filter.assets import source_asset_from_member
XHTML_MEDIA_TYPES = {
"application/xhtml+xml",
"text/html",
}
EPUB_ATTACHMENT_MEDIA_PREFIXES = (
"audio/",
"font/",
"image/",
"video/",
)
EPUB_ATTACHMENT_MEDIA_TYPES = {
"application/font-sfnt",
"application/vnd.ms-opentype",
"application/x-font-ttf",
"font/otf",
"font/ttf",
"font/woff",
"font/woff2",
"text/css",
}
BOILERPLATE_HINTS = {
"cover",
"nav",
@@ -101,11 +118,15 @@ class Epub3ReadAdapter:
asset=request.asset,
adapter=_adapter_info(request.options),
metadata=metadata,
capabilities=["read"],
capabilities=["read", "attachments"],
quality=NormalizationQuality(
lossiness="unknown" if has_error(diagnostics) else "low",
confidence=0.9 if not has_error(diagnostics) else 0.0,
warnings=_warning_count(diagnostics),
metadata={
"manifest_items": len(package.manifest) if package else 0,
"attachment_candidates": _attachment_candidate_count(package.manifest) if package else 0,
},
),
diagnostics=diagnostics,
valid=not has_error(diagnostics),
@@ -122,6 +143,7 @@ class Epub3ReadAdapter:
skip_boilerplate = bool(request.options.get("skip_boilerplate", True))
try:
with zipfile.ZipFile(Path(request.asset.path or request.asset.uri)) as archive:
attachments = _extract_attachments(archive, request.asset, package, diagnostics)
for order, item_id in enumerate(package.spine):
item = package.manifest.get(item_id)
if item is None:
@@ -187,7 +209,10 @@ class Epub3ReadAdapter:
confidence=0.9 if not has_error(diagnostics) else 0.0,
skipped_items=sum(1 for diagnostic in diagnostics if diagnostic.code == "source.epub3.skipped_boilerplate"),
warnings=_warning_count(diagnostics),
metadata={"extraction": "epub3-stdlib-xhtml"},
metadata={
"extraction": "epub3-stdlib-xhtml",
"attachment_count": len(attachments),
},
)
adapter = _adapter_info(request.options)
document = NormalizedMarkdownDocument(
@@ -203,9 +228,10 @@ class Epub3ReadAdapter:
source_uri=request.asset.uri,
source_path=request.asset.path,
digest=request.asset.digest,
metadata={"rootfile": package.rootfile_path},
metadata={"rootfile": package.rootfile_path, "attachment_count": len(attachments)},
)
],
attachments=attachments,
adapter=adapter,
cache_key=normalization_cache_key(
asset=request.asset,
@@ -393,6 +419,97 @@ def _extract_nav_labels(
return labels
def _extract_attachments(
archive: zipfile.ZipFile,
asset: SourceAsset,
package: EpubPackage,
diagnostics: list[Diagnostic],
) -> list[SourceAsset]:
attachments: list[SourceAsset] = []
for item in sorted(package.manifest.values(), key=lambda value: value.get("href", "")):
media_type = item.get("media_type", "")
href = item.get("href", "")
if not href or media_type in XHTML_MEDIA_TYPES:
continue
package_path = _resolve_package_path(package.rootfile_path, href)
if not _is_attachment_media(media_type):
diagnostics.append(
_warning(
asset,
"source.epub3.skipped_resource",
f"Skipped unsupported EPUB resource media type `{media_type}`.",
details={
"href": href,
"package_path": package_path,
"media_type": media_type,
"manifest_id": item.get("id"),
},
)
)
continue
try:
data = archive.read(package_path)
except KeyError:
diagnostics.append(
_warning(
asset,
"source.epub3.missing_resource",
f"EPUB resource `{package_path}` is declared but missing.",
details={
"href": href,
"package_path": package_path,
"media_type": media_type,
"manifest_id": item.get("id"),
},
)
)
continue
attachments.append(
source_asset_from_member(
container_uri=asset.uri,
member_path=package_path,
data=data,
media_type=media_type,
source_adapter="source.epub3",
source_role=_attachment_role(media_type),
metadata={
"href": href,
"manifest_id": item.get("id"),
"properties": item.get("properties", ""),
"container_path": asset.path,
"render_manifest_compatible": True,
},
)
)
return attachments
def _attachment_candidate_count(manifest: dict[str, dict[str, str]]) -> int:
return sum(1 for item in manifest.values() if _is_attachment_media(item.get("media_type", "")))
def _is_attachment_media(media_type: str) -> bool:
normalized = media_type.lower()
return normalized in EPUB_ATTACHMENT_MEDIA_TYPES or any(
normalized.startswith(prefix) for prefix in EPUB_ATTACHMENT_MEDIA_PREFIXES
)
def _attachment_role(media_type: str) -> str:
normalized = media_type.lower()
if normalized.startswith("image/"):
return "image"
if normalized == "text/css":
return "stylesheet"
if normalized.startswith("font/") or "font" in normalized or "opentype" in normalized:
return "font"
if normalized.startswith("audio/"):
return "audio"
if normalized.startswith("video/"):
return "video"
return "package-resource"
def _extract_segment(
archive: zipfile.ZipFile,
asset: SourceAsset,

View File

@@ -26,6 +26,7 @@ from markitect_tool.source import (
)
from markitect_filter.adapters import pdf_adapter_descriptor
from markitect_filter.assets import bytes_digest, source_asset_signal, source_asset_from_member
PDF_HEADER_RE = re.compile(rb"%PDF-\d\.\d")
@@ -36,6 +37,9 @@ PAGES_TYPE_RE = re.compile(rb"/Type\s*/Pages\b")
REF_RE = re.compile(rb"(\d+)\s+\d+\s+R")
INFO_REF_RE = re.compile(rb"/Info\s+(\d+)\s+\d+\s+R")
COUNT_RE = re.compile(rb"/Count\s+(\d+)")
EMBEDDED_FILE_RE = re.compile(rb"/Type\s*/EmbeddedFile\b")
FILESPEC_RE = re.compile(rb"/Type\s*/Filespec\b")
EMBEDDED_FILE_REF_RE = re.compile(rb"/EF\s*<<.*?/F\s+(\d+)\s+\d+\s+R", re.DOTALL)
@dataclass(frozen=True)
@@ -53,6 +57,7 @@ class PdfPackage:
encrypted: bool
pages: list[PdfPage]
diagnostics: list[Diagnostic]
attachments: list[SourceAsset]
class PdfReadAdapter:
@@ -92,7 +97,7 @@ class PdfReadAdapter:
asset=request.asset,
adapter=_adapter_info(request.options),
metadata=package.metadata,
capabilities=["read"],
capabilities=["read", "attachments"],
quality=NormalizationQuality(
lossiness="unknown" if has_error(diagnostics) else "medium",
confidence=_confidence(package, diagnostics),
@@ -102,6 +107,9 @@ class PdfReadAdapter:
"page_count": package.page_count,
"pages_with_text": extracted_pages,
"encrypted": package.encrypted,
"attachment_count": len(package.attachments),
"image_signal_count": _attachment_role_count(package.attachments, "image-signal"),
"embedded_file_count": _attachment_role_count(package.attachments, "embedded-file"),
},
),
diagnostics=diagnostics,
@@ -188,6 +196,9 @@ class PdfReadAdapter:
"selected_pages": [page.number for page in selected_pages],
"pages_extracted": len(segments),
"page_coverage": page_coverage,
"attachment_count": len(package.attachments),
"image_signal_count": _attachment_role_count(package.attachments, "image-signal"),
"embedded_file_count": _attachment_role_count(package.attachments, "embedded-file"),
},
)
document = NormalizedMarkdownDocument(
@@ -203,9 +214,10 @@ class PdfReadAdapter:
source_uri=request.asset.uri,
source_path=request.asset.path,
digest=request.asset.digest,
metadata={"page_count": package.page_count},
metadata={"page_count": package.page_count, "attachment_count": len(package.attachments)},
)
],
attachments=package.attachments,
adapter=_adapter_info(request.options),
cache_key=normalization_cache_key(
asset=request.asset,
@@ -227,6 +239,7 @@ def _load_pdf(asset: SourceAsset) -> PdfPackage:
page_count=0,
encrypted=False,
pages=[],
attachments=[],
diagnostics=[
_pdf_error(
asset,
@@ -243,6 +256,7 @@ def _load_pdf(asset: SourceAsset) -> PdfPackage:
page_count=0,
encrypted=False,
pages=[],
attachments=[],
diagnostics=[_malformed(asset, "PDF does not start with a PDF header.")],
)
@@ -255,6 +269,7 @@ def _load_pdf(asset: SourceAsset) -> PdfPackage:
page_count=_page_count(objects),
encrypted=True,
pages=[],
attachments=[],
diagnostics=[
_pdf_error(
asset,
@@ -267,10 +282,30 @@ def _load_pdf(asset: SourceAsset) -> PdfPackage:
page_ids = _page_object_ids(objects)
page_count = _page_count(objects) or len(page_ids)
pages: list[PdfPage] = []
attachments = _embedded_file_assets(objects, asset, diagnostics)
for page_number, object_id in enumerate(page_ids, start=1):
page_body = objects[object_id]
page_diagnostics: list[Diagnostic] = []
content_ids = _content_refs(page_body)
image_object_ids = _image_object_ids(page_body, objects, content_ids)
if image_object_ids:
attachments.append(
_image_signal_asset(
asset,
page_number=page_number,
page_object_id=object_id,
image_object_ids=image_object_ids,
digest_parts=[page_body, *[objects.get(image_id, b"") for image_id in image_object_ids]],
)
)
page_diagnostics.append(
_warning(
asset,
"source.pdf.image_resource_signal",
f"PDF page {page_number} references image resources; binary extraction is not performed.",
details={"page": page_number, "image_objects": image_object_ids},
)
)
text_parts: list[str] = []
if not content_ids and STREAM_RE.search(page_body):
stream = _stream_data(page_body, asset, page_diagnostics)
@@ -312,6 +347,126 @@ def _load_pdf(asset: SourceAsset) -> PdfPackage:
encrypted=False,
pages=pages,
diagnostics=diagnostics,
attachments=attachments,
)
def _embedded_file_assets(
objects: dict[int, bytes],
asset: SourceAsset,
diagnostics: list[Diagnostic],
) -> list[SourceAsset]:
file_names = _embedded_file_names(objects)
attachments: list[SourceAsset] = []
for object_id, body in sorted(objects.items()):
if not EMBEDDED_FILE_RE.search(body):
continue
attachment_diagnostics: list[Diagnostic] = []
stream = _stream_data(body, asset, attachment_diagnostics)
diagnostics.extend(attachment_diagnostics)
if not stream:
diagnostics.append(
_warning(
asset,
"source.pdf.embedded_file_unreadable",
f"PDF embedded file object {object_id} does not expose readable bytes.",
details={"object_id": object_id},
)
)
continue
name = file_names.get(object_id) or f"embedded-{object_id}.bin"
attachments.append(
source_asset_from_member(
container_uri=asset.uri,
member_path=f"embedded/{name}",
data=stream,
media_type=None,
source_adapter="source.pdf",
source_role="embedded-file",
metadata={
"container_path": asset.path,
"pdf_object": object_id,
"embedded_file_name": name,
"render_manifest_compatible": True,
},
)
)
return attachments
def _embedded_file_names(objects: dict[int, bytes]) -> dict[int, str]:
names: dict[int, str] = {}
for body in objects.values():
if not FILESPEC_RE.search(body):
continue
ref_match = EMBEDDED_FILE_REF_RE.search(body)
if ref_match is None:
continue
object_id = int(ref_match.group(1))
names[object_id] = _file_spec_name(body) or f"embedded-{object_id}.bin"
return names
def _file_spec_name(body: bytes) -> str | None:
for key in ("UF", "F"):
literal_match = re.search(rb"/" + key.encode("ascii") + rb"\s*(\()", body)
if literal_match:
value, _ = _read_literal_string(body, literal_match.start(1))
else:
value = _metadata_value(body, key)
if value:
return re.sub(r"[\\/:]+", "-", value).strip() or None
return None
def _image_object_ids(
page_body: bytes,
objects: dict[int, bytes],
content_ids: list[int],
) -> list[int]:
page_refs = {
int(match.group(1))
for match in REF_RE.finditer(page_body)
}
image_refs = sorted(
object_id
for object_id in page_refs
if re.search(rb"/Subtype\s*/Image\b", objects.get(object_id, b""))
)
if image_refs:
return image_refs
haystack = page_body + b"\n" + b"\n".join(objects.get(ref, b"") for ref in content_ids)
if not re.search(rb"/Subtype\s*/Image\b|\bDo\b", haystack):
return []
return sorted(
object_id
for object_id, body in objects.items()
if re.search(rb"/Subtype\s*/Image\b", body)
)
def _image_signal_asset(
asset: SourceAsset,
*,
page_number: int,
page_object_id: int,
image_object_ids: list[int],
digest_parts: list[bytes],
) -> SourceAsset:
return source_asset_signal(
container_uri=asset.uri,
signal_id=f"page-{page_number:04d}/image-signal",
media_type="application/x.markitect-pdf-image-signal",
source_adapter="source.pdf",
source_role="image-signal",
digest_parts=digest_parts or [bytes_digest(",".join(map(str, image_object_ids)).encode("ascii")).encode("ascii")],
metadata={
"container_path": asset.path,
"page": page_number,
"pdf_object": page_object_id,
"image_objects": image_object_ids,
"render_manifest_compatible": True,
},
)
@@ -733,6 +888,10 @@ def _confidence(package: PdfPackage, diagnostics: list[Diagnostic]) -> float:
return max(0.1, 0.75 * coverage)
def _attachment_role_count(attachments: list[SourceAsset], role: str) -> int:
return sum(1 for attachment in attachments if attachment.metadata.get("source_role") == role)
def _warning(
asset: SourceAsset,
code: str,

View File

@@ -31,6 +31,8 @@ def test_epub3_descriptor_matches_contract():
assert descriptor.extensions == [".epub"]
assert descriptor.safety["network"] is False
assert descriptor.option_schema["properties"]["skip_boilerplate"]["default"] is True
assert descriptor.quality_profile["attachments"] == "read-side-source-assets"
assert descriptor.metadata["render_asset_manifest_compatible"] is True
def test_epub3_adapter_matches_epub_assets(tmp_path: Path):
@@ -57,6 +59,7 @@ def test_epub3_adapter_inspects_metadata(tmp_path: Path):
assert result.metadata.language == "en"
assert result.metadata.identifiers["bookid"] == "urn:test-book"
assert result.quality.lossiness == "low"
assert result.quality.metadata["attachment_candidates"] == 2
def test_epub3_adapter_normalizes_spine_to_markdown(tmp_path: Path):
@@ -83,6 +86,16 @@ def test_epub3_adapter_normalizes_spine_to_markdown(tmp_path: Path):
"continuation",
]
assert result.document.segments[0].provenance[0].package_path == "EPUB/chapter1.xhtml"
assert [attachment.metadata["source_role"] for attachment in result.document.attachments] == [
"image",
"stylesheet",
]
assert result.document.attachments[0].media_type == "image/png"
assert result.document.attachments[1].media_type == "text/css"
assert result.document.attachments[0].digest.startswith("sha256:")
assert result.document.attachments[0].metadata["package_path"] == "EPUB/images/chart.png"
assert result.document.attachments[0].metadata["render_manifest_compatible"] is True
assert result.document.quality.metadata["attachment_count"] == 2
assert result.document.quality.lossiness == "none"
@@ -114,13 +127,28 @@ def test_epub3_adapter_reports_malformed_missing_container(tmp_path: Path):
assert "container.xml" in result.diagnostics[0].message
def test_epub3_adapter_reports_unsupported_package_resources(tmp_path: Path):
epub_path = _write_epub(tmp_path, include_unsupported_resource=True)
asset = SourceAsset.from_path(epub_path, media_type="application/epub+zip")
adapter = epub3_adapter_descriptor().instantiate()
result = adapter.read(SourceReadRequest(asset=asset))
assert result.is_valid
assert result.document is not None
assert any(
diagnostic.code == "source.epub3.skipped_resource"
for diagnostic in result.document.diagnostics
)
def test_epub3_entry_point_discovery_shape():
registry = discover_source_adapters([FakeEntryPoint()])
assert registry.get("source.epub3").name == "EPUB3"
def _write_epub(tmp_path: Path) -> Path:
def _write_epub(tmp_path: Path, *, include_unsupported_resource: bool = False) -> Path:
epub_path = tmp_path / "test-book.epub"
with zipfile.ZipFile(epub_path, "w") as archive:
archive.writestr("mimetype", "application/epub+zip")
@@ -150,13 +178,22 @@ def _write_epub(tmp_path: Path) -> Path:
<item id="nav" href="nav.xhtml" media-type="application/xhtml+xml" properties="nav"/>
<item id="chapter1" href="chapter1.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter2" href="chapter2.xhtml" media-type="application/xhtml+xml"/>
<item id="style" href="styles/book.css" media-type="text/css"/>
<item id="chart" href="images/chart.png" media-type="image/png"/>
{unsupported}
</manifest>
<spine>
<itemref idref="chapter1"/>
<itemref idref="chapter2"/>
</spine>
</package>
""",
""".format(
unsupported=(
'<item id="payload" href="data/payload.bin" media-type="application/x-custom-binary"/>'
if include_unsupported_resource
else ""
)
),
)
archive.writestr(
"EPUB/nav.xhtml",
@@ -203,4 +240,8 @@ def _write_epub(tmp_path: Path) -> Path:
</html>
""",
)
archive.writestr("EPUB/styles/book.css", "body { color: #111; }\n")
archive.writestr("EPUB/images/chart.png", b"\x89PNG\r\n\x1a\nfixture")
if include_unsupported_resource:
archive.writestr("EPUB/data/payload.bin", b"custom")
return epub_path

View File

@@ -32,6 +32,9 @@ def test_pdf_descriptor_matches_contract():
assert descriptor.safety["external_process"] is False
assert descriptor.option_schema["properties"]["include_page_breaks"]["default"] is False
assert descriptor.metadata["dependency_profile"] == "stdlib"
assert descriptor.metadata["render_asset_manifest_compatible"] is True
assert descriptor.quality_profile["attachments"] == "metadata-with-digest"
assert descriptor.quality_profile["images"] == "signal-only"
def test_pdf_adapter_matches_pdf_assets(tmp_path: Path):
@@ -60,6 +63,7 @@ def test_pdf_adapter_inspects_metadata(tmp_path: Path):
assert result.quality.lossiness == "medium"
assert result.quality.metadata["page_count"] == 2
assert result.quality.metadata["pages_with_text"] == 2
assert result.quality.metadata["attachment_count"] == 0
def test_pdf_adapter_normalizes_pages_to_markdown(tmp_path: Path):
@@ -80,6 +84,7 @@ def test_pdf_adapter_normalizes_pages_to_markdown(tmp_path: Path):
assert result.document.segments[0].provenance[0].page == "1"
assert result.document.quality.lossiness == "low"
assert result.document.quality.metadata["page_coverage"] == 1.0
assert result.document.attachments == []
def test_pdf_adapter_applies_page_range_and_page_markers(tmp_path: Path):
@@ -137,17 +142,57 @@ def test_pdf_adapter_reports_encrypted_pdf(tmp_path: Path):
assert result.diagnostics[0].code == "source.pdf.encrypted"
def test_pdf_adapter_reports_embedded_files_and_image_signals(tmp_path: Path):
pdf_path = _write_pdf(tmp_path, embedded_file=True, image_signal=True)
asset = SourceAsset.from_path(pdf_path, media_type="application/pdf")
adapter = pdf_adapter_descriptor().instantiate()
result = adapter.read(SourceReadRequest(asset=asset))
assert result.is_valid
assert result.document is not None
assert [attachment.metadata["source_role"] for attachment in result.document.attachments] == [
"embedded-file",
"image-signal",
]
embedded = result.document.attachments[0]
signal = result.document.attachments[1]
assert embedded.name == "attachment.txt"
assert embedded.media_type == "text/plain"
assert embedded.digest.startswith("sha256:")
assert embedded.metadata["render_manifest_compatible"] is True
assert signal.media_type == "application/x.markitect-pdf-image-signal"
assert signal.metadata["page"] == 1
assert signal.metadata["signal_only"] is True
assert result.document.quality.metadata["attachment_count"] == 2
assert result.document.quality.metadata["embedded_file_count"] == 1
assert result.document.quality.metadata["image_signal_count"] == 1
assert any(
diagnostic.code == "source.pdf.image_resource_signal"
for diagnostic in result.document.diagnostics
)
def test_pdf_entry_point_discovery_shape():
registry = discover_source_adapters([FakeEntryPoint()])
assert registry.get("source.pdf").name == "PDF"
def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
def _write_pdf(
tmp_path: Path,
*,
encrypted: bool = False,
embedded_file: bool = False,
image_signal: bool = False,
) -> Path:
pdf_path = tmp_path / ("encrypted.pdf" if encrypted else "fixture.pdf")
objects: list[tuple[int, bytes]] = []
page_refs = []
next_id = 3
font_id = 100
info_id = 101
encrypt_id = 102
for page_number, lines in enumerate(
[
["Hello PDF", "Second line"],
@@ -159,13 +204,20 @@ def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
content_id = next_id + 1
next_id += 2
page_refs.append(f"{page_id} 0 R")
stream = _page_stream(lines)
include_image = image_signal and page_number == 1
image_id = None
if include_image:
image_id = next_id
next_id += 1
stream = _page_stream(lines, draw_image=include_image)
xobject = f" /XObject << /Im1 {image_id} 0 R >>" if image_id else ""
objects.append(
(
page_id,
(
f"<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] "
f"/Resources << /Font << /F1 7 0 R >> >> /Contents {content_id} 0 R >>"
f"/Resources << /Font << /F1 {font_id} 0 R >>{xobject} >> "
f"/Contents {content_id} 0 R >>"
).encode("ascii"),
)
)
@@ -179,6 +231,44 @@ def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
+ b"\nendstream",
)
)
if image_id:
image_stream = b"\x00\x00\x00"
objects.append(
(
image_id,
b"<< /Type /XObject /Subtype /Image /Width 1 /Height 1 "
b"/ColorSpace /DeviceGray /BitsPerComponent 8 /Length "
+ str(len(image_stream)).encode("ascii")
+ b" >>\nstream\n"
+ image_stream
+ b"\nendstream",
)
)
if embedded_file:
embedded_id = next_id
filespec_id = next_id + 1
next_id += 2
embedded_stream = b"attached text"
objects.append(
(
embedded_id,
b"<< /Type /EmbeddedFile /Length "
+ str(len(embedded_stream)).encode("ascii")
+ b" >>\nstream\n"
+ embedded_stream
+ b"\nendstream",
)
)
objects.append(
(
filespec_id,
(
f"<< /Type /Filespec /F (attachment.txt) "
f"/EF << /F {embedded_id} 0 R >> >>"
).encode("ascii"),
)
)
objects.extend(
[
@@ -190,9 +280,9 @@ def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
f"/Count {len(page_refs)} >>"
).encode("ascii"),
),
(7, b"<< /Type /Font /Subtype /Type1 /BaseFont /Helvetica >>"),
(font_id, b"<< /Type /Font /Subtype /Type1 /BaseFont /Helvetica >>"),
(
8,
info_id,
b"<< /Title (PDF Fixture) /Author (Ada Lovelace) "
b"/Subject (Source Adapter Test) /Keywords (markitect pdf) "
b"/Producer (markitect-filter tests) /CreationDate (D:20260514093000Z) >>",
@@ -200,7 +290,7 @@ def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
]
)
if encrypted:
objects.append((9, b"<< /Filter /Standard /V 1 /R 2 >>"))
objects.append((encrypt_id, b"<< /Filter /Standard /V 1 /R 2 >>"))
objects.sort(key=lambda item: item[0])
header = b"%PDF-1.4\n%\xe2\xe3\xcf\xd3\n"
@@ -218,9 +308,9 @@ def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
content.extend(b"0000000000 65535 f \n")
for object_id in range(1, max_id + 1):
content.extend(f"{offsets.get(object_id, 0):010d} 00000 n \n".encode("ascii"))
trailer = f"trailer\n<< /Size {max_id + 1} /Root 1 0 R /Info 8 0 R".encode("ascii")
trailer = f"trailer\n<< /Size {max_id + 1} /Root 1 0 R /Info {info_id} 0 R".encode("ascii")
if encrypted:
trailer += b" /Encrypt 9 0 R"
trailer += f" /Encrypt {encrypt_id} 0 R".encode("ascii")
trailer += b" >>\n"
content.extend(trailer)
content.extend(f"startxref\n{xref_offset}\n%%EOF\n".encode("ascii"))
@@ -228,13 +318,15 @@ def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
return pdf_path
def _page_stream(lines: list[str]) -> bytes:
def _page_stream(lines: list[str], *, draw_image: bool = False) -> bytes:
parts = ["BT", "/F1 12 Tf", "72 720 Td"]
for index, line in enumerate(lines):
if index:
parts.append("T*")
parts.append(f"({_pdf_literal(line)}) Tj")
parts.append("ET")
if draw_image:
parts.extend(["q", "10 0 0 10 72 640 cm", "/Im1 Do", "Q"])
return "\n".join(parts).encode("ascii")

View File

@@ -3,10 +3,10 @@ id: MKTF-WP-0003
type: workplan
title: "Source Attachment Manifest Compatibility"
domain: markitect
status: todo
status: done
owner: markitect-filter
topic_slug: markitect
planning_priority: P2
planning_priority: complete
planning_order: 30
depends_on_workplans:
- MKTF-WP-0001
@@ -56,11 +56,27 @@ render asset manifest.
Those responsibilities belong to `markitect-tool` contracts,
`markitect-quarkdown` render integration, or later runtime/publication systems.
## Implementation Summary
Completed as a read-side attachment metadata compatibility slice:
- Added shared source attachment metadata helpers and exported
`markitect-filter.source-attachment.v1`.
- EPUB3 read results now populate `NormalizedMarkdownDocument.attachments` for
package images, stylesheets, fonts, audio, and video with byte size, digest,
package path, manifest id, href, and render-manifest compatibility metadata.
- PDF read results now populate attachments for embedded file streams and
signal-only image resources where the stdlib parser can detect them.
- Unsupported EPUB resources, missing EPUB resources, PDF image signals, and
unreadable embedded files produce structured diagnostics.
- Docs, handoff fixtures, adapter descriptors, README notes, and tests were
added without introducing renderer/export behavior.
## P3.1 - Align attachment metadata with Markitect source contracts
```task
id: MKTF-WP-0003-T001
status: todo
status: done
priority: high
state_hub_task_id: "d119daca-8141-4662-8ad7-ce43ccd79044"
```
@@ -76,7 +92,7 @@ run with `markitect-tool` on `PYTHONPATH`.
```task
id: MKTF-WP-0003-T002
status: todo
status: done
priority: medium
state_hub_task_id: "ebcbf480-210d-46e7-a4e4-fbe7e9baa39a"
```
@@ -91,7 +107,7 @@ Output: EPUB3 attachment metadata, provenance, diagnostics, and fixtures.
```task
id: MKTF-WP-0003-T003
status: todo
status: done
priority: medium
state_hub_task_id: "d8b7b820-387f-4d45-bf22-296b227f917a"
```
@@ -108,7 +124,7 @@ Output: PDF metadata conventions, diagnostics, and tests.
```task
id: MKTF-WP-0003-T004
status: todo
status: done
priority: high
state_hub_task_id: "ca539c01-c272-4635-8f60-86f870bbef0c"
```
@@ -128,7 +144,7 @@ Output: deterministic digest/provenance tests.
```task
id: MKTF-WP-0003-T005
status: todo
status: done
priority: medium
state_hub_task_id: "f2213a20-ce6f-4e16-9b9b-557b99f8b4d1"
```