generated from coulomb/repo-seed
Add source attachment metadata compatibility
This commit is contained in:
@@ -28,3 +28,9 @@ pdf = "markitect_filter.adapters:pdf_adapter_descriptor"
|
||||
The first PDF slice is stdlib-only and targets deterministic text extraction
|
||||
from local, digitally-readable PDFs. OCR, scanned-document recognition, and
|
||||
layout-perfect reconstruction are intentionally deferred.
|
||||
|
||||
Read-side attachment metadata is exposed through
|
||||
`NormalizedMarkdownDocument.attachments` for EPUB3 package resources, PDF
|
||||
embedded files, and PDF image-resource signals. See
|
||||
`docs/source-attachment-metadata.md` for the handoff contract to passive render
|
||||
asset manifests.
|
||||
|
||||
@@ -23,7 +23,7 @@ native system services, or renderer-specific tooling.
|
||||
- Scanned or image-only PDFs that require OCR.
|
||||
- Encrypted or permission-restricted PDFs.
|
||||
- Pixel-perfect layout reconstruction.
|
||||
- Table, figure, annotation, form, signature, and attachment extraction.
|
||||
- Table, figure, annotation, form, signature, and rich attachment extraction.
|
||||
- PDF writing/export.
|
||||
|
||||
## Options
|
||||
@@ -43,3 +43,9 @@ and originating PDF page object id.
|
||||
Quality metadata records the extraction backend, document page count, selected
|
||||
pages, extracted page count, page coverage, skipped pages, warning count,
|
||||
lossiness, and confidence.
|
||||
|
||||
`NormalizedMarkdownDocument.attachments` may include read-side metadata for
|
||||
embedded file streams and image-resource signals when the stdlib parser can
|
||||
detect them. Embedded files include byte size and digest. Image resources are
|
||||
signal-only descriptors with page/object provenance; the adapter does not
|
||||
extract image bytes or perform OCR.
|
||||
|
||||
81
docs/source-attachment-metadata.md
Normal file
81
docs/source-attachment-metadata.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# Source Attachment Metadata
|
||||
|
||||
`markitect-filter` exposes read-side attachment metadata through
|
||||
`NormalizedMarkdownDocument.attachments`. These entries are
|
||||
`markitect_tool.source.SourceAsset` objects, so `markitect-tool` can consume
|
||||
them when building passive render asset manifests.
|
||||
|
||||
The metadata schema marker is:
|
||||
|
||||
```text
|
||||
markitect-filter.source-attachment.v1
|
||||
```
|
||||
|
||||
## Common Fields
|
||||
|
||||
Attachment entries should preserve:
|
||||
|
||||
- `uri`: stable source package or document member URI
|
||||
- `path`: package member path or signal path
|
||||
- `name`: member filename or signal label
|
||||
- `media_type` and `extension` when known
|
||||
- `size` and `digest` when bytes are available
|
||||
- `metadata.source_adapter`: adapter id such as `source.epub3` or `source.pdf`
|
||||
- `metadata.source_role`: logical read-side role
|
||||
- `metadata.package_path`, `metadata.page`, `metadata.pdf_object`, or related
|
||||
provenance coordinates when known
|
||||
- `metadata.render_manifest_compatible: true` when the entry can feed
|
||||
`RenderAsset.from_source_asset`
|
||||
|
||||
These entries describe source-side resources only. They do not imply output
|
||||
paths, copy execution, final artifact locations, or publication state.
|
||||
|
||||
## EPUB3
|
||||
|
||||
The EPUB3 adapter records manifest resources for images, stylesheets, fonts,
|
||||
audio, and video when the package entry exists and can be read cheaply from the
|
||||
ZIP archive. It stores byte size and sha256 digest for each collected resource.
|
||||
|
||||
Unsupported non-XHTML package resources produce
|
||||
`source.epub3.skipped_resource` warnings. Declared but missing resources produce
|
||||
`source.epub3.missing_resource` warnings.
|
||||
|
||||
## PDF
|
||||
|
||||
The PDF adapter records embedded file streams when a stdlib scan can identify
|
||||
`Filespec` and `EmbeddedFile` objects. It stores member bytes, media type by
|
||||
filename, size, digest, object id, and source role `embedded-file`.
|
||||
|
||||
For image resources, the stdlib slice records signal-only entries with source
|
||||
role `image-signal`. These entries preserve page/object provenance and a stable
|
||||
digest of the detected page/resource signal, but they do not extract image
|
||||
bytes. Image signals emit `source.pdf.image_resource_signal` warnings so callers
|
||||
know the adapter detected media that it did not extract.
|
||||
|
||||
## Render Manifest Handoff
|
||||
|
||||
`markitect-tool` can convert attachment entries to passive render assets:
|
||||
|
||||
```python
|
||||
from markitect_tool.render import RenderAsset
|
||||
|
||||
render_assets = [
|
||||
RenderAsset.from_source_asset(asset, role=asset.metadata["source_role"])
|
||||
for asset in document.attachments
|
||||
]
|
||||
```
|
||||
|
||||
The resulting render assets remain passive descriptors. Asset copying,
|
||||
renderer output references, link rewriting, and final artifact validation stay
|
||||
outside `markitect-filter`.
|
||||
|
||||
Example normalized attachment envelopes live in:
|
||||
|
||||
- `examples/source-attachments/epub3-attachments.normalized.yaml`
|
||||
- `examples/source-attachments/pdf-attachments.normalized.yaml`
|
||||
|
||||
Cross-repo validation can be run from this checkout with:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src:/home/worsch/markitect-tool/src python3 -m pytest
|
||||
```
|
||||
@@ -0,0 +1,40 @@
|
||||
schema_version: markitect.source.v1
|
||||
document_id: source.epub3:fixture
|
||||
adapter:
|
||||
id: source.epub3
|
||||
version: "1"
|
||||
attachments:
|
||||
- uri: fixture.epub!/EPUB/images/chart.png
|
||||
path: EPUB/images/chart.png
|
||||
name: chart.png
|
||||
media_type: image/png
|
||||
extension: .png
|
||||
size: 15
|
||||
digest: sha256:example-chart
|
||||
metadata:
|
||||
schema_version: markitect-filter.source-attachment.v1
|
||||
source_adapter: source.epub3
|
||||
source_role: image
|
||||
package_path: EPUB/images/chart.png
|
||||
href: images/chart.png
|
||||
manifest_id: chart
|
||||
render_manifest_compatible: true
|
||||
- uri: fixture.epub!/EPUB/styles/book.css
|
||||
path: EPUB/styles/book.css
|
||||
name: book.css
|
||||
media_type: text/css
|
||||
extension: .css
|
||||
size: 22
|
||||
digest: sha256:example-css
|
||||
metadata:
|
||||
schema_version: markitect-filter.source-attachment.v1
|
||||
source_adapter: source.epub3
|
||||
source_role: stylesheet
|
||||
package_path: EPUB/styles/book.css
|
||||
href: styles/book.css
|
||||
manifest_id: style
|
||||
render_manifest_compatible: true
|
||||
render_asset_manifest_handoff:
|
||||
compatible_schema: markitect.render.reference.v1
|
||||
core_asset_copying: false
|
||||
renderer_required: false
|
||||
40
examples/source-attachments/pdf-attachments.normalized.yaml
Normal file
40
examples/source-attachments/pdf-attachments.normalized.yaml
Normal file
@@ -0,0 +1,40 @@
|
||||
schema_version: markitect.source.v1
|
||||
document_id: source.pdf:fixture
|
||||
adapter:
|
||||
id: source.pdf
|
||||
version: "1"
|
||||
attachments:
|
||||
- uri: fixture.pdf!/embedded/attachment.txt
|
||||
path: embedded/attachment.txt
|
||||
name: attachment.txt
|
||||
media_type: text/plain
|
||||
extension: .txt
|
||||
size: 13
|
||||
digest: sha256:example-embedded-file
|
||||
metadata:
|
||||
schema_version: markitect-filter.source-attachment.v1
|
||||
source_adapter: source.pdf
|
||||
source_role: embedded-file
|
||||
package_path: embedded/attachment.txt
|
||||
pdf_object: 8
|
||||
embedded_file_name: attachment.txt
|
||||
render_manifest_compatible: true
|
||||
- uri: fixture.pdf#page-0001/image-signal
|
||||
path: page-0001/image-signal
|
||||
name: image-signal
|
||||
media_type: application/x.markitect-pdf-image-signal
|
||||
digest: sha256:example-image-signal
|
||||
metadata:
|
||||
schema_version: markitect-filter.source-attachment.v1
|
||||
source_adapter: source.pdf
|
||||
source_role: image-signal
|
||||
signal_only: true
|
||||
page: 1
|
||||
pdf_object: 3
|
||||
image_objects:
|
||||
- 5
|
||||
render_manifest_compatible: true
|
||||
render_asset_manifest_handoff:
|
||||
compatible_schema: markitect.render.reference.v1
|
||||
core_asset_copying: false
|
||||
renderer_required: false
|
||||
@@ -1,5 +1,10 @@
|
||||
"""Concrete source-format adapters for Markitect."""
|
||||
|
||||
from markitect_filter.adapters import epub3_adapter_descriptor, pdf_adapter_descriptor
|
||||
from markitect_filter.assets import SOURCE_ATTACHMENT_METADATA_VERSION
|
||||
|
||||
__all__ = ["epub3_adapter_descriptor", "pdf_adapter_descriptor"]
|
||||
__all__ = [
|
||||
"SOURCE_ATTACHMENT_METADATA_VERSION",
|
||||
"epub3_adapter_descriptor",
|
||||
"pdf_adapter_descriptor",
|
||||
]
|
||||
|
||||
@@ -41,12 +41,15 @@ def epub3_adapter_descriptor() -> SourceAdapterDescriptor:
|
||||
},
|
||||
quality_profile={
|
||||
"text_extraction": "stdlib-xhtml",
|
||||
"images": "metadata-only",
|
||||
"styles": "ignored",
|
||||
"images": "metadata-with-digest",
|
||||
"styles": "metadata-with-digest",
|
||||
"fonts": "metadata-with-digest",
|
||||
"attachments": "read-side-source-assets",
|
||||
},
|
||||
metadata={
|
||||
"format": "EPUB3",
|
||||
"dependency_profile": "stdlib",
|
||||
"render_asset_manifest_compatible": True,
|
||||
},
|
||||
)
|
||||
|
||||
@@ -96,12 +99,14 @@ def pdf_adapter_descriptor() -> SourceAdapterDescriptor:
|
||||
},
|
||||
quality_profile={
|
||||
"text_extraction": "stdlib-pdf-text",
|
||||
"images": "diagnostic-only",
|
||||
"attachments": "metadata-with-digest",
|
||||
"images": "signal-only",
|
||||
"styles": "ignored",
|
||||
"tables": "plain-text-only",
|
||||
},
|
||||
metadata={
|
||||
"format": "PDF",
|
||||
"dependency_profile": "stdlib",
|
||||
"render_asset_manifest_compatible": True,
|
||||
},
|
||||
)
|
||||
|
||||
88
src/markitect_filter/assets.py
Normal file
88
src/markitect_filter/assets.py
Normal file
@@ -0,0 +1,88 @@
|
||||
"""Read-side source asset metadata helpers."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import mimetypes
|
||||
import posixpath
|
||||
from pathlib import PurePosixPath
|
||||
from typing import Any
|
||||
|
||||
from markitect_tool.source import SourceAsset
|
||||
|
||||
|
||||
SOURCE_ATTACHMENT_METADATA_VERSION = "markitect-filter.source-attachment.v1"
|
||||
|
||||
|
||||
def bytes_digest(data: bytes) -> str:
|
||||
"""Return a Markitect-compatible sha256 digest for source bytes."""
|
||||
|
||||
return "sha256:" + hashlib.sha256(data).hexdigest()
|
||||
|
||||
|
||||
def source_member_uri(container_uri: str, member_path: str) -> str:
|
||||
"""Return a stable URI for a source package member."""
|
||||
|
||||
return f"{container_uri}!/{member_path}"
|
||||
|
||||
|
||||
def source_asset_from_member(
|
||||
*,
|
||||
container_uri: str,
|
||||
member_path: str,
|
||||
data: bytes,
|
||||
media_type: str | None,
|
||||
source_adapter: str,
|
||||
source_role: str,
|
||||
metadata: dict[str, Any] | None = None,
|
||||
) -> SourceAsset:
|
||||
"""Build read-side metadata for a package member without extracting it."""
|
||||
|
||||
name = PurePosixPath(member_path).name or member_path
|
||||
extension = PurePosixPath(name).suffix.lower() or None
|
||||
resolved_media_type = media_type or mimetypes.guess_type(name)[0] or "application/octet-stream"
|
||||
return SourceAsset(
|
||||
uri=source_member_uri(container_uri, member_path),
|
||||
path=member_path,
|
||||
name=name,
|
||||
media_type=resolved_media_type,
|
||||
extension=extension,
|
||||
size=len(data),
|
||||
digest=bytes_digest(data),
|
||||
metadata={
|
||||
"schema_version": SOURCE_ATTACHMENT_METADATA_VERSION,
|
||||
"source_adapter": source_adapter,
|
||||
"source_role": source_role,
|
||||
"package_path": posixpath.normpath(member_path),
|
||||
**(metadata or {}),
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
def source_asset_signal(
|
||||
*,
|
||||
container_uri: str,
|
||||
signal_id: str,
|
||||
media_type: str,
|
||||
source_adapter: str,
|
||||
source_role: str,
|
||||
digest_parts: list[bytes],
|
||||
metadata: dict[str, Any] | None = None,
|
||||
) -> SourceAsset:
|
||||
"""Build metadata for a detected source resource signal without bytes."""
|
||||
|
||||
digest = bytes_digest(b"\n".join(digest_parts))
|
||||
return SourceAsset(
|
||||
uri=f"{container_uri}#{signal_id}",
|
||||
path=signal_id,
|
||||
name=signal_id.rsplit("/", 1)[-1],
|
||||
media_type=media_type,
|
||||
digest=digest,
|
||||
metadata={
|
||||
"schema_version": SOURCE_ATTACHMENT_METADATA_VERSION,
|
||||
"source_adapter": source_adapter,
|
||||
"source_role": source_role,
|
||||
"signal_only": True,
|
||||
**(metadata or {}),
|
||||
},
|
||||
)
|
||||
@@ -28,12 +28,29 @@ from markitect_tool.source import (
|
||||
)
|
||||
|
||||
from markitect_filter.adapters import epub3_adapter_descriptor
|
||||
from markitect_filter.assets import source_asset_from_member
|
||||
|
||||
|
||||
XHTML_MEDIA_TYPES = {
|
||||
"application/xhtml+xml",
|
||||
"text/html",
|
||||
}
|
||||
EPUB_ATTACHMENT_MEDIA_PREFIXES = (
|
||||
"audio/",
|
||||
"font/",
|
||||
"image/",
|
||||
"video/",
|
||||
)
|
||||
EPUB_ATTACHMENT_MEDIA_TYPES = {
|
||||
"application/font-sfnt",
|
||||
"application/vnd.ms-opentype",
|
||||
"application/x-font-ttf",
|
||||
"font/otf",
|
||||
"font/ttf",
|
||||
"font/woff",
|
||||
"font/woff2",
|
||||
"text/css",
|
||||
}
|
||||
BOILERPLATE_HINTS = {
|
||||
"cover",
|
||||
"nav",
|
||||
@@ -101,11 +118,15 @@ class Epub3ReadAdapter:
|
||||
asset=request.asset,
|
||||
adapter=_adapter_info(request.options),
|
||||
metadata=metadata,
|
||||
capabilities=["read"],
|
||||
capabilities=["read", "attachments"],
|
||||
quality=NormalizationQuality(
|
||||
lossiness="unknown" if has_error(diagnostics) else "low",
|
||||
confidence=0.9 if not has_error(diagnostics) else 0.0,
|
||||
warnings=_warning_count(diagnostics),
|
||||
metadata={
|
||||
"manifest_items": len(package.manifest) if package else 0,
|
||||
"attachment_candidates": _attachment_candidate_count(package.manifest) if package else 0,
|
||||
},
|
||||
),
|
||||
diagnostics=diagnostics,
|
||||
valid=not has_error(diagnostics),
|
||||
@@ -122,6 +143,7 @@ class Epub3ReadAdapter:
|
||||
skip_boilerplate = bool(request.options.get("skip_boilerplate", True))
|
||||
try:
|
||||
with zipfile.ZipFile(Path(request.asset.path or request.asset.uri)) as archive:
|
||||
attachments = _extract_attachments(archive, request.asset, package, diagnostics)
|
||||
for order, item_id in enumerate(package.spine):
|
||||
item = package.manifest.get(item_id)
|
||||
if item is None:
|
||||
@@ -187,7 +209,10 @@ class Epub3ReadAdapter:
|
||||
confidence=0.9 if not has_error(diagnostics) else 0.0,
|
||||
skipped_items=sum(1 for diagnostic in diagnostics if diagnostic.code == "source.epub3.skipped_boilerplate"),
|
||||
warnings=_warning_count(diagnostics),
|
||||
metadata={"extraction": "epub3-stdlib-xhtml"},
|
||||
metadata={
|
||||
"extraction": "epub3-stdlib-xhtml",
|
||||
"attachment_count": len(attachments),
|
||||
},
|
||||
)
|
||||
adapter = _adapter_info(request.options)
|
||||
document = NormalizedMarkdownDocument(
|
||||
@@ -203,9 +228,10 @@ class Epub3ReadAdapter:
|
||||
source_uri=request.asset.uri,
|
||||
source_path=request.asset.path,
|
||||
digest=request.asset.digest,
|
||||
metadata={"rootfile": package.rootfile_path},
|
||||
metadata={"rootfile": package.rootfile_path, "attachment_count": len(attachments)},
|
||||
)
|
||||
],
|
||||
attachments=attachments,
|
||||
adapter=adapter,
|
||||
cache_key=normalization_cache_key(
|
||||
asset=request.asset,
|
||||
@@ -393,6 +419,97 @@ def _extract_nav_labels(
|
||||
return labels
|
||||
|
||||
|
||||
def _extract_attachments(
|
||||
archive: zipfile.ZipFile,
|
||||
asset: SourceAsset,
|
||||
package: EpubPackage,
|
||||
diagnostics: list[Diagnostic],
|
||||
) -> list[SourceAsset]:
|
||||
attachments: list[SourceAsset] = []
|
||||
for item in sorted(package.manifest.values(), key=lambda value: value.get("href", "")):
|
||||
media_type = item.get("media_type", "")
|
||||
href = item.get("href", "")
|
||||
if not href or media_type in XHTML_MEDIA_TYPES:
|
||||
continue
|
||||
package_path = _resolve_package_path(package.rootfile_path, href)
|
||||
if not _is_attachment_media(media_type):
|
||||
diagnostics.append(
|
||||
_warning(
|
||||
asset,
|
||||
"source.epub3.skipped_resource",
|
||||
f"Skipped unsupported EPUB resource media type `{media_type}`.",
|
||||
details={
|
||||
"href": href,
|
||||
"package_path": package_path,
|
||||
"media_type": media_type,
|
||||
"manifest_id": item.get("id"),
|
||||
},
|
||||
)
|
||||
)
|
||||
continue
|
||||
try:
|
||||
data = archive.read(package_path)
|
||||
except KeyError:
|
||||
diagnostics.append(
|
||||
_warning(
|
||||
asset,
|
||||
"source.epub3.missing_resource",
|
||||
f"EPUB resource `{package_path}` is declared but missing.",
|
||||
details={
|
||||
"href": href,
|
||||
"package_path": package_path,
|
||||
"media_type": media_type,
|
||||
"manifest_id": item.get("id"),
|
||||
},
|
||||
)
|
||||
)
|
||||
continue
|
||||
attachments.append(
|
||||
source_asset_from_member(
|
||||
container_uri=asset.uri,
|
||||
member_path=package_path,
|
||||
data=data,
|
||||
media_type=media_type,
|
||||
source_adapter="source.epub3",
|
||||
source_role=_attachment_role(media_type),
|
||||
metadata={
|
||||
"href": href,
|
||||
"manifest_id": item.get("id"),
|
||||
"properties": item.get("properties", ""),
|
||||
"container_path": asset.path,
|
||||
"render_manifest_compatible": True,
|
||||
},
|
||||
)
|
||||
)
|
||||
return attachments
|
||||
|
||||
|
||||
def _attachment_candidate_count(manifest: dict[str, dict[str, str]]) -> int:
|
||||
return sum(1 for item in manifest.values() if _is_attachment_media(item.get("media_type", "")))
|
||||
|
||||
|
||||
def _is_attachment_media(media_type: str) -> bool:
|
||||
normalized = media_type.lower()
|
||||
return normalized in EPUB_ATTACHMENT_MEDIA_TYPES or any(
|
||||
normalized.startswith(prefix) for prefix in EPUB_ATTACHMENT_MEDIA_PREFIXES
|
||||
)
|
||||
|
||||
|
||||
def _attachment_role(media_type: str) -> str:
|
||||
normalized = media_type.lower()
|
||||
if normalized.startswith("image/"):
|
||||
return "image"
|
||||
if normalized == "text/css":
|
||||
return "stylesheet"
|
||||
if normalized.startswith("font/") or "font" in normalized or "opentype" in normalized:
|
||||
return "font"
|
||||
if normalized.startswith("audio/"):
|
||||
return "audio"
|
||||
if normalized.startswith("video/"):
|
||||
return "video"
|
||||
return "package-resource"
|
||||
|
||||
|
||||
def _extract_segment(
|
||||
archive: zipfile.ZipFile,
|
||||
asset: SourceAsset,
|
||||
|
||||
@@ -26,6 +26,7 @@ from markitect_tool.source import (
|
||||
)
|
||||
|
||||
from markitect_filter.adapters import pdf_adapter_descriptor
|
||||
from markitect_filter.assets import bytes_digest, source_asset_signal, source_asset_from_member
|
||||
|
||||
|
||||
PDF_HEADER_RE = re.compile(rb"%PDF-\d\.\d")
|
||||
@@ -36,6 +37,9 @@ PAGES_TYPE_RE = re.compile(rb"/Type\s*/Pages\b")
|
||||
REF_RE = re.compile(rb"(\d+)\s+\d+\s+R")
|
||||
INFO_REF_RE = re.compile(rb"/Info\s+(\d+)\s+\d+\s+R")
|
||||
COUNT_RE = re.compile(rb"/Count\s+(\d+)")
|
||||
EMBEDDED_FILE_RE = re.compile(rb"/Type\s*/EmbeddedFile\b")
|
||||
FILESPEC_RE = re.compile(rb"/Type\s*/Filespec\b")
|
||||
EMBEDDED_FILE_REF_RE = re.compile(rb"/EF\s*<<.*?/F\s+(\d+)\s+\d+\s+R", re.DOTALL)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
@@ -53,6 +57,7 @@ class PdfPackage:
|
||||
encrypted: bool
|
||||
pages: list[PdfPage]
|
||||
diagnostics: list[Diagnostic]
|
||||
attachments: list[SourceAsset]
|
||||
|
||||
|
||||
class PdfReadAdapter:
|
||||
@@ -92,7 +97,7 @@ class PdfReadAdapter:
|
||||
asset=request.asset,
|
||||
adapter=_adapter_info(request.options),
|
||||
metadata=package.metadata,
|
||||
capabilities=["read"],
|
||||
capabilities=["read", "attachments"],
|
||||
quality=NormalizationQuality(
|
||||
lossiness="unknown" if has_error(diagnostics) else "medium",
|
||||
confidence=_confidence(package, diagnostics),
|
||||
@@ -102,6 +107,9 @@ class PdfReadAdapter:
|
||||
"page_count": package.page_count,
|
||||
"pages_with_text": extracted_pages,
|
||||
"encrypted": package.encrypted,
|
||||
"attachment_count": len(package.attachments),
|
||||
"image_signal_count": _attachment_role_count(package.attachments, "image-signal"),
|
||||
"embedded_file_count": _attachment_role_count(package.attachments, "embedded-file"),
|
||||
},
|
||||
),
|
||||
diagnostics=diagnostics,
|
||||
@@ -188,6 +196,9 @@ class PdfReadAdapter:
|
||||
"selected_pages": [page.number for page in selected_pages],
|
||||
"pages_extracted": len(segments),
|
||||
"page_coverage": page_coverage,
|
||||
"attachment_count": len(package.attachments),
|
||||
"image_signal_count": _attachment_role_count(package.attachments, "image-signal"),
|
||||
"embedded_file_count": _attachment_role_count(package.attachments, "embedded-file"),
|
||||
},
|
||||
)
|
||||
document = NormalizedMarkdownDocument(
|
||||
@@ -203,9 +214,10 @@ class PdfReadAdapter:
|
||||
source_uri=request.asset.uri,
|
||||
source_path=request.asset.path,
|
||||
digest=request.asset.digest,
|
||||
metadata={"page_count": package.page_count},
|
||||
metadata={"page_count": package.page_count, "attachment_count": len(package.attachments)},
|
||||
)
|
||||
],
|
||||
attachments=package.attachments,
|
||||
adapter=_adapter_info(request.options),
|
||||
cache_key=normalization_cache_key(
|
||||
asset=request.asset,
|
||||
@@ -227,6 +239,7 @@ def _load_pdf(asset: SourceAsset) -> PdfPackage:
|
||||
page_count=0,
|
||||
encrypted=False,
|
||||
pages=[],
|
||||
attachments=[],
|
||||
diagnostics=[
|
||||
_pdf_error(
|
||||
asset,
|
||||
@@ -243,6 +256,7 @@ def _load_pdf(asset: SourceAsset) -> PdfPackage:
|
||||
page_count=0,
|
||||
encrypted=False,
|
||||
pages=[],
|
||||
attachments=[],
|
||||
diagnostics=[_malformed(asset, "PDF does not start with a PDF header.")],
|
||||
)
|
||||
|
||||
@@ -255,6 +269,7 @@ def _load_pdf(asset: SourceAsset) -> PdfPackage:
|
||||
page_count=_page_count(objects),
|
||||
encrypted=True,
|
||||
pages=[],
|
||||
attachments=[],
|
||||
diagnostics=[
|
||||
_pdf_error(
|
||||
asset,
|
||||
@@ -267,10 +282,30 @@ def _load_pdf(asset: SourceAsset) -> PdfPackage:
|
||||
page_ids = _page_object_ids(objects)
|
||||
page_count = _page_count(objects) or len(page_ids)
|
||||
pages: list[PdfPage] = []
|
||||
attachments = _embedded_file_assets(objects, asset, diagnostics)
|
||||
for page_number, object_id in enumerate(page_ids, start=1):
|
||||
page_body = objects[object_id]
|
||||
page_diagnostics: list[Diagnostic] = []
|
||||
content_ids = _content_refs(page_body)
|
||||
image_object_ids = _image_object_ids(page_body, objects, content_ids)
|
||||
if image_object_ids:
|
||||
attachments.append(
|
||||
_image_signal_asset(
|
||||
asset,
|
||||
page_number=page_number,
|
||||
page_object_id=object_id,
|
||||
image_object_ids=image_object_ids,
|
||||
digest_parts=[page_body, *[objects.get(image_id, b"") for image_id in image_object_ids]],
|
||||
)
|
||||
)
|
||||
page_diagnostics.append(
|
||||
_warning(
|
||||
asset,
|
||||
"source.pdf.image_resource_signal",
|
||||
f"PDF page {page_number} references image resources; binary extraction is not performed.",
|
||||
details={"page": page_number, "image_objects": image_object_ids},
|
||||
)
|
||||
)
|
||||
text_parts: list[str] = []
|
||||
if not content_ids and STREAM_RE.search(page_body):
|
||||
stream = _stream_data(page_body, asset, page_diagnostics)
|
||||
@@ -312,6 +347,126 @@ def _load_pdf(asset: SourceAsset) -> PdfPackage:
|
||||
encrypted=False,
|
||||
pages=pages,
|
||||
diagnostics=diagnostics,
|
||||
attachments=attachments,
|
||||
)
|
||||
|
||||
|
||||
def _embedded_file_assets(
|
||||
objects: dict[int, bytes],
|
||||
asset: SourceAsset,
|
||||
diagnostics: list[Diagnostic],
|
||||
) -> list[SourceAsset]:
|
||||
file_names = _embedded_file_names(objects)
|
||||
attachments: list[SourceAsset] = []
|
||||
for object_id, body in sorted(objects.items()):
|
||||
if not EMBEDDED_FILE_RE.search(body):
|
||||
continue
|
||||
attachment_diagnostics: list[Diagnostic] = []
|
||||
stream = _stream_data(body, asset, attachment_diagnostics)
|
||||
diagnostics.extend(attachment_diagnostics)
|
||||
if not stream:
|
||||
diagnostics.append(
|
||||
_warning(
|
||||
asset,
|
||||
"source.pdf.embedded_file_unreadable",
|
||||
f"PDF embedded file object {object_id} does not expose readable bytes.",
|
||||
details={"object_id": object_id},
|
||||
)
|
||||
)
|
||||
continue
|
||||
name = file_names.get(object_id) or f"embedded-{object_id}.bin"
|
||||
attachments.append(
|
||||
source_asset_from_member(
|
||||
container_uri=asset.uri,
|
||||
member_path=f"embedded/{name}",
|
||||
data=stream,
|
||||
media_type=None,
|
||||
source_adapter="source.pdf",
|
||||
source_role="embedded-file",
|
||||
metadata={
|
||||
"container_path": asset.path,
|
||||
"pdf_object": object_id,
|
||||
"embedded_file_name": name,
|
||||
"render_manifest_compatible": True,
|
||||
},
|
||||
)
|
||||
)
|
||||
return attachments
|
||||
|
||||
|
||||
def _embedded_file_names(objects: dict[int, bytes]) -> dict[int, str]:
|
||||
names: dict[int, str] = {}
|
||||
for body in objects.values():
|
||||
if not FILESPEC_RE.search(body):
|
||||
continue
|
||||
ref_match = EMBEDDED_FILE_REF_RE.search(body)
|
||||
if ref_match is None:
|
||||
continue
|
||||
object_id = int(ref_match.group(1))
|
||||
names[object_id] = _file_spec_name(body) or f"embedded-{object_id}.bin"
|
||||
return names
|
||||
|
||||
|
||||
def _file_spec_name(body: bytes) -> str | None:
|
||||
for key in ("UF", "F"):
|
||||
literal_match = re.search(rb"/" + key.encode("ascii") + rb"\s*(\()", body)
|
||||
if literal_match:
|
||||
value, _ = _read_literal_string(body, literal_match.start(1))
|
||||
else:
|
||||
value = _metadata_value(body, key)
|
||||
if value:
|
||||
return re.sub(r"[\\/:]+", "-", value).strip() or None
|
||||
return None
|
||||
|
||||
|
||||
def _image_object_ids(
|
||||
page_body: bytes,
|
||||
objects: dict[int, bytes],
|
||||
content_ids: list[int],
|
||||
) -> list[int]:
|
||||
page_refs = {
|
||||
int(match.group(1))
|
||||
for match in REF_RE.finditer(page_body)
|
||||
}
|
||||
image_refs = sorted(
|
||||
object_id
|
||||
for object_id in page_refs
|
||||
if re.search(rb"/Subtype\s*/Image\b", objects.get(object_id, b""))
|
||||
)
|
||||
if image_refs:
|
||||
return image_refs
|
||||
haystack = page_body + b"\n" + b"\n".join(objects.get(ref, b"") for ref in content_ids)
|
||||
if not re.search(rb"/Subtype\s*/Image\b|\bDo\b", haystack):
|
||||
return []
|
||||
return sorted(
|
||||
object_id
|
||||
for object_id, body in objects.items()
|
||||
if re.search(rb"/Subtype\s*/Image\b", body)
|
||||
)
|
||||
|
||||
|
||||
def _image_signal_asset(
|
||||
asset: SourceAsset,
|
||||
*,
|
||||
page_number: int,
|
||||
page_object_id: int,
|
||||
image_object_ids: list[int],
|
||||
digest_parts: list[bytes],
|
||||
) -> SourceAsset:
|
||||
return source_asset_signal(
|
||||
container_uri=asset.uri,
|
||||
signal_id=f"page-{page_number:04d}/image-signal",
|
||||
media_type="application/x.markitect-pdf-image-signal",
|
||||
source_adapter="source.pdf",
|
||||
source_role="image-signal",
|
||||
digest_parts=digest_parts or [bytes_digest(",".join(map(str, image_object_ids)).encode("ascii")).encode("ascii")],
|
||||
metadata={
|
||||
"container_path": asset.path,
|
||||
"page": page_number,
|
||||
"pdf_object": page_object_id,
|
||||
"image_objects": image_object_ids,
|
||||
"render_manifest_compatible": True,
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
@@ -733,6 +888,10 @@ def _confidence(package: PdfPackage, diagnostics: list[Diagnostic]) -> float:
|
||||
return max(0.1, 0.75 * coverage)
|
||||
|
||||
|
||||
def _attachment_role_count(attachments: list[SourceAsset], role: str) -> int:
|
||||
return sum(1 for attachment in attachments if attachment.metadata.get("source_role") == role)
|
||||
|
||||
|
||||
def _warning(
|
||||
asset: SourceAsset,
|
||||
code: str,
|
||||
|
||||
@@ -31,6 +31,8 @@ def test_epub3_descriptor_matches_contract():
|
||||
assert descriptor.extensions == [".epub"]
|
||||
assert descriptor.safety["network"] is False
|
||||
assert descriptor.option_schema["properties"]["skip_boilerplate"]["default"] is True
|
||||
assert descriptor.quality_profile["attachments"] == "read-side-source-assets"
|
||||
assert descriptor.metadata["render_asset_manifest_compatible"] is True
|
||||
|
||||
|
||||
def test_epub3_adapter_matches_epub_assets(tmp_path: Path):
|
||||
@@ -57,6 +59,7 @@ def test_epub3_adapter_inspects_metadata(tmp_path: Path):
|
||||
assert result.metadata.language == "en"
|
||||
assert result.metadata.identifiers["bookid"] == "urn:test-book"
|
||||
assert result.quality.lossiness == "low"
|
||||
assert result.quality.metadata["attachment_candidates"] == 2
|
||||
|
||||
|
||||
def test_epub3_adapter_normalizes_spine_to_markdown(tmp_path: Path):
|
||||
@@ -83,6 +86,16 @@ def test_epub3_adapter_normalizes_spine_to_markdown(tmp_path: Path):
|
||||
"continuation",
|
||||
]
|
||||
assert result.document.segments[0].provenance[0].package_path == "EPUB/chapter1.xhtml"
|
||||
assert [attachment.metadata["source_role"] for attachment in result.document.attachments] == [
|
||||
"image",
|
||||
"stylesheet",
|
||||
]
|
||||
assert result.document.attachments[0].media_type == "image/png"
|
||||
assert result.document.attachments[1].media_type == "text/css"
|
||||
assert result.document.attachments[0].digest.startswith("sha256:")
|
||||
assert result.document.attachments[0].metadata["package_path"] == "EPUB/images/chart.png"
|
||||
assert result.document.attachments[0].metadata["render_manifest_compatible"] is True
|
||||
assert result.document.quality.metadata["attachment_count"] == 2
|
||||
assert result.document.quality.lossiness == "none"
|
||||
|
||||
|
||||
@@ -114,13 +127,28 @@ def test_epub3_adapter_reports_malformed_missing_container(tmp_path: Path):
|
||||
assert "container.xml" in result.diagnostics[0].message
|
||||
|
||||
|
||||
def test_epub3_adapter_reports_unsupported_package_resources(tmp_path: Path):
|
||||
epub_path = _write_epub(tmp_path, include_unsupported_resource=True)
|
||||
asset = SourceAsset.from_path(epub_path, media_type="application/epub+zip")
|
||||
adapter = epub3_adapter_descriptor().instantiate()
|
||||
|
||||
result = adapter.read(SourceReadRequest(asset=asset))
|
||||
|
||||
assert result.is_valid
|
||||
assert result.document is not None
|
||||
assert any(
|
||||
diagnostic.code == "source.epub3.skipped_resource"
|
||||
for diagnostic in result.document.diagnostics
|
||||
)
|
||||
|
||||
|
||||
def test_epub3_entry_point_discovery_shape():
|
||||
registry = discover_source_adapters([FakeEntryPoint()])
|
||||
|
||||
assert registry.get("source.epub3").name == "EPUB3"
|
||||
|
||||
|
||||
def _write_epub(tmp_path: Path) -> Path:
|
||||
def _write_epub(tmp_path: Path, *, include_unsupported_resource: bool = False) -> Path:
|
||||
epub_path = tmp_path / "test-book.epub"
|
||||
with zipfile.ZipFile(epub_path, "w") as archive:
|
||||
archive.writestr("mimetype", "application/epub+zip")
|
||||
@@ -150,13 +178,22 @@ def _write_epub(tmp_path: Path) -> Path:
|
||||
<item id="nav" href="nav.xhtml" media-type="application/xhtml+xml" properties="nav"/>
|
||||
<item id="chapter1" href="chapter1.xhtml" media-type="application/xhtml+xml"/>
|
||||
<item id="chapter2" href="chapter2.xhtml" media-type="application/xhtml+xml"/>
|
||||
<item id="style" href="styles/book.css" media-type="text/css"/>
|
||||
<item id="chart" href="images/chart.png" media-type="image/png"/>
|
||||
{unsupported}
|
||||
</manifest>
|
||||
<spine>
|
||||
<itemref idref="chapter1"/>
|
||||
<itemref idref="chapter2"/>
|
||||
</spine>
|
||||
</package>
|
||||
""",
|
||||
""".format(
|
||||
unsupported=(
|
||||
'<item id="payload" href="data/payload.bin" media-type="application/x-custom-binary"/>'
|
||||
if include_unsupported_resource
|
||||
else ""
|
||||
)
|
||||
),
|
||||
)
|
||||
archive.writestr(
|
||||
"EPUB/nav.xhtml",
|
||||
@@ -203,4 +240,8 @@ def _write_epub(tmp_path: Path) -> Path:
|
||||
</html>
|
||||
""",
|
||||
)
|
||||
archive.writestr("EPUB/styles/book.css", "body { color: #111; }\n")
|
||||
archive.writestr("EPUB/images/chart.png", b"\x89PNG\r\n\x1a\nfixture")
|
||||
if include_unsupported_resource:
|
||||
archive.writestr("EPUB/data/payload.bin", b"custom")
|
||||
return epub_path
|
||||
|
||||
@@ -32,6 +32,9 @@ def test_pdf_descriptor_matches_contract():
|
||||
assert descriptor.safety["external_process"] is False
|
||||
assert descriptor.option_schema["properties"]["include_page_breaks"]["default"] is False
|
||||
assert descriptor.metadata["dependency_profile"] == "stdlib"
|
||||
assert descriptor.metadata["render_asset_manifest_compatible"] is True
|
||||
assert descriptor.quality_profile["attachments"] == "metadata-with-digest"
|
||||
assert descriptor.quality_profile["images"] == "signal-only"
|
||||
|
||||
|
||||
def test_pdf_adapter_matches_pdf_assets(tmp_path: Path):
|
||||
@@ -60,6 +63,7 @@ def test_pdf_adapter_inspects_metadata(tmp_path: Path):
|
||||
assert result.quality.lossiness == "medium"
|
||||
assert result.quality.metadata["page_count"] == 2
|
||||
assert result.quality.metadata["pages_with_text"] == 2
|
||||
assert result.quality.metadata["attachment_count"] == 0
|
||||
|
||||
|
||||
def test_pdf_adapter_normalizes_pages_to_markdown(tmp_path: Path):
|
||||
@@ -80,6 +84,7 @@ def test_pdf_adapter_normalizes_pages_to_markdown(tmp_path: Path):
|
||||
assert result.document.segments[0].provenance[0].page == "1"
|
||||
assert result.document.quality.lossiness == "low"
|
||||
assert result.document.quality.metadata["page_coverage"] == 1.0
|
||||
assert result.document.attachments == []
|
||||
|
||||
|
||||
def test_pdf_adapter_applies_page_range_and_page_markers(tmp_path: Path):
|
||||
@@ -137,17 +142,57 @@ def test_pdf_adapter_reports_encrypted_pdf(tmp_path: Path):
|
||||
assert result.diagnostics[0].code == "source.pdf.encrypted"
|
||||
|
||||
|
||||
def test_pdf_adapter_reports_embedded_files_and_image_signals(tmp_path: Path):
|
||||
pdf_path = _write_pdf(tmp_path, embedded_file=True, image_signal=True)
|
||||
asset = SourceAsset.from_path(pdf_path, media_type="application/pdf")
|
||||
adapter = pdf_adapter_descriptor().instantiate()
|
||||
|
||||
result = adapter.read(SourceReadRequest(asset=asset))
|
||||
|
||||
assert result.is_valid
|
||||
assert result.document is not None
|
||||
assert [attachment.metadata["source_role"] for attachment in result.document.attachments] == [
|
||||
"embedded-file",
|
||||
"image-signal",
|
||||
]
|
||||
embedded = result.document.attachments[0]
|
||||
signal = result.document.attachments[1]
|
||||
assert embedded.name == "attachment.txt"
|
||||
assert embedded.media_type == "text/plain"
|
||||
assert embedded.digest.startswith("sha256:")
|
||||
assert embedded.metadata["render_manifest_compatible"] is True
|
||||
assert signal.media_type == "application/x.markitect-pdf-image-signal"
|
||||
assert signal.metadata["page"] == 1
|
||||
assert signal.metadata["signal_only"] is True
|
||||
assert result.document.quality.metadata["attachment_count"] == 2
|
||||
assert result.document.quality.metadata["embedded_file_count"] == 1
|
||||
assert result.document.quality.metadata["image_signal_count"] == 1
|
||||
assert any(
|
||||
diagnostic.code == "source.pdf.image_resource_signal"
|
||||
for diagnostic in result.document.diagnostics
|
||||
)
|
||||
|
||||
|
||||
def test_pdf_entry_point_discovery_shape():
|
||||
registry = discover_source_adapters([FakeEntryPoint()])
|
||||
|
||||
assert registry.get("source.pdf").name == "PDF"
|
||||
|
||||
|
||||
def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
|
||||
def _write_pdf(
|
||||
tmp_path: Path,
|
||||
*,
|
||||
encrypted: bool = False,
|
||||
embedded_file: bool = False,
|
||||
image_signal: bool = False,
|
||||
) -> Path:
|
||||
pdf_path = tmp_path / ("encrypted.pdf" if encrypted else "fixture.pdf")
|
||||
objects: list[tuple[int, bytes]] = []
|
||||
page_refs = []
|
||||
next_id = 3
|
||||
font_id = 100
|
||||
info_id = 101
|
||||
encrypt_id = 102
|
||||
for page_number, lines in enumerate(
|
||||
[
|
||||
["Hello PDF", "Second line"],
|
||||
@@ -159,13 +204,20 @@ def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
|
||||
content_id = next_id + 1
|
||||
next_id += 2
|
||||
page_refs.append(f"{page_id} 0 R")
|
||||
stream = _page_stream(lines)
|
||||
include_image = image_signal and page_number == 1
|
||||
image_id = None
|
||||
if include_image:
|
||||
image_id = next_id
|
||||
next_id += 1
|
||||
stream = _page_stream(lines, draw_image=include_image)
|
||||
xobject = f" /XObject << /Im1 {image_id} 0 R >>" if image_id else ""
|
||||
objects.append(
|
||||
(
|
||||
page_id,
|
||||
(
|
||||
f"<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] "
|
||||
f"/Resources << /Font << /F1 7 0 R >> >> /Contents {content_id} 0 R >>"
|
||||
f"/Resources << /Font << /F1 {font_id} 0 R >>{xobject} >> "
|
||||
f"/Contents {content_id} 0 R >>"
|
||||
).encode("ascii"),
|
||||
)
|
||||
)
|
||||
@@ -179,6 +231,44 @@ def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
|
||||
+ b"\nendstream",
|
||||
)
|
||||
)
|
||||
if image_id:
|
||||
image_stream = b"\x00\x00\x00"
|
||||
objects.append(
|
||||
(
|
||||
image_id,
|
||||
b"<< /Type /XObject /Subtype /Image /Width 1 /Height 1 "
|
||||
b"/ColorSpace /DeviceGray /BitsPerComponent 8 /Length "
|
||||
+ str(len(image_stream)).encode("ascii")
|
||||
+ b" >>\nstream\n"
|
||||
+ image_stream
|
||||
+ b"\nendstream",
|
||||
)
|
||||
)
|
||||
|
||||
if embedded_file:
|
||||
embedded_id = next_id
|
||||
filespec_id = next_id + 1
|
||||
next_id += 2
|
||||
embedded_stream = b"attached text"
|
||||
objects.append(
|
||||
(
|
||||
embedded_id,
|
||||
b"<< /Type /EmbeddedFile /Length "
|
||||
+ str(len(embedded_stream)).encode("ascii")
|
||||
+ b" >>\nstream\n"
|
||||
+ embedded_stream
|
||||
+ b"\nendstream",
|
||||
)
|
||||
)
|
||||
objects.append(
|
||||
(
|
||||
filespec_id,
|
||||
(
|
||||
f"<< /Type /Filespec /F (attachment.txt) "
|
||||
f"/EF << /F {embedded_id} 0 R >> >>"
|
||||
).encode("ascii"),
|
||||
)
|
||||
)
|
||||
|
||||
objects.extend(
|
||||
[
|
||||
@@ -190,9 +280,9 @@ def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
|
||||
f"/Count {len(page_refs)} >>"
|
||||
).encode("ascii"),
|
||||
),
|
||||
(7, b"<< /Type /Font /Subtype /Type1 /BaseFont /Helvetica >>"),
|
||||
(font_id, b"<< /Type /Font /Subtype /Type1 /BaseFont /Helvetica >>"),
|
||||
(
|
||||
8,
|
||||
info_id,
|
||||
b"<< /Title (PDF Fixture) /Author (Ada Lovelace) "
|
||||
b"/Subject (Source Adapter Test) /Keywords (markitect pdf) "
|
||||
b"/Producer (markitect-filter tests) /CreationDate (D:20260514093000Z) >>",
|
||||
@@ -200,7 +290,7 @@ def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
|
||||
]
|
||||
)
|
||||
if encrypted:
|
||||
objects.append((9, b"<< /Filter /Standard /V 1 /R 2 >>"))
|
||||
objects.append((encrypt_id, b"<< /Filter /Standard /V 1 /R 2 >>"))
|
||||
objects.sort(key=lambda item: item[0])
|
||||
|
||||
header = b"%PDF-1.4\n%\xe2\xe3\xcf\xd3\n"
|
||||
@@ -218,9 +308,9 @@ def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
|
||||
content.extend(b"0000000000 65535 f \n")
|
||||
for object_id in range(1, max_id + 1):
|
||||
content.extend(f"{offsets.get(object_id, 0):010d} 00000 n \n".encode("ascii"))
|
||||
trailer = f"trailer\n<< /Size {max_id + 1} /Root 1 0 R /Info 8 0 R".encode("ascii")
|
||||
trailer = f"trailer\n<< /Size {max_id + 1} /Root 1 0 R /Info {info_id} 0 R".encode("ascii")
|
||||
if encrypted:
|
||||
trailer += b" /Encrypt 9 0 R"
|
||||
trailer += f" /Encrypt {encrypt_id} 0 R".encode("ascii")
|
||||
trailer += b" >>\n"
|
||||
content.extend(trailer)
|
||||
content.extend(f"startxref\n{xref_offset}\n%%EOF\n".encode("ascii"))
|
||||
@@ -228,13 +318,15 @@ def _write_pdf(tmp_path: Path, *, encrypted: bool = False) -> Path:
|
||||
return pdf_path
|
||||
|
||||
|
||||
def _page_stream(lines: list[str]) -> bytes:
|
||||
def _page_stream(lines: list[str], *, draw_image: bool = False) -> bytes:
|
||||
parts = ["BT", "/F1 12 Tf", "72 720 Td"]
|
||||
for index, line in enumerate(lines):
|
||||
if index:
|
||||
parts.append("T*")
|
||||
parts.append(f"({_pdf_literal(line)}) Tj")
|
||||
parts.append("ET")
|
||||
if draw_image:
|
||||
parts.extend(["q", "10 0 0 10 72 640 cm", "/Im1 Do", "Q"])
|
||||
return "\n".join(parts).encode("ascii")
|
||||
|
||||
|
||||
|
||||
@@ -3,10 +3,10 @@ id: MKTF-WP-0003
|
||||
type: workplan
|
||||
title: "Source Attachment Manifest Compatibility"
|
||||
domain: markitect
|
||||
status: todo
|
||||
status: done
|
||||
owner: markitect-filter
|
||||
topic_slug: markitect
|
||||
planning_priority: P2
|
||||
planning_priority: complete
|
||||
planning_order: 30
|
||||
depends_on_workplans:
|
||||
- MKTF-WP-0001
|
||||
@@ -56,11 +56,27 @@ render asset manifest.
|
||||
Those responsibilities belong to `markitect-tool` contracts,
|
||||
`markitect-quarkdown` render integration, or later runtime/publication systems.
|
||||
|
||||
## Implementation Summary
|
||||
|
||||
Completed as a read-side attachment metadata compatibility slice:
|
||||
|
||||
- Added shared source attachment metadata helpers and exported
|
||||
`markitect-filter.source-attachment.v1`.
|
||||
- EPUB3 read results now populate `NormalizedMarkdownDocument.attachments` for
|
||||
package images, stylesheets, fonts, audio, and video with byte size, digest,
|
||||
package path, manifest id, href, and render-manifest compatibility metadata.
|
||||
- PDF read results now populate attachments for embedded file streams and
|
||||
signal-only image resources where the stdlib parser can detect them.
|
||||
- Unsupported EPUB resources, missing EPUB resources, PDF image signals, and
|
||||
unreadable embedded files produce structured diagnostics.
|
||||
- Docs, handoff fixtures, adapter descriptors, README notes, and tests were
|
||||
added without introducing renderer/export behavior.
|
||||
|
||||
## P3.1 - Align attachment metadata with Markitect source contracts
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0003-T001
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "d119daca-8141-4662-8ad7-ce43ccd79044"
|
||||
```
|
||||
@@ -76,7 +92,7 @@ run with `markitect-tool` on `PYTHONPATH`.
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0003-T002
|
||||
status: todo
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "ebcbf480-210d-46e7-a4e4-fbe7e9baa39a"
|
||||
```
|
||||
@@ -91,7 +107,7 @@ Output: EPUB3 attachment metadata, provenance, diagnostics, and fixtures.
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0003-T003
|
||||
status: todo
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "d8b7b820-387f-4d45-bf22-296b227f917a"
|
||||
```
|
||||
@@ -108,7 +124,7 @@ Output: PDF metadata conventions, diagnostics, and tests.
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0003-T004
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "ca539c01-c272-4635-8f60-86f870bbef0c"
|
||||
```
|
||||
@@ -128,7 +144,7 @@ Output: deterministic digest/provenance tests.
|
||||
|
||||
```task
|
||||
id: MKTF-WP-0003-T005
|
||||
status: todo
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "f2213a20-ce6f-4e16-9b9b-557b99f8b4d1"
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user