diff --git a/README.md b/README.md index 6e23038..036b944 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,7 @@ requirements documents in `wiki/`. - `docs/command-cheatsheet.md` - command-oriented workflow cheat sheet - `docs/examples-index.md` - map from examples to usecases and commands - `docs/source-adapter-contract.md` - v1 source adapter contract for external format adapters +- `docs/source-adapter-migration.md` - sibling-repo handoff for source adapter adoption - `docs/performance-notes.md` - local performance posture and smoke coverage - `docs/cli-reference.md` - generated `mkt` command reference - `docs/api-reference.md` - generated public API reference diff --git a/docs/api-reference.md b/docs/api-reference.md index 94e903d..1cc61c1 100644 --- a/docs/api-reference.md +++ b/docs/api-reference.md @@ -10,6 +10,8 @@ Generated from `markitect_tool.__all__`. - `EMPTY_PARSE_OPTIONS_HASH` - object. str(object='') -> str - `EXPLODE_MANIFEST_NAME` - object. str(object='') -> str - `LOCAL_INDEX_SCHEMA_VERSION` - object. str(object='') -> str +- `NORMALIZED_SOURCE_SCHEMA_VERSION` - object. str(object='') -> str +- `SOURCE_ADAPTER_ENTRY_POINT_GROUP` - object. str(object='') -> str ## `markitect_tool.backend.engine` @@ -339,6 +341,31 @@ Generated from `markitect_tool.__all__`. - `validate_markdown_file(markdown_path: 'str | Path', schema_path: 'str | Path') -> 'SchemaValidationResult'` - function. Parse and validate a Markdown file against a Markdown schema file. - `validate_schema(schema: 'dict[str, Any]') -> 'SchemaValidationResult'` - function. Validate that a JSON Schema itself is well formed. +## `markitect_tool.source.engine` + +- `NormalizationQuality(lossiness: 'str', confidence: 'float | None' = None, skipped_items: 'int | None' = None, warnings: 'int | None' = None, metadata: 'dict[str, Any]' = ) -> None` - class. Summary of extraction lossiness and confidence. +- `NormalizedMarkdownDocument(document_id: 'str', asset: 'SourceAsset', metadata: 'SourceMetadata', markdown: 'str', segments: 'list[NormalizedMarkdownSegment]', quality: 'NormalizationQuality', adapter: 'dict[str, Any]', cache_key: 'str', schema_version: 'str' = 'markitect.source.v1', diagnostics: 'list[Diagnostic]' = , provenance: 'list[SourceProvenance]' = , attachments: 'list[SourceAsset]' = ) -> None` - class. Canonical source-normalized Markdown document. +- `NormalizedMarkdownSegment(segment_id: 'str', order: 'int', markdown: 'str', heading: 'str | None' = None, heading_level: 'int | None' = None, anchors: 'list[str]' = , provenance: 'list[SourceProvenance]' = , metadata: 'dict[str, Any]' = ) -> None` - class. One ordered normalized Markdown segment. +- `SourceAdapterDescriptor(id: 'str', version: 'str', name: 'str', operations: 'list[str]', media_types: 'list[str]', extensions: 'list[str]', factory: 'SourceAdapterFactory', summary: 'str | None' = None, option_schema: 'dict[str, Any]' = , optional_dependencies: 'list[OptionalDependency]' = , safety: 'dict[str, Any]' = , quality_profile: 'dict[str, Any]' = , metadata: 'dict[str, Any]' = ) -> None` - class. Inspectable descriptor for one source read adapter. +- `SourceAdapterError` - class. Raised when source adapter descriptors or registries are invalid. +- `SourceAdapterMatch(adapter_id: 'str', matched: 'bool', confidence: 'int' = 0, reason: 'str | None' = None, diagnostics: 'list[Diagnostic]' = ) -> None` - class. Result of an adapter match attempt. +- `SourceAdapterMatchRequest(asset: 'SourceAsset', options: 'dict[str, Any]' = ) -> None` - class. Cheap adapter matching request. +- `SourceAdapterRegistry(descriptors: 'Iterable[SourceAdapterDescriptor] | None' = None) -> 'None'` - class. Registry of source adapter descriptors. +- `SourceAsset(uri: 'str', path: 'str | None' = None, name: 'str | None' = None, media_type: 'str | None' = None, extension: 'str | None' = None, size: 'int | None' = None, mtime_ns: 'int | None' = None, digest: 'str | None' = None, metadata: 'dict[str, Any]' = ) -> None` - class. Identity and filesystem metadata for one source asset. +- `SourceInspectRequest(asset: 'SourceAsset', options: 'dict[str, Any]' = ) -> None` - class. Metadata-only source inspection request. +- `SourceInspectResult(asset: 'SourceAsset', adapter: 'dict[str, Any]', metadata: 'SourceMetadata', quality: 'NormalizationQuality', capabilities: 'list[str]' = , diagnostics: 'list[Diagnostic]' = , valid: 'bool | None' = None) -> None` - class. Metadata-only source inspection result. +- `SourceMetadata(title: 'str | None' = None, creators: 'list[str]' = , language: 'str | None' = None, rights: 'str | None' = None, source_url: 'str | None' = None, publication_date: 'str | None' = None, publisher: 'str | None' = None, identifiers: 'dict[str, str]' = , raw: 'dict[str, Any]' = ) -> None` - class. Format-provided descriptive metadata. +- `SourceProvenance(source_uri: 'str', source_path: 'str | None' = None, source_href: 'str | None' = None, package_path: 'str | None' = None, anchor: 'str | None' = None, page: 'str | None' = None, section: 'str | None' = None, start_offset: 'int | None' = None, end_offset: 'int | None' = None, digest: 'str | None' = None, metadata: 'dict[str, Any]' = ) -> None` - class. Trace from normalized Markdown back to source locations. +- `SourceReadAdapter(*args, **kwargs)` - class. Read-only source adapter protocol. +- `SourceReadRequest(asset: 'SourceAsset', options: 'dict[str, Any]' = ) -> None` - class. Full source normalization request. +- `SourceReadResult(document: 'NormalizedMarkdownDocument | None' = None, diagnostics: 'list[Diagnostic]' = , valid: 'bool | None' = None) -> None` - class. Full source normalization result. +- `default_source_adapter_registry() -> 'SourceAdapterRegistry'` - function. Return the discovered source adapter registry. +- `discover_source_adapters(entry_points: 'Iterable[Any] | None' = None) -> 'SourceAdapterRegistry'` - function. Discover source adapters from package entry points. +- `inspect_source(path_or_uri: 'str | Path', *, registry: 'SourceAdapterRegistry | None' = None, adapter_id: 'str | None' = None, options: 'dict[str, Any] | None' = None) -> 'SourceInspectResult'` - function. Inspect a local source through the selected adapter. +- `normalization_cache_key(*, asset: 'SourceAsset', adapter_id: 'str', adapter_version: 'str', options: 'dict[str, Any] | None' = None) -> 'str'` - function. Compute a stable normalization cache key. +- `normalize_source(path_or_uri: 'str | Path', *, registry: 'SourceAdapterRegistry | None' = None, adapter_id: 'str | None' = None, options: 'dict[str, Any] | None' = None) -> 'SourceReadResult'` - function. Normalize a local source through the selected adapter. +- `source_adapter_registry_descriptor() -> 'ExtensionDescriptor'` - function. Return the built-in descriptor for source adapter discovery. + ## `markitect_tool.template.engine` - `MissingTemplateVariable` - class. Raised when strict rendering cannot resolve a variable. diff --git a/docs/cli-reference.md b/docs/cli-reference.md index deb8b2c..b924fcd 100644 --- a/docs/cli-reference.md +++ b/docs/cli-reference.md @@ -862,6 +862,57 @@ Parameters: - `--policy-mode` - Override policy mode for this search. - `--format` - +## `mkt source` + +Inspect source-format adapters and normalize sources. + +```text +source [OPTIONS] COMMAND [ARGS]... +``` + +## `mkt source adapters` + +List discovered read-only source adapters. + +```text +adapters [OPTIONS] +``` + +Parameters: + +- `--format` - + +## `mkt source inspect` + +Inspect a local source without full Markdown conversion. + +```text +inspect [OPTIONS] SOURCE_PATH +``` + +Parameters: + +- `SOURCE_PATH` - Required. +- `--adapter` - Explicit source adapter id. +- `--option` - Adapter-specific option. May be repeated. +- `--format` - + +## `mkt source normalize` + +Normalize a local source into canonical Markdown. + +```text +normalize [OPTIONS] SOURCE_PATH +``` + +Parameters: + +- `SOURCE_PATH` - Required. +- `--adapter` - Explicit source adapter id. +- `--option` - Adapter-specific option. May be repeated. +- `--output` - Write normalized output to this file. +- `--format` - + ## `mkt tangle` Tangle named Markdown code chunks into target files. diff --git a/docs/command-cheatsheet.md b/docs/command-cheatsheet.md index 0aab229..7bbe9a5 100644 --- a/docs/command-cheatsheet.md +++ b/docs/command-cheatsheet.md @@ -41,6 +41,14 @@ mkt ref resolve context.md 'std:clauses.md#payment-terms' --root examples/refere mkt process file.md --root . ``` +## Normalize External Sources + +```bash +mkt source adapters +mkt source inspect book.epub --adapter source.epub3 --format json +mkt source normalize book.epub --adapter source.epub3 --format markdown --output book.md +``` + ## Split And Literate Workflows ```bash @@ -102,6 +110,7 @@ mkt context list mkt completion bash --instructions mkt extension list mkt extension inspect backend.local-sqlite +mkt extension inspect source.adapter-registry mkt extension commands mkt docs cli --output docs/cli-reference.md mkt docs api --output docs/api-reference.md diff --git a/docs/extension-authoring.md b/docs/extension-authoring.md index 5281293..42980b0 100644 --- a/docs/extension-authoring.md +++ b/docs/extension-authoring.md @@ -9,6 +9,10 @@ Use this for internal query engines, processors, backend/index stores, reference providers, validators, template/generation adapters, CLI command groups, render/export adapters, and future document functions. +Source-format adapters are external package extensions. Use +`docs/source-adapter-contract.md` for the source adapter protocol, entry point +group, descriptor shape, and contract-test expectations. + ## Recommended Shape Each extension should have: diff --git a/docs/internal-extension-framework.md b/docs/internal-extension-framework.md index 477550c..ca6a2ff 100644 --- a/docs/internal-extension-framework.md +++ b/docs/internal-extension-framework.md @@ -170,8 +170,10 @@ markitect_tool/extensions/ Each module exposes one or more descriptors plus a registration function. The root registry can be assembled explicitly at import time or by a small internal -discovery list. Package entry points can be added later if external extension -packages become a real requirement. +discovery list. Source adapters are the first external package-discovery slice +and use the `markitect_tool.source_adapters` entry point group defined in +`docs/source-adapter-contract.md`; other extension kinds can adopt package +entry points later if they become a real requirement. See `docs/extension-authoring.md` for the extension authoring checklist and descriptor template. diff --git a/docs/source-adapter-migration.md b/docs/source-adapter-migration.md new file mode 100644 index 0000000..a19abb6 --- /dev/null +++ b/docs/source-adapter-migration.md @@ -0,0 +1,119 @@ +# Source Adapter Migration Notes + +## Purpose + +These notes describe how sibling repositories should consume the +`markitect-tool` source adapter contract implemented by `MKTT-WP-0018`. + +The source adapter layer is deliberately split: + +```text +external source files + -> markitect-filter concrete adapters + -> markitect-tool source adapter protocol and normalized Markdown model + -> infospace-bench workflows + -> optional kontextual-engine ingestion +``` + +## Markitect-tool + +`markitect-tool` owns the stable contract: + +- normalized source data model +- read-only source adapter protocol +- adapter registry and Python entry point discovery +- `mkt source` CLI commands +- public API helpers such as `inspect_source` and `normalize_source` +- fake adapter contract tests + +It does not own EPUB3, PDF, DOCX, ODT, OCR, or browser extraction. + +## Markitect-filter + +`markitect-filter` should implement concrete adapters behind the entry point +group: + +```toml +[project.entry-points."markitect_tool.source_adapters"] +epub3 = "markitect_filter.adapters:epub3_adapter_descriptor" +``` + +The first adapter should be: + +```text +id: source.epub3 +operations: read +media_types: application/epub+zip +extensions: .epub +``` + +The EPUB3 adapter should satisfy the contract tests described in +`docs/source-adapter-contract.md` and add EPUB-specific fixtures for container, +OPF, spine, nav, body XHTML, malformed package structure, skipped assets, and +lossy extraction diagnostics. + +## Infospace-bench + +`infospace-bench` should replace its local EPUB intake spike with the public +source adapter API: + +```python +from markitect_tool import normalize_source + +result = normalize_source("source.epub") +if not result.is_valid: + raise RuntimeError(result.to_dict()["diagnostics"]) +markdown = result.document.markdown +segments = result.document.segments +``` + +Application workflows should consume normalized Markdown and segment metadata. +They should not depend on EPUB package internals, spine parsing, XHTML +extraction, or boilerplate classification directly. + +## Kontextual-engine + +`kontextual-engine` can treat normalized source outputs as ingestible derived +knowledge assets when it needs durable ingestion. The durable layer should +persist policy, indexing, retrieval, permissions, audit, and lifecycle state. +It should not require source-format dependencies inside `markitect-tool`. + +Recommended ingestion boundary: + +- keep `NormalizedMarkdownDocument.to_dict()` as the portable derivative +- preserve source asset digest and normalization cache key +- record adapter ID, adapter version, and read options +- preserve document and segment provenance +- store diagnostics and quality signals for human review + +## Follow-up Workplan Seeds + +Recommended `markitect-filter` workplan: + +```text +MKTF-WP-0001: EPUB3 Read Adapter + +Implement source.epub3 against docs/source-adapter-contract.md: +- package scaffold and pyproject entry point +- optional EPUB dependencies +- META-INF/container.xml parsing +- OPF metadata and spine reading order +- nav/chapter label extraction +- body XHTML to normalized Markdown segments +- explicit boilerplate skip policy +- malformed/unsupported/lossy diagnostics +- contract tests using markitect-tool fake adapter expectations +``` + +Recommended `infospace-bench` workplan: + +```text +ISB-WP-source-adapter-intake + +Replace local EPUB source intake with markitect-tool normalize_source: +- install markitect-filter[epub3] in the relevant environment +- call normalize_source for source documents +- consume NormalizedMarkdownDocument markdown and segments +- remove app-local EPUB package parsing +- preserve source diagnostics in benchmark review artifacts +``` diff --git a/docs/workplan-planning-map.md b/docs/workplan-planning-map.md index b31aa58..0edc3a0 100644 --- a/docs/workplan-planning-map.md +++ b/docs/workplan-planning-map.md @@ -43,7 +43,7 @@ and descriptions mirror the operational view. | `MKTT-WP-0008` | complete | done | `MKTT-WP-0006`, `MKTT-WP-0007`, `MKTT-WP-0009` | Agent working-memory context cache is complete: context package schema, local registry, package creation from queries/search/manifests, deterministic summaries, namespaces, activation/deactivation/refresh/explain lifecycle, policy re-checks, CLI, docs, and examples. | | `MKTT-WP-0017` | complete | done | `MKTT-WP-0003`, `MKTT-WP-0013` | CLI/API polish and practical adoption track is complete: shell completion, extension discovery, generated CLI/API docs, usecase relevance matrix, E2E fixture matrix, large-corpus smoke coverage, first-use docs, examples index, and command cheat sheet. | | `MKTT-WP-0019` | complete | done | `MKTT-WP-0013`, `MKTT-WP-0017` | Source adapter contract refinement is complete: v1 read-only scope, normalized model fields, package entry point discovery, CLI/API envelopes, fake adapter fixtures, and `markitect-filter` EPUB3 handoff are pinned in `docs/source-adapter-contract.md`. | -| `MKTT-WP-0018` | P0 | active | `MKTT-WP-0013`, `MKTT-WP-0017`, `MKTT-WP-0019` | Source adapter framework implementation is the current track: implement `docs/source-adapter-contract.md`, keeping format extraction in `markitect-filter` and the base install free of heavyweight conversion dependencies. | +| `MKTT-WP-0018` | complete | done | `MKTT-WP-0013`, `MKTT-WP-0017`, `MKTT-WP-0019` | Source adapter framework implementation is complete: read-only models, protocol, registry, entry point discovery, extension descriptors, CLI/API, fake adapter fixtures, migration notes, and tests are in place. | | `MKTT-WP-0015` | P2 | todo | `MKTT-WP-0010`, `MKTT-WP-0011`, `MKTT-WP-0012` | Future render and document-function extensions: typed values, richer syntax, document-local reusable functions, Quarkdown/export adapters, render-aware references, assets, and permission sandboxing. Defer unless publishing/export pressure becomes current. | | `MKTT-WP-0016` | P2 | todo | `MKTT-WP-0008`, `MKTT-WP-0007`, `MKTT-WP-0009`, `MKTT-WP-0013` | Follow-on agentic memory architecture: reasoning decision graphs, conversational paths, long-term knowledge graphs, memory service blueprints/profiles, graph-to-context-package compilation, and adapter boundaries. | @@ -132,9 +132,11 @@ protocol behavior, CLI/API envelopes, fake adapter fixtures, and v1 contract is read-only; writer/export adapters belong in later format-specific work once preservation semantics are explicit. -`MKTT-WP-0018` is now the current source-adapter implementation track. It -should implement the pinned contract directly rather than reopening v1 model, -entry point, protocol, or CLI/API decisions. +`MKTT-WP-0018` completed the source-adapter implementation track. The v1 +read-only contract now has public models, protocol types, registry/discovery, +extension descriptors, `mkt source` commands, API exports, fake adapter +fixtures, and sibling-repo migration notes. Concrete EPUB3 extraction remains +`markitect-filter` scope. ## State Hub Mirror diff --git a/src/markitect_tool/__init__.py b/src/markitect_tool/__init__.py index d1c0f14..45f20b2 100644 --- a/src/markitect_tool/__init__.py +++ b/src/markitect_tool/__init__.py @@ -262,6 +262,32 @@ from markitect_tool.schema import ( validate_markdown_file, validate_schema, ) +from markitect_tool.source import ( + NORMALIZED_SOURCE_SCHEMA_VERSION, + SOURCE_ADAPTER_ENTRY_POINT_GROUP, + NormalizationQuality, + NormalizedMarkdownDocument, + NormalizedMarkdownSegment, + SourceAdapterDescriptor, + SourceAdapterError, + SourceAdapterMatch, + SourceAdapterMatchRequest, + SourceAdapterRegistry, + SourceAsset, + SourceInspectRequest, + SourceInspectResult, + SourceMetadata, + SourceProvenance, + SourceReadAdapter, + SourceReadRequest, + SourceReadResult, + default_source_adapter_registry, + discover_source_adapters, + inspect_source, + normalization_cache_key, + normalize_source, + source_adapter_registry_descriptor, +) from markitect_tool.template import ( MissingTemplateVariable, TemplateAnalysis, @@ -295,6 +321,30 @@ __all__ = [ "validate_document", "validate_markdown_file", "validate_schema", + "NORMALIZED_SOURCE_SCHEMA_VERSION", + "SOURCE_ADAPTER_ENTRY_POINT_GROUP", + "NormalizationQuality", + "NormalizedMarkdownDocument", + "NormalizedMarkdownSegment", + "SourceAdapterDescriptor", + "SourceAdapterError", + "SourceAdapterMatch", + "SourceAdapterMatchRequest", + "SourceAdapterRegistry", + "SourceAsset", + "SourceInspectRequest", + "SourceInspectResult", + "SourceMetadata", + "SourceProvenance", + "SourceReadAdapter", + "SourceReadRequest", + "SourceReadResult", + "default_source_adapter_registry", + "discover_source_adapters", + "inspect_source", + "normalization_cache_key", + "normalize_source", + "source_adapter_registry_descriptor", "ContractCheckResult", "ContractValidationResult", "DocumentContract", diff --git a/src/markitect_tool/cli/main.py b/src/markitect_tool/cli/main.py index a7abbca..66de551 100644 --- a/src/markitect_tool/cli/main.py +++ b/src/markitect_tool/cli/main.py @@ -99,6 +99,11 @@ from markitect_tool.reference import ( ) from markitect_tool.runtime import evaluate_form_state, load_runtime_context_file from markitect_tool.schema import load_schema_file, validate_markdown_file, validate_schema +from markitect_tool.source import ( + default_source_adapter_registry, + inspect_source, + normalize_source, +) from markitect_tool.template import ( MissingTemplateVariable, TemplateError, @@ -197,6 +202,123 @@ def extension_commands(output_format: str) -> None: _emit_extension_catalog({"count": len(specs), "commands": specs}, output_format) +@main.group("source") +def source_group() -> None: + """Inspect source-format adapters and normalize sources.""" + + +@source_group.command("adapters") +@click.option( + "--format", + "output_format", + type=click.Choice(["json", "yaml", "text"], case_sensitive=False), + default="text", + show_default=True, +) +def source_adapters(output_format: str) -> None: + """List discovered read-only source adapters.""" + + _emit_source_adapters(default_source_adapter_registry().to_dict(), output_format) + + +@source_group.command("inspect") +@click.argument("source_path", type=click.Path(exists=True, dir_okay=False, path_type=Path)) +@click.option("--adapter", "adapter_id", help="Explicit source adapter id.") +@click.option( + "--option", + "option_values", + multiple=True, + metavar="KEY=VALUE", + help="Adapter-specific option. May be repeated.", +) +@click.option( + "--format", + "output_format", + type=click.Choice(["json", "yaml", "text"], case_sensitive=False), + default="json", + show_default=True, +) +def source_inspect( + source_path: Path, + adapter_id: str | None, + option_values: tuple[str, ...], + output_format: str, +) -> None: + """Inspect a local source without full Markdown conversion.""" + + try: + result = inspect_source( + source_path, + registry=default_source_adapter_registry(), + adapter_id=adapter_id, + options=_parse_key_value_options(option_values), + ) + except ValueError as exc: + raise click.ClickException(str(exc)) from exc + _emit_source_inspect(result.to_dict(), output_format) + raise click.exceptions.Exit(0 if result.is_valid else 1) + + +@source_group.command("normalize") +@click.argument("source_path", type=click.Path(exists=True, dir_okay=False, path_type=Path)) +@click.option("--adapter", "adapter_id", help="Explicit source adapter id.") +@click.option( + "--option", + "option_values", + multiple=True, + metavar="KEY=VALUE", + help="Adapter-specific option. May be repeated.", +) +@click.option( + "--output", + type=click.Path(dir_okay=False, path_type=Path), + help="Write normalized output to this file.", +) +@click.option( + "--format", + "output_format", + type=click.Choice(["markdown", "json", "yaml"], case_sensitive=False), + default="markdown", + show_default=True, +) +def source_normalize( + source_path: Path, + adapter_id: str | None, + option_values: tuple[str, ...], + output: Path | None, + output_format: str, +) -> None: + """Normalize a local source into canonical Markdown.""" + + try: + result = normalize_source( + source_path, + registry=default_source_adapter_registry(), + adapter_id=adapter_id, + options=_parse_key_value_options(option_values), + ) + except ValueError as exc: + raise click.ClickException(str(exc)) from exc + data = result.to_dict() + if output_format == "markdown": + if not result.is_valid or result.document is None: + for diagnostic in data.get("diagnostics", []): + click.echo( + f"[{diagnostic['severity']}] {diagnostic['code']}: " + f"{diagnostic['message']}", + err=True, + ) + raise click.exceptions.Exit(1) + markdown = result.document.markdown + if output: + output.write_text(markdown, encoding="utf-8") + else: + click.echo(markdown, nl=False) + else: + _emit_jsonish(data, output_format) + raise click.exceptions.Exit(0 if result.is_valid else 1) + + @main.group("docs") def docs_group() -> None: """Generate CLI and API reference documentation.""" @@ -2892,6 +3014,46 @@ def _emit_extension_catalog(data: dict, output_format: str) -> None: click.echo(f"- {extension['id']} ({extension['kind']})") +def _emit_source_adapters(data: dict, output_format: str) -> None: + if output_format == "json": + click.echo(json.dumps(data, indent=2, ensure_ascii=False)) + elif output_format == "yaml": + click.echo(yaml.safe_dump(data, sort_keys=False)) + else: + click.echo(f"adapters: {data.get('count', 0)}") + for adapter in data.get("adapters", []): + operations = ", ".join(adapter.get("operations", [])) + extensions = ", ".join(adapter.get("extensions", [])) + click.echo(f"- {adapter['id']} [{operations}] {extensions}") + check = adapter.get("dependency_check", {}) + if check.get("missing"): + click.echo(" missing: " + ", ".join(check["missing"])) + + +def _emit_source_inspect(data: dict, output_format: str) -> None: + if output_format == "json": + click.echo(json.dumps(data, indent=2, ensure_ascii=False)) + elif output_format == "yaml": + click.echo(yaml.safe_dump(data, sort_keys=False)) + else: + click.echo("valid" if data.get("valid") else "invalid") + asset = data.get("asset", {}) + adapter = data.get("adapter", {}) + metadata = data.get("metadata", {}) + click.echo(f"source: {asset.get('path') or asset.get('uri', '')}") + if adapter.get("id"): + click.echo(f"adapter: {adapter['id']}") + if metadata.get("title"): + click.echo(f"title: {metadata['title']}") + if metadata.get("creators"): + click.echo("creators: " + ", ".join(metadata["creators"])) + quality = data.get("quality", {}) + if quality.get("lossiness"): + click.echo(f"lossiness: {quality['lossiness']}") + for diagnostic in data.get("diagnostics", []): + click.echo(f"! [{diagnostic['severity']}] {diagnostic['code']}: {diagnostic['message']}") + + def _emit_jsonish(data: dict, output_format: str) -> None: if output_format == "yaml": click.echo(yaml.safe_dump(data, sort_keys=False)) diff --git a/src/markitect_tool/extension/builtins.py b/src/markitect_tool/extension/builtins.py index e71bc3d..2c6dd31 100644 --- a/src/markitect_tool/extension/builtins.py +++ b/src/markitect_tool/extension/builtins.py @@ -5,6 +5,10 @@ from __future__ import annotations from markitect_tool.extension.registry import ExtensionDescriptor, ExtensionRegistry from markitect_tool.extension.processing import ProcessingCapability from markitect_tool.query import default_query_engine_registry +from markitect_tool.source import ( + default_source_adapter_registry, + source_adapter_registry_descriptor, +) def builtin_extension_registry() -> ExtensionRegistry: @@ -22,8 +26,11 @@ def builtin_extension_registry() -> ExtensionRegistry: _local_label_policy_descriptor(), _document_function_descriptor(), _agent_memory_descriptor(), + source_adapter_registry_descriptor(), ]: registry.register(descriptor) + for descriptor in default_source_adapter_registry().extension_descriptors(): + registry.register(descriptor) return registry diff --git a/src/markitect_tool/source/__init__.py b/src/markitect_tool/source/__init__.py new file mode 100644 index 0000000..b7bbbcd --- /dev/null +++ b/src/markitect_tool/source/__init__.py @@ -0,0 +1,55 @@ +"""Source adapter contracts and normalization helpers.""" + +from markitect_tool.source.engine import ( + SOURCE_ADAPTER_ENTRY_POINT_GROUP, + NORMALIZED_SOURCE_SCHEMA_VERSION, + NormalizationQuality, + NormalizedMarkdownDocument, + NormalizedMarkdownSegment, + SourceAdapterDescriptor, + SourceAdapterError, + SourceAdapterMatch, + SourceAdapterMatchRequest, + SourceAdapterRegistry, + SourceAsset, + SourceInspectRequest, + SourceInspectResult, + SourceMetadata, + SourceProvenance, + SourceReadAdapter, + SourceReadRequest, + SourceReadResult, + default_source_adapter_registry, + discover_source_adapters, + inspect_source, + normalization_cache_key, + normalize_source, + source_adapter_registry_descriptor, +) + +__all__ = [ + "SOURCE_ADAPTER_ENTRY_POINT_GROUP", + "NORMALIZED_SOURCE_SCHEMA_VERSION", + "NormalizationQuality", + "NormalizedMarkdownDocument", + "NormalizedMarkdownSegment", + "SourceAdapterDescriptor", + "SourceAdapterError", + "SourceAdapterMatch", + "SourceAdapterMatchRequest", + "SourceAdapterRegistry", + "SourceAsset", + "SourceInspectRequest", + "SourceInspectResult", + "SourceMetadata", + "SourceProvenance", + "SourceReadAdapter", + "SourceReadRequest", + "SourceReadResult", + "default_source_adapter_registry", + "discover_source_adapters", + "inspect_source", + "normalization_cache_key", + "normalize_source", + "source_adapter_registry_descriptor", +] diff --git a/src/markitect_tool/source/engine.py b/src/markitect_tool/source/engine.py new file mode 100644 index 0000000..58ceba9 --- /dev/null +++ b/src/markitect_tool/source/engine.py @@ -0,0 +1,1000 @@ +"""Read-only source adapter framework for normalized Markdown.""" + +from __future__ import annotations + +import hashlib +import importlib.metadata +import importlib.util +import json +import mimetypes +from dataclasses import asdict, dataclass, field +from pathlib import Path +from typing import Any, Iterable, Protocol, runtime_checkable + +from markitect_tool.diagnostics import Diagnostic, SourceLocation, has_error +from markitect_tool.extension import ( + ExtensionDependencyCheck, + ExtensionDescriptor, + OptionalDependency, + ProcessingCapability, +) + + +SOURCE_ADAPTER_ENTRY_POINT_GROUP = "markitect_tool.source_adapters" +NORMALIZED_SOURCE_SCHEMA_VERSION = "markitect.source.v1" +SOURCE_ADAPTER_KIND = "source-adapter" + + +class SourceAdapterError(ValueError): + """Raised when source adapter descriptors or registries are invalid.""" + + +@dataclass(frozen=True) +class SourceAsset: + """Identity and filesystem metadata for one source asset.""" + + uri: str + path: str | None = None + name: str | None = None + media_type: str | None = None + extension: str | None = None + size: int | None = None + mtime_ns: int | None = None + digest: str | None = None + metadata: dict[str, Any] = field(default_factory=dict) + + @classmethod + def from_path( + cls, + path: str | Path, + *, + media_type: str | None = None, + compute_digest: bool = True, + ) -> "SourceAsset": + source_path = Path(path) + stat = source_path.stat() + detected_media_type = media_type or mimetypes.guess_type(source_path.name)[0] + digest = _file_digest(source_path) if compute_digest else None + return cls( + uri=str(source_path), + path=str(source_path), + name=source_path.name, + media_type=detected_media_type, + extension=source_path.suffix.lower() or None, + size=stat.st_size, + mtime_ns=stat.st_mtime_ns, + digest=digest, + ) + + def to_dict(self) -> dict[str, Any]: + return _drop_empty(asdict(self)) + + @classmethod + def from_dict(cls, data: dict[str, Any]) -> "SourceAsset": + return cls( + uri=str(data["uri"]), + path=str(data["path"]) if data.get("path") is not None else None, + name=str(data["name"]) if data.get("name") is not None else None, + media_type=str(data["media_type"]) if data.get("media_type") is not None else None, + extension=str(data["extension"]) if data.get("extension") is not None else None, + size=int(data["size"]) if data.get("size") is not None else None, + mtime_ns=int(data["mtime_ns"]) if data.get("mtime_ns") is not None else None, + digest=str(data["digest"]) if data.get("digest") is not None else None, + metadata=dict(data.get("metadata", {})), + ) + + +@dataclass(frozen=True) +class SourceMetadata: + """Format-provided descriptive metadata.""" + + title: str | None = None + creators: list[str] = field(default_factory=list) + language: str | None = None + rights: str | None = None + source_url: str | None = None + publication_date: str | None = None + publisher: str | None = None + identifiers: dict[str, str] = field(default_factory=dict) + raw: dict[str, Any] = field(default_factory=dict) + + def to_dict(self) -> dict[str, Any]: + return _drop_empty(asdict(self)) + + @classmethod + def from_dict(cls, data: dict[str, Any]) -> "SourceMetadata": + return cls( + title=str(data["title"]) if data.get("title") is not None else None, + creators=[str(value) for value in data.get("creators", [])], + language=str(data["language"]) if data.get("language") is not None else None, + rights=str(data["rights"]) if data.get("rights") is not None else None, + source_url=str(data["source_url"]) if data.get("source_url") is not None else None, + publication_date=str(data["publication_date"]) + if data.get("publication_date") is not None + else None, + publisher=str(data["publisher"]) if data.get("publisher") is not None else None, + identifiers={str(key): str(value) for key, value in data.get("identifiers", {}).items()}, + raw=dict(data.get("raw", {})), + ) + + +@dataclass(frozen=True) +class SourceProvenance: + """Trace from normalized Markdown back to source locations.""" + + source_uri: str + source_path: str | None = None + source_href: str | None = None + package_path: str | None = None + anchor: str | None = None + page: str | None = None + section: str | None = None + start_offset: int | None = None + end_offset: int | None = None + digest: str | None = None + metadata: dict[str, Any] = field(default_factory=dict) + + def to_dict(self) -> dict[str, Any]: + return _drop_empty(asdict(self)) + + @classmethod + def from_dict(cls, data: dict[str, Any]) -> "SourceProvenance": + return cls( + source_uri=str(data["source_uri"]), + source_path=str(data["source_path"]) if data.get("source_path") is not None else None, + source_href=str(data["source_href"]) if data.get("source_href") is not None else None, + package_path=str(data["package_path"]) if data.get("package_path") is not None else None, + anchor=str(data["anchor"]) if data.get("anchor") is not None else None, + page=str(data["page"]) if data.get("page") is not None else None, + section=str(data["section"]) if data.get("section") is not None else None, + start_offset=int(data["start_offset"]) if data.get("start_offset") is not None else None, + end_offset=int(data["end_offset"]) if data.get("end_offset") is not None else None, + digest=str(data["digest"]) if data.get("digest") is not None else None, + metadata=dict(data.get("metadata", {})), + ) + + +@dataclass(frozen=True) +class NormalizedMarkdownSegment: + """One ordered normalized Markdown segment.""" + + segment_id: str + order: int + markdown: str + heading: str | None = None + heading_level: int | None = None + anchors: list[str] = field(default_factory=list) + provenance: list[SourceProvenance] = field(default_factory=list) + metadata: dict[str, Any] = field(default_factory=dict) + + def to_dict(self) -> dict[str, Any]: + data = { + "segment_id": self.segment_id, + "order": self.order, + "markdown": self.markdown, + "heading": self.heading, + "heading_level": self.heading_level, + "anchors": self.anchors, + "provenance": [event.to_dict() for event in self.provenance], + "metadata": self.metadata, + } + return _drop_empty(data) + + @classmethod + def from_dict(cls, data: dict[str, Any]) -> "NormalizedMarkdownSegment": + return cls( + segment_id=str(data["segment_id"]), + order=int(data["order"]), + markdown=str(data.get("markdown", "")), + heading=str(data["heading"]) if data.get("heading") is not None else None, + heading_level=int(data["heading_level"]) if data.get("heading_level") is not None else None, + anchors=[str(value) for value in data.get("anchors", [])], + provenance=[ + SourceProvenance.from_dict(event) + for event in data.get("provenance", []) + ], + metadata=dict(data.get("metadata", {})), + ) + + +@dataclass(frozen=True) +class NormalizationQuality: + """Summary of extraction lossiness and confidence.""" + + lossiness: str + confidence: float | None = None + skipped_items: int | None = None + warnings: int | None = None + metadata: dict[str, Any] = field(default_factory=dict) + + def to_dict(self) -> dict[str, Any]: + return _drop_empty(asdict(self)) + + @classmethod + def from_dict(cls, data: dict[str, Any]) -> "NormalizationQuality": + return cls( + lossiness=str(data["lossiness"]), + confidence=float(data["confidence"]) if data.get("confidence") is not None else None, + skipped_items=int(data["skipped_items"]) if data.get("skipped_items") is not None else None, + warnings=int(data["warnings"]) if data.get("warnings") is not None else None, + metadata=dict(data.get("metadata", {})), + ) + + +@dataclass(frozen=True) +class NormalizedMarkdownDocument: + """Canonical source-normalized Markdown document.""" + + document_id: str + asset: SourceAsset + metadata: SourceMetadata + markdown: str + segments: list[NormalizedMarkdownSegment] + quality: NormalizationQuality + adapter: dict[str, Any] + cache_key: str + schema_version: str = NORMALIZED_SOURCE_SCHEMA_VERSION + diagnostics: list[Diagnostic] = field(default_factory=list) + provenance: list[SourceProvenance] = field(default_factory=list) + attachments: list[SourceAsset] = field(default_factory=list) + + def to_dict(self) -> dict[str, Any]: + data = { + "schema_version": self.schema_version, + "document_id": self.document_id, + "asset": self.asset.to_dict(), + "metadata": self.metadata.to_dict(), + "markdown": self.markdown, + "segments": [segment.to_dict() for segment in self.segments], + "quality": self.quality.to_dict(), + "diagnostics": [diagnostic.to_dict() for diagnostic in self.diagnostics], + "provenance": [event.to_dict() for event in self.provenance], + "attachments": [attachment.to_dict() for attachment in self.attachments], + "adapter": self.adapter, + "cache_key": self.cache_key, + } + return _drop_empty(data) + + @classmethod + def from_dict(cls, data: dict[str, Any]) -> "NormalizedMarkdownDocument": + return cls( + schema_version=str(data.get("schema_version", NORMALIZED_SOURCE_SCHEMA_VERSION)), + document_id=str(data["document_id"]), + asset=SourceAsset.from_dict(data["asset"]), + metadata=SourceMetadata.from_dict(data.get("metadata", {})), + markdown=str(data.get("markdown", "")), + segments=[ + NormalizedMarkdownSegment.from_dict(segment) + for segment in data.get("segments", []) + ], + quality=NormalizationQuality.from_dict(data.get("quality", {"lossiness": "unknown"})), + diagnostics=[ + _diagnostic_from_dict(diagnostic) + for diagnostic in data.get("diagnostics", []) + ], + provenance=[ + SourceProvenance.from_dict(event) + for event in data.get("provenance", []) + ], + attachments=[ + SourceAsset.from_dict(asset) + for asset in data.get("attachments", []) + ], + adapter=dict(data.get("adapter", {})), + cache_key=str(data["cache_key"]), + ) + + +@dataclass(frozen=True) +class SourceAdapterMatchRequest: + """Cheap adapter matching request.""" + + asset: SourceAsset + options: dict[str, Any] = field(default_factory=dict) + + def to_dict(self) -> dict[str, Any]: + return _drop_empty({"asset": self.asset.to_dict(), "options": self.options}) + + +@dataclass(frozen=True) +class SourceAdapterMatch: + """Result of an adapter match attempt.""" + + adapter_id: str + matched: bool + confidence: int = 0 + reason: str | None = None + diagnostics: list[Diagnostic] = field(default_factory=list) + + def to_dict(self) -> dict[str, Any]: + return _drop_empty( + { + "adapter_id": self.adapter_id, + "matched": self.matched, + "confidence": self.confidence, + "reason": self.reason, + "diagnostics": [diagnostic.to_dict() for diagnostic in self.diagnostics], + } + ) + + +@dataclass(frozen=True) +class SourceInspectRequest: + """Metadata-only source inspection request.""" + + asset: SourceAsset + options: dict[str, Any] = field(default_factory=dict) + + def to_dict(self) -> dict[str, Any]: + return _drop_empty({"asset": self.asset.to_dict(), "options": self.options}) + + +@dataclass(frozen=True) +class SourceInspectResult: + """Metadata-only source inspection result.""" + + asset: SourceAsset + adapter: dict[str, Any] + metadata: SourceMetadata + quality: NormalizationQuality + capabilities: list[str] = field(default_factory=list) + diagnostics: list[Diagnostic] = field(default_factory=list) + valid: bool | None = None + + @property + def is_valid(self) -> bool: + return (not has_error(self.diagnostics)) if self.valid is None else self.valid + + def to_dict(self) -> dict[str, Any]: + return _drop_empty( + { + "valid": self.is_valid, + "asset": self.asset.to_dict(), + "adapter": self.adapter, + "metadata": self.metadata.to_dict(), + "capabilities": self.capabilities, + "quality": self.quality.to_dict(), + "diagnostics": [diagnostic.to_dict() for diagnostic in self.diagnostics], + } + ) + + +@dataclass(frozen=True) +class SourceReadRequest: + """Full source normalization request.""" + + asset: SourceAsset + options: dict[str, Any] = field(default_factory=dict) + + def to_dict(self) -> dict[str, Any]: + return _drop_empty({"asset": self.asset.to_dict(), "options": self.options}) + + +@dataclass(frozen=True) +class SourceReadResult: + """Full source normalization result.""" + + document: NormalizedMarkdownDocument | None = None + diagnostics: list[Diagnostic] = field(default_factory=list) + valid: bool | None = None + + @property + def is_valid(self) -> bool: + diagnostics = list(self.diagnostics) + if self.document is not None: + diagnostics.extend(self.document.diagnostics) + return (self.document is not None and not has_error(diagnostics)) if self.valid is None else self.valid + + def to_dict(self) -> dict[str, Any]: + return _drop_empty( + { + "valid": self.is_valid, + "document": self.document.to_dict() if self.document else None, + "diagnostics": [diagnostic.to_dict() for diagnostic in self.diagnostics], + } + ) + + +@runtime_checkable +class SourceReadAdapter(Protocol): + """Read-only source adapter protocol.""" + + descriptor: "SourceAdapterDescriptor" + + def can_read(self, request: SourceAdapterMatchRequest) -> SourceAdapterMatch: + """Return whether this adapter can read the source asset.""" + + def inspect(self, request: SourceInspectRequest) -> SourceInspectResult: + """Inspect a source without full Markdown conversion.""" + + def read(self, request: SourceReadRequest) -> SourceReadResult: + """Normalize a source into canonical Markdown.""" + + +SourceAdapterFactory = Any + + +@dataclass(frozen=True) +class SourceAdapterDescriptor: + """Inspectable descriptor for one source read adapter.""" + + id: str + version: str + name: str + operations: list[str] + media_types: list[str] + extensions: list[str] + factory: SourceAdapterFactory = field(compare=False, repr=False) + summary: str | None = None + option_schema: dict[str, Any] = field(default_factory=dict) + optional_dependencies: list[OptionalDependency] = field(default_factory=list) + safety: dict[str, Any] = field(default_factory=dict) + quality_profile: dict[str, Any] = field(default_factory=dict) + metadata: dict[str, Any] = field(default_factory=dict) + + def __post_init__(self) -> None: + if not self.id.strip(): + raise SourceAdapterError("Source adapter id cannot be empty") + if not self.name.strip(): + raise SourceAdapterError("Source adapter name cannot be empty") + unsupported_operations = sorted(set(self.operations) - {"read"}) + if unsupported_operations: + raise SourceAdapterError( + "Source adapter v1 only supports read operations: " + + ", ".join(unsupported_operations) + ) + if "read" not in self.operations: + raise SourceAdapterError("Source adapter operations must include read") + object.__setattr__(self, "media_types", [value.lower() for value in self.media_types]) + object.__setattr__(self, "extensions", [value.lower() for value in self.extensions]) + + def instantiate(self) -> SourceReadAdapter: + adapter = self.factory() + if not isinstance(adapter, SourceReadAdapter): + missing = [ + name + for name in ["can_read", "inspect", "read"] + if not callable(getattr(adapter, name, None)) + ] + if missing: + raise SourceAdapterError( + f"Source adapter `{self.id}` is missing method(s): {', '.join(missing)}" + ) + return adapter + + def to_dict(self) -> dict[str, Any]: + return _drop_empty( + { + "id": self.id, + "kind": SOURCE_ADAPTER_KIND, + "version": self.version, + "name": self.name, + "summary": self.summary, + "operations": self.operations, + "media_types": self.media_types, + "extensions": self.extensions, + "option_schema": self.option_schema, + "optional_dependencies": [ + dependency.to_dict() + for dependency in self.optional_dependencies + ], + "safety": self.safety, + "quality_profile": self.quality_profile, + "metadata": self.metadata, + "capabilities": [ + capability.to_dict() + for capability in _source_adapter_capabilities() + ], + "docs": ["docs/source-adapter-contract.md"], + } + ) + + def to_extension_descriptor(self) -> ExtensionDescriptor: + return ExtensionDescriptor( + id=self.id, + kind=SOURCE_ADAPTER_KIND, + version=self.version, + summary=self.summary or self.name, + capabilities=_source_adapter_capabilities(), + optional_dependencies=self.optional_dependencies, + safety=self.safety, + input_contract="SourceInspectRequest | SourceReadRequest", + output_contract="SourceInspectResult | SourceReadResult", + diagnostics_namespace="source", + provenance_prefix=self.id, + cli={"commands": ["mkt source adapters", "mkt source inspect", "mkt source normalize"]}, + docs=["docs/source-adapter-contract.md"], + metadata={ + "source_adapter": { + "name": self.name, + "operations": self.operations, + "media_types": self.media_types, + "extensions": self.extensions, + "option_schema": self.option_schema, + "quality_profile": self.quality_profile, + "metadata": self.metadata, + } + }, + ) + + +class SourceAdapterRegistry: + """Registry of source adapter descriptors.""" + + def __init__(self, descriptors: Iterable[SourceAdapterDescriptor] | None = None) -> None: + self._descriptors: dict[str, SourceAdapterDescriptor] = {} + for descriptor in descriptors or []: + self.register(descriptor) + + def register(self, descriptor: SourceAdapterDescriptor) -> None: + if descriptor.id in self._descriptors: + raise SourceAdapterError(f"Duplicate source adapter id `{descriptor.id}`") + self._descriptors[descriptor.id] = descriptor + + def get(self, adapter_id: str) -> SourceAdapterDescriptor: + try: + return self._descriptors[adapter_id] + except KeyError as exc: + raise SourceAdapterError(f"Unknown source adapter `{adapter_id}`") from exc + + def list(self) -> list[SourceAdapterDescriptor]: + return [self._descriptors[key] for key in sorted(self._descriptors)] + + def check_dependencies( + self, + adapter_id: str, + *, + available_modules: set[str] | None = None, + ) -> ExtensionDependencyCheck: + descriptor = self.get(adapter_id) + available = ( + available_modules + if available_modules is not None + else _available_modules( + dependency.name for dependency in descriptor.optional_dependencies + ) + ) + missing: list[str] = [] + optional_missing: list[str] = [] + for dependency in descriptor.optional_dependencies: + if dependency.name in available: + continue + if dependency.required: + missing.append(dependency.name) + else: + optional_missing.append(dependency.name) + return ExtensionDependencyCheck( + extension_id=adapter_id, + missing=missing, + optional_missing=optional_missing, + ) + + def to_dict(self) -> dict[str, Any]: + adapters = [] + for descriptor in self.list(): + data = descriptor.to_dict() + data["dependency_check"] = self.check_dependencies(descriptor.id).to_dict() + adapters.append(data) + return {"count": len(adapters), "adapters": adapters} + + def extension_descriptors(self) -> list[ExtensionDescriptor]: + return [descriptor.to_extension_descriptor() for descriptor in self.list()] + + def select( + self, + asset: SourceAsset, + *, + adapter_id: str | None = None, + options: dict[str, Any] | None = None, + ) -> tuple[SourceAdapterDescriptor | None, SourceReadAdapter | None, list[Diagnostic]]: + options = options or {} + if adapter_id: + try: + descriptors = [self.get(adapter_id)] + except SourceAdapterError: + return None, None, [_source_error("source.adapter_unknown", f"Unknown source adapter `{adapter_id}`.")] + else: + descriptors = self.list() + candidates: list[tuple[int, int, str, SourceAdapterDescriptor, SourceReadAdapter]] = [] + diagnostics: list[Diagnostic] = [] + for descriptor in descriptors: + dependency_check = self.check_dependencies(descriptor.id) + if not dependency_check.compatible: + diagnostics.append( + _source_error( + "source.missing_dependency", + f"Source adapter `{descriptor.id}` is missing required dependencies.", + details=dependency_check.to_dict(), + ) + ) + continue + try: + adapter = descriptor.instantiate() + match = adapter.can_read(SourceAdapterMatchRequest(asset=asset, options=options)) + except Exception as exc: # pragma: no cover - defensive boundary + diagnostics.append( + _source_error( + "source.adapter_failed", + f"Source adapter `{descriptor.id}` failed while matching.", + details={"error": str(exc)}, + ) + ) + continue + diagnostics.extend(match.diagnostics) + if not match.matched: + continue + media_score = 1 if asset.media_type and asset.media_type.lower() in descriptor.media_types else 0 + candidates.append((media_score, int(match.confidence), descriptor.id, descriptor, adapter)) + if not candidates: + if not diagnostics: + diagnostics.append( + _source_error( + "source.unsupported_format", + f"No source adapter can read `{asset.uri}`.", + source_path=asset.path, + ) + ) + return None, None, diagnostics + candidates.sort(key=lambda item: (-item[0], -item[1], item[2])) + best = candidates[0] + if len(candidates) > 1 and candidates[1][0:2] == best[0:2]: + diagnostics.append( + Diagnostic( + severity="warning", + code="source.adapter_ambiguous", + message=( + "Multiple source adapters matched; selected " + f"`{best[2]}` deterministically." + ), + source=SourceLocation(path=asset.path) if asset.path else None, + details={"candidates": [candidate[2] for candidate in candidates]}, + ) + ) + return best[3], best[4], diagnostics + + +def source_adapter_registry_descriptor() -> ExtensionDescriptor: + """Return the built-in descriptor for source adapter discovery.""" + + return ExtensionDescriptor( + id="source.adapter-registry", + kind="source-adapter-registry", + summary="Read-only source adapter discovery and Markdown normalization framework.", + capabilities=[ + ProcessingCapability(id="source_adapters", kind="inspect"), + ProcessingCapability(id="source", kind="read"), + ProcessingCapability(id="markdown", kind="normalize"), + ProcessingCapability(id="entry_points", kind="discover"), + ], + safety={"reads_files": True, "writes_files": False, "network": False}, + input_contract="SourceAsset", + output_contract="NormalizedMarkdownDocument", + diagnostics_namespace="source", + provenance_prefix="source", + cli={"commands": ["mkt source adapters", "mkt source inspect", "mkt source normalize"]}, + docs=["docs/source-adapter-contract.md"], + examples=["examples/source-adapters/"], + metadata={ + "entry_point_group": SOURCE_ADAPTER_ENTRY_POINT_GROUP, + "v1_scope": "read-only", + }, + ) + + +def default_source_adapter_registry() -> SourceAdapterRegistry: + """Return the discovered source adapter registry.""" + + return discover_source_adapters() + + +def discover_source_adapters( + entry_points: Iterable[Any] | None = None, +) -> SourceAdapterRegistry: + """Discover source adapters from package entry points.""" + + registry = SourceAdapterRegistry() + raw_entry_points = entry_points if entry_points is not None else _source_adapter_entry_points() + for entry_point in raw_entry_points: + try: + loaded = entry_point.load() + for descriptor in _normalize_entry_point_result(loaded): + registry.register(descriptor) + except Exception as exc: + name = getattr(entry_point, "name", "") + descriptor = _failed_discovery_descriptor(name, exc) + registry.register(descriptor) + return registry + + +def inspect_source( + path_or_uri: str | Path, + *, + registry: SourceAdapterRegistry | None = None, + adapter_id: str | None = None, + options: dict[str, Any] | None = None, +) -> SourceInspectResult: + """Inspect a local source through the selected adapter.""" + + options = options or {} + try: + asset = _asset_from_input(path_or_uri) + except FileNotFoundError: + asset = SourceAsset(uri=str(path_or_uri), path=str(path_or_uri)) + diagnostic = _source_error( + "source.file_not_found", + f"Source file `{path_or_uri}` does not exist.", + source_path=str(path_or_uri), + ) + return SourceInspectResult( + asset=asset, + adapter={}, + metadata=SourceMetadata(), + capabilities=[], + quality=NormalizationQuality(lossiness="unknown"), + diagnostics=[diagnostic], + valid=False, + ) + selected_registry = registry or default_source_adapter_registry() + descriptor, adapter, diagnostics = selected_registry.select( + asset, + adapter_id=adapter_id, + options=options, + ) + if descriptor is None or adapter is None: + return SourceInspectResult( + asset=asset, + adapter={}, + metadata=SourceMetadata(), + capabilities=[], + quality=NormalizationQuality(lossiness="unknown"), + diagnostics=diagnostics, + valid=False, + ) + result = adapter.inspect(SourceInspectRequest(asset=asset, options=options)) + if diagnostics: + result = SourceInspectResult( + asset=result.asset, + adapter=result.adapter, + metadata=result.metadata, + capabilities=result.capabilities, + quality=result.quality, + diagnostics=[*diagnostics, *result.diagnostics], + valid=result.is_valid and not has_error(diagnostics), + ) + return result + + +def normalize_source( + path_or_uri: str | Path, + *, + registry: SourceAdapterRegistry | None = None, + adapter_id: str | None = None, + options: dict[str, Any] | None = None, +) -> SourceReadResult: + """Normalize a local source through the selected adapter.""" + + options = options or {} + try: + asset = _asset_from_input(path_or_uri) + except FileNotFoundError: + return SourceReadResult( + document=None, + diagnostics=[ + _source_error( + "source.file_not_found", + f"Source file `{path_or_uri}` does not exist.", + source_path=str(path_or_uri), + ) + ], + valid=False, + ) + selected_registry = registry or default_source_adapter_registry() + descriptor, adapter, diagnostics = selected_registry.select( + asset, + adapter_id=adapter_id, + options=options, + ) + if descriptor is None or adapter is None: + return SourceReadResult(document=None, diagnostics=diagnostics, valid=False) + result = adapter.read(SourceReadRequest(asset=asset, options=options)) + if diagnostics: + result = SourceReadResult( + document=result.document, + diagnostics=[*diagnostics, *result.diagnostics], + valid=result.is_valid and not has_error(diagnostics), + ) + return result + + +def normalization_cache_key( + *, + asset: SourceAsset, + adapter_id: str, + adapter_version: str, + options: dict[str, Any] | None = None, +) -> str: + """Compute a stable normalization cache key.""" + + payload = { + "schema_version": NORMALIZED_SOURCE_SCHEMA_VERSION, + "asset_uri": asset.uri, + "asset_path": asset.path, + "asset_digest": asset.digest, + "adapter_id": adapter_id, + "adapter_version": adapter_version, + "options": options or {}, + } + digest = hashlib.sha256(_canonical_json(payload).encode("utf-8")).hexdigest() + return f"source-normalize:sha256:{digest}" + + +def _asset_from_input(path_or_uri: str | Path) -> SourceAsset: + path = Path(path_or_uri) + if not path.exists(): + raise FileNotFoundError(str(path_or_uri)) + return SourceAsset.from_path(path) + + +def _source_adapter_entry_points() -> list[Any]: + entry_points = importlib.metadata.entry_points() + if hasattr(entry_points, "select"): + return list(entry_points.select(group=SOURCE_ADAPTER_ENTRY_POINT_GROUP)) + return list(entry_points.get(SOURCE_ADAPTER_ENTRY_POINT_GROUP, [])) + + +def _normalize_entry_point_result(loaded: Any) -> list[SourceAdapterDescriptor]: + value = loaded() if callable(loaded) and not isinstance(loaded, SourceAdapterDescriptor) else loaded + if isinstance(value, SourceAdapterDescriptor): + return [value] + if isinstance(value, Iterable) and not isinstance(value, (str, bytes, dict)): + descriptors = list(value) + if all(isinstance(descriptor, SourceAdapterDescriptor) for descriptor in descriptors): + return descriptors + raise SourceAdapterError("Source adapter entry point must return SourceAdapterDescriptor objects") + + +def _failed_discovery_descriptor(name: str, exc: Exception) -> SourceAdapterDescriptor: + def factory() -> SourceReadAdapter: + return _FailedSourceReadAdapter(name, exc) + + return SourceAdapterDescriptor( + id=f"source.discovery-failed.{_safe_identifier(name)}", + version="1", + name=f"Failed source adapter entry point: {name}", + operations=["read"], + media_types=[], + extensions=[], + factory=factory, + summary="Source adapter entry point failed to load.", + safety={"reads_files": False, "writes_files": False, "network": False}, + metadata={"error": str(exc), "entry_point": name}, + ) + + +class _FailedSourceReadAdapter: + def __init__(self, name: str, exc: Exception) -> None: + self.descriptor = _failed_discovery_descriptor(name, exc) + self._name = name + self._exc = exc + + def can_read(self, request: SourceAdapterMatchRequest) -> SourceAdapterMatch: + return SourceAdapterMatch( + adapter_id=self.descriptor.id, + matched=False, + diagnostics=[ + _source_error( + "source.discovery_failed", + f"Source adapter entry point `{self._name}` failed to load.", + details={"error": str(self._exc)}, + ) + ], + ) + + def inspect(self, request: SourceInspectRequest) -> SourceInspectResult: + return SourceInspectResult( + asset=request.asset, + adapter={"id": self.descriptor.id, "version": self.descriptor.version}, + metadata=SourceMetadata(), + capabilities=[], + quality=NormalizationQuality(lossiness="unknown"), + diagnostics=[ + _source_error( + "source.discovery_failed", + f"Source adapter entry point `{self._name}` failed to load.", + details={"error": str(self._exc)}, + ) + ], + valid=False, + ) + + def read(self, request: SourceReadRequest) -> SourceReadResult: + return SourceReadResult( + diagnostics=[ + _source_error( + "source.discovery_failed", + f"Source adapter entry point `{self._name}` failed to load.", + details={"error": str(self._exc)}, + ) + ], + valid=False, + ) + + +def _file_digest(path: Path) -> str: + hasher = hashlib.sha256() + with path.open("rb") as handle: + for chunk in iter(lambda: handle.read(1024 * 1024), b""): + hasher.update(chunk) + return "sha256:" + hasher.hexdigest() + + +def _source_adapter_capabilities() -> list[ProcessingCapability]: + return [ + ProcessingCapability(id="source", kind="read"), + ProcessingCapability(id="markdown", kind="normalize"), + ProcessingCapability(id="diagnostics", kind="emit"), + ProcessingCapability(id="provenance", kind="emit"), + ProcessingCapability(id="filesystem", kind="read"), + ] + + +def _source_error( + code: str, + message: str, + *, + source_path: str | None = None, + details: dict[str, Any] | None = None, +) -> Diagnostic: + return Diagnostic( + severity="error", + code=code, + message=message, + source=SourceLocation(path=source_path) if source_path else None, + details=details or {}, + ) + + +def _diagnostic_from_dict(data: dict[str, Any]) -> Diagnostic: + return Diagnostic( + severity=str(data["severity"]), + code=str(data["code"]), + message=str(data["message"]), + source=_source_location_from_dict(data["source"]) if data.get("source") else None, + contract=_source_location_from_dict(data["contract"]) if data.get("contract") else None, + rule_id=str(data["rule_id"]) if data.get("rule_id") is not None else None, + guidance=str(data["guidance"]) if data.get("guidance") is not None else None, + details=dict(data.get("details", {})), + ) + + +def _source_location_from_dict(data: dict[str, Any]) -> SourceLocation: + return SourceLocation( + path=str(data["path"]) if data.get("path") is not None else None, + line=int(data["line"]) if data.get("line") is not None else None, + column=int(data["column"]) if data.get("column") is not None else None, + ) + + +def _available_modules(module_names: Iterable[str]) -> set[str]: + return { + module_name + for module_name in module_names + if importlib.util.find_spec(module_name) is not None + } + + +def _canonical_json(data: dict[str, Any]) -> str: + return json.dumps(data, sort_keys=True, ensure_ascii=False, separators=(",", ":"), default=str) + + +def _drop_empty(data: dict[str, Any]) -> dict[str, Any]: + return { + key: value + for key, value in data.items() + if value not in (None, [], {}, "") + } + + +def _safe_identifier(name: str) -> str: + safe = "".join(character if character.isalnum() else "-" for character in name.lower()) + return safe.strip("-") or "unknown" diff --git a/tests/test_builtin_extension_catalog.py b/tests/test_builtin_extension_catalog.py index 788d15a..c274fd3 100644 --- a/tests/test_builtin_extension_catalog.py +++ b/tests/test_builtin_extension_catalog.py @@ -22,6 +22,7 @@ def test_builtin_extension_registry_lists_query_processors_and_backend(): assert "policy.local-label" in ids assert "document.function" in ids assert "memory.context-package" in ids + assert "source.adapter-registry" in ids def test_builtin_meta_descriptors_expose_extension_and_docs_accesspoints(): diff --git a/tests/test_cli_api_polish.py b/tests/test_cli_api_polish.py index d6ee27f..0e9e1af 100644 --- a/tests/test_cli_api_polish.py +++ b/tests/test_cli_api_polish.py @@ -43,6 +43,7 @@ def test_mkt_docs_cli_generates_command_reference(): assert result.exit_code == 0, result.output assert "# Markitect CLI Reference" in result.output assert "## `mkt extension commands`" in result.output + assert "## `mkt source normalize`" in result.output assert "## `mkt docs api`" in result.output @@ -54,6 +55,7 @@ def test_mkt_docs_api_generates_public_api_reference(): assert "query_document_jsonpath" in result.output assert "ExtensionDescriptor" in result.output assert "LocalSnapshotStore" in result.output + assert "SourceAdapterRegistry" in result.output def test_top_level_api_exports_newer_architecture_surfaces(): @@ -63,3 +65,5 @@ def test_top_level_api_exports_newer_architecture_surfaces(): assert api.ExtensionDescriptor assert api.builtin_extension_registry assert api.validate_schema + assert api.SourceAdapterRegistry + assert api.normalize_source diff --git a/tests/test_source_adapter_contract.py b/tests/test_source_adapter_contract.py new file mode 100644 index 0000000..a9996b5 --- /dev/null +++ b/tests/test_source_adapter_contract.py @@ -0,0 +1,380 @@ +import importlib +import json +from pathlib import Path + +from click.testing import CliRunner + +import markitect_tool as api +from markitect_tool.diagnostics import Diagnostic +from markitect_tool.extension import OptionalDependency, builtin_extension_registry +from markitect_tool.source import ( + NORMALIZED_SOURCE_SCHEMA_VERSION, + NormalizationQuality, + NormalizedMarkdownDocument, + NormalizedMarkdownSegment, + SourceAdapterDescriptor, + SourceAdapterMatch, + SourceAdapterMatchRequest, + SourceAdapterRegistry, + SourceAsset, + SourceInspectRequest, + SourceInspectResult, + SourceMetadata, + SourceProvenance, + SourceReadRequest, + SourceReadResult, + discover_source_adapters, + inspect_source, + normalization_cache_key, + normalize_source, +) + + +SAMPLE_SOURCE = Path("examples/source-adapters/sample.fake") +NORMALIZED_MARKDOWN = ( + "# Fake Source\n\n" + "A small normalized segment.\n\n" + "## Second Segment\n\n" + "Another deterministic segment." +) + + +class FakeSourceAdapter: + def __init__(self, descriptor: SourceAdapterDescriptor, *, confidence: int = 80) -> None: + self.descriptor = descriptor + self.confidence = confidence + + def can_read(self, request: SourceAdapterMatchRequest) -> SourceAdapterMatch: + return SourceAdapterMatch( + adapter_id=self.descriptor.id, + matched=request.asset.extension == ".fake", + confidence=self.confidence, + reason="extension", + ) + + def inspect(self, request: SourceInspectRequest) -> SourceInspectResult: + return SourceInspectResult( + asset=request.asset, + adapter={"id": self.descriptor.id, "version": self.descriptor.version, "options": request.options}, + metadata=_source_metadata(), + capabilities=["read"], + quality=NormalizationQuality(lossiness="none", confidence=1.0), + ) + + def read(self, request: SourceReadRequest) -> SourceReadResult: + asset = request.asset + provenance = [ + SourceProvenance( + source_uri=asset.uri, + source_path=asset.path, + digest=asset.digest, + ) + ] + segments = [ + NormalizedMarkdownSegment( + segment_id="seg-0001", + order=0, + heading="Fake Source", + heading_level=1, + markdown="# Fake Source\n\nA small normalized segment.", + anchors=["fake-source"], + provenance=[ + SourceProvenance( + source_uri=asset.uri, + source_path=asset.path, + anchor="fake-source", + section="Fake Source", + ) + ], + ), + NormalizedMarkdownSegment( + segment_id="seg-0002", + order=1, + heading="Second Segment", + heading_level=2, + markdown="## Second Segment\n\nAnother deterministic segment.", + anchors=["second-segment"], + provenance=[ + SourceProvenance( + source_uri=asset.uri, + source_path=asset.path, + anchor="second-segment", + section="Second Segment", + ) + ], + ), + ] + cache_key = normalization_cache_key( + asset=asset, + adapter_id=self.descriptor.id, + adapter_version=self.descriptor.version, + options=request.options, + ) + document = NormalizedMarkdownDocument( + document_id=f"{self.descriptor.id}:fake-source-001", + asset=asset, + metadata=_source_metadata(), + markdown=NORMALIZED_MARKDOWN, + segments=segments, + quality=NormalizationQuality(lossiness="none", confidence=1.0, skipped_items=0, warnings=0), + provenance=provenance, + adapter={"id": self.descriptor.id, "version": self.descriptor.version, "options": request.options}, + cache_key=cache_key, + ) + return SourceReadResult(document=document) + + +def _source_metadata() -> SourceMetadata: + return SourceMetadata( + title="Fake Source", + creators=["Markitect Fixture"], + language="en", + identifiers={"fixture": "fake-source-001"}, + ) + + +def _fake_descriptor(adapter_id: str = "source.fake", *, confidence: int = 80) -> SourceAdapterDescriptor: + descriptor = None + + def factory() -> FakeSourceAdapter: + assert descriptor is not None + return FakeSourceAdapter(descriptor, confidence=confidence) + + descriptor = SourceAdapterDescriptor( + id=adapter_id, + version="1", + name="Fake Source Adapter", + summary="Contract-test adapter for plain fixture sources.", + operations=["read"], + media_types=["text/x.markitect-fake"], + extensions=[".fake"], + factory=factory, + safety={ + "reads_files": True, + "writes_files": False, + "network": False, + "external_process": False, + }, + ) + return descriptor + + +def test_normalized_document_serialization_round_trips(): + registry = SourceAdapterRegistry([_fake_descriptor()]) + result = normalize_source(SAMPLE_SOURCE, registry=registry) + + assert result.is_valid + assert result.document is not None + data = result.document.to_dict() + round_trip = NormalizedMarkdownDocument.from_dict(data).to_dict() + + assert round_trip == data + assert data["schema_version"] == NORMALIZED_SOURCE_SCHEMA_VERSION + assert data["markdown"] == NORMALIZED_MARKDOWN + assert data["segments"][0]["segment_id"] == "seg-0001" + + +def test_normalization_cache_key_is_deterministic(): + asset = SourceAsset(uri="sample.fake", path="sample.fake", digest="sha256:abc") + + first = normalization_cache_key( + asset=asset, + adapter_id="source.fake", + adapter_version="1", + options={"skip_boilerplate": True}, + ) + second = normalization_cache_key( + asset=asset, + adapter_id="source.fake", + adapter_version="1", + options={"skip_boilerplate": True}, + ) + + assert first == second + assert first.startswith("source-normalize:sha256:") + + +def test_source_registry_selects_fake_adapter_and_reports_unsupported(): + registry = SourceAdapterRegistry([_fake_descriptor()]) + asset = SourceAsset.from_path(SAMPLE_SOURCE) + descriptor, adapter, diagnostics = registry.select(asset) + + assert descriptor is not None + assert descriptor.id == "source.fake" + assert adapter is not None + assert diagnostics == [] + + unsupported = SourceAsset(uri="example.bin", extension=".bin") + descriptor, adapter, diagnostics = registry.select(unsupported) + + assert descriptor is None + assert adapter is None + assert diagnostics[0].code == "source.unsupported_format" + + +def test_source_registry_reports_missing_required_dependency(): + descriptor = SourceAdapterDescriptor( + id="source.needs-missing", + version="1", + name="Missing Dependency Adapter", + operations=["read"], + media_types=[], + extensions=[".fake"], + factory=lambda: FakeSourceAdapter(_fake_descriptor("source.needs-missing")), + optional_dependencies=[ + OptionalDependency( + name="definitely_missing_markitect_source_adapter_dependency", + package="missing-package", + required=True, + ) + ], + ) + registry = SourceAdapterRegistry([descriptor]) + + _, _, diagnostics = registry.select(SourceAsset.from_path(SAMPLE_SOURCE)) + + assert diagnostics[0].code == "source.missing_dependency" + assert "definitely_missing_markitect_source_adapter_dependency" in diagnostics[0].details["missing"] + + +def test_source_registry_breaks_ambiguous_matches_by_adapter_id(): + registry = SourceAdapterRegistry( + [ + _fake_descriptor("source.b", confidence=80), + _fake_descriptor("source.a", confidence=80), + ] + ) + + descriptor, _, diagnostics = registry.select(SourceAsset.from_path(SAMPLE_SOURCE)) + + assert descriptor is not None + assert descriptor.id == "source.a" + assert [diagnostic.code for diagnostic in diagnostics] == ["source.adapter_ambiguous"] + + +class FakeEntryPoint: + name = "fake" + + def load(self): + return _fake_descriptor() + + +def test_discover_source_adapters_accepts_entry_point_descriptors(): + registry = discover_source_adapters([FakeEntryPoint()]) + + assert registry.get("source.fake").name == "Fake Source Adapter" + + +def test_source_descriptor_maps_to_extension_descriptor(): + extension = _fake_descriptor().to_extension_descriptor() + + assert extension.kind == "source-adapter" + assert extension.input_contract == "SourceInspectRequest | SourceReadRequest" + assert "mkt source normalize" in extension.cli["commands"] + assert {capability.id for capability in extension.capabilities} >= { + "source", + "markdown", + "diagnostics", + "provenance", + } + + +def test_builtin_registry_exposes_source_adapter_framework(): + registry = builtin_extension_registry() + + descriptor = registry.get("source.adapter-registry") + + assert descriptor.kind == "source-adapter-registry" + assert descriptor.metadata["entry_point_group"] == "markitect_tool.source_adapters" + assert "mkt source adapters" in descriptor.cli["commands"] + + +def test_inspect_and_normalize_source_api_use_injected_registry(): + registry = SourceAdapterRegistry([_fake_descriptor()]) + + inspected = inspect_source(SAMPLE_SOURCE, registry=registry) + normalized = normalize_source(SAMPLE_SOURCE, registry=registry) + + assert inspected.is_valid + assert inspected.metadata.title == "Fake Source" + assert normalized.is_valid + assert normalized.document is not None + assert normalized.document.markdown == NORMALIZED_MARKDOWN + + +def test_source_cli_uses_registry_and_emits_json(monkeypatch): + cli_module = importlib.import_module("markitect_tool.cli.main") + monkeypatch.setattr( + cli_module, + "default_source_adapter_registry", + lambda: SourceAdapterRegistry([_fake_descriptor()]), + ) + + result = CliRunner().invoke(cli_module.main, ["source", "adapters", "--format", "json"]) + + assert result.exit_code == 0, result.output + data = json.loads(result.output) + assert data["count"] == 1 + assert data["adapters"][0]["id"] == "source.fake" + + +def test_source_cli_inspect_and_normalize(monkeypatch): + cli_module = importlib.import_module("markitect_tool.cli.main") + monkeypatch.setattr( + cli_module, + "default_source_adapter_registry", + lambda: SourceAdapterRegistry([_fake_descriptor()]), + ) + runner = CliRunner() + + inspected = runner.invoke( + cli_module.main, + ["source", "inspect", str(SAMPLE_SOURCE), "--format", "json"], + ) + normalized = runner.invoke( + cli_module.main, + ["source", "normalize", str(SAMPLE_SOURCE), "--format", "markdown"], + ) + + assert inspected.exit_code == 0, inspected.output + assert json.loads(inspected.output)["metadata"]["title"] == "Fake Source" + assert normalized.exit_code == 0, normalized.output + assert normalized.output == NORMALIZED_MARKDOWN + + +def test_source_cli_markdown_output_suppresses_invalid_partial(monkeypatch): + cli_module = importlib.import_module("markitect_tool.cli.main") + monkeypatch.setattr( + cli_module, + "default_source_adapter_registry", + lambda: SourceAdapterRegistry(), + ) + + result = CliRunner(mix_stderr=False).invoke( + cli_module.main, + ["source", "normalize", str(SAMPLE_SOURCE), "--format", "markdown"], + ) + + assert result.exit_code == 1 + assert result.output == "" + assert "source.unsupported_format" in result.stderr + + +def test_source_examples_are_valid_json_fixtures(): + for path in [ + "examples/source-adapters/adapter-list.json", + "examples/source-adapters/inspect-result.json", + "examples/source-adapters/normalized-document.json", + ]: + with open(path, encoding="utf-8") as handle: + data = json.load(handle) + assert data + + +def test_top_level_api_exports_source_contract(): + assert api.SourceAsset + assert api.SourceAdapterDescriptor + assert api.SourceAdapterRegistry + assert api.default_source_adapter_registry + assert api.normalize_source + assert api.SOURCE_ADAPTER_ENTRY_POINT_GROUP == "markitect_tool.source_adapters" diff --git a/workplans/MKTT-WP-0018-source-adapter-contract.md b/workplans/MKTT-WP-0018-source-adapter-contract.md index 587155c..f938e0d 100644 --- a/workplans/MKTT-WP-0018-source-adapter-contract.md +++ b/workplans/MKTT-WP-0018-source-adapter-contract.md @@ -3,10 +3,10 @@ id: MKTT-WP-0018 type: workplan title: "Source Adapter Interface And Markdown Normalization Contract" domain: markitect -status: active +status: done owner: markitect-tool topic_slug: markitect -planning_priority: P0 +planning_priority: complete planning_order: 145 depends_on_workplans: - MKTT-WP-0013 @@ -108,7 +108,7 @@ preservation contract exists. ```task id: MKTT-WP-0018-T001 -status: todo +status: done priority: high state_hub_task_id: "a5d05b2a-b9d8-43c6-9e52-5a77094b49d1" ``` @@ -124,11 +124,16 @@ Output: architecture note covering responsibilities, extension package shape, the `docs/source-adapter-contract.md` entry point contract, dependency policy, and migration path from the current `infospace-bench` EPUB spike. +Implemented: `docs/source-adapter-contract.md` defines the contract boundary, +external package shape, dependency policy, entry point group, and +`markitect-filter` EPUB3 handoff. `docs/source-adapter-migration.md` documents +the sibling-repo migration path. + ## P18.2 - Canonical source-to-markdown data model ```task id: MKTT-WP-0018-T002 -status: todo +status: done priority: high state_hub_task_id: "f8164264-a9c1-4c82-8617-76bbb84a51bb" ``` @@ -160,11 +165,17 @@ Output: public data model, serialization tests using `examples/source-adapters/normalized-document.json`, and normalization contract documentation matching the field-level v1 specification. +Implemented: `markitect_tool.source` exposes `SourceAsset`, `SourceMetadata`, +`NormalizedMarkdownDocument`, `NormalizedMarkdownSegment`, +`SourceProvenance`, and `NormalizationQuality`, with stable dictionary +serialization, round-trip tests, digest/cache-key support, diagnostics, and +fixture coverage. + ## P18.3 - Source adapter protocol and capability descriptors ```task id: MKTT-WP-0018-T003 -status: todo +status: done priority: high state_hub_task_id: "5036ff34-49f4-4900-9e90-95c4555b4ce9" ``` @@ -189,11 +200,16 @@ Concrete EPUB3 implementation belongs in `markitect-filter`. Output: protocol module, descriptor integration, tests for matching, inspection, reading, diagnostics, and unsupported-format behavior. +Implemented: `SourceReadAdapter`, request/result types, +`SourceAdapterDescriptor`, deterministic selection, dependency diagnostics, +unsupported-format diagnostics, and read-only capability descriptors live in +`markitect_tool.source`. + ## P18.4 - Adapter registry and discovery hooks ```task id: MKTT-WP-0018-T004 -status: todo +status: done priority: high state_hub_task_id: "391fb723-8990-4086-ac6c-656a3d637ba3" ``` @@ -210,11 +226,16 @@ Wire source adapters into the existing internal extension framework: Output: registry implementation, package discovery tests, and compatibility notes for `markitect-filter`. +Implemented: `SourceAdapterRegistry`, `discover_source_adapters`, and +`default_source_adapter_registry` discover descriptors through +`markitect_tool.source_adapters`, expose source adapter descriptors through the +extension catalog, and report missing optional dependencies deterministically. + ## P18.5 - Normalization CLI and public API surface ```task id: MKTT-WP-0018-T005 -status: todo +status: done priority: medium state_hub_task_id: "c6233bd1-0403-498b-a6ed-c1874b172aa3" ``` @@ -231,11 +252,16 @@ Expose a small CLI/API surface: Output: CLI commands, API exports, generated command/API docs updates, and tests. +Implemented: `mkt source adapters`, `mkt source inspect`, and +`mkt source normalize` expose JSON/YAML/text/Markdown behavior. Public API +exports were added to `markitect_tool.__all__`, and generated CLI/API docs were +refreshed. + ## P18.6 - Contract tests and fake adapter fixture ```task id: MKTT-WP-0018-T006 -status: todo +status: done priority: high state_hub_task_id: "263d0351-2942-4c2a-b333-b3aa96f2b8e3" ``` @@ -252,11 +278,16 @@ Add deterministic contract tests proving that an external read adapter can: Output: fake adapter fixture, reusable contract-test helpers, and documentation for `markitect-filter` adapter implementers. +Implemented: `tests/test_source_adapter_contract.py` provides fake adapter +coverage for model serialization, cache keys, registry selection, entry point +discovery, dependency diagnostics, CLI JSON/Markdown envelopes, and public API +exports. `examples/source-adapters/` contains expected-output fixtures. + ## P18.7 - Cross-repo migration notes for infospace-bench and markitect-filter ```task id: MKTT-WP-0018-T007 -status: todo +status: done priority: medium state_hub_task_id: "dfc81c61-f7ca-4266-8908-56b221101fd4" ``` @@ -272,6 +303,10 @@ Document how the new contract affects sibling repos: Output: migration note and follow-up workplan seeds for `markitect-filter` and `infospace-bench`. +Implemented: `docs/source-adapter-migration.md` documents the migration path +for `markitect-filter`, `infospace-bench`, and `kontextual-engine`, including +follow-up workplan seeds. + ## Acceptance - `markitect-tool` exposes a stable source adapter protocol and canonical