source adapter framework

This commit is contained in:
2026-05-14 22:05:34 +02:00
parent f8f20c7c32
commit eb34c0d4fb
17 changed files with 1924 additions and 15 deletions

View File

@@ -19,6 +19,7 @@ requirements documents in `wiki/`.
- `docs/command-cheatsheet.md` - command-oriented workflow cheat sheet
- `docs/examples-index.md` - map from examples to usecases and commands
- `docs/source-adapter-contract.md` - v1 source adapter contract for external format adapters
- `docs/source-adapter-migration.md` - sibling-repo handoff for source adapter adoption
- `docs/performance-notes.md` - local performance posture and smoke coverage
- `docs/cli-reference.md` - generated `mkt` command reference
- `docs/api-reference.md` - generated public API reference

View File

@@ -10,6 +10,8 @@ Generated from `markitect_tool.__all__`.
- `EMPTY_PARSE_OPTIONS_HASH` - object. str(object='') -> str
- `EXPLODE_MANIFEST_NAME` - object. str(object='') -> str
- `LOCAL_INDEX_SCHEMA_VERSION` - object. str(object='') -> str
- `NORMALIZED_SOURCE_SCHEMA_VERSION` - object. str(object='') -> str
- `SOURCE_ADAPTER_ENTRY_POINT_GROUP` - object. str(object='') -> str
## `markitect_tool.backend.engine`
@@ -339,6 +341,31 @@ Generated from `markitect_tool.__all__`.
- `validate_markdown_file(markdown_path: 'str | Path', schema_path: 'str | Path') -> 'SchemaValidationResult'` - function. Parse and validate a Markdown file against a Markdown schema file.
- `validate_schema(schema: 'dict[str, Any]') -> 'SchemaValidationResult'` - function. Validate that a JSON Schema itself is well formed.
## `markitect_tool.source.engine`
- `NormalizationQuality(lossiness: 'str', confidence: 'float | None' = None, skipped_items: 'int | None' = None, warnings: 'int | None' = None, metadata: 'dict[str, Any]' = <factory>) -> None` - class. Summary of extraction lossiness and confidence.
- `NormalizedMarkdownDocument(document_id: 'str', asset: 'SourceAsset', metadata: 'SourceMetadata', markdown: 'str', segments: 'list[NormalizedMarkdownSegment]', quality: 'NormalizationQuality', adapter: 'dict[str, Any]', cache_key: 'str', schema_version: 'str' = 'markitect.source.v1', diagnostics: 'list[Diagnostic]' = <factory>, provenance: 'list[SourceProvenance]' = <factory>, attachments: 'list[SourceAsset]' = <factory>) -> None` - class. Canonical source-normalized Markdown document.
- `NormalizedMarkdownSegment(segment_id: 'str', order: 'int', markdown: 'str', heading: 'str | None' = None, heading_level: 'int | None' = None, anchors: 'list[str]' = <factory>, provenance: 'list[SourceProvenance]' = <factory>, metadata: 'dict[str, Any]' = <factory>) -> None` - class. One ordered normalized Markdown segment.
- `SourceAdapterDescriptor(id: 'str', version: 'str', name: 'str', operations: 'list[str]', media_types: 'list[str]', extensions: 'list[str]', factory: 'SourceAdapterFactory', summary: 'str | None' = None, option_schema: 'dict[str, Any]' = <factory>, optional_dependencies: 'list[OptionalDependency]' = <factory>, safety: 'dict[str, Any]' = <factory>, quality_profile: 'dict[str, Any]' = <factory>, metadata: 'dict[str, Any]' = <factory>) -> None` - class. Inspectable descriptor for one source read adapter.
- `SourceAdapterError` - class. Raised when source adapter descriptors or registries are invalid.
- `SourceAdapterMatch(adapter_id: 'str', matched: 'bool', confidence: 'int' = 0, reason: 'str | None' = None, diagnostics: 'list[Diagnostic]' = <factory>) -> None` - class. Result of an adapter match attempt.
- `SourceAdapterMatchRequest(asset: 'SourceAsset', options: 'dict[str, Any]' = <factory>) -> None` - class. Cheap adapter matching request.
- `SourceAdapterRegistry(descriptors: 'Iterable[SourceAdapterDescriptor] | None' = None) -> 'None'` - class. Registry of source adapter descriptors.
- `SourceAsset(uri: 'str', path: 'str | None' = None, name: 'str | None' = None, media_type: 'str | None' = None, extension: 'str | None' = None, size: 'int | None' = None, mtime_ns: 'int | None' = None, digest: 'str | None' = None, metadata: 'dict[str, Any]' = <factory>) -> None` - class. Identity and filesystem metadata for one source asset.
- `SourceInspectRequest(asset: 'SourceAsset', options: 'dict[str, Any]' = <factory>) -> None` - class. Metadata-only source inspection request.
- `SourceInspectResult(asset: 'SourceAsset', adapter: 'dict[str, Any]', metadata: 'SourceMetadata', quality: 'NormalizationQuality', capabilities: 'list[str]' = <factory>, diagnostics: 'list[Diagnostic]' = <factory>, valid: 'bool | None' = None) -> None` - class. Metadata-only source inspection result.
- `SourceMetadata(title: 'str | None' = None, creators: 'list[str]' = <factory>, language: 'str | None' = None, rights: 'str | None' = None, source_url: 'str | None' = None, publication_date: 'str | None' = None, publisher: 'str | None' = None, identifiers: 'dict[str, str]' = <factory>, raw: 'dict[str, Any]' = <factory>) -> None` - class. Format-provided descriptive metadata.
- `SourceProvenance(source_uri: 'str', source_path: 'str | None' = None, source_href: 'str | None' = None, package_path: 'str | None' = None, anchor: 'str | None' = None, page: 'str | None' = None, section: 'str | None' = None, start_offset: 'int | None' = None, end_offset: 'int | None' = None, digest: 'str | None' = None, metadata: 'dict[str, Any]' = <factory>) -> None` - class. Trace from normalized Markdown back to source locations.
- `SourceReadAdapter(*args, **kwargs)` - class. Read-only source adapter protocol.
- `SourceReadRequest(asset: 'SourceAsset', options: 'dict[str, Any]' = <factory>) -> None` - class. Full source normalization request.
- `SourceReadResult(document: 'NormalizedMarkdownDocument | None' = None, diagnostics: 'list[Diagnostic]' = <factory>, valid: 'bool | None' = None) -> None` - class. Full source normalization result.
- `default_source_adapter_registry() -> 'SourceAdapterRegistry'` - function. Return the discovered source adapter registry.
- `discover_source_adapters(entry_points: 'Iterable[Any] | None' = None) -> 'SourceAdapterRegistry'` - function. Discover source adapters from package entry points.
- `inspect_source(path_or_uri: 'str | Path', *, registry: 'SourceAdapterRegistry | None' = None, adapter_id: 'str | None' = None, options: 'dict[str, Any] | None' = None) -> 'SourceInspectResult'` - function. Inspect a local source through the selected adapter.
- `normalization_cache_key(*, asset: 'SourceAsset', adapter_id: 'str', adapter_version: 'str', options: 'dict[str, Any] | None' = None) -> 'str'` - function. Compute a stable normalization cache key.
- `normalize_source(path_or_uri: 'str | Path', *, registry: 'SourceAdapterRegistry | None' = None, adapter_id: 'str | None' = None, options: 'dict[str, Any] | None' = None) -> 'SourceReadResult'` - function. Normalize a local source through the selected adapter.
- `source_adapter_registry_descriptor() -> 'ExtensionDescriptor'` - function. Return the built-in descriptor for source adapter discovery.
## `markitect_tool.template.engine`
- `MissingTemplateVariable` - class. Raised when strict rendering cannot resolve a variable.

View File

@@ -862,6 +862,57 @@ Parameters:
- `--policy-mode` - Override policy mode for this search.
- `--format` -
## `mkt source`
Inspect source-format adapters and normalize sources.
```text
source [OPTIONS] COMMAND [ARGS]...
```
## `mkt source adapters`
List discovered read-only source adapters.
```text
adapters [OPTIONS]
```
Parameters:
- `--format` -
## `mkt source inspect`
Inspect a local source without full Markdown conversion.
```text
inspect [OPTIONS] SOURCE_PATH
```
Parameters:
- `SOURCE_PATH` - Required.
- `--adapter` - Explicit source adapter id.
- `--option` - Adapter-specific option. May be repeated.
- `--format` -
## `mkt source normalize`
Normalize a local source into canonical Markdown.
```text
normalize [OPTIONS] SOURCE_PATH
```
Parameters:
- `SOURCE_PATH` - Required.
- `--adapter` - Explicit source adapter id.
- `--option` - Adapter-specific option. May be repeated.
- `--output` - Write normalized output to this file.
- `--format` -
## `mkt tangle`
Tangle named Markdown code chunks into target files.

View File

@@ -41,6 +41,14 @@ mkt ref resolve context.md 'std:clauses.md#payment-terms' --root examples/refere
mkt process file.md --root .
```
## Normalize External Sources
```bash
mkt source adapters
mkt source inspect book.epub --adapter source.epub3 --format json
mkt source normalize book.epub --adapter source.epub3 --format markdown --output book.md
```
## Split And Literate Workflows
```bash
@@ -102,6 +110,7 @@ mkt context list
mkt completion bash --instructions
mkt extension list
mkt extension inspect backend.local-sqlite
mkt extension inspect source.adapter-registry
mkt extension commands
mkt docs cli --output docs/cli-reference.md
mkt docs api --output docs/api-reference.md

View File

@@ -9,6 +9,10 @@ Use this for internal query engines, processors, backend/index stores,
reference providers, validators, template/generation adapters, CLI command
groups, render/export adapters, and future document functions.
Source-format adapters are external package extensions. Use
`docs/source-adapter-contract.md` for the source adapter protocol, entry point
group, descriptor shape, and contract-test expectations.
## Recommended Shape
Each extension should have:

View File

@@ -170,8 +170,10 @@ markitect_tool/extensions/
Each module exposes one or more descriptors plus a registration function. The
root registry can be assembled explicitly at import time or by a small internal
discovery list. Package entry points can be added later if external extension
packages become a real requirement.
discovery list. Source adapters are the first external package-discovery slice
and use the `markitect_tool.source_adapters` entry point group defined in
`docs/source-adapter-contract.md`; other extension kinds can adopt package
entry points later if they become a real requirement.
See `docs/extension-authoring.md` for the extension authoring checklist and
descriptor template.

View File

@@ -0,0 +1,119 @@
# Source Adapter Migration Notes
## Purpose
These notes describe how sibling repositories should consume the
`markitect-tool` source adapter contract implemented by `MKTT-WP-0018`.
The source adapter layer is deliberately split:
```text
external source files
-> markitect-filter concrete adapters
-> markitect-tool source adapter protocol and normalized Markdown model
-> infospace-bench workflows
-> optional kontextual-engine ingestion
```
## Markitect-tool
`markitect-tool` owns the stable contract:
- normalized source data model
- read-only source adapter protocol
- adapter registry and Python entry point discovery
- `mkt source` CLI commands
- public API helpers such as `inspect_source` and `normalize_source`
- fake adapter contract tests
It does not own EPUB3, PDF, DOCX, ODT, OCR, or browser extraction.
## Markitect-filter
`markitect-filter` should implement concrete adapters behind the entry point
group:
```toml
[project.entry-points."markitect_tool.source_adapters"]
epub3 = "markitect_filter.adapters:epub3_adapter_descriptor"
```
The first adapter should be:
```text
id: source.epub3
operations: read
media_types: application/epub+zip
extensions: .epub
```
The EPUB3 adapter should satisfy the contract tests described in
`docs/source-adapter-contract.md` and add EPUB-specific fixtures for container,
OPF, spine, nav, body XHTML, malformed package structure, skipped assets, and
lossy extraction diagnostics.
## Infospace-bench
`infospace-bench` should replace its local EPUB intake spike with the public
source adapter API:
```python
from markitect_tool import normalize_source
result = normalize_source("source.epub")
if not result.is_valid:
raise RuntimeError(result.to_dict()["diagnostics"])
markdown = result.document.markdown
segments = result.document.segments
```
Application workflows should consume normalized Markdown and segment metadata.
They should not depend on EPUB package internals, spine parsing, XHTML
extraction, or boilerplate classification directly.
## Kontextual-engine
`kontextual-engine` can treat normalized source outputs as ingestible derived
knowledge assets when it needs durable ingestion. The durable layer should
persist policy, indexing, retrieval, permissions, audit, and lifecycle state.
It should not require source-format dependencies inside `markitect-tool`.
Recommended ingestion boundary:
- keep `NormalizedMarkdownDocument.to_dict()` as the portable derivative
- preserve source asset digest and normalization cache key
- record adapter ID, adapter version, and read options
- preserve document and segment provenance
- store diagnostics and quality signals for human review
## Follow-up Workplan Seeds
Recommended `markitect-filter` workplan:
```text
MKTF-WP-0001: EPUB3 Read Adapter
Implement source.epub3 against docs/source-adapter-contract.md:
- package scaffold and pyproject entry point
- optional EPUB dependencies
- META-INF/container.xml parsing
- OPF metadata and spine reading order
- nav/chapter label extraction
- body XHTML to normalized Markdown segments
- explicit boilerplate skip policy
- malformed/unsupported/lossy diagnostics
- contract tests using markitect-tool fake adapter expectations
```
Recommended `infospace-bench` workplan:
```text
ISB-WP-source-adapter-intake
Replace local EPUB source intake with markitect-tool normalize_source:
- install markitect-filter[epub3] in the relevant environment
- call normalize_source for source documents
- consume NormalizedMarkdownDocument markdown and segments
- remove app-local EPUB package parsing
- preserve source diagnostics in benchmark review artifacts
```

View File

@@ -43,7 +43,7 @@ and descriptions mirror the operational view.
| `MKTT-WP-0008` | complete | done | `MKTT-WP-0006`, `MKTT-WP-0007`, `MKTT-WP-0009` | Agent working-memory context cache is complete: context package schema, local registry, package creation from queries/search/manifests, deterministic summaries, namespaces, activation/deactivation/refresh/explain lifecycle, policy re-checks, CLI, docs, and examples. |
| `MKTT-WP-0017` | complete | done | `MKTT-WP-0003`, `MKTT-WP-0013` | CLI/API polish and practical adoption track is complete: shell completion, extension discovery, generated CLI/API docs, usecase relevance matrix, E2E fixture matrix, large-corpus smoke coverage, first-use docs, examples index, and command cheat sheet. |
| `MKTT-WP-0019` | complete | done | `MKTT-WP-0013`, `MKTT-WP-0017` | Source adapter contract refinement is complete: v1 read-only scope, normalized model fields, package entry point discovery, CLI/API envelopes, fake adapter fixtures, and `markitect-filter` EPUB3 handoff are pinned in `docs/source-adapter-contract.md`. |
| `MKTT-WP-0018` | P0 | active | `MKTT-WP-0013`, `MKTT-WP-0017`, `MKTT-WP-0019` | Source adapter framework implementation is the current track: implement `docs/source-adapter-contract.md`, keeping format extraction in `markitect-filter` and the base install free of heavyweight conversion dependencies. |
| `MKTT-WP-0018` | complete | done | `MKTT-WP-0013`, `MKTT-WP-0017`, `MKTT-WP-0019` | Source adapter framework implementation is complete: read-only models, protocol, registry, entry point discovery, extension descriptors, CLI/API, fake adapter fixtures, migration notes, and tests are in place. |
| `MKTT-WP-0015` | P2 | todo | `MKTT-WP-0010`, `MKTT-WP-0011`, `MKTT-WP-0012` | Future render and document-function extensions: typed values, richer syntax, document-local reusable functions, Quarkdown/export adapters, render-aware references, assets, and permission sandboxing. Defer unless publishing/export pressure becomes current. |
| `MKTT-WP-0016` | P2 | todo | `MKTT-WP-0008`, `MKTT-WP-0007`, `MKTT-WP-0009`, `MKTT-WP-0013` | Follow-on agentic memory architecture: reasoning decision graphs, conversational paths, long-term knowledge graphs, memory service blueprints/profiles, graph-to-context-package compilation, and adapter boundaries. |
@@ -132,9 +132,11 @@ protocol behavior, CLI/API envelopes, fake adapter fixtures, and
v1 contract is read-only; writer/export adapters belong in later
format-specific work once preservation semantics are explicit.
`MKTT-WP-0018` is now the current source-adapter implementation track. It
should implement the pinned contract directly rather than reopening v1 model,
entry point, protocol, or CLI/API decisions.
`MKTT-WP-0018` completed the source-adapter implementation track. The v1
read-only contract now has public models, protocol types, registry/discovery,
extension descriptors, `mkt source` commands, API exports, fake adapter
fixtures, and sibling-repo migration notes. Concrete EPUB3 extraction remains
`markitect-filter` scope.
## State Hub Mirror

View File

@@ -262,6 +262,32 @@ from markitect_tool.schema import (
validate_markdown_file,
validate_schema,
)
from markitect_tool.source import (
NORMALIZED_SOURCE_SCHEMA_VERSION,
SOURCE_ADAPTER_ENTRY_POINT_GROUP,
NormalizationQuality,
NormalizedMarkdownDocument,
NormalizedMarkdownSegment,
SourceAdapterDescriptor,
SourceAdapterError,
SourceAdapterMatch,
SourceAdapterMatchRequest,
SourceAdapterRegistry,
SourceAsset,
SourceInspectRequest,
SourceInspectResult,
SourceMetadata,
SourceProvenance,
SourceReadAdapter,
SourceReadRequest,
SourceReadResult,
default_source_adapter_registry,
discover_source_adapters,
inspect_source,
normalization_cache_key,
normalize_source,
source_adapter_registry_descriptor,
)
from markitect_tool.template import (
MissingTemplateVariable,
TemplateAnalysis,
@@ -295,6 +321,30 @@ __all__ = [
"validate_document",
"validate_markdown_file",
"validate_schema",
"NORMALIZED_SOURCE_SCHEMA_VERSION",
"SOURCE_ADAPTER_ENTRY_POINT_GROUP",
"NormalizationQuality",
"NormalizedMarkdownDocument",
"NormalizedMarkdownSegment",
"SourceAdapterDescriptor",
"SourceAdapterError",
"SourceAdapterMatch",
"SourceAdapterMatchRequest",
"SourceAdapterRegistry",
"SourceAsset",
"SourceInspectRequest",
"SourceInspectResult",
"SourceMetadata",
"SourceProvenance",
"SourceReadAdapter",
"SourceReadRequest",
"SourceReadResult",
"default_source_adapter_registry",
"discover_source_adapters",
"inspect_source",
"normalization_cache_key",
"normalize_source",
"source_adapter_registry_descriptor",
"ContractCheckResult",
"ContractValidationResult",
"DocumentContract",

View File

@@ -99,6 +99,11 @@ from markitect_tool.reference import (
)
from markitect_tool.runtime import evaluate_form_state, load_runtime_context_file
from markitect_tool.schema import load_schema_file, validate_markdown_file, validate_schema
from markitect_tool.source import (
default_source_adapter_registry,
inspect_source,
normalize_source,
)
from markitect_tool.template import (
MissingTemplateVariable,
TemplateError,
@@ -197,6 +202,123 @@ def extension_commands(output_format: str) -> None:
_emit_extension_catalog({"count": len(specs), "commands": specs}, output_format)
@main.group("source")
def source_group() -> None:
"""Inspect source-format adapters and normalize sources."""
@source_group.command("adapters")
@click.option(
"--format",
"output_format",
type=click.Choice(["json", "yaml", "text"], case_sensitive=False),
default="text",
show_default=True,
)
def source_adapters(output_format: str) -> None:
"""List discovered read-only source adapters."""
_emit_source_adapters(default_source_adapter_registry().to_dict(), output_format)
@source_group.command("inspect")
@click.argument("source_path", type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.option("--adapter", "adapter_id", help="Explicit source adapter id.")
@click.option(
"--option",
"option_values",
multiple=True,
metavar="KEY=VALUE",
help="Adapter-specific option. May be repeated.",
)
@click.option(
"--format",
"output_format",
type=click.Choice(["json", "yaml", "text"], case_sensitive=False),
default="json",
show_default=True,
)
def source_inspect(
source_path: Path,
adapter_id: str | None,
option_values: tuple[str, ...],
output_format: str,
) -> None:
"""Inspect a local source without full Markdown conversion."""
try:
result = inspect_source(
source_path,
registry=default_source_adapter_registry(),
adapter_id=adapter_id,
options=_parse_key_value_options(option_values),
)
except ValueError as exc:
raise click.ClickException(str(exc)) from exc
_emit_source_inspect(result.to_dict(), output_format)
raise click.exceptions.Exit(0 if result.is_valid else 1)
@source_group.command("normalize")
@click.argument("source_path", type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.option("--adapter", "adapter_id", help="Explicit source adapter id.")
@click.option(
"--option",
"option_values",
multiple=True,
metavar="KEY=VALUE",
help="Adapter-specific option. May be repeated.",
)
@click.option(
"--output",
type=click.Path(dir_okay=False, path_type=Path),
help="Write normalized output to this file.",
)
@click.option(
"--format",
"output_format",
type=click.Choice(["markdown", "json", "yaml"], case_sensitive=False),
default="markdown",
show_default=True,
)
def source_normalize(
source_path: Path,
adapter_id: str | None,
option_values: tuple[str, ...],
output: Path | None,
output_format: str,
) -> None:
"""Normalize a local source into canonical Markdown."""
try:
result = normalize_source(
source_path,
registry=default_source_adapter_registry(),
adapter_id=adapter_id,
options=_parse_key_value_options(option_values),
)
except ValueError as exc:
raise click.ClickException(str(exc)) from exc
data = result.to_dict()
if output_format == "markdown":
if not result.is_valid or result.document is None:
for diagnostic in data.get("diagnostics", []):
click.echo(
f"[{diagnostic['severity']}] {diagnostic['code']}: "
f"{diagnostic['message']}",
err=True,
)
raise click.exceptions.Exit(1)
markdown = result.document.markdown
if output:
output.write_text(markdown, encoding="utf-8")
else:
click.echo(markdown, nl=False)
else:
_emit_jsonish(data, output_format)
raise click.exceptions.Exit(0 if result.is_valid else 1)
@main.group("docs")
def docs_group() -> None:
"""Generate CLI and API reference documentation."""
@@ -2892,6 +3014,46 @@ def _emit_extension_catalog(data: dict, output_format: str) -> None:
click.echo(f"- {extension['id']} ({extension['kind']})")
def _emit_source_adapters(data: dict, output_format: str) -> None:
if output_format == "json":
click.echo(json.dumps(data, indent=2, ensure_ascii=False))
elif output_format == "yaml":
click.echo(yaml.safe_dump(data, sort_keys=False))
else:
click.echo(f"adapters: {data.get('count', 0)}")
for adapter in data.get("adapters", []):
operations = ", ".join(adapter.get("operations", []))
extensions = ", ".join(adapter.get("extensions", []))
click.echo(f"- {adapter['id']} [{operations}] {extensions}")
check = adapter.get("dependency_check", {})
if check.get("missing"):
click.echo(" missing: " + ", ".join(check["missing"]))
def _emit_source_inspect(data: dict, output_format: str) -> None:
if output_format == "json":
click.echo(json.dumps(data, indent=2, ensure_ascii=False))
elif output_format == "yaml":
click.echo(yaml.safe_dump(data, sort_keys=False))
else:
click.echo("valid" if data.get("valid") else "invalid")
asset = data.get("asset", {})
adapter = data.get("adapter", {})
metadata = data.get("metadata", {})
click.echo(f"source: {asset.get('path') or asset.get('uri', '<unknown>')}")
if adapter.get("id"):
click.echo(f"adapter: {adapter['id']}")
if metadata.get("title"):
click.echo(f"title: {metadata['title']}")
if metadata.get("creators"):
click.echo("creators: " + ", ".join(metadata["creators"]))
quality = data.get("quality", {})
if quality.get("lossiness"):
click.echo(f"lossiness: {quality['lossiness']}")
for diagnostic in data.get("diagnostics", []):
click.echo(f"! [{diagnostic['severity']}] {diagnostic['code']}: {diagnostic['message']}")
def _emit_jsonish(data: dict, output_format: str) -> None:
if output_format == "yaml":
click.echo(yaml.safe_dump(data, sort_keys=False))

View File

@@ -5,6 +5,10 @@ from __future__ import annotations
from markitect_tool.extension.registry import ExtensionDescriptor, ExtensionRegistry
from markitect_tool.extension.processing import ProcessingCapability
from markitect_tool.query import default_query_engine_registry
from markitect_tool.source import (
default_source_adapter_registry,
source_adapter_registry_descriptor,
)
def builtin_extension_registry() -> ExtensionRegistry:
@@ -22,8 +26,11 @@ def builtin_extension_registry() -> ExtensionRegistry:
_local_label_policy_descriptor(),
_document_function_descriptor(),
_agent_memory_descriptor(),
source_adapter_registry_descriptor(),
]:
registry.register(descriptor)
for descriptor in default_source_adapter_registry().extension_descriptors():
registry.register(descriptor)
return registry

View File

@@ -0,0 +1,55 @@
"""Source adapter contracts and normalization helpers."""
from markitect_tool.source.engine import (
SOURCE_ADAPTER_ENTRY_POINT_GROUP,
NORMALIZED_SOURCE_SCHEMA_VERSION,
NormalizationQuality,
NormalizedMarkdownDocument,
NormalizedMarkdownSegment,
SourceAdapterDescriptor,
SourceAdapterError,
SourceAdapterMatch,
SourceAdapterMatchRequest,
SourceAdapterRegistry,
SourceAsset,
SourceInspectRequest,
SourceInspectResult,
SourceMetadata,
SourceProvenance,
SourceReadAdapter,
SourceReadRequest,
SourceReadResult,
default_source_adapter_registry,
discover_source_adapters,
inspect_source,
normalization_cache_key,
normalize_source,
source_adapter_registry_descriptor,
)
__all__ = [
"SOURCE_ADAPTER_ENTRY_POINT_GROUP",
"NORMALIZED_SOURCE_SCHEMA_VERSION",
"NormalizationQuality",
"NormalizedMarkdownDocument",
"NormalizedMarkdownSegment",
"SourceAdapterDescriptor",
"SourceAdapterError",
"SourceAdapterMatch",
"SourceAdapterMatchRequest",
"SourceAdapterRegistry",
"SourceAsset",
"SourceInspectRequest",
"SourceInspectResult",
"SourceMetadata",
"SourceProvenance",
"SourceReadAdapter",
"SourceReadRequest",
"SourceReadResult",
"default_source_adapter_registry",
"discover_source_adapters",
"inspect_source",
"normalization_cache_key",
"normalize_source",
"source_adapter_registry_descriptor",
]

File diff suppressed because it is too large Load Diff

View File

@@ -22,6 +22,7 @@ def test_builtin_extension_registry_lists_query_processors_and_backend():
assert "policy.local-label" in ids
assert "document.function" in ids
assert "memory.context-package" in ids
assert "source.adapter-registry" in ids
def test_builtin_meta_descriptors_expose_extension_and_docs_accesspoints():

View File

@@ -43,6 +43,7 @@ def test_mkt_docs_cli_generates_command_reference():
assert result.exit_code == 0, result.output
assert "# Markitect CLI Reference" in result.output
assert "## `mkt extension commands`" in result.output
assert "## `mkt source normalize`" in result.output
assert "## `mkt docs api`" in result.output
@@ -54,6 +55,7 @@ def test_mkt_docs_api_generates_public_api_reference():
assert "query_document_jsonpath" in result.output
assert "ExtensionDescriptor" in result.output
assert "LocalSnapshotStore" in result.output
assert "SourceAdapterRegistry" in result.output
def test_top_level_api_exports_newer_architecture_surfaces():
@@ -63,3 +65,5 @@ def test_top_level_api_exports_newer_architecture_surfaces():
assert api.ExtensionDescriptor
assert api.builtin_extension_registry
assert api.validate_schema
assert api.SourceAdapterRegistry
assert api.normalize_source

View File

@@ -0,0 +1,380 @@
import importlib
import json
from pathlib import Path
from click.testing import CliRunner
import markitect_tool as api
from markitect_tool.diagnostics import Diagnostic
from markitect_tool.extension import OptionalDependency, builtin_extension_registry
from markitect_tool.source import (
NORMALIZED_SOURCE_SCHEMA_VERSION,
NormalizationQuality,
NormalizedMarkdownDocument,
NormalizedMarkdownSegment,
SourceAdapterDescriptor,
SourceAdapterMatch,
SourceAdapterMatchRequest,
SourceAdapterRegistry,
SourceAsset,
SourceInspectRequest,
SourceInspectResult,
SourceMetadata,
SourceProvenance,
SourceReadRequest,
SourceReadResult,
discover_source_adapters,
inspect_source,
normalization_cache_key,
normalize_source,
)
SAMPLE_SOURCE = Path("examples/source-adapters/sample.fake")
NORMALIZED_MARKDOWN = (
"# Fake Source\n\n"
"A small normalized segment.\n\n"
"## Second Segment\n\n"
"Another deterministic segment."
)
class FakeSourceAdapter:
def __init__(self, descriptor: SourceAdapterDescriptor, *, confidence: int = 80) -> None:
self.descriptor = descriptor
self.confidence = confidence
def can_read(self, request: SourceAdapterMatchRequest) -> SourceAdapterMatch:
return SourceAdapterMatch(
adapter_id=self.descriptor.id,
matched=request.asset.extension == ".fake",
confidence=self.confidence,
reason="extension",
)
def inspect(self, request: SourceInspectRequest) -> SourceInspectResult:
return SourceInspectResult(
asset=request.asset,
adapter={"id": self.descriptor.id, "version": self.descriptor.version, "options": request.options},
metadata=_source_metadata(),
capabilities=["read"],
quality=NormalizationQuality(lossiness="none", confidence=1.0),
)
def read(self, request: SourceReadRequest) -> SourceReadResult:
asset = request.asset
provenance = [
SourceProvenance(
source_uri=asset.uri,
source_path=asset.path,
digest=asset.digest,
)
]
segments = [
NormalizedMarkdownSegment(
segment_id="seg-0001",
order=0,
heading="Fake Source",
heading_level=1,
markdown="# Fake Source\n\nA small normalized segment.",
anchors=["fake-source"],
provenance=[
SourceProvenance(
source_uri=asset.uri,
source_path=asset.path,
anchor="fake-source",
section="Fake Source",
)
],
),
NormalizedMarkdownSegment(
segment_id="seg-0002",
order=1,
heading="Second Segment",
heading_level=2,
markdown="## Second Segment\n\nAnother deterministic segment.",
anchors=["second-segment"],
provenance=[
SourceProvenance(
source_uri=asset.uri,
source_path=asset.path,
anchor="second-segment",
section="Second Segment",
)
],
),
]
cache_key = normalization_cache_key(
asset=asset,
adapter_id=self.descriptor.id,
adapter_version=self.descriptor.version,
options=request.options,
)
document = NormalizedMarkdownDocument(
document_id=f"{self.descriptor.id}:fake-source-001",
asset=asset,
metadata=_source_metadata(),
markdown=NORMALIZED_MARKDOWN,
segments=segments,
quality=NormalizationQuality(lossiness="none", confidence=1.0, skipped_items=0, warnings=0),
provenance=provenance,
adapter={"id": self.descriptor.id, "version": self.descriptor.version, "options": request.options},
cache_key=cache_key,
)
return SourceReadResult(document=document)
def _source_metadata() -> SourceMetadata:
return SourceMetadata(
title="Fake Source",
creators=["Markitect Fixture"],
language="en",
identifiers={"fixture": "fake-source-001"},
)
def _fake_descriptor(adapter_id: str = "source.fake", *, confidence: int = 80) -> SourceAdapterDescriptor:
descriptor = None
def factory() -> FakeSourceAdapter:
assert descriptor is not None
return FakeSourceAdapter(descriptor, confidence=confidence)
descriptor = SourceAdapterDescriptor(
id=adapter_id,
version="1",
name="Fake Source Adapter",
summary="Contract-test adapter for plain fixture sources.",
operations=["read"],
media_types=["text/x.markitect-fake"],
extensions=[".fake"],
factory=factory,
safety={
"reads_files": True,
"writes_files": False,
"network": False,
"external_process": False,
},
)
return descriptor
def test_normalized_document_serialization_round_trips():
registry = SourceAdapterRegistry([_fake_descriptor()])
result = normalize_source(SAMPLE_SOURCE, registry=registry)
assert result.is_valid
assert result.document is not None
data = result.document.to_dict()
round_trip = NormalizedMarkdownDocument.from_dict(data).to_dict()
assert round_trip == data
assert data["schema_version"] == NORMALIZED_SOURCE_SCHEMA_VERSION
assert data["markdown"] == NORMALIZED_MARKDOWN
assert data["segments"][0]["segment_id"] == "seg-0001"
def test_normalization_cache_key_is_deterministic():
asset = SourceAsset(uri="sample.fake", path="sample.fake", digest="sha256:abc")
first = normalization_cache_key(
asset=asset,
adapter_id="source.fake",
adapter_version="1",
options={"skip_boilerplate": True},
)
second = normalization_cache_key(
asset=asset,
adapter_id="source.fake",
adapter_version="1",
options={"skip_boilerplate": True},
)
assert first == second
assert first.startswith("source-normalize:sha256:")
def test_source_registry_selects_fake_adapter_and_reports_unsupported():
registry = SourceAdapterRegistry([_fake_descriptor()])
asset = SourceAsset.from_path(SAMPLE_SOURCE)
descriptor, adapter, diagnostics = registry.select(asset)
assert descriptor is not None
assert descriptor.id == "source.fake"
assert adapter is not None
assert diagnostics == []
unsupported = SourceAsset(uri="example.bin", extension=".bin")
descriptor, adapter, diagnostics = registry.select(unsupported)
assert descriptor is None
assert adapter is None
assert diagnostics[0].code == "source.unsupported_format"
def test_source_registry_reports_missing_required_dependency():
descriptor = SourceAdapterDescriptor(
id="source.needs-missing",
version="1",
name="Missing Dependency Adapter",
operations=["read"],
media_types=[],
extensions=[".fake"],
factory=lambda: FakeSourceAdapter(_fake_descriptor("source.needs-missing")),
optional_dependencies=[
OptionalDependency(
name="definitely_missing_markitect_source_adapter_dependency",
package="missing-package",
required=True,
)
],
)
registry = SourceAdapterRegistry([descriptor])
_, _, diagnostics = registry.select(SourceAsset.from_path(SAMPLE_SOURCE))
assert diagnostics[0].code == "source.missing_dependency"
assert "definitely_missing_markitect_source_adapter_dependency" in diagnostics[0].details["missing"]
def test_source_registry_breaks_ambiguous_matches_by_adapter_id():
registry = SourceAdapterRegistry(
[
_fake_descriptor("source.b", confidence=80),
_fake_descriptor("source.a", confidence=80),
]
)
descriptor, _, diagnostics = registry.select(SourceAsset.from_path(SAMPLE_SOURCE))
assert descriptor is not None
assert descriptor.id == "source.a"
assert [diagnostic.code for diagnostic in diagnostics] == ["source.adapter_ambiguous"]
class FakeEntryPoint:
name = "fake"
def load(self):
return _fake_descriptor()
def test_discover_source_adapters_accepts_entry_point_descriptors():
registry = discover_source_adapters([FakeEntryPoint()])
assert registry.get("source.fake").name == "Fake Source Adapter"
def test_source_descriptor_maps_to_extension_descriptor():
extension = _fake_descriptor().to_extension_descriptor()
assert extension.kind == "source-adapter"
assert extension.input_contract == "SourceInspectRequest | SourceReadRequest"
assert "mkt source normalize" in extension.cli["commands"]
assert {capability.id for capability in extension.capabilities} >= {
"source",
"markdown",
"diagnostics",
"provenance",
}
def test_builtin_registry_exposes_source_adapter_framework():
registry = builtin_extension_registry()
descriptor = registry.get("source.adapter-registry")
assert descriptor.kind == "source-adapter-registry"
assert descriptor.metadata["entry_point_group"] == "markitect_tool.source_adapters"
assert "mkt source adapters" in descriptor.cli["commands"]
def test_inspect_and_normalize_source_api_use_injected_registry():
registry = SourceAdapterRegistry([_fake_descriptor()])
inspected = inspect_source(SAMPLE_SOURCE, registry=registry)
normalized = normalize_source(SAMPLE_SOURCE, registry=registry)
assert inspected.is_valid
assert inspected.metadata.title == "Fake Source"
assert normalized.is_valid
assert normalized.document is not None
assert normalized.document.markdown == NORMALIZED_MARKDOWN
def test_source_cli_uses_registry_and_emits_json(monkeypatch):
cli_module = importlib.import_module("markitect_tool.cli.main")
monkeypatch.setattr(
cli_module,
"default_source_adapter_registry",
lambda: SourceAdapterRegistry([_fake_descriptor()]),
)
result = CliRunner().invoke(cli_module.main, ["source", "adapters", "--format", "json"])
assert result.exit_code == 0, result.output
data = json.loads(result.output)
assert data["count"] == 1
assert data["adapters"][0]["id"] == "source.fake"
def test_source_cli_inspect_and_normalize(monkeypatch):
cli_module = importlib.import_module("markitect_tool.cli.main")
monkeypatch.setattr(
cli_module,
"default_source_adapter_registry",
lambda: SourceAdapterRegistry([_fake_descriptor()]),
)
runner = CliRunner()
inspected = runner.invoke(
cli_module.main,
["source", "inspect", str(SAMPLE_SOURCE), "--format", "json"],
)
normalized = runner.invoke(
cli_module.main,
["source", "normalize", str(SAMPLE_SOURCE), "--format", "markdown"],
)
assert inspected.exit_code == 0, inspected.output
assert json.loads(inspected.output)["metadata"]["title"] == "Fake Source"
assert normalized.exit_code == 0, normalized.output
assert normalized.output == NORMALIZED_MARKDOWN
def test_source_cli_markdown_output_suppresses_invalid_partial(monkeypatch):
cli_module = importlib.import_module("markitect_tool.cli.main")
monkeypatch.setattr(
cli_module,
"default_source_adapter_registry",
lambda: SourceAdapterRegistry(),
)
result = CliRunner(mix_stderr=False).invoke(
cli_module.main,
["source", "normalize", str(SAMPLE_SOURCE), "--format", "markdown"],
)
assert result.exit_code == 1
assert result.output == ""
assert "source.unsupported_format" in result.stderr
def test_source_examples_are_valid_json_fixtures():
for path in [
"examples/source-adapters/adapter-list.json",
"examples/source-adapters/inspect-result.json",
"examples/source-adapters/normalized-document.json",
]:
with open(path, encoding="utf-8") as handle:
data = json.load(handle)
assert data
def test_top_level_api_exports_source_contract():
assert api.SourceAsset
assert api.SourceAdapterDescriptor
assert api.SourceAdapterRegistry
assert api.default_source_adapter_registry
assert api.normalize_source
assert api.SOURCE_ADAPTER_ENTRY_POINT_GROUP == "markitect_tool.source_adapters"

View File

@@ -3,10 +3,10 @@ id: MKTT-WP-0018
type: workplan
title: "Source Adapter Interface And Markdown Normalization Contract"
domain: markitect
status: active
status: done
owner: markitect-tool
topic_slug: markitect
planning_priority: P0
planning_priority: complete
planning_order: 145
depends_on_workplans:
- MKTT-WP-0013
@@ -108,7 +108,7 @@ preservation contract exists.
```task
id: MKTT-WP-0018-T001
status: todo
status: done
priority: high
state_hub_task_id: "a5d05b2a-b9d8-43c6-9e52-5a77094b49d1"
```
@@ -124,11 +124,16 @@ Output: architecture note covering responsibilities, extension package shape,
the `docs/source-adapter-contract.md` entry point contract, dependency policy,
and migration path from the current `infospace-bench` EPUB spike.
Implemented: `docs/source-adapter-contract.md` defines the contract boundary,
external package shape, dependency policy, entry point group, and
`markitect-filter` EPUB3 handoff. `docs/source-adapter-migration.md` documents
the sibling-repo migration path.
## P18.2 - Canonical source-to-markdown data model
```task
id: MKTT-WP-0018-T002
status: todo
status: done
priority: high
state_hub_task_id: "f8164264-a9c1-4c82-8617-76bbb84a51bb"
```
@@ -160,11 +165,17 @@ Output: public data model, serialization tests using
`examples/source-adapters/normalized-document.json`, and normalization contract
documentation matching the field-level v1 specification.
Implemented: `markitect_tool.source` exposes `SourceAsset`, `SourceMetadata`,
`NormalizedMarkdownDocument`, `NormalizedMarkdownSegment`,
`SourceProvenance`, and `NormalizationQuality`, with stable dictionary
serialization, round-trip tests, digest/cache-key support, diagnostics, and
fixture coverage.
## P18.3 - Source adapter protocol and capability descriptors
```task
id: MKTT-WP-0018-T003
status: todo
status: done
priority: high
state_hub_task_id: "5036ff34-49f4-4900-9e90-95c4555b4ce9"
```
@@ -189,11 +200,16 @@ Concrete EPUB3 implementation belongs in `markitect-filter`.
Output: protocol module, descriptor integration, tests for matching,
inspection, reading, diagnostics, and unsupported-format behavior.
Implemented: `SourceReadAdapter`, request/result types,
`SourceAdapterDescriptor`, deterministic selection, dependency diagnostics,
unsupported-format diagnostics, and read-only capability descriptors live in
`markitect_tool.source`.
## P18.4 - Adapter registry and discovery hooks
```task
id: MKTT-WP-0018-T004
status: todo
status: done
priority: high
state_hub_task_id: "391fb723-8990-4086-ac6c-656a3d637ba3"
```
@@ -210,11 +226,16 @@ Wire source adapters into the existing internal extension framework:
Output: registry implementation, package discovery tests, and compatibility
notes for `markitect-filter`.
Implemented: `SourceAdapterRegistry`, `discover_source_adapters`, and
`default_source_adapter_registry` discover descriptors through
`markitect_tool.source_adapters`, expose source adapter descriptors through the
extension catalog, and report missing optional dependencies deterministically.
## P18.5 - Normalization CLI and public API surface
```task
id: MKTT-WP-0018-T005
status: todo
status: done
priority: medium
state_hub_task_id: "c6233bd1-0403-498b-a6ed-c1874b172aa3"
```
@@ -231,11 +252,16 @@ Expose a small CLI/API surface:
Output: CLI commands, API exports, generated command/API docs updates, and
tests.
Implemented: `mkt source adapters`, `mkt source inspect`, and
`mkt source normalize` expose JSON/YAML/text/Markdown behavior. Public API
exports were added to `markitect_tool.__all__`, and generated CLI/API docs were
refreshed.
## P18.6 - Contract tests and fake adapter fixture
```task
id: MKTT-WP-0018-T006
status: todo
status: done
priority: high
state_hub_task_id: "263d0351-2942-4c2a-b333-b3aa96f2b8e3"
```
@@ -252,11 +278,16 @@ Add deterministic contract tests proving that an external read adapter can:
Output: fake adapter fixture, reusable contract-test helpers, and documentation
for `markitect-filter` adapter implementers.
Implemented: `tests/test_source_adapter_contract.py` provides fake adapter
coverage for model serialization, cache keys, registry selection, entry point
discovery, dependency diagnostics, CLI JSON/Markdown envelopes, and public API
exports. `examples/source-adapters/` contains expected-output fixtures.
## P18.7 - Cross-repo migration notes for infospace-bench and markitect-filter
```task
id: MKTT-WP-0018-T007
status: todo
status: done
priority: medium
state_hub_task_id: "dfc81c61-f7ca-4266-8908-56b221101fd4"
```
@@ -272,6 +303,10 @@ Document how the new contract affects sibling repos:
Output: migration note and follow-up workplan seeds for `markitect-filter` and
`infospace-bench`.
Implemented: `docs/source-adapter-migration.md` documents the migration path
for `markitect-filter`, `infospace-bench`, and `kontextual-engine`, including
follow-up workplan seeds.
## Acceptance
- `markitect-tool` exposes a stable source adapter protocol and canonical