source adapter framework

This commit is contained in:
2026-05-14 22:05:34 +02:00
parent f8f20c7c32
commit eb34c0d4fb
17 changed files with 1924 additions and 15 deletions

View File

@@ -10,6 +10,8 @@ Generated from `markitect_tool.__all__`.
- `EMPTY_PARSE_OPTIONS_HASH` - object. str(object='') -> str
- `EXPLODE_MANIFEST_NAME` - object. str(object='') -> str
- `LOCAL_INDEX_SCHEMA_VERSION` - object. str(object='') -> str
- `NORMALIZED_SOURCE_SCHEMA_VERSION` - object. str(object='') -> str
- `SOURCE_ADAPTER_ENTRY_POINT_GROUP` - object. str(object='') -> str
## `markitect_tool.backend.engine`
@@ -339,6 +341,31 @@ Generated from `markitect_tool.__all__`.
- `validate_markdown_file(markdown_path: 'str | Path', schema_path: 'str | Path') -> 'SchemaValidationResult'` - function. Parse and validate a Markdown file against a Markdown schema file.
- `validate_schema(schema: 'dict[str, Any]') -> 'SchemaValidationResult'` - function. Validate that a JSON Schema itself is well formed.
## `markitect_tool.source.engine`
- `NormalizationQuality(lossiness: 'str', confidence: 'float | None' = None, skipped_items: 'int | None' = None, warnings: 'int | None' = None, metadata: 'dict[str, Any]' = <factory>) -> None` - class. Summary of extraction lossiness and confidence.
- `NormalizedMarkdownDocument(document_id: 'str', asset: 'SourceAsset', metadata: 'SourceMetadata', markdown: 'str', segments: 'list[NormalizedMarkdownSegment]', quality: 'NormalizationQuality', adapter: 'dict[str, Any]', cache_key: 'str', schema_version: 'str' = 'markitect.source.v1', diagnostics: 'list[Diagnostic]' = <factory>, provenance: 'list[SourceProvenance]' = <factory>, attachments: 'list[SourceAsset]' = <factory>) -> None` - class. Canonical source-normalized Markdown document.
- `NormalizedMarkdownSegment(segment_id: 'str', order: 'int', markdown: 'str', heading: 'str | None' = None, heading_level: 'int | None' = None, anchors: 'list[str]' = <factory>, provenance: 'list[SourceProvenance]' = <factory>, metadata: 'dict[str, Any]' = <factory>) -> None` - class. One ordered normalized Markdown segment.
- `SourceAdapterDescriptor(id: 'str', version: 'str', name: 'str', operations: 'list[str]', media_types: 'list[str]', extensions: 'list[str]', factory: 'SourceAdapterFactory', summary: 'str | None' = None, option_schema: 'dict[str, Any]' = <factory>, optional_dependencies: 'list[OptionalDependency]' = <factory>, safety: 'dict[str, Any]' = <factory>, quality_profile: 'dict[str, Any]' = <factory>, metadata: 'dict[str, Any]' = <factory>) -> None` - class. Inspectable descriptor for one source read adapter.
- `SourceAdapterError` - class. Raised when source adapter descriptors or registries are invalid.
- `SourceAdapterMatch(adapter_id: 'str', matched: 'bool', confidence: 'int' = 0, reason: 'str | None' = None, diagnostics: 'list[Diagnostic]' = <factory>) -> None` - class. Result of an adapter match attempt.
- `SourceAdapterMatchRequest(asset: 'SourceAsset', options: 'dict[str, Any]' = <factory>) -> None` - class. Cheap adapter matching request.
- `SourceAdapterRegistry(descriptors: 'Iterable[SourceAdapterDescriptor] | None' = None) -> 'None'` - class. Registry of source adapter descriptors.
- `SourceAsset(uri: 'str', path: 'str | None' = None, name: 'str | None' = None, media_type: 'str | None' = None, extension: 'str | None' = None, size: 'int | None' = None, mtime_ns: 'int | None' = None, digest: 'str | None' = None, metadata: 'dict[str, Any]' = <factory>) -> None` - class. Identity and filesystem metadata for one source asset.
- `SourceInspectRequest(asset: 'SourceAsset', options: 'dict[str, Any]' = <factory>) -> None` - class. Metadata-only source inspection request.
- `SourceInspectResult(asset: 'SourceAsset', adapter: 'dict[str, Any]', metadata: 'SourceMetadata', quality: 'NormalizationQuality', capabilities: 'list[str]' = <factory>, diagnostics: 'list[Diagnostic]' = <factory>, valid: 'bool | None' = None) -> None` - class. Metadata-only source inspection result.
- `SourceMetadata(title: 'str | None' = None, creators: 'list[str]' = <factory>, language: 'str | None' = None, rights: 'str | None' = None, source_url: 'str | None' = None, publication_date: 'str | None' = None, publisher: 'str | None' = None, identifiers: 'dict[str, str]' = <factory>, raw: 'dict[str, Any]' = <factory>) -> None` - class. Format-provided descriptive metadata.
- `SourceProvenance(source_uri: 'str', source_path: 'str | None' = None, source_href: 'str | None' = None, package_path: 'str | None' = None, anchor: 'str | None' = None, page: 'str | None' = None, section: 'str | None' = None, start_offset: 'int | None' = None, end_offset: 'int | None' = None, digest: 'str | None' = None, metadata: 'dict[str, Any]' = <factory>) -> None` - class. Trace from normalized Markdown back to source locations.
- `SourceReadAdapter(*args, **kwargs)` - class. Read-only source adapter protocol.
- `SourceReadRequest(asset: 'SourceAsset', options: 'dict[str, Any]' = <factory>) -> None` - class. Full source normalization request.
- `SourceReadResult(document: 'NormalizedMarkdownDocument | None' = None, diagnostics: 'list[Diagnostic]' = <factory>, valid: 'bool | None' = None) -> None` - class. Full source normalization result.
- `default_source_adapter_registry() -> 'SourceAdapterRegistry'` - function. Return the discovered source adapter registry.
- `discover_source_adapters(entry_points: 'Iterable[Any] | None' = None) -> 'SourceAdapterRegistry'` - function. Discover source adapters from package entry points.
- `inspect_source(path_or_uri: 'str | Path', *, registry: 'SourceAdapterRegistry | None' = None, adapter_id: 'str | None' = None, options: 'dict[str, Any] | None' = None) -> 'SourceInspectResult'` - function. Inspect a local source through the selected adapter.
- `normalization_cache_key(*, asset: 'SourceAsset', adapter_id: 'str', adapter_version: 'str', options: 'dict[str, Any] | None' = None) -> 'str'` - function. Compute a stable normalization cache key.
- `normalize_source(path_or_uri: 'str | Path', *, registry: 'SourceAdapterRegistry | None' = None, adapter_id: 'str | None' = None, options: 'dict[str, Any] | None' = None) -> 'SourceReadResult'` - function. Normalize a local source through the selected adapter.
- `source_adapter_registry_descriptor() -> 'ExtensionDescriptor'` - function. Return the built-in descriptor for source adapter discovery.
## `markitect_tool.template.engine`
- `MissingTemplateVariable` - class. Raised when strict rendering cannot resolve a variable.

View File

@@ -862,6 +862,57 @@ Parameters:
- `--policy-mode` - Override policy mode for this search.
- `--format` -
## `mkt source`
Inspect source-format adapters and normalize sources.
```text
source [OPTIONS] COMMAND [ARGS]...
```
## `mkt source adapters`
List discovered read-only source adapters.
```text
adapters [OPTIONS]
```
Parameters:
- `--format` -
## `mkt source inspect`
Inspect a local source without full Markdown conversion.
```text
inspect [OPTIONS] SOURCE_PATH
```
Parameters:
- `SOURCE_PATH` - Required.
- `--adapter` - Explicit source adapter id.
- `--option` - Adapter-specific option. May be repeated.
- `--format` -
## `mkt source normalize`
Normalize a local source into canonical Markdown.
```text
normalize [OPTIONS] SOURCE_PATH
```
Parameters:
- `SOURCE_PATH` - Required.
- `--adapter` - Explicit source adapter id.
- `--option` - Adapter-specific option. May be repeated.
- `--output` - Write normalized output to this file.
- `--format` -
## `mkt tangle`
Tangle named Markdown code chunks into target files.

View File

@@ -41,6 +41,14 @@ mkt ref resolve context.md 'std:clauses.md#payment-terms' --root examples/refere
mkt process file.md --root .
```
## Normalize External Sources
```bash
mkt source adapters
mkt source inspect book.epub --adapter source.epub3 --format json
mkt source normalize book.epub --adapter source.epub3 --format markdown --output book.md
```
## Split And Literate Workflows
```bash
@@ -102,6 +110,7 @@ mkt context list
mkt completion bash --instructions
mkt extension list
mkt extension inspect backend.local-sqlite
mkt extension inspect source.adapter-registry
mkt extension commands
mkt docs cli --output docs/cli-reference.md
mkt docs api --output docs/api-reference.md

View File

@@ -9,6 +9,10 @@ Use this for internal query engines, processors, backend/index stores,
reference providers, validators, template/generation adapters, CLI command
groups, render/export adapters, and future document functions.
Source-format adapters are external package extensions. Use
`docs/source-adapter-contract.md` for the source adapter protocol, entry point
group, descriptor shape, and contract-test expectations.
## Recommended Shape
Each extension should have:

View File

@@ -170,8 +170,10 @@ markitect_tool/extensions/
Each module exposes one or more descriptors plus a registration function. The
root registry can be assembled explicitly at import time or by a small internal
discovery list. Package entry points can be added later if external extension
packages become a real requirement.
discovery list. Source adapters are the first external package-discovery slice
and use the `markitect_tool.source_adapters` entry point group defined in
`docs/source-adapter-contract.md`; other extension kinds can adopt package
entry points later if they become a real requirement.
See `docs/extension-authoring.md` for the extension authoring checklist and
descriptor template.

View File

@@ -0,0 +1,119 @@
# Source Adapter Migration Notes
## Purpose
These notes describe how sibling repositories should consume the
`markitect-tool` source adapter contract implemented by `MKTT-WP-0018`.
The source adapter layer is deliberately split:
```text
external source files
-> markitect-filter concrete adapters
-> markitect-tool source adapter protocol and normalized Markdown model
-> infospace-bench workflows
-> optional kontextual-engine ingestion
```
## Markitect-tool
`markitect-tool` owns the stable contract:
- normalized source data model
- read-only source adapter protocol
- adapter registry and Python entry point discovery
- `mkt source` CLI commands
- public API helpers such as `inspect_source` and `normalize_source`
- fake adapter contract tests
It does not own EPUB3, PDF, DOCX, ODT, OCR, or browser extraction.
## Markitect-filter
`markitect-filter` should implement concrete adapters behind the entry point
group:
```toml
[project.entry-points."markitect_tool.source_adapters"]
epub3 = "markitect_filter.adapters:epub3_adapter_descriptor"
```
The first adapter should be:
```text
id: source.epub3
operations: read
media_types: application/epub+zip
extensions: .epub
```
The EPUB3 adapter should satisfy the contract tests described in
`docs/source-adapter-contract.md` and add EPUB-specific fixtures for container,
OPF, spine, nav, body XHTML, malformed package structure, skipped assets, and
lossy extraction diagnostics.
## Infospace-bench
`infospace-bench` should replace its local EPUB intake spike with the public
source adapter API:
```python
from markitect_tool import normalize_source
result = normalize_source("source.epub")
if not result.is_valid:
raise RuntimeError(result.to_dict()["diagnostics"])
markdown = result.document.markdown
segments = result.document.segments
```
Application workflows should consume normalized Markdown and segment metadata.
They should not depend on EPUB package internals, spine parsing, XHTML
extraction, or boilerplate classification directly.
## Kontextual-engine
`kontextual-engine` can treat normalized source outputs as ingestible derived
knowledge assets when it needs durable ingestion. The durable layer should
persist policy, indexing, retrieval, permissions, audit, and lifecycle state.
It should not require source-format dependencies inside `markitect-tool`.
Recommended ingestion boundary:
- keep `NormalizedMarkdownDocument.to_dict()` as the portable derivative
- preserve source asset digest and normalization cache key
- record adapter ID, adapter version, and read options
- preserve document and segment provenance
- store diagnostics and quality signals for human review
## Follow-up Workplan Seeds
Recommended `markitect-filter` workplan:
```text
MKTF-WP-0001: EPUB3 Read Adapter
Implement source.epub3 against docs/source-adapter-contract.md:
- package scaffold and pyproject entry point
- optional EPUB dependencies
- META-INF/container.xml parsing
- OPF metadata and spine reading order
- nav/chapter label extraction
- body XHTML to normalized Markdown segments
- explicit boilerplate skip policy
- malformed/unsupported/lossy diagnostics
- contract tests using markitect-tool fake adapter expectations
```
Recommended `infospace-bench` workplan:
```text
ISB-WP-source-adapter-intake
Replace local EPUB source intake with markitect-tool normalize_source:
- install markitect-filter[epub3] in the relevant environment
- call normalize_source for source documents
- consume NormalizedMarkdownDocument markdown and segments
- remove app-local EPUB package parsing
- preserve source diagnostics in benchmark review artifacts
```

View File

@@ -43,7 +43,7 @@ and descriptions mirror the operational view.
| `MKTT-WP-0008` | complete | done | `MKTT-WP-0006`, `MKTT-WP-0007`, `MKTT-WP-0009` | Agent working-memory context cache is complete: context package schema, local registry, package creation from queries/search/manifests, deterministic summaries, namespaces, activation/deactivation/refresh/explain lifecycle, policy re-checks, CLI, docs, and examples. |
| `MKTT-WP-0017` | complete | done | `MKTT-WP-0003`, `MKTT-WP-0013` | CLI/API polish and practical adoption track is complete: shell completion, extension discovery, generated CLI/API docs, usecase relevance matrix, E2E fixture matrix, large-corpus smoke coverage, first-use docs, examples index, and command cheat sheet. |
| `MKTT-WP-0019` | complete | done | `MKTT-WP-0013`, `MKTT-WP-0017` | Source adapter contract refinement is complete: v1 read-only scope, normalized model fields, package entry point discovery, CLI/API envelopes, fake adapter fixtures, and `markitect-filter` EPUB3 handoff are pinned in `docs/source-adapter-contract.md`. |
| `MKTT-WP-0018` | P0 | active | `MKTT-WP-0013`, `MKTT-WP-0017`, `MKTT-WP-0019` | Source adapter framework implementation is the current track: implement `docs/source-adapter-contract.md`, keeping format extraction in `markitect-filter` and the base install free of heavyweight conversion dependencies. |
| `MKTT-WP-0018` | complete | done | `MKTT-WP-0013`, `MKTT-WP-0017`, `MKTT-WP-0019` | Source adapter framework implementation is complete: read-only models, protocol, registry, entry point discovery, extension descriptors, CLI/API, fake adapter fixtures, migration notes, and tests are in place. |
| `MKTT-WP-0015` | P2 | todo | `MKTT-WP-0010`, `MKTT-WP-0011`, `MKTT-WP-0012` | Future render and document-function extensions: typed values, richer syntax, document-local reusable functions, Quarkdown/export adapters, render-aware references, assets, and permission sandboxing. Defer unless publishing/export pressure becomes current. |
| `MKTT-WP-0016` | P2 | todo | `MKTT-WP-0008`, `MKTT-WP-0007`, `MKTT-WP-0009`, `MKTT-WP-0013` | Follow-on agentic memory architecture: reasoning decision graphs, conversational paths, long-term knowledge graphs, memory service blueprints/profiles, graph-to-context-package compilation, and adapter boundaries. |
@@ -132,9 +132,11 @@ protocol behavior, CLI/API envelopes, fake adapter fixtures, and
v1 contract is read-only; writer/export adapters belong in later
format-specific work once preservation semantics are explicit.
`MKTT-WP-0018` is now the current source-adapter implementation track. It
should implement the pinned contract directly rather than reopening v1 model,
entry point, protocol, or CLI/API decisions.
`MKTT-WP-0018` completed the source-adapter implementation track. The v1
read-only contract now has public models, protocol types, registry/discovery,
extension descriptors, `mkt source` commands, API exports, fake adapter
fixtures, and sibling-repo migration notes. Concrete EPUB3 extraction remains
`markitect-filter` scope.
## State Hub Mirror