generated from coulomb/repo-seed
source adapter framework
This commit is contained in:
@@ -10,6 +10,8 @@ Generated from `markitect_tool.__all__`.
|
||||
- `EMPTY_PARSE_OPTIONS_HASH` - object. str(object='') -> str
|
||||
- `EXPLODE_MANIFEST_NAME` - object. str(object='') -> str
|
||||
- `LOCAL_INDEX_SCHEMA_VERSION` - object. str(object='') -> str
|
||||
- `NORMALIZED_SOURCE_SCHEMA_VERSION` - object. str(object='') -> str
|
||||
- `SOURCE_ADAPTER_ENTRY_POINT_GROUP` - object. str(object='') -> str
|
||||
|
||||
## `markitect_tool.backend.engine`
|
||||
|
||||
@@ -339,6 +341,31 @@ Generated from `markitect_tool.__all__`.
|
||||
- `validate_markdown_file(markdown_path: 'str | Path', schema_path: 'str | Path') -> 'SchemaValidationResult'` - function. Parse and validate a Markdown file against a Markdown schema file.
|
||||
- `validate_schema(schema: 'dict[str, Any]') -> 'SchemaValidationResult'` - function. Validate that a JSON Schema itself is well formed.
|
||||
|
||||
## `markitect_tool.source.engine`
|
||||
|
||||
- `NormalizationQuality(lossiness: 'str', confidence: 'float | None' = None, skipped_items: 'int | None' = None, warnings: 'int | None' = None, metadata: 'dict[str, Any]' = <factory>) -> None` - class. Summary of extraction lossiness and confidence.
|
||||
- `NormalizedMarkdownDocument(document_id: 'str', asset: 'SourceAsset', metadata: 'SourceMetadata', markdown: 'str', segments: 'list[NormalizedMarkdownSegment]', quality: 'NormalizationQuality', adapter: 'dict[str, Any]', cache_key: 'str', schema_version: 'str' = 'markitect.source.v1', diagnostics: 'list[Diagnostic]' = <factory>, provenance: 'list[SourceProvenance]' = <factory>, attachments: 'list[SourceAsset]' = <factory>) -> None` - class. Canonical source-normalized Markdown document.
|
||||
- `NormalizedMarkdownSegment(segment_id: 'str', order: 'int', markdown: 'str', heading: 'str | None' = None, heading_level: 'int | None' = None, anchors: 'list[str]' = <factory>, provenance: 'list[SourceProvenance]' = <factory>, metadata: 'dict[str, Any]' = <factory>) -> None` - class. One ordered normalized Markdown segment.
|
||||
- `SourceAdapterDescriptor(id: 'str', version: 'str', name: 'str', operations: 'list[str]', media_types: 'list[str]', extensions: 'list[str]', factory: 'SourceAdapterFactory', summary: 'str | None' = None, option_schema: 'dict[str, Any]' = <factory>, optional_dependencies: 'list[OptionalDependency]' = <factory>, safety: 'dict[str, Any]' = <factory>, quality_profile: 'dict[str, Any]' = <factory>, metadata: 'dict[str, Any]' = <factory>) -> None` - class. Inspectable descriptor for one source read adapter.
|
||||
- `SourceAdapterError` - class. Raised when source adapter descriptors or registries are invalid.
|
||||
- `SourceAdapterMatch(adapter_id: 'str', matched: 'bool', confidence: 'int' = 0, reason: 'str | None' = None, diagnostics: 'list[Diagnostic]' = <factory>) -> None` - class. Result of an adapter match attempt.
|
||||
- `SourceAdapterMatchRequest(asset: 'SourceAsset', options: 'dict[str, Any]' = <factory>) -> None` - class. Cheap adapter matching request.
|
||||
- `SourceAdapterRegistry(descriptors: 'Iterable[SourceAdapterDescriptor] | None' = None) -> 'None'` - class. Registry of source adapter descriptors.
|
||||
- `SourceAsset(uri: 'str', path: 'str | None' = None, name: 'str | None' = None, media_type: 'str | None' = None, extension: 'str | None' = None, size: 'int | None' = None, mtime_ns: 'int | None' = None, digest: 'str | None' = None, metadata: 'dict[str, Any]' = <factory>) -> None` - class. Identity and filesystem metadata for one source asset.
|
||||
- `SourceInspectRequest(asset: 'SourceAsset', options: 'dict[str, Any]' = <factory>) -> None` - class. Metadata-only source inspection request.
|
||||
- `SourceInspectResult(asset: 'SourceAsset', adapter: 'dict[str, Any]', metadata: 'SourceMetadata', quality: 'NormalizationQuality', capabilities: 'list[str]' = <factory>, diagnostics: 'list[Diagnostic]' = <factory>, valid: 'bool | None' = None) -> None` - class. Metadata-only source inspection result.
|
||||
- `SourceMetadata(title: 'str | None' = None, creators: 'list[str]' = <factory>, language: 'str | None' = None, rights: 'str | None' = None, source_url: 'str | None' = None, publication_date: 'str | None' = None, publisher: 'str | None' = None, identifiers: 'dict[str, str]' = <factory>, raw: 'dict[str, Any]' = <factory>) -> None` - class. Format-provided descriptive metadata.
|
||||
- `SourceProvenance(source_uri: 'str', source_path: 'str | None' = None, source_href: 'str | None' = None, package_path: 'str | None' = None, anchor: 'str | None' = None, page: 'str | None' = None, section: 'str | None' = None, start_offset: 'int | None' = None, end_offset: 'int | None' = None, digest: 'str | None' = None, metadata: 'dict[str, Any]' = <factory>) -> None` - class. Trace from normalized Markdown back to source locations.
|
||||
- `SourceReadAdapter(*args, **kwargs)` - class. Read-only source adapter protocol.
|
||||
- `SourceReadRequest(asset: 'SourceAsset', options: 'dict[str, Any]' = <factory>) -> None` - class. Full source normalization request.
|
||||
- `SourceReadResult(document: 'NormalizedMarkdownDocument | None' = None, diagnostics: 'list[Diagnostic]' = <factory>, valid: 'bool | None' = None) -> None` - class. Full source normalization result.
|
||||
- `default_source_adapter_registry() -> 'SourceAdapterRegistry'` - function. Return the discovered source adapter registry.
|
||||
- `discover_source_adapters(entry_points: 'Iterable[Any] | None' = None) -> 'SourceAdapterRegistry'` - function. Discover source adapters from package entry points.
|
||||
- `inspect_source(path_or_uri: 'str | Path', *, registry: 'SourceAdapterRegistry | None' = None, adapter_id: 'str | None' = None, options: 'dict[str, Any] | None' = None) -> 'SourceInspectResult'` - function. Inspect a local source through the selected adapter.
|
||||
- `normalization_cache_key(*, asset: 'SourceAsset', adapter_id: 'str', adapter_version: 'str', options: 'dict[str, Any] | None' = None) -> 'str'` - function. Compute a stable normalization cache key.
|
||||
- `normalize_source(path_or_uri: 'str | Path', *, registry: 'SourceAdapterRegistry | None' = None, adapter_id: 'str | None' = None, options: 'dict[str, Any] | None' = None) -> 'SourceReadResult'` - function. Normalize a local source through the selected adapter.
|
||||
- `source_adapter_registry_descriptor() -> 'ExtensionDescriptor'` - function. Return the built-in descriptor for source adapter discovery.
|
||||
|
||||
## `markitect_tool.template.engine`
|
||||
|
||||
- `MissingTemplateVariable` - class. Raised when strict rendering cannot resolve a variable.
|
||||
|
||||
@@ -862,6 +862,57 @@ Parameters:
|
||||
- `--policy-mode` - Override policy mode for this search.
|
||||
- `--format` -
|
||||
|
||||
## `mkt source`
|
||||
|
||||
Inspect source-format adapters and normalize sources.
|
||||
|
||||
```text
|
||||
source [OPTIONS] COMMAND [ARGS]...
|
||||
```
|
||||
|
||||
## `mkt source adapters`
|
||||
|
||||
List discovered read-only source adapters.
|
||||
|
||||
```text
|
||||
adapters [OPTIONS]
|
||||
```
|
||||
|
||||
Parameters:
|
||||
|
||||
- `--format` -
|
||||
|
||||
## `mkt source inspect`
|
||||
|
||||
Inspect a local source without full Markdown conversion.
|
||||
|
||||
```text
|
||||
inspect [OPTIONS] SOURCE_PATH
|
||||
```
|
||||
|
||||
Parameters:
|
||||
|
||||
- `SOURCE_PATH` - Required.
|
||||
- `--adapter` - Explicit source adapter id.
|
||||
- `--option` - Adapter-specific option. May be repeated.
|
||||
- `--format` -
|
||||
|
||||
## `mkt source normalize`
|
||||
|
||||
Normalize a local source into canonical Markdown.
|
||||
|
||||
```text
|
||||
normalize [OPTIONS] SOURCE_PATH
|
||||
```
|
||||
|
||||
Parameters:
|
||||
|
||||
- `SOURCE_PATH` - Required.
|
||||
- `--adapter` - Explicit source adapter id.
|
||||
- `--option` - Adapter-specific option. May be repeated.
|
||||
- `--output` - Write normalized output to this file.
|
||||
- `--format` -
|
||||
|
||||
## `mkt tangle`
|
||||
|
||||
Tangle named Markdown code chunks into target files.
|
||||
|
||||
@@ -41,6 +41,14 @@ mkt ref resolve context.md 'std:clauses.md#payment-terms' --root examples/refere
|
||||
mkt process file.md --root .
|
||||
```
|
||||
|
||||
## Normalize External Sources
|
||||
|
||||
```bash
|
||||
mkt source adapters
|
||||
mkt source inspect book.epub --adapter source.epub3 --format json
|
||||
mkt source normalize book.epub --adapter source.epub3 --format markdown --output book.md
|
||||
```
|
||||
|
||||
## Split And Literate Workflows
|
||||
|
||||
```bash
|
||||
@@ -102,6 +110,7 @@ mkt context list
|
||||
mkt completion bash --instructions
|
||||
mkt extension list
|
||||
mkt extension inspect backend.local-sqlite
|
||||
mkt extension inspect source.adapter-registry
|
||||
mkt extension commands
|
||||
mkt docs cli --output docs/cli-reference.md
|
||||
mkt docs api --output docs/api-reference.md
|
||||
|
||||
@@ -9,6 +9,10 @@ Use this for internal query engines, processors, backend/index stores,
|
||||
reference providers, validators, template/generation adapters, CLI command
|
||||
groups, render/export adapters, and future document functions.
|
||||
|
||||
Source-format adapters are external package extensions. Use
|
||||
`docs/source-adapter-contract.md` for the source adapter protocol, entry point
|
||||
group, descriptor shape, and contract-test expectations.
|
||||
|
||||
## Recommended Shape
|
||||
|
||||
Each extension should have:
|
||||
|
||||
@@ -170,8 +170,10 @@ markitect_tool/extensions/
|
||||
|
||||
Each module exposes one or more descriptors plus a registration function. The
|
||||
root registry can be assembled explicitly at import time or by a small internal
|
||||
discovery list. Package entry points can be added later if external extension
|
||||
packages become a real requirement.
|
||||
discovery list. Source adapters are the first external package-discovery slice
|
||||
and use the `markitect_tool.source_adapters` entry point group defined in
|
||||
`docs/source-adapter-contract.md`; other extension kinds can adopt package
|
||||
entry points later if they become a real requirement.
|
||||
|
||||
See `docs/extension-authoring.md` for the extension authoring checklist and
|
||||
descriptor template.
|
||||
|
||||
119
docs/source-adapter-migration.md
Normal file
119
docs/source-adapter-migration.md
Normal file
@@ -0,0 +1,119 @@
|
||||
# Source Adapter Migration Notes
|
||||
|
||||
## Purpose
|
||||
|
||||
These notes describe how sibling repositories should consume the
|
||||
`markitect-tool` source adapter contract implemented by `MKTT-WP-0018`.
|
||||
|
||||
The source adapter layer is deliberately split:
|
||||
|
||||
```text
|
||||
external source files
|
||||
-> markitect-filter concrete adapters
|
||||
-> markitect-tool source adapter protocol and normalized Markdown model
|
||||
-> infospace-bench workflows
|
||||
-> optional kontextual-engine ingestion
|
||||
```
|
||||
|
||||
## Markitect-tool
|
||||
|
||||
`markitect-tool` owns the stable contract:
|
||||
|
||||
- normalized source data model
|
||||
- read-only source adapter protocol
|
||||
- adapter registry and Python entry point discovery
|
||||
- `mkt source` CLI commands
|
||||
- public API helpers such as `inspect_source` and `normalize_source`
|
||||
- fake adapter contract tests
|
||||
|
||||
It does not own EPUB3, PDF, DOCX, ODT, OCR, or browser extraction.
|
||||
|
||||
## Markitect-filter
|
||||
|
||||
`markitect-filter` should implement concrete adapters behind the entry point
|
||||
group:
|
||||
|
||||
```toml
|
||||
[project.entry-points."markitect_tool.source_adapters"]
|
||||
epub3 = "markitect_filter.adapters:epub3_adapter_descriptor"
|
||||
```
|
||||
|
||||
The first adapter should be:
|
||||
|
||||
```text
|
||||
id: source.epub3
|
||||
operations: read
|
||||
media_types: application/epub+zip
|
||||
extensions: .epub
|
||||
```
|
||||
|
||||
The EPUB3 adapter should satisfy the contract tests described in
|
||||
`docs/source-adapter-contract.md` and add EPUB-specific fixtures for container,
|
||||
OPF, spine, nav, body XHTML, malformed package structure, skipped assets, and
|
||||
lossy extraction diagnostics.
|
||||
|
||||
## Infospace-bench
|
||||
|
||||
`infospace-bench` should replace its local EPUB intake spike with the public
|
||||
source adapter API:
|
||||
|
||||
```python
|
||||
from markitect_tool import normalize_source
|
||||
|
||||
result = normalize_source("source.epub")
|
||||
if not result.is_valid:
|
||||
raise RuntimeError(result.to_dict()["diagnostics"])
|
||||
markdown = result.document.markdown
|
||||
segments = result.document.segments
|
||||
```
|
||||
|
||||
Application workflows should consume normalized Markdown and segment metadata.
|
||||
They should not depend on EPUB package internals, spine parsing, XHTML
|
||||
extraction, or boilerplate classification directly.
|
||||
|
||||
## Kontextual-engine
|
||||
|
||||
`kontextual-engine` can treat normalized source outputs as ingestible derived
|
||||
knowledge assets when it needs durable ingestion. The durable layer should
|
||||
persist policy, indexing, retrieval, permissions, audit, and lifecycle state.
|
||||
It should not require source-format dependencies inside `markitect-tool`.
|
||||
|
||||
Recommended ingestion boundary:
|
||||
|
||||
- keep `NormalizedMarkdownDocument.to_dict()` as the portable derivative
|
||||
- preserve source asset digest and normalization cache key
|
||||
- record adapter ID, adapter version, and read options
|
||||
- preserve document and segment provenance
|
||||
- store diagnostics and quality signals for human review
|
||||
|
||||
## Follow-up Workplan Seeds
|
||||
|
||||
Recommended `markitect-filter` workplan:
|
||||
|
||||
```text
|
||||
MKTF-WP-0001: EPUB3 Read Adapter
|
||||
|
||||
Implement source.epub3 against docs/source-adapter-contract.md:
|
||||
- package scaffold and pyproject entry point
|
||||
- optional EPUB dependencies
|
||||
- META-INF/container.xml parsing
|
||||
- OPF metadata and spine reading order
|
||||
- nav/chapter label extraction
|
||||
- body XHTML to normalized Markdown segments
|
||||
- explicit boilerplate skip policy
|
||||
- malformed/unsupported/lossy diagnostics
|
||||
- contract tests using markitect-tool fake adapter expectations
|
||||
```
|
||||
|
||||
Recommended `infospace-bench` workplan:
|
||||
|
||||
```text
|
||||
ISB-WP-source-adapter-intake
|
||||
|
||||
Replace local EPUB source intake with markitect-tool normalize_source:
|
||||
- install markitect-filter[epub3] in the relevant environment
|
||||
- call normalize_source for source documents
|
||||
- consume NormalizedMarkdownDocument markdown and segments
|
||||
- remove app-local EPUB package parsing
|
||||
- preserve source diagnostics in benchmark review artifacts
|
||||
```
|
||||
@@ -43,7 +43,7 @@ and descriptions mirror the operational view.
|
||||
| `MKTT-WP-0008` | complete | done | `MKTT-WP-0006`, `MKTT-WP-0007`, `MKTT-WP-0009` | Agent working-memory context cache is complete: context package schema, local registry, package creation from queries/search/manifests, deterministic summaries, namespaces, activation/deactivation/refresh/explain lifecycle, policy re-checks, CLI, docs, and examples. |
|
||||
| `MKTT-WP-0017` | complete | done | `MKTT-WP-0003`, `MKTT-WP-0013` | CLI/API polish and practical adoption track is complete: shell completion, extension discovery, generated CLI/API docs, usecase relevance matrix, E2E fixture matrix, large-corpus smoke coverage, first-use docs, examples index, and command cheat sheet. |
|
||||
| `MKTT-WP-0019` | complete | done | `MKTT-WP-0013`, `MKTT-WP-0017` | Source adapter contract refinement is complete: v1 read-only scope, normalized model fields, package entry point discovery, CLI/API envelopes, fake adapter fixtures, and `markitect-filter` EPUB3 handoff are pinned in `docs/source-adapter-contract.md`. |
|
||||
| `MKTT-WP-0018` | P0 | active | `MKTT-WP-0013`, `MKTT-WP-0017`, `MKTT-WP-0019` | Source adapter framework implementation is the current track: implement `docs/source-adapter-contract.md`, keeping format extraction in `markitect-filter` and the base install free of heavyweight conversion dependencies. |
|
||||
| `MKTT-WP-0018` | complete | done | `MKTT-WP-0013`, `MKTT-WP-0017`, `MKTT-WP-0019` | Source adapter framework implementation is complete: read-only models, protocol, registry, entry point discovery, extension descriptors, CLI/API, fake adapter fixtures, migration notes, and tests are in place. |
|
||||
| `MKTT-WP-0015` | P2 | todo | `MKTT-WP-0010`, `MKTT-WP-0011`, `MKTT-WP-0012` | Future render and document-function extensions: typed values, richer syntax, document-local reusable functions, Quarkdown/export adapters, render-aware references, assets, and permission sandboxing. Defer unless publishing/export pressure becomes current. |
|
||||
| `MKTT-WP-0016` | P2 | todo | `MKTT-WP-0008`, `MKTT-WP-0007`, `MKTT-WP-0009`, `MKTT-WP-0013` | Follow-on agentic memory architecture: reasoning decision graphs, conversational paths, long-term knowledge graphs, memory service blueprints/profiles, graph-to-context-package compilation, and adapter boundaries. |
|
||||
|
||||
@@ -132,9 +132,11 @@ protocol behavior, CLI/API envelopes, fake adapter fixtures, and
|
||||
v1 contract is read-only; writer/export adapters belong in later
|
||||
format-specific work once preservation semantics are explicit.
|
||||
|
||||
`MKTT-WP-0018` is now the current source-adapter implementation track. It
|
||||
should implement the pinned contract directly rather than reopening v1 model,
|
||||
entry point, protocol, or CLI/API decisions.
|
||||
`MKTT-WP-0018` completed the source-adapter implementation track. The v1
|
||||
read-only contract now has public models, protocol types, registry/discovery,
|
||||
extension descriptors, `mkt source` commands, API exports, fake adapter
|
||||
fixtures, and sibling-repo migration notes. Concrete EPUB3 extraction remains
|
||||
`markitect-filter` scope.
|
||||
|
||||
## State Hub Mirror
|
||||
|
||||
|
||||
Reference in New Issue
Block a user