Commit Graph

4 Commits

Author SHA1 Message Date
120ed89780 fix(proxy): catch markitdown missing-dependency errors with clean hint
When markitdown is installed but a format-specific sub-dependency is
missing (e.g. pdfminer-six for PDF), translate the raw traceback into
a DependencyMissingError with the correct install command.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:00:51 +01:00
9fa239c140 fix(proxy): register markitdown extractor unconditionally
Always register MarkitdownExtractor so it overrides specialized extractors
for all its extensions. When markitdown-no-magika is not installed, users
now see the correct install hint instead of the old pymupdf4llm message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 20:52:07 +01:00
e4fbba8a57 feat(proxy): add markitdown as default proxy backend
Uses markitdown-no-magika (lighter fork without magika/onnxruntime) to
handle PDF, HTML, DOCX, PPTX, XLSX, XLS, CSV, JSON, and XML files.
Specialized extractors (pymupdf4llm, markdownify) remain as fallbacks
when markitdown is not installed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 20:48:47 +01:00
ac334c679d feat(proxy): add proxy file system for non-markdown source conversion
Introduces a new `markitect/proxy/` module with pluggable extractors that
convert non-markdown sources (PDF, HTML) into tracked markdown proxy files.
Proxy files preserve origin metadata (path, checksum, timestamp) so they
can be kept in sync when the original changes.

CLI commands: `proxy create`, `proxy update`, `proxy status`, `proxy extractors`.
Built-in extractors: PDF (pymupdf4llm), HTML (markdownify), Markdown (built-in).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 19:06:09 +01:00