Files

tegwick 2e49072d41 feat: complete core asset management system with database integration

- Add enhanced AssetManager with database integration and usage tracking
- Implement Asset model with from_dict/to_dict conversion methods
- Add resolve_asset_references() for linking discovered assets to imports
- Integrate AssetDatabase with enhanced schema and performance indexes
- Fix database schema constraints and test compatibility issues
- Add list_assets_as_objects() method for dict-to-object migration
- Resolve 91% of asset management tests (51/56 passing)

Key features:
* Content-addressable asset storage with deduplication
* Database-backed usage statistics and processing logs
* Asset reference resolution from markdown files
* Enhanced performance with indexing and caching
* Object-oriented Asset model with backwards compatibility

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-14 23:42:42 +02:00

16 KiB

Raw Blame History

Issue #141: Asset Management Concepts for Images and File Includes

Date: October 8, 2025 Issue: #141 - Concept to handle images and other file includes Status: 📋 CONCEPT PROPOSAL

Problem Statement

The goal is to create a system that can:

Include images and files with markdown documents
Keep them referenceable in the database/system
Store them efficiently with automatic deduplication
Handle duplicate content with different filenames seamlessly

Design Context

Based on the MarkdownPackageFormats wiki analysis, we have several proven patterns:

ZIP-based packaging (.mdpkg, .mdz formats)
Content-addressable storage patterns
Manifest-based metadata systems
Asset directory conventions (/assets, /images)

Core Requirements Analysis

Functional Requirements

Content Deduplication: Same image content → single storage, multiple references
Efficient Storage: Minimize disk space usage for asset libraries
Referential Integrity: Maintain markdown → asset relationships
Multiple Names: Support different filenames for same content
Database Integration: Asset metadata queryable and indexable

Non-Functional Requirements

Performance: Fast asset lookup and retrieval
Scalability: Handle large asset libraries (1000s of files)
Portability: Assets packaged with markdown for distribution
Maintainability: Clear separation of content and metadata

🎯 Concept A: Hash-Based Asset Store with Virtual Naming

Architecture Overview

markitect_assets/
├── store/                    # Content-addressed storage
│   ├── sha256/
│   │   ├── a1b2c3.../       # First 6 chars of hash
│   │   │   └── full_hash.ext # Actual file
│   │   └── d4e5f6.../
│   └── metadata.db          # SQLite database
├── cache/                   # Processed/resized versions
└── manifest.json           # Global asset registry

Key Components

1. Content-Addressed Storage

import hashlib
from pathlib import Path

class HashBasedAssetStore:
    def __init__(self, store_path):
        self.store_path = Path(store_path)
        self.store_path.mkdir(parents=True, exist_ok=True)

    def store_asset(self, file_path, original_name=None):
        """Store asset and return content hash."""
        content = Path(file_path).read_bytes()
        content_hash = hashlib.sha256(content).hexdigest()

        # Store in hash-based directory structure
        hash_dir = self.store_path / "store" / "sha256" / content_hash[:6]
        hash_dir.mkdir(parents=True, exist_ok=True)

        file_ext = Path(file_path).suffix
        stored_path = hash_dir / f"{content_hash}{file_ext}"

        if not stored_path.exists():
            stored_path.write_bytes(content)

        return content_hash

2. Virtual Name Mapping Database

-- SQLite schema for asset management
CREATE TABLE assets (
    content_hash TEXT PRIMARY KEY,
    file_size INTEGER,
    mime_type TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    original_extension TEXT
);

CREATE TABLE asset_names (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    content_hash TEXT,
    virtual_name TEXT,
    document_id TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (content_hash) REFERENCES assets(content_hash)
);

CREATE INDEX idx_asset_names_virtual ON asset_names(virtual_name);
CREATE INDEX idx_asset_names_document ON asset_names(document_id);

3. Markdown Integration

class MarkdownAssetProcessor:
    def __init__(self, asset_store):
        self.asset_store = asset_store

    def process_markdown_with_assets(self, md_content, document_id, asset_dir):
        """Process markdown and replace image references with hash-based ones."""
        import re

        def replace_image_ref(match):
            image_path = match.group(1)
            full_path = asset_dir / image_path

            if full_path.exists():
                # Store asset and get hash
                content_hash = self.asset_store.store_asset(full_path, image_path)

                # Register virtual name
                self.asset_store.register_name(content_hash, image_path, document_id)

                # Return hash-based reference
                return f'![{match.group(0)}]({content_hash})'

            return match.group(0)  # Return original if file not found

        # Replace image references
        processed_md = re.sub(r'!\[.*?\]\(([^)]+)\)', replace_image_ref, md_content)
        return processed_md

Concept A: Pros and Cons

✅ Advantages

Perfect Deduplication: Identical content stored only once regardless of filename
Content Integrity: Hash verification ensures data hasn't been corrupted
Efficient Storage: Minimum disk space usage for large asset libraries
Fast Lookups: Hash-based access is O(1) for retrieval
Version Agnostic: Same content = same hash, regardless of how it was added
Referential Integrity: Virtual names maintain user-friendly references

❌ Disadvantages

Complex Recovery: Lost database means lost name mappings
Hash Collisions: Theoretical risk with SHA-256 (extremely low)
Migration Complexity: Moving between systems requires database + files
Debugging Difficulty: Not human-readable file organization
Initial Overhead: Database setup and maintenance required
Tool Integration: External tools can't easily browse assets

🎯 Concept B: Content-Addressable Package System with Symlinks

Architecture Overview

markitect_packages/
├── documents/
│   ├── doc1.mdpkg           # ZIP package per document
│   └── doc2.mdpkg
├── shared_assets/           # Deduplicated asset library
│   ├── images/
│   │   ├── content_hash_1.png
│   │   └── content_hash_2.jpg
│   └── registry.json       # Asset registry
└── workspace/               # Working directory with symlinks
    ├── doc1/
    │   ├── index.md
    │   └── assets/          # Symlinks to shared_assets
    │       ├── logo.png → ../../shared_assets/images/content_hash_1.png
    │       └── chart.png → ../../shared_assets/images/content_hash_1.png
    └── doc2/

Key Components

1. Package-Based Document Storage

import zipfile
import json
from pathlib import Path

class PackageManager:
    def __init__(self, workspace_path):
        self.workspace = Path(workspace_path)
        self.shared_assets = self.workspace / "shared_assets"
        self.packages = self.workspace / "packages"

        # Initialize directories
        for dir_path in [self.shared_assets, self.packages]:
            dir_path.mkdir(parents=True, exist_ok=True)

    def create_package(self, document_path, package_name):
        """Create .mdpkg from working directory."""
        package_path = self.packages / f"{package_name}.mdpkg"

        with zipfile.ZipFile(package_path, 'w', zipfile.ZIP_DEFLATED) as zf:
            # Add markdown file
            zf.write(document_path / "index.md", "index.md")

            # Add manifest
            manifest = self._create_manifest(document_path)
            zf.writestr("manifest.json", json.dumps(manifest, indent=2))

            # Add actual asset files (resolved from symlinks)
            assets_dir = document_path / "assets"
            if assets_dir.exists():
                for asset in assets_dir.iterdir():
                    if asset.is_symlink():
                        # Resolve symlink and add actual file
                        real_file = asset.resolve()
                        zf.write(real_file, f"assets/{asset.name}")
                    else:
                        zf.write(asset, f"assets/{asset.name}")

        return package_path

2. Symlink-Based Deduplication

class AssetDeduplicator:
    def __init__(self, shared_assets_path):
        self.shared_assets = Path(shared_assets_path)
        self.registry_path = self.shared_assets / "registry.json"
        self.load_registry()

    def add_asset(self, asset_path, document_dir, desired_name):
        """Add asset with deduplication via symlinks."""
        content = Path(asset_path).read_bytes()
        content_hash = hashlib.sha256(content).hexdigest()

        # Check if content already exists
        existing_path = self._find_existing_asset(content_hash)

        if not existing_path:
            # Store new asset in shared location
            file_ext = Path(asset_path).suffix
            shared_path = self.shared_assets / "images" / f"{content_hash}{file_ext}"
            shared_path.parent.mkdir(parents=True, exist_ok=True)
            shared_path.write_bytes(content)

            # Update registry
            self.registry[content_hash] = {
                "path": str(shared_path.relative_to(self.shared_assets)),
                "size": len(content),
                "mime_type": self._get_mime_type(file_ext),
                "created": datetime.now().isoformat()
            }
            existing_path = shared_path

        # Create symlink in document directory
        asset_link = document_dir / "assets" / desired_name
        asset_link.parent.mkdir(parents=True, exist_ok=True)

        if asset_link.exists() or asset_link.is_symlink():
            asset_link.unlink()

        asset_link.symlink_to(existing_path.resolve())

        return existing_path

3. Package Import/Export

class PackageHandler:
    def extract_package(self, package_path, workspace_dir):
        """Extract .mdpkg and set up symlinks."""
        extract_dir = workspace_dir / package_path.stem
        extract_dir.mkdir(parents=True, exist_ok=True)

        with zipfile.ZipFile(package_path, 'r') as zf:
            # Extract manifest first
            manifest = json.loads(zf.read("manifest.json"))

            # Extract markdown
            zf.extract("index.md", extract_dir)

            # Handle assets with deduplication
            for asset_info in manifest.get("assets", []):
                asset_name = asset_info["name"]

                # Extract to temporary location
                temp_path = extract_dir / "temp_assets" / asset_name
                temp_path.parent.mkdir(parents=True, exist_ok=True)
                zf.extract(f"assets/{asset_name}", temp_path.parent)

                # Add through deduplicator (creates symlink)
                self.deduplicator.add_asset(temp_path, extract_dir, asset_name)

                # Clean up temporary file
                temp_path.unlink()

        return extract_dir

Concept B: Pros and Cons

✅ Advantages

Visual Transparency: Symlinks show actual file relationships clearly
Tool Compatibility: Standard tools can follow symlinks and work normally
Package Portability: .mdpkg files are self-contained ZIP archives
Gradual Migration: Can work with existing file-based workflows
Backup Friendly: Clear separation between packages and shared assets
Standard Formats: Uses ZIP and JSON, widely supported
Working Directory: Users see familiar file/folder structure

❌ Disadvantages

Platform Dependency: Symlinks work differently on Windows vs Unix
Sync Complexity: Symlinks can break during cloud sync or backup
Storage Overhead: Registry + symlinks + actual files
Permission Issues: Symlink creation may require special permissions
Broken Links: Symlinks can become dangling if shared assets move
Complexity: More moving parts (packages + symlinks + registry)

📊 Concept Comparison Matrix

Aspect	Concept A: Hash-Based Store	Concept B: Package + Symlinks
Deduplication Efficiency	⭐⭐⭐⭐⭐ Perfect	⭐⭐⭐⭐⚪ Very Good
Implementation Complexity	⭐⭐⭐⚪⚪ Moderate	⭐⭐⚪⚪⚪ Complex
Platform Compatibility	⭐⭐⭐⭐⭐ Universal	⭐⭐⭐⚪⚪ Platform-dependent
Tool Integration	⭐⭐⚪⚪⚪ Custom tools needed	⭐⭐⭐⭐⚪ Standard tools work
Storage Efficiency	⭐⭐⭐⭐⭐ Minimal	⭐⭐⭐⭐⚪ Good
User Experience	⭐⭐⭐⚪⚪ Learning curve	⭐⭐⭐⭐⚪ Familiar
Package Portability	⭐⭐⭐⚪⚪ Requires tooling	⭐⭐⭐⭐⭐ Standard ZIP
Recovery Robustness	⭐⭐⚪⚪⚪ Database dependent	⭐⭐⭐⭐⚪ Self-documenting
Performance	⭐⭐⭐⭐⭐ Fast hash lookup	⭐⭐⭐⚪⚪ Filesystem dependent
Maintenance	⭐⭐⭐⚪⚪ Database management	⭐⭐⚪⚪⚪ Complex relationships

🎯 Recommended Implementation Strategy

Phase 1: Start with Concept B (Rapid Prototyping)

Rationale: Easier to understand, debug, and demonstrate

Implement basic package creation/extraction
Use simple file copying for initial version (add deduplication later)
Focus on .mdpkg format compatibility with wiki specifications

Phase 2: Add Deduplication (Hybrid Approach)

Evolution: Incorporate hash-based deduplication from Concept A

Keep the package/symlink user interface from Concept B
Add content hashing for deduplication backend
Maintain content-addressable shared storage

Phase 3: Advanced Features

Content-based asset search and discovery
Automatic format conversion and optimization
Integration with markitect CLI commands
Web interface for asset library browsing

🛠️ Python Library Recommendations

Core Libraries (Standard Library)

hashlib - Content hashing for deduplication
sqlite3 - Metadata and relationship storage
zipfile - Package creation and extraction
pathlib - Modern path handling
json - Manifest and metadata serialization

Additional Libraries (Optional)

click - CLI interface (already available)
Pillow - Image processing and format detection
python-magic - MIME type detection
watchdog - File system monitoring for auto-import
send2trash - Safe file deletion

Architecture Libraries

sqlalchemy - Advanced database ORM (if complex queries needed)
pydantic - Data validation and settings management
rich - Beautiful CLI output and progress bars

📋 Implementation Checklist

Core Functionality

Asset content hashing and deduplication
Markdown reference parsing and rewriting
Package creation (.mdpkg ZIP format)
Package extraction and workspace setup
Asset registry and metadata management

CLI Integration

markitect asset add - Import assets into library
markitect asset dedupe - Cleanup duplicate assets
markitect package create - Create .mdpkg from directory
markitect package extract - Extract .mdpkg to workspace
markitect asset list - Browse asset library

Advanced Features

Automatic image format optimization
Asset usage tracking and cleanup
Batch import from directories
Integration with md-explode/implode workflow
Web-based asset browser interface

🚀 Next Steps

Prototype Development: Create minimal working implementation of Concept B
CLI Integration: Add basic asset management commands to markitect
Testing: Comprehensive testing with real-world markdown documents
Documentation: User guide for asset management workflow
Community Feedback: Gather input on the approach and API design

This design provides a solid foundation for efficient, deduplicated asset management while maintaining compatibility with existing markdown workflows and the MarkdownPackageFormats standards.

Status: 📋 Concept Complete - Ready for Implementation Planning

16 KiB Raw Blame History

Issue #141: Asset Management Concepts for Images and File Includes

Problem Statement

Design Context

Core Requirements Analysis

Functional Requirements

Non-Functional Requirements

🎯 Concept A: Hash-Based Asset Store with Virtual Naming

Architecture Overview

Key Components

1. Content-Addressed Storage

2. Virtual Name Mapping Database

3. Markdown Integration

Concept A: Pros and Cons

✅ Advantages

❌ Disadvantages

🎯 Concept B: Content-Addressable Package System with Symlinks

Architecture Overview

Key Components

1. Package-Based Document Storage

2. Symlink-Based Deduplication

3. Package Import/Export

Concept B: Pros and Cons

✅ Advantages

❌ Disadvantages

📊 Concept Comparison Matrix

🎯 Recommended Implementation Strategy

Phase 1: Start with Concept B (Rapid Prototyping)

Phase 2: Add Deduplication (Hybrid Approach)

Phase 3: Advanced Features

🛠️ Python Library Recommendations

Core Libraries (Standard Library)

Additional Libraries (Optional)

Architecture Libraries

📋 Implementation Checklist

Core Functionality

CLI Integration

Advanced Features

🚀 Next Steps

16 KiB

Raw Blame History