# Issue #141: Asset Management Concepts for Images and File Includes **Date**: October 8, 2025 **Issue**: #141 - Concept to handle images and other file includes **Status**: 📋 **CONCEPT PROPOSAL** ## Problem Statement The goal is to create a system that can: 1. **Include images and files** with markdown documents 2. **Keep them referenceable** in the database/system 3. **Store them efficiently** with automatic deduplication 4. **Handle duplicate content** with different filenames seamlessly ## Design Context Based on the **MarkdownPackageFormats** wiki analysis, we have several proven patterns: - **ZIP-based packaging** (`.mdpkg`, `.mdz` formats) - **Content-addressable storage** patterns - **Manifest-based metadata** systems - **Asset directory conventions** (`/assets`, `/images`) ## Core Requirements Analysis ### Functional Requirements - **Content Deduplication**: Same image content → single storage, multiple references - **Efficient Storage**: Minimize disk space usage for asset libraries - **Referential Integrity**: Maintain markdown → asset relationships - **Multiple Names**: Support different filenames for same content - **Database Integration**: Asset metadata queryable and indexable ### Non-Functional Requirements - **Performance**: Fast asset lookup and retrieval - **Scalability**: Handle large asset libraries (1000s of files) - **Portability**: Assets packaged with markdown for distribution - **Maintainability**: Clear separation of content and metadata --- ## 🎯 Concept A: Hash-Based Asset Store with Virtual Naming ### Architecture Overview ``` markitect_assets/ ├── store/ # Content-addressed storage │ ├── sha256/ │ │ ├── a1b2c3.../ # First 6 chars of hash │ │ │ └── full_hash.ext # Actual file │ │ └── d4e5f6.../ │ └── metadata.db # SQLite database ├── cache/ # Processed/resized versions └── manifest.json # Global asset registry ``` ### Key Components #### 1. Content-Addressed Storage ```python import hashlib from pathlib import Path class HashBasedAssetStore: def __init__(self, store_path): self.store_path = Path(store_path) self.store_path.mkdir(parents=True, exist_ok=True) def store_asset(self, file_path, original_name=None): """Store asset and return content hash.""" content = Path(file_path).read_bytes() content_hash = hashlib.sha256(content).hexdigest() # Store in hash-based directory structure hash_dir = self.store_path / "store" / "sha256" / content_hash[:6] hash_dir.mkdir(parents=True, exist_ok=True) file_ext = Path(file_path).suffix stored_path = hash_dir / f"{content_hash}{file_ext}" if not stored_path.exists(): stored_path.write_bytes(content) return content_hash ``` #### 2. Virtual Name Mapping Database ```sql -- SQLite schema for asset management CREATE TABLE assets ( content_hash TEXT PRIMARY KEY, file_size INTEGER, mime_type TEXT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, original_extension TEXT ); CREATE TABLE asset_names ( id INTEGER PRIMARY KEY AUTOINCREMENT, content_hash TEXT, virtual_name TEXT, document_id TEXT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, FOREIGN KEY (content_hash) REFERENCES assets(content_hash) ); CREATE INDEX idx_asset_names_virtual ON asset_names(virtual_name); CREATE INDEX idx_asset_names_document ON asset_names(document_id); ``` #### 3. Markdown Integration ```python class MarkdownAssetProcessor: def __init__(self, asset_store): self.asset_store = asset_store def process_markdown_with_assets(self, md_content, document_id, asset_dir): """Process markdown and replace image references with hash-based ones.""" import re def replace_image_ref(match): image_path = match.group(1) full_path = asset_dir / image_path if full_path.exists(): # Store asset and get hash content_hash = self.asset_store.store_asset(full_path, image_path) # Register virtual name self.asset_store.register_name(content_hash, image_path, document_id) # Return hash-based reference return f'![{match.group(0)}]({content_hash})' return match.group(0) # Return original if file not found # Replace image references processed_md = re.sub(r'!\[.*?\]\(([^)]+)\)', replace_image_ref, md_content) return processed_md ``` ### Concept A: Pros and Cons #### ✅ Advantages 1. **Perfect Deduplication**: Identical content stored only once regardless of filename 2. **Content Integrity**: Hash verification ensures data hasn't been corrupted 3. **Efficient Storage**: Minimum disk space usage for large asset libraries 4. **Fast Lookups**: Hash-based access is O(1) for retrieval 5. **Version Agnostic**: Same content = same hash, regardless of how it was added 6. **Referential Integrity**: Virtual names maintain user-friendly references #### ❌ Disadvantages 1. **Complex Recovery**: Lost database means lost name mappings 2. **Hash Collisions**: Theoretical risk with SHA-256 (extremely low) 3. **Migration Complexity**: Moving between systems requires database + files 4. **Debugging Difficulty**: Not human-readable file organization 5. **Initial Overhead**: Database setup and maintenance required 6. **Tool Integration**: External tools can't easily browse assets --- ## 🎯 Concept B: Content-Addressable Package System with Symlinks ### Architecture Overview ``` markitect_packages/ ├── documents/ │ ├── doc1.mdpkg # ZIP package per document │ └── doc2.mdpkg ├── shared_assets/ # Deduplicated asset library │ ├── images/ │ │ ├── content_hash_1.png │ │ └── content_hash_2.jpg │ └── registry.json # Asset registry └── workspace/ # Working directory with symlinks ├── doc1/ │ ├── index.md │ └── assets/ # Symlinks to shared_assets │ ├── logo.png → ../../shared_assets/images/content_hash_1.png │ └── chart.png → ../../shared_assets/images/content_hash_1.png └── doc2/ ``` ### Key Components #### 1. Package-Based Document Storage ```python import zipfile import json from pathlib import Path class PackageManager: def __init__(self, workspace_path): self.workspace = Path(workspace_path) self.shared_assets = self.workspace / "shared_assets" self.packages = self.workspace / "packages" # Initialize directories for dir_path in [self.shared_assets, self.packages]: dir_path.mkdir(parents=True, exist_ok=True) def create_package(self, document_path, package_name): """Create .mdpkg from working directory.""" package_path = self.packages / f"{package_name}.mdpkg" with zipfile.ZipFile(package_path, 'w', zipfile.ZIP_DEFLATED) as zf: # Add markdown file zf.write(document_path / "index.md", "index.md") # Add manifest manifest = self._create_manifest(document_path) zf.writestr("manifest.json", json.dumps(manifest, indent=2)) # Add actual asset files (resolved from symlinks) assets_dir = document_path / "assets" if assets_dir.exists(): for asset in assets_dir.iterdir(): if asset.is_symlink(): # Resolve symlink and add actual file real_file = asset.resolve() zf.write(real_file, f"assets/{asset.name}") else: zf.write(asset, f"assets/{asset.name}") return package_path ``` #### 2. Symlink-Based Deduplication ```python class AssetDeduplicator: def __init__(self, shared_assets_path): self.shared_assets = Path(shared_assets_path) self.registry_path = self.shared_assets / "registry.json" self.load_registry() def add_asset(self, asset_path, document_dir, desired_name): """Add asset with deduplication via symlinks.""" content = Path(asset_path).read_bytes() content_hash = hashlib.sha256(content).hexdigest() # Check if content already exists existing_path = self._find_existing_asset(content_hash) if not existing_path: # Store new asset in shared location file_ext = Path(asset_path).suffix shared_path = self.shared_assets / "images" / f"{content_hash}{file_ext}" shared_path.parent.mkdir(parents=True, exist_ok=True) shared_path.write_bytes(content) # Update registry self.registry[content_hash] = { "path": str(shared_path.relative_to(self.shared_assets)), "size": len(content), "mime_type": self._get_mime_type(file_ext), "created": datetime.now().isoformat() } existing_path = shared_path # Create symlink in document directory asset_link = document_dir / "assets" / desired_name asset_link.parent.mkdir(parents=True, exist_ok=True) if asset_link.exists() or asset_link.is_symlink(): asset_link.unlink() asset_link.symlink_to(existing_path.resolve()) return existing_path ``` #### 3. Package Import/Export ```python class PackageHandler: def extract_package(self, package_path, workspace_dir): """Extract .mdpkg and set up symlinks.""" extract_dir = workspace_dir / package_path.stem extract_dir.mkdir(parents=True, exist_ok=True) with zipfile.ZipFile(package_path, 'r') as zf: # Extract manifest first manifest = json.loads(zf.read("manifest.json")) # Extract markdown zf.extract("index.md", extract_dir) # Handle assets with deduplication for asset_info in manifest.get("assets", []): asset_name = asset_info["name"] # Extract to temporary location temp_path = extract_dir / "temp_assets" / asset_name temp_path.parent.mkdir(parents=True, exist_ok=True) zf.extract(f"assets/{asset_name}", temp_path.parent) # Add through deduplicator (creates symlink) self.deduplicator.add_asset(temp_path, extract_dir, asset_name) # Clean up temporary file temp_path.unlink() return extract_dir ``` ### Concept B: Pros and Cons #### ✅ Advantages 1. **Visual Transparency**: Symlinks show actual file relationships clearly 2. **Tool Compatibility**: Standard tools can follow symlinks and work normally 3. **Package Portability**: `.mdpkg` files are self-contained ZIP archives 4. **Gradual Migration**: Can work with existing file-based workflows 5. **Backup Friendly**: Clear separation between packages and shared assets 6. **Standard Formats**: Uses ZIP and JSON, widely supported 7. **Working Directory**: Users see familiar file/folder structure #### ❌ Disadvantages 1. **Platform Dependency**: Symlinks work differently on Windows vs Unix 2. **Sync Complexity**: Symlinks can break during cloud sync or backup 3. **Storage Overhead**: Registry + symlinks + actual files 4. **Permission Issues**: Symlink creation may require special permissions 5. **Broken Links**: Symlinks can become dangling if shared assets move 6. **Complexity**: More moving parts (packages + symlinks + registry) --- ## 📊 Concept Comparison Matrix | Aspect | Concept A: Hash-Based Store | Concept B: Package + Symlinks | |--------|---------------------------|------------------------------| | **Deduplication Efficiency** | ⭐⭐⭐⭐⭐ Perfect | ⭐⭐⭐⭐⚪ Very Good | | **Implementation Complexity** | ⭐⭐⭐⚪⚪ Moderate | ⭐⭐⚪⚪⚪ Complex | | **Platform Compatibility** | ⭐⭐⭐⭐⭐ Universal | ⭐⭐⭐⚪⚪ Platform-dependent | | **Tool Integration** | ⭐⭐⚪⚪⚪ Custom tools needed | ⭐⭐⭐⭐⚪ Standard tools work | | **Storage Efficiency** | ⭐⭐⭐⭐⭐ Minimal | ⭐⭐⭐⭐⚪ Good | | **User Experience** | ⭐⭐⭐⚪⚪ Learning curve | ⭐⭐⭐⭐⚪ Familiar | | **Package Portability** | ⭐⭐⭐⚪⚪ Requires tooling | ⭐⭐⭐⭐⭐ Standard ZIP | | **Recovery Robustness** | ⭐⭐⚪⚪⚪ Database dependent | ⭐⭐⭐⭐⚪ Self-documenting | | **Performance** | ⭐⭐⭐⭐⭐ Fast hash lookup | ⭐⭐⭐⚪⚪ Filesystem dependent | | **Maintenance** | ⭐⭐⭐⚪⚪ Database management | ⭐⭐⚪⚪⚪ Complex relationships | ## 🎯 Recommended Implementation Strategy ### Phase 1: Start with Concept B (Rapid Prototyping) **Rationale**: Easier to understand, debug, and demonstrate - Implement basic package creation/extraction - Use simple file copying for initial version (add deduplication later) - Focus on `.mdpkg` format compatibility with wiki specifications ### Phase 2: Add Deduplication (Hybrid Approach) **Evolution**: Incorporate hash-based deduplication from Concept A - Keep the package/symlink user interface from Concept B - Add content hashing for deduplication backend - Maintain content-addressable shared storage ### Phase 3: Advanced Features - Content-based asset search and discovery - Automatic format conversion and optimization - Integration with markitect CLI commands - Web interface for asset library browsing ## 🛠️ Python Library Recommendations ### Core Libraries (Standard Library) - **`hashlib`** - Content hashing for deduplication - **`sqlite3`** - Metadata and relationship storage - **`zipfile`** - Package creation and extraction - **`pathlib`** - Modern path handling - **`json`** - Manifest and metadata serialization ### Additional Libraries (Optional) - **`click`** - CLI interface (already available) - **`Pillow`** - Image processing and format detection - **`python-magic`** - MIME type detection - **`watchdog`** - File system monitoring for auto-import - **`send2trash`** - Safe file deletion ### Architecture Libraries - **`sqlalchemy`** - Advanced database ORM (if complex queries needed) - **`pydantic`** - Data validation and settings management - **`rich`** - Beautiful CLI output and progress bars ## 📋 Implementation Checklist ### Core Functionality - [ ] Asset content hashing and deduplication - [ ] Markdown reference parsing and rewriting - [ ] Package creation (.mdpkg ZIP format) - [ ] Package extraction and workspace setup - [ ] Asset registry and metadata management ### CLI Integration - [ ] `markitect asset add` - Import assets into library - [ ] `markitect asset dedupe` - Cleanup duplicate assets - [ ] `markitect package create` - Create .mdpkg from directory - [ ] `markitect package extract` - Extract .mdpkg to workspace - [ ] `markitect asset list` - Browse asset library ### Advanced Features - [ ] Automatic image format optimization - [ ] Asset usage tracking and cleanup - [ ] Batch import from directories - [ ] Integration with md-explode/implode workflow - [ ] Web-based asset browser interface ## 🚀 Next Steps 1. **Prototype Development**: Create minimal working implementation of Concept B 2. **CLI Integration**: Add basic asset management commands to markitect 3. **Testing**: Comprehensive testing with real-world markdown documents 4. **Documentation**: User guide for asset management workflow 5. **Community Feedback**: Gather input on the approach and API design This design provides a solid foundation for efficient, deduplicated asset management while maintaining compatibility with existing markdown workflows and the MarkdownPackageFormats standards. --- **Status**: 📋 **Concept Complete - Ready for Implementation Planning**