# Issue #141: Asset Management Concepts for Images and File Includes

**Date**: October 8, 2025
**Issue**: #141 - Concept to handle images and other file includes
**Status**: 📋 **CONCEPT PROPOSAL**

## Problem Statement

The goal is to create a system that can:
1. **Include images and files** with markdown documents
2. **Keep them referenceable** in the database/system
3. **Store them efficiently** with automatic deduplication
4. **Handle duplicate content** with different filenames seamlessly

## Design Context

Based on the **MarkdownPackageFormats** wiki analysis, we have several proven patterns:
- **ZIP-based packaging** (`.mdpkg`, `.mdz` formats)
- **Content-addressable storage** patterns
- **Manifest-based metadata** systems
- **Asset directory conventions** (`/assets`, `/images`)

## Core Requirements Analysis

### Functional Requirements
- **Content Deduplication**: Same image content → single storage, multiple references
- **Efficient Storage**: Minimize disk space usage for asset libraries
- **Referential Integrity**: Maintain markdown → asset relationships
- **Multiple Names**: Support different filenames for same content
- **Database Integration**: Asset metadata queryable and indexable

### Non-Functional Requirements
- **Performance**: Fast asset lookup and retrieval
- **Scalability**: Handle large asset libraries (1000s of files)
- **Portability**: Assets packaged with markdown for distribution
- **Maintainability**: Clear separation of content and metadata

---

## 🎯 Concept A: Hash-Based Asset Store with Virtual Naming

### Architecture Overview

```
markitect_assets/
├── store/                    # Content-addressed storage
│   ├── sha256/
│   │   ├── a1b2c3.../       # First 6 chars of hash
│   │   │   └── full_hash.ext # Actual file
│   │   └── d4e5f6.../
│   └── metadata.db          # SQLite database
├── cache/                   # Processed/resized versions
└── manifest.json           # Global asset registry
```

### Key Components

#### 1. Content-Addressed Storage
```python
import hashlib
from pathlib import Path

class HashBasedAssetStore:
    def __init__(self, store_path):
        self.store_path = Path(store_path)
        self.store_path.mkdir(parents=True, exist_ok=True)

    def store_asset(self, file_path, original_name=None):
        """Store asset and return content hash."""
        content = Path(file_path).read_bytes()
        content_hash = hashlib.sha256(content).hexdigest()

        # Store in hash-based directory structure
        hash_dir = self.store_path / "store" / "sha256" / content_hash[:6]
        hash_dir.mkdir(parents=True, exist_ok=True)

        file_ext = Path(file_path).suffix
        stored_path = hash_dir / f"{content_hash}{file_ext}"

        if not stored_path.exists():
            stored_path.write_bytes(content)

        return content_hash
```

#### 2. Virtual Name Mapping Database
```sql
-- SQLite schema for asset management
CREATE TABLE assets (
    content_hash TEXT PRIMARY KEY,
    file_size INTEGER,
    mime_type TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    original_extension TEXT
);

CREATE TABLE asset_names (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    content_hash TEXT,
    virtual_name TEXT,
    document_id TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (content_hash) REFERENCES assets(content_hash)
);

CREATE INDEX idx_asset_names_virtual ON asset_names(virtual_name);
CREATE INDEX idx_asset_names_document ON asset_names(document_id);
```

#### 3. Markdown Integration
```python
class MarkdownAssetProcessor:
    def __init__(self, asset_store):
        self.asset_store = asset_store

    def process_markdown_with_assets(self, md_content, document_id, asset_dir):
        """Process markdown and replace image references with hash-based ones."""
        import re

        def replace_image_ref(match):
            image_path = match.group(1)
            full_path = asset_dir / image_path

            if full_path.exists():
                # Store asset and get hash
                content_hash = self.asset_store.store_asset(full_path, image_path)

                # Register virtual name
                self.asset_store.register_name(content_hash, image_path, document_id)

                # Return hash-based reference
                return f'![{match.group(0)}]({content_hash})'

            return match.group(0)  # Return original if file not found

        # Replace image references
        processed_md = re.sub(r'!\[.*?\]\(([^)]+)\)', replace_image_ref, md_content)
        return processed_md
```

### Concept A: Pros and Cons

#### ✅ Advantages
1. **Perfect Deduplication**: Identical content stored only once regardless of filename
2. **Content Integrity**: Hash verification ensures data hasn't been corrupted
3. **Efficient Storage**: Minimum disk space usage for large asset libraries
4. **Fast Lookups**: Hash-based access is O(1) for retrieval
5. **Version Agnostic**: Same content = same hash, regardless of how it was added
6. **Referential Integrity**: Virtual names maintain user-friendly references

#### ❌ Disadvantages
1. **Complex Recovery**: Lost database means lost name mappings
2. **Hash Collisions**: Theoretical risk with SHA-256 (extremely low)
3. **Migration Complexity**: Moving between systems requires database + files
4. **Debugging Difficulty**: Not human-readable file organization
5. **Initial Overhead**: Database setup and maintenance required
6. **Tool Integration**: External tools can't easily browse assets

---

## 🎯 Concept B: Content-Addressable Package System with Symlinks

### Architecture Overview

```
markitect_packages/
├── documents/
│   ├── doc1.mdpkg           # ZIP package per document
│   └── doc2.mdpkg
├── shared_assets/           # Deduplicated asset library
│   ├── images/
│   │   ├── content_hash_1.png
│   │   └── content_hash_2.jpg
│   └── registry.json       # Asset registry
└── workspace/               # Working directory with symlinks
    ├── doc1/
    │   ├── index.md
    │   └── assets/          # Symlinks to shared_assets
    │       ├── logo.png → ../../shared_assets/images/content_hash_1.png
    │       └── chart.png → ../../shared_assets/images/content_hash_1.png
    └── doc2/
```

### Key Components

#### 1. Package-Based Document Storage
```python
import zipfile
import json
from pathlib import Path

class PackageManager:
    def __init__(self, workspace_path):
        self.workspace = Path(workspace_path)
        self.shared_assets = self.workspace / "shared_assets"
        self.packages = self.workspace / "packages"

        # Initialize directories
        for dir_path in [self.shared_assets, self.packages]:
            dir_path.mkdir(parents=True, exist_ok=True)

    def create_package(self, document_path, package_name):
        """Create .mdpkg from working directory."""
        package_path = self.packages / f"{package_name}.mdpkg"

        with zipfile.ZipFile(package_path, 'w', zipfile.ZIP_DEFLATED) as zf:
            # Add markdown file
            zf.write(document_path / "index.md", "index.md")

            # Add manifest
            manifest = self._create_manifest(document_path)
            zf.writestr("manifest.json", json.dumps(manifest, indent=2))

            # Add actual asset files (resolved from symlinks)
            assets_dir = document_path / "assets"
            if assets_dir.exists():
                for asset in assets_dir.iterdir():
                    if asset.is_symlink():
                        # Resolve symlink and add actual file
                        real_file = asset.resolve()
                        zf.write(real_file, f"assets/{asset.name}")
                    else:
                        zf.write(asset, f"assets/{asset.name}")

        return package_path
```

#### 2. Symlink-Based Deduplication
```python
class AssetDeduplicator:
    def __init__(self, shared_assets_path):
        self.shared_assets = Path(shared_assets_path)
        self.registry_path = self.shared_assets / "registry.json"
        self.load_registry()

    def add_asset(self, asset_path, document_dir, desired_name):
        """Add asset with deduplication via symlinks."""
        content = Path(asset_path).read_bytes()
        content_hash = hashlib.sha256(content).hexdigest()

        # Check if content already exists
        existing_path = self._find_existing_asset(content_hash)

        if not existing_path:
            # Store new asset in shared location
            file_ext = Path(asset_path).suffix
            shared_path = self.shared_assets / "images" / f"{content_hash}{file_ext}"
            shared_path.parent.mkdir(parents=True, exist_ok=True)
            shared_path.write_bytes(content)

            # Update registry
            self.registry[content_hash] = {
                "path": str(shared_path.relative_to(self.shared_assets)),
                "size": len(content),
                "mime_type": self._get_mime_type(file_ext),
                "created": datetime.now().isoformat()
            }
            existing_path = shared_path

        # Create symlink in document directory
        asset_link = document_dir / "assets" / desired_name
        asset_link.parent.mkdir(parents=True, exist_ok=True)

        if asset_link.exists() or asset_link.is_symlink():
            asset_link.unlink()

        asset_link.symlink_to(existing_path.resolve())

        return existing_path
```

#### 3. Package Import/Export
```python
class PackageHandler:
    def extract_package(self, package_path, workspace_dir):
        """Extract .mdpkg and set up symlinks."""
        extract_dir = workspace_dir / package_path.stem
        extract_dir.mkdir(parents=True, exist_ok=True)

        with zipfile.ZipFile(package_path, 'r') as zf:
            # Extract manifest first
            manifest = json.loads(zf.read("manifest.json"))

            # Extract markdown
            zf.extract("index.md", extract_dir)

            # Handle assets with deduplication
            for asset_info in manifest.get("assets", []):
                asset_name = asset_info["name"]

                # Extract to temporary location
                temp_path = extract_dir / "temp_assets" / asset_name
                temp_path.parent.mkdir(parents=True, exist_ok=True)
                zf.extract(f"assets/{asset_name}", temp_path.parent)

                # Add through deduplicator (creates symlink)
                self.deduplicator.add_asset(temp_path, extract_dir, asset_name)

                # Clean up temporary file
                temp_path.unlink()

        return extract_dir
```

### Concept B: Pros and Cons

#### ✅ Advantages
1. **Visual Transparency**: Symlinks show actual file relationships clearly
2. **Tool Compatibility**: Standard tools can follow symlinks and work normally
3. **Package Portability**: `.mdpkg` files are self-contained ZIP archives
4. **Gradual Migration**: Can work with existing file-based workflows
5. **Backup Friendly**: Clear separation between packages and shared assets
6. **Standard Formats**: Uses ZIP and JSON, widely supported
7. **Working Directory**: Users see familiar file/folder structure

#### ❌ Disadvantages
1. **Platform Dependency**: Symlinks work differently on Windows vs Unix
2. **Sync Complexity**: Symlinks can break during cloud sync or backup
3. **Storage Overhead**: Registry + symlinks + actual files
4. **Permission Issues**: Symlink creation may require special permissions
5. **Broken Links**: Symlinks can become dangling if shared assets move
6. **Complexity**: More moving parts (packages + symlinks + registry)

---

## 📊 Concept Comparison Matrix

| Aspect | Concept A: Hash-Based Store | Concept B: Package + Symlinks |
|--------|---------------------------|------------------------------|
| **Deduplication Efficiency** | ⭐⭐⭐⭐⭐ Perfect | ⭐⭐⭐⭐⚪ Very Good |
| **Implementation Complexity** | ⭐⭐⭐⚪⚪ Moderate | ⭐⭐⚪⚪⚪ Complex |
| **Platform Compatibility** | ⭐⭐⭐⭐⭐ Universal | ⭐⭐⭐⚪⚪ Platform-dependent |
| **Tool Integration** | ⭐⭐⚪⚪⚪ Custom tools needed | ⭐⭐⭐⭐⚪ Standard tools work |
| **Storage Efficiency** | ⭐⭐⭐⭐⭐ Minimal | ⭐⭐⭐⭐⚪ Good |
| **User Experience** | ⭐⭐⭐⚪⚪ Learning curve | ⭐⭐⭐⭐⚪ Familiar |
| **Package Portability** | ⭐⭐⭐⚪⚪ Requires tooling | ⭐⭐⭐⭐⭐ Standard ZIP |
| **Recovery Robustness** | ⭐⭐⚪⚪⚪ Database dependent | ⭐⭐⭐⭐⚪ Self-documenting |
| **Performance** | ⭐⭐⭐⭐⭐ Fast hash lookup | ⭐⭐⭐⚪⚪ Filesystem dependent |
| **Maintenance** | ⭐⭐⭐⚪⚪ Database management | ⭐⭐⚪⚪⚪ Complex relationships |

## 🎯 Recommended Implementation Strategy

### Phase 1: Start with Concept B (Rapid Prototyping)
**Rationale**: Easier to understand, debug, and demonstrate
- Implement basic package creation/extraction
- Use simple file copying for initial version (add deduplication later)
- Focus on `.mdpkg` format compatibility with wiki specifications

### Phase 2: Add Deduplication (Hybrid Approach)
**Evolution**: Incorporate hash-based deduplication from Concept A
- Keep the package/symlink user interface from Concept B
- Add content hashing for deduplication backend
- Maintain content-addressable shared storage

### Phase 3: Advanced Features
- Content-based asset search and discovery
- Automatic format conversion and optimization
- Integration with markitect CLI commands
- Web interface for asset library browsing

## 🛠️ Python Library Recommendations

### Core Libraries (Standard Library)
- **`hashlib`** - Content hashing for deduplication
- **`sqlite3`** - Metadata and relationship storage
- **`zipfile`** - Package creation and extraction
- **`pathlib`** - Modern path handling
- **`json`** - Manifest and metadata serialization

### Additional Libraries (Optional)
- **`click`** - CLI interface (already available)
- **`Pillow`** - Image processing and format detection
- **`python-magic`** - MIME type detection
- **`watchdog`** - File system monitoring for auto-import
- **`send2trash`** - Safe file deletion

### Architecture Libraries
- **`sqlalchemy`** - Advanced database ORM (if complex queries needed)
- **`pydantic`** - Data validation and settings management
- **`rich`** - Beautiful CLI output and progress bars

## 📋 Implementation Checklist

### Core Functionality
- [ ] Asset content hashing and deduplication
- [ ] Markdown reference parsing and rewriting
- [ ] Package creation (.mdpkg ZIP format)
- [ ] Package extraction and workspace setup
- [ ] Asset registry and metadata management

### CLI Integration
- [ ] `markitect asset add` - Import assets into library
- [ ] `markitect asset dedupe` - Cleanup duplicate assets
- [ ] `markitect package create` - Create .mdpkg from directory
- [ ] `markitect package extract` - Extract .mdpkg to workspace
- [ ] `markitect asset list` - Browse asset library

### Advanced Features
- [ ] Automatic image format optimization
- [ ] Asset usage tracking and cleanup
- [ ] Batch import from directories
- [ ] Integration with md-explode/implode workflow
- [ ] Web-based asset browser interface

## 🚀 Next Steps

1. **Prototype Development**: Create minimal working implementation of Concept B
2. **CLI Integration**: Add basic asset management commands to markitect
3. **Testing**: Comprehensive testing with real-world markdown documents
4. **Documentation**: User guide for asset management workflow
5. **Community Feedback**: Gather input on the approach and API design

This design provides a solid foundation for efficient, deduplicated asset management while maintaining compatibility with existing markdown workflows and the MarkdownPackageFormats standards.

---

**Status**: 📋 **Concept Complete - Ready for Implementation Planning**