- Add enhanced AssetManager with database integration and usage tracking - Implement Asset model with from_dict/to_dict conversion methods - Add resolve_asset_references() for linking discovered assets to imports - Integrate AssetDatabase with enhanced schema and performance indexes - Fix database schema constraints and test compatibility issues - Add list_assets_as_objects() method for dict-to-object migration - Resolve 91% of asset management tests (51/56 passing) Key features: * Content-addressable asset storage with deduplication * Database-backed usage statistics and processing logs * Asset reference resolution from markdown files * Enhanced performance with indexing and caching * Object-oriented Asset model with backwards compatibility 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
16 KiB
16 KiB
Issue #141: Asset Management Concepts for Images and File Includes
Date: October 8, 2025 Issue: #141 - Concept to handle images and other file includes Status: 📋 CONCEPT PROPOSAL
Problem Statement
The goal is to create a system that can:
- Include images and files with markdown documents
- Keep them referenceable in the database/system
- Store them efficiently with automatic deduplication
- Handle duplicate content with different filenames seamlessly
Design Context
Based on the MarkdownPackageFormats wiki analysis, we have several proven patterns:
- ZIP-based packaging (
.mdpkg,.mdzformats) - Content-addressable storage patterns
- Manifest-based metadata systems
- Asset directory conventions (
/assets,/images)
Core Requirements Analysis
Functional Requirements
- Content Deduplication: Same image content → single storage, multiple references
- Efficient Storage: Minimize disk space usage for asset libraries
- Referential Integrity: Maintain markdown → asset relationships
- Multiple Names: Support different filenames for same content
- Database Integration: Asset metadata queryable and indexable
Non-Functional Requirements
- Performance: Fast asset lookup and retrieval
- Scalability: Handle large asset libraries (1000s of files)
- Portability: Assets packaged with markdown for distribution
- Maintainability: Clear separation of content and metadata
🎯 Concept A: Hash-Based Asset Store with Virtual Naming
Architecture Overview
markitect_assets/
├── store/ # Content-addressed storage
│ ├── sha256/
│ │ ├── a1b2c3.../ # First 6 chars of hash
│ │ │ └── full_hash.ext # Actual file
│ │ └── d4e5f6.../
│ └── metadata.db # SQLite database
├── cache/ # Processed/resized versions
└── manifest.json # Global asset registry
Key Components
1. Content-Addressed Storage
import hashlib
from pathlib import Path
class HashBasedAssetStore:
def __init__(self, store_path):
self.store_path = Path(store_path)
self.store_path.mkdir(parents=True, exist_ok=True)
def store_asset(self, file_path, original_name=None):
"""Store asset and return content hash."""
content = Path(file_path).read_bytes()
content_hash = hashlib.sha256(content).hexdigest()
# Store in hash-based directory structure
hash_dir = self.store_path / "store" / "sha256" / content_hash[:6]
hash_dir.mkdir(parents=True, exist_ok=True)
file_ext = Path(file_path).suffix
stored_path = hash_dir / f"{content_hash}{file_ext}"
if not stored_path.exists():
stored_path.write_bytes(content)
return content_hash
2. Virtual Name Mapping Database
-- SQLite schema for asset management
CREATE TABLE assets (
content_hash TEXT PRIMARY KEY,
file_size INTEGER,
mime_type TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
original_extension TEXT
);
CREATE TABLE asset_names (
id INTEGER PRIMARY KEY AUTOINCREMENT,
content_hash TEXT,
virtual_name TEXT,
document_id TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (content_hash) REFERENCES assets(content_hash)
);
CREATE INDEX idx_asset_names_virtual ON asset_names(virtual_name);
CREATE INDEX idx_asset_names_document ON asset_names(document_id);
3. Markdown Integration
class MarkdownAssetProcessor:
def __init__(self, asset_store):
self.asset_store = asset_store
def process_markdown_with_assets(self, md_content, document_id, asset_dir):
"""Process markdown and replace image references with hash-based ones."""
import re
def replace_image_ref(match):
image_path = match.group(1)
full_path = asset_dir / image_path
if full_path.exists():
# Store asset and get hash
content_hash = self.asset_store.store_asset(full_path, image_path)
# Register virtual name
self.asset_store.register_name(content_hash, image_path, document_id)
# Return hash-based reference
return f''
return match.group(0) # Return original if file not found
# Replace image references
processed_md = re.sub(r'!\[.*?\]\(([^)]+)\)', replace_image_ref, md_content)
return processed_md
Concept A: Pros and Cons
✅ Advantages
- Perfect Deduplication: Identical content stored only once regardless of filename
- Content Integrity: Hash verification ensures data hasn't been corrupted
- Efficient Storage: Minimum disk space usage for large asset libraries
- Fast Lookups: Hash-based access is O(1) for retrieval
- Version Agnostic: Same content = same hash, regardless of how it was added
- Referential Integrity: Virtual names maintain user-friendly references
❌ Disadvantages
- Complex Recovery: Lost database means lost name mappings
- Hash Collisions: Theoretical risk with SHA-256 (extremely low)
- Migration Complexity: Moving between systems requires database + files
- Debugging Difficulty: Not human-readable file organization
- Initial Overhead: Database setup and maintenance required
- Tool Integration: External tools can't easily browse assets
🎯 Concept B: Content-Addressable Package System with Symlinks
Architecture Overview
markitect_packages/
├── documents/
│ ├── doc1.mdpkg # ZIP package per document
│ └── doc2.mdpkg
├── shared_assets/ # Deduplicated asset library
│ ├── images/
│ │ ├── content_hash_1.png
│ │ └── content_hash_2.jpg
│ └── registry.json # Asset registry
└── workspace/ # Working directory with symlinks
├── doc1/
│ ├── index.md
│ └── assets/ # Symlinks to shared_assets
│ ├── logo.png → ../../shared_assets/images/content_hash_1.png
│ └── chart.png → ../../shared_assets/images/content_hash_1.png
└── doc2/
Key Components
1. Package-Based Document Storage
import zipfile
import json
from pathlib import Path
class PackageManager:
def __init__(self, workspace_path):
self.workspace = Path(workspace_path)
self.shared_assets = self.workspace / "shared_assets"
self.packages = self.workspace / "packages"
# Initialize directories
for dir_path in [self.shared_assets, self.packages]:
dir_path.mkdir(parents=True, exist_ok=True)
def create_package(self, document_path, package_name):
"""Create .mdpkg from working directory."""
package_path = self.packages / f"{package_name}.mdpkg"
with zipfile.ZipFile(package_path, 'w', zipfile.ZIP_DEFLATED) as zf:
# Add markdown file
zf.write(document_path / "index.md", "index.md")
# Add manifest
manifest = self._create_manifest(document_path)
zf.writestr("manifest.json", json.dumps(manifest, indent=2))
# Add actual asset files (resolved from symlinks)
assets_dir = document_path / "assets"
if assets_dir.exists():
for asset in assets_dir.iterdir():
if asset.is_symlink():
# Resolve symlink and add actual file
real_file = asset.resolve()
zf.write(real_file, f"assets/{asset.name}")
else:
zf.write(asset, f"assets/{asset.name}")
return package_path
2. Symlink-Based Deduplication
class AssetDeduplicator:
def __init__(self, shared_assets_path):
self.shared_assets = Path(shared_assets_path)
self.registry_path = self.shared_assets / "registry.json"
self.load_registry()
def add_asset(self, asset_path, document_dir, desired_name):
"""Add asset with deduplication via symlinks."""
content = Path(asset_path).read_bytes()
content_hash = hashlib.sha256(content).hexdigest()
# Check if content already exists
existing_path = self._find_existing_asset(content_hash)
if not existing_path:
# Store new asset in shared location
file_ext = Path(asset_path).suffix
shared_path = self.shared_assets / "images" / f"{content_hash}{file_ext}"
shared_path.parent.mkdir(parents=True, exist_ok=True)
shared_path.write_bytes(content)
# Update registry
self.registry[content_hash] = {
"path": str(shared_path.relative_to(self.shared_assets)),
"size": len(content),
"mime_type": self._get_mime_type(file_ext),
"created": datetime.now().isoformat()
}
existing_path = shared_path
# Create symlink in document directory
asset_link = document_dir / "assets" / desired_name
asset_link.parent.mkdir(parents=True, exist_ok=True)
if asset_link.exists() or asset_link.is_symlink():
asset_link.unlink()
asset_link.symlink_to(existing_path.resolve())
return existing_path
3. Package Import/Export
class PackageHandler:
def extract_package(self, package_path, workspace_dir):
"""Extract .mdpkg and set up symlinks."""
extract_dir = workspace_dir / package_path.stem
extract_dir.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(package_path, 'r') as zf:
# Extract manifest first
manifest = json.loads(zf.read("manifest.json"))
# Extract markdown
zf.extract("index.md", extract_dir)
# Handle assets with deduplication
for asset_info in manifest.get("assets", []):
asset_name = asset_info["name"]
# Extract to temporary location
temp_path = extract_dir / "temp_assets" / asset_name
temp_path.parent.mkdir(parents=True, exist_ok=True)
zf.extract(f"assets/{asset_name}", temp_path.parent)
# Add through deduplicator (creates symlink)
self.deduplicator.add_asset(temp_path, extract_dir, asset_name)
# Clean up temporary file
temp_path.unlink()
return extract_dir
Concept B: Pros and Cons
✅ Advantages
- Visual Transparency: Symlinks show actual file relationships clearly
- Tool Compatibility: Standard tools can follow symlinks and work normally
- Package Portability:
.mdpkgfiles are self-contained ZIP archives - Gradual Migration: Can work with existing file-based workflows
- Backup Friendly: Clear separation between packages and shared assets
- Standard Formats: Uses ZIP and JSON, widely supported
- Working Directory: Users see familiar file/folder structure
❌ Disadvantages
- Platform Dependency: Symlinks work differently on Windows vs Unix
- Sync Complexity: Symlinks can break during cloud sync or backup
- Storage Overhead: Registry + symlinks + actual files
- Permission Issues: Symlink creation may require special permissions
- Broken Links: Symlinks can become dangling if shared assets move
- Complexity: More moving parts (packages + symlinks + registry)
📊 Concept Comparison Matrix
| Aspect | Concept A: Hash-Based Store | Concept B: Package + Symlinks |
|---|---|---|
| Deduplication Efficiency | ⭐⭐⭐⭐⭐ Perfect | ⭐⭐⭐⭐⚪ Very Good |
| Implementation Complexity | ⭐⭐⭐⚪⚪ Moderate | ⭐⭐⚪⚪⚪ Complex |
| Platform Compatibility | ⭐⭐⭐⭐⭐ Universal | ⭐⭐⭐⚪⚪ Platform-dependent |
| Tool Integration | ⭐⭐⚪⚪⚪ Custom tools needed | ⭐⭐⭐⭐⚪ Standard tools work |
| Storage Efficiency | ⭐⭐⭐⭐⭐ Minimal | ⭐⭐⭐⭐⚪ Good |
| User Experience | ⭐⭐⭐⚪⚪ Learning curve | ⭐⭐⭐⭐⚪ Familiar |
| Package Portability | ⭐⭐⭐⚪⚪ Requires tooling | ⭐⭐⭐⭐⭐ Standard ZIP |
| Recovery Robustness | ⭐⭐⚪⚪⚪ Database dependent | ⭐⭐⭐⭐⚪ Self-documenting |
| Performance | ⭐⭐⭐⭐⭐ Fast hash lookup | ⭐⭐⭐⚪⚪ Filesystem dependent |
| Maintenance | ⭐⭐⭐⚪⚪ Database management | ⭐⭐⚪⚪⚪ Complex relationships |
🎯 Recommended Implementation Strategy
Phase 1: Start with Concept B (Rapid Prototyping)
Rationale: Easier to understand, debug, and demonstrate
- Implement basic package creation/extraction
- Use simple file copying for initial version (add deduplication later)
- Focus on
.mdpkgformat compatibility with wiki specifications
Phase 2: Add Deduplication (Hybrid Approach)
Evolution: Incorporate hash-based deduplication from Concept A
- Keep the package/symlink user interface from Concept B
- Add content hashing for deduplication backend
- Maintain content-addressable shared storage
Phase 3: Advanced Features
- Content-based asset search and discovery
- Automatic format conversion and optimization
- Integration with markitect CLI commands
- Web interface for asset library browsing
🛠️ Python Library Recommendations
Core Libraries (Standard Library)
hashlib- Content hashing for deduplicationsqlite3- Metadata and relationship storagezipfile- Package creation and extractionpathlib- Modern path handlingjson- Manifest and metadata serialization
Additional Libraries (Optional)
click- CLI interface (already available)Pillow- Image processing and format detectionpython-magic- MIME type detectionwatchdog- File system monitoring for auto-importsend2trash- Safe file deletion
Architecture Libraries
sqlalchemy- Advanced database ORM (if complex queries needed)pydantic- Data validation and settings managementrich- Beautiful CLI output and progress bars
📋 Implementation Checklist
Core Functionality
- Asset content hashing and deduplication
- Markdown reference parsing and rewriting
- Package creation (.mdpkg ZIP format)
- Package extraction and workspace setup
- Asset registry and metadata management
CLI Integration
markitect asset add- Import assets into librarymarkitect asset dedupe- Cleanup duplicate assetsmarkitect package create- Create .mdpkg from directorymarkitect package extract- Extract .mdpkg to workspacemarkitect asset list- Browse asset library
Advanced Features
- Automatic image format optimization
- Asset usage tracking and cleanup
- Batch import from directories
- Integration with md-explode/implode workflow
- Web-based asset browser interface
🚀 Next Steps
- Prototype Development: Create minimal working implementation of Concept B
- CLI Integration: Add basic asset management commands to markitect
- Testing: Comprehensive testing with real-world markdown documents
- Documentation: User guide for asset management workflow
- Community Feedback: Gather input on the approach and API design
This design provides a solid foundation for efficient, deduplicated asset management while maintaining compatibility with existing markdown workflows and the MarkdownPackageFormats standards.
Status: 📋 Concept Complete - Ready for Implementation Planning