A modular document processing pipeline for Markdown to PDF/SVG/PNG conversion with OCR
Project description
Document Processing Pipeline
A comprehensive document processing system that converts Markdown files through a complete pipeline: Markdown → PDF → SVG → PNG → OCR → Search → Dashboard.
🏗️ Refactored Package Structure
The codebase has been restructured into a modular Python package:
processor/
├── __init__.py # Package initialization
├── __main__.py # CLI entry point
├── core/
│ ├── __init__.py
│ └── document_processor.py # Main processor class
├── converters/
│ ├── __init__.py
│ ├── markdown_converter.py # Markdown to PDF conversion
│ └── pdf_converter.py # PDF to SVG/PNG conversion
└── utils/
├── __init__.py
├── ocr_processor.py # OCR processing
├── file_utils.py # File operations
├── html_utils.py # HTML generation
└── metadata_utils.py # Metadata handling
This modular structure provides better:
- Code organization and maintainability
- Separation of concerns
- Testability
- Reusability of components
- Easier extension of functionality
A comprehensive document processing system that converts Markdown files through a complete pipeline: Markdown → PDF → SVG → PNG → OCR → Search → Dashboard.
🚀 Features
- Multi-format conversion: Markdown to PDF with styling
- SVG embedding: PDF embedded as base64 data URI in SVG containers
- Image extraction: PDF pages converted to PNG with base64 encoding
- OCR processing: Text extraction with confidence scoring
- Metadata tracking: JSON metadata throughout the pipeline
- File system search: Automatic SVG file discovery
- Interactive dashboard: HTML table with SVG thumbnails
- Automated workflow: Makefile-driven pipeline
📋 Prerequisites
System Dependencies
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utils libcairo2-dev
macOS:
brew install tesseract poppler cairo
Windows:
- Install Tesseract OCR
- Install Poppler
Python Requirements
- Python 3.7+
- pip3
🛠️ Installation
Quick Setup
# Clone or download the project files
git clone https://github.com/veridock/enclose.git
cd enclose
# Install dependencies
make install
Manual Setup
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install Python packages
pip install -r requirements.txt
🎯 Usage
Quick Start
# Run complete pipeline
make all
Step-by-Step Execution
-
Create Example Files
make create- Generates
invoice_example.md
- Generates
-
Process Documents
make process- Converts MD → PDF → SVG → PNG
- Performs OCR processing
- Creates metadata JSON
-
Search & enclose
make search # Find all SVG files make enclose # Create dashboard
-
View Results
- Dashboard opens automatically in browser
- Access:
output/dashboard.html
Individual Commands
# Python script direct usage
python processor.py --step create
python processor.py --step process
python processor.py --step search
python processor.py --step enclose
📁 Project Structure
enclose/
├── Makefile # Build automation
├── processor.py # Main processing pipeline
├── requirements.txt # Python dependencies
├── setup.sh # System setup script
├── README.md # Project documentation
├── venv/ # Virtual environment (created)
└── output/ # Generated files (created)
├── invoice_example.md # Source markdown
├── invoice_example.pdf # Generated PDF
├── invoice_example.svg # SVG with embedded PDF
├── page_1.png # Extracted PNG pages
├── page_N.png # (multiple pages if needed)
├── metadata.json # Processing metadata
├── svg_search_results.json # Search results
└── dashboard.html # Interactive dashboard
🔄 Pipeline Workflow
Step 1: CREATE
├── Generate example markdown file (invoice)
└── Output: invoice_example.md
Step 2: MARKDOWN → PDF
├── Convert markdown to styled HTML
├── Generate PDF with CSS styling
└── Output: invoice_example.pdf
Step 3: PDF → SVG
├── Embed PDF as base64 data URI
├── Add SVG metadata (RDF/Dublin Core)
└── Output: invoice_example.svg + metadata.json
Step 4: PDF → PNG
├── Extract PDF pages as PNG images
├── Convert PNG to base64 encoding
└── Output: page_*.png + updated metadata
Step 5: OCR PROCESSING
├── Extract text from PNG images
├── Calculate confidence scores
└── Output: updated metadata with OCR data
Step 6: FILESYSTEM SEARCH
├── Scan for all SVG files
├── Parse SVG metadata
└── Output: svg_search_results.json
Step 7: DASHBOARD CREATION
├── Generate HTML table with thumbnails
├── Embed SVG previews
└── Output: dashboard.html (opens in browser)
📊 Output Files
Metadata Structure
{
"file": "path/to/file.svg",
"type": "svg_with_pdf",
"created": "2025-06-25T10:30:00",
"pdf_embedded": true,
"total_pages": 1,
"pages": [
{
"page": 1,
"file": "page_1.png",
"base64": "iVBORw0KGgoAAAANSU...",
"ocr_text": "Invoice #INV-2025-001...",
"ocr_confidence": 95.7,
"word_count": 45
}
]
}
Dashboard Features
- SVG Thumbnails: Direct embedding of SVG files
- File Information: Path, size, modification date
- PDF Detection: Indicates embedded PDF data
- Metadata Status: Shows RDF metadata presence
- Interactive Links: Click to open files
🛠️ Makefile Targets
| Target | Description |
|---|---|
install |
Install dependencies in virtual environment |
create |
Create example markdown file |
process |
Run conversion pipeline (steps 2-5) |
search |
Search filesystem for SVG files |
enclose |
Create HTML dashboard |
clean |
Remove generated files |
clean-all |
Remove everything including venv |
help |
Show available commands |
🔧 Configuration
OCR Language Support
# Install additional languages
sudo apt-get install tesseract-ocr-pol # Polish
sudo apt-get install tesseract-ocr-deu # German
# Configure in processor.py
pytesseract.image_to_string(image, lang='pol+eng')
PDF Styling
Modify CSS in markdown_to_pdf() method:
styled_html = f"""
<style>
body {{ font-family: 'Your Font', sans-serif; }}
/* Add custom styles */
</style>
"""
🐛 Troubleshooting
Common Issues
OCR Not Working:
# Check tesseract installation
tesseract --version
# Install language packs
sudo apt-get install tesseract-ocr-eng
PDF Conversion Fails:
# Check weasyprint dependencies
pip install --upgrade weasyprint
SVG Rendering Issues:
# Install cairo development libraries
sudo apt-get install libcairo2-dev
Debug Mode
# Enable verbose output
python processor.py --step process --verbose
📝 License
This project is open source. See LICENSE file for details.
🤝 Contributing
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
📞 Support
- Issues: GitHub Issues
- Documentation: This README
- Examples: Check
output/directory after running pipeline
🎉 Quick Demo
# Complete setup and demo
make install
make all
# View results
open output/dashboard.html # macOS
xdg-open output/dashboard.html # Linux
The dashboard will show your processed documents with interactive thumbnails and metadata!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file enclose-1.0.1.tar.gz.
File metadata
- Download URL: enclose-1.0.1.tar.gz
- Upload date:
- Size: 17.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.15.3-200.fc42.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f93774b487c760d8a93c58eecc046eafa774d97c0619fd7a2e44b44da5fe8bff
|
|
| MD5 |
73fbc795c0c26433625a7cb658672cb1
|
|
| BLAKE2b-256 |
5c5ad2166217475f5bcb217ce518e3ecf6b14e7205b88eafdb223f9aa5a37dd3
|
File details
Details for the file enclose-1.0.1-py3-none-any.whl.
File metadata
- Download URL: enclose-1.0.1-py3-none-any.whl
- Upload date:
- Size: 18.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.15.3-200.fc42.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
faeb607467e47f8557aeddcad92c368fc7a741f7b2fe7faf3f4127aa2626c3aa
|
|
| MD5 |
211f85ddc04bee77657d3dd2be832955
|
|
| BLAKE2b-256 |
8774e71a2909595c497aed2ac96d38f4fefa05364c520b3d6884e4ff554f1fda
|