A modular document processing pipeline for Markdown to PDF/SVG/PNG conversion with OCR
Project description
Enclose
A comprehensive document processing pipeline for Markdown to PDF/SVG/PNG conversion with OCR capabilities.
graph LR
A[Markdown] -->|Parse| B[HTML]
B -->|Convert| C[PDF]
C -->|Embed| D[SVG]
D -->|Extract| E[PNG]
E -->|Process| F[OCR]
F -->|Index| G[Search]
G -->|Visualize| H[Dashboard]
๐ Features
graph LR
A[Input Formats] --> B[Markdown]
A --> C[PDF]
B --> D[Converters]
C --> D
D --> E[Output Formats]
E --> F[PDF]
E --> G[SVG]
E --> H[PNG]
E --> I[HTML]
style A fill:#f9f,stroke:#333
style E fill:#9f9,stroke:#333
- Multi-format conversion: Convert between Markdown, PDF, SVG, and PNG
- SVG embedding: Embed PDFs as base64 data URIs in SVG containers
- Image extraction: Extract high-quality images from PDFs
- OCR processing: Extract text with confidence scoring
- Metadata tracking: Preserve and enhance metadata throughout processing
- Interactive dashboard: View and search processed documents
โ File Format Validation
Enclose includes comprehensive file format validation to ensure the integrity and correctness of all converted files:
graph TD
A[Input Validation] --> B[Conversion]
B --> C[Output Validation]
C --> D[Verification]
style A fill:#d4f1f9,stroke:#333
style C fill:#d4f1f9,stroke:#333
style D fill:#d4f1f9,stroke:#333
Validation Checks
PDF Files
- โ
Valid PDF signature (
%PDFheader) - โ
Correct MIME type (
application/pdf) - โ File integrity verification
SVG Files
- โ Valid XML structure
- โ
Correct MIME type (
image/svg+xml) - โ Basic SVG tag validation
PNG Files
- โ Valid PNG signature (magic bytes)
- โ
Correct MIME type (
image/png) - โ Image data integrity check
- โ PIL verification of image data
Example Validation Output
# When running tests, you'll see validation output like:
PASSED tests/test_file_formats.py::test_pdf_conversion
PASSED tests/test_file_formats.py::test_svg_conversion
PASSED tests/test_file_formats.py::test_png_conversion
๐ Documentation
For complete documentation, please visit our documentation site.
๐ ๏ธ Quick Start
Prerequisites
- Python 3.8+
- Poetry (for development)
Installation
-
Clone the repository:
git clone https://github.com/yourusername/enclose.git cd enclose
-
Install dependencies:
poetry install -
Install the package in development mode:
poetry install
Basic Usage
-
List supported formats:
enclose --list -
Convert a markdown file to another format:
# Basic conversion (outputs to current directory with default name) enclose example.md pdf # Specify output filename enclose example.md pdf -o output.pdf # Convert to SVG enclose example.md svg -o output.svg # Convert to PNG enclose example.md png -o output.png # Convert to HTML enclose example.md html -o output.html
Example
-
First, create a test markdown file or use the provided
example.md -
Convert it to different formats:
# Convert to PDF enclose example.md pdf -o example.pdf # Convert to SVG enclose example.md svg -o example.svg
Important Notes
- The
-oor--outputflag requires a full file path with extension (e.g.,output.pdf,./output.svg) - If no output is specified, the output will be saved in the current directory with a default name based on the input file
- The output directory must exist before running the command
- The output will be saved to
output/example.pdf
Command Line Options
usage: enclose [-h] [--version] [--list] [-o OUTPUT] [input] [{pdf,png,svg,html}]
A document processing tool for format conversion.
positional arguments:
input Input file path (markdown, pdf, etc.)
{pdf,png,svg,html} Output format
options:
-h, --help show this help message and exit
--version show program's version number and exit
--list List supported formats and conversions
-o OUTPUT, --output OUTPUT
Output directory (default: current directory)
Development
To run tests:
make test
To run linting:
make lint
To run type checking:
make typecheck
Prerequisites
- Python 3.8.1+
- Poetry for dependency management
- System dependencies (see Installation Guide)
Installation
# Clone the repository
git clone https://github.com/veridock/enclose.git
cd enclose
# Install the package
make install
Basic Usage
# Process a document
enclose process example.md -o output/
# View the results
open output/dashboard.html # macOS
# or
xdg-open output/dashboard.html # Linux
๐ Documentation Structure
- Getting Started - Installation and setup
- User Guide - Command reference and usage examples
- Architecture - System design and components
- Development - Contributing and development setup
๐ Features in Detail
Document Conversion
- Markdown to PDF with custom styling
- PDF to SVG with embedded fonts
- High-quality image extraction
Advanced Processing
- OCR text extraction with confidence scoring
- Metadata extraction and management
- Batch processing support
Command Line Interface
- Intuitive command structure
- Configurable output formats
- Progress tracking
๐ Example Workflow
sequenceDiagram
participant User
participant CLI
participant Processor
User->>CLI: enclose process doc.md
CLI->>Processor: Process document
Processor->>Processor: Convert Markdown to PDF
Processor->>Processor: Generate SVG with embedded PDF
Processor->>Processor: Extract images
Processor->>Processor: Process OCR
Processor-->>CLI: Processing complete
CLI-->>User: Results in output/
๐ฆ Project Structure
enclose/
โโโ docs/ # Documentation
โโโ processor/ # Main package
โ โโโ __init__.py
โ โโโ __main__.py # CLI entry point
โ โโโ core/ # Core processing logic
โ โโโ converters/ # Format converters
โ โโโ utils/ # Utility functions
โโโ scripts/ # Helper scripts
โโโ tests/ # Test suite
โโโ pyproject.toml # Project configuration
๐ค Contributing
Contributions are welcome! Please see our Contributing Guide for details.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐งช Testing
To run the test suite:
make test
make lint
๐ Development Workflow
-
Set up development environment
make install -
Run tests
make test
-
Format and check code
make format make lint
-
Run the development server
make dev- Dashboard opens automatically in browser
- Access:
output/dashboard.html
๐ ๏ธ CLI Commands
Basic Usage
# Convert a document
enclose input.md pdf -o output.pdf
# List available formats
enclose --list
# Show help
enclose --help
Command Structure
flowchart TD
A[enclose] --> B[input_file output_format options]
A --> C[--list]
A --> D[--help]
A --> E[--version]
B --> F[input_file]
B --> G[output_format]
B --> H[options]
H --> I[-o/--output]
H --> J[--dpi]
H --> K[--quality]
style B fill:#9f9,stroke:#333
style C fill:#99f,stroke:#333
style D fill:#99f,stroke:#333
Common Examples
# Convert Markdown to PDF
enclose document.md pdf -o output.pdf
# Convert PDF to high-quality PNG
enclose document.pdf png --dpi 300 -o output.png
# List all supported formats
enclose --list
Advanced Options
# Set output DPI for images
enclose convert input.pdf png --dpi 150
# Set image quality (1-100)
enclose convert input.pdf jpg --quality 90
# Process multiple files
for f in *.md; do enclose convert "$f" pdf -o output/; done
๐ Project Structure
graph TD
A[Project Root] --> B[Source Code]
A --> C[Documentation]
A --> D[Build System]
A --> E[Tests]
B --> F[enclose/]:::dir
F --> G[__init__.py]:::file
F --> H[__main__.py]:::file
F --> I[core/]:::dir
F --> J[converters/]:::dir
F --> K[utils/]:::dir
C --> L[docs/]:::dir
C --> M[README.md]:::file
D --> N[pyproject.toml]:::file
D --> O[Makefile]:::file
E --> P[tests/]:::dir
classDef dir fill:#e1f5fe,stroke:#01579b,stroke-width:2px
classDef file fill:#e8f5e9,stroke:#2e7d32,stroke-width:1.5px
Key Directories
-
enclose/- Main Python packagecore/- Core processing logic and document handlingconverters/- File format conversion modulesutils/- Utility functions and helpers__main__.py- CLI entry point
-
docs/- Comprehensive documentationgetting-started/- Installation and setup guidesarchitecture/- System design and componentsusage/- User guides and examples
-
tests/- Test suite- Unit tests
- Integration tests
- Test fixtures
Key Files
pyproject.toml- Project configuration and dependenciesMakefile- Common development tasksscripts/enclose- Global CLI wrapper script.github/workflows/- CI/CD pipelines
๐ Workflow
flowchart LR
A[Input] -->|Markdown/PDF| B(enclose)
B --> C{Format?}
C -->|Markdown| D[Parse Markdown]
D --> E[Generate HTML]
E --> F[Convert to PDF]
C -->|PDF| G[Process PDF]
G --> H[Extract Content]
F & H --> I[Generate Outputs]
I --> J[SVG/PNG/HTML]
I --> K[Metadata]
J --> L[Dashboard]
style A fill:#e3f2fd,stroke:#1565c0
style B fill:#e8f5e9,stroke#2e7d32
style L fill:#fff3e0,stroke:#e65100
Processing Steps
-
Input Handling
- Accepts Markdown or PDF files
- Validates input format and content
-
Conversion
- Markdown โ HTML โ PDF
- PDF โ Images/Text
-
Output Generation
- Generate SVG/PNG/HTML outputs
- Extract and process metadata
-
Visualization
- Create interactive dashboard
- Enable search and filtering
๐ Pipeline Workflow
Step 1: CREATE
โโโ Generate example markdown file (invoice)
โโโ Output: invoice_example.md
Step 2: MARKDOWN โ PDF
โโโ Convert markdown to styled HTML
โโโ Generate PDF with CSS styling
โโโ Output: invoice_example.pdf
Step 3: PDF โ SVG
โโโ Embed PDF as base64 data URI
โโโ Add SVG metadata (RDF/Dublin Core)
โโโ Output: invoice_example.svg + metadata.json
Step 4: PDF โ PNG
โโโ Extract PDF pages as PNG images
โโโ Convert PNG to base64 encoding
โโโ Output: page_*.png + updated metadata
Step 5: OCR PROCESSING
โโโ Extract text from PNG images
โโโ Calculate confidence scores
โโโ Output: updated metadata with OCR data
Step 6: FILESYSTEM SEARCH
โโโ Scan for all SVG files
โโโ Parse SVG metadata
โโโ Output: svg_search_results.json
Step 7: DASHBOARD CREATION
โโโ Generate HTML table with thumbnails
โโโ Embed SVG previews
โโโ Output: dashboard.html (opens in browser)
๐ Output Files
Metadata Structure
{
"file": "path/to/file.svg",
"type": "svg_with_pdf",
"created": "2025-06-25T10:30:00",
"pdf_embedded": true,
"total_pages": 1,
"pages": [
{
"page": 1,
"file": "page_1.png",
"base64": "iVBORw0KGgoAAAANSU...",
"ocr_text": "Invoice #INV-2025-001...",
"ocr_confidence": 95.7,
"word_count": 45
}
]
}
Dashboard Features
- SVG Thumbnails: Direct embedding of SVG files
- File Information: Path, size, modification date
- PDF Detection: Indicates embedded PDF data
- Metadata Status: Shows RDF metadata presence
- Interactive Links: Click to open files
๐ ๏ธ Makefile Targets
| Target | Description |
|---|---|
install |
Install dependencies in virtual environment |
create |
Create example markdown file |
process |
Run conversion pipeline (steps 2-5) |
search |
Search filesystem for SVG files |
enclose |
Create HTML dashboard |
clean |
Remove generated files |
clean-all |
Remove everything including venv |
help |
Show available commands |
๐ง Configuration
OCR Language Support
# Install additional languages
sudo apt-get install tesseract-ocr-pol # Polish
sudo apt-get install tesseract-ocr-deu # German
# Configure in processor.py
pytesseract.image_to_string(image, lang='pol+eng')
PDF Styling
Modify CSS in markdown_to_pdf() method:
styled_html = f"""
<style>
body {{ font-family: 'Your Font', sans-serif; }}
/* Add custom styles */
</style>
"""
๐ Troubleshooting
Common Issues
OCR Not Working:
# Check tesseract installation
tesseract --version
# Install language packs
sudo apt-get install tesseract-ocr-eng
PDF Conversion Fails:
# Check weasyprint dependencies
pip install --upgrade weasyprint
SVG Rendering Issues:
# Install cairo development libraries
sudo apt-get install libcairo2-dev
Debug Mode
# Enable verbose output
python processor.py --step process --verbose
๐ License
This project is open source. See LICENSE file for details.
๐ค Contributing
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
๐ Support
- Issues: GitHub Issues
- Documentation: This README
- Examples: Check
output/directory after running pipeline
๐ Quick Demo
# Complete setup and demo
make install
make all
# View results
open output/dashboard.html # macOS
xdg-open output/dashboard.html # Linux
The dashboard will show your processed documents with interactive thumbnails and metadata!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file enclose-1.0.8.tar.gz.
File metadata
- Download URL: enclose-1.0.8.tar.gz
- Upload date:
- Size: 25.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.15.3-200.fc42.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c70f20052bbba7fdee716fcb145a65068ee908cd85f498e56814142243a5caff
|
|
| MD5 |
6023e7c1a1b2193ab61bfbe4537ff6e6
|
|
| BLAKE2b-256 |
ceb9456b01506349b1cf83fe1a114233cf432ae8e423e3b957d0d3825a270bc0
|
File details
Details for the file enclose-1.0.8-py3-none-any.whl.
File metadata
- Download URL: enclose-1.0.8-py3-none-any.whl
- Upload date:
- Size: 25.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.15.3-200.fc42.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
754064374435bbb49cd92769a592fb36bcc04c5e857704d0cfcac8eb281e6ad4
|
|
| MD5 |
477792e65d11515fb8cb85212c53f435
|
|
| BLAKE2b-256 |
9c96944340040c72b71f3674f687c65dc9a88826f95e9c6629f39e22f0bcb167
|