Skip to main content

A modular document processing pipeline for Markdown to PDF/SVG/PNG conversion with OCR

Project description

Document Processing Pipeline

A comprehensive document processing system that converts Markdown files through a complete pipeline: Markdown → PDF → SVG → PNG → OCR → Search → Dashboard.

🏗️ Refactored Package Structure

The codebase has been restructured into a modular Python package:

processor/
├── __init__.py          # Package initialization
├── __main__.py          # CLI entry point
├── core/
│   ├── __init__.py
│   └── document_processor.py  # Main processor class
├── converters/
│   ├── __init__.py
│   ├── markdown_converter.py  # Markdown to PDF conversion
│   └── pdf_converter.py       # PDF to SVG/PNG conversion
└── utils/
    ├── __init__.py
    ├── ocr_processor.py       # OCR processing
    ├── file_utils.py          # File operations
    ├── html_utils.py          # HTML generation
    └── metadata_utils.py      # Metadata handling

This modular structure provides better:

  • Code organization and maintainability
  • Separation of concerns
  • Testability
  • Reusability of components
  • Easier extension of functionality

A comprehensive document processing system that converts Markdown files through a complete pipeline: Markdown → PDF → SVG → PNG → OCR → Search → Dashboard.

🚀 Features

  • Multi-format conversion: Markdown to PDF with styling
  • SVG embedding: PDF embedded as base64 data URI in SVG containers
  • Image extraction: PDF pages converted to PNG with base64 encoding
  • OCR processing: Text extraction with confidence scoring
  • Metadata tracking: JSON metadata throughout the pipeline
  • File system search: Automatic SVG file discovery
  • Interactive dashboard: HTML table with SVG thumbnails
  • Automated workflow: Makefile-driven pipeline

📋 Prerequisites

System Dependencies

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utils libcairo2-dev

macOS:

brew install tesseract poppler cairo

Windows:

Python Requirements

  • Python 3.7+
  • pip3

🛠️ Installation

Quick Setup

# Clone or download the project files
git clone https://github.com/veridock/enclose.git
cd enclose

# Install dependencies
make install

Manual Setup

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install Python packages
pip install -r requirements.txt

🎯 Usage

Quick Start

# Run complete pipeline
make all

Step-by-Step Execution

  1. Create Example Files

    make create
    
    • Generates invoice_example.md
  2. Process Documents

    make process
    
    • Converts MD → PDF → SVG → PNG
    • Performs OCR processing
    • Creates metadata JSON
  3. Search & enclose

    make search     # Find all SVG files
    make enclose  # Create dashboard
    
  4. View Results

    • Dashboard opens automatically in browser
    • Access: output/dashboard.html

Individual Commands

# Python script direct usage
python processor.py --step create
python processor.py --step process
python processor.py --step search
python processor.py --step enclose

📁 Project Structure

enclose/
├── Makefile                 # Build automation
├── processor.py            # Main processing pipeline
├── requirements.txt        # Python dependencies
├── setup.sh               # System setup script
├── README.md              # Project documentation
├── venv/                  # Virtual environment (created)
└── output/                # Generated files (created)
    ├── invoice_example.md     # Source markdown
    ├── invoice_example.pdf    # Generated PDF
    ├── invoice_example.svg    # SVG with embedded PDF
    ├── page_1.png            # Extracted PNG pages
    ├── page_N.png            # (multiple pages if needed)
    ├── metadata.json         # Processing metadata
    ├── svg_search_results.json # Search results
    └── dashboard.html        # Interactive dashboard

🔄 Pipeline Workflow

Step 1: CREATE
├── Generate example markdown file (invoice)
└── Output: invoice_example.md

Step 2: MARKDOWN → PDF
├── Convert markdown to styled HTML
├── Generate PDF with CSS styling
└── Output: invoice_example.pdf

Step 3: PDF → SVG
├── Embed PDF as base64 data URI
├── Add SVG metadata (RDF/Dublin Core)
└── Output: invoice_example.svg + metadata.json

Step 4: PDF → PNG
├── Extract PDF pages as PNG images
├── Convert PNG to base64 encoding
└── Output: page_*.png + updated metadata

Step 5: OCR PROCESSING
├── Extract text from PNG images
├── Calculate confidence scores
└── Output: updated metadata with OCR data

Step 6: FILESYSTEM SEARCH
├── Scan for all SVG files
├── Parse SVG metadata
└── Output: svg_search_results.json

Step 7: DASHBOARD CREATION
├── Generate HTML table with thumbnails
├── Embed SVG previews
└── Output: dashboard.html (opens in browser)

📊 Output Files

Metadata Structure

{
  "file": "path/to/file.svg",
  "type": "svg_with_pdf",
  "created": "2025-06-25T10:30:00",
  "pdf_embedded": true,
  "total_pages": 1,
  "pages": [
    {
      "page": 1,
      "file": "page_1.png",
      "base64": "iVBORw0KGgoAAAANSU...",
      "ocr_text": "Invoice #INV-2025-001...",
      "ocr_confidence": 95.7,
      "word_count": 45
    }
  ]
}

Dashboard Features

  • SVG Thumbnails: Direct embedding of SVG files
  • File Information: Path, size, modification date
  • PDF Detection: Indicates embedded PDF data
  • Metadata Status: Shows RDF metadata presence
  • Interactive Links: Click to open files

🛠️ Makefile Targets

Target Description
install Install dependencies in virtual environment
create Create example markdown file
process Run conversion pipeline (steps 2-5)
search Search filesystem for SVG files
enclose Create HTML dashboard
clean Remove generated files
clean-all Remove everything including venv
help Show available commands

🔧 Configuration

OCR Language Support

# Install additional languages
sudo apt-get install tesseract-ocr-pol  # Polish
sudo apt-get install tesseract-ocr-deu  # German

# Configure in processor.py
pytesseract.image_to_string(image, lang='pol+eng')

PDF Styling

Modify CSS in markdown_to_pdf() method:

styled_html = f"""
<style>
    body {{ font-family: 'Your Font', sans-serif; }}
    /* Add custom styles */
</style>
"""

🐛 Troubleshooting

Common Issues

OCR Not Working:

# Check tesseract installation
tesseract --version

# Install language packs
sudo apt-get install tesseract-ocr-eng

PDF Conversion Fails:

# Check weasyprint dependencies
pip install --upgrade weasyprint

SVG Rendering Issues:

# Install cairo development libraries
sudo apt-get install libcairo2-dev

Debug Mode

# Enable verbose output
python processor.py --step process --verbose

📝 License

This project is open source. See LICENSE file for details.

🤝 Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open Pull Request

📞 Support

  • Issues: GitHub Issues
  • Documentation: This README
  • Examples: Check output/ directory after running pipeline

🎉 Quick Demo

# Complete setup and demo
make install
make all

# View results
open output/dashboard.html  # macOS
xdg-open output/dashboard.html  # Linux

The dashboard will show your processed documents with interactive thumbnails and metadata!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

enclose-1.0.3.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

enclose-1.0.3-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file enclose-1.0.3.tar.gz.

File metadata

  • Download URL: enclose-1.0.3.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.15.3-200.fc42.x86_64

File hashes

Hashes for enclose-1.0.3.tar.gz
Algorithm Hash digest
SHA256 37064c08b60b31a24a257c6e59a06d6360d17b9eb259459583c3e327a2f2b429
MD5 8f0e1d6793b11f541ad68b594c2cafc4
BLAKE2b-256 502d1a7db0cc86f9a84d2b7bc28c4bb79330ee3568e4cf8d53801c8348d99ff8

See more details on using hashes here.

File details

Details for the file enclose-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: enclose-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.15.3-200.fc42.x86_64

File hashes

Hashes for enclose-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 203aea12b96a87dd6a58a8cff19eb0feeb3abca0561fac026a197c6377a26275
MD5 a2b3ff639e4a43bcb9b3bbee4165ab17
BLAKE2b-256 205c30cad468528829498835ff2b906579b87f3495893b5deebdd656ab579f8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page