A modular document processing pipeline for Markdown to PDF/SVG/PNG conversion with OCR

These details have not been verified by PyPI

Project links

Homepage

Project description

Document Processing Pipeline

A comprehensive document processing system that converts Markdown files through a complete pipeline: Markdown → PDF → SVG → PNG → OCR → Search → Dashboard.

🏗️ Refactored Package Structure

The codebase has been restructured into a modular Python package:

processor/
├── __init__.py          # Package initialization
├── __main__.py          # CLI entry point
├── core/
│   ├── __init__.py
│   └── document_processor.py  # Main processor class
├── converters/
│   ├── __init__.py
│   ├── markdown_converter.py  # Markdown to PDF conversion
│   └── pdf_converter.py       # PDF to SVG/PNG conversion
└── utils/
    ├── __init__.py
    ├── ocr_processor.py       # OCR processing
    ├── file_utils.py          # File operations
    ├── html_utils.py          # HTML generation
    └── metadata_utils.py      # Metadata handling

This modular structure provides better:

Code organization and maintainability
Separation of concerns
Testability
Reusability of components
Easier extension of functionality

A comprehensive document processing system that converts Markdown files through a complete pipeline: Markdown → PDF → SVG → PNG → OCR → Search → Dashboard.

🚀 Features

Multi-format conversion: Markdown to PDF with styling
SVG embedding: PDF embedded as base64 data URI in SVG containers
Image extraction: PDF pages converted to PNG with base64 encoding
OCR processing: Text extraction with confidence scoring
Metadata tracking: JSON metadata throughout the pipeline
File system search: Automatic SVG file discovery
Interactive dashboard: HTML table with SVG thumbnails
Automated workflow: Makefile-driven pipeline

📋 Prerequisites

System Dependencies

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utils libcairo2-dev

macOS:

brew install tesseract poppler cairo

Windows:

Install Tesseract OCR
Install Poppler

Python Requirements

Python 3.7+
pip3

🛠️ Installation

Quick Setup

# Clone or download the project files
git clone https://github.com/veridock/enclose.git
cd enclose

# Install dependencies
make install

Manual Setup

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install Python packages
pip install -r requirements.txt

🎯 Usage

Quick Start

# Run complete pipeline
make all

Step-by-Step Execution

Create Example Files
```
make create
```
- Generates invoice_example.md
Process Documents
```
make process
```
- Converts MD → PDF → SVG → PNG
- Performs OCR processing
- Creates metadata JSON

Search & enclose

make search     # Find all SVG files
make enclose  # Create dashboard

View Results
- Dashboard opens automatically in browser
- Access: output/dashboard.html

Individual Commands

# Python script direct usage
python processor.py --step create
python processor.py --step process
python processor.py --step search
python processor.py --step enclose

📁 Project Structure

enclose/
├── Makefile                 # Build automation
├── processor.py            # Main processing pipeline
├── requirements.txt        # Python dependencies
├── setup.sh               # System setup script
├── README.md              # Project documentation
├── venv/                  # Virtual environment (created)
└── output/                # Generated files (created)
    ├── invoice_example.md     # Source markdown
    ├── invoice_example.pdf    # Generated PDF
    ├── invoice_example.svg    # SVG with embedded PDF
    ├── page_1.png            # Extracted PNG pages
    ├── page_N.png            # (multiple pages if needed)
    ├── metadata.json         # Processing metadata
    ├── svg_search_results.json # Search results
    └── dashboard.html        # Interactive dashboard

🔄 Pipeline Workflow

Step 1: CREATE
├── Generate example markdown file (invoice)
└── Output: invoice_example.md

Step 2: MARKDOWN → PDF
├── Convert markdown to styled HTML
├── Generate PDF with CSS styling
└── Output: invoice_example.pdf

Step 3: PDF → SVG
├── Embed PDF as base64 data URI
├── Add SVG metadata (RDF/Dublin Core)
└── Output: invoice_example.svg + metadata.json

Step 4: PDF → PNG
├── Extract PDF pages as PNG images
├── Convert PNG to base64 encoding
└── Output: page_*.png + updated metadata

Step 5: OCR PROCESSING
├── Extract text from PNG images
├── Calculate confidence scores
└── Output: updated metadata with OCR data

Step 6: FILESYSTEM SEARCH
├── Scan for all SVG files
├── Parse SVG metadata
└── Output: svg_search_results.json

Step 7: DASHBOARD CREATION
├── Generate HTML table with thumbnails
├── Embed SVG previews
└── Output: dashboard.html (opens in browser)

📊 Output Files

Metadata Structure

{
  "file": "path/to/file.svg",
  "type": "svg_with_pdf",
  "created": "2025-06-25T10:30:00",
  "pdf_embedded": true,
  "total_pages": 1,
  "pages": [
    {
      "page": 1,
      "file": "page_1.png",
      "base64": "iVBORw0KGgoAAAANSU...",
      "ocr_text": "Invoice #INV-2025-001...",
      "ocr_confidence": 95.7,
      "word_count": 45
    }
  ]
}

Dashboard Features

SVG Thumbnails: Direct embedding of SVG files
File Information: Path, size, modification date
PDF Detection: Indicates embedded PDF data
Metadata Status: Shows RDF metadata presence
Interactive Links: Click to open files

🛠️ Makefile Targets

Target	Description
`install`	Install dependencies in virtual environment
`create`	Create example markdown file
`process`	Run conversion pipeline (steps 2-5)
`search`	Search filesystem for SVG files
`enclose`	Create HTML dashboard
`clean`	Remove generated files
`clean-all`	Remove everything including venv
`help`	Show available commands

🔧 Configuration

OCR Language Support

# Install additional languages
sudo apt-get install tesseract-ocr-pol  # Polish
sudo apt-get install tesseract-ocr-deu  # German

# Configure in processor.py
pytesseract.image_to_string(image, lang='pol+eng')

PDF Styling

Modify CSS in markdown_to_pdf() method:

styled_html = f"""
<style>
    body {{ font-family: 'Your Font', sans-serif; }}
    /* Add custom styles */
</style>
"""

🐛 Troubleshooting

Common Issues

OCR Not Working:

# Check tesseract installation
tesseract --version

# Install language packs
sudo apt-get install tesseract-ocr-eng

PDF Conversion Fails:

# Check weasyprint dependencies
pip install --upgrade weasyprint

SVG Rendering Issues:

# Install cairo development libraries
sudo apt-get install libcairo2-dev

Debug Mode

# Enable verbose output
python processor.py --step process --verbose

📝 License

This project is open source. See LICENSE file for details.

🤝 Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

📞 Support

Issues: GitHub Issues
Documentation: This README
Examples: Check output/ directory after running pipeline

🎉 Quick Demo

# Complete setup and demo
make install
make all

# View results
open output/dashboard.html  # macOS
xdg-open output/dashboard.html  # Linux

The dashboard will show your processed documents with interactive thumbnails and metadata!

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.9

Jun 27, 2025

1.0.8

Jun 26, 2025

1.0.7

Jun 26, 2025

1.0.6

Jun 26, 2025

1.0.5

Jun 26, 2025

1.0.4

Jun 26, 2025

This version

1.0.3

Jun 26, 2025

1.0.2

Jun 26, 2025

1.0.1

Jun 26, 2025

0.1.7

Jun 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

enclose-1.0.3.tar.gz (17.0 kB view details)

Uploaded Jun 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

enclose-1.0.3-py3-none-any.whl (18.0 kB view details)

Uploaded Jun 26, 2025 Python 3

File details

Details for the file enclose-1.0.3.tar.gz.

File metadata

Download URL: enclose-1.0.3.tar.gz
Upload date: Jun 26, 2025
Size: 17.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.15.3-200.fc42.x86_64

File hashes

Hashes for enclose-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`37064c08b60b31a24a257c6e59a06d6360d17b9eb259459583c3e327a2f2b429`
MD5	`8f0e1d6793b11f541ad68b594c2cafc4`
BLAKE2b-256	`502d1a7db0cc86f9a84d2b7bc28c4bb79330ee3568e4cf8d53801c8348d99ff8`

See more details on using hashes here.

File details

Details for the file enclose-1.0.3-py3-none-any.whl.

File metadata

Download URL: enclose-1.0.3-py3-none-any.whl
Upload date: Jun 26, 2025
Size: 18.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.15.3-200.fc42.x86_64

File hashes

Hashes for enclose-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`203aea12b96a87dd6a58a8cff19eb0feeb3abca0561fac026a197c6377a26275`
MD5	`a2b3ff639e4a43bcb9b3bbee4165ab17`
BLAKE2b-256	`205c30cad468528829498835ff2b906579b87f3495893b5deebdd656ab579f8d`

See more details on using hashes here.

enclose 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Document Processing Pipeline

🏗️ Refactored Package Structure

🚀 Features

📋 Prerequisites

System Dependencies

Python Requirements

🛠️ Installation

Quick Setup

Manual Setup

🎯 Usage

Quick Start

Step-by-Step Execution

Individual Commands

📁 Project Structure

🔄 Pipeline Workflow

📊 Output Files

Metadata Structure

Dashboard Features

🛠️ Makefile Targets

🔧 Configuration

OCR Language Support

PDF Styling

🐛 Troubleshooting

Common Issues

Debug Mode

📝 License

🤝 Contributing

📞 Support

🎉 Quick Demo

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes