Skip to main content

A modular document processing pipeline for Markdown to PDF/SVG/PNG conversion with OCR

Project description

Enclose

A comprehensive document processing pipeline for Markdown to PDF/SVG/PNG conversion with OCR capabilities.

graph LR
    A[Markdown] -->|Parse| B[HTML]
    B -->|Convert| C[PDF]
    C -->|Embed| D[SVG]
    D -->|Extract| E[PNG]
    E -->|Process| F[OCR]
    F -->|Index| G[Search]
    G -->|Visualize| H[Dashboard]

๐Ÿš€ Features

graph LR
    A[Input Formats] --> B[Markdown]
    A --> C[PDF]
    
    B --> D[Converters]
    C --> D
    
    D --> E[Output Formats]
    E --> F[PDF]
    E --> G[SVG]
    E --> H[PNG]
    E --> I[HTML]
    
    style A fill:#f9f,stroke:#333
    style E fill:#9f9,stroke:#333
  • Multi-format conversion: Convert between Markdown, PDF, SVG, and PNG
  • SVG embedding: Embed PDFs as base64 data URIs in SVG containers
  • Image extraction: Extract high-quality images from PDFs
  • OCR processing: Extract text with confidence scoring
  • Metadata tracking: Preserve and enhance metadata throughout processing
  • Interactive dashboard: View and search processed documents

โœ… File Format Validation

Enclose includes comprehensive file format validation to ensure the integrity and correctness of all converted files:

graph TD
    A[Input Validation] --> B[Conversion]
    B --> C[Output Validation]
    C --> D[Verification]
    
    style A fill:#d4f1f9,stroke:#333
    style C fill:#d4f1f9,stroke:#333
    style D fill:#d4f1f9,stroke:#333

Validation Checks

PDF Files

  • โœ… Valid PDF signature (%PDF header)
  • โœ… Correct MIME type (application/pdf)
  • โœ… File integrity verification

SVG Files

  • โœ… Valid XML structure
  • โœ… Correct MIME type (image/svg+xml)
  • โœ… Basic SVG tag validation

PNG Files

  • โœ… Valid PNG signature (magic bytes)
  • โœ… Correct MIME type (image/png)
  • โœ… Image data integrity check
  • โœ… PIL verification of image data

Example Validation Output

# When running tests, you'll see validation output like:
PASSED tests/test_file_formats.py::test_pdf_conversion
PASSED tests/test_file_formats.py::test_svg_conversion
PASSED tests/test_file_formats.py::test_png_conversion

๐Ÿ“š Documentation

For complete documentation, please visit our documentation site.

๐Ÿ› ๏ธ Quick Start

Prerequisites

  • Python 3.8+
  • Poetry (for development)

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/enclose.git
    cd enclose
    
  2. Install dependencies:

    poetry install
    
  3. Install the package in development mode:

    poetry install
    

Basic Usage

  1. List supported formats:

    enclose --list
    
  2. Convert a markdown file to another format:

    # Basic conversion (outputs to current directory with default name)
    enclose example.md pdf
    
    # Specify output filename
    enclose example.md pdf -o output.pdf
    
    # Convert to SVG
    enclose example.md svg -o output.svg
    
    # Convert to PNG
    enclose example.md png -o output.png
    
    # Convert to HTML
    enclose example.md html -o output.html
    

Example

  1. First, create a test markdown file or use the provided example.md

  2. Convert it to different formats:

    # Convert to PDF
    enclose example.md pdf -o example.pdf
    
    # Convert to SVG
    enclose example.md svg -o example.svg
    

Important Notes

  • The -o or --output flag requires a full file path with extension (e.g., output.pdf, ./output.svg)
  • If no output is specified, the output will be saved in the current directory with a default name based on the input file
  • The output directory must exist before running the command
  1. The output will be saved to output/example.pdf

Command Line Options

usage: enclose [-h] [--version] [--list] [-o OUTPUT] [input] [{pdf,png,svg,html}]

A document processing tool for format conversion.

positional arguments:
  input                 Input file path (markdown, pdf, etc.)
  {pdf,png,svg,html}    Output format

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --list                List supported formats and conversions
  -o OUTPUT, --output OUTPUT
                        Output directory (default: current directory)

Development

To run tests:

make test

To run linting:

make lint

To run type checking:

make typecheck

Prerequisites

Installation

# Clone the repository
git clone https://github.com/veridock/enclose.git
cd enclose

# Install the package
make install

Basic Usage

# Process a document
enclose process example.md -o output/

# View the results
open output/dashboard.html  # macOS
# or
xdg-open output/dashboard.html  # Linux

๐Ÿ“– Documentation Structure

๐ŸŒŸ Features in Detail

Document Conversion

  • Markdown to PDF with custom styling
  • PDF to SVG with embedded fonts
  • High-quality image extraction

Advanced Processing

  • OCR text extraction with confidence scoring
  • Metadata extraction and management
  • Batch processing support

Command Line Interface

  • Intuitive command structure
  • Configurable output formats
  • Progress tracking

๐Ÿ“Š Example Workflow

sequenceDiagram
    participant User
    participant CLI
    participant Processor
    
    User->>CLI: enclose process doc.md
    CLI->>Processor: Process document
    Processor->>Processor: Convert Markdown to PDF
    Processor->>Processor: Generate SVG with embedded PDF
    Processor->>Processor: Extract images
    Processor->>Processor: Process OCR
    Processor-->>CLI: Processing complete
    CLI-->>User: Results in output/

๐Ÿ“ฆ Project Structure

enclose/
โ”œโ”€โ”€ docs/                   # Documentation
โ”œโ”€โ”€ processor/              # Main package
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ __main__.py         # CLI entry point
โ”‚   โ”œโ”€โ”€ core/               # Core processing logic
โ”‚   โ”œโ”€โ”€ converters/         # Format converters
โ”‚   โ””โ”€โ”€ utils/              # Utility functions
โ”œโ”€โ”€ scripts/                # Helper scripts
โ”œโ”€โ”€ tests/                  # Test suite
โ””โ”€โ”€ pyproject.toml          # Project configuration

๐Ÿค Contributing

Contributions are welcome! Please see our Contributing Guide for details.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿงช Testing

To run the test suite:

make test
make lint

๐Ÿ”„ Development Workflow

  1. Set up development environment

    make install
    
  2. Run tests

    make test
    
  3. Format and check code

    make format
    make lint
    
  4. Run the development server

    make dev
    
    • Dashboard opens automatically in browser
    • Access: output/dashboard.html

๐Ÿ› ๏ธ CLI Commands

Basic Usage

# Convert a document
enclose input.md pdf -o output.pdf

# List available formats
enclose --list
# Show help
enclose --help

Command Structure

flowchart TD
    A[enclose] --> B[input_file output_format options]
    A --> C[--list]
    A --> D[--help]
    A --> E[--version]
    
    B --> F[input_file]
    B --> G[output_format]
    B --> H[options]
    
    H --> I[-o/--output]
    H --> J[--dpi]
    H --> K[--quality]
    
    style B fill:#9f9,stroke:#333
    style C fill:#99f,stroke:#333
    style D fill:#99f,stroke:#333

Common Examples

# Convert Markdown to PDF
enclose document.md pdf -o output.pdf

# Convert PDF to high-quality PNG
enclose document.pdf png --dpi 300 -o output.png


# List all supported formats
enclose --list

Advanced Options

# Set output DPI for images
enclose convert input.pdf png --dpi 150

# Set image quality (1-100)
enclose convert input.pdf jpg --quality 90

# Process multiple files
for f in *.md; do enclose convert "$f" pdf -o output/; done

๐Ÿ“ Project Structure

graph TD
    A[Project Root] --> B[Source Code]
    A --> C[Documentation]
    A --> D[Build System]
    A --> E[Tests]
    
    B --> F[enclose/]:::dir
    F --> G[__init__.py]:::file
    F --> H[__main__.py]:::file
    F --> I[core/]:::dir
    F --> J[converters/]:::dir
    F --> K[utils/]:::dir
    
    C --> L[docs/]:::dir
    C --> M[README.md]:::file
    
    D --> N[pyproject.toml]:::file
    D --> O[Makefile]:::file
    
    E --> P[tests/]:::dir
    
    classDef dir fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef file fill:#e8f5e9,stroke:#2e7d32,stroke-width:1.5px

Key Directories

  • enclose/ - Main Python package

    • core/ - Core processing logic and document handling
    • converters/ - File format conversion modules
    • utils/ - Utility functions and helpers
    • __main__.py - CLI entry point
  • docs/ - Comprehensive documentation

    • getting-started/ - Installation and setup guides
    • architecture/ - System design and components
    • usage/ - User guides and examples
  • tests/ - Test suite

    • Unit tests
    • Integration tests
    • Test fixtures

Key Files

  • pyproject.toml - Project configuration and dependencies
  • Makefile - Common development tasks
  • scripts/enclose - Global CLI wrapper script
  • .github/workflows/ - CI/CD pipelines

๐Ÿ”„ Workflow

flowchart LR
    A[Input] -->|Markdown/PDF| B(enclose)
    B --> C{Format?}
    
    C -->|Markdown| D[Parse Markdown]
    D --> E[Generate HTML]
    E --> F[Convert to PDF]
    
    C -->|PDF| G[Process PDF]
    G --> H[Extract Content]
    
    F & H --> I[Generate Outputs]
    I --> J[SVG/PNG/HTML]
    I --> K[Metadata]
    J --> L[Dashboard]
    
    style A fill:#e3f2fd,stroke:#1565c0
    style B fill:#e8f5e9,stroke#2e7d32
    style L fill:#fff3e0,stroke:#e65100

Processing Steps

  1. Input Handling

    • Accepts Markdown or PDF files
    • Validates input format and content
  2. Conversion

    • Markdown โ†’ HTML โ†’ PDF
    • PDF โ†’ Images/Text
  3. Output Generation

    • Generate SVG/PNG/HTML outputs
    • Extract and process metadata
  4. Visualization

    • Create interactive dashboard
    • Enable search and filtering

๐Ÿ”„ Pipeline Workflow

Step 1: CREATE
โ”œโ”€โ”€ Generate example markdown file (invoice)
โ””โ”€โ”€ Output: invoice_example.md

Step 2: MARKDOWN โ†’ PDF
โ”œโ”€โ”€ Convert markdown to styled HTML
โ”œโ”€โ”€ Generate PDF with CSS styling
โ””โ”€โ”€ Output: invoice_example.pdf

Step 3: PDF โ†’ SVG
โ”œโ”€โ”€ Embed PDF as base64 data URI
โ”œโ”€โ”€ Add SVG metadata (RDF/Dublin Core)
โ””โ”€โ”€ Output: invoice_example.svg + metadata.json

Step 4: PDF โ†’ PNG
โ”œโ”€โ”€ Extract PDF pages as PNG images
โ”œโ”€โ”€ Convert PNG to base64 encoding
โ””โ”€โ”€ Output: page_*.png + updated metadata

Step 5: OCR PROCESSING
โ”œโ”€โ”€ Extract text from PNG images
โ”œโ”€โ”€ Calculate confidence scores
โ””โ”€โ”€ Output: updated metadata with OCR data

Step 6: FILESYSTEM SEARCH
โ”œโ”€โ”€ Scan for all SVG files
โ”œโ”€โ”€ Parse SVG metadata
โ””โ”€โ”€ Output: svg_search_results.json

Step 7: DASHBOARD CREATION
โ”œโ”€โ”€ Generate HTML table with thumbnails
โ”œโ”€โ”€ Embed SVG previews
โ””โ”€โ”€ Output: dashboard.html (opens in browser)

๐Ÿ“Š Output Files

Metadata Structure

{
  "file": "path/to/file.svg",
  "type": "svg_with_pdf",
  "created": "2025-06-25T10:30:00",
  "pdf_embedded": true,
  "total_pages": 1,
  "pages": [
    {
      "page": 1,
      "file": "page_1.png",
      "base64": "iVBORw0KGgoAAAANSU...",
      "ocr_text": "Invoice #INV-2025-001...",
      "ocr_confidence": 95.7,
      "word_count": 45
    }
  ]
}

Dashboard Features

  • SVG Thumbnails: Direct embedding of SVG files
  • File Information: Path, size, modification date
  • PDF Detection: Indicates embedded PDF data
  • Metadata Status: Shows RDF metadata presence
  • Interactive Links: Click to open files

๐Ÿ› ๏ธ Makefile Targets

Target Description
install Install dependencies in virtual environment
create Create example markdown file
process Run conversion pipeline (steps 2-5)
search Search filesystem for SVG files
enclose Create HTML dashboard
clean Remove generated files
clean-all Remove everything including venv
help Show available commands

๐Ÿ”ง Configuration

OCR Language Support

# Install additional languages
sudo apt-get install tesseract-ocr-pol  # Polish
sudo apt-get install tesseract-ocr-deu  # German

# Configure in processor.py
pytesseract.image_to_string(image, lang='pol+eng')

PDF Styling

Modify CSS in markdown_to_pdf() method:

styled_html = f"""
<style>
    body {{ font-family: 'Your Font', sans-serif; }}
    /* Add custom styles */
</style>
"""

๐Ÿ› Troubleshooting

Common Issues

OCR Not Working:

# Check tesseract installation
tesseract --version

# Install language packs
sudo apt-get install tesseract-ocr-eng

PDF Conversion Fails:

# Check weasyprint dependencies
pip install --upgrade weasyprint

SVG Rendering Issues:

# Install cairo development libraries
sudo apt-get install libcairo2-dev

Debug Mode

# Enable verbose output
python processor.py --step process --verbose

๐Ÿ“ License

This project is open source. See LICENSE file for details.

๐Ÿค Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open Pull Request

๐Ÿ“ž Support

  • Issues: GitHub Issues
  • Documentation: This README
  • Examples: Check output/ directory after running pipeline

๐ŸŽ‰ Quick Demo

# Complete setup and demo
make install
make all

# View results
open output/dashboard.html  # macOS
xdg-open output/dashboard.html  # Linux

The dashboard will show your processed documents with interactive thumbnails and metadata!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

enclose-1.0.8.tar.gz (25.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

enclose-1.0.8-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file enclose-1.0.8.tar.gz.

File metadata

  • Download URL: enclose-1.0.8.tar.gz
  • Upload date:
  • Size: 25.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.15.3-200.fc42.x86_64

File hashes

Hashes for enclose-1.0.8.tar.gz
Algorithm Hash digest
SHA256 c70f20052bbba7fdee716fcb145a65068ee908cd85f498e56814142243a5caff
MD5 6023e7c1a1b2193ab61bfbe4537ff6e6
BLAKE2b-256 ceb9456b01506349b1cf83fe1a114233cf432ae8e423e3b957d0d3825a270bc0

See more details on using hashes here.

File details

Details for the file enclose-1.0.8-py3-none-any.whl.

File metadata

  • Download URL: enclose-1.0.8-py3-none-any.whl
  • Upload date:
  • Size: 25.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.15.3-200.fc42.x86_64

File hashes

Hashes for enclose-1.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 754064374435bbb49cd92769a592fb36bcc04c5e857704d0cfcac8eb281e6ad4
MD5 477792e65d11515fb8cb85212c53f435
BLAKE2b-256 9c96944340040c72b71f3674f687c65dc9a88826f95e9c6629f39e22f0bcb167

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page