Skip to main content

Universal document converter with template support, OCR, and AI-powered processing. Convert between PDF, DOCX, HTML, XML, JSON, EPUB and more with a simple CLI or Python API.

Project description

๐Ÿ“„ Redoc - Universal Document Converter

PyPI Version Python Version License Documentation Status Build Status Test Coverage Code Style Docker Pulls Downloads CodeQL pre-commit OpenSSF Scorecard Discord Twitter Follow

Redoc is a powerful, modular document conversion framework that enables seamless transformation between various document formats including PDF, HTML, XML, JSON, DOCX, and EPUB. It features OCR capabilities, AI-powered content generation using Ollama Mistral:7b, and a bidirectional template system for document generation and data extraction.

๐ŸŒŸ Features

Core Functionality

  • Multi-format Support: Bidirectional conversion between PDF, HTML, XML, JSON, DOCX, and EPUB
  • Template System: JSON+HTML templates for dynamic document generation with bidirectional support
  • OCR Integration: Extract text from scanned documents and images with Tesseract OCR
  • AI-Powered: Leverage Ollama Mistral:7b for intelligent content generation and processing
  • Bidirectional Processing: Convert documents to data and back with templates
  • Batch Processing: Process multiple documents efficiently with parallel execution

Advanced Capabilities

  • Template Variables: Support for dynamic content and conditional rendering
  • Validation: Built-in data validation with Pydantic models
  • Extensible Architecture: Plugin system for custom formats and processors
  • Asynchronous Processing: Non-blocking operations for high performance
  • Web Interface: Modern UI for document conversion and management

Developer Experience

  • Comprehensive API: Clean, well-documented Python API
  • Command Line Interface: Intuitive CLI for quick conversions
  • Interactive Shell: Built-in Python shell for exploration and debugging
  • Logging & Debugging: Configurable logging and error reporting
  • Type Hints: Full type annotations for better IDE support

Enterprise Ready

  • Docker Support: Containerized deployment with Docker and Docker Compose
  • REST API: Built with FastAPI for easy integration
  • Asynchronous Processing: Non-blocking operations for high performance
  • Security: Input validation, sanitization, and secure defaults
  • Monitoring: Built-in metrics and health checks

๐Ÿš€ Quick START

Installation

Using pip (recommended)

# Install the latest stable version
pip install redoc

# Install with all optional dependencies
pip install "redoc[all]"

# Or install specific components
pip install "redoc[cli]"       # Command line interface
pip install "redoc[server]"     # Web server and API
pip install "redoc[ai]"         # AI features (requires Ollama)
pip install "redoc[ocr]"        # OCR capabilities (Tesseract)
pip install "redoc[templates]"  # Pre-built templates

Using Docker (recommended for production)

# Pull the latest image
docker pull text2doc/redoc:latest

# Run a conversion
docker run -v $(pwd):/data text2doc/redoc convert input.pdf output.html

# Start the web interface
docker run -p 8000:8000 -v $(pwd)/templates:/app/templates text2doc/redoc serve

Development Installation

git clone https://github.com/text2doc/redoc.git
cd redoc
pip install -e ".[dev]"  # Install in development mode with all dependencies
pre-commit install  # Install git hooks

๐Ÿ›  Basic Usage

Command Line Interface

# Convert a document
redoc convert input.pdf output.html

# Convert with a template
redoc convert --template invoice.html data.json invoice.pdf

# Start interactive shell
redoc shell

# Start web server
redoc serve

Python API

from redoc import Redoc

# Initialize with default settings
converter = Redoc()

# Convert between formats
converter.convert('document.pdf', 'document.html')  # PDF to HTML
converter.convert('data.json', 'report.pdf')       # JSON to PDF with template

# Process multiple files
converter.batch_convert(
    input_glob='invoices/*.json',
    output_dir='output/',
    output_format='pdf',
    template='invoice.html'
)

# Extract data from documents
data = converter.extract_data('document.pdf', 'invoice_schema.json')

# Generate documents from templates
converter.generate_document(
    template='invoice.html',
    data='data.json',
    output='invoice.pdf'
)

# Use the interactive shell
converter.shell()

Command Line Interface

# Show help
redoc --help

# Convert a document
redoc convert input.pdf output.html
redoc convert --template invoice.html data.json invoice.pdf

# Start interactive shell
redoc shell

# Start web server
redoc serve --host 0.0.0.0 --port 8000

# Process multiple files
redoc batch "documents/*.pdf" --format html --output-dir html_output

Using Templates

from redoc import Redoc

converter = Redoc()

# Simple template with variables
template = {
    "template": "invoice.html",
    "data": {
        "invoice": {
            "number": "INV-2023-001",
            "date": "2023-11-15",
            "items": [
                {"description": "Web Design", "quantity": 10, "price": 100},
                {"description": "Hosting", "quantity": 1, "price": 50}
            ]
        }
    }
}

# Generate PDF from template
converter.convert(template, 'pdf', output_file='invoice.pdf')

# Extract data from document
data = converter.extract_data('invoice.pdf', template='invoice_template.html')

๐Ÿ“š Supported Conversions

From \ To PDF HTML XML JSON DOCX EPUB
PDF โŒ โœ… โœ… โœ… โœ… โœ…
HTML โœ… โŒ โœ… โœ… โœ… โœ…
XML โœ… โœ… โŒ โœ… โœ… โœ…
JSON โœ… โœ… โœ… โŒ โœ… โœ…
DOCX โœ… โœ… โœ… โœ… โŒ โœ…
EPUB โœ… โœ… โœ… โœ… โœ… โŒ

Conversion Features

  • PDF Generation: High-quality PDF output with support for headers, footers, and page numbers
  • HTML Processing: Clean HTML output with customizable CSS styling
  • Data Extraction: Extract structured data from documents using templates
  • Template Variables: Use Jinja2 syntax for dynamic content
  • Batch Processing: Process multiple files in parallel
  • OCR Support: Extract text from scanned documents and images
  • AI-Powered: Enhance documents with AI-generated content

๐Ÿ—๏ธ Project Structure

redoc/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ redoc/
โ”‚       โ”œโ”€โ”€ __init__.py          # Package initialization
โ”‚       โ”œโ”€โ”€ core.py             # Core conversion logic
โ”‚       โ”œโ”€โ”€ converters/         # Format-specific converters
โ”‚       โ”‚   โ”œโ”€โ”€ base.py         # Base converter class
โ”‚       โ”‚   โ”œโ”€โ”€ pdf_converter.py
โ”‚       โ”‚   โ”œโ”€โ”€ html_converter.py
โ”‚       โ”‚   โ”œโ”€โ”€ xml_converter.py
โ”‚       โ”‚   โ”œโ”€โ”€ json_converter.py
โ”‚       โ”‚   โ”œโ”€โ”€ docx_converter.py
โ”‚       โ”‚   โ””โ”€โ”€ epub_converter.py
โ”‚       โ”œโ”€โ”€ ocr/                # OCR functionality
โ”‚       โ”œโ”€โ”€ templates/          # Default templates
โ”‚       โ””โ”€โ”€ utils/              # Utility functions
โ”œโ”€โ”€ tests/                      # Test suite
โ”œโ”€โ”€ examples/                   # Usage examples
โ”œโ”€โ”€ docs/                       # Documentation
โ”œโ”€โ”€ pyproject.toml              # Project configuration
โ””โ”€โ”€ README.md                   # This file

๐Ÿ”ง Advanced Usage

Using Templates

from redoc import Redoc

converter = Redoc()

# Convert JSON+HTML template to PDF
converter.convert(
    {
        "template": "invoice.html",
        "data": {
            "invoice_number": "INV-2023-001",
            "date": "2023-11-15",
            "items": [
                {"description": "Web Design", "quantity": 1, "price": 1200}
            ],
            "total": 1200
        }
    },
    'pdf',
    output_file='invoice.pdf'
)

OCR Processing

from redoc import Redoc

converter = Redoc()

# Extract text from scanned PDF with OCR
result = converter.ocr('scanned_document.pdf')
print(result['text'])

# Convert scanned document to searchable PDF
converter.ocr('scanned_document.pdf', output_file='searchable.pdf')

AI-Powered Content Generation

from redoc import Redoc

converter = Redoc()

# Generate document using AI
result = converter.generate(
    "Create a professional invoice for web design services",
    format='pdf',
    style='professional',
    output_file='ai_invoice.pdf'
)

๐Ÿšง Next Steps

We have an exciting roadmap ahead! Check out our TODO list for upcoming features and improvements. Here are some highlights:

In Progress

  • Fixing pyproject.toml TOML syntax error
  • Resolving MkDocs build warnings
  • Enhancing documentation

Coming Soon

  • More template examples
  • Improved AI features
  • Performance optimizations
  • Additional document format support

๐Ÿค Contributing

Contributions are welcome! Please read our Contributing Guidelines for details on how to contribute to this project.

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ“ง Contact

For any questions or suggestions, please contact info@softreck.dev.


Made with โค๏ธ by Text2Doc Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redoc-0.2.2.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

redoc-0.2.2-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file redoc-0.2.2.tar.gz.

File metadata

  • Download URL: redoc-0.2.2.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64

File hashes

Hashes for redoc-0.2.2.tar.gz
Algorithm Hash digest
SHA256 8b0a9e2298c971eff0a025ff45aac61ae3e8deadbd2bcede0c01035c896a265c
MD5 fd34d71278d0af1b2a09fe9b9fd383b6
BLAKE2b-256 91c5aa66d54ea2a54e3dbfbbacef9646f206b992cca327ad2e72ddada370181e

See more details on using hashes here.

File details

Details for the file redoc-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: redoc-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64

File hashes

Hashes for redoc-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 71174321ff5eaf4089204e272d04543ae8701c019c8c4b0fdd931c335fa540f0
MD5 2db247395a5e821982593293cbccf82e
BLAKE2b-256 012084c9fc4fdb6ef777ba6be99b352f735a979aa03df0f6f08cff3730d7bc61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page