Universal document converter with template support, OCR, and AI-powered processing. Convert between PDF, DOCX, HTML, XML, JSON, EPUB and more with a simple CLI or Python API.
Project description
Redoc is a powerful, modular document conversion framework that enables seamless transformation between various document formats including PDF, HTML, XML, JSON, DOCX, and EPUB. It features OCR capabilities, AI-powered content generation using Ollama Mistral:7b, and a bidirectional template system for document generation and data extraction.
๐ Features
Core Functionality
- Multi-format Support: Bidirectional conversion between PDF, HTML, XML, JSON, DOCX, and EPUB
- Template System: JSON+HTML templates for dynamic document generation with bidirectional support
- OCR Integration: Extract text from scanned documents and images with Tesseract OCR
- AI-Powered: Leverage Ollama Mistral:7b for intelligent content generation and processing
- Bidirectional Processing: Convert documents to data and back with templates
- Batch Processing: Process multiple documents efficiently with parallel execution
Advanced Capabilities
- Template Variables: Support for dynamic content and conditional rendering
- Validation: Built-in data validation with Pydantic models
- Extensible Architecture: Plugin system for custom formats and processors
- Asynchronous Processing: Non-blocking operations for high performance
- Web Interface: Modern UI for document conversion and management
Developer Experience
- Comprehensive API: Clean, well-documented Python API
- Command Line Interface: Intuitive CLI for quick conversions
- Interactive Shell: Built-in Python shell for exploration and debugging
- Logging & Debugging: Configurable logging and error reporting
- Type Hints: Full type annotations for better IDE support
Enterprise Ready
- Docker Support: Containerized deployment with Docker and Docker Compose
- REST API: Built with FastAPI for easy integration
- Asynchronous Processing: Non-blocking operations for high performance
- Security: Input validation, sanitization, and secure defaults
- Monitoring: Built-in metrics and health checks
๐ Quick START
Installation
Using pip (recommended)
# Install the latest stable version
pip install redoc
# Install with all optional dependencies
pip install "redoc[all]"
# Or install specific components
pip install "redoc[cli]" # Command line interface
pip install "redoc[server]" # Web server and API
pip install "redoc[ai]" # AI features (requires Ollama)
pip install "redoc[ocr]" # OCR capabilities (Tesseract)
pip install "redoc[templates]" # Pre-built templates
Using Docker (recommended for production)
# Pull the latest image
docker pull text2doc/redoc:latest
# Run a conversion
docker run -v $(pwd):/data text2doc/redoc convert input.pdf output.html
# Start the web interface
docker run -p 8000:8000 -v $(pwd)/templates:/app/templates text2doc/redoc serve
Development Installation
git clone https://github.com/text2doc/redoc.git
cd redoc
pip install -e ".[dev]" # Install in development mode with all dependencies
pre-commit install # Install git hooks
๐ Basic Usage
Command Line Interface
# Convert a document
redoc convert input.pdf output.html
# Convert with a template
redoc convert --template invoice.html data.json invoice.pdf
# Start interactive shell
redoc shell
# Start web server
redoc serve
Python API
from redoc import Redoc
# Initialize with default settings
converter = Redoc()
# Convert between formats
converter.convert('document.pdf', 'document.html') # PDF to HTML
converter.convert('data.json', 'report.pdf') # JSON to PDF with template
# Process multiple files
converter.batch_convert(
input_glob='invoices/*.json',
output_dir='output/',
output_format='pdf',
template='invoice.html'
)
# Extract data from documents
data = converter.extract_data('document.pdf', 'invoice_schema.json')
# Generate documents from templates
converter.generate_document(
template='invoice.html',
data='data.json',
output='invoice.pdf'
)
# Use the interactive shell
converter.shell()
Command Line Interface
# Show help
redoc --help
# Convert a document
redoc convert input.pdf output.html
redoc convert --template invoice.html data.json invoice.pdf
# Start interactive shell
redoc shell
# Start web server
redoc serve --host 0.0.0.0 --port 8000
# Process multiple files
redoc batch "documents/*.pdf" --format html --output-dir html_output
Using Templates
from redoc import Redoc
converter = Redoc()
# Simple template with variables
template = {
"template": "invoice.html",
"data": {
"invoice": {
"number": "INV-2023-001",
"date": "2023-11-15",
"items": [
{"description": "Web Design", "quantity": 10, "price": 100},
{"description": "Hosting", "quantity": 1, "price": 50}
]
}
}
}
# Generate PDF from template
converter.convert(template, 'pdf', output_file='invoice.pdf')
# Extract data from document
data = converter.extract_data('invoice.pdf', template='invoice_template.html')
๐ Supported Conversions
| From \ To | HTML | XML | JSON | DOCX | EPUB | |
|---|---|---|---|---|---|---|
| โ | โ | โ | โ | โ | โ | |
| HTML | โ | โ | โ | โ | โ | โ |
| XML | โ | โ | โ | โ | โ | โ |
| JSON | โ | โ | โ | โ | โ | โ |
| DOCX | โ | โ | โ | โ | โ | โ |
| EPUB | โ | โ | โ | โ | โ | โ |
Conversion Features
- PDF Generation: High-quality PDF output with support for headers, footers, and page numbers
- HTML Processing: Clean HTML output with customizable CSS styling
- Data Extraction: Extract structured data from documents using templates
- Template Variables: Use Jinja2 syntax for dynamic content
- Batch Processing: Process multiple files in parallel
- OCR Support: Extract text from scanned documents and images
- AI-Powered: Enhance documents with AI-generated content
๐๏ธ Project Structure
redoc/
โโโ src/
โ โโโ redoc/
โ โโโ __init__.py # Package initialization
โ โโโ core.py # Core conversion logic
โ โโโ converters/ # Format-specific converters
โ โ โโโ base.py # Base converter class
โ โ โโโ pdf_converter.py
โ โ โโโ html_converter.py
โ โ โโโ xml_converter.py
โ โ โโโ json_converter.py
โ โ โโโ docx_converter.py
โ โ โโโ epub_converter.py
โ โโโ ocr/ # OCR functionality
โ โโโ templates/ # Default templates
โ โโโ utils/ # Utility functions
โโโ tests/ # Test suite
โโโ examples/ # Usage examples
โโโ docs/ # Documentation
โโโ pyproject.toml # Project configuration
โโโ README.md # This file
๐ง Advanced Usage
Using Templates
from redoc import Redoc
converter = Redoc()
# Convert JSON+HTML template to PDF
converter.convert(
{
"template": "invoice.html",
"data": {
"invoice_number": "INV-2023-001",
"date": "2023-11-15",
"items": [
{"description": "Web Design", "quantity": 1, "price": 1200}
],
"total": 1200
}
},
'pdf',
output_file='invoice.pdf'
)
OCR Processing
from redoc import Redoc
converter = Redoc()
# Extract text from scanned PDF with OCR
result = converter.ocr('scanned_document.pdf')
print(result['text'])
# Convert scanned document to searchable PDF
converter.ocr('scanned_document.pdf', output_file='searchable.pdf')
AI-Powered Content Generation
from redoc import Redoc
converter = Redoc()
# Generate document using AI
result = converter.generate(
"Create a professional invoice for web design services",
format='pdf',
style='professional',
output_file='ai_invoice.pdf'
)
๐ง Next Steps
We have an exciting roadmap ahead! Check out our TODO list for upcoming features and improvements. Here are some highlights:
In Progress
- Fixing pyproject.toml TOML syntax error
- Resolving MkDocs build warnings
- Enhancing documentation
Coming Soon
- More template examples
- Improved AI features
- Performance optimizations
- Additional document format support
๐ค Contributing
Contributions are welcome! Please read our Contributing Guidelines for details on how to contribute to this project.
๐ License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
๐ง Contact
For any questions or suggestions, please contact info@softreck.dev.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file redoc-0.2.3.tar.gz.
File metadata
- Download URL: redoc-0.2.3.tar.gz
- Upload date:
- Size: 19.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8578224cd9ee2d7b9fcb964b0feb4720996e674daac9998ee48f6fe17fcc4308
|
|
| MD5 |
da9b735eca657c742fcdd14d77569be2
|
|
| BLAKE2b-256 |
1e99da1cb1f003d5f84a0215eb58a0b320ee0bc957ee8e9163b6c82957dac119
|
File details
Details for the file redoc-0.2.3-py3-none-any.whl.
File metadata
- Download URL: redoc-0.2.3-py3-none-any.whl
- Upload date:
- Size: 14.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbe389669fa510092dd560babaffd16c548a3f2655c5469ce3d666ad6746ceb9
|
|
| MD5 |
9a06f642095c4a59bea9315cc185fe20
|
|
| BLAKE2b-256 |
abb44832ade28a174f1ee9b515ee78c73894da701f578ba4cfe660b5da546e1b
|