Multi-Level Docuemtn converter from pdf to xml or html and json , from json+html to xml or pdf or doc or epub, with OCR and Generator powered by Ollama Mistral:7b
Project description
Redoc - Universal Document Converter
Redoc is a powerful, modular document conversion framework that enables seamless transformation between various document formats including PDF, HTML, XML, JSON, DOCX, and EPUB. It features OCR capabilities and AI-powered content generation using Ollama Mistral:7b.
๐ Features
- Multi-format Support: Convert between PDF, HTML, XML, JSON, DOCX, and EPUB
- Template-based Processing: Use JSON+HTML templates for dynamic document generation
- OCR Integration: Extract text from scanned documents and images
- Modular Architecture: Easily extendable with custom converters and processors
- AI-Powered: Leverage Ollama Mistral:7b for intelligent content generation
- Batch Processing: Process multiple documents efficiently
- CLI & API: Command-line interface and Python API for easy integration
๐ Quick Start
Installation
# Install with pip
pip install redoc
# Or install from source
git clone https://github.com/text2doc/redoc.git
cd redoc
pip install -e .
Basic Usage
from redoc import Redoc
# Initialize the converter
converter = Redoc()
# Convert PDF to JSON
result = converter.convert('document.pdf', 'json')
# Convert HTML+JSON template to PDF
template = {
"template": "invoice.html",
"data": {
"invoice_number": "INV-2023-001",
"date": "2023-11-15",
"total": "$1,200.00"
}
}
converter.convert(template, 'pdf', output_file='invoice.pdf')
๐ Supported Conversions
| From \ To | HTML | XML | JSON | DOCX | EPUB | |
|---|---|---|---|---|---|---|
| โ | โ | โ | โ | โ | โ | |
| HTML | โ | โ | โ | โ | โ | โ |
| XML | โ | โ | โ | โ | โ | โ |
| JSON | โ | โ | โ | โ | โ | โ |
| DOCX | โ | โ | โ | โ | โ | โ |
| EPUB | โ | โ | โ | โ | โ | โ |
๐๏ธ Project Structure
redoc/
โโโ src/
โ โโโ redoc/
โ โโโ __init__.py # Package initialization
โ โโโ core.py # Core conversion logic
โ โโโ converters/ # Format-specific converters
โ โ โโโ base.py # Base converter class
โ โ โโโ pdf_converter.py
โ โ โโโ html_converter.py
โ โ โโโ xml_converter.py
โ โ โโโ json_converter.py
โ โ โโโ docx_converter.py
โ โ โโโ epub_converter.py
โ โโโ ocr/ # OCR functionality
โ โโโ templates/ # Default templates
โ โโโ utils/ # Utility functions
โโโ tests/ # Test suite
โโโ examples/ # Usage examples
โโโ docs/ # Documentation
โโโ pyproject.toml # Project configuration
โโโ README.md # This file
๐ง Advanced Usage
Using Templates
from redoc import Redoc
converter = Redoc()
# Convert JSON+HTML template to PDF
converter.convert(
{
"template": "invoice.html",
"data": {
"invoice_number": "INV-2023-001",
"date": "2023-11-15",
"items": [
{"description": "Web Design", "quantity": 1, "price": 1200}
],
"total": 1200
}
},
'pdf',
output_file='invoice.pdf'
)
OCR Processing
from redoc import Redoc
converter = Redoc()
# Extract text from scanned PDF with OCR
result = converter.ocr('scanned_document.pdf')
print(result['text'])
# Convert scanned document to searchable PDF
converter.ocr('scanned_document.pdf', output_file='searchable.pdf')
AI-Powered Content Generation
from redoc import Redoc
converter = Redoc()
# Generate document using AI
result = converter.generate(
"Create a professional invoice for web design services",
format='pdf',
style='professional',
output_file='ai_invoice.pdf'
)
๐ค Contributing
Contributions are welcome! Please read our Contributing Guidelines for details on how to contribute to this project.
๐ License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
๐ง Contact
For any questions or suggestions, please contact info@softreck.dev.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file redoc-0.1.7.tar.gz.
File metadata
- Download URL: redoc-0.1.7.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40659a70741b4da6644184ef7dd2737c6c1ec653fc968bd9be252599fbf26a50
|
|
| MD5 |
0de3a2a858ac17a902b2ec6cfb79207d
|
|
| BLAKE2b-256 |
3177b2770f963fa1e524c3c145ced07ced4d59155a31daa8b2545ab48927a2d1
|
File details
Details for the file redoc-0.1.7-py3-none-any.whl.
File metadata
- Download URL: redoc-0.1.7-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca731bd130a4c009cf62c94226598f1ef6f6753d95a0ff2fae104227b954d158
|
|
| MD5 |
368668831b4a3d577706ee8c102c8e58
|
|
| BLAKE2b-256 |
5bb7c3a7a376495ee40513c92e9bd64bf9349033470e063a22898965d62fe4bc
|