Skip to main content

Multi-Level Docuemtn converter from pdf to xml or html and json , from json+html to xml or pdf or doc or epub, with OCR and Generator powered by Ollama Mistral:7b

Project description

Redoc - Universal Document Converter

Python Version License Code style: black

Redoc is a powerful, modular document conversion framework that enables seamless transformation between various document formats including PDF, HTML, XML, JSON, DOCX, and EPUB. It features OCR capabilities and AI-powered content generation using Ollama Mistral:7b.

๐ŸŒŸ Features

  • Multi-format Support: Convert between PDF, HTML, XML, JSON, DOCX, and EPUB
  • Template-based Processing: Use JSON+HTML templates for dynamic document generation
  • OCR Integration: Extract text from scanned documents and images
  • Modular Architecture: Easily extendable with custom converters and processors
  • AI-Powered: Leverage Ollama Mistral:7b for intelligent content generation
  • Batch Processing: Process multiple documents efficiently
  • CLI & API: Command-line interface and Python API for easy integration

๐Ÿš€ Quick Start

Installation

# Install with pip
pip install redoc

# Or install from source
git clone https://github.com/text2doc/redoc.git
cd redoc
pip install -e .

Basic Usage

from redoc import Redoc

# Initialize the converter
converter = Redoc()

# Convert PDF to JSON
result = converter.convert('document.pdf', 'json')

# Convert HTML+JSON template to PDF
template = {
    "template": "invoice.html",
    "data": {
        "invoice_number": "INV-2023-001",
        "date": "2023-11-15",
        "total": "$1,200.00"
    }
}
converter.convert(template, 'pdf', output_file='invoice.pdf')

๐Ÿ“š Supported Conversions

From \ To PDF HTML XML JSON DOCX EPUB
PDF โŒ โœ… โœ… โœ… โœ… โœ…
HTML โœ… โŒ โœ… โœ… โœ… โœ…
XML โœ… โœ… โŒ โœ… โœ… โœ…
JSON โœ… โœ… โœ… โŒ โœ… โœ…
DOCX โœ… โœ… โœ… โœ… โŒ โœ…
EPUB โœ… โœ… โœ… โœ… โœ… โŒ

๐Ÿ—๏ธ Project Structure

redoc/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ redoc/
โ”‚       โ”œโ”€โ”€ __init__.py          # Package initialization
โ”‚       โ”œโ”€โ”€ core.py             # Core conversion logic
โ”‚       โ”œโ”€โ”€ converters/         # Format-specific converters
โ”‚       โ”‚   โ”œโ”€โ”€ base.py         # Base converter class
โ”‚       โ”‚   โ”œโ”€โ”€ pdf_converter.py
โ”‚       โ”‚   โ”œโ”€โ”€ html_converter.py
โ”‚       โ”‚   โ”œโ”€โ”€ xml_converter.py
โ”‚       โ”‚   โ”œโ”€โ”€ json_converter.py
โ”‚       โ”‚   โ”œโ”€โ”€ docx_converter.py
โ”‚       โ”‚   โ””โ”€โ”€ epub_converter.py
โ”‚       โ”œโ”€โ”€ ocr/                # OCR functionality
โ”‚       โ”œโ”€โ”€ templates/          # Default templates
โ”‚       โ””โ”€โ”€ utils/              # Utility functions
โ”œโ”€โ”€ tests/                      # Test suite
โ”œโ”€โ”€ examples/                   # Usage examples
โ”œโ”€โ”€ docs/                       # Documentation
โ”œโ”€โ”€ pyproject.toml              # Project configuration
โ””โ”€โ”€ README.md                   # This file

๐Ÿ”ง Advanced Usage

Using Templates

from redoc import Redoc

converter = Redoc()

# Convert JSON+HTML template to PDF
converter.convert(
    {
        "template": "invoice.html",
        "data": {
            "invoice_number": "INV-2023-001",
            "date": "2023-11-15",
            "items": [
                {"description": "Web Design", "quantity": 1, "price": 1200}
            ],
            "total": 1200
        }
    },
    'pdf',
    output_file='invoice.pdf'
)

OCR Processing

from redoc import Redoc

converter = Redoc()

# Extract text from scanned PDF with OCR
result = converter.ocr('scanned_document.pdf')
print(result['text'])

# Convert scanned document to searchable PDF
converter.ocr('scanned_document.pdf', output_file='searchable.pdf')

AI-Powered Content Generation

from redoc import Redoc

converter = Redoc()

# Generate document using AI
result = converter.generate(
    "Create a professional invoice for web design services",
    format='pdf',
    style='professional',
    output_file='ai_invoice.pdf'
)

๐Ÿค Contributing

Contributions are welcome! Please read our Contributing Guidelines for details on how to contribute to this project.

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ“ง Contact

For any questions or suggestions, please contact info@softreck.dev.


Made with โค๏ธ by Text2Doc Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

redoc-0.1.7.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

redoc-0.1.7-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file redoc-0.1.7.tar.gz.

File metadata

  • Download URL: redoc-0.1.7.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64

File hashes

Hashes for redoc-0.1.7.tar.gz
Algorithm Hash digest
SHA256 40659a70741b4da6644184ef7dd2737c6c1ec653fc968bd9be252599fbf26a50
MD5 0de3a2a858ac17a902b2ec6cfb79207d
BLAKE2b-256 3177b2770f963fa1e524c3c145ced07ced4d59155a31daa8b2545ab48927a2d1

See more details on using hashes here.

File details

Details for the file redoc-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: redoc-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.12 Linux/6.14.9-300.fc42.x86_64

File hashes

Hashes for redoc-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 ca731bd130a4c009cf62c94226598f1ef6f6753d95a0ff2fae104227b954d158
MD5 368668831b4a3d577706ee8c102c8e58
BLAKE2b-256 5bb7c3a7a376495ee40513c92e9bd64bf9349033470e063a22898965d62fe4bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page