Convert any document, text, or URL into LLM-ready data format with advanced neural OCR capabilities powered by state-of-the-art pre-trained models

These details have not been verified by PyPI

Project links

Project description

LLM Data Converter v2.0.0

Convert any document, text, or URL into LLM-ready data format with advanced neural OCR capabilities powered by state-of-the-art pre-trained models.

Installation

pip install llm-data-converter

Requirements:

Python 3.8 or higher

System Dependencies for Neural OCR

For neural OCR functionality to work properly, you may need to install additional system dependencies:

Ubuntu/Debian:

sudo apt update
sudo apt install -y libgl1 libglib2.0-0 libgomp1
pip install setuptools

macOS:

# Usually not needed, but if you encounter OpenGL issues:
brew install mesa

Note: The package will automatically download and cache neural models on first use.

Quick Start

from llm_converter import FileConverter

# Basic conversion with neural OCR
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)

Features

Multiple Input Formats: PDF, DOCX, TXT, HTML, URLs, Excel files, and more
Multiple Output Formats: Markdown, HTML, JSON, Plain Text
LLM Integration: Seamless integration with LiteLLM and other LLM libraries
Local Processing: Process documents locally without external dependencies
Layout Preservation: Maintain document structure and formatting
Neural OCR: Advanced document understanding powered by state-of-the-art pre-trained models:
- Layout Detection: Neural models for document structure understanding
- Text Recognition: High-accuracy OCR with confidence scoring
- Table Structure: Intelligent table detection and parsing with proper markdown output
- Automatic Model Download: Models are automatically downloaded and cached

Neural Document Processing

Version 2.0.0 introduces advanced neural document processing capabilities:

Neural OCR (Default)

Uses state-of-the-art pre-trained models for superior accuracy:

Layout Detection: Advanced neural models for document structure understanding
Text Recognition: High-accuracy OCR with confidence scoring
Table Structure: Intelligent table detection and parsing with proper markdown output
Automatic Model Download: Models are automatically downloaded on first use
Document Understanding: Comprehensive document analysis beyond simple OCR

Usage Examples

Convert PDF to Markdown

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)

Convert URL to HTML

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert_url("https://example.com").to_html()
print(result)

Convert Excel to JSON

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("data.xlsx").to_json()
print(result)

Chain with LLM

from llm_converter import FileConverter
from litellm import completion

converter = FileConverter()
document_content = converter.convert("report.pdf").to_markdown()

# Use with any LLM
response = completion(
    model="openai/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that analyzes documents."},
        {"role": "user", "content": f"Summarize this document:\n\n{document_content}"}
    ]
)

print(response.choices[0].message.content)

Supported Formats

Input Formats

Documents: PDF, DOCX, TXT
Web: URLs, HTML files
Data: Excel (XLSX, XLS), CSV
Images: PNG, JPG, JPEG (with neural OCR capabilities)

Output Formats

Markdown: Clean, structured markdown with proper table formatting
HTML: Formatted HTML with styling
JSON: Structured JSON data
Plain Text: Simple text extraction

Advanced Usage

Custom Configuration

from llm_converter import FileConverter

converter = FileConverter(
    preserve_layout=True,
    include_images=True,
    ocr_enabled=True   
)

result = converter.convert("document.pdf").to_markdown()
print(result)

Batch Processing

from llm_converter import FileConverter

converter = FileConverter()
files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]

results = []
for file in files:
    result = converter.convert(file).to_markdown()
    results.append(result)

Testing Neural OCR

# Test the neural OCR capabilities
from llm_converter.pipeline.neural_document_processor import NeuralDocumentProcessor

# Initialize neural document processor
processor = NeuralDocumentProcessor()

# Extract text with layout awareness
text = processor.extract_text_with_layout("sample.png")
print(text)

API Reference

FileConverter

Main class for converting documents to LLM-ready formats.

Methods

convert(file_path: str) -> ConversionResult: Convert a file to internal format
convert_url(url: str) -> ConversionResult: Convert a URL page contents to internal format
convert_text(text: str) -> ConversionResult: Convert plain text to internal format

ConversionResult

Result object with methods to export to different formats.

Methods

to_markdown() -> str: Export as markdown
to_html() -> str: Export as HTML
to_json() -> dict: Export as JSON
to_text() -> str: Export as plain text

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

License

MIT License - see LICENSE file for details.

Third-Party Dependencies

This project uses several third-party libraries:

EasyOCR - Apache 2.0 License (https://github.com/JaidedAI/EasyOCR)
PyTorch - BSD 3-Clause License (https://pytorch.org/)
Transformers - Apache 2.0 License (https://github.com/huggingface/transformers)
Pillow - HPND License (https://python-pillow.org/)
python-docx - MIT License (https://github.com/python-openxml/python-docx)
pandas - BSD 3-Clause License (https://pandas.pydata.org/)
numpy - BSD 3-Clause License (https://numpy.org/)
pdf2image - MIT License (https://github.com/Belval/pdf2image)
markdownify - MIT License (https://github.com/matthewwithanm/markdownify)

All dependencies are used in accordance with their respective licenses.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.2.0

Jul 25, 2025

2.1.7

Jul 23, 2025

2.1.6

Jul 21, 2025

2.1.5

Jul 21, 2025

2.1.3

Jul 17, 2025

2.1.2

Jul 16, 2025

2.1.1

Jul 16, 2025

2.1.0

Jul 16, 2025

2.0.7

Jul 15, 2025

2.0.6

Jul 15, 2025

2.0.5

Jul 15, 2025

2.0.4

Jul 15, 2025

2.0.3

Jul 15, 2025

2.0.2

Jul 15, 2025

2.0.1

Jul 15, 2025

This version

2.0.0

Jul 15, 2025

0.4.1

Jul 14, 2025

0.4.0

Jul 14, 2025

0.2.3

Jul 14, 2025

0.2.2

Jul 9, 2025

0.2.1

Jul 9, 2025

0.2.0

Jul 9, 2025

0.1.0

Jul 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_data_converter-2.0.0.tar.gz (26.1 kB view details)

Uploaded Jul 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_data_converter-2.0.0-py3-none-any.whl (37.1 kB view details)

Uploaded Jul 15, 2025 Python 3

File details

Details for the file llm_data_converter-2.0.0.tar.gz.

File metadata

Download URL: llm_data_converter-2.0.0.tar.gz
Upload date: Jul 15, 2025
Size: 26.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llm_data_converter-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`212bf530a34378ee384bd3d890658e883c2674d8b61c316c83dd04476dba1a50`
MD5	`3278d5bfd17d319ef40b5cdeb3cf1e56`
BLAKE2b-256	`e38eb97e2ca83470b0afb5a735e72fb7d49f1b355ec225a610488fca2bf6e779`

See more details on using hashes here.

File details

Details for the file llm_data_converter-2.0.0-py3-none-any.whl.

File metadata

Download URL: llm_data_converter-2.0.0-py3-none-any.whl
Upload date: Jul 15, 2025
Size: 37.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llm_data_converter-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1fc2ba218a645526d3022eeedb38a61817c94b9be246ed8baec8aa9c69ee52ec`
MD5	`a2ab1230159f01e6b1d9432b802e3d02`
BLAKE2b-256	`91bcd6d93c6799703b285836fc24cc3f74bc7ad231d0d859b0fbc202ad76753c`

See more details on using hashes here.

llm-data-converter 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLM Data Converter v2.0.0

Installation

System Dependencies for Neural OCR

Quick Start

Features

Neural Document Processing

Neural OCR (Default)

Usage Examples

Convert PDF to Markdown

Convert URL to HTML

Convert Excel to JSON

Chain with LLM

Supported Formats

Input Formats

Output Formats

Advanced Usage

Custom Configuration

Batch Processing

Testing Neural OCR

API Reference

FileConverter

Methods

ConversionResult

Methods

Contributing

License

Third-Party Dependencies

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes