Skip to main content

Convert any document, text, or URL into LLM-ready data format with advanced neural OCR capabilities powered by state-of-the-art pre-trained models

Project description

LLM Data Converter v2.0.0

Convert any document, text, or URL into LLM-ready data format with advanced neural OCR capabilities powered by state-of-the-art pre-trained models.

Installation

pip install llm-data-converter

Requirements:

  • Python 3.8 or higher

System Dependencies for Neural OCR

For neural OCR functionality to work properly, you may need to install additional system dependencies:

Ubuntu/Debian:

sudo apt update
sudo apt install -y libgl1 libglib2.0-0 libgomp1
pip install setuptools

macOS:

# Usually not needed, but if you encounter OpenGL issues:
brew install mesa

Note: The package will automatically download and cache neural models on first use.

Quick Start

from llm_converter import FileConverter

# Basic conversion with neural OCR
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)

Features

  • Multiple Input Formats: PDF, DOCX, TXT, HTML, URLs, Excel files, and more
  • Multiple Output Formats: Markdown, HTML, JSON, Plain Text
  • LLM Integration: Seamless integration with LiteLLM and other LLM libraries
  • Local Processing: Process documents locally without external dependencies
  • Layout Preservation: Maintain document structure and formatting
  • Neural OCR: Advanced document understanding powered by state-of-the-art pre-trained models:
    • Layout Detection: Neural models for document structure understanding
    • Text Recognition: High-accuracy OCR with confidence scoring
    • Table Structure: Intelligent table detection and parsing with proper markdown output
    • Automatic Model Download: Models are automatically downloaded and cached

Neural Document Processing

Version 2.0.0 introduces advanced neural document processing capabilities:

Neural OCR (Default)

Uses state-of-the-art pre-trained models for superior accuracy:

  • Layout Detection: Advanced neural models for document structure understanding
  • Text Recognition: High-accuracy OCR with confidence scoring
  • Table Structure: Intelligent table detection and parsing with proper markdown output
  • Automatic Model Download: Models are automatically downloaded on first use
  • Document Understanding: Comprehensive document analysis beyond simple OCR

Usage Examples

Convert PDF to Markdown

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)

Convert URL to HTML

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert_url("https://example.com").to_html()
print(result)

Convert Excel to JSON

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("data.xlsx").to_json()
print(result)

Chain with LLM

from llm_converter import FileConverter
from litellm import completion

converter = FileConverter()
document_content = converter.convert("report.pdf").to_markdown()

# Use with any LLM
response = completion(
    model="openai/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that analyzes documents."},
        {"role": "user", "content": f"Summarize this document:\n\n{document_content}"}
    ]
)

print(response.choices[0].message.content)

Supported Formats

Input Formats

  • Documents: PDF, DOCX, TXT
  • Web: URLs, HTML files
  • Data: Excel (XLSX, XLS), CSV
  • Images: PNG, JPG, JPEG (with neural OCR capabilities)

Output Formats

  • Markdown: Clean, structured markdown with proper table formatting
  • HTML: Formatted HTML with styling
  • JSON: Structured JSON data
  • Plain Text: Simple text extraction

Advanced Usage

Custom Configuration

from llm_converter import FileConverter

converter = FileConverter(
    preserve_layout=True,
    include_images=True,
    ocr_enabled=True   
)

result = converter.convert("document.pdf").to_markdown()
print(result)

Batch Processing

from llm_converter import FileConverter

converter = FileConverter()
files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]

results = []
for file in files:
    result = converter.convert(file).to_markdown()
    results.append(result)

Testing Neural OCR

# Test the neural OCR capabilities
from llm_converter.pipeline.neural_document_processor import NeuralDocumentProcessor

# Initialize neural document processor
processor = NeuralDocumentProcessor()

# Extract text with layout awareness
text = processor.extract_text_with_layout("sample.png")
print(text)

API Reference

FileConverter

Main class for converting documents to LLM-ready formats.

Methods

  • convert(file_path: str) -> ConversionResult: Convert a file to internal format
  • convert_url(url: str) -> ConversionResult: Convert a URL page contents to internal format
  • convert_text(text: str) -> ConversionResult: Convert plain text to internal format

ConversionResult

Result object with methods to export to different formats.

Methods

  • to_markdown() -> str: Export as markdown
  • to_html() -> str: Export as HTML
  • to_json() -> dict: Export as JSON
  • to_text() -> str: Export as plain text

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

Third-Party Dependencies

This project uses several third-party libraries:

All dependencies are used in accordance with their respective licenses.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_data_converter-2.0.6.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_data_converter-2.0.6-py3-none-any.whl (38.8 kB view details)

Uploaded Python 3

File details

Details for the file llm_data_converter-2.0.6.tar.gz.

File metadata

  • Download URL: llm_data_converter-2.0.6.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llm_data_converter-2.0.6.tar.gz
Algorithm Hash digest
SHA256 4dd7b1bbd4d44bf0213f0b94135780f311eef781038fd1362f525f9cc5902cc9
MD5 035007eef5931fbe37da2e955922b2b0
BLAKE2b-256 03041a9aebc33f3b781d580a42c3805c85c0a45088627966091c23af8a0327f7

See more details on using hashes here.

File details

Details for the file llm_data_converter-2.0.6-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_data_converter-2.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 952d27d95cacac7fdf4b4c9f6eb3bcb8dab6a94357bbf4e726d3a7ce44699837
MD5 1e95b451883ceadb75342a06f8714f0a
BLAKE2b-256 2773e11f6bda4812b060b3d799a456ef4c26ea652c3b2d14dd04b9aa88ec1959

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page