Skip to main content

Convert any document, text, or URL into LLM-ready data format

Project description

LLM Data Converter

Convert any document, text, or URL into LLM-ready data format.

Installation

pip install llm-data-converter

System Dependencies for OCR

For OCR functionality to work properly, you may need to install additional system dependencies:

Ubuntu/Debian:

sudo apt update
sudo apt install -y libgl1-mesa-glx libglib2.0-0

macOS:

# Usually not needed, but if you encounter OpenGL issues:
brew install mesa

Windows:

# Usually not needed, but if you encounter OpenGL issues:
# Install the latest graphics drivers from your GPU manufacturer

Note: The package will automatically detect if OpenGL is available and provide helpful warnings if system dependencies are missing.

Quick Start

from llm_converter import FileConverter
from litellm import completion

# Basic conversion
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()

# Pass the result to LLM
response = completion(
    model="openai/gpt-4o",
    messages=[{"content": f"Extract info from this document: \n{result}", "role": "user"}]
)

Features

  • Multiple Input Formats: PDF, DOCX, TXT, HTML, URLs, Excel files, and more
  • Multiple Output Formats: Markdown, HTML, JSON, Plain Text
  • LLM Integration: Seamless integration with LiteLLM and other LLM libraries
  • Local Processing: Process documents locally without external dependencies
  • Layout Preservation: Maintain document structure and formatting

Usage Examples

Convert PDF to Markdown

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)

Convert URL to HTML

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("https://example.com").to_html()
print(result)

Convert Excel to JSON

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("data.xlsx").to_json()
print(result)

Chain with LLM

from llm_converter import FileConverter
from litellm import completion

converter = FileConverter()
document_content = converter.convert("report.pdf").to_markdown()

# Use with any LLM
response = completion(
    model="openai/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that analyzes documents."},
        {"role": "user", "content": f"Summarize this document:\n\n{document_content}"}
    ]
)

print(response.choices[0].message.content)

Supported Formats

Input Formats

  • Documents: PDF, DOCX, TXT
  • Web: URLs, HTML files
  • Data: Excel (XLSX, XLS), CSV
  • Images: PNG, JPG, JPEG (with OCR capabilities)

Output Formats

  • Markdown: Clean, structured markdown
  • HTML: Formatted HTML with styling
  • JSON: Structured JSON data
  • Plain Text: Simple text extraction

Advanced Usage

Custom Configuration

from llm_converter import FileConverter

converter = FileConverter(
    preserve_layout=True,
    include_images=True,
    ocr_enabled=True
)

result = converter.convert("document.pdf").to_markdown()

Batch Processing

from llm_converter import FileConverter

converter = FileConverter()
files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]

results = []
for file in files:
    result = converter.convert(file).to_markdown()
    results.append(result)

API Reference

FileConverter

Main class for converting documents to LLM-ready formats.

Methods

  • convert(file_path: str) -> ConversionResult: Convert a file to internal format
  • convert_url(url: str) -> ConversionResult: Convert a URL to internal format
  • convert_text(text: str) -> ConversionResult: Convert plain text to internal format

ConversionResult

Result object with methods to export to different formats.

Methods

  • to_markdown() -> str: Export as markdown
  • to_html() -> str: Export as HTML
  • to_json() -> dict: Export as JSON
  • to_text() -> str: Export as plain text

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

Third-Party Dependencies

This project uses several third-party libraries:

All dependencies are used in accordance with their respective licenses.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_data_converter-0.2.3.tar.gz (38.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_data_converter-0.2.3-py3-none-any.whl (29.8 kB view details)

Uploaded Python 3

File details

Details for the file llm_data_converter-0.2.3.tar.gz.

File metadata

  • Download URL: llm_data_converter-0.2.3.tar.gz
  • Upload date:
  • Size: 38.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llm_data_converter-0.2.3.tar.gz
Algorithm Hash digest
SHA256 b4f4a81bb406e3fae31bae94ed3f05c4e2122d34b30654b83ef57dfe64f04e93
MD5 07364447e773e025fe227945a0a658f1
BLAKE2b-256 ca1f216b9fd04a75331ae3ee9f2f5db4bd1512abd1bfb5564b8f29f0ffbec092

See more details on using hashes here.

File details

Details for the file llm_data_converter-0.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_data_converter-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a1d1a02b302eb5b713f13c1175cb48f4f512052c57eeffc7d9f3259dffac1445
MD5 6057aae3d23caaed5fe636ccc1ae12b8
BLAKE2b-256 4e2e81b230919616d09fa86c2ab8dbd773d950a865a6e61f0673966a43961c40

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page