Skip to main content

Convert any document, text, or URL into LLM-ready data format with advanced intelligent document processing capabilities powered by pre-trained models

Project description

LLM Data Converter v2.0.0

Convert any document, text, or URL into LLM-ready data format with advanced intelligent document processing capabilities powered by pre-trained models.

Installation

pip install llm-data-converter

Requirements:

  • Python 3.8 or higher

System Dependencies for Intelligent Document Processing

For intelligent document processing functionality to work properly, you may need to install additional system dependencies:

Ubuntu/Debian:

sudo apt update
sudo apt install -y libgl1 libglib2.0-0 libgomp1
pip install setuptools

macOS:

# Usually not needed, but if you encounter OpenGL issues:
brew install mesa

Note: The package will automatically download and cache intelligent models on first use.

Quick Start

from llm_converter import FileConverter

# Basic conversion with intelligent document processing
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)

Features

  • Multiple Input Formats: PDF, DOCX, TXT, HTML, URLs, Excel files, and more
  • Multiple Output Formats: Markdown, HTML, JSON, Plain Text
  • LLM Integration: Seamless integration with LiteLLM and other LLM libraries
  • Local Processing: Process documents locally without external dependencies
  • Layout Preservation: Maintain document structure and formatting
  • Intelligent Document Processing: Advanced document understanding and conversion powered by pre-trained models:
    • Layout Detection: Intelligent models for document structure understanding
    • Text Recognition: High-accuracy text extraction with confidence scoring
    • Table Structure: Intelligent table detection and conversion to markdown format
    • Automatic Model Download: Models are automatically downloaded and cached

Intelligent Document Processing

Version 2.0.0 introduces advanced intelligent document processing capabilities:

Intelligent Document Processing (Default)

Uses pre-trained models for superior document conversion accuracy:

  • Layout Detection: Advanced intelligent models for document structure understanding
  • Text Recognition: High-accuracy text extraction with confidence scoring
  • Table Structure: Intelligent table detection and conversion to markdown format
  • Automatic Model Download: Models are automatically downloaded on first use
  • Document Understanding: Comprehensive document analysis and conversion beyond simple text extraction

Usage Examples

Convert PDF to Markdown

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)

Convert Image to HTML

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("sample.png").to_html()
print(result)

Chain with LLM

from llm_converter import FileConverter
from litellm import completion

converter = FileConverter()
document_content = converter.convert("report.pdf").to_markdown()

# Use with any LLM
response = completion(
    model="openai/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that analyzes documents."},
        {"role": "user", "content": f"Summarize this document:\n\n{document_content}"}
    ]
)

print(response.choices[0].message.content)

Supported Formats

Input Formats

  • Documents: PDF, DOCX, TXT
  • Web: URLs, HTML files
  • Data: Excel (XLSX, XLS), CSV
  • Images: PNG, JPG, JPEG (with intelligent document processing capabilities)

Output Formats

  • Markdown: Clean, structured markdown with proper table formatting
  • HTML: Formatted HTML with styling
  • JSON: Structured JSON data
  • Plain Text: Simple text extraction

CLI usage

The llm-converter command-line tool provides easy access to all conversion features:

Basic Usage

# Convert a PDF to markdown (default)
llm-converter document.pdf

# Convert to different output formats
llm-converter document.pdf --output html
llm-converter document.pdf --output json
llm-converter document.pdf --output text

Advanced Options

# Save output to file
llm-converter document.pdf --output-file output.md

# For image input
llm-converter image.png 

# Convert multiple files at once
llm-converter file1.pdf file2.docx file3.xlsx --output markdown

List Supported Formats

# See all supported input formats
llm-converter --list-formats

Examples

# Convert PDF with intelligent document processing
llm-converter scanned_document.pdf --output markdown

# Convert image to HTML with layout preservation
llm-converter screenshot.png --output html

# Convert multiple documents to JSON
llm-converter report.pdf presentation.pptx data.xlsx --output json --output-file combined.json

# Convert URL content to markdown
llm-converter https://blog.example.com --output markdown --output-file blog_content.md

Output Formats

  • markdown (default): Clean, structured markdown
  • html: Formatted HTML with styling
  • json: Structured JSON data
  • text: Plain text extraction

API Reference for library

FileConverter

Main class for converting documents to LLM-ready formats.

Methods

  • convert(file_path: str) -> ConversionResult: Convert a file to internal format
  • convert_url(url: str) -> ConversionResult: Convert a URL page contents to internal format
  • convert_text(text: str) -> ConversionResult: Convert plain text to internal format

ConversionResult

Result object with methods to export to different formats.

Methods

  • to_markdown() -> str: Export as markdown
  • to_html() -> str: Export as HTML
  • to_json() -> dict: Export as JSON
  • to_text() -> str: Export as plain text

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_data_converter-2.1.0.tar.gz (36.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_data_converter-2.1.0-py3-none-any.whl (47.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_data_converter-2.1.0.tar.gz.

File metadata

  • Download URL: llm_data_converter-2.1.0.tar.gz
  • Upload date:
  • Size: 36.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llm_data_converter-2.1.0.tar.gz
Algorithm Hash digest
SHA256 464f358da57a8f572747d8e938c8ae1e6090a5d7526122fefc51686b99feb544
MD5 52202afb7d706cb9c9414ef3e61c2fb9
BLAKE2b-256 cf64a6d33bfe957c1f6403e5233fe42267f03c53006000a685af6b46df817ac5

See more details on using hashes here.

File details

Details for the file llm_data_converter-2.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_data_converter-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d9414cad88d0e0ae80ec8a289c7eb0a594396f795b1349d5e1e8775dd9f21fe5
MD5 6072c4dc4ed30d9566d416777ce26019
BLAKE2b-256 57097d413c9e0cadfb6ea400f030303e65f7da6d0b8de09cb5ae04f9296e7525

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page