Skip to main content

Best open-source document to markdown converter for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract

Project description

LLM Data Converter

PyPI version Downloads Python versions

🆓 Try Cloud Mode for Free: Test the cloud extraction capabilities at https://extraction-api.nanonets.com/ - API key required for the web interface!

Convert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.

🆕 NEW: Cloud Mode Available! - Process documents using the powerful Nanonets cloud API with a free API key for faster, more accurate results.

Installation

pip install llm-data-converter

Requirements:

  • Python 3.8 or higher

System Dependencies for Intelligent Document Processing

For this library to work properly, you may need to install additional system dependencies:

Ubuntu/Debian:

sudo apt update
sudo apt install -y libgl1 libglib2.0-0 libgomp1
pip install setuptools

macOS:

# Usually not needed, but if you encounter OpenGL issues:
brew install mesa

Note: The package will automatically download and cache intelligent models on first use. For cloud mode, no system dependencies or model downloads are required.

Quick Start

from llm_converter import FileConverter

# Local mode (default) - works offline
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)

Cloud Mode (New!) - For faster, more accurate results:

from llm_converter import FileConverter

# Only api_key is required for cloud mode
# Get API key from https://app.nanonets.com/#/keys
converter = FileConverter(cloud_mode=True, api_key="your_api_key")
result = converter.convert("document.pdf").to_markdown()  # Same interface!
print(result)

# Optional: Choose specific model for cloud processing
converter = FileConverter(cloud_mode=True, api_key="your_api_key", model="gemini")  # model is optional
result = converter.convert("document.pdf").to_markdown()
print(result)

Features

  • Multiple Input Formats: PDF, DOCX, TXT, HTML, URLs, Excel files, and more
  • Multiple Output Formats: Markdown, HTML, JSON, Plain Text
  • LLM Integration: Seamless integration with LiteLLM and other LLM libraries
  • Local Processing: Process documents locally without external dependencies
  • Cloud Processing: Fast, accurate processing with Nanonets cloud API
  • Layout Preservation: Maintain document structure and formatting
  • Intelligent Document Processing: Advanced document understanding and conversion powered by pre-trained models:
    • Layout Detection: Intelligent models for document structure understanding
    • Text Recognition: High-accuracy text extraction with confidence scoring
    • Table Structure: Intelligent table detection and conversion to markdown format
    • Automatic Model Download: Models are automatically downloaded and cached

Usage Examples

Convert PDF to Markdown

from llm_converter import FileConverter

# Local mode (default)
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)

# Cloud mode (just add cloud_mode=True and api_key)
converter = FileConverter(cloud_mode=True, api_key="your_api_key")
result = converter.convert("document.pdf").to_markdown()  # Same interface!
print(result)

Convert Image to HTML

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("sample.png").to_html()
print(result)

Chain with LLM

from llm_converter import FileConverter
from litellm import completion

converter = FileConverter()
document_content = converter.convert("report.pdf").to_markdown()

# Use with any LLM
response = completion(
    model="openai/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that analyzes documents."},
        {"role": "user", "content": f"Summarize this document:\n\n{document_content}"}
    ]
)

print(response.choices[0].message.content)

Supported Formats

Input Formats

  • Documents: PDF, DOCX, TXT
  • Web: URLs, HTML files
  • Data: Excel (XLSX, XLS), CSV
  • Images: PNG, JPG, JPEG

Output Formats

  • Markdown: Clean, structured markdown with proper table formatting
  • HTML: Formatted HTML with styling
  • JSON: Structured JSON data
  • Plain Text: Simple text extraction

CLI usage

The llm-converter command-line tool provides easy access to all conversion features:

Basic Usage

# Convert a PDF to markdown (default)
llm-converter document.pdf

# Convert to different output formats
llm-converter document.pdf --output html
llm-converter document.pdf --output json
llm-converter document.pdf --output text

Cloud Mode

# Convert using cloud API - only api_key required
llm-converter document.pdf --cloud-mode --api-key YOUR_API_KEY

# Use environment variable for API key
export NANONETS_API_KEY=your_api_key
llm-converter document.pdf --cloud-mode --output json

# Optional: Use specific model for cloud processing
llm-converter document.pdf --cloud-mode --api-key YOUR_KEY --model gemini
llm-converter document.pdf --cloud-mode --model openapi --output json

Advanced Options

# Save output to file
llm-converter document.pdf --output-file output.md

# For image input
llm-converter image.png 

# Convert multiple files at once
llm-converter file1.pdf file2.docx file3.xlsx --output markdown

List Supported Formats

# See all supported input formats
llm-converter --list-formats

Examples

# Convert PDF to markdown
llm-converter scanned_document.pdf --output markdown

# Convert image to HTML with layout preservation
llm-converter screenshot.png --output html

# Convert multiple documents to JSON
llm-converter report.pdf presentation.pptx data.xlsx --output json --output-file combined.json

# Convert URL content to markdown
llm-converter https://blog.example.com --output markdown --output-file blog_content.md

# Cloud mode examples
llm-converter document.pdf --cloud-mode --api-key YOUR_KEY
llm-converter document.pdf --cloud-mode --output json  # env var NANONETS_API_KEY

API Reference for library

FileConverter

Main class for converting documents to LLM-ready formats.

Methods

  • convert(file_path: str) -> ConversionResult: Convert a file to internal format
  • convert_url(url: str) -> ConversionResult: Convert a URL page contents to internal format
  • convert_text(text: str) -> ConversionResult: Convert plain text to internal format

CloudFileConverter

Extended FileConverter with cloud processing capabilities.

Methods

  • convert(file_path: str) -> ConversionResult: Convert using cloud API (same interface!)
  • is_cloud_enabled() -> bool: Check if cloud processing is available

ConversionResult

Result object with methods to export to different formats.

Methods

  • to_markdown() -> str: Export as markdown
  • to_html() -> str: Export as HTML
  • to_json() -> dict: Export as JSON
  • to_text() -> str: Export as plain text

Troubleshooting

Cloud Mode Setup

  1. Get your free API key from https://app.nanonets.com/#/keys
  2. Set environment variable: export NANONETS_API_KEY=your_key
  3. Or provide directly: CloudFileConverter(api_key="your_key")

Installation Issues

Tokenizers Build Error

If you encounter an error like this during installation:

ERROR: Could not find a version that satisfies the requirement puccinialin
ERROR: No matching distribution found for puccinialin

This is typically caused by the tokenizers package failing to build from source. Here are several solutions:

Solution 1: Update pip and install pre-compiled wheels

pip install --upgrade pip
pip install llm-data-converter --no-cache-dir

Solution 2: Install with specific tokenizers version

pip install tokenizers==0.21.0
pip install llm-data-converter

Solution 3: Use conda (recommended for complex dependencies)

conda install -c conda-forge llm-data-converter

Numpy/Homebrew Conflict (macOS)

If you see this error on macOS:

error: uninstall-no-record-file
× Cannot uninstall numpy 2.1.2

Solution: Use virtual environment (recommended)

# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate
pip install llm-data-converter

Getting Help

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_data_converter-2.1.7.tar.gz (43.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_data_converter-2.1.7-py3-none-any.whl (55.6 kB view details)

Uploaded Python 3

File details

Details for the file llm_data_converter-2.1.7.tar.gz.

File metadata

  • Download URL: llm_data_converter-2.1.7.tar.gz
  • Upload date:
  • Size: 43.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llm_data_converter-2.1.7.tar.gz
Algorithm Hash digest
SHA256 1052bdac704228f42dcd39d94f028ba63fc0f1f2e353725bbd32941e903b569e
MD5 60f5a0563b043065e486cb8cae470185
BLAKE2b-256 29b686a4b3d87ed93a6eaaf2321bdf5462d76f4e5fe7a85f2b3d48fde71eaccf

See more details on using hashes here.

File details

Details for the file llm_data_converter-2.1.7-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_data_converter-2.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 2f3a41b0bbb33659d5190dd0ef38c5c626ef85ae866a836672313b8cba3b5149
MD5 beee44ea9ef7496345c0c68f6dde5779
BLAKE2b-256 b4cd6c6cb5e2aec900df63e4c7a32c0777a3837992fb2fadffd1c9d7a163bae8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page