Best open-source document to markdown converter for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract

These details have not been verified by PyPI

Project links

Project description

LLM Data Converter

Convert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.

Installation

pip install llm-data-converter

Requirements:

Python 3.8 or higher

System Dependencies for Intelligent Document Processing

For this library to work properly, you may need to install additional system dependencies:

Ubuntu/Debian:

sudo apt update
sudo apt install -y libgl1 libglib2.0-0 libgomp1
pip install setuptools

macOS:

# Usually not needed, but if you encounter OpenGL issues:
brew install mesa

Note: The package will automatically download and cache intelligent models on first use.

Quick Start

from llm_converter import FileConverter

# Basic conversion 
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)

Features

Multiple Input Formats: PDF, DOCX, TXT, HTML, URLs, Excel files, and more
Multiple Output Formats: Markdown, HTML, JSON, Plain Text
LLM Integration: Seamless integration with LiteLLM and other LLM libraries
Local Processing: Process documents locally without external dependencies
Layout Preservation: Maintain document structure and formatting
Intelligent Document Processing: Advanced document understanding and conversion powered by pre-trained models:
- Layout Detection: Intelligent models for document structure understanding
- Text Recognition: High-accuracy text extraction with confidence scoring
- Table Structure: Intelligent table detection and conversion to markdown format
- Automatic Model Download: Models are automatically downloaded and cached

Usage Examples

Convert PDF to Markdown

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)

Convert Image to HTML

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("sample.png").to_html()
print(result)

Chain with LLM

from llm_converter import FileConverter
from litellm import completion

converter = FileConverter()
document_content = converter.convert("report.pdf").to_markdown()

# Use with any LLM
response = completion(
    model="openai/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that analyzes documents."},
        {"role": "user", "content": f"Summarize this document:\n\n{document_content}"}
    ]
)

print(response.choices[0].message.content)

Supported Formats

Input Formats

Documents: PDF, DOCX, TXT
Web: URLs, HTML files
Data: Excel (XLSX, XLS), CSV
Images: PNG, JPG, JPEG

Output Formats

Markdown: Clean, structured markdown with proper table formatting
HTML: Formatted HTML with styling
JSON: Structured JSON data
Plain Text: Simple text extraction

CLI usage

The llm-converter command-line tool provides easy access to all conversion features:

Basic Usage

# Convert a PDF to markdown (default)
llm-converter document.pdf

# Convert to different output formats
llm-converter document.pdf --output html
llm-converter document.pdf --output json
llm-converter document.pdf --output text

Advanced Options

# Save output to file
llm-converter document.pdf --output-file output.md

# For image input
llm-converter image.png 

# Convert multiple files at once
llm-converter file1.pdf file2.docx file3.xlsx --output markdown

List Supported Formats

# See all supported input formats
llm-converter --list-formats

Examples

# Convert PDF to markdown
llm-converter scanned_document.pdf --output markdown

# Convert image to HTML with layout preservation
llm-converter screenshot.png --output html

# Convert multiple documents to JSON
llm-converter report.pdf presentation.pptx data.xlsx --output json --output-file combined.json

# Convert URL content to markdown
llm-converter https://blog.example.com --output markdown --output-file blog_content.md

Output Formats

markdown (default): Clean, structured markdown
html: Formatted HTML with styling
json: Structured JSON data
text: Plain text extraction

API Reference for library

FileConverter

Main class for converting documents to LLM-ready formats.

Methods

convert(file_path: str) -> ConversionResult: Convert a file to internal format
convert_url(url: str) -> ConversionResult: Convert a URL page contents to internal format
convert_text(text: str) -> ConversionResult: Convert plain text to internal format

ConversionResult

Result object with methods to export to different formats.

Methods

to_markdown() -> str: Export as markdown
to_html() -> str: Export as HTML
to_json() -> dict: Export as JSON
to_text() -> str: Export as plain text

Troubleshooting

Installation Issues

Tokenizers Build Error

If you encounter an error like this during installation:

ERROR: Could not find a version that satisfies the requirement puccinialin
ERROR: No matching distribution found for puccinialin

This is typically caused by the tokenizers package failing to build from source. Here are several solutions:

Solution 1: Update pip and install pre-compiled wheels

pip install --upgrade pip
pip install llm-data-converter --no-cache-dir

Solution 2: Install with specific tokenizers version

pip install tokenizers==0.21.0
pip install llm-data-converter

Solution 3: Use conda (recommended for complex dependencies)

conda install -c conda-forge llm-data-converter

Solution 4: Install Rust (if you want to build from source)

# On macOS/Linux
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Then restart your terminal and try installing again
pip install llm-data-converter

Numpy/Homebrew Conflict (macOS)

If you see this error on macOS:

error: uninstall-no-record-file
× Cannot uninstall numpy 2.1.2
╰─> The package's contents are unknown: no RECORD file was found for numpy.
hint: The package was installed by brew. You should check if it can uninstall the package.

This happens when numpy is installed via Homebrew and conflicts with pip. Here are solutions:

Solution 1: Use virtual environment (recommended)

# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate
pip install llm-data-converter

Solution 2: Install with --ignore-installed flag

pip install llm-data-converter --ignore-installed numpy

Solution 3: Use conda instead of pip

conda install -c conda-forge llm-data-converter

Solution 4: Uninstall brew numpy (if you don't need it)

brew uninstall numpy
pip install llm-data-converter

Hugging Face Authentication Issues

If you see authentication errors when downloading models:

huggingface_hub.errors.HfHubHTTPError: 401 Client Error: Unauthorized

The library now uses Nanonets S3 hosting by default, so this should not occur. If it does:

Set up Hugging Face token (optional):

pip install huggingface_hub
huggingface-cli login

Force S3 usage (recommended):

# The library uses S3 by default, but you can ensure it:
export LLM_CONVERTER_PREFER_HF=false

Model Download Issues

If models fail to download:

Check your internet connection
Try again - the library has automatic retry logic
Models are cached locally after first download

Runtime Issues

Memory Issues with Large Documents

For very large documents, you may need to increase memory limits:

# Increase Python memory limit
export PYTHONMALLOC=malloc
python -X maxsize=4GB your_script.py

GPU/CPU Issues

The library works on CPU by default. For better performance:

Install PyTorch with CUDA support if you have a GPU
Models will automatically use available hardware

Getting Help

GitHub Issues: Report bugs or request features
Documentation: Check this README and the scripts documentation
Community: Join discussions on GitHub

License

MIT License - see LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.2.0

Jul 25, 2025

2.1.7

Jul 23, 2025

This version

2.1.6

Jul 21, 2025

2.1.5

Jul 21, 2025

2.1.3

Jul 17, 2025

2.1.2

Jul 16, 2025

2.1.1

Jul 16, 2025

2.1.0

Jul 16, 2025

2.0.7

Jul 15, 2025

2.0.6

Jul 15, 2025

2.0.5

Jul 15, 2025

2.0.4

Jul 15, 2025

2.0.3

Jul 15, 2025

2.0.2

Jul 15, 2025

2.0.1

Jul 15, 2025

2.0.0

Jul 15, 2025

0.4.1

Jul 14, 2025

0.4.0

Jul 14, 2025

0.2.3

Jul 14, 2025

0.2.2

Jul 9, 2025

0.2.1

Jul 9, 2025

0.2.0

Jul 9, 2025

0.1.0

Jul 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_data_converter-2.1.6.tar.gz (39.3 kB view details)

Uploaded Jul 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_data_converter-2.1.6-py3-none-any.whl (50.8 kB view details)

Uploaded Jul 21, 2025 Python 3

File details

Details for the file llm_data_converter-2.1.6.tar.gz.

File metadata

Download URL: llm_data_converter-2.1.6.tar.gz
Upload date: Jul 21, 2025
Size: 39.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llm_data_converter-2.1.6.tar.gz
Algorithm	Hash digest
SHA256	`f621c69ae21a168c3a9804159b05e409f8516ff568fc1c2203ffb1ff0a0c7b6d`
MD5	`636295b4ac5a38e1a3aa96b2f35d8a97`
BLAKE2b-256	`1089ca18fc0607633e9b55c0a76453609c57637d9eff797fb4b6a0bd81a22d7b`

See more details on using hashes here.

File details

Details for the file llm_data_converter-2.1.6-py3-none-any.whl.

File metadata

Download URL: llm_data_converter-2.1.6-py3-none-any.whl
Upload date: Jul 21, 2025
Size: 50.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llm_data_converter-2.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`17d1481bc59f5f3a9fac1b2719e3b8d99f76d144cb8c7c8f70902845da54eb18`
MD5	`c4b79772b558cdf68ce647b7e6380dfb`
BLAKE2b-256	`7531011b545646d6873daade33dd3abb27bae61fc4700038bd9c104619389ce8`

See more details on using hashes here.

llm-data-converter 2.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLM Data Converter

Installation

System Dependencies for Intelligent Document Processing

Quick Start

Features

Usage Examples

Convert PDF to Markdown

Convert Image to HTML

Chain with LLM

Supported Formats

Input Formats

Output Formats

CLI usage

Basic Usage

Advanced Options

List Supported Formats

Examples

Output Formats

API Reference for library

FileConverter

Methods

ConversionResult

Methods

Troubleshooting

Installation Issues

Tokenizers Build Error

Numpy/Homebrew Conflict (macOS)

Hugging Face Authentication Issues

Model Download Issues

Runtime Issues

Memory Issues with Large Documents

GPU/CPU Issues

Getting Help

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes