Best open-source document to markdown converter for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract
Project description
LLM Data Converter
Convert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.
Installation
pip install llm-data-converter
Requirements:
- Python 3.8 or higher
System Dependencies for Intelligent Document Processing
For this library to work properly, you may need to install additional system dependencies:
Ubuntu/Debian:
sudo apt update
sudo apt install -y libgl1 libglib2.0-0 libgomp1
pip install setuptools
macOS:
# Usually not needed, but if you encounter OpenGL issues:
brew install mesa
Note: The package will automatically download and cache intelligent models on first use.
Quick Start
from llm_converter import FileConverter
# Basic conversion
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)
Features
- Multiple Input Formats: PDF, DOCX, TXT, HTML, URLs, Excel files, and more
- Multiple Output Formats: Markdown, HTML, JSON, Plain Text
- LLM Integration: Seamless integration with LiteLLM and other LLM libraries
- Local Processing: Process documents locally without external dependencies
- Layout Preservation: Maintain document structure and formatting
- Intelligent Document Processing: Advanced document understanding and conversion powered by pre-trained models:
- Layout Detection: Intelligent models for document structure understanding
- Text Recognition: High-accuracy text extraction with confidence scoring
- Table Structure: Intelligent table detection and conversion to markdown format
- Automatic Model Download: Models are automatically downloaded and cached
Usage Examples
Convert PDF to Markdown
from llm_converter import FileConverter
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)
Convert Image to HTML
from llm_converter import FileConverter
converter = FileConverter()
result = converter.convert("sample.png").to_html()
print(result)
Chain with LLM
from llm_converter import FileConverter
from litellm import completion
converter = FileConverter()
document_content = converter.convert("report.pdf").to_markdown()
# Use with any LLM
response = completion(
model="openai/gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant that analyzes documents."},
{"role": "user", "content": f"Summarize this document:\n\n{document_content}"}
]
)
print(response.choices[0].message.content)
Supported Formats
Input Formats
- Documents: PDF, DOCX, TXT
- Web: URLs, HTML files
- Data: Excel (XLSX, XLS), CSV
- Images: PNG, JPG, JPEG
Output Formats
- Markdown: Clean, structured markdown with proper table formatting
- HTML: Formatted HTML with styling
- JSON: Structured JSON data
- Plain Text: Simple text extraction
CLI usage
The llm-converter command-line tool provides easy access to all conversion features:
Basic Usage
# Convert a PDF to markdown (default)
llm-converter document.pdf
# Convert to different output formats
llm-converter document.pdf --output html
llm-converter document.pdf --output json
llm-converter document.pdf --output text
Advanced Options
# Save output to file
llm-converter document.pdf --output-file output.md
# For image input
llm-converter image.png
# Convert multiple files at once
llm-converter file1.pdf file2.docx file3.xlsx --output markdown
List Supported Formats
# See all supported input formats
llm-converter --list-formats
Examples
# Convert PDF to markdown
llm-converter scanned_document.pdf --output markdown
# Convert image to HTML with layout preservation
llm-converter screenshot.png --output html
# Convert multiple documents to JSON
llm-converter report.pdf presentation.pptx data.xlsx --output json --output-file combined.json
# Convert URL content to markdown
llm-converter https://blog.example.com --output markdown --output-file blog_content.md
Output Formats
- markdown (default): Clean, structured markdown
- html: Formatted HTML with styling
- json: Structured JSON data
- text: Plain text extraction
API Reference for library
FileConverter
Main class for converting documents to LLM-ready formats.
Methods
convert(file_path: str) -> ConversionResult: Convert a file to internal formatconvert_url(url: str) -> ConversionResult: Convert a URL page contents to internal formatconvert_text(text: str) -> ConversionResult: Convert plain text to internal format
ConversionResult
Result object with methods to export to different formats.
Methods
to_markdown() -> str: Export as markdownto_html() -> str: Export as HTMLto_json() -> dict: Export as JSONto_text() -> str: Export as plain text
Troubleshooting
Installation Issues
Tokenizers Build Error
If you encounter an error like this during installation:
ERROR: Could not find a version that satisfies the requirement puccinialin
ERROR: No matching distribution found for puccinialin
This is typically caused by the tokenizers package failing to build from source. Here are several solutions:
Solution 1: Update pip and install pre-compiled wheels
pip install --upgrade pip
pip install llm-data-converter --no-cache-dir
Solution 2: Install with specific tokenizers version
pip install tokenizers==0.21.0
pip install llm-data-converter
Solution 3: Use conda (recommended for complex dependencies)
conda install -c conda-forge llm-data-converter
Solution 4: Install Rust (if you want to build from source)
# On macOS/Linux
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Then restart your terminal and try installing again
pip install llm-data-converter
Numpy/Homebrew Conflict (macOS)
If you see this error on macOS:
error: uninstall-no-record-file
× Cannot uninstall numpy 2.1.2
╰─> The package's contents are unknown: no RECORD file was found for numpy.
hint: The package was installed by brew. You should check if it can uninstall the package.
This happens when numpy is installed via Homebrew and conflicts with pip. Here are solutions:
Solution 1: Use virtual environment (recommended)
# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate
pip install llm-data-converter
Solution 2: Install with --ignore-installed flag
pip install llm-data-converter --ignore-installed numpy
Solution 3: Use conda instead of pip
conda install -c conda-forge llm-data-converter
Solution 4: Uninstall brew numpy (if you don't need it)
brew uninstall numpy
pip install llm-data-converter
Hugging Face Authentication Issues
If you see authentication errors when downloading models:
huggingface_hub.errors.HfHubHTTPError: 401 Client Error: Unauthorized
The library now uses Nanonets S3 hosting by default, so this should not occur. If it does:
-
Set up Hugging Face token (optional):
pip install huggingface_hub huggingface-cli login
-
Force S3 usage (recommended):
# The library uses S3 by default, but you can ensure it: export LLM_CONVERTER_PREFER_HF=false
Model Download Issues
If models fail to download:
- Check your internet connection
- Try again - the library has automatic retry logic
- Models are cached locally after first download
Runtime Issues
Memory Issues with Large Documents
For very large documents, you may need to increase memory limits:
# Increase Python memory limit
export PYTHONMALLOC=malloc
python -X maxsize=4GB your_script.py
GPU/CPU Issues
The library works on CPU by default. For better performance:
- Install PyTorch with CUDA support if you have a GPU
- Models will automatically use available hardware
Getting Help
- GitHub Issues: Report bugs or request features
- Documentation: Check this README and the scripts documentation
- Community: Join discussions on GitHub
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_data_converter-2.1.6.tar.gz.
File metadata
- Download URL: llm_data_converter-2.1.6.tar.gz
- Upload date:
- Size: 39.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f621c69ae21a168c3a9804159b05e409f8516ff568fc1c2203ffb1ff0a0c7b6d
|
|
| MD5 |
636295b4ac5a38e1a3aa96b2f35d8a97
|
|
| BLAKE2b-256 |
1089ca18fc0607633e9b55c0a76453609c57637d9eff797fb4b6a0bd81a22d7b
|
File details
Details for the file llm_data_converter-2.1.6-py3-none-any.whl.
File metadata
- Download URL: llm_data_converter-2.1.6-py3-none-any.whl
- Upload date:
- Size: 50.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17d1481bc59f5f3a9fac1b2719e3b8d99f76d144cb8c7c8f70902845da54eb18
|
|
| MD5 |
c4b79772b558cdf68ce647b7e6380dfb
|
|
| BLAKE2b-256 |
7531011b545646d6873daade33dd3abb27bae61fc4700038bd9c104619389ce8
|