Best open-source document to markdown converter for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract

These details have not been verified by PyPI

Project links

Project description

LLM Data Converter

Try Cloud Mode for Free!
Convert documents instantly with our cloud API - no setup required.
For unlimited processing, get your free API key.

Transform any document, image, or URL into LLM-ready formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.

Key Features

Cloud Processing (Default): Instant conversion with Nanonets API - no local setup needed
Local Processing: CPU/GPU options for complete privacy and control
Universal Input: PDFs, Word docs, Excel, PowerPoint, images, URLs, and raw text
Smart Output: Markdown, JSON, CSV, HTML, and plain text formats
LLM-Optimized: Clean, structured output perfect for AI processing
Intelligent Extraction: Extract specific fields or structured data using AI
Advanced OCR: Multiple OCR engines with automatic fallback
Table Processing: Accurate table extraction and formatting
Image Handling: Extract text from images and visual content
URL Processing: Direct conversion from web pages

Installation

pip install llm-data-converter

Quick Start

Basic Usage (Cloud Mode - Default)

from llm_converter import FileConverter

# Default cloud mode - no setup required
converter = FileConverter()

# Convert any document
result = converter.convert("document.pdf")

# Get different output formats
markdown = result.to_markdown()
json_data = result.to_json()
html = result.to_html()
csv_tables = result.to_csv()

# Extract specific fields
extracted_fields = result.to_json(specified_fields=[
    "title", "author", "date", "summary", "key_points"
])

# Extract using JSON schema
schema = {
    "title": "string",
    "author": "string", 
    "date": "string",
    "summary": "string",
    "key_points": ["string"],
    "metadata": {
        "page_count": "number",
        "language": "string"
    }
}
structured_data = result.to_json(json_schema=schema)

With API Key (Unlimited Access)

# Get your free API key from https://app.nanonets.com/#/keys
converter = FileConverter(api_key="your_api_key_here")
result = converter.convert("document.pdf")

Local Processing

# Force local CPU processing
converter = FileConverter(cpu_preference=True)

# Force local GPU processing (requires CUDA)
converter = FileConverter(gpu_preference=True)

Output Formats

Markdown: Clean, LLM-friendly format with preserved structure
JSON: Structured data with metadata and intelligent parsing
HTML: Formatted output with styling and layout
CSV: Extract tables and data in spreadsheet format
Text: Plain text with smart formatting

Examples

Convert Multiple File Types

from llm_converter import FileConverter

converter = FileConverter()

# PDF document
pdf_result = converter.convert("report.pdf")
print(pdf_result.to_markdown())

# Word document  
docx_result = converter.convert("document.docx")
print(docx_result.to_json())

# Excel spreadsheet
excel_result = converter.convert("data.xlsx")
print(excel_result.to_csv())

# PowerPoint presentation
pptx_result = converter.convert("slides.pptx")
print(pptx_result.to_html())

# Image with text
image_result = converter.convert("screenshot.png")
print(image_result.to_text())

# Web page
url_result = converter.convert("https://example.com")
print(url_result.to_markdown())

Extract Tables to CSV

# Extract all tables from a document
result = converter.convert("financial_report.pdf")
csv_data = result.to_csv(include_all_tables=True)
print(csv_data)

Enhanced JSON Conversion

The library now uses intelligent document understanding for JSON conversion:

from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("document.pdf")

# Enhanced JSON with Ollama (when available)
json_data = result.to_json()
print(json_data["format"])  # "ollama_structured_json" or "structured_json"

# The enhanced conversion provides:
# - Better document structure understanding
# - Intelligent table parsing
# - Automatic metadata extraction  
# - Key information identification
# - Proper data type handling

Requirements for enhanced JSON (if using cpu_preference=True):

Install: pip install 'llm-data-converter[local-llm]'
Install Ollama and run: ollama serve
Pull a model: ollama pull llama3.2

If Ollama is not available, the library automatically falls back to the standard JSON parser.

Extract Specific Fields & Structured Data

# Extract specific fields from any document
result = converter.convert("invoice.pdf")

# Method 1: Extract specific fields
extracted = result.to_json(specified_fields=[
    "invoice_number", 
    "total_amount", 
    "vendor_name",
    "due_date"
])

# Method 2: Extract using JSON schema
schema = {
    "invoice_number": "string",
    "total_amount": "number", 
    "vendor_name": "string",
    "line_items": [{
        "description": "string",
        "amount": "number"
    }]
}

structured = result.to_json(json_schema=schema)

How it works:

Automatically uses cloud API when available
Falls back to local Ollama for privacy-focused processing
Same interface works for both cloud and local modes

Cloud Mode Usage Examples:

from llm_converter import FileConverter

# Default cloud mode (rate-limited without API key)
converter = FileConverter()

# With API key for unlimited access
converter = FileConverter(api_key="your_api_key_here")

# Extract specific fields from invoice
result = converter.convert("invoice.pdf")

# Extract key invoice information
invoice_fields = result.to_json(specified_fields=[
    "invoice_number",
    "total_amount", 
    "vendor_name",
    "due_date",
    "items_count"
])

print("Extracted Invoice Fields:")
print(invoice_fields)
# Output: {"extracted_fields": {"invoice_number": "INV-001", ...}, "format": "specified_fields"}

# Extract structured data using schema
invoice_schema = {
    "invoice_number": "string",
    "total_amount": "number",
    "vendor_name": "string",
    "billing_address": {
        "street": "string",
        "city": "string", 
        "zip_code": "string"
    },
    "line_items": [{
        "description": "string",
        "quantity": "number",
        "unit_price": "number",
        "total": "number"
    }],
    "taxes": {
        "tax_rate": "number",
        "tax_amount": "number"
    }
}

structured_invoice = result.to_json(json_schema=invoice_schema)
print("Structured Invoice Data:")
print(structured_invoice)
# Output: {"structured_data": {...}, "schema": {...}, "format": "structured_json"}

# Extract from different document types
receipt = converter.convert("receipt.jpg")
receipt_data = receipt.to_json(specified_fields=[
    "merchant_name", "total_amount", "date", "payment_method"
])

contract = converter.convert("contract.pdf") 
contract_schema = {
    "parties": [{
        "name": "string",
        "role": "string"
    }],
    "contract_value": "number",
    "start_date": "string",
    "end_date": "string",
    "key_terms": ["string"]
}
contract_data = contract.to_json(json_schema=contract_schema)

Local extraction requirements (if using cpu_preference=True):

Install ollama package: pip install 'llm-data-converter[local-llm]'
Install Ollama and run: ollama serve
Pull a model: ollama pull llama3.2

Chain with LLM

# Perfect for LLM workflows
document_text = converter.convert("research_paper.pdf").to_markdown()

# Use with any LLM
response = your_llm_client.chat(
    messages=[{
        "role": "user", 
        "content": f"Summarize this research paper:\n\n{document_text}"
    }]
)

Command Line Interface

# Basic conversion (cloud mode default)
llm-converter document.pdf

# With API key for unlimited access
llm-converter document.pdf --api-key YOUR_API_KEY

# Local processing modes
llm-converter document.pdf --cpu-mode
llm-converter document.pdf --gpu-mode

# Different output formats
llm-converter document.pdf --output json
llm-converter document.pdf --output html
llm-converter document.pdf --output csv

# Extract specific fields
llm-converter invoice.pdf --output json --extract-fields invoice_number total_amount

# Extract with JSON schema
llm-converter document.pdf --output json --json-schema schema.json

# Multiple files
llm-converter *.pdf --output markdown

# Save to file
llm-converter document.pdf --output-file result.md

# Comprehensive field extraction examples
llm-converter invoice.pdf --output json --extract-fields invoice_number vendor_name total_amount due_date line_items

# Extract from different document types with specific fields
llm-converter receipt.jpg --output json --extract-fields merchant_name total_amount date payment_method

llm-converter contract.pdf --output json --extract-fields parties contract_value start_date end_date

# Using JSON schema files for structured extraction
llm-converter invoice.pdf --output json --json-schema invoice_schema.json
llm-converter contract.pdf --output json --json-schema contract_schema.json

# Combine with API key for unlimited access
llm-converter document.pdf --api-key YOUR_API_KEY --output json --extract-fields title author date summary

# Force local processing with field extraction (requires Ollama)
llm-converter document.pdf --cpu-mode --output json --extract-fields key_points conclusions recommendations

Example schema.json file:

{
  "invoice_number": "string",
  "total_amount": "number",
  "vendor_name": "string",
  "billing_address": {
    "street": "string",
    "city": "string",
    "zip_code": "string"
  },
  "line_items": [{
    "description": "string",
    "quantity": "number",
    "unit_price": "number"
  }]
}

API Reference for library

FileConverter

FileConverter(
    preserve_layout: bool = True,      # Preserve document structure
    include_images: bool = True,       # Include image content
    ocr_enabled: bool = True,         # Enable OCR processing
    api_key: str = None,              # API key for unlimited cloud access
    model: str = None,                # Model for cloud processing ("gemini", "openapi")
    cpu_preference: bool = False,     # Force local CPU processing
    gpu_preference: bool = False      # Force local GPU processing
)

ConversionResult Methods

result.to_markdown() -> str                    # Clean markdown output
result.to_json(                              # Structured JSON
    specified_fields: List[str] = None,       # Extract specific fields
    json_schema: Dict = None                  # Extract with schema
) -> Dict
result.to_html() -> str                      # Formatted HTML
result.to_csv() -> str                       # CSV format for tables
result.to_text() -> str                      # Plain text

Advanced Configuration

Custom OCR Settings

converter = FileConverter(
    cpu_preference=True,        # Use local processing
    ocr_enabled=True,          # Enable OCR
    preserve_layout=True,      # Maintain structure
    include_images=True        # Process images
)

Environment Variables

export NANONETS_API_KEY="your_api_key"
# Now all conversions use your API key automatically

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Email: support@nanonets.com
Issues: GitHub Issues
Discussions: GitHub Discussions

Star this repo if you find it helpful! Your support helps us improve the library.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.2.0

Jul 25, 2025

2.1.7

Jul 23, 2025

2.1.6

Jul 21, 2025

2.1.5

Jul 21, 2025

2.1.3

Jul 17, 2025

2.1.2

Jul 16, 2025

2.1.1

Jul 16, 2025

2.1.0

Jul 16, 2025

2.0.7

Jul 15, 2025

2.0.6

Jul 15, 2025

2.0.5

Jul 15, 2025

2.0.4

Jul 15, 2025

2.0.3

Jul 15, 2025

2.0.2

Jul 15, 2025

2.0.1

Jul 15, 2025

2.0.0

Jul 15, 2025

0.4.1

Jul 14, 2025

0.4.0

Jul 14, 2025

0.2.3

Jul 14, 2025

0.2.2

Jul 9, 2025

0.2.1

Jul 9, 2025

0.2.0

Jul 9, 2025

0.1.0

Jul 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_data_converter-2.2.0.tar.gz (55.9 kB view details)

Uploaded Jul 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_data_converter-2.2.0-py3-none-any.whl (71.8 kB view details)

Uploaded Jul 25, 2025 Python 3

File details

Details for the file llm_data_converter-2.2.0.tar.gz.

File metadata

Download URL: llm_data_converter-2.2.0.tar.gz
Upload date: Jul 25, 2025
Size: 55.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llm_data_converter-2.2.0.tar.gz
Algorithm	Hash digest
SHA256	`5b294146e1346911696dcf595284738f82ccdd69d0de89d14b964c3fd83facd0`
MD5	`f4380c3f83b744ddedbb5f11c8e24aec`
BLAKE2b-256	`72b2acdd48fd94704ce97490d8718de5fa862be400ea6bed760bdac8507662ca`

See more details on using hashes here.

File details

Details for the file llm_data_converter-2.2.0-py3-none-any.whl.

File metadata

Download URL: llm_data_converter-2.2.0-py3-none-any.whl
Upload date: Jul 25, 2025
Size: 71.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llm_data_converter-2.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`04e0e425e93cd3fe77cad4a8dfe258989c0c037b6c7104003cb1d0c0432902c1`
MD5	`4c0f6032af473d29c76446386bb6e094`
BLAKE2b-256	`1e83051a8f73cf3e07b598a3332d58ed7f6046430f6eef288c402a222a6d2489`

See more details on using hashes here.

llm-data-converter 2.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLM Data Converter

Key Features

Installation

Quick Start

Basic Usage (Cloud Mode - Default)

With API Key (Unlimited Access)

Local Processing

Output Formats

Examples

Convert Multiple File Types

Extract Tables to CSV

Enhanced JSON Conversion

Extract Specific Fields & Structured Data

Chain with LLM

Command Line Interface

API Reference for library

FileConverter

ConversionResult Methods

Advanced Configuration

Custom OCR Settings

Environment Variables

Contributing

License

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes