Convert any document, text, or URL into LLM-ready data format
Project description
LLM Data Converter
Convert any document, text, or URL into LLM-ready data format.
Installation
pip install llm-data-converter
Quick Start
from llm_converter import FileConverter
from litellm import completion
# Basic conversion
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
# Pass the result to LLM
response = completion(
model="openai/gpt-4o",
messages=[{"content": f"Extract info from this document: \n{result}", "role": "user"}]
)
Features
- Multiple Input Formats: PDF, DOCX, TXT, HTML, URLs, Excel files, and more
- Multiple Output Formats: Markdown, HTML, JSON, Plain Text
- LLM Integration: Seamless integration with LiteLLM and other LLM libraries
- Local Processing: Process documents locally without external dependencies
- Layout Preservation: Maintain document structure and formatting
Usage Examples
Convert PDF to Markdown
from llm_converter import FileConverter
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)
Convert URL to HTML
from llm_converter import FileConverter
converter = FileConverter()
result = converter.convert("https://example.com").to_html()
print(result)
Convert Excel to JSON
from llm_converter import FileConverter
converter = FileConverter()
result = converter.convert("data.xlsx").to_json()
print(result)
Chain with LLM
from llm_converter import FileConverter
from litellm import completion
converter = FileConverter()
document_content = converter.convert("report.pdf").to_markdown()
# Use with any LLM
response = completion(
model="openai/gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant that analyzes documents."},
{"role": "user", "content": f"Summarize this document:\n\n{document_content}"}
]
)
print(response.choices[0].message.content)
Supported Formats
Input Formats
- Documents: PDF, DOCX, TXT
- Web: URLs, HTML files
- Data: Excel (XLSX, XLS), CSV
- Images: PNG, JPG, JPEG (with OCR capabilities)
Output Formats
- Markdown: Clean, structured markdown
- HTML: Formatted HTML with styling
- JSON: Structured JSON data
- Plain Text: Simple text extraction
Advanced Usage
Custom Configuration
from llm_converter import FileConverter
converter = FileConverter(
preserve_layout=True,
include_images=True,
ocr_enabled=True
)
result = converter.convert("document.pdf").to_markdown()
Batch Processing
from llm_converter import FileConverter
converter = FileConverter()
files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
results = []
for file in files:
result = converter.convert(file).to_markdown()
results.append(result)
API Reference
FileConverter
Main class for converting documents to LLM-ready formats.
Methods
convert(file_path: str) -> ConversionResult: Convert a file to internal formatconvert_url(url: str) -> ConversionResult: Convert a URL to internal formatconvert_text(text: str) -> ConversionResult: Convert plain text to internal format
ConversionResult
Result object with methods to export to different formats.
Methods
to_markdown() -> str: Export as markdownto_html() -> str: Export as HTMLto_json() -> dict: Export as JSONto_text() -> str: Export as plain text
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
llm_data_converter-0.1.0.tar.gz
(37.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_data_converter-0.1.0.tar.gz.
File metadata
- Download URL: llm_data_converter-0.1.0.tar.gz
- Upload date:
- Size: 37.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1078e3718adf7fb5832a5c0427604bbb0e5367111f26778f35ec22e405a44ebf
|
|
| MD5 |
b5bbdd67029a30bd030e6dce663c7035
|
|
| BLAKE2b-256 |
8d1bb2bb4e46a79ea1072466e68b484d1f5724e4b04981dedb55fd139413f5f1
|
File details
Details for the file llm_data_converter-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llm_data_converter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 28.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
419ec7db72c6e30f523e3cc9a03a2fc911049ce23b994c2392d47c99f791cc09
|
|
| MD5 |
26e58185201e59c5309bf83be24d2c3c
|
|
| BLAKE2b-256 |
80e4c8432791320eb5f8428a2d41316a4d6201cc1a80076cbeb69857b3d113af
|