A library to convert PDF files to Markdown format.
Project description
PDF2Markdown4LLM
PDF2Markdown4LLM is a Python library that converts PDF documents to Markdown format, specifically optimized for Large Language Models (LLMs). It intelligently preserves document structure, identifies headers based on font sizes, and handles tables while maintaining the original document flow.
Demo
Features
- Intelligent header detection based on font size analysis
- Table extraction and conversion to Markdown format
- Maintains document structure and flow
- Handles nested content and complex layouts
- Font size classification for consistent header levels
- Clean table formatting with proper alignment
- Optional header removal functionality
- Robust error handling and validation
Installation
pip install pdf2markdown4llm
Dependencies
- pdfplumber
- dataclasses (Python 3.7+)
- typing
- collections
- re
Usage
Basic usage example:
from pdf2markdown4llm import PDF2Markdown4LLM
def progress_callback(progress):
"""Callback function to handle progress"""
print(f"Phase: {progress.phase.value}, Page {progress.current_page}/{progress.total_pages}, Progress: {progress.percentage:.1f}%, Message: {progress.message}")
# Initialize converter
converter = PDF2Markdown4LLM(remove_headers=False, skip_empty_tables=True, table_header="### Table", progress_callback=progress_callback)
# Convert PDF to Markdown
markdown_content = converter.convert("input.pdf")
# Save to file
with open("output.md", "w", encoding="utf-8") as md_file:
md_file.write(markdown_content)
Configuration Options
remove_headers: Boolean flag to remove existing markdown headers from text (default: False)table_header: String to specify the header level for tables (default: "###")
Key Components
FontSizeClassifier
Analyzes font sizes throughout the document to determine appropriate header levels:
- Automatically identifies the normal text size
- Classifies larger fonts into appropriate header levels
- Handles font size variations and inconsistencies
PDFContentExtractor
Extracts content while preserving structure:
- Processes text and tables separately
- Maintains original document flow
- Validates table boundaries
- Handles nested content
MarkdownConverter
Converts extracted content to clean Markdown:
- Proper table formatting
- Header level preservation
- Clean text formatting
Error Handling
The library includes comprehensive error handling:
- Validates table boundaries
- Checks for invalid content
- Provides detailed error messages
- Includes stack traces for debugging
Contributing
Contributions are welcome! Please feel free to submit pull requests.
- Fork the repository
- Create your feature branch (
git checkout -b feature/new-feature) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Create a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
[HawkClaws]
Acknowledgements
- PDFPlumber team for the excellent PDF parsing library
- Contributors and maintainers
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2markdown4llm-0.1.7.tar.gz.
File metadata
- Download URL: pdf2markdown4llm-0.1.7.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad743810881218ebd792dcd1dbfd07cf70ea9bda8138a8f877614c2a6d01ceea
|
|
| MD5 |
4caf74c3462feb4567ef325a85a813d7
|
|
| BLAKE2b-256 |
fe830d8e371cf36bd12c79cda9103eb431ef2a4d119bec0ef9f8f9546197b59b
|
File details
Details for the file pdf2markdown4llm-0.1.7-py3-none-any.whl.
File metadata
- Download URL: pdf2markdown4llm-0.1.7-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6da3e22b1071a5477fb18a2cfbb2500efe581bdb8592b5d44d051f20bb35115c
|
|
| MD5 |
c502091e2600a455ec8f9d2adbc0b51d
|
|
| BLAKE2b-256 |
6759b6177d487d8bd85d2180ab82a8da5eeaa93feb9655911817c5df20fc3b67
|