A library to convert PDF files to Markdown format.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

PDF2Markdown4LLM

PDF2Markdown4LLM is a Python library that converts PDF documents to Markdown format, specifically optimized for Large Language Models (LLMs). It intelligently preserves document structure, identifies headers based on font sizes, and handles tables while maintaining the original document flow.

Demo

demo

Features

Intelligent header detection based on font size analysis
Table extraction and conversion to Markdown format
Maintains document structure and flow
Handles nested content and complex layouts
Font size classification for consistent header levels
Clean table formatting with proper alignment
Optional header removal functionality
Robust error handling and validation

Installation

pip install pdf2markdown4llm

Dependencies

pdfplumber
dataclasses (Python 3.7+)
typing
collections
re

Usage

Basic usage example:

from pdf2markdown4llm import PDF2Markdown4LLM

def progress_callback(progress): 
    """Callback function to handle progress"""
    print(f"Phase: {progress.phase.value}, Page {progress.current_page}/{progress.total_pages}, Progress: {progress.percentage:.1f}%, Message: {progress.message}")


# Initialize converter
converter = PDF2Markdown4LLM(remove_headers=False, skip_empty_tables=True, table_header="### Table", progress_callback=progress_callback)


# Convert PDF to Markdown
markdown_content = converter.convert("input.pdf")

# Save to file
with open("output.md", "w", encoding="utf-8") as md_file:
    md_file.write(markdown_content)

Configuration Options

remove_headers: Boolean flag to remove existing markdown headers from text (default: False)
table_header: String to specify the header level for tables (default: "###")

Key Components

FontSizeClassifier

Analyzes font sizes throughout the document to determine appropriate header levels:

Automatically identifies the normal text size
Classifies larger fonts into appropriate header levels
Handles font size variations and inconsistencies

PDFContentExtractor

Extracts content while preserving structure:

Processes text and tables separately
Maintains original document flow
Validates table boundaries
Handles nested content

MarkdownConverter

Converts extracted content to clean Markdown:

Proper table formatting
Header level preservation
Clean text formatting

Error Handling

The library includes comprehensive error handling:

Validates table boundaries
Checks for invalid content
Provides detailed error messages
Includes stack traces for debugging

Contributing

Contributions are welcome! Please feel free to submit pull requests.

Fork the repository
Create your feature branch (git checkout -b feature/new-feature)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/new-feature)
Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

[HawkClaws]

Acknowledgements

PDFPlumber team for the excellent PDF parsing library
Contributors and maintainers

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.1.9

Jan 21, 2025

0.1.8

Jan 21, 2025

0.1.7

Jan 21, 2025

This version

0.1.6

Jan 17, 2025

0.1.5

Jan 14, 2025

0.1.4

Jan 14, 2025

0.1.3

Jan 14, 2025

0.1.2

Jan 14, 2025

0.1.1

Jan 14, 2025

0.1.0

Jan 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2markdown4llm-0.1.6.tar.gz (10.9 kB view details)

Uploaded Jan 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf2markdown4llm-0.1.6-py3-none-any.whl (10.3 kB view details)

Uploaded Jan 17, 2025 Python 3

File details

Details for the file pdf2markdown4llm-0.1.6.tar.gz.

File metadata

Download URL: pdf2markdown4llm-0.1.6.tar.gz
Upload date: Jan 17, 2025
Size: 10.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for pdf2markdown4llm-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`19b5bdc6725e565b9fdb20e76dc2bc436f5a77faa93766d21c2f02b6f81e9446`
MD5	`f1c08f8078fd2723229365d09fc8b2ef`
BLAKE2b-256	`e322aa9ca4879dbefc73229099e9a694e8f0fd9870ebdf433fef8f9c4535851f`

See more details on using hashes here.

File details

Details for the file pdf2markdown4llm-0.1.6-py3-none-any.whl.

File metadata

Download URL: pdf2markdown4llm-0.1.6-py3-none-any.whl
Upload date: Jan 17, 2025
Size: 10.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for pdf2markdown4llm-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3ae2a15784b8436d7886fb80ade0b8d0f7bfe84482d425fd6ec336907e2c1e51`
MD5	`678bb71a6ed1beb3cde489edac0303fe`
BLAKE2b-256	`871171cae053bce8f7c763087da63e1680eaa98dfaa8161a391a362bc3c7d4bf`

See more details on using hashes here.

pdf2markdown4llm 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF2Markdown4LLM

Demo

Features

Installation

Dependencies

Usage

Configuration Options

Key Components

FontSizeClassifier

PDFContentExtractor

MarkdownConverter

Error Handling

Contributing

License

Author

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes