Skip to main content

A library to convert PDF files to Markdown format.

Project description

PDF2Markdown4LLM

PDF2Markdown4LLM is a Python library that converts PDF documents to Markdown format, specifically optimized for Large Language Models (LLMs). It intelligently preserves document structure, identifies headers based on font sizes, and handles tables while maintaining the original document flow.

Features

  • Intelligent header detection based on font size analysis
  • Table extraction and conversion to Markdown format
  • Maintains document structure and flow
  • Handles nested content and complex layouts
  • Font size classification for consistent header levels
  • Clean table formatting with proper alignment
  • Optional header removal functionality
  • Robust error handling and validation

Installation

pip install pdf2markdown4llm

Dependencies

  • pdfplumber
  • dataclasses (Python 3.7+)
  • typing
  • collections
  • re

Usage

Basic usage example:

from pdf2markdown4llm import PDF2Markdown4LLM

def progress_callback(progress): 
    """Callback function to handle progress"""
    print(f"Phase: {progress.phase.value}, Page {progress.current_page}/{progress.total_pages}, Progress: {progress.percentage:.1f}%, Message: {progress.message}")


# Initialize converter
converter = PDF2Markdown4LLM(remove_headers=False, skip_empty_tables=True, table_header="### Table", progress_callback=progress_callback)


# Convert PDF to Markdown
markdown_content = converter.convert("input.pdf")

# Save to file
with open("output.md", "w", encoding="utf-8") as md_file:
    md_file.write(markdown_content)

Configuration Options

  • remove_headers: Boolean flag to remove existing markdown headers from text (default: False)
  • table_header: String to specify the header level for tables (default: "###")

Key Components

FontSizeClassifier

Analyzes font sizes throughout the document to determine appropriate header levels:

  • Automatically identifies the normal text size
  • Classifies larger fonts into appropriate header levels
  • Handles font size variations and inconsistencies

PDFContentExtractor

Extracts content while preserving structure:

  • Processes text and tables separately
  • Maintains original document flow
  • Validates table boundaries
  • Handles nested content

MarkdownConverter

Converts extracted content to clean Markdown:

  • Proper table formatting
  • Header level preservation
  • Clean text formatting

Error Handling

The library includes comprehensive error handling:

  • Validates table boundaries
  • Checks for invalid content
  • Provides detailed error messages
  • Includes stack traces for debugging

Contributing

Contributions are welcome! Please feel free to submit pull requests.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/new-feature)
  3. Commit your changes (git commit -am 'Add new feature')
  4. Push to the branch (git push origin feature/new-feature)
  5. Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

[HawkClaws]

Acknowledgements

  • PDFPlumber team for the excellent PDF parsing library
  • Contributors and maintainers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2markdown4llm-0.1.5.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2markdown4llm-0.1.5-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file pdf2markdown4llm-0.1.5.tar.gz.

File metadata

  • Download URL: pdf2markdown4llm-0.1.5.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for pdf2markdown4llm-0.1.5.tar.gz
Algorithm Hash digest
SHA256 661a8456970cac98b5ce23351eebf7d738c0e8cf7fc69daef4c324d37e3a0040
MD5 4c3bd9c83b85d332ea48c921e6763530
BLAKE2b-256 ffc9f2b53b2c7fc74d0374a00d8a5cb3ad671e9d180e82976d8c38df6993ba4e

See more details on using hashes here.

File details

Details for the file pdf2markdown4llm-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf2markdown4llm-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 65323d776f0da451013ed72b50bcd4513c2c7b1e73f2b8d96b9f40436c82dff6
MD5 3e694c6bbee478496b7d6e2a95720cd9
BLAKE2b-256 bd2c25ec7935c13a98ea72c8c903fbdb1c1adb88449be67d57bca70f7f000ca8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page