A fast PDF to Markdown converter optimized for LLM processing

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

feiyue

These details have not been verified by PyPI

Project description

🚀 fastpdf4llm: PDF to LLM-Ready Markdown in Seconds

A fast and efficient PDF to Markdown converter optimized for LLM (Large Language Model) processing. This tool intelligently extracts text, tables, and images from PDF files and converts them into well-structured Markdown format.

Features

🚀 Fast Processing: Efficient PDF parsing and conversion
📊 Table Extraction: Automatically detects and converts tables to Markdown format
🖼️ Image Support: Extracts and saves images from PDFs (can be disabled for text-only processing)
📝 Smart Formatting: Intelligently identifies headings based on font sizes
📈 Progress Tracking: Built-in progress callback support
🎯 LLM Optimized: Output format optimized for LLM consumption
📜 Free & Open Source: MIT licensed, free to use for commercial and personal projects

Examples

See the examples/ directory for more usage examples:

financial_report_cn/: Converting financial reports with tables and images
- Example output: 平安财报2016.md
table_data/: Converting PDFs with complex tables
- Example output: national-capitals.md
car_user_manual/: Converting car user manuals with extensive images and structured content
- Example output: tesla_model3_user_manual.pdf.md

Installation

Using Poetry (Recommended)

poetry add fastpdf4llm

Using pip

pip install fastpdf4llm

Quick Start

Basic Usage

from fastpdf4llm import to_markdown

# Convert PDF to Markdown
markdown_content = to_markdown("path/to/your/document.pdf")

# Save to file
with open("output.md", "w", encoding="utf-8") as f:
    f.write(markdown_content)

With Different Input Types

The to_markdown function accepts multiple input types:

from fastpdf4llm import to_markdown
from pathlib import Path
from io import BytesIO

# String path (traditional)
markdown_content = to_markdown("path/to/document.pdf")

# Pathlib Path object
markdown_content = to_markdown(Path("path/to/document.pdf"))

# File object
with open("path/to/document.pdf", "rb") as f:
    markdown_content = to_markdown(f)

# BytesIO (in-memory PDF)
with open("path/to/document.pdf", "rb") as f:
    pdf_bytes = f.read()

bytes_io = BytesIO(pdf_bytes)
markdown_content = to_markdown(bytes_io)

With Custom Image Directory

from fastpdf4llm import to_markdown

# Specify custom directory for extracted images
markdown_content = to_markdown(
    "path/to/your/document.pdf",
    image_dir="./images"
)

With Progress Callback

from fastpdf4llm import to_markdown, ProgressInfo

def progress_callback(progress: ProgressInfo):
    print(f"{progress.phase.value}: {progress.current_page}/{progress.total_pages} "
          f"({progress.percentage:.1f}%) - {progress.message}")

markdown_content = to_markdown(
    "path/to/your/document.pdf",
    progress_callback=progress_callback
)

With Custom Parse Options

from fastpdf4llm import to_markdown
from fastpdf4llm.models.parse_options import ParseOptions

# Customize parsing options for better text extraction
parse_options = ParseOptions(
    x_tolerance_ratio=0.15,  # Ratio of x_tolerance to page width (default: 0.15)
    y_tolerance=3            # Control spacing between lines (default: 3)
)

markdown_content = to_markdown(
    "path/to/your/document.pdf",
    parse_options=parse_options
)

No Image Mode

from fastpdf4llm import to_markdown

# Disable image extraction for faster processing
# Useful when you only need text content
markdown_content = to_markdown(
    "path/to/your/document.pdf",
    extract_images=False
)

Combined Usage

from fastpdf4llm import to_markdown, ProgressInfo
from fastpdf4llm.models.parse_options import ParseOptions

def progress_callback(progress: ProgressInfo):
    print(f"Progress: {progress.percentage:.1f}%")

parse_options = ParseOptions(x_tolerance_ratio=0.25, y_tolerance=5)

markdown_content = to_markdown(
    "path/to/your/document.pdf",
    image_dir="./images",
    extract_images=True,
    parse_options=parse_options,
    progress_callback=progress_callback
)

API Reference

`to_markdown`

Convert a PDF file to Markdown format.

Parameters:

path_or_fp (Union[str, pathlib.Path, BufferedReader, BytesIO]): PDF file path, Path object, file handle, or BytesIO object
image_dir (Optional[str]): Directory to save extracted images. Defaults to ./tmp/images/. Only used when extract_images=True
extract_images (bool): Whether to extract and save images from PDF. Default: True
- Set to False to skip image extraction for faster processing
- When False, images are ignored and not included in the markdown output
parse_options (Optional[ParseOptions]): Parsing options to control text extraction. Defaults to ParseOptions(x_tolerance_ratio=0.15, y_tolerance=3)
progress_callback (Optional[Callable[[ProgressInfo], None]]): Callback function for progress updates

Returns:

str: Markdown content of the PDF

Example:

from fastpdf4llm import to_markdown, ProgressInfo
from typing import Callable

def on_progress(progress: ProgressInfo):
    print(f"Progress: {progress.percentage:.1f}%")

content = to_markdown(
    "document.pdf",
    image_dir="./output_images",
    extract_images=True,
    progress_callback=on_progress
)

# No image mode example
content_text_only = to_markdown(
    "document.pdf",
    extract_images=False
)

`ParseOptions`

Parsing options to customize PDF text extraction behavior.

Attributes:

x_tolerance_ratio (float): Ratio of x_tolerance to page width for controlling spacing tolerance between words horizontally. Default: 0.15
- Lower values (e.g., 0.1): More strict word separation (better for well-formatted PDFs)
- Higher values (e.g., 0.25): More lenient word grouping (better for PDFs with irregular spacing)
y_tolerance (float): Controls spacing tolerance between lines vertically. Default: 3
- Lower values: More strict line separation
- Higher values: More lenient line grouping

Example:

from fastpdf4llm.models.parse_options import ParseOptions

# For PDFs with tight spacing
tight_options = ParseOptions(x_tolerance_ratio=0.1, y_tolerance=1)

# For PDFs with loose spacing
loose_options = ParseOptions(x_tolerance_ratio=0.25, y_tolerance=5)

markdown_content = to_markdown("document.pdf", parse_options=tight_options)

`ProgressInfo`

Progress information model for tracking conversion progress.

Attributes:

phase (ProcessPhase): Current processing phase (ANALYSIS or CONVERSION)
current_page (int): Current page being processed
total_pages (int): Total number of pages in the PDF
percentage (float): Overall progress percentage (0-100)
message (str): Status message

How It Works

Analysis Phase: Analyzes the PDF to identify font sizes and determine heading hierarchy
Conversion Phase: Extracts text, tables, and images, converting them to Markdown format
Smart Formatting: Automatically detects headings based on font size analysis
Table Detection: Identifies and converts tables to Markdown table format
Image Extraction: Extracts images and saves them to the specified directory
Configurable Parsing: Adjustable tolerance settings for optimal text extraction from various PDF layouts

Requirements

Python >= 3.9
pdfplumber >= 0.11.3
loguru >= 0.7.0
pydantic >= 2.0.0

Development

Setup

# Clone the repository
git clone https://github.com/moria97/fastpdf4llm.git
cd fastpdf4llm

# Install dependencies
poetry install

# Install pre-commit hooks (runs automatically on git commit)
poetry run pre-commit install

Running Tests

poetry run pytest

Code Formatting

# Format code
poetry run ruff format .

# Lint code
poetry run ruff check .

Acknowledgements

This project is inspired by the pdf2markdown4llm project by HawkClaws. We appreciate their work on PDF to Markdown conversion for LLM applications.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Repository

https://github.com/moria97/fastpdf4llm

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

feiyue

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.5

Jan 13, 2026

0.1.4

Nov 20, 2025

0.1.3

Nov 14, 2025

0.1.2

Nov 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastpdf4llm-0.1.5.tar.gz (18.0 kB view details)

Uploaded Jan 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fastpdf4llm-0.1.5-py3-none-any.whl (19.7 kB view details)

Uploaded Jan 13, 2026 Python 3

File details

Details for the file fastpdf4llm-0.1.5.tar.gz.

File metadata

Download URL: fastpdf4llm-0.1.5.tar.gz
Upload date: Jan 13, 2026
Size: 18.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastpdf4llm-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`39163c412009e2366da4dac65cd25d31da99d26de2c09e1bc8c00e0304ac2362`
MD5	`c98c28ab17a47181e3180b1e46ee985b`
BLAKE2b-256	`d9f2ebc13a9691e6a29f6b2a3b8c56685b786d29f367537b09c7b131b67b8a23`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fastpdf4llm-0.1.5.tar.gz:

Publisher: publish_pypi.yml on moria97/fastpdf4llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fastpdf4llm-0.1.5.tar.gz
- Subject digest: 39163c412009e2366da4dac65cd25d31da99d26de2c09e1bc8c00e0304ac2362
- Sigstore transparency entry: 816358558
- Sigstore integration time: Jan 13, 2026
Source repository:
- Permalink: moria97/fastpdf4llm@5873baf1cd949a4956b767161f76f3b548b3fc99
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/moria97
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_pypi.yml@5873baf1cd949a4956b767161f76f3b548b3fc99
- Trigger Event: workflow_dispatch

File details

Details for the file fastpdf4llm-0.1.5-py3-none-any.whl.

File metadata

Download URL: fastpdf4llm-0.1.5-py3-none-any.whl
Upload date: Jan 13, 2026
Size: 19.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fastpdf4llm-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6cad84cd25286abc1eece0206286e220c456bccb807e96d000a7d58b7e04f5cd`
MD5	`2274110fac5a99c63f81ce207efdcea3`
BLAKE2b-256	`71ec13966ea31d3b7c3793011a18d2b4a66c366f5d822ec1fba4a9af1d893466`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fastpdf4llm-0.1.5-py3-none-any.whl:

Publisher: publish_pypi.yml on moria97/fastpdf4llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fastpdf4llm-0.1.5-py3-none-any.whl
- Subject digest: 6cad84cd25286abc1eece0206286e220c456bccb807e96d000a7d58b7e04f5cd
- Sigstore transparency entry: 816358659
- Sigstore integration time: Jan 13, 2026
Source repository:
- Permalink: moria97/fastpdf4llm@5873baf1cd949a4956b767161f76f3b548b3fc99
- Branch / Tag: refs/tags/v0.1.5
- Owner: https://github.com/moria97
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_pypi.yml@5873baf1cd949a4956b767161f76f3b548b3fc99
- Trigger Event: workflow_dispatch

fastpdf4llm 0.1.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🚀 fastpdf4llm: PDF to LLM-Ready Markdown in Seconds

Features

Examples

Installation

Using Poetry (Recommended)

Using pip

Quick Start

Basic Usage

With Different Input Types

With Custom Image Directory

With Progress Callback

With Custom Parse Options

No Image Mode

Combined Usage

API Reference

to_markdown

ParseOptions

ProgressInfo

How It Works

Requirements

Development

Setup

Running Tests

Code Formatting

Acknowledgements

License

Contributing

Repository

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`to_markdown`

`ParseOptions`

`ProgressInfo`