Skip to main content

A PDF to clean, pagewise Markdown converter.

Project description

pdf2md: PDF to Markdown Converter

pdf2md is an open-source Python library and CLI tool for converting PDF documents into clean, page-wise Markdown files. It leverages modern, GPU-accelerated OCR and layout detection models from the Hugging Face ecosystem, with robust fallbacks to widely used tools like EasyOCR and Tesseract.

[Image of a PDF document on the left being converted to a Markdown document on the right with images and text blocks]

Features

  • Multi-Backend OCR: Supports TrOCR (Hugging Face), EasyOCR, and Tesseract.
  • Layout-Aware: Uses layoutparser (optional) to intelligently detect text, image, and table blocks, with a heuristic fallback.
  • Resource-Aware: Explicitly manages GPU memory and resources to prevent leaks.
  • CLI & Library: Use it as a powerful command-line tool or integrate it into your Python projects.
  • Docker Support: CPU-only Docker image is provided by default, with clear instructions for GPU acceleration.

Quickstart (CLI)

  1. Create a virtual environment:
    python -m venv venv
    source venv/bin/activate
    
  2. Install the library and dependencies:
    pip install ".[all]" # Installs all optional dependencies for full functionality
    
  3. Run a conversion:
    # Convert a scanned document using pytesseract (CPU-only)
    pdf2md --input documents/scanned_book.pdf --out output/ --backend pytesseract --layout heuristic
    

Docker Usage

A Dockerfile is provided for running pdf2md in a containerized environment. By default, it's configured for CPU-only mode.

  1. Build the CPU image:
    docker build -t pdf2md .
    
  2. Run the CLI (CPU-only):
    docker run --rm -v $(pwd):/data pdf2md --input /data/sample.pdf --out /data/output --backend easyocr
    

How to Enable GPU Support

To run pdf2md with GPU acceleration, you need to use a base image with CUDA and install the correct PyTorch wheel.

  1. Modify Dockerfile: Uncomment the FROM nvidia/cuda:11.8.0-base-ubuntu22.04 and WORKDIR /app lines, and comment out the CPU base image.
  2. Update PyTorch Installation: Change the PyTorch install command to point to a CUDA-enabled wheel, for example: pip install torch==2.1.0+cu118 --index-url https://download.pytorch.org/whl/cu118. Note: Check the official PyTorch website for the latest compatible wheel URL.
  3. Use docker-compose: The docker-compose.yml is pre-configured to enable NVIDIA container runtime support. Uncomment the runtime: nvidia line under the pdf2md service.
  4. Build and run (GPU):
    # Build with new Dockerfile
    docker build -t pdf2md:gpu .
    # Run with docker-compose
    docker-compose up
    

Contributing & Testing

Running Tests

To run the unit and integration tests, first install the test dependencies:

pip install ".[test]"
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2md_converter-0.1.0.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2md_converter-0.1.0-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file pdf2md_converter-0.1.0.tar.gz.

File metadata

  • Download URL: pdf2md_converter-0.1.0.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.2

File hashes

Hashes for pdf2md_converter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c53b731da9eba74334489a89176e325497af791db69716bcbe678148f44eb975
MD5 6beea4a59632db00c310779b59cf013e
BLAKE2b-256 cab4715e3bb5dca1aee00e8b41e207aa47409638454f871695a2eadfc3cc173a

See more details on using hashes here.

File details

Details for the file pdf2md_converter-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf2md_converter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1849bcb0dacbc2986e4edbde889a979aecca300bd92440f3ea983ed6ebd5e589
MD5 ed56ed7bb8c31fe64c3325bcf60a8f4f
BLAKE2b-256 7e2500b86437af38c18198bdb9148e18bb8a97a3314ae8d4f7a8be77952f5cff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page