A PDF to clean, pagewise Markdown converter.
Project description
pdf2md: PDF to Markdown Converter
pdf2md is an open-source Python library and CLI tool for converting PDF documents into clean, page-wise Markdown files. It leverages modern, GPU-accelerated OCR and layout detection models from the Hugging Face ecosystem, with robust fallbacks to widely used tools like EasyOCR and Tesseract.
[Image of a PDF document on the left being converted to a Markdown document on the right with images and text blocks]
Features
- Multi-Backend OCR: Supports TrOCR (Hugging Face), EasyOCR, and Tesseract.
- Layout-Aware: Uses
layoutparser(optional) to intelligently detect text, image, and table blocks, with a heuristic fallback. - Resource-Aware: Explicitly manages GPU memory and resources to prevent leaks.
- CLI & Library: Use it as a powerful command-line tool or integrate it into your Python projects.
- Docker Support: CPU-only Docker image is provided by default, with clear instructions for GPU acceleration.
Quickstart (CLI)
- Create a virtual environment:
python -m venv venv source venv/bin/activate
- Install the library and dependencies:
pip install ".[all]" # Installs all optional dependencies for full functionality
- Run a conversion:
# Convert a scanned document using pytesseract (CPU-only) pdf2md --input documents/scanned_book.pdf --out output/ --backend pytesseract --layout heuristic
Docker Usage
A Dockerfile is provided for running pdf2md in a containerized environment. By default, it's configured for CPU-only mode.
- Build the CPU image:
docker build -t pdf2md .
- Run the CLI (CPU-only):
docker run --rm -v $(pwd):/data pdf2md --input /data/sample.pdf --out /data/output --backend easyocr
How to Enable GPU Support
To run pdf2md with GPU acceleration, you need to use a base image with CUDA and install the correct PyTorch wheel.
- Modify
Dockerfile: Uncomment theFROM nvidia/cuda:11.8.0-base-ubuntu22.04andWORKDIR /applines, and comment out the CPU base image. - Update PyTorch Installation: Change the PyTorch install command to point to a CUDA-enabled wheel, for example:
pip install torch==2.1.0+cu118 --index-url https://download.pytorch.org/whl/cu118. Note: Check the official PyTorch website for the latest compatible wheel URL. - Use
docker-compose: Thedocker-compose.ymlis pre-configured to enable NVIDIA container runtime support. Uncomment theruntime: nvidialine under thepdf2mdservice. - Build and run (GPU):
# Build with new Dockerfile docker build -t pdf2md:gpu . # Run with docker-compose docker-compose up
Contributing & Testing
Running Tests
To run the unit and integration tests, first install the test dependencies:
pip install ".[test]"
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2md_converter-0.1.0.tar.gz.
File metadata
- Download URL: pdf2md_converter-0.1.0.tar.gz
- Upload date:
- Size: 10.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c53b731da9eba74334489a89176e325497af791db69716bcbe678148f44eb975
|
|
| MD5 |
6beea4a59632db00c310779b59cf013e
|
|
| BLAKE2b-256 |
cab4715e3bb5dca1aee00e8b41e207aa47409638454f871695a2eadfc3cc173a
|
File details
Details for the file pdf2md_converter-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pdf2md_converter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1849bcb0dacbc2986e4edbde889a979aecca300bd92440f3ea983ed6ebd5e589
|
|
| MD5 |
ed56ed7bb8c31fe64c3325bcf60a8f4f
|
|
| BLAKE2b-256 |
7e2500b86437af38c18198bdb9148e18bb8a97a3314ae8d4f7a8be77952f5cff
|