Convert scanned PDF files to searchable text via Poppler and Tesseract
Project description
pdf2text
Convert scanned PDF files into searchable plain-text using 100 % free and open-source software (Poppler + Tesseract).
License
This project is licensed under the MIT License – see the LICENSE file for details.
Installation
Via Docker
# Option A – build the image locally
docker build -t pdf2text .
# Option B – pull the latest pre-built image from GHCR
docker pull ghcr.io/valginer0/pdf2text:latest
# convert a single PDF (mount a host folder) and write the text next to it
# The -o flag ensures the output ends up in the mounted folder.
docker run --rm -v $(pwd)/data:/data ghcr.io/valginer0/pdf2text:latest /data/sample.pdf -o /data/sample.txt -v
# Alternatively mount the project root as the working directory so the default
# output path works (the default WORKDIR inside the image is /app).
docker run --rm -v $(pwd):/app -w /app ghcr.io/valginer0/pdf2text:latest data/sample.pdf -v
From source
# End-users (from PyPI)
pip install pdf2text-ocr
# From source (editable)
pip install -e .
# Optional extras
pip install -e .[progress,rich] # progress bar + rich logging
**Why are these extras optional?** `tqdm` (progress) and `rich` are great in
interactive terminals, but they add extra dependencies and ANSI control
sequences that can clutter plain log files. Keeping them optional keeps the
core install lightweight, and lets you skip them in headless environments
(CI, Docker, systemd services) or when embedding `pdf2text` in another
application that provides its own UI.
### Conda quick-start
```bash
# Everything in one go: Python, Poppler, Tesseract, pdf2text
conda env create -f environment.yaml
conda activate pdf2text
# Optional editable/dev install for contributors
pip install -e .[dev]
System requirements:
| OS | Poppler install command | Tesseract install command |
|---|---|---|
| Debian/Ubuntu | sudo apt install poppler-utils |
sudo apt install tesseract-ocr |
| macOS | brew install poppler |
brew install tesseract |
| Windows | Poppler-Windows binaries | UB Mannheim build |
CLI Usage
# Convert a single file
pdf2text input.pdf -o output.txt
# Multi-language OCR (English + Spanish)
pdf2text input.pdf --lang eng+spa -o output.txt
# Batch convert all PDFs inside a folder
pdf2text /path/to/folder -b -o /path/to/out_dir
# Higher resolution & progress bar
pdf2text input.pdf --dpi 300 --enhance
# Process N pages at once (speed↑, memory↑)
pdf2text input.pdf --chunk-size 3 -o output.txt
# Batch convert PDFs using all CPUs (use with `-b`)
pdf2text /path/to/folder -b --parallel -o /path/to/out_dir
# Limit workers for `--parallel`
pdf2text /path/to/folder -b --parallel --max-workers 4 -o /path/to/out_dir
Arguments
| Flag | Description |
|---|---|
input |
Input PDF file or folder |
-o, --output |
Output txt file or folder |
-b, --batch |
Treat input as folder and process all PDFs |
--dpi |
DPI for rasterisation (default 200) |
--enhance |
Apply basic image enhancement before OCR |
--lang |
Tesseract language codes (eng, eng+deu, …) |
--tesseract-path |
Path to tesseract executable (Windows) |
--version |
Print program version and exit |
--chunk-size N |
Process N pages at once (speed↑, memory↑) |
--parallel |
Batch convert PDFs using all CPUs (use with -b) |
--max-workers |
Limit workers for --parallel |
Python API
from pdf2text import PDFToTextConverter
conv = PDFToTextConverter(tesseract_config="--psm 1")
# Traditional full-extract
# Tip: pass `log_level=logging.DEBUG` to the constructor to enable verbose logging from library code (same as `-v` flag in the CLI).
text = conv.extract_text_from_pdf("scan.pdf", enhance=True, lang="eng+spa", chunk_size=3)
# Stream pages one-by-one (memory-efficient for large PDFs)
for page_num, page_text in conv.iter_pages("scan.pdf", enhance=True):
print(page_num, len(page_text))
# Asynchronous multi-core extraction
import asyncio
full_text = asyncio.run(conv.extract_text_async("scan.pdf"))
Development / Contributing
Clone the repo and set up the dev environment:
python -m venv .venv; source .venv/bin/activate # or your preferred workflow
pip install -r requirements-dev.txt
# install pre-commit Git hooks (ruff, black, mypy, pytest run automatically)
pre-commit install
# run tasks
make lint # ruff + mypy
make test # pytest
make build # python -m build
CI mirrors these checks, so commits that pass hooks locally should pass remotely too.
# using pip / venv
pip install -e .[dev]
# or, if you used the Conda env above, just run:
pytest -q
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2text_ocr-0.1.1.tar.gz.
File metadata
- Download URL: pdf2text_ocr-0.1.1.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5979a683233e6b59fcb6ff4fffd69215e2b3dcb412911d3c72b50f516c52bbd3
|
|
| MD5 |
86e3e5eb79bc5a8741032780da4d13a5
|
|
| BLAKE2b-256 |
c61dfebbf5b5c208f6db911567b08cd5d1cc2c405f43016b704f8c32463299bf
|
File details
Details for the file pdf2text_ocr-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pdf2text_ocr-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
704d91ad09d229dfad305191bdff362649ed9570ea37fd535e597026ead379f1
|
|
| MD5 |
7b3319eaae5755112bc44a02b090b542
|
|
| BLAKE2b-256 |
200a0b70b8bd10e615e32fc6aa98b9c6e028f499dc0c4caf3bedcfa5a9d869e7
|