Convert scanned PDF files to searchable text via Poppler and Tesseract

These details have not been verified by PyPI

Project description

pdf2text

Convert scanned PDF files into searchable plain-text using 100 % free and open-source software (Poppler + Tesseract).

License

This project is licensed under the MIT License – see the LICENSE file for details.

Installation

Via Docker

# Option A – build the image locally
docker build -t pdf2text .

# Option B – pull the latest pre-built image from GHCR
docker pull ghcr.io/valginer0/pdf2text:latest

# convert a single PDF (mount a host folder) and write the text next to it
# The -o flag ensures the output ends up in the mounted folder.
docker run --rm -v $(pwd)/data:/data ghcr.io/valginer0/pdf2text:latest /data/sample.pdf -o /data/sample.txt -v

# Alternatively mount the project root as the working directory so the default
# output path works (the default WORKDIR inside the image is /app).
docker run --rm -v $(pwd):/app -w /app ghcr.io/valginer0/pdf2text:latest data/sample.pdf -v

From source

# End-users (from PyPI)
pip install pdf2text-ocr

# From source (editable)
pip install -e .

# Optional extras
pip install -e .[progress,rich]    # progress bar + rich logging
**Why are these extras optional?**  `tqdm` (progress) and `rich` are great in
  interactive terminals, but they add extra dependencies and ANSI control
  sequences that can clutter plain log files.  Keeping them optional keeps the
  core install lightweight, and lets you skip them in headless environments
  (CI, Docker, systemd services) or when embedding `pdf2text` in another
  application that provides its own UI.

### Conda quick-start

```bash
# Everything in one go: Python, Poppler, Tesseract, pdf2text
conda env create -f environment.yaml
conda activate pdf2text

# Optional editable/dev install for contributors
pip install -e .[dev]

System requirements:

OS	Poppler install command	Tesseract install command
Debian/Ubuntu	`sudo apt install poppler-utils`	`sudo apt install tesseract-ocr`
macOS	`brew install poppler`	`brew install tesseract`
Windows	Poppler-Windows binaries	UB Mannheim build

CLI Usage

# Convert a single file
pdf2text input.pdf -o output.txt

# Multi-language OCR (English + Spanish)
pdf2text input.pdf --lang eng+spa -o output.txt

# Batch convert all PDFs inside a folder
pdf2text /path/to/folder -b -o /path/to/out_dir

# Higher resolution & progress bar
pdf2text input.pdf --dpi 300 --enhance

# Process N pages at once (speed↑, memory↑)
pdf2text input.pdf --chunk-size 3 -o output.txt

# Batch convert PDFs using all CPUs (use with `-b`)
pdf2text /path/to/folder -b --parallel -o /path/to/out_dir

# Limit workers for `--parallel`
pdf2text /path/to/folder -b --parallel --max-workers 4 -o /path/to/out_dir

Arguments

Flag	Description
`input`	Input PDF file or folder
`-o`, `--output`	Output txt file or folder
`-b`, `--batch`	Treat input as folder and process all PDFs
`--dpi`	DPI for rasterisation (default 200)
`--enhance`	Apply basic image enhancement before OCR
`--lang`	Tesseract language codes (`eng`, `eng+deu`, …)
`--tesseract-path`	Path to `tesseract` executable (Windows)
`--version`	Print program version and exit
`--chunk-size N`	Process N pages at once (speed↑, memory↑)
`--parallel`	Batch convert PDFs using all CPUs (use with `-b`)
`--max-workers`	Limit workers for `--parallel`

Python API

from pdf2text import PDFToTextConverter

conv = PDFToTextConverter(tesseract_config="--psm 1")
# Traditional full-extract
# Tip: pass `log_level=logging.DEBUG` to the constructor to enable verbose logging from library code (same as `-v` flag in the CLI).
text = conv.extract_text_from_pdf("scan.pdf", enhance=True, lang="eng+spa", chunk_size=3)

# Stream pages one-by-one (memory-efficient for large PDFs)
for page_num, page_text in conv.iter_pages("scan.pdf", enhance=True):
    print(page_num, len(page_text))

# Asynchronous multi-core extraction
import asyncio
full_text = asyncio.run(conv.extract_text_async("scan.pdf"))

Development / Contributing

Clone the repo and set up the dev environment:

python -m venv .venv; source .venv/bin/activate  # or your preferred workflow
pip install -r requirements-dev.txt

# install pre-commit Git hooks (ruff, black, mypy, pytest run automatically)
pre-commit install

# run tasks
make lint   # ruff + mypy
make test   # pytest
make build  # python -m build

CI mirrors these checks, so commits that pass hooks locally should pass remotely too.

# using pip / venv
pip install -e .[dev]
# or, if you used the Conda env above, just run:
pytest -q

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.1

Jul 19, 2025

0.1.0

Jul 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2text_ocr-0.1.1.tar.gz (14.8 kB view details)

Uploaded Jul 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf2text_ocr-0.1.1-py3-none-any.whl (10.6 kB view details)

Uploaded Jul 19, 2025 Python 3

File details

Details for the file pdf2text_ocr-0.1.1.tar.gz.

File metadata

Download URL: pdf2text_ocr-0.1.1.tar.gz
Upload date: Jul 19, 2025
Size: 14.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pdf2text_ocr-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`5979a683233e6b59fcb6ff4fffd69215e2b3dcb412911d3c72b50f516c52bbd3`
MD5	`86e3e5eb79bc5a8741032780da4d13a5`
BLAKE2b-256	`c61dfebbf5b5c208f6db911567b08cd5d1cc2c405f43016b704f8c32463299bf`

See more details on using hashes here.

File details

Details for the file pdf2text_ocr-0.1.1-py3-none-any.whl.

File metadata

Download URL: pdf2text_ocr-0.1.1-py3-none-any.whl
Upload date: Jul 19, 2025
Size: 10.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pdf2text_ocr-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`704d91ad09d229dfad305191bdff362649ed9570ea37fd535e597026ead379f1`
MD5	`7b3319eaae5755112bc44a02b090b542`
BLAKE2b-256	`200a0b70b8bd10e615e32fc6aa98b9c6e028f499dc0c4caf3bedcfa5a9d869e7`

See more details on using hashes here.

pdf2text-ocr 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

pdf2text

License

Installation

Via Docker

From source

CLI Usage

Arguments

Python API

Development / Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes