Extract text from pdf pages from codebehind or Azure OCR as required

These details have not been verified by PyPI

Project links

Home
Issues

Project description

pypdftotext

OCR-enabled PDF text extraction built on pypdf and Azure Document Intelligence

pypdftotext is a Python package that intelligently extracts text from PDF files. It uses pypdf's advanced layout mode for embedded text extraction and seamlessly falls back to Azure Document Intelligence OCR when no embedded text is found.

Key Features

🚀 Fast embedded text extraction using pypdf's layout mode
🔄 Automatic OCR fallback via Azure Document Intelligence when needed
🚛 Batch processing with parallel OCR for multiple PDFs
🧵 Stateful extraction with the PdfExtract class
📦 S3 support for reading PDFs directly from AWS S3
🖼️ Image compression to reduce PDF file sizes
✍️ Handwritten text detection with confidence scoring
📄 Page manipulation - create child PDFs and extract page subsets
📑 Header/footer detection - heuristic stripping of repeated page elements
⚙️ Flexible configuration with built-in env support and multiple inheritance options

Installation

Basic Installation

pip install pypdftotext

Optional Dependencies

# Install with boto3 for S3 support
pip install "pypdftotext[s3]"

# Install with pillow for scanned pdf compression support
pip install "pypdftotext[image]"

# For all optional features (s3 and pillow)
pip install "pypdftotext[full]"

# For development (full + boto3-stubs[s3], pytest, pytest-cov)
pip install "pypdftotext[dev]"

Requirements

Python 3.10, 3.11, or 3.12
pypdf 6.0
azure-ai-documentintelligence >= 1.0.0
tqdm (for progress bars)
boto3 (optional)
pillow (optional)

Quick Start

Enable Azure OCR (optional)

NOTE: If OCR has not been configured, only the text embedded directly in the pdf will be returned (using pypdf's layout mode). You can also explicitly disable OCR by setting DISABLE_OCR=True in your config.

OCR Prerequisites

An Azure Subscription (create one for free)
An Azure Document Intelligence resource (create one)

OCR Configuration

NOTE: The same behaviors apply to the AWS_* settings for pulling PDFs from S3.

You can set your Endpoint and Subscription Key globally via env vars

export AZURE_DOCINTEL_ENDPOINT="https://your-resource.cognitiveservices.azure.com/"
export AZURE_DOCINTEL_SUBSCRIPTION_KEY="your-subscription-key"

Or via the `constants` module

from pypdftotext import constants
constants.AZURE_DOCINTEL_ENDPOINT = "https://your-resource.cognitiveservices.azure.com/"
constants.AZURE_DOCINTEL_SUBSCRIPTION_KEY = "your-subscription-key"

You can also set these values for individual instances of the PyPdfToTextConfig class, instances of which are exposed by the config attribute of PdfExtract and AzureDocIntelIntegrator classes. See below.

Basic Usage

Create a PdfExtract Instance

from pypdftotext import PdfExtract

extract = PdfExtract("document.pdf")

# Optional: supply a human-readable name for log output (useful in parallel scenarios)
extract = PdfExtract("document.pdf", pdf_name="quarterly-report")

# The config parameter also accepts a plain dict of overrides
extract = PdfExtract("document.pdf", config={"DISABLE_OCR": True})

Optional: Customize the Config

NOTE: if you've set env vars or constants, setting the endpoint and subscription key is optional. However, it is still acceptable to set them (and any other config options) on the instance itself after creating it.

extract.config.AZURE_DOCINTEL_ENDPOINT = "https://your-resource.cognitiveservices.azure.com/"
extract.config.AZURE_DOCINTEL_SUBSCRIPTION_KEY = "your-subscription-key"
extract.config.PRESERVE_VERTICAL_WHITESPACE = True

Extract Text with OCR Fallback

text = extract.text
print(text)

# Get text by page
for i, page_text in enumerate(extract.text_pages):
    print(f"Page {i + 1}: {page_text[:100]}...")

Compress Images in Scanned PDFs to Reduce File Size or Improve OCR

NOTE: Requires the optional pypdftotext[image] installation.

NOTE: Perform this step before accessing text/text_pages to use the compressed PDF for OCR. Otherwise, text will already be extracted from the original version and will not be re-extracted.

extract.compress_images(  # always converts images to greyscale
    white_point = 220,  # pixels with values from 221 to 255 are set to 255 (white) to remove scanner artifacts
    aspect_tolerance=0.001,  # resizes images whose aspect ratios (width/height) are within 0.001 of the page aspect ratio
    max_overscale = 2,  # images having a width more than 2x the displayed width of the PDF page are downsampled to 2x
)

Saving a Corrected or Compressed Pdf Version

NOTE: If a scanned PDF contains upside down or rotated pages, these pages will be reoriented automatically during text extraction.

from pathlib import Path
Path("compressed_corrected_document.pdf").write_bytes(extract.body)

PDF Splitting

# create a new PdfExtract instance containing the first 10 pages of the original PDF.
extract_child = extract.child((0, 9))  # useful for passing config and metadata forward.
# get the bytes of a PDF containing pages 1, 3, and 5 without creating a new PdfExtract instance.
clipped_pages_pdf_bytes = extract_child.clip_pages([0, 2, 4])  # useful for quick splitting.

child() also supports remove_from_parent=True to move pages out of the parent, and raise_on_empty=False to suppress AllPagesRemovedError when all pages would be removed.

Adding Bookmarks

extract.add_named_destinations([("Chapter 1", 0), ("Chapter 2", 5)])

Batch Processing

Process multiple PDFs efficiently with parallel OCR:

from pypdftotext import PdfExtractBatch

# Process multiple PDFs (list or dict)
pdfs = ["file1.pdf", "file2.pdf", "file3.pdf"]
# or
pdfs = {"report": "report.pdf", "invoice": "invoice.pdf"}

batch = PdfExtractBatch(pdfs)
results = batch.extract_all()  # Returns dict[str, PdfExtract]

# Access results
for name, pdf_extract in results.items():
    print(f"{name}: {len(pdf_extract.text)} characters extracted")

Batch processing extracts embedded text sequentially, then performs OCR in parallel for all PDFs that need it. It also heuristically detects and strips repeated headers and footers across pages (configurable via MAX_HEADER_LINES, MAX_FOOTER_LINES, and related config options; disabled by default).

S3 Support

If an S3 URI (e.g. s3://my-bucket/path/to/document.pdf) is supplied as the pdf parameter, PdfExtract will attempt to pull the bytes from the supplied bucket/key. AWS credentials with proper permissions must be supplied as env vars or set programmatically as described for Azure OCR above or an error will result.

Implementation Details

OCR Triggering Logic

OCR is automatically triggered when:

The ratio of low-text pages exceeds TRIGGER_OCR_PAGE_RATIO (default: 99% of pages)
A page is considered "low-text" if it has < MIN_LINES_OCR_TRIGGER lines (default: 1)

Example: OCR only when 50% of pages have fewer than 5 lines:

config = PyPdfToTextConfig(
    overrides={
        "MIN_LINES_OCR_TRIGGER": 5,
        "TRIGGER_OCR_PAGE_RATIO": 0.5,
    }
)

Configuration (Optional)

The PyPdfToTextConfig and PyPdfToTextConfigOverrides (optional) classes can be used to customize the operation of individual PdfExtract instances if desired.

New PdfToTextConfig instances will first reinitialize all relevant settings from the env and then inherit any settings that have been set programmatically via constants. This allows users to globally set API keys (via env OR constants) and other desired behaviors (via constants only) eliminating the need to supply the config parameter to every PdfExtract instance.
Inheritance from the global constants can be disabled globally by setting constants.INHERIT_CONSTANTS to False or for a single PyPdfToTextConfig instance using the overrides parameter (e.g. PyPdfToTextConfig(overrides={"INHERIT_CONSTANTS": False})). The PdfToTextConfigOverrides TypedDict is available for IDE and typing support.
An alternate base can be supplied to the PyPdfToTextConfig constructor. If supplied, its values supersede those in the global constants.
If both a base and overrides are supplied, overlapping settings in overrides will supersede those in base (or constants).

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built on top of:

pypdf for PDF parsing
Azure Document Intelligence for OCR capabilities

Project details

These details have not been verified by PyPI

Project links

Home
Issues

Release history Release notifications | RSS feed

This version

0.4.1

May 17, 2026

0.4.0

May 13, 2026

0.3.8

May 11, 2026

0.3.7

May 11, 2026

0.3.6

May 5, 2026

0.3.5

Mar 25, 2026

0.3.4

Mar 14, 2026

0.3.3

Feb 9, 2026

0.3.2

Feb 9, 2026

0.3.1

Jan 22, 2026

0.3.0

Dec 12, 2025

0.2.2

Oct 8, 2025

0.2.1

Sep 29, 2025

0.2.0

Sep 29, 2025

0.1.0

Aug 26, 2025

0.0.9

Aug 3, 2025

0.0.8

Jun 29, 2025

0.0.7

Jun 25, 2025

0.0.6

Jun 25, 2025

0.0.5

Mar 21, 2025

0.0.4

Mar 21, 2025

0.0.3

Mar 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypdftotext-0.4.1.tar.gz (44.4 kB view details)

Uploaded May 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pypdftotext-0.4.1-py3-none-any.whl (46.3 kB view details)

Uploaded May 17, 2026 Python 3

File details

Details for the file pypdftotext-0.4.1.tar.gz.

File metadata

Download URL: pypdftotext-0.4.1.tar.gz
Upload date: May 17, 2026
Size: 44.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.31.0

File hashes

Hashes for pypdftotext-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`0235cf631acaa4de55f50e98e0ff5023328489582dda442b06528b71b6fcbc64`
MD5	`b960a1d5f20c8b6cd455bbd47abb9992`
BLAKE2b-256	`f03dfd2549cbe0652245e0b4e5ff97464540f5fbb30eebc1af0598346d542ffa`

See more details on using hashes here.

File details

Details for the file pypdftotext-0.4.1-py3-none-any.whl.

File metadata

Download URL: pypdftotext-0.4.1-py3-none-any.whl
Upload date: May 17, 2026
Size: 46.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.31.0

File hashes

Hashes for pypdftotext-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`977d69dcea800ecc376a895fc63149a5ca659c68533271caba605b52c6dc1dcb`
MD5	`63aacc9d950533808182369f34a6f2a1`
BLAKE2b-256	`b4c924bf06d4528277fa07e552946bf094d049d18c028adb70a1a06d4f02326e`

See more details on using hashes here.

pypdftotext 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pypdftotext

Key Features

Installation

Basic Installation

Optional Dependencies

Requirements

Quick Start

Enable Azure OCR (optional)

OCR Prerequisites

OCR Configuration

You can set your Endpoint and Subscription Key globally via env vars

Or via the constants module

Basic Usage

Create a PdfExtract Instance

Optional: Customize the Config

Extract Text with OCR Fallback

Compress Images in Scanned PDFs to Reduce File Size or Improve OCR

Saving a Corrected or Compressed Pdf Version

PDF Splitting

Adding Bookmarks

Batch Processing

S3 Support

Implementation Details

OCR Triggering Logic

Configuration (Optional)

License

Links

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Or via the `constants` module