Skip to main content

Extract text from pdf pages from codebehind or Azure OCR as required

Project description

pypdftotext

PyPI version Python Support License: MIT

OCR-enabled PDF text extraction built on pypdf and Azure Document Intelligence

pypdftotext is a Python package that intelligently extracts text from PDF files. It uses pypdf's advanced layout mode for embedded text extraction and seamlessly falls back to Azure Document Intelligence OCR when no embedded text is found.

Key Features

  • 🚀 Fast embedded text extraction using pypdf's layout mode
  • 🔄 Automatic OCR fallback via Azure Document Intelligence when needed
  • 🚛 Batch processing with parallel OCR for multiple PDFs
  • 🧵 Stateful extraction with the PdfExtract class
  • 📦 S3 support for reading PDFs directly from AWS S3
  • 🖼️ Image compression to reduce PDF file sizes
  • ✍️ Handwritten text detection with confidence scoring
  • 📄 Page manipulation - create child PDFs and extract page subsets
  • 📑 Header/footer detection - heuristic stripping of repeated page elements
  • ⚙️ Flexible configuration with built-in env support and multiple inheritance options

Installation

Basic Installation

pip install pypdftotext

Optional Dependencies

# Install with boto3 for S3 support
pip install "pypdftotext[s3]"

# Install with pillow for scanned pdf compression support
pip install "pypdftotext[image]"

# For all optional features (s3 and pillow)
pip install "pypdftotext[full]"

# For development (full + boto3-stubs[s3], pytest, pytest-cov)
pip install "pypdftotext[dev]"

Requirements

  • Python 3.10, 3.11, or 3.12
  • pypdf 6.0
  • azure-ai-documentintelligence >= 1.0.0
  • tqdm (for progress bars)
  • boto3 (optional)
  • pillow (optional)

Quick Start

Enable Azure OCR (optional)

NOTE: If OCR has not been configured, only the text embedded directly in the pdf will be returned (using pypdf's layout mode). You can also explicitly disable OCR by setting DISABLE_OCR=True in your config.

OCR Prerequisites

OCR Configuration

NOTE: The same behaviors apply to the AWS_* settings for pulling PDFs from S3.

You can set your Endpoint and Subscription Key globally via env vars
export AZURE_DOCINTEL_ENDPOINT="https://your-resource.cognitiveservices.azure.com/"
export AZURE_DOCINTEL_SUBSCRIPTION_KEY="your-subscription-key"
Or via the constants module
from pypdftotext import constants
constants.AZURE_DOCINTEL_ENDPOINT = "https://your-resource.cognitiveservices.azure.com/"
constants.AZURE_DOCINTEL_SUBSCRIPTION_KEY = "your-subscription-key"

You can also set these values for individual instances of the PyPdfToTextConfig class, instances of which are exposed by the config attribute of PdfExtract and AzureDocIntelIntegrator classes. See below.

Basic Usage

Create a PdfExtract Instance

from pypdftotext import PdfExtract

extract = PdfExtract("document.pdf")

# Optional: supply a human-readable name for log output (useful in parallel scenarios)
extract = PdfExtract("document.pdf", pdf_name="quarterly-report")

# The config parameter also accepts a plain dict of overrides
extract = PdfExtract("document.pdf", config={"DISABLE_OCR": True})

Optional: Customize the Config

NOTE: if you've set env vars or constants, setting the endpoint and subscription key is optional. However, it is still acceptable to set them (and any other config options) on the instance itself after creating it.

extract.config.AZURE_DOCINTEL_ENDPOINT = "https://your-resource.cognitiveservices.azure.com/"
extract.config.AZURE_DOCINTEL_SUBSCRIPTION_KEY = "your-subscription-key"
extract.config.PRESERVE_VERTICAL_WHITESPACE = True

Extract Text with OCR Fallback

text = extract.text
print(text)

# Get text by page
for i, page_text in enumerate(extract.text_pages):
    print(f"Page {i + 1}: {page_text[:100]}...")

Compress Images in Scanned PDFs to Reduce File Size or Improve OCR

NOTE: Requires the optional pypdftotext[image] installation.

NOTE: Perform this step before accessing text/text_pages to use the compressed PDF for OCR. Otherwise, text will already be extracted from the original version and will not be re-extracted.

extract.compress_images(  # always converts images to greyscale
    white_point = 220,  # pixels with values from 221 to 255 are set to 255 (white) to remove scanner artifacts
    aspect_tolerance=0.001,  # resizes images whose aspect ratios (width/height) are within 0.001 of the page aspect ratio
    max_overscale = 2,  # images having a width more than 2x the displayed width of the PDF page are downsampled to 2x
)

Saving a Corrected or Compressed Pdf Version

NOTE: If a scanned PDF contains upside down or rotated pages, these pages will be reoriented automatically during text extraction.

from pathlib import Path
Path("compressed_corrected_document.pdf").write_bytes(extract.body)

PDF Splitting

# create a new PdfExtract instance containing the first 10 pages of the original PDF.
extract_child = extract.child((0, 9))  # useful for passing config and metadata forward.
# get the bytes of a PDF containing pages 1, 3, and 5 without creating a new PdfExtract instance.
clipped_pages_pdf_bytes = extract_child.clip_pages([0, 2, 4])  # useful for quick splitting.

child() also supports remove_from_parent=True to move pages out of the parent, and raise_on_empty=False to suppress AllPagesRemovedError when all pages would be removed.

Adding Bookmarks

extract.add_named_destinations([("Chapter 1", 0), ("Chapter 2", 5)])

Batch Processing

Process multiple PDFs efficiently with parallel OCR:

from pypdftotext import PdfExtractBatch

# Process multiple PDFs (list or dict)
pdfs = ["file1.pdf", "file2.pdf", "file3.pdf"]
# or
pdfs = {"report": "report.pdf", "invoice": "invoice.pdf"}

batch = PdfExtractBatch(pdfs)
results = batch.extract_all()  # Returns dict[str, PdfExtract]

# Access results
for name, pdf_extract in results.items():
    print(f"{name}: {len(pdf_extract.text)} characters extracted")

Batch processing extracts embedded text sequentially, then performs OCR in parallel for all PDFs that need it. It also heuristically detects and strips repeated headers and footers across pages (configurable via MAX_HEADER_LINES, MAX_FOOTER_LINES, and related config options; disabled by default).

S3 Support

If an S3 URI (e.g. s3://my-bucket/path/to/document.pdf) is supplied as the pdf parameter, PdfExtract will attempt to pull the bytes from the supplied bucket/key. AWS credentials with proper permissions must be supplied as env vars or set programmatically as described for Azure OCR above or an error will result.

Implementation Details

OCR Triggering Logic

OCR is automatically triggered when:

  1. The ratio of low-text pages exceeds TRIGGER_OCR_PAGE_RATIO (default: 99% of pages)
  2. A page is considered "low-text" if it has < MIN_LINES_OCR_TRIGGER lines (default: 1)

Example: OCR only when 50% of pages have fewer than 5 lines:

config = PyPdfToTextConfig(
    overrides={
        "MIN_LINES_OCR_TRIGGER": 5,
        "TRIGGER_OCR_PAGE_RATIO": 0.5,
    }
)

Configuration (Optional)

The PyPdfToTextConfig and PyPdfToTextConfigOverrides (optional) classes can be used to customize the operation of individual PdfExtract instances if desired.

  1. New PdfToTextConfig instances will first reinitialize all relevant settings from the env and then inherit any settings that have been set programmatically via constants. This allows users to globally set API keys (via env OR constants) and other desired behaviors (via constants only) eliminating the need to supply the config parameter to every PdfExtract instance.
  2. Inheritance from the global constants can be disabled globally by setting constants.INHERIT_CONSTANTS to False or for a single PyPdfToTextConfig instance using the overrides parameter (e.g. PyPdfToTextConfig(overrides={"INHERIT_CONSTANTS": False})). The PdfToTextConfigOverrides TypedDict is available for IDE and typing support.
  3. An alternate base can be supplied to the PyPdfToTextConfig constructor. If supplied, its values supersede those in the global constants.
  4. If both a base and overrides are supplied, overlapping settings in overrides will supersede those in base (or constants).

License

This project is licensed under the MIT License - see the LICENSE file for details.

Links

Acknowledgments

Built on top of:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypdftotext-0.3.5.tar.gz (38.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pypdftotext-0.3.5-py3-none-any.whl (39.2 kB view details)

Uploaded Python 3

File details

Details for the file pypdftotext-0.3.5.tar.gz.

File metadata

  • Download URL: pypdftotext-0.3.5.tar.gz
  • Upload date:
  • Size: 38.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for pypdftotext-0.3.5.tar.gz
Algorithm Hash digest
SHA256 39f443260ad2779321790c12a1da08eb68c0c9c50b25a192d40de7b0d5dfce8b
MD5 62bb64aea6288e9b57b8ba87da9490aa
BLAKE2b-256 6d6d257dd6b0d9956dd8d25228fd4e2ceff426024ceb46d3e00dbc5020c60b00

See more details on using hashes here.

File details

Details for the file pypdftotext-0.3.5-py3-none-any.whl.

File metadata

  • Download URL: pypdftotext-0.3.5-py3-none-any.whl
  • Upload date:
  • Size: 39.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for pypdftotext-0.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 8e2aabb0013a96fe65d4d4d1285c200026446cb84d6750715405e0e1197019ec
MD5 0e1d7e9708be1127e64e7ae2b7aa6969
BLAKE2b-256 fa7b897546d3fbc6f21421c60288561dc3510d5f3786788aed44005f3a5db1f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page