Extract text from pdf pages from codebehind or Azure OCR as required
Project description
pypdftotext
OCR-enabled PDF text extraction built on pypdf and Azure Document Intelligence
pypdftotext is a Python package that intelligently extracts text from PDF files. It uses pypdf's advanced layout mode for embedded text extraction and seamlessly falls back to Azure Document Intelligence OCR when no embedded text is found.
Key Features
- 🚀 Fast embedded text extraction using pypdf's layout mode
- 🔄 Automatic OCR fallback via Azure Document Intelligence when needed
- 🚛 Batch processing with parallel OCR for multiple PDFs
- 🧵 Stateful extraction with the
PdfExtractclass - 📦 S3 support for reading PDFs directly from AWS S3
- 🖼️ Image compression to reduce PDF file sizes
- ✍️ Handwritten text detection with confidence scoring
- 📄 Page manipulation - create child PDFs and extract page subsets
- 📑 Header/footer detection - heuristic stripping of repeated page elements
- ⚙️ Flexible configuration with built-in env support and multiple inheritance options
Installation
Basic Installation
pip install pypdftotext
Optional Dependencies
# Install with boto3 for S3 support
pip install "pypdftotext[s3]"
# Install with pillow for scanned pdf compression support
pip install "pypdftotext[image]"
# For all optional features (s3 and pillow)
pip install "pypdftotext[full]"
# For development (full + boto3-stubs[s3], pytest, pytest-cov)
pip install "pypdftotext[dev]"
Requirements
- Python 3.10, 3.11, or 3.12
- pypdf 6.0
- azure-ai-documentintelligence >= 1.0.0
- tqdm (for progress bars)
- boto3 (optional)
- pillow (optional)
Quick Start
Enable Azure OCR (optional)
NOTE: If OCR has not been configured, only the text embedded directly in the pdf will be returned (using pypdf's layout mode). You can also explicitly disable OCR by setting
DISABLE_OCR=Truein your config.
OCR Prerequisites
- An Azure Subscription (create one for free)
- An Azure Document Intelligence resource (create one)
OCR Configuration
NOTE: The same behaviors apply to the AWS_* settings for pulling PDFs from S3.
You can set your Endpoint and Subscription Key globally via env vars
export AZURE_DOCINTEL_ENDPOINT="https://your-resource.cognitiveservices.azure.com/"
export AZURE_DOCINTEL_SUBSCRIPTION_KEY="your-subscription-key"
Or via the constants module
from pypdftotext import constants
constants.AZURE_DOCINTEL_ENDPOINT = "https://your-resource.cognitiveservices.azure.com/"
constants.AZURE_DOCINTEL_SUBSCRIPTION_KEY = "your-subscription-key"
You can also set these values for individual instances of the PyPdfToTextConfig class, instances of which are exposed by the config attribute of PdfExtract and AzureDocIntelIntegrator classes. See below.
Basic Usage
Create a PdfExtract Instance
from pypdftotext import PdfExtract
extract = PdfExtract("document.pdf")
# Optional: supply a human-readable name for log output (useful in parallel scenarios)
extract = PdfExtract("document.pdf", pdf_name="quarterly-report")
# The config parameter also accepts a plain dict of overrides
extract = PdfExtract("document.pdf", config={"DISABLE_OCR": True})
Optional: Customize the Config
NOTE: if you've set env vars or constants, setting the endpoint and subscription key is optional. However, it is still acceptable to set them (and any other config options) on the instance itself after creating it.
extract.config.AZURE_DOCINTEL_ENDPOINT = "https://your-resource.cognitiveservices.azure.com/"
extract.config.AZURE_DOCINTEL_SUBSCRIPTION_KEY = "your-subscription-key"
extract.config.PRESERVE_VERTICAL_WHITESPACE = True
Extract Text with OCR Fallback
text = extract.text
print(text)
# Get text by page
for i, page_text in enumerate(extract.text_pages):
print(f"Page {i + 1}: {page_text[:100]}...")
Compress Images in Scanned PDFs to Reduce File Size or Improve OCR
NOTE: Requires the optional
pypdftotext[image]installation.
NOTE: Perform this step before accessing text/text_pages to use the compressed PDF for OCR. Otherwise, text will already be extracted from the original version and will not be re-extracted.
extract.compress_images( # always converts images to greyscale
white_point = 220, # pixels with values from 221 to 255 are set to 255 (white) to remove scanner artifacts
aspect_tolerance=0.001, # resizes images whose aspect ratios (width/height) are within 0.001 of the page aspect ratio
max_overscale = 2, # images having a width more than 2x the displayed width of the PDF page are downsampled to 2x
)
Saving a Corrected or Compressed Pdf Version
NOTE: If a scanned PDF contains upside down or rotated pages, these pages will be reoriented automatically during text extraction.
from pathlib import Path
Path("compressed_corrected_document.pdf").write_bytes(extract.body)
PDF Splitting
# create a new PdfExtract instance containing the first 10 pages of the original PDF.
extract_child = extract.child((0, 9)) # useful for passing config and metadata forward.
# get the bytes of a PDF containing pages 1, 3, and 5 without creating a new PdfExtract instance.
clipped_pages_pdf_bytes = extract_child.clip_pages([0, 2, 4]) # useful for quick splitting.
child() also supports remove_from_parent=True to move pages out of the parent, and raise_on_empty=False to suppress AllPagesRemovedError when all pages would be removed.
Adding Bookmarks
extract.add_named_destinations([("Chapter 1", 0), ("Chapter 2", 5)])
Batch Processing
Process multiple PDFs efficiently with parallel OCR:
from pypdftotext import PdfExtractBatch
# Process multiple PDFs (list or dict)
pdfs = ["file1.pdf", "file2.pdf", "file3.pdf"]
# or
pdfs = {"report": "report.pdf", "invoice": "invoice.pdf"}
batch = PdfExtractBatch(pdfs)
results = batch.extract_all() # Returns dict[str, PdfExtract]
# Access results
for name, pdf_extract in results.items():
print(f"{name}: {len(pdf_extract.text)} characters extracted")
Batch processing extracts embedded text sequentially, then performs OCR in parallel for all PDFs that need it. It also heuristically detects and strips repeated headers and footers across pages (configurable via MAX_HEADER_LINES, MAX_FOOTER_LINES, and related config options; disabled by default).
S3 Support
If an S3 URI (e.g. s3://my-bucket/path/to/document.pdf) is supplied as the pdf parameter, PdfExtract will attempt to pull the bytes from the supplied bucket/key. AWS credentials with proper permissions must be supplied as env vars or set programmatically as described for Azure OCR above or an error will result.
Implementation Details
OCR Triggering Logic
OCR is automatically triggered when:
- The ratio of low-text pages exceeds
TRIGGER_OCR_PAGE_RATIO(default: 99% of pages) - A page is considered "low-text" if it has <
MIN_LINES_OCR_TRIGGERlines (default: 1)
Example: OCR only when 50% of pages have fewer than 5 lines:
config = PyPdfToTextConfig(
overrides={
"MIN_LINES_OCR_TRIGGER": 5,
"TRIGGER_OCR_PAGE_RATIO": 0.5,
}
)
Configuration (Optional)
The PyPdfToTextConfig and PyPdfToTextConfigOverrides (optional) classes can be used to customize the operation of individual PdfExtract instances if desired.
- New PdfToTextConfig instances will first reinitialize all relevant settings from the env and then inherit any settings that have been set programmatically via
constants. This allows users to globally set API keys (via env ORconstants) and other desired behaviors (viaconstantsonly) eliminating the need to supply theconfigparameter to everyPdfExtractinstance. - Inheritance from the global constants can be disabled globally by setting
constants.INHERIT_CONSTANTSto False or for a single PyPdfToTextConfig instance using theoverridesparameter (e.g.PyPdfToTextConfig(overrides={"INHERIT_CONSTANTS": False})). ThePdfToTextConfigOverridesTypedDict is available for IDE and typing support. - An alternate
basecan be supplied to the PyPdfToTextConfig constructor. If supplied, its values supersede those in the globalconstants. - If both a
baseandoverridesare supplied, overlapping settings inoverrideswill supersede those inbase(orconstants).
License
This project is licensed under the MIT License - see the LICENSE file for details.
Links
Acknowledgments
Built on top of:
- pypdf for PDF parsing
- Azure Document Intelligence for OCR capabilities
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pypdftotext-0.3.5.tar.gz.
File metadata
- Download URL: pypdftotext-0.3.5.tar.gz
- Upload date:
- Size: 38.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39f443260ad2779321790c12a1da08eb68c0c9c50b25a192d40de7b0d5dfce8b
|
|
| MD5 |
62bb64aea6288e9b57b8ba87da9490aa
|
|
| BLAKE2b-256 |
6d6d257dd6b0d9956dd8d25228fd4e2ceff426024ceb46d3e00dbc5020c60b00
|
File details
Details for the file pypdftotext-0.3.5-py3-none-any.whl.
File metadata
- Download URL: pypdftotext-0.3.5-py3-none-any.whl
- Upload date:
- Size: 39.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.31.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e2aabb0013a96fe65d4d4d1285c200026446cb84d6750715405e0e1197019ec
|
|
| MD5 |
0e1d7e9708be1127e64e7ae2b7aa6969
|
|
| BLAKE2b-256 |
fa7b897546d3fbc6f21421c60288561dc3510d5f3786788aed44005f3a5db1f9
|