No project description provided
Project description
PdfTokenizer
A Python library for extracting text from PDFs with automatic OCR detection.
Features
- 🔍 Smart OCR Detection: Automatically determines if OCR is needed by analyzing text extractability
- 🔄 Dual Extraction Methods: Uses PdfPlumber for native PDFs and Tesseract for scanned documents
- 🪟 Windows Support: Automatic Poppler download and setup for Windows users
Installation
pip install pdftokenizer
Quick Start
from pdftokenizer import extract_tokens_from_pdf
# Read your PDF file
with open("document.pdf", "rb") as f:
pdf_bytes = f.read()
# Extract tokens - OCR will be used automatically if needed
pages = extract_tokens_from_pdf(pdf_bytes)
# Force OCR if desired
pages_ocr = extract_tokens_from_pdf(pdf_bytes, force_ocr=True)
How It Works
The library automatically determines whether to use OCR based on text extractability:
- Attempts to extract text from the PDF using PyPDF
- If the extracted text contains fewer than 10 characters (configurable threshold), the PDF is considered to need OCR
- Based on this detection:
- Text-based PDFs: Processed using PdfPlumber for efficient extraction
- Scanned/Image PDFs: Processed using Tesseract OCR
Requirements
Poppler
PDF processing backend:
- Windows: Automatically downloaded and configured
- Linux:
apt-get install poppler-utils - macOS:
brew install poppler
Tesseract
Required for OCR functionality:
- Windows: Download from UB Mannheim
- Linux:
apt-get install tesseract-ocr - macOS:
brew install tesseract
License
pdftokenizer is distributed under the terms of the MIT license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdftokenizer-0.0.2.tar.gz.
File metadata
- Download URL: pdftokenizer-0.0.2.tar.gz
- Upload date:
- Size: 557.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.27.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef15c9a7a3a8eb00d9f5c615b238b0688473e194bd3aa9af2d50e70fc8e3d7e0
|
|
| MD5 |
90e758e1875511e0fcf3d085a4bf9983
|
|
| BLAKE2b-256 |
76fcba2bb70504a8a7d8a77cd7aff3b190f26498bd788fa369e6871ae602c583
|
File details
Details for the file pdftokenizer-0.0.2-py3-none-any.whl.
File metadata
- Download URL: pdftokenizer-0.0.2-py3-none-any.whl
- Upload date:
- Size: 2.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.27.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62b56cdc7b08dc8c65ff4dbc001300cf30a70e9dbb420dc1b0c0c6bc12cdcba5
|
|
| MD5 |
552044cb51d2dda04423de3602b4c5e7
|
|
| BLAKE2b-256 |
44db9b5648ad7183229307653d6ef9d7eb45f69de5def521c8db9221d917c7e2
|