Simple PDF text extraction
Project description
pdftotext
Simple PDF text extraction
import pdftotext
# Load your PDF
with open("lorem_ipsum.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# If it's password-protected
with open("secure.pdf", "rb") as f:
pdf = pdftotext.PDF(f, "secret")
# How many pages?
print(len(pdf))
# Iterate over all the pages
for page in pdf:
print(page)
# Read some individual pages
print(pdf[0])
print(pdf[1])
# Read all the text into one string
print("\n\n".join(pdf))
Why
I needed to index a lot of PDF documents with Python. Existing solutions were slow, complicated, or both. That might not be true anymore!
Dependencies
Debian, Ubuntu, and friends
sudo apt install build-essential libpoppler-cpp-dev python3-dev
Fedora, Red Hat, and friends
sudo yum install gcc-c++ poppler-cpp-devel python3-devel
macOS
brew install poppler python
Install
pip install pdftotext
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdftotext-4.0.0.tar.gz
(112.3 kB
view details)
File details
Details for the file pdftotext-4.0.0.tar.gz.
File metadata
- Download URL: pdftotext-4.0.0.tar.gz
- Upload date:
- Size: 112.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c949f01bb0fea4560cc23e4b8b017188647d22f99a604a7ed3d09f72f43e634
|
|
| MD5 |
836772a19e904de806ef1fb822e65ba6
|
|
| BLAKE2b-256 |
f4f4603a95657786fdbca99c3a65127507cb1a2d43c0c1c4d9f7a4680c56b61e
|