Simple PDF text extraction
Project description
pdftotext
Simple PDF text extraction
import pdftotext
# Load your PDF
with open("lorem_ipsum.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# If it's password-protected
with open("secure.pdf", "rb") as f:
pdf = pdftotext.PDF(f, "secret")
# How many pages?
print(len(pdf))
# Iterate over all the pages
for page in pdf:
print(page)
# Read some individual pages
print(pdf[0])
print(pdf[1])
# Read all the text into one string
print("\n\n".join(pdf))
OS Dependencies
Debian, Ubuntu, and friends:
sudo apt-get update
sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
Fedora, Red Hat, and friends:
sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config
macOS:
brew install pkg-config poppler
Conda users may also need libgcc:
conda install libgcc
Install
pip install pdftotext
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdftotext-2.1.2.tar.gz
(113.3 kB
view details)
File details
Details for the file pdftotext-2.1.2.tar.gz.
File metadata
- Download URL: pdftotext-2.1.2.tar.gz
- Upload date:
- Size: 113.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8bdc47b08baa17b8e03ba1f960fc6335b183d2644eaf7300e088516758a6090
|
|
| MD5 |
8dfdefaafd94b7f4a3073bb35fdc5c4f
|
|
| BLAKE2b-256 |
a6a7c202adb0bcd3adc3030b0c5f7f0e21f62a721913e93296e6c4ddc305cbd3
|