Simple PDF text extraction
Project description
pdftotext
Simple PDF text extraction
import pdftotext
# Load your PDF
with open("lorem_ipsum.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# If it's password-protected
with open("secure.pdf", "rb") as f:
pdf = pdftotext.PDF(f, "secret")
# How many pages?
print(len(pdf))
# Iterate over all the pages
for page in pdf:
print(page)
# Read some individual pages
print(pdf[0])
print(pdf[1])
# Read all the text into one string
print("\n\n".join(pdf))
OS Dependencies
These instructions assume you're using Python 3 on a recent OS. Package names may differ for Python 2 or for an older OS.
Debian, Ubuntu, and friends
sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev
Fedora, Red Hat, and friends
sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python3-devel
macOS
brew install pkg-config poppler python
Windows
Currently tested only when using conda:
- Install the Microsoft Visual C++ Build Tools
- Install poppler through conda:
conda install -c conda-forge poppler
Install
pip install pdftotext
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdftotext-2.1.5.tar.gz
(98.8 kB
view details)
File details
Details for the file pdftotext-2.1.5.tar.gz.
File metadata
- Download URL: pdftotext-2.1.5.tar.gz
- Upload date:
- Size: 98.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98aeb8b07a4127e1a30223bd933ef080bbd29aa88f801717ca6c5618380b8aa6
|
|
| MD5 |
f5a1e324e23a334beaeed78352ec6fb3
|
|
| BLAKE2b-256 |
dfad87e0429c74f50721b90e0f4b5700d66b2ba2e5bf3d3a59acf1bb81dfac7a
|