Simple PDF text extraction

Project description

pdftotext

Simple PDF text extraction

import pdftotext

# Load your PDF
with open("lorem_ipsum.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# If it's password-protected
with open("secure.pdf", "rb") as f:
    pdf = pdftotext.PDF(f, "secret")

# How many pages?
print(len(pdf))

# Iterate over all the pages
for page in pdf:
    print(page)

# Read some individual pages
print(pdf[0])
print(pdf[1])

# Read all the text into one string
print("\n\n".join(pdf))

Why

I needed to index a lot of PDF documents with Python. Existing solutions were slow, complicated, or both. That might not be true anymore!

Dependencies

Debian, Ubuntu, and friends

sudo apt install build-essential libpoppler-cpp-dev python3-dev

Fedora, Red Hat, and friends

sudo yum install gcc-c++ poppler-cpp-devel python3-devel

macOS

brew install poppler python

Install

pip install pdftotext

Project details

Release history Release notifications | RSS feed

This version

4.0.0

Jun 26, 2026

3.0.0

Dec 6, 2024

2.2.2

Nov 23, 2021

2.2.1

Oct 1, 2021

2.2.0

Aug 16, 2021

2.1.6

May 14, 2021

2.1.5

Aug 14, 2020

2.1.4

Jan 25, 2020

2.1.3

Jan 7, 2020

2.1.2

Aug 7, 2019

2.1.1

Oct 7, 2018

2.1.0

May 31, 2018

2.0.2

Feb 20, 2018

2.0.1

Aug 10, 2017

2.0.0

Jul 23, 2017

1.1.0

Jul 18, 2017

1.0.0

Jun 10, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftotext-4.0.0.tar.gz (112.3 kB view details)

Uploaded Jun 26, 2026 Source

File details

Details for the file pdftotext-4.0.0.tar.gz.

File metadata

Download URL: pdftotext-4.0.0.tar.gz
Upload date: Jun 26, 2026
Size: 112.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pdftotext-4.0.0.tar.gz
Algorithm	Hash digest
SHA256	`6c949f01bb0fea4560cc23e4b8b017188647d22f99a604a7ed3d09f72f43e634`
MD5	`836772a19e904de806ef1fb822e65ba6`
BLAKE2b-256	`f4f4603a95657786fdbca99c3a65127507cb1a2d43c0c1c4d9f7a4680c56b61e`

See more details on using hashes here.

pdftotext 4.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta