Skip to main content

Simple PDF text extraction

Project description

pdftotext

PyPI Status Downloads

Simple PDF text extraction

import pdftotext

# Load your PDF
with open("lorem_ipsum.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# If it's password-protected
with open("secure.pdf", "rb") as f:
    pdf = pdftotext.PDF(f, "secret")

# How many pages?
print(len(pdf))

# Iterate over all the pages
for page in pdf:
    print(page)

# Read some individual pages
print(pdf[0])
print(pdf[1])

# Read all the text into one string
print("\n\n".join(pdf))

Why

I needed to index a lot of PDF documents with Python. Existing solutions were slow, complicated, or both. That might not be true anymore!

Dependencies

Debian, Ubuntu, and friends

sudo apt install build-essential libpoppler-cpp-dev python3-dev

Fedora, Red Hat, and friends

sudo yum install gcc-c++ poppler-cpp-devel python3-devel

macOS

brew install poppler python

Install

pip install pdftotext

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftotext-4.0.0.tar.gz (112.3 kB view details)

Uploaded Source

File details

Details for the file pdftotext-4.0.0.tar.gz.

File metadata

  • Download URL: pdftotext-4.0.0.tar.gz
  • Upload date:
  • Size: 112.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for pdftotext-4.0.0.tar.gz
Algorithm Hash digest
SHA256 6c949f01bb0fea4560cc23e4b8b017188647d22f99a604a7ed3d09f72f43e634
MD5 836772a19e904de806ef1fb822e65ba6
BLAKE2b-256 f4f4603a95657786fdbca99c3a65127507cb1a2d43c0c1c4d9f7a4680c56b61e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page