Skip to main content
Python Software Foundation 20th Year Anniversary Fundraiser  Donate today!

A tool to extract text from PDF files.

Project description


A python package focused on extracting content out of PDF files.

There seems to be many options out there, but no single solution that is easy to install, even on Windows, and focus specifically on PDF files. So we have created this extractpdf package.

It is based on Textract structure, but focuses on PDF only, and adds also other tools to the pipline, such as PyPDF2 and Camelot.


To use this package, install it from pypi using:

pip install extractpdf

Then use it like so:

import extractpdf as epdf

# local file
content = epdf.process('my_file.pdf')
# url:
content = epdf.process('')

Advanced Usage:

To control more features, one can use the PDFExtractor itself:

from extractpdf import PDFExtractor
epdf = PDFExtractor()
content = epdf.get_content('', keep_download=True)
f = epdf.filename # f = some_file.pdf


We welcome contributers warmly!

For running this project locally, you need first to install the dependency packages. To install them, you can use pipenv:

Installation using pipenv (which combines virtualenv with pip)

Install pipenv

# if you haven't installed pip
sudo easy_install pip

# install pipenv
pip install pipenv

On MacOS - you can use homebrew:

brew install pipenv

Set the pipenv to be local in the project: On Windows:


On Mac/Linux:


... and then, install the packages and run the server

 # install all packages
pipenv install

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for extractpdf, version 0.0.4
Filename, size File type Python version Upload date Hashes
Filename, size extractpdf-0.0.4-py3-none-any.whl (22.6 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size extractpdf-0.0.4.tar.gz (7.7 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page