Pythonic API for parsing PDF files
|Info:||See the tutorials & documentation for more information.|
|Author & Maintainer:||Maksym Polshcha <firstname.lastname@example.org>|
See GitHub for the latest source.
- pdfreader is a Pythonic API for:
- extracting texts, images and other data from PDF documents (plain or protected)
- accessing different objects within PDF documents
- pdfreader is NOT a tool (maybe one day it become!):
- to create or update PDF files
- to split PDF files into pages or other pieces
- convert PDFs to any other format
Nevertheless it can be used as a part of such tools.
- Extracts texts (plain text and formatted text objects)
- Extract PDF forms data (pure strings and formatted text objects)
- Supports all PDF encodings, CMap, predefined cmaps.
- Extracts images and image masks as Pillow/PIL Images
- Supports encrypted and password-protected PDF documents
- Allows browse any document objects, resources and extract any data you need (fonts, annotations, metadata, multimedia, etc.)
- Follows PDF-1.7 specification
- Lazy objects access allows to process huge PDF documents quite fast
pdfreader can be installed with pip:
$ python -m pip install pdfreader
Or easy_install from setuptools:
$ python -m easy_install pdfreader
You can also download the project source and do:
$ python setup.py install
Tutorial and Documentation
Support, Bugs & Feature Requests
pdfreader uses GitHub issues to keep track of bugs, feature requests, etc.
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.