Skip to main content

A PDF parser written in Python3 with no external dependencies.

Project description

pdf4py

Build Status Documentation Status PyPI version PyPI - Downloads

A PDF parser written in Python 3 with no external dependencies.

The package pdf4py allows the user to analyze a PDF file at a very low level and in a very flexible way by giving access to its atomic components, the PDF objects. All through a very simple API that can be used to build higher level functionalities (e.g. text and/or image extraction). In particular, it defines the class Parser that reads the Cross Reference Table of a PDF document and uses its entries to give the user the ability to locate PDF objects within the file and parse them into suitable Python objects.

DISCLAIMER: this package hasn't reached a stable version (>= 1.0.0) yet. Although the parser API is quite simple it may change suddenly from one release to the next one. All breaking changes will be properly notified in the release notes.

Quick example

Here is a quick demonstration on how to use pdf4py. You can find more at the tutorials page.

>>> from pdf4py.parser import Parser
>>> fp = open('tests/pdfs/0000.pdf', 'rb')
>>> parser = Parser(fp)
>>> info_ref = parser.trailer['Info']
>>> print(info_ref)
PDFReference(object_number=114, generation_number=0)
>>> info = parser.parse_reference(info_ref)
>>> print(info)
{'Creator': PDFLiteralString(value=b'PaperCept Conference Management System'),
    ... , 'Producer': PDFLiteralString(value=b'PDFlib+PDI 7.0.3 (Perl 5.8.0/Linux)')}
>>> creator = info['Creator'].value.decode('utf8')
>>> print(creator)
PaperCept Conference Management System

Installation and updates

You can install pdf4py using pip:

python3 -m pip install pdf4py

or download one of the releases and use the setup.py script.

The master branch is used for development and it is not advised to use it in production.

For this package the semantic versioning (specification 2.0.0) is adopted.

Extracting text or images

Extracting text from a PDF and other higher level analysis tasks are not natively supported as of now because of two reasons:

  • their complexity is not trivial and would require a not indifferent amount of work which now I prefer investing into developing a complete and reliable parser;
  • they are conceptually different tasks from PDF parsing, since the PDF does not define the concept of document as a sequence of paragraphs, images, and other objects that can be normally considered content.

Therefore, they require a separate implementation built on top of pdf4py. In don't exclude that in future these functionalities will be made available as modules in this package, but I am not planning to do it anytime soon.

Why this package

One day at work I was asked to analyze some PDF files. To my surprise I had discovered that there was not an established Python module to easily parse a PDF document. In order to understand why I delved into the PDF 1.7 specification: since that moment I've got interested more and more in the inner workings of one of the most important and ubiquitous file format. And what's a better way to understand the PDF than writing a parser for it?

Documentation

You can read the documentation on readthedocs.io.

Contributing

Contributions are more than welcome! Please, when writing code or documentation for this package remind:

  • to use the numpy docstring conventions for documenting code.
  • to follow the Python guideline (PEP 8) when writing code.
  • pdf4py is designed to be readable and easy to work with. I prefer readability over (not so significant) performance improvements.
  • pdf4py is designed to be modular, flexible but also easy to use. It shouldn't be complicated for the user to perform one particular task.
  • to adopt as much as possible a test-driven development process. Each contribution must be accompanied by a test addition/modification.

If you are wondering in which way you can help, check the TODO list. For now it will do as a simple "road map".

If you have found a bug, please file a new issue here on GitHub. Proposing fixes, changes and additions can be done through a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf4py-0.1.0.tar.gz (34.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf4py-0.1.0-py3-none-any.whl (39.8 kB view details)

Uploaded Python 3

File details

Details for the file pdf4py-0.1.0.tar.gz.

File metadata

  • Download URL: pdf4py-0.1.0.tar.gz
  • Upload date:
  • Size: 34.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for pdf4py-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0152e0b2192090615347e533e0412eaaec0d71f990b907763424cccce06b23d3
MD5 c6d7d8491993ecb000eb3751cfbadbee
BLAKE2b-256 c390e94d9cdcc9a99c8fe7ff50856124ab6d53e5b86b9632feeb4142f7921805

See more details on using hashes here.

File details

Details for the file pdf4py-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdf4py-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for pdf4py-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 81d47a9e55a218ff42f3c98c1fd9885379a7bced42fa603eeaee39240d89a142
MD5 9f72ab4da3f41cc5ab0754b17cf34128
BLAKE2b-256 c8f3d1e0cd4f439ad3f97cb6e219ecb4037eb3c32b25bd1c2da6a0c7b063110c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page