Skip to main content

A PDF parser written in Python3 with no external dependencies.

Project description

pdf4py

A PDF parser written in Python 3 with no external dependencies.

The package pdf4py allows the user to interact with a PDF file at a low level and to build higher level functionalities (e.g. text and/or image extraction). In particular, it defines the class Parser that reads the Cross Reference Table of a PDF document and uses its entries to give the user the ability to locate PDF objects within the file and parse them into suitable Python objects.

Quick example

Here is a quick demostration on how to use pdf4py.

>>> from pdf4py.parser import Parser
>>> fp = open('tests/pdfs/0000.pdf', 'rb')
>>> parser = Parser(fp)
>>> info_ref = parser.trailer['Info']
>>> print(info_ref)
PDFReference(object_number=114, generation_number=0)
>>> info = parser.parse_reference(info_ref).value
>>> print(info)
{'Creator': PDFLiteralString(value=b'PaperCept Conference Management System'),
    ... , 'Producer': PDFLiteralString(value=b'PDFlib+PDI 7.0.3 (Perl 5.8.0/Linux)')}
>>> creator = info['Creator'].value.decode('utf8')
>>> print(creator)
PaperCept Conference Management System

Extracting text or images

Extracting text from a PDF and other higher level analysis tasks are not natively supported because of two reasons:

  • their complexity is not trivial and would require a not indifferent amount of work which now I prefer investing into developing a complete and reliable parser;
  • they are conceptually different tasks from PDF parsing, since the PDF does not define the concept of structured document from the content point of view.

Therefore, they require a separate implementation built on top of pdf4py.

Why this package

One day at work I was asked to analyze some PDF files; to my surprise I have discovered that there is not an established Python module to easily parse a PDF document. In order to understand why I delved into the PDF 1.7 specification: since from moment on, I got more and more interested in the inner workings of one of the most important and ubiquitous file format. And what's a better way to understand the PDF than writing a parser for it?

PDF standard coverage

You can check how many features of the standard are implemented and what is the progress on supporting the missing ones by checking the standard coverage page.

Contributing

Contributions are more than welcome! You can

  • filing a new issue;
  • proposing changes and additions through a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf4py-0.0.1.tar.gz (31.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf4py-0.0.1-py3-none-any.whl (43.3 kB view details)

Uploaded Python 3

File details

Details for the file pdf4py-0.0.1.tar.gz.

File metadata

  • Download URL: pdf4py-0.0.1.tar.gz
  • Upload date:
  • Size: 31.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for pdf4py-0.0.1.tar.gz
Algorithm Hash digest
SHA256 db1d3ecb41e2185387613ae6da92fa312390953fe94173898de9fd7fad37d306
MD5 351f8776ae81cc98fee3bfcc00206616
BLAKE2b-256 d290215af0f5223202f5c0bb6ed493117847e88ca60f2730ce6df08a0b001f46

See more details on using hashes here.

File details

Details for the file pdf4py-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: pdf4py-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 43.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for pdf4py-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1eb9de0adb79823e21760455e411a285cd24e60c6daf5f17035f1ec373a296fa
MD5 b07143f7262c1939bc9310debf85d47c
BLAKE2b-256 be20efe209d40d7772634b8991db117952b82ed3f5a6d334b1eb657eabb7bec7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page