Skip to main content

Powerful and Pythonic PDF processing library based on xpdf-4.02

Project description

0.1.1 (2020-05-10)

  • FIX: bug where default Config.text_encoding value i.e UTF-8 does not persist Config.reset() and changes to Latin1
  • pdftotext: remove all parameters that change global Config properties

Build Status Build Status codecov GitHub license PyPI - Python Version PyPI

pyxpdf

Fast Python PDF parser module based on xpdf-reader sources.

Quickstart

from pyxpdf import Document, Page, Config
from pyxpdf.xpdf import TextControl

doc = Document("samples/nonfree/mandarin.pdf")
# or
# load pdf from file like object
with open("samples/nonfree/mandarin.pdf", 'rb') as fp:
    doc = Document(fp)

# get pdf metadata dict
print(doc.info())
# >>> doc.info()
# {'CreationDate': "D:20080721141207-04'00'", 
#  'Subject': 'Chinese Version of Universal PCXR8 ...', 
#  'Author': 'SKC Inc.', 
#  'Creator': 'PScript5.dll
#   .....

# get all text
all_text = doc.text()

# iter first 10 pages
for page in doc[:10]:
    # get page label if any
    print(page.label)

# get page by page label
label_page = doc['1']

# get text in table layout without discarding clipped
# text.
text_control = TextControl("table", clip_text=True)
text = label_page.text(control=text_control)

# find case sensitive text within [x_min, y_min, x_max, y_max]
res_box = label_page.find_text('操作说明', search_box=[0, 0, 400, 400],
                                case_sensitive=True)
# >>> print(res_box)
# (281.88, 269.718, 354.05819999999994, 287.7)

# load xpdfrc
Config.load_file('my_xpdfrc')
# suppress stderr output for xpdf error log.
Config.error_quiet = False

pdftotext

If you are familiar with pdftotext binary then this is it's python port with almost native binary speed.

from pyxpdf import pdftotext

file = "sample.pdf"
# Get text from first two pages of pdf
pdf_text = pdftotext(file, start=1, end=2, layout="table",
                     userpass="1234", ownerpass="1234", 
                     cfg_file="~/.xpdfrc")

Note:-

  • pdftotext returns Unicode encoded string, so if your PDF contain characters outside of utf-8 then they will be ignored [decode('utf-8', errors='ignore')].
  • If you are working with different encoding then you can use pdftotext_raw which has same function signature but returns bytes object. You can then decode it yourself but make sure to set Config.text_encoding to your encoding so that xpdf can properly extract text. Currently only 'UTF-8', 'Latin1', 'ASCII7', 'Symbol', 'ZapfDingbats' and 'UCS-2' encodings are predefined. To add additional encodings you can provide Unicode CMaps for your encoding through xpdfrc.

Install

pip install pyxpdf

Note (Windows):-

To build this in windows you will need Visual C++ compiler which you can get by installing Visual Studio Build Tools

Build Instructions

Requirements:-

  • (CPython) Python 3.4+
  • A recent enough C/C++ build environment

First clone the pyxpdf git repository:

$ git clone https://github.com/ashutoshvarma/pyxpdf.git
$ cd pyxpdf

Optionally create a virtualenv (recommended):

$ python -m venv <directory>
$ source <directory>/bin/activate

Then install the dependencies:

$ pip install -r test_requirements.txt

Build wheel

$ pip install wheel
$ python setup.py bdist_wheel --with-cython

Install wheel package

$ pip install dist/*.whl

Now you can run the tests

$ python runtests.py -v

License

pyxpdf is licensed under the GNU General Public License (GPL), version 3. See the LICENSE

It uses following third party sources :-

  • Xpdf Reader [https://www.xpdfreader.com/] by Derek Noonburg

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyxpdf-0.1.1.tar.gz (1.3 MB view hashes)

Uploaded Source

Built Distributions

pyxpdf-0.1.1-cp38-cp38-win_amd64.whl (804.1 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

pyxpdf-0.1.1-cp38-cp38-win32.whl (724.9 kB view hashes)

Uploaded CPython 3.8 Windows x86

pyxpdf-0.1.1-cp38-cp38-manylinux2010_x86_64.whl (1.7 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

pyxpdf-0.1.1-cp38-cp38-manylinux2010_i686.whl (1.6 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.12+ i686

pyxpdf-0.1.1-cp38-cp38-macosx_10_9_x86_64.whl (998.2 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

pyxpdf-0.1.1-cp37-cp37m-win_amd64.whl (802.0 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

pyxpdf-0.1.1-cp37-cp37m-win32.whl (722.5 kB view hashes)

Uploaded CPython 3.7m Windows x86

pyxpdf-0.1.1-cp37-cp37m-manylinux2010_x86_64.whl (1.7 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

pyxpdf-0.1.1-cp37-cp37m-manylinux2010_i686.whl (1.5 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.12+ i686

pyxpdf-0.1.1-cp37-cp37m-macosx_10_9_x86_64.whl (996.1 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

pyxpdf-0.1.1-cp36-cp36m-win_amd64.whl (802.1 kB view hashes)

Uploaded CPython 3.6m Windows x86-64

pyxpdf-0.1.1-cp36-cp36m-win32.whl (722.7 kB view hashes)

Uploaded CPython 3.6m Windows x86

pyxpdf-0.1.1-cp36-cp36m-manylinux2010_x86_64.whl (1.7 MB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

pyxpdf-0.1.1-cp36-cp36m-manylinux2010_i686.whl (1.5 MB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.12+ i686

pyxpdf-0.1.1-cp36-cp36m-macosx_10_9_x86_64.whl (999.5 kB view hashes)

Uploaded CPython 3.6m macOS 10.9+ x86-64

pyxpdf-0.1.1-cp35-cp35m-win_amd64.whl (799.4 kB view hashes)

Uploaded CPython 3.5m Windows x86-64

pyxpdf-0.1.1-cp35-cp35m-win32.whl (721.1 kB view hashes)

Uploaded CPython 3.5m Windows x86

pyxpdf-0.1.1-cp35-cp35m-manylinux2010_x86_64.whl (1.6 MB view hashes)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

pyxpdf-0.1.1-cp35-cp35m-manylinux2010_i686.whl (1.5 MB view hashes)

Uploaded CPython 3.5m manylinux: glibc 2.12+ i686

pyxpdf-0.1.1-cp35-cp35m-macosx_10_9_x86_64.whl (994.9 kB view hashes)

Uploaded CPython 3.5m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page