Skip to main content

A wrapper around the poppler's and pdftoimage, pdftphtml and pdftotext command line tools to extract informaton from pdf

Project description

poppdf

A python (3.6+) module that wraps poppler's pdftoimage, pdftohtml and pdftotext to extract informations from PDF.

What information is extracted

  • image
  • text
  • infromation about the position of various text lines

How to install

pip install poppdf

Windows

Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path.

Mac

Mac users will have to install poppler for Mac.

Linux

Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils

Platform-independant (Using conda)

  1. Install poppler: conda install -c conda-forge poppler
  2. Install pdf2image: pip install pdf2image

How does it work?

from pdf2image import image_from_path, xml_from_path, text_from_path

from poppdf.pdfDocument import PdfDocument

Then simply do:

pdf = PdfDocument('example.pdf')

And

print(pdf.pdf_pages[1].text)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

poppdf-0.17.8-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file poppdf-0.17.8-py3-none-any.whl.

File metadata

  • Download URL: poppdf-0.17.8-py3-none-any.whl
  • Upload date:
  • Size: 29.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.8.2

File hashes

Hashes for poppdf-0.17.8-py3-none-any.whl
Algorithm Hash digest
SHA256 2b9ac26b363a669268f6ac7dd8d8909bf9c109e39fc0540441a18e6f7493d18b
MD5 4806d9d8dad849a417b36ed5eed0ffba
BLAKE2b-256 25bd243ace2c56c5324169d2ea6dea5545510c48a6e96f58ff0bc5428a5f1301

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page