A wrapper around the poppler's and pdftoimage, pdftphtml and pdftotext command line tools to extract informaton from pdf
Project description
poppdf
A python (3.6+) module that wraps poppler's pdftoimage, pdftohtml and pdftotext to extract informations from PDF.
What information is extracted
- image
- text
- infromation about the position of various text lines
How to install
pip install poppdf
Windows
Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/
folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument
in convert_from_path
.
Mac
Mac users will have to install poppler for Mac.
Linux
Most distros ship with pdftoppm
and pdftocairo
. If they are not installed, refer to your package manager to install poppler-utils
Platform-independant (Using conda
)
- Install poppler:
conda install -c conda-forge poppler
- Install pdf2image:
pip install pdf2image
How does it work?
from pdf2image import image_from_path, xml_from_path, text_from_path
from poppdf.pdfDocument import PdfDocument
Then simply do:
pdf = PdfDocument('example.pdf')
And
print(pdf.pdf_pages[1].text)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file poppdf-0.17.8-py3-none-any.whl
.
File metadata
- Download URL: poppdf-0.17.8-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b9ac26b363a669268f6ac7dd8d8909bf9c109e39fc0540441a18e6f7493d18b |
|
MD5 | 4806d9d8dad849a417b36ed5eed0ffba |
|
BLAKE2b-256 | 25bd243ace2c56c5324169d2ea6dea5545510c48a6e96f58ff0bc5428a5f1301 |