A wrapper around the poppler's and pdftoimage, pdftphtml and pdftotext command line tools to extract informaton from pdf
Project description
poppdf
A python (3.6+) module that wraps poppler's pdftoimage, pdftohtml and pdftotext to extract informations from PDF.
What information is extracted
- image
- text
- infromation about the position of various text lines
How to install
pip install poppdf
Windows
Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path.
Mac
Mac users will have to install poppler for Mac.
Linux
Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils
Platform-independant (Using conda)
- Install poppler:
conda install -c conda-forge poppler - Install pdf2image:
pip install pdf2image
How does it work?
from pdf2image import image_from_path, xml_from_path, text_from_path
from poppdf.pdfDocument import PdfDocument
Then simply do:
pdf = PdfDocument('example.pdf')
And
print(pdf.pdf_pages[1].text)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file poppdf-0.17.8-py3-none-any.whl.
File metadata
- Download URL: poppdf-0.17.8-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b9ac26b363a669268f6ac7dd8d8909bf9c109e39fc0540441a18e6f7493d18b
|
|
| MD5 |
4806d9d8dad849a417b36ed5eed0ffba
|
|
| BLAKE2b-256 |
25bd243ace2c56c5324169d2ea6dea5545510c48a6e96f58ff0bc5428a5f1301
|