A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.
Project description
pdf2image
A python (3.7+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object
How to install
pip install pdf2image
Windows
Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/
folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument
in convert_from_path
.
Mac
Mac users will have to install poppler.
Installing using Brew:
brew install poppler
Linux
Most distros ship with pdftoppm
and pdftocairo
. If they are not installed, refer to your package manager to install poppler-utils
Platform-independant (Using conda
)
- Install poppler:
conda install -c conda-forge poppler
- Install pdf2image:
pip install pdf2image
How does it work?
from pdf2image import convert_from_path, convert_from_bytes
from pdf2image.exceptions import (
PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError
)
Then simply do:
images = convert_from_path('/home/belval/example.pdf')
OR
images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())
OR better yet
import tempfile
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
# Do something here
images
will be a list of PIL Image representing each page of the PDF document.
Here are the definitions:
convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)
convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)
What's new?
- Allow users to hide attributes when using pdftoppm with
hide_attributes
(Thank you @StaticRocket) - Fix console opening on Windows (Thank you @OhMyAgnes!)
- Add
timeout
parameter which raisesPDFPopplerTimeoutError
after the given number of seconds. - Add
use_pdftocairo
parameter which forcespdf2image
to usepdftocairo
. Should improve performance. - Fixed a bug where using
pdf2image
with multiple threads (but not multiple processes) would cause and exception jpegopt
parameter allows for tuning of the output JPEG when usingfmt="jpeg"
(-jpegopt
in pdftoppm CLI) (Thank you @abieler)pdfinfo_from_path
andpdfinfo_from_bytes
which expose the output of the pdfinfo CLIpaths_only
parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDFsize
parameter allows you to define the shape of the resulting images (-scale-to
in pdftoppm CLI)size=400
will fit the image to a 400x400 box, preserving aspect ratiosize=(400, None)
will make the image 400 pixels wide, preserving aspect ratiosize=(500, 500)
will resize the image to 500x500 pixels, not preserving aspect ratio
grayscale
parameter allows you to convert images to grayscale (-gray
in pdftoppm CLI)single_file
parameter allows you to convert the first PDF page only, without adding digits at the end of theoutput_file
- Allow the user to specify poppler's installation path with
poppler_path
Performance tips
- Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
- Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
- If i/o is your bottleneck, using the JPEG format can lead to significant gains.
- PNG format is pretty slow, this is because of the compression.
- If you want to know the best settings (most settings will be fine anyway) you can clone the project and run
python tests.py
to get timings.
Limitations / known issues
- A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)
- Sometimes fail read pdf signed using DocuSign, Solution for DocuSign issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdf2image-1.17.0.tar.gz
.
File metadata
- Download URL: pdf2image-1.17.0.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eaa959bc116b420dd7ec415fcae49b98100dda3dd18cd2fdfa86d09f112f6d57 |
|
MD5 | 989a182455d439b3a58640031e14652c |
|
BLAKE2b-256 | 00d8b280f01045555dc257b8153c00dee3bc75830f91a744cd5f84ef3a0a64b1 |
File details
Details for the file pdf2image-1.17.0-py3-none-any.whl
.
File metadata
- Download URL: pdf2image-1.17.0-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ecdd58d7afb810dffe21ef2b1bbc057ef434dabbac6c33778a38a3f7744a27e2 |
|
MD5 | 34470f853c84ebed2d342d975222e9c3 |
|
BLAKE2b-256 | 623361766ae033518957f877ab246f87ca30a85b778ebaad65b7f74fa7e52988 |