A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.
Project description
pdf2image
A python (3.5+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object
How to install
pip install pdf2image
Windows
Windows users will have to install poppler for Windows, then add the bin/
folder to PATH.
Mac
Mac users will have to install poppler for Mac.
Linux
Most distros ship with pdftoppm
and pdftocairo
. If they are not installed, refer to your package manager to install poppler-utils
Platform-independant (Using conda
)
- Install poppler:
conda install -c conda-forge poppler
- Install pdf2image:
pip install pdf2image
How does it work?
from pdf2image import convert_from_path, convert_from_bytes
from pdf2image.exceptions import (
PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError
)
Then simply do:
images = convert_from_path('/home/belval/example.pdf')
OR
images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())
OR better yet
import tempfile
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
# Do something here
images
will be a list of PIL Image representing each page of the PDF document.
Here are the definitions:
convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False)
convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False)
Need help?
Use the mattermost chat to ask questions on the helpdesk and get direct support.
What's new?
- Fixed a bug where using
pdf2image
with multiple threads (but not multiple processes) would cause and exception jpegopt
parameter allows for tuning of the output JPEG when usingfmt="jpeg"
(-jpegopt
in pdftoppm CLI) (Thank you @abieler)pdfinfo_from_path
andpdfinfo_from_bytes
which expose the output of the pdfinfo CLIpaths_only
parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDFsize
parameter allows you to define the shape of the resulting images (-scale-to
in pdftoppm CLI)size=400
will fit the image to a 400x400 box, preserving aspect ratiosize=(400, None)
will make the image 400 pixels wide, preserving aspect ratiosize=(500, 500)
will resize the image to 500x500 pixels, not preserving aspect ratio
grayscale
parameter allows you to convert images to grayscale (-gray
in pdftoppm CLI)single_file
parameter allows you to convert the first PDF page only, without adding digits at the end of theoutput_file
- Allow the user to specify poppler's installation path with
poppler_path
- Fixed a bug where PNGs buffer with a non-terminating I-E-N-D sequence would throw an exception
Performance tips
- Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
- Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
- If i/o is your bottleneck, using the JPEG format can lead to significant gains.
- PNG format is pretty slow, this is because of the compression.
- If you want to know the best settings (most settings will be fine anyway) you can clone the project and run
python tests.py
to get timings.
Limitations / known issues
- A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.