A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.
A python (3.5+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object
How to install
pip install pdf2image
Mac users will have to install poppler for Mac.
Most distros ship with
pdftocairo. If they are not installed, refer to your package manager to install
- Install poppler:
conda install -c conda-forge poppler
- Install pdf2image:
pip install pdf2image
How does it work?
from pdf2image import convert_from_path, convert_from_bytes
from pdf2image.exceptions import ( PDFInfoNotInstalledError, PDFPageCountError, PDFSyntaxError )
Then simply do:
images = convert_from_path('/home/belval/example.pdf')
images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())
OR better yet
import tempfile with tempfile.TemporaryDirectory() as path: images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path) # Do something here
images will be a list of PIL Image representing each page of the PDF document.
Here are the definitions:
convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False)
convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False)
Use the mattermost chat to ask questions on the helpdesk and get direct support.
- Fixed a bug where using
pdf2imagewith multiple threads (but not multiple processes) would cause and exception
jpegoptparameter allows for tuning of the output JPEG when using
-jpegoptin pdftoppm CLI) (Thank you @abieler)
pdfinfo_from_byteswhich expose the output of the pdfinfo CLI
paths_onlyparameter will return image paths instead of Image objects, to prevent OOM when converting a big PDF
sizeparameter allows you to define the shape of the resulting images (
-scale-toin pdftoppm CLI)
size=400will fit the image to a 400x400 box, preserving aspect ratio
size=(400, None)will make the image 400 pixels wide, preserving aspect ratio
size=(500, 500)will resize the image to 500x500 pixels, not preserving aspect ratio
grayscaleparameter allows you to convert images to grayscale (
-grayin pdftoppm CLI)
single_fileparameter allows you to convert the first PDF page only, without adding digits at the end of the
- Allow the user to specify poppler's installation path with
- Fixed a bug where PNGs buffer with a non-terminating I-E-N-D sequence would throw an exception
- Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
- Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
- If i/o is your bottleneck, using the JPEG format can lead to significant gains.
- PNG format is pretty slow, this is because of the compression.
- If you want to know the best settings (most settings will be fine anyway) you can clone the project and run
python tests.pyto get timings.
Limitations / known issues
- A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.