Skip to main content

A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.

Project description

pdf2image

TravisCI PyPI version codecov Downloads mattermost Documentation Status

A python (3.5+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object

How to install

pip install pdf2image

Windows

Windows users will have to install poppler for Windows, then add the bin/ folder to PATH.

Mac

Mac users will have to install poppler for Mac.

Linux

Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils

Platform-independant (Using conda)

  1. Install poppler: conda install -c conda-forge poppler
  2. Install pdf2image: pip install pdf2image

How does it work?

from pdf2image import convert_from_path, convert_from_bytes

from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)

Then simply do:

images = convert_from_path('/home/belval/example.pdf')

OR

images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())

OR better yet

import tempfile

with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
    # Do something here

images will be a list of PIL Image representing each page of the PDF document.

Here are the definitions:

convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False)

convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False)

Need help?

Use the mattermost chat to ask questions on the helpdesk and get direct support.

What's new?

  • Add use_pdftocairo parameter which forces pdf2image to use pdftocairo. Should improve performance.
  • Fixed a bug where using pdf2image with multiple threads (but not multiple processes) would cause and exception
  • jpegopt parameter allows for tuning of the output JPEG when using fmt="jpeg" (-jpegopt in pdftoppm CLI) (Thank you @abieler)
  • pdfinfo_from_path and pdfinfo_from_bytes which expose the output of the pdfinfo CLI
  • paths_only parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDF
  • size parameter allows you to define the shape of the resulting images (-scale-to in pdftoppm CLI)
    • size=400 will fit the image to a 400x400 box, preserving aspect ratio
    • size=(400, None) will make the image 400 pixels wide, preserving aspect ratio
    • size=(500, 500) will resize the image to 500x500 pixels, not preserving aspect ratio
  • grayscale parameter allows you to convert images to grayscale (-gray in pdftoppm CLI)
  • single_file parameter allows you to convert the first PDF page only, without adding digits at the end of the output_file
  • Allow the user to specify poppler's installation path with poppler_path
  • Fixed a bug where PNGs buffer with a non-terminating I-E-N-D sequence would throw an exception

Performance tips

  • Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
  • Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
  • If i/o is your bottleneck, using the JPEG format can lead to significant gains.
  • PNG format is pretty slow, this is because of the compression.
  • If you want to know the best settings (most settings will be fine anyway) you can clone the project and run python tests.py to get timings.

Limitations / known issues

  • A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2image-1.13.1.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2image-1.13.1-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file pdf2image-1.13.1.tar.gz.

File metadata

  • Download URL: pdf2image-1.13.1.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for pdf2image-1.13.1.tar.gz
Algorithm Hash digest
SHA256 df6b825f7f26df35b873642725a7ee37dfc8a531b711274a8ad2ee830c8b72d0
MD5 27c803d6deb37c172d459b4cbb456e45
BLAKE2b-256 a4a77277283619ec01e69cfc0e6a37c9d13d2aebb721550a7b4ec2aa34be94ef

See more details on using hashes here.

File details

Details for the file pdf2image-1.13.1-py3-none-any.whl.

File metadata

  • Download URL: pdf2image-1.13.1-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for pdf2image-1.13.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ed2935991de449e55ceea2eff7c5d18c7b5cde4a2f6b9f3d56a430e8c5b77969
MD5 e67428e05e4b45bfd3751bb28361ec5a
BLAKE2b-256 c662bf2df0547cf4e216b329a9d39a7aa6c755f02071e63e17a4b76690ebfe20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page