Python bindings to PDFium
Project description
PyPDFium2
PyPDFium2 is a Python 3 binding to PDFium, the liberal-licensed PDF rendering library authored by Foxit and maintained by Google.
Install/Update
Install from PyPI
python3 -m pip install -U pypdfium2
Manual installation
# download binaries / header files and generate bindings
python3 update.py
# build the package that corresponds to your platform
python3 setup_${platform_name}.py bdist_wheel
# optionally, run check-wheel-contents on the package to confirm its validity
check-wheel-contents dist/pypdfium2-${version}-py3-none-${platform_tag}.whl
# install the package
python3 -m pip install -U dist/pypdfium2-${version}-py3-none-${platform_tag}.whl
# remove downloaded files and build artifacts
bash clean.sh
Documentation
API documentation for PDFium is available. PyPDFium2 transparently maps all PDFium classes, enums and functions to Python.
Examples
Using the command-line interface
pypdfium2 -i your_file.pdf -o your_output_dir/ --scale 1 --rotation 0 --optimise-mode none
If you want to render multiple files at once, a bash for
-loop may be suitable:
for file in ./*.pdf; do echo "$file" && pypdfium2 -i "$file" -o your_output_dir/ --scale 2; done
To obtain a list of possible command-line parameters, run
pypdfium2 --help
Using the support model
import pypdfium2 as pdfium
with pdfium.PdfContext(filename) as pdf:
pil_image = pdfium.render_page(
pdf,
page_index = 0,
scale = 1,
rotation = 0,
background_colour = 0xFFFFFFFF,
render_annotations = True,
optimise_mode = pdfium.OptimiseMode.none,
)
pil_image.save("out.png")
Using the PDFium API
import ctypes
from PIL import Image
import pypdfium2 as pdfium
doc = pdfium.FPDF_LoadDocument(filename, None) # load document (filename, password string)
page_count = pdfium.FPDF_GetPageCount(doc) # get page count
assert page_count >= 1
page = pdfium.FPDF_LoadPage(doc, 0) # load the first page
width = int(pdfium.FPDF_GetPageWidthF(page) + 0.5) # get page width
height = int(pdfium.FPDF_GetPageHeightF(page) + 0.5) # get page height
# render to bitmap
bitmap = pdfium.FPDFBitmap_Create(width, height, 0)
pdfium.FPDFBitmap_FillRect(bitmap, 0, 0, width, height, 0xFFFFFFFF)
pdfium.FPDF_RenderPageBitmap(
bitmap, page, 0, 0, width, height, 0,
pdfium.FPDF_LCD_TEXT | pdfium.FPDF_ANNOT
)
# retrieve data from bitmap
cbuffer = pdfium.FPDFBitmap_GetBuffer(bitmap)
buffer = ctypes.cast(cbuffer, ctypes.POINTER(ctypes.c_ubyte * (width * height * 4)))
img = Image.frombuffer("RGBA", (width, height), buffer.contents, "raw", "BGRA", 0, 1)
img.save("out.png")
if bitmap is not None:
pdfium.FPDFBitmap_Destroy(bitmap)
pdfium.FPDF_ClosePage(page)
pdfium.FPDF_CloseDocument(doc)
Licensing
PyPDFium2 source code itself is Apache-2.0 licensed. The auto-generated bindings file contains BSD-3-Clause code.
Documentation and examples are CC-BY-4.0.
PDFium is available by the terms and conditions of either Apache 2.0 or BSD-3-Clause, at your choice.
Various other BSD- and MIT-style licenses apply to the dependencies of PDFium.
License texts for PDFium and its dependencies are included in the file
LICENSE-PDFium.txt
, which is also shipped with binary re-distributions.
History
PyPDFium2 is the successor of pypdfium and pypdfium-reboot.
The initial pypdfium was packaged manually and did not get regular updates. There were no platform-specific wheels, but only a single wheel that contained binaries for 64-bit Linux, Windows and macOS.
pypdfium-reboot then added a script to automate binary deployment and bindings generation to simplify regular updates. However, it was still not platform specific.
PyPDFium2 is a full rewrite of pypdfium-reboot to build platform-specific wheels. It also adds a basic support model and a command-line interface on top of the PDFium C API to simplify common use cases.
Development
PDFium builds are retrieved from bblanchon/pdfium-binaries. Python bindings are auto-generated with ctypesgen
Currently supported architectures:
- macOS x86_64 *
- macOS arm64 *
- Linux x86_64
- Linux aarch64 (64-bit ARM) *
- Linux armv7l (32-bit ARM hard-float, e. g. Raspberry Pi 2)
- Windows 64bit
- Windows 32bit *
*
Not tested yet
If you have access to a theoretically supported but untested system, please report success or failure on the issues panel.
(In case bblanchon/pdfium-binaries
would add support for more architectures, PyPDFium2
could be adapted easily.)
For wheel naming conventions, please see Python Packaging: Platform compatibility tags and the various referenced PEPs.
PyPDFium2 contains scripts to automate the release process:
- To build wheels for all platforms, run
./release.sh
. This will download binaries and header files, write finished Python wheels todist/
, and runcheck-wheel-contents
. - To clean up after a release, run
./clean.sh
. This will remove downloaded files and build artifacts.
Publishing the wheels
- You may want to upload to TestPyPI first to ensure
everything works as expected:
twine upload --verbose --repository-url https://test.pypi.org/legacy/ dist/*
- If all went well, upload to the real PyPI:
twine upload dist/*
Issues
Since PyPDFium2 is built using upstream binaries and an automatic bindings creator, issues that are not related to packaging most likely need to be addressed upstream. However, the PyPDFium2 issues panel is always a good place to start if you have any problems, questions or suggestions.
If the cause of an issue could be determined to be in PDFium, the problem needs to be reported at the PDFium bug tracker.
Issues related to build configuration should be discussed at pdfium-binaries, though.
If your issue is caused by the bindings generator, refer to the ctypesgen bug tracker.
Known limitations
Non-ascii file paths on Windows
On Windows, PDFium currently is not able to open documents with file names containing multi-byte, non-ascii characters. This bug is reported since March 2017. However, the PDFium development team so far has not given it much attention. The cause of the issue is known and the structure for a fix was proposed, but it has not been applied yet.
This issue cannot reasonably be worked around in PyPDFium2, for the following reasons:
-
Using
FPDF_LoadMemDocument()
rather thanFPDF_LoadDocument()
is not possible due to issues with concurrent access to the same file. Moreover, it would be less efficient as the whole document has to be loaded into memory. This makes it impractical for large files. -
FPDF_LoadCustomDocument()
is not a solution, since mapping the complex file reading callback to Python is hardly feasible. Furthermore, there would likely be the same problem with concurrent access. -
Creating a tempfile with a compatible name would be possible, but cannot be done in PyPDFium2 itself: For faster rendering, you usually set up a multiprocessing pool or a concurrent future. This means each process has to initialise its own
PdfContext
. If an automatic tempfile workaround were implemented inPdfContext
, this would mean that each process creates its own temporary copy of the file, which would be highly inefficient. The tempfile should be created only once for all pages, not for each page separately. Therefore, this workaround can only be applied downstream. It could be done somewhat like this:import sys if sys.platform.startswith('win32') and not filename.isascii(): # create a temporary copy and remap the file name # (str.isascii() requires at least Python 3.7) ...
This workaround is currently used for the command-line interface of PyPDFium2 (see
__main__.py
).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for pypdfium2-0.1.0-py3-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3c952dad0e6530660259c26c11ee7d2781fa440abcf377dba6c9532bdeeb2b9 |
|
MD5 | 47a7d171b1c78e61e2f5afb989eba591 |
|
BLAKE2b-256 | 36b541babfe800ddd61a22e9283b78e239ba86d692cfb3ff2b3ec37c43f876f6 |
Hashes for pypdfium2-0.1.0-py3-none-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ced54a448948fef73555942e29caf04f2b600efa5062188ad277d8806246c0df |
|
MD5 | 4bc3f26f4ba9c23997d5609b204dbe32 |
|
BLAKE2b-256 | 026db59131b35d8c086fdad07985c22dc5d69ce41a68d37702c2f2bf410ac1e1 |
Hashes for pypdfium2-0.1.0-py3-none-manylinux_2_17_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e53d9cbdde67875317d0afb06ab8ca379a7fbb9b133e6030f946cf1eac53a30 |
|
MD5 | c1debf9d4dde01ef6138c35b9472587e |
|
BLAKE2b-256 | e5d3b3e17da9eee06f8d4bb9737fbbcb43b71fbe86907dd578ae9e932d634ebf |
Hashes for pypdfium2-0.1.0-py3-none-manylinux_2_17_armv7l.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d278652d16447d2e890aff33115a7b9b5d06d300ad86941e26561d1009573663 |
|
MD5 | a992f4f7b9063f5c0f4d86d6f8471f72 |
|
BLAKE2b-256 | f3fd6f09885924d50d41e9630c8b9b2761711c231ecdd9776ddeaff056643f3d |
Hashes for pypdfium2-0.1.0-py3-none-manylinux_2_17_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | da399f148a050dbf6fecb838ca29e116a667d58e30670e3fdf4c8f45a6a9420e |
|
MD5 | 3269d4ccb718d74698f867a3487e3b44 |
|
BLAKE2b-256 | 783dba58dd8deef8a3f2c7177d775facd464d09a379cd46bbb2ad5bd5eff333d |
Hashes for pypdfium2-0.1.0-py3-none-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2cd1bf023f97e87e26d0b1d3945b48f9c0553c15a4f0af60014f3b470d6509e |
|
MD5 | 57540e1c1854020e7e6265b61f8a85ac |
|
BLAKE2b-256 | 42d623dec9ad81f64b68658663f2fdec07a605e4448e78639917ec6f0e61f8d4 |
Hashes for pypdfium2-0.1.0-py3-none-macosx_10_10_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 278fdde950521b3c3f3a8e865403388c143bbc70c6c47cc19a19a5ce29ee883a |
|
MD5 | a02a58b85b1d9d9202c6474d26997db3 |
|
BLAKE2b-256 | 4f39292fd9faf343916a00a9f5f7eba81cd32fabf8369cb554209d2aa634a941 |