Internet Archive PDF compression tools
Project description
- Date:
- 2021-11-14 18:00
This repository contains a library to perform MRC (Mixed Raster Content) compression on images [*], which offers lossy high compression of images, in particular images with text.
Additionally, the library can generate MRC-compressed PDF files with hOCR [†] text layers mixed into to the PDF, which makes searching and copy-pasting of the PDF possible. PDFs generated by bin/recode_pdf should be PDF/A 3b and PDF/UA compatible.
Some of the tooling also supports specific Internet Archive file formats (such as the “scandata.xml” files, but the tooling should work fine without those files, too.
While the code is already being used internally to create PDFs at the Internet Archive, the code still needs more documentation and cleaning up, so don’t expect this to be super well documented just yet.
Features
Reliable: has produced over 6 million PDFs in 2021 alone (each with many hundreds of pages)
Fast and robust compression: Competes directly with the proprietary software offerings when it comes to speed and compressibility (often outperforming in both)
MRC compression of images, leading to anywhere from 3-15x compression ratios, depending on the quality setting provided.
Creates PDF from a directory of images
Improved compression based on OCR results (hOCR files)
Hidden text layer insertion based on hOCR files, which makes a PDF searchable and the text copy-pasteable.
PDF/A 3b compatible.
Basic PDF/UA support (accessibility features)
Creation of 1 bit (black and white) PDFs
Dependencies
Python 3.x
- Python packages (also see requirements.txt):
PyMuPDF
lxml
scikit-image
Pillow
roman
One-of:
Open source OpenJPEG2000 tools (opj_compress and opj_decompress)
Grok (grk_compress and grk_decompress)
jpegoptim (when using JPEG instead of JPEG2000)
For JBIG2 compression:
jbig2enc for JBIG2 compression (and PyMuPDF 1.19.0 or higher)
Installation
First install dependencies. For example, in Ubuntu:
sudo apt install libleptonica-dev libopenjp2-tools libxml2-dev libxslt-dev python3-dev python3-pip git clone https://github.com/agl/jbig2enc cd jbig2enc ./autogen.sh ./configure && make sudo make install
Because archive-pdf-tools is on the Python Package Index (PyPI), you can use pip (the Python 3 version is often called pip3) to install the latest version:
# Latest version pip3 install archive-pdf-tools # Specific version pip3 install archive-pdf-tools==1.4.14
Alternatively, if you want a specific commit or unreleased version, check out the master branch or a tagged release and use pip to install:
git clone https://github.com/internetarchive/archive-pdf-tools.git cd archive-pdf-tools pip3 install .
Finally, if you’ve downloaded a wheel to test a specific commit, you can also install it using pip:
pip3 install --force-reinstall -U --no-deps ./archive_pdf_tools-${version}.whl
To see if archive-pdf-tools is installed correctly for your user, run:
recode_pdf --version
Not well tested features
“Recoding” an existing PDF, extracting the images and creating a new PDF with the images from the existing PDF is not well tested. This works OK if every PDF page just has a single image.
Known issues
Using --image-mode 0 and --image-mode 1 is currently broken, so only MRC or no images is supported.
It is not possible to recode/compress a PDF without hOCR files. This will be addressed in the future, since it should not be a problem to generate a PDF lacking hOCR data.
Planned features
Addition of a second set of fonts in the PDFs, so that hidden selected text also renders the original glyphs.
Better background generation (text shade removal from the background)
Better compression parameter selection, I have not toyed around that much with kakadu and grok/openjpeg2000 parameters.
MRC
The goal of Mixed Raster Content compression is to decompose the image into a background, foreground and mask. The background should contain components that are not of particular interest, whereas the foreground would contain all glyphs/text on a page, as well as the lines and edges of various drawings or images. The mask is a 1-bit image which has the value ‘1’ when a pixel is part of the foreground.
This decomposition can then be used to compress the different components individually, applying much higher compression to specific components, usually the background, which can be downscaled as well. The foreground can be quite compressed as well, since it mostly just needs to contain the approximate colours of the text and other lines - any artifacts introduced during the foreground compression (e.g. ugly artifact around text borders) are removed by overlaying the mask component of the image, which is losslessly compressed (typically using either JBIG2 or CCITT).
In a PDF, this usually means the background image is inserted into a page, followed by the foreground image, which uses the mask as its alpha layer.
Usage
Creating a PDF from a set of images is pretty straightforward:
recode_pdf --from-imagestack 'sim_english-illustrated-magazine_1884-12_2_15_jp2/*' \ --hocr-file sim_english-illustrated-magazine_1884-12_2_15_hocr.html \ --dpi 400 --bg-downsample 3 \ -m 2 -t 10 --mask-compression jbig2 \ -o /tmp/example.pdf [...] Processed 9 pages at 1.16 seconds/page Compression ratio: 7.144962
Or, to scan a document, OCR it with Tesseract and save the result as a compressed PDF (JPEG2000 compression with OpenJPEG, background downsampled three times), with text layer:
scanimage --resolution 300 --mode Color --format tiff | tee /tmp/scan.tiff | tesseract - - hocr > /tmp/scan.hocr ; recode_pdf -v -J openjpeg --bg-downsample 3 --from-imagestack /tmp/scan.tiff --hocr-file /tmp/scan.hocr -o /tmp/scan.pdf [...] Processed 1 pages at 11.40 seconds/page Compression ratio: 249.876613
Examining the results
mrcview (tools/mrcview) is shipped with the package and can be used to turn a MRC-compressed PDF into a PDF with each layer on a separate page, this is the easiest way to inspect the resulting compression. Run it like so:
mrcview /tmp/compressed.pdf /tmp/mrc.pdf
There is also maskview, which just renders the masks of a PDF to another PDF.
Alternatively, one could use pdfimages to extract the image layers of a specific page and then view them with your favourite image viewer:
pageno=0; pdfimages -f $pageno -l $pageno -png path_to_pdf extracted_image_base feh extracted_image_base*.png
tools/pdfimagesmrc can be used to check how the size of the PDF is broken down into the foreground, background, masks and text layer.
License
License for all code (minus internetarchive/pdfrenderer.py) is AGPL 3.0.
internetarchive/pdfrenderer.py is Apache 2.0, which matches the Tesseract license for that file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file archive_pdf_tools-1.4.29.tar.gz
.
File metadata
- Download URL: archive_pdf_tools-1.4.29.tar.gz
- Upload date:
- Size: 248.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6ccaa52fa31d7c06e02ddb8c2566f59711867ffdee6c20c73892dce4945c1b1 |
|
MD5 | 6ae321303eadac699cdc963191b2425a |
|
BLAKE2b-256 | 4706a462238204353573356f0b082c0a870c67b0a7bda597c7c958ca30cb582f |
File details
Details for the file archive_pdf_tools-1.4.29-cp311-cp311-win_amd64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 159.3 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 291b5336013a9cddcdc777396d51938ea0a5925b0a5d8cd279d926081e23e8ec |
|
MD5 | 45951596905df4878c8c81fcd0d48b5d |
|
BLAKE2b-256 | 9e0260745dd7ac9826b08a6257e7cc2e9de4848739df298090c918d6312d5d27 |
File details
Details for the file archive_pdf_tools-1.4.29-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 492.4 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ef9227802288f1644ae5e6283e49f808a5b3a90140bcf3bf6e4dbdad7f3ea86b |
|
MD5 | 7a81499fa939162c366e043a2dbcc3ca |
|
BLAKE2b-256 | 994ae519884f871ad2061f2133ab099131d66137b53b7060b1febd72233778bc |
File details
Details for the file archive_pdf_tools-1.4.29-cp311-cp311-macosx_11_0_arm64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 157.3 kB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ef480a7229bcc8b8ce609ca03dd7c00ac004eb25236b8f5663f341044d986a6e |
|
MD5 | f4f0fe2739c64f8646034576f8b37cf5 |
|
BLAKE2b-256 | ff305ecba9b33fdc3989d47c4b9d65c60a5d4785f54ba5fa23aadc239efa1fdf |
File details
Details for the file archive_pdf_tools-1.4.29-cp311-cp311-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp311-cp311-macosx_10_9_x86_64.whl
- Upload date:
- Size: 159.2 kB
- Tags: CPython 3.11, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 81415d1ed156aeb0952ecbf07bfb413e5fe0023747955d17fcc2a758fd1d5a81 |
|
MD5 | 993db0955ef3600c2402933dfcc56fa6 |
|
BLAKE2b-256 | d1b95d58288ead3988874fcbfabc60ff295edcb6499afd82917cf2af1386b30f |
File details
Details for the file archive_pdf_tools-1.4.29-cp310-cp310-win_amd64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 158.6 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f3f67ea7bb1735818b6174bdabf9150ad9129baa118559b879661ca7fb854bc |
|
MD5 | c3e854f9646b1a3d2107e9abbda81ec0 |
|
BLAKE2b-256 | a115a0b95acb177113ac8ef034aae21f6cbe127b942c94b6a7c2279e341a4c1e |
File details
Details for the file archive_pdf_tools-1.4.29-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 465.7 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29ef4ee4d7156b7162c98e068d05de6198fd69883b1610e013099558d1015b33 |
|
MD5 | 12b91f73e98db72554807a64d5905943 |
|
BLAKE2b-256 | d23f8eea6a7cb3a01b0b8d6eac4c04044be8bac8b2cbd7f87fd1890fb785300d |
File details
Details for the file archive_pdf_tools-1.4.29-cp310-cp310-macosx_11_0_arm64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 157.4 kB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e12e34e66bd7e8c86c09af15025d895a48c293b44cd39cda69bb38452d73b069 |
|
MD5 | 9d82ae17bb7aa099ad5eaac7410a5702 |
|
BLAKE2b-256 | ee9eecd589a5e26b91e7789550c4c13d5b54127f068d70db814858e29bfbd775 |
File details
Details for the file archive_pdf_tools-1.4.29-cp310-cp310-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp310-cp310-macosx_10_9_x86_64.whl
- Upload date:
- Size: 159.3 kB
- Tags: CPython 3.10, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cecc8a4f9adec6d71c5daba81cc4bea89da3a2334504338e53c03002fb9c3e1e |
|
MD5 | c2fc6451a67b1a75e5bbb5ca75f694af |
|
BLAKE2b-256 | 84b9da3ecc8350d71faf40380f5adc81ce6656b4961793a36f67726b2442a7d5 |
File details
Details for the file archive_pdf_tools-1.4.29-cp39-cp39-win_amd64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 158.6 kB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f2d860770c8306d4b79820dddae551e98a22c729e865e758099752c2e7aff41 |
|
MD5 | e7b5c2ab557d0b3c6420abdb015984a4 |
|
BLAKE2b-256 | ef62f26cca7d9c3e12708e4fc0975ae8ff7565ea0f33c4a765096fa5646e7249 |
File details
Details for the file archive_pdf_tools-1.4.29-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 465.2 kB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ed5ab45fc69de37d6e6dd74b99a06ed88184d2b45fb2e7d0beea56b8ec81ea5 |
|
MD5 | 616ee6337068c61f6a99dc4aca87c4f5 |
|
BLAKE2b-256 | ef9bc26fbe9dc58c46e07d7883d7eaefae7055de5372cd49006717b50dec5337 |
File details
Details for the file archive_pdf_tools-1.4.29-cp39-cp39-macosx_11_0_arm64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 157.4 kB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bff27f643a25386bde15e2cb084a37557fa6061dd6e0c41356d1fb4224591ab3 |
|
MD5 | c8895f9b3804bb6db011f6551e0bbe29 |
|
BLAKE2b-256 | 0a8fc4fd7ab3bf1209b9b64e64d7194a02fef6cffd159a0958b8cf958867f77e |
File details
Details for the file archive_pdf_tools-1.4.29-cp39-cp39-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp39-cp39-macosx_10_9_x86_64.whl
- Upload date:
- Size: 159.3 kB
- Tags: CPython 3.9, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41a1ebde8f5083db511885481d818b4bb98f75290036eafaebcd23e461888b76 |
|
MD5 | 2388abf057e5a2b3bfc788f50be17562 |
|
BLAKE2b-256 | edc1b020821bd45e79a2df36f93e097fdb03e8fe7b868eefc4bcaa7e55301f09 |
File details
Details for the file archive_pdf_tools-1.4.29-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 464.5 kB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 234d0bf15fe121385ba490f8c47b8c3cf152936ce96e802a5c349b13f4b4f507 |
|
MD5 | 4c8e694732d22d3347cd4fda64fca12f |
|
BLAKE2b-256 | b3b8dba483b748705110adf2a5505340be6bdefc2904138e0a39588d50c6215a |
File details
Details for the file archive_pdf_tools-1.4.29-cp38-cp38-macosx_11_0_arm64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp38-cp38-macosx_11_0_arm64.whl
- Upload date:
- Size: 157.5 kB
- Tags: CPython 3.8, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9fa20db2099281f14328fba2fa9abadd044dad7d792040ee98d87e35364425d7 |
|
MD5 | 212eaf1c01e3d1a23151188f425ecf12 |
|
BLAKE2b-256 | fc51ab50600eb9e76bcbe4ab053b1434522417d4a79908ea6a189ba7e07aae7f |
File details
Details for the file archive_pdf_tools-1.4.29-cp38-cp38-macosx_10_9_x86_64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp38-cp38-macosx_10_9_x86_64.whl
- Upload date:
- Size: 158.9 kB
- Tags: CPython 3.8, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e0bc2a3199b10a1deff42a287d01889c87235e502db2bc75ec007f3a303bb74d |
|
MD5 | 6917213074b6547c39302ec87e4d447d |
|
BLAKE2b-256 | 5846d3aad8238ded3c3ab3911d339072473f5568fcfcd1ffb18c7259dfc61dcb |
File details
Details for the file archive_pdf_tools-1.4.29-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
.
File metadata
- Download URL: archive_pdf_tools-1.4.29-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 444.1 kB
- Tags: CPython 3.7m, manylinux: glibc 2.17+ x86-64, manylinux: glibc 2.5+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ca555a3277c04fdfac4099a2225daf1c79fe4e065bb2ee7225e5d026839918c6 |
|
MD5 | b6e9904db1a268f389372d8c248241d5 |
|
BLAKE2b-256 | 01afa666f845d9deb32959777f9cf0e80e277ad6d432e6d565306a2c51c1ec12 |