Skip to main content

Internet Archive PDF compression tools

Project description

Authors:

Merlijn Wajer <merlijn@archive.org>

Date:
2021-11-14 18:00

This repository contains a library to perform MRC (Mixed Raster Content) compression on images [*], which offers lossy high compression of images, in particular images with text.

Additionally, the library can generate MRC-compressed PDF files with hOCR [] text layers mixed into to the PDF, which makes searching and copy-pasting of the PDF possible. PDFs generated by bin/recode_pdf should be PDF/A 3b and PDF/UA compatible.

Some of the tooling also supports specific Internet Archive file formats (such as the “scandata.xml” files, but the tooling should work fine without those files, too.

While the code is already being used internally to create PDFs at the Internet Archive, the code still needs more documentation and cleaning up, so don’t expect this to be super well documented just yet.

Features

  • Reliable: has produced over 6 million PDFs in 2021 alone (each with many hundreds of pages)

  • Fast and robust compression: Competes directly with the proprietary software offerings when it comes to speed and compressibility (often outperforming in both)

  • MRC compression of images, leading to anywhere from 3-15x compression ratios, depending on the quality setting provided.

  • Creates PDF from a directory of images

  • Improved compression based on OCR results (hOCR files)

  • Hidden text layer insertion based on hOCR files, which makes a PDF searchable and the text copy-pasteable.

  • PDF/A 3b compatible.

  • Basic PDF/UA support (accessibility features)

  • Creation of 1 bit (black and white) PDFs

Dependencies

  • Python 3.x

  • Python packages (also see requirements.txt):

One-of:

For JBIG2 compression:

  • jbig2enc for JBIG2 compression (and PyMuPDF 1.19.0 or higher)

Installation

First install dependencies. For example, in Ubuntu:

sudo apt install libleptonica-dev libopenjp2-tools libxml2-dev libxslt-dev python3-dev python3-pip
git clone https://github.com/agl/jbig2enc
cd jbig2enc
./autogen.sh
./configure && make
sudo make install

Because archive-pdf-tools is on the Python Package Index (PyPI), you can use pip (the Python 3 version is often called pip3) to install the latest version:

# Latest version
pip3 install archive-pdf-tools

# Specific version
pip3 install archive-pdf-tools==1.4.14

Alternatively, if you want a specific commit or unreleased version, check out the master branch or a tagged release and use pip to install:

git clone https://github.com/internetarchive/archive-pdf-tools.git
cd archive-pdf-tools
pip3 install .

Finally, if you’ve downloaded a wheel to test a specific commit, you can also install it using pip:

pip3 install --force-reinstall -U --no-deps ./archive_pdf_tools-${version}.whl

To see if archive-pdf-tools is installed correctly for your user, run:

recode_pdf --version

Not well tested features

  • “Recoding” an existing PDF, extracting the images and creating a new PDF with the images from the existing PDF is not well tested. This works OK if every PDF page just has a single image.

Known issues

  • Using --image-mode 0 and --image-mode 1 is currently broken, so only MRC or no images is supported.

  • It is not possible to recode/compress a PDF without hOCR files. This will be addressed in the future, since it should not be a problem to generate a PDF lacking hOCR data.

Planned features

  • Addition of a second set of fonts in the PDFs, so that hidden selected text also renders the original glyphs.

  • Better background generation (text shade removal from the background)

  • Better compression parameter selection, I have not toyed around that much with kakadu and grok/openjpeg2000 parameters.

MRC

The goal of Mixed Raster Content compression is to decompose the image into a background, foreground and mask. The background should contain components that are not of particular interest, whereas the foreground would contain all glyphs/text on a page, as well as the lines and edges of various drawings or images. The mask is a 1-bit image which has the value ‘1’ when a pixel is part of the foreground.

This decomposition can then be used to compress the different components individually, applying much higher compression to specific components, usually the background, which can be downscaled as well. The foreground can be quite compressed as well, since it mostly just needs to contain the approximate colours of the text and other lines - any artifacts introduced during the foreground compression (e.g. ugly artifact around text borders) are removed by overlaying the mask component of the image, which is losslessly compressed (typically using either JBIG2 or CCITT).

In a PDF, this usually means the background image is inserted into a page, followed by the foreground image, which uses the mask as its alpha layer.

Usage

Creating a PDF from a set of images is pretty straightforward:

recode_pdf --from-imagestack 'sim_english-illustrated-magazine_1884-12_2_15_jp2/*' \
    --hocr-file sim_english-illustrated-magazine_1884-12_2_15_hocr.html \
    --dpi 400 --bg-downsample 3 \
    -m 2 -t 10 --mask-compression jbig2 \
    -o /tmp/example.pdf
[...]
Processed 9 pages at 1.16 seconds/page
Compression ratio: 7.144962

Or, to scan a document, OCR it with Tesseract and save the result as a compressed PDF (JPEG2000 compression with OpenJPEG, background downsampled three times), with text layer:

scanimage --resolution 300 --mode Color --format tiff | tee /tmp/scan.tiff | tesseract - - hocr > /tmp/scan.hocr ; recode_pdf -v -J openjpeg --bg-downsample 3 --from-imagestack /tmp/scan.tiff --hocr-file /tmp/scan.hocr -o /tmp/scan.pdf
[...]
Processed 1 pages at 11.40 seconds/page
Compression ratio: 249.876613

Examining the results

mrcview (tools/mrcview) is shipped with the package and can be used to turn a MRC-compressed PDF into a PDF with each layer on a separate page, this is the easiest way to inspect the resulting compression. Run it like so:

mrcview /tmp/compressed.pdf /tmp/mrc.pdf

There is also maskview, which just renders the masks of a PDF to another PDF.

Alternatively, one could use pdfimages to extract the image layers of a specific page and then view them with your favourite image viewer:

pageno=0; pdfimages -f $pageno -l $pageno -png path_to_pdf extracted_image_base
feh extracted_image_base*.png

tools/pdfimagesmrc can be used to check how the size of the PDF is broken down into the foreground, background, masks and text layer.

License

License for all code (minus internetarchive/pdfrenderer.py) is AGPL 3.0.

internetarchive/pdfrenderer.py is Apache 2.0, which matches the Tesseract license for that file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

archive-pdf-tools-1.4.18.tar.gz (195.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

archive_pdf_tools-1.4.18-cp310-cp310-win_amd64.whl (139.4 kB view details)

Uploaded CPython 3.10Windows x86-64

archive_pdf_tools-1.4.18-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (397.1 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.4.18-cp310-cp310-macosx_10_9_x86_64.whl (144.9 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

archive_pdf_tools-1.4.18-cp39-cp39-win_amd64.whl (139.4 kB view details)

Uploaded CPython 3.9Windows x86-64

archive_pdf_tools-1.4.18-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (395.9 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.4.18-cp39-cp39-macosx_10_9_x86_64.whl (144.9 kB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

archive_pdf_tools-1.4.18-cp38-cp38-win_amd64.whl (139.3 kB view details)

Uploaded CPython 3.8Windows x86-64

archive_pdf_tools-1.4.18-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (396.0 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.4.18-cp38-cp38-macosx_10_9_x86_64.whl (144.2 kB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

archive_pdf_tools-1.4.18-cp37-cp37m-win_amd64.whl (138.9 kB view details)

Uploaded CPython 3.7mWindows x86-64

archive_pdf_tools-1.4.18-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (373.4 kB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.4.18-cp37-cp37m-macosx_10_9_x86_64.whl (143.5 kB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

File details

Details for the file archive-pdf-tools-1.4.18.tar.gz.

File metadata

  • Download URL: archive-pdf-tools-1.4.18.tar.gz
  • Upload date:
  • Size: 195.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.9

File hashes

Hashes for archive-pdf-tools-1.4.18.tar.gz
Algorithm Hash digest
SHA256 3fb53b360dff6eacee4fe6020c7a1897b16866122c276425fbd5f9e077257436
MD5 6d9e87d345267964e038f91962877f1c
BLAKE2b-256 1be0d30dd3d381f09760907298ca66bdb17b6377fc51a748127c07cf9adf15a7

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.18-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: archive_pdf_tools-1.4.18-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 139.4 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.9

File hashes

Hashes for archive_pdf_tools-1.4.18-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 66c6e50fc83c0fb5d6ab98b0744f38562cb11d20cb7cfc9f21d95525128dc9c0
MD5 73c77825f94a9ffb42efe234efa50e40
BLAKE2b-256 81623834921d9cb7c899404c21885b6cac814efc3c7c57cc4495f59aa6de19df

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.18-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.18-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ab6b79d5ee28055a7f785102a44defb9118ab9ba4e3c45f13240dc4dc9bfba53
MD5 a92cd8e1b1e8132d19dd768fb7f1b47c
BLAKE2b-256 9acb832668df07439ead55aae111be196289098d97553d5263408e937b27d800

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.18-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: archive_pdf_tools-1.4.18-cp310-cp310-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 144.9 kB
  • Tags: CPython 3.10, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.9

File hashes

Hashes for archive_pdf_tools-1.4.18-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 94438791ef297a9f85bdc016bb2017d2c6ff8c142fb84ff7b9ed162f58593bd9
MD5 a130534432fa8f18867a431d633a48dd
BLAKE2b-256 3d30f8827610677e8f3790c2a364807ac57eca83dd4f22190c6e7e3771fd7f48

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.18-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: archive_pdf_tools-1.4.18-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 139.4 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.9

File hashes

Hashes for archive_pdf_tools-1.4.18-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 6784d67799a982c5b73a5fee695bbaefc7561e6a8df520881d30228b6f9b4d47
MD5 d4a21d9cf3dfab606124eeefc29b5bb0
BLAKE2b-256 2c0bb3154f4eca66cebcdb33dc0eefcd92cee3f753e5aeeb64ac184ec5017bba

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.18-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.18-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 96a2fe8e3194d6a2ef9a0d64cb11d3243d92e591208a38dde495afcc4a185d93
MD5 04d09b25eabb372cccdddcdb52e14d85
BLAKE2b-256 059e97ec8e01f4ecd79d5e2c84ac1a25245ad97b73f9644d3da3d6e5f80587e5

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.18-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: archive_pdf_tools-1.4.18-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 144.9 kB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.9

File hashes

Hashes for archive_pdf_tools-1.4.18-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d84e55c4f7c0e140f0b6b673586a20b88958cc49a055b59ec95865a441ec4036
MD5 75f4fa3029cce0311da3bf998e3a734f
BLAKE2b-256 89702d467030757e2b1daffcdd0341e3ed85ea12e83679b2e66a599a63731ba7

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.18-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: archive_pdf_tools-1.4.18-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 139.3 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.9

File hashes

Hashes for archive_pdf_tools-1.4.18-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 2bf6c640a5d4410f7f994d80155f7aac7f306ea64ca6c8d6862a52a1fd4b25c5
MD5 b4212971a7efef08050a7559a0f136cf
BLAKE2b-256 ffd030b5ac9ae8ce807ac31f4deacfa89d8a9bca11038dc9771287d37c1cc527

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.18-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.18-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 330a2857f6df5bc872f43fc124e51973af7ec2a372a25c08f2104ce743dab625
MD5 68e4a94242dfe438bd61bf1c15e6a421
BLAKE2b-256 44173f93af9cc72a8988db717a62efa2cafeb2508d4f73e005f10fda30739086

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.18-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: archive_pdf_tools-1.4.18-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 144.2 kB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.9

File hashes

Hashes for archive_pdf_tools-1.4.18-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 78957298cfc0bdea59f2639a01be8ac05db3bc6f535aa3f1d39700e0eeb7db51
MD5 80a5ea1d26c3c1423020b9446faa0649
BLAKE2b-256 c5079fe8fd2499bfd742b026f300316335ca816c482d43b42926aaa06fc767a6

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.18-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: archive_pdf_tools-1.4.18-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 138.9 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.9

File hashes

Hashes for archive_pdf_tools-1.4.18-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 e2566dd4d5b4cb5e18cef62af543d4b92f27f52567aa4cc01936db669462291a
MD5 e3039f39f48003a8497216ac7087caa2
BLAKE2b-256 80aa98dcdf662aa12a6b802d21ff4018bf63e0c7ed401b88786fadc3c54355d4

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.18-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.18-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 410aeea2262945ef5b54b63de709c665af6871a990219627dcd0bcd92e52f15d
MD5 5fb6ddb75628e658e4dc237c73d067fe
BLAKE2b-256 a19e9a85a2e56878648abaabd7b801489afc837e509b71eaeb4192cab49ffbe1

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.18-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: archive_pdf_tools-1.4.18-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 143.5 kB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.9

File hashes

Hashes for archive_pdf_tools-1.4.18-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f2cf769f171498faf5ec052ce54683c997aca593c0a0b837b64863bd77ce574f
MD5 b6b20589c67bbe411c69469dc3466666
BLAKE2b-256 aac880b40a90444320d3860d26dc012719fec606adf37dbfe25c57dcc4742e43

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page