Skip to main content

Internet Archive PDF compression tools

Project description

Authors:

Merlijn Wajer <merlijn@archive.org>

Date:
2021-11-14 18:00

This repository contains a library to perform MRC (Mixed Raster Content) compression on images [*], which offers lossy high compression of images, in particular images with text.

Additionally, the library can generate MRC-compressed PDF files with hOCR [] text layers mixed into to the PDF, which makes searching and copy-pasting of the PDF possible. PDFs generated by bin/recode_pdf should be PDF/A 3b and PDF/UA compatible.

Some of the tooling also supports specific Internet Archive file formats (such as the “scandata.xml” files, but the tooling should work fine without those files, too.

While the code is already being used internally to create PDFs at the Internet Archive, the code still needs more documentation and cleaning up, so don’t expect this to be super well documented just yet.

Features

  • Reliable: has produced over 6 million PDFs in 2021 alone (each with many hundreds of pages)

  • Fast and robust compression: Competes directly with the proprietary software offerings when it comes to speed and compressibility (often outperforming in both)

  • MRC compression of images, leading to anywhere from 3-15x compression ratios, depending on the quality setting provided.

  • Creates PDF from a directory of images

  • Improved compression based on OCR results (hOCR files)

  • Hidden text layer insertion based on hOCR files, which makes a PDF searchable and the text copy-pasteable.

  • PDF/A 3b compatible.

  • Basic PDF/UA support (accessibility features)

  • Creation of 1 bit (black and white) PDFs

Dependencies

  • Python 3.x

  • Python packages (also see requirements.txt):

One-of:

For JBIG2 compression:

  • jbig2enc for JBIG2 compression (and PyMuPDF 1.19.0 or higher)

Installation

First install dependencies. For example, in Ubuntu:

sudo apt install libleptonica-dev libopenjp2-tools libxml2-dev libxslt-dev python3-dev python3-pip
git clone https://github.com/agl/jbig2enc
cd jbig2enc
./autogen.sh
./configure && make
sudo make install

Because archive-pdf-tools is on the Python Package Index (PyPI), you can use pip (the Python 3 version is often called pip3) to install the latest version:

# Latest version
pip3 install archive-pdf-tools

# Specific version
pip3 install archive-pdf-tools==1.4.14

Alternatively, if you want a specific commit or unreleased version, check out the master branch or a tagged release and use pip to install:

git clone https://github.com/internetarchive/archive-pdf-tools.git
cd archive-pdf-tools
pip3 install .

Finally, if you’ve downloaded a wheel to test a specific commit, you can also install it using pip:

pip3 install --force-reinstall -U --no-deps ./archive_pdf_tools-${version}.whl

To see if archive-pdf-tools is installed correctly for your user, run:

recode_pdf --version

Not well tested features

  • “Recoding” an existing PDF, extracting the images and creating a new PDF with the images from the existing PDF is not well tested. This works OK if every PDF page just has a single image.

Known issues

  • Using --image-mode 0 and --image-mode 1 is currently broken, so only MRC or no images is supported.

  • It is not possible to recode/compress a PDF without hOCR files. This will be addressed in the future, since it should not be a problem to generate a PDF lacking hOCR data.

Planned features

  • Addition of a second set of fonts in the PDFs, so that hidden selected text also renders the original glyphs.

  • Better background generation (text shade removal from the background)

  • Better compression parameter selection, I have not toyed around that much with kakadu and grok/openjpeg2000 parameters.

MRC

The goal of Mixed Raster Content compression is to decompose the image into a background, foreground and mask. The background should contain components that are not of particular interest, whereas the foreground would contain all glyphs/text on a page, as well as the lines and edges of various drawings or images. The mask is a 1-bit image which has the value ‘1’ when a pixel is part of the foreground.

This decomposition can then be used to compress the different components individually, applying much higher compression to specific components, usually the background, which can be downscaled as well. The foreground can be quite compressed as well, since it mostly just needs to contain the approximate colours of the text and other lines - any artifacts introduced during the foreground compression (e.g. ugly artifact around text borders) are removed by overlaying the mask component of the image, which is losslessly compressed (typically using either JBIG2 or CCITT).

In a PDF, this usually means the background image is inserted into a page, followed by the foreground image, which uses the mask as its alpha layer.

Usage

Creating a PDF from a set of images is pretty straightforward:

recode_pdf --from-imagestack 'sim_english-illustrated-magazine_1884-12_2_15_jp2/*' \
    --hocr-file sim_english-illustrated-magazine_1884-12_2_15_hocr.html \
    --dpi 400 --bg-downsample 3 \
    -m 2 -t 10 --mask-compression jbig2 \
    -o /tmp/example.pdf
[...]
Processed 9 pages at 1.16 seconds/page
Compression ratio: 7.144962

Or, to scan a document, OCR it with Tesseract and save the result as a compressed PDF (JPEG2000 compression with OpenJPEG, background downsampled three times), with text layer:

scanimage --resolution 300 --mode Color --format tiff | tee /tmp/scan.tiff | tesseract - - hocr > /tmp/scan.hocr ; recode_pdf -v -J openjpeg --bg-downsample 3 --from-imagestack /tmp/scan.tiff --hocr-file /tmp/scan.hocr -o /tmp/scan.pdf
[...]
Processed 1 pages at 11.40 seconds/page
Compression ratio: 249.876613

Examining the results

mrcview (tools/mrcview) is shipped with the package and can be used to turn a MRC-compressed PDF into a PDF with each layer on a separate page, this is the easiest way to inspect the resulting compression. Run it like so:

mrcview /tmp/compressed.pdf /tmp/mrc.pdf

There is also maskview, which just renders the masks of a PDF to another PDF.

Alternatively, one could use pdfimages to extract the image layers of a specific page and then view them with your favourite image viewer:

pageno=0; pdfimages -f $pageno -l $pageno -png path_to_pdf extracted_image_base
feh extracted_image_base*.png

tools/pdfimagesmrc can be used to check how the size of the PDF is broken down into the foreground, background, masks and text layer.

License

License for all code (minus internetarchive/pdfrenderer.py) is AGPL 3.0.

internetarchive/pdfrenderer.py is Apache 2.0, which matches the Tesseract license for that file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

archive_pdf_tools-1.4.29.tar.gz (248.9 kB view details)

Uploaded Source

Built Distributions

archive_pdf_tools-1.4.29-cp311-cp311-win_amd64.whl (159.3 kB view details)

Uploaded CPython 3.11 Windows x86-64

archive_pdf_tools-1.4.29-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (492.4 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.4.29-cp311-cp311-macosx_11_0_arm64.whl (157.3 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

archive_pdf_tools-1.4.29-cp311-cp311-macosx_10_9_x86_64.whl (159.2 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

archive_pdf_tools-1.4.29-cp310-cp310-win_amd64.whl (158.6 kB view details)

Uploaded CPython 3.10 Windows x86-64

archive_pdf_tools-1.4.29-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (465.7 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.4.29-cp310-cp310-macosx_11_0_arm64.whl (157.4 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

archive_pdf_tools-1.4.29-cp310-cp310-macosx_10_9_x86_64.whl (159.3 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

archive_pdf_tools-1.4.29-cp39-cp39-win_amd64.whl (158.6 kB view details)

Uploaded CPython 3.9 Windows x86-64

archive_pdf_tools-1.4.29-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (465.2 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.4.29-cp39-cp39-macosx_11_0_arm64.whl (157.4 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

archive_pdf_tools-1.4.29-cp39-cp39-macosx_10_9_x86_64.whl (159.3 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

archive_pdf_tools-1.4.29-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (464.5 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.4.29-cp38-cp38-macosx_11_0_arm64.whl (157.5 kB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

archive_pdf_tools-1.4.29-cp38-cp38-macosx_10_9_x86_64.whl (158.9 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

archive_pdf_tools-1.4.29-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (444.1 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

File details

Details for the file archive_pdf_tools-1.4.29.tar.gz.

File metadata

  • Download URL: archive_pdf_tools-1.4.29.tar.gz
  • Upload date:
  • Size: 248.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for archive_pdf_tools-1.4.29.tar.gz
Algorithm Hash digest
SHA256 d6ccaa52fa31d7c06e02ddb8c2566f59711867ffdee6c20c73892dce4945c1b1
MD5 6ae321303eadac699cdc963191b2425a
BLAKE2b-256 4706a462238204353573356f0b082c0a870c67b0a7bda597c7c958ca30cb582f

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 291b5336013a9cddcdc777396d51938ea0a5925b0a5d8cd279d926081e23e8ec
MD5 45951596905df4878c8c81fcd0d48b5d
BLAKE2b-256 9e0260745dd7ac9826b08a6257e7cc2e9de4848739df298090c918d6312d5d27

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ef9227802288f1644ae5e6283e49f808a5b3a90140bcf3bf6e4dbdad7f3ea86b
MD5 7a81499fa939162c366e043a2dbcc3ca
BLAKE2b-256 994ae519884f871ad2061f2133ab099131d66137b53b7060b1febd72233778bc

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ef480a7229bcc8b8ce609ca03dd7c00ac004eb25236b8f5663f341044d986a6e
MD5 f4f0fe2739c64f8646034576f8b37cf5
BLAKE2b-256 ff305ecba9b33fdc3989d47c4b9d65c60a5d4785f54ba5fa23aadc239efa1fdf

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 81415d1ed156aeb0952ecbf07bfb413e5fe0023747955d17fcc2a758fd1d5a81
MD5 993db0955ef3600c2402933dfcc56fa6
BLAKE2b-256 d1b95d58288ead3988874fcbfabc60ff295edcb6499afd82917cf2af1386b30f

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 1f3f67ea7bb1735818b6174bdabf9150ad9129baa118559b879661ca7fb854bc
MD5 c3e854f9646b1a3d2107e9abbda81ec0
BLAKE2b-256 a115a0b95acb177113ac8ef034aae21f6cbe127b942c94b6a7c2279e341a4c1e

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 29ef4ee4d7156b7162c98e068d05de6198fd69883b1610e013099558d1015b33
MD5 12b91f73e98db72554807a64d5905943
BLAKE2b-256 d23f8eea6a7cb3a01b0b8d6eac4c04044be8bac8b2cbd7f87fd1890fb785300d

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e12e34e66bd7e8c86c09af15025d895a48c293b44cd39cda69bb38452d73b069
MD5 9d82ae17bb7aa099ad5eaac7410a5702
BLAKE2b-256 ee9eecd589a5e26b91e7789550c4c13d5b54127f068d70db814858e29bfbd775

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 cecc8a4f9adec6d71c5daba81cc4bea89da3a2334504338e53c03002fb9c3e1e
MD5 c2fc6451a67b1a75e5bbb5ca75f694af
BLAKE2b-256 84b9da3ecc8350d71faf40380f5adc81ce6656b4961793a36f67726b2442a7d5

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 8f2d860770c8306d4b79820dddae551e98a22c729e865e758099752c2e7aff41
MD5 e7b5c2ab557d0b3c6420abdb015984a4
BLAKE2b-256 ef62f26cca7d9c3e12708e4fc0975ae8ff7565ea0f33c4a765096fa5646e7249

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3ed5ab45fc69de37d6e6dd74b99a06ed88184d2b45fb2e7d0beea56b8ec81ea5
MD5 616ee6337068c61f6a99dc4aca87c4f5
BLAKE2b-256 ef9bc26fbe9dc58c46e07d7883d7eaefae7055de5372cd49006717b50dec5337

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bff27f643a25386bde15e2cb084a37557fa6061dd6e0c41356d1fb4224591ab3
MD5 c8895f9b3804bb6db011f6551e0bbe29
BLAKE2b-256 0a8fc4fd7ab3bf1209b9b64e64d7194a02fef6cffd159a0958b8cf958867f77e

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 41a1ebde8f5083db511885481d818b4bb98f75290036eafaebcd23e461888b76
MD5 2388abf057e5a2b3bfc788f50be17562
BLAKE2b-256 edc1b020821bd45e79a2df36f93e097fdb03e8fe7b868eefc4bcaa7e55301f09

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 234d0bf15fe121385ba490f8c47b8c3cf152936ce96e802a5c349b13f4b4f507
MD5 4c8e694732d22d3347cd4fda64fca12f
BLAKE2b-256 b3b8dba483b748705110adf2a5505340be6bdefc2904138e0a39588d50c6215a

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9fa20db2099281f14328fba2fa9abadd044dad7d792040ee98d87e35364425d7
MD5 212eaf1c01e3d1a23151188f425ecf12
BLAKE2b-256 fc51ab50600eb9e76bcbe4ab053b1434522417d4a79908ea6a189ba7e07aae7f

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e0bc2a3199b10a1deff42a287d01889c87235e502db2bc75ec007f3a303bb74d
MD5 6917213074b6547c39302ec87e4d447d
BLAKE2b-256 5846d3aad8238ded3c3ab3911d339072473f5568fcfcd1ffb18c7259dfc61dcb

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.4.29-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.4.29-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ca555a3277c04fdfac4099a2225daf1c79fe4e065bb2ee7225e5d026839918c6
MD5 b6e9904db1a268f389372d8c248241d5
BLAKE2b-256 01afa666f845d9deb32959777f9cf0e80e277ad6d432e6d565306a2c51c1ec12

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page