Skip to main content

Internet Archive PDF compression tools

Project description

Authors:

Merlijn Wajer <merlijn@archive.org>

Date:
2021-11-14 18:00

This repository contains a library to perform MRC (Mixed Raster Content) compression on images [*], which offers lossy high compression of images, in particular images with text.

Additionally, the library can generate MRC-compressed PDF files with hOCR [] text layers mixed into to the PDF, which makes searching and copy-pasting of the PDF possible. PDFs generated by bin/recode_pdf should be PDF/A 3b and PDF/UA compatible.

Some of the tooling also supports specific Internet Archive file formats (such as the “scandata.xml” files, but the tooling should work fine without those files, too.

While the code is already being used internally to create PDFs at the Internet Archive, the code still needs more documentation and cleaning up, so don’t expect this to be super well documented just yet.

Features

  • Reliable: has produced over 6 million PDFs in 2021 alone (each with many hundreds of pages)

  • Fast and robust compression: Competes directly with the proprietary software offerings when it comes to speed and compressibility (often outperforming in both)

  • MRC compression of images, leading to anywhere from 3-15x compression ratios, depending on the quality setting provided.

  • Creates PDF from a directory of images

  • Improved compression based on OCR results (hOCR files)

  • Hidden text layer insertion based on hOCR files, which makes a PDF searchable and the text copy-pasteable.

  • PDF/A 3b compatible.

  • Basic PDF/UA support (accessibility features)

  • Creation of 1 bit (black and white) PDFs

Dependencies

  • Python 3.x

  • Python packages (also see requirements.txt):

One-of:

For JBIG2 compression:

  • jbig2enc for JBIG2 compression (and PyMuPDF 1.19.0 or higher)

Installation

First install dependencies. For example, in Ubuntu:

sudo apt install libleptonica-dev libopenjp2-tools libxml2-dev libxslt-dev python3-dev python3-pip

sudo apt install automake libtool
git clone https://github.com/agl/jbig2enc
cd jbig2enc
./autogen.sh
./configure && make
sudo make install

Because archive-pdf-tools is on the Python Package Index (PyPI), you can use pip (the Python 3 version is often called pip3) to install the latest version:

# Latest version
pip3 install archive-pdf-tools

# Specific version
pip3 install archive-pdf-tools==1.4.14

Alternatively, if you want a specific commit or unreleased version, check out the master branch or a tagged release and use pip to install:

git clone https://github.com/internetarchive/archive-pdf-tools.git
cd archive-pdf-tools
pip3 install .

Finally, if you’ve downloaded a wheel to test a specific commit, you can also install it using pip:

pip3 install --force-reinstall -U --no-deps ./archive_pdf_tools-${version}.whl

To see if archive-pdf-tools is installed correctly for your user, run:

recode_pdf --version

Not well tested features

  • “Recoding” an existing PDF, extracting the images and creating a new PDF with the images from the existing PDF is not well tested. This works OK if every PDF page just has a single image.

Known issues

  • Using --image-mode 0 and --image-mode 1 is currently broken, so only MRC or no images is supported.

  • It is not possible to recode/compress a PDF without hOCR files. This will be addressed in the future, since it should not be a problem to generate a PDF lacking hOCR data.

Planned features

  • Addition of a second set of fonts in the PDFs, so that hidden selected text also renders the original glyphs.

  • Better background generation (text shade removal from the background)

  • Better compression parameter selection, I have not toyed around that much with kakadu and grok/openjpeg2000 parameters.

MRC

The goal of Mixed Raster Content compression is to decompose the image into a background, foreground and mask. The background should contain components that are not of particular interest, whereas the foreground would contain all glyphs/text on a page, as well as the lines and edges of various drawings or images. The mask is a 1-bit image which has the value ‘1’ when a pixel is part of the foreground.

This decomposition can then be used to compress the different components individually, applying much higher compression to specific components, usually the background, which can be downscaled as well. The foreground can be quite compressed as well, since it mostly just needs to contain the approximate colours of the text and other lines - any artifacts introduced during the foreground compression (e.g. ugly artifact around text borders) are removed by overlaying the mask component of the image, which is losslessly compressed (typically using either JBIG2 or CCITT).

In a PDF, this usually means the background image is inserted into a page, followed by the foreground image, which uses the mask as its alpha layer.

Usage

Creating a PDF from a set of images is pretty straightforward:

recode_pdf --from-imagestack 'sim_english-illustrated-magazine_1884-12_2_15_jp2/*' \
    --hocr-file sim_english-illustrated-magazine_1884-12_2_15_hocr.html \
    --dpi 400 --bg-downsample 3 \
    -m 2 -t 10 --mask-compression jbig2 \
    -o /tmp/example.pdf
[...]
Processed 9 pages at 1.16 seconds/page
Compression ratio: 7.144962

Or, to scan a document, OCR it with Tesseract and save the result as a compressed PDF (JPEG2000 compression with OpenJPEG, background downsampled three times), with text layer:

scanimage --resolution 300 --mode Color --format tiff | tee /tmp/scan.tiff | tesseract - - hocr > /tmp/scan.hocr ; recode_pdf -v -J openjpeg --bg-downsample 3 --from-imagestack /tmp/scan.tiff --hocr-file /tmp/scan.hocr -o /tmp/scan.pdf
[...]
Processed 1 pages at 11.40 seconds/page
Compression ratio: 249.876613

Examining the results

mrcview (tools/mrcview) is shipped with the package and can be used to turn a MRC-compressed PDF into a PDF with each layer on a separate page, this is the easiest way to inspect the resulting compression. Run it like so:

mrcview /tmp/compressed.pdf /tmp/mrc.pdf

There is also maskview, which just renders the masks of a PDF to another PDF.

Alternatively, one could use pdfimages to extract the image layers of a specific page and then view them with your favourite image viewer:

pageno=0; pdfimages -f $pageno -l $pageno -png path_to_pdf extracted_image_base
feh extracted_image_base*.png

tools/pdfimagesmrc can be used to check how the size of the PDF is broken down into the foreground, background, masks and text layer.

License

License for all code (minus internetarchivepdf/pdfrenderer.py) is AGPL 3.0.

internetarchivepdf/pdfrenderer.py is Apache 2.0, which matches the Tesseract license for that file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

archive_pdf_tools-1.5.7.tar.gz (267.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

archive_pdf_tools-1.5.7-cp313-cp313-win_amd64.whl (154.6 kB view details)

Uploaded CPython 3.13Windows x86-64

archive_pdf_tools-1.5.7-cp313-cp313-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (178.0 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.5.7-cp313-cp313-macosx_11_0_arm64.whl (153.8 kB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

archive_pdf_tools-1.5.7-cp312-cp312-win_amd64.whl (155.9 kB view details)

Uploaded CPython 3.12Windows x86-64

archive_pdf_tools-1.5.7-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (179.1 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.5.7-cp312-cp312-macosx_11_0_arm64.whl (154.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

archive_pdf_tools-1.5.7-cp311-cp311-win_amd64.whl (156.8 kB view details)

Uploaded CPython 3.11Windows x86-64

archive_pdf_tools-1.5.7-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (179.7 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.5.7-cp311-cp311-macosx_11_0_arm64.whl (153.7 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

archive_pdf_tools-1.5.7-cp310-cp310-win_amd64.whl (156.8 kB view details)

Uploaded CPython 3.10Windows x86-64

archive_pdf_tools-1.5.7-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (180.5 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.5.7-cp310-cp310-macosx_11_0_arm64.whl (153.5 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

archive_pdf_tools-1.5.7-cp39-cp39-win_amd64.whl (156.8 kB view details)

Uploaded CPython 3.9Windows x86-64

archive_pdf_tools-1.5.7-cp39-cp39-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (180.5 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.5.7-cp39-cp39-macosx_11_0_arm64.whl (153.5 kB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

File details

Details for the file archive_pdf_tools-1.5.7.tar.gz.

File metadata

  • Download URL: archive_pdf_tools-1.5.7.tar.gz
  • Upload date:
  • Size: 267.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for archive_pdf_tools-1.5.7.tar.gz
Algorithm Hash digest
SHA256 ec3d44a7654eafa0c93e0624fdd95c9674b485107ca70238d09496205fe45ee7
MD5 aa274dea135b0119306237bdf8f6d184
BLAKE2b-256 fd6e81e33d37393fd396b50bb7b2beef394477e0e88e64a65b9b529e67c10df3

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 bc226ec1783c1f70b42baf315df14ffd2982e7b3fbac15a3f7c28f14157c8a60
MD5 f429e6c069d9899aa9580aa3cd548ede
BLAKE2b-256 dbe661fb745a5a35cb36efbc5f427617bd0afbdd47fcc71b5046eb76dbbd0b3d

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp313-cp313-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp313-cp313-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl
Algorithm Hash digest
SHA256 a9a0b3beba602c4343476ace2fbb3c5c89f3844cdc74a5ce506796bd5b9376d4
MD5 787ee07cd70ce1412e665b74bc15f024
BLAKE2b-256 0fb58a9d028c940921b207342a1762cfc6a06ba2ceb9b29ecd232a310a573427

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b5fdd6561e1d60091ced430d348df3e701f652334dc58476181f70bb5193bc02
MD5 1798786074849ffc5e29c23251a9d55a
BLAKE2b-256 41dcb7d281bfbd13a1f0b04156e09517f74153f51785460b1c7da87398ee83b1

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 2ac1c1f7cf154f2768c402c3f632eb7308017252a884e7cfcb317e0688c19ba1
MD5 d5d5b5f6cf62aafd2fe9e2ad3f69545b
BLAKE2b-256 ecb4c3aa290192a8860176c4a0278521f24a343dcc8890a35b7414698392762c

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl
Algorithm Hash digest
SHA256 e4a048f51f48cfee413a66a95b77899fd820a3f3c6385afe505ec6e5b8314c89
MD5 593f3057e3e523f71bd6b390a6c704d8
BLAKE2b-256 f03f9a7242f1b5f1e84edd559bfdbd252a9e8a1f1e254bc9053ffe67540456e7

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b0bb4823b41dfc770224944515d8ef07e9f48d53b25c77a2056e1cff38248144
MD5 728187082f3d6f8531634ecc737fa14c
BLAKE2b-256 c3d4be50bc326d35530e948b1b331624a3b4461fa5f0a4f63939393321e0fc95

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 33b2405af915d8be65432818df52b12b37d6718f4091c5c0473c6edd5166f950
MD5 70b055fd579efccd17bdc2a29445b9e9
BLAKE2b-256 10ec67e5bbc5b9365ea42db4791f1a52bf8e0aa2ae251c0cc1f7e9f27f5ee503

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl
Algorithm Hash digest
SHA256 0f7f502ea07bc34b8b7334314b9eb927fd2e05d5bbc77d5c5aa1c4d7f4c9f1f2
MD5 72b13e9ffadd2857f2cf20cf0ced72b2
BLAKE2b-256 ce1abb0066eae90538b79dfcea00705dff84ce70fe570cde05e2efec39cd572a

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 558ef03b74a4b653743ab0372ca08257712704219977e165508b55b4a5d9d7be
MD5 6c1e709a4cc029ab8be1bbf1153dadb8
BLAKE2b-256 3cc9f9c66ae22d84ef4f7d4afa29ca89da0419fb584fc03712a0bfc62e591cf0

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 65c635de2db00c65b3ec3c02ce9d70df748bef97cb636b7b0b13969acb8bb51b
MD5 247a795e7b66e152a547d6ab66464da3
BLAKE2b-256 35c5816f345c97f5212cb9dfc65ae77a6dad3343c4a62bf4fbdffb7d5480a654

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl
Algorithm Hash digest
SHA256 10490c7e71eb5e4b3d538bfc430318f04cd6c11a93f51f3a332405af344051a0
MD5 502264cc6145da88fcb69cd5ca74315f
BLAKE2b-256 a6b85f00fdf5d03395bfa08f565a4367a7fb1c8824ba6f59f30d015fce86b5de

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b6999528388f110f71528f52f5f8c2d0eb356dada81a18ea2766441476a70788
MD5 41f67c39247380f7cf1f4967660f0006
BLAKE2b-256 7b9b69e2261c91507bccdc228dacef6c1ece8d78f53f04c52bbef41cb02cfaa6

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 d1a1ad2c84980fcd29b3588c5928245e9148f72010067179d43dc55b01346e07
MD5 6ac2a7df73e9b3e58cb664323bb9d2a5
BLAKE2b-256 a42c6fed2c8d4f41928f7f2d85a8d7acea0e2f4c7848bc175e78446e1df251ee

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp39-cp39-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp39-cp39-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl
Algorithm Hash digest
SHA256 0c9322bb15f75b7fdc885ddc46484a8b23cc61d5e96d9839b04d89177a8c5c2a
MD5 48b5eaabb3eb86735cefaaba9fded0f6
BLAKE2b-256 62c84cf9ff302a277c570360dda4c4a1455bc139afd408ca06c625a3bb64c0ab

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.7-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.7-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ca9de80329c6844e1ca059dd10e41a48512e4ec2046e1ebd27ef24011075b7c5
MD5 edc2af220ac3a5815c51fcebd35f7085
BLAKE2b-256 f0f35d4b7be1a5920557534c3894bc518bd3ffd6d4b6c956eb475b4f8383dcec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page