Skip to main content

Internet Archive PDF compression tools

Project description

Authors:

Merlijn Wajer <merlijn@archive.org>

Date:
2021-11-14 18:00

This repository contains a library to perform MRC (Mixed Raster Content) compression on images [*], which offers lossy high compression of images, in particular images with text.

Additionally, the library can generate MRC-compressed PDF files with hOCR [] text layers mixed into to the PDF, which makes searching and copy-pasting of the PDF possible. PDFs generated by bin/recode_pdf should be PDF/A 3b and PDF/UA compatible.

Some of the tooling also supports specific Internet Archive file formats (such as the “scandata.xml” files, but the tooling should work fine without those files, too.

While the code is already being used internally to create PDFs at the Internet Archive, the code still needs more documentation and cleaning up, so don’t expect this to be super well documented just yet.

Features

  • Reliable: has produced over 6 million PDFs in 2021 alone (each with many hundreds of pages)

  • Fast and robust compression: Competes directly with the proprietary software offerings when it comes to speed and compressibility (often outperforming in both)

  • MRC compression of images, leading to anywhere from 3-15x compression ratios, depending on the quality setting provided.

  • Creates PDF from a directory of images

  • Improved compression based on OCR results (hOCR files)

  • Hidden text layer insertion based on hOCR files, which makes a PDF searchable and the text copy-pasteable.

  • PDF/A 3b compatible.

  • Basic PDF/UA support (accessibility features)

  • Creation of 1 bit (black and white) PDFs

Dependencies

  • Python 3.x

  • Python packages (also see requirements.txt):

One-of:

For JBIG2 compression:

  • jbig2enc for JBIG2 compression (and PyMuPDF 1.19.0 or higher)

Installation

First install dependencies. For example, in Ubuntu:

sudo apt install libleptonica-dev libopenjp2-tools libxml2-dev libxslt-dev python3-dev python3-pip

sudo apt install automake libtool
git clone https://github.com/agl/jbig2enc
cd jbig2enc
./autogen.sh
./configure && make
sudo make install

Because archive-pdf-tools is on the Python Package Index (PyPI), you can use pip (the Python 3 version is often called pip3) to install the latest version:

# Latest version
pip3 install archive-pdf-tools

# Specific version
pip3 install archive-pdf-tools==1.4.14

Alternatively, if you want a specific commit or unreleased version, check out the master branch or a tagged release and use pip to install:

git clone https://github.com/internetarchive/archive-pdf-tools.git
cd archive-pdf-tools
pip3 install .

Finally, if you’ve downloaded a wheel to test a specific commit, you can also install it using pip:

pip3 install --force-reinstall -U --no-deps ./archive_pdf_tools-${version}.whl

To see if archive-pdf-tools is installed correctly for your user, run:

recode_pdf --version

Not well tested features

  • “Recoding” an existing PDF, extracting the images and creating a new PDF with the images from the existing PDF is not well tested. This works OK if every PDF page just has a single image.

Known issues

  • Using --image-mode 0 and --image-mode 1 is currently broken, so only MRC or no images is supported.

  • It is not possible to recode/compress a PDF without hOCR files. This will be addressed in the future, since it should not be a problem to generate a PDF lacking hOCR data.

Planned features

  • Addition of a second set of fonts in the PDFs, so that hidden selected text also renders the original glyphs.

  • Better background generation (text shade removal from the background)

  • Better compression parameter selection, I have not toyed around that much with kakadu and grok/openjpeg2000 parameters.

MRC

The goal of Mixed Raster Content compression is to decompose the image into a background, foreground and mask. The background should contain components that are not of particular interest, whereas the foreground would contain all glyphs/text on a page, as well as the lines and edges of various drawings or images. The mask is a 1-bit image which has the value ‘1’ when a pixel is part of the foreground.

This decomposition can then be used to compress the different components individually, applying much higher compression to specific components, usually the background, which can be downscaled as well. The foreground can be quite compressed as well, since it mostly just needs to contain the approximate colours of the text and other lines - any artifacts introduced during the foreground compression (e.g. ugly artifact around text borders) are removed by overlaying the mask component of the image, which is losslessly compressed (typically using either JBIG2 or CCITT).

In a PDF, this usually means the background image is inserted into a page, followed by the foreground image, which uses the mask as its alpha layer.

Usage

Creating a PDF from a set of images is pretty straightforward:

recode_pdf --from-imagestack 'sim_english-illustrated-magazine_1884-12_2_15_jp2/*' \
    --hocr-file sim_english-illustrated-magazine_1884-12_2_15_hocr.html \
    --dpi 400 --bg-downsample 3 \
    -m 2 -t 10 --mask-compression jbig2 \
    -o /tmp/example.pdf
[...]
Processed 9 pages at 1.16 seconds/page
Compression ratio: 7.144962

Or, to scan a document, OCR it with Tesseract and save the result as a compressed PDF (JPEG2000 compression with OpenJPEG, background downsampled three times), with text layer:

scanimage --resolution 300 --mode Color --format tiff | tee /tmp/scan.tiff | tesseract - - hocr > /tmp/scan.hocr ; recode_pdf -v -J openjpeg --bg-downsample 3 --from-imagestack /tmp/scan.tiff --hocr-file /tmp/scan.hocr -o /tmp/scan.pdf
[...]
Processed 1 pages at 11.40 seconds/page
Compression ratio: 249.876613

Examining the results

mrcview (tools/mrcview) is shipped with the package and can be used to turn a MRC-compressed PDF into a PDF with each layer on a separate page, this is the easiest way to inspect the resulting compression. Run it like so:

mrcview /tmp/compressed.pdf /tmp/mrc.pdf

There is also maskview, which just renders the masks of a PDF to another PDF.

Alternatively, one could use pdfimages to extract the image layers of a specific page and then view them with your favourite image viewer:

pageno=0; pdfimages -f $pageno -l $pageno -png path_to_pdf extracted_image_base
feh extracted_image_base*.png

tools/pdfimagesmrc can be used to check how the size of the PDF is broken down into the foreground, background, masks and text layer.

License

License for all code (minus internetarchive/pdfrenderer.py) is AGPL 3.0.

internetarchive/pdfrenderer.py is Apache 2.0, which matches the Tesseract license for that file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

archive-pdf-tools-1.5.4.tar.gz (194.6 kB view details)

Uploaded Source

Built Distributions

archive_pdf_tools-1.5.4-cp311-cp311-win_amd64.whl (140.0 kB view details)

Uploaded CPython 3.11 Windows x86-64

archive_pdf_tools-1.5.4-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (418.7 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.5.4-cp311-cp311-macosx_10_9_x86_64.whl (146.4 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

archive_pdf_tools-1.5.4-cp310-cp310-win_amd64.whl (139.5 kB view details)

Uploaded CPython 3.10 Windows x86-64

archive_pdf_tools-1.5.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (399.5 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.5.4-cp310-cp310-macosx_10_9_x86_64.whl (146.3 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

archive_pdf_tools-1.5.4-cp39-cp39-win_amd64.whl (140.2 kB view details)

Uploaded CPython 3.9 Windows x86-64

archive_pdf_tools-1.5.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (398.9 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.5.4-cp39-cp39-macosx_10_9_x86_64.whl (146.2 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

archive_pdf_tools-1.5.4-cp38-cp38-win_amd64.whl (140.3 kB view details)

Uploaded CPython 3.8 Windows x86-64

archive_pdf_tools-1.5.4-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (399.1 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.5.4-cp38-cp38-macosx_10_9_x86_64.whl (145.6 kB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

archive_pdf_tools-1.5.4-cp37-cp37m-win_amd64.whl (140.0 kB view details)

Uploaded CPython 3.7m Windows x86-64

archive_pdf_tools-1.5.4-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (376.7 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

archive_pdf_tools-1.5.4-cp37-cp37m-macosx_10_9_x86_64.whl (145.7 kB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file archive-pdf-tools-1.5.4.tar.gz.

File metadata

  • Download URL: archive-pdf-tools-1.5.4.tar.gz
  • Upload date:
  • Size: 194.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for archive-pdf-tools-1.5.4.tar.gz
Algorithm Hash digest
SHA256 db0c3aadad12047e9195f663d28ba828c7b001a65b836123c5ccb9e75e919b04
MD5 94681ccdeddc8cb1c324af182e06d478
BLAKE2b-256 e41be08645a31fa2c15a37c8aa46efbab1f41dea185a5a96589c3d871af11e07

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 0558dd0e3f7269571d8b9c82e74013d7af353377a03ebd4f51cda3732feb001e
MD5 da0ff3c449588f99e4f70b208cce8356
BLAKE2b-256 d15dc1aaacfb1796e6dac8c1995043946b87d7f1fd1d0aae9f199526afae17b4

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0ae4f9c08eaa71179b16bdbfe49bdec09a9d50536f9ee9a8789f72539f4356ab
MD5 13dd2eff7962d69e4420ae0c48bb8f36
BLAKE2b-256 297261cbbecd069f4fd9d521330ce8acf63dc9214f6e3932ab25fbfcc4e94405

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4750a007c860ae4a3d4ee871d08491e33d524ec53c7dabd78c8891bb36502ed0
MD5 901daaeeb0d784bc20a5d2cceb734154
BLAKE2b-256 66f00fd0be779b579dd73570b66664d5f3a329dfd91c9669f42deb6541eabeb8

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 946d00f149ab7ea77c94c6edfd82f60e4089f5b86ed86df0af6c2717a7080217
MD5 78edb756a29a265e841dc7a3321aab26
BLAKE2b-256 b854effc49bc1b723beba95d3cf59d5f2728c03db7aabd4d2bb8c415027a79db

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4643cec2db3e00e5abf4f947664414d17c4bc9e14fe2d12d700a6429e3f00f42
MD5 b9851e9cba89446fc54ec404bb88eb30
BLAKE2b-256 2e06dd0a49c1e5975396f387b23ec3549f36e8e5deda5f78a37b859c815e68e6

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 311a83ddb46fc622b2b717e79d53329b31e9b7d21c80484ba28db0b26ee23489
MD5 06032e8f34c60206bccbffc137c397aa
BLAKE2b-256 3fb1d1671336f016792266d972899d4c3d8caa65b1a1dde7e2552b7f251ddfca

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 2d96bca90fe011992b4d520be6fbd26ede0ffd4b052dce8e983cbee42f75bc3e
MD5 a975f9b1d74c40a871ac312e55437d59
BLAKE2b-256 47c7775f7a2a31dab02931208abc7142eb7e9270d3a717811f67332f586374a0

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 aeaa76a1bcc266c39796498cd77780c98dfcf34ae9594f7e29e5d18690824e7a
MD5 801e13306942f82d8beb315e9a6484fa
BLAKE2b-256 2068d80399df26ef25a69c10840ddcb83b4a1faad6a9efa31b563e892ac8366a

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 651910ee6a124e504288f00e43ec810af5f2a6f57d53086b95eb8ddfd6d2693b
MD5 b537ea2473f0a875330d5488317d26af
BLAKE2b-256 c6b3f97bf81eb626be4f2cf860a3c096f17e8e4aa6fd18c183d1be9bd3a61964

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 6618511665616e4f7458d2a51ec5f8781d1fa4e0b4c0487e4c33c6ed226a7682
MD5 10d41d92d48b88f2235a544b97eb8e2b
BLAKE2b-256 17901b068587517aa3a72e41f60672c85bbc5df970a375f94035b0ed79cccd0c

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3ba746833c1ca8454436e5d61006c43acbd735c71f35a0b606a7fe6a8bc9d5e7
MD5 770ceb5177a2d0cb374edab2c940e930
BLAKE2b-256 3d030ef2cb06a7cadf25bf5a63155c1298e01b8ed1995168c00612549ce6d6e0

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a4948d664ce8bc65abe74caed835046cd1a43976238e74b82605d7df7c03a2a9
MD5 9d5a7825fdba05071b3f117ca1509947
BLAKE2b-256 78fdf6b477502d140f7c4d9c0aa1f9f555b0fbda62defc37bdc24517c8ed1dba

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 126928752df9047005a2f924e6794c6a6c481713aec095b0b8a76b4d85fbd77c
MD5 717b2396a3693c772c09a3b63834a8c2
BLAKE2b-256 9b53323fc4b0b355248c0b39f4d1a27bcde9c682adb81fa0778f758e6fda63e3

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b044ae62747dd57e59a019a197890dd6886e3a8cfec978e2b98e1abb3ba5628c
MD5 1502685b848f0851859ca022376552c0
BLAKE2b-256 791c645c2df2e46a6dc95b003d09342f1d0247f63bb30277173e0b7181f0f8dc

See more details on using hashes here.

File details

Details for the file archive_pdf_tools-1.5.4-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for archive_pdf_tools-1.5.4-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4dae3d5bf6a0a5b211e0ef06752d5173c901fcf448878cc597925f8d1af9ea26
MD5 8af16f5858f7800a8b0e6b33a20259ea
BLAKE2b-256 9665fc4fd8dc98f5d7b3928ccb99de0a74990ce7233ab9071e52cc04fe264f22

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page