adf2pdf

Automate the workflow around ADF scanning, OCR and PDF creation

These details have not been verified by PyPI

Project links

Project description

adf2pdf - a tool that turns a batch of paper pages into a PDF with a text layer. By default, it detects empty pages (as they may easily occur during duplex scanning) and excludes them from the OCR and the resulting PDF.

For that, it uses Sane's scanimage for the scanning, Tesseract for the optical character recognition (OCR), and the Python packages img2pdf, Pillow (PIL) and PyPDF2 for some image-processing tasks and PDF mangling.

Example:

$ adf2pdf contract-xyz.pdf

2017, Georg Sauthoff mail@gms.tf

Features

Automatic document feed (ADF) support
Fast empty page detection
Overlaying of scanning, image processing, OCR and PDF creation to minimize the total runtime
Fast creation of small PDFs using the fine img2pdf package
Only use of safe compression methods, i.e. no error-prone symbol segmentation style compression like JBIG2 or JB2 that is used in Xerox photocopiers and the DjVu format.

Install Instructions

Adf2pdf can be directly installed with pip, e.g.

$ pip3 install --user adf2pdf

$ pip3 install adf2pdf

Hardware Requirements

A scanner with automatic document feed (ADF) that is supported by Sane. For example, the Fujitsu ScanSnap S1500 works well. That model supports duplex scanning, which is quite convenient.

Example continued

Running adf2pdf for a 7 page example document takes 150 seconds on an i7-6600U (Intel Skylake, 4 cores) CPU (using the ADF of the Fujitsu ScanSnap S1500). With the defaults, adf2pdf calls scanimage for duplex scanning into 600 dpi lineart (black and white) images. In this example, 6 pages are empty and thus automatically excluded, i.e. the resulting PDF then just contains 8 pages.

The resulting PDF contains a text layer from the OCR such that one can search and copy'n'paste some text. It is 1.1 MiB big, i.e. a page is stored in 132 KiB, on average.

Software Requirements

The script assumes Tesseract version 4, by default. Version 3 can be used as well, but the new neural network system in Tesseract 4 just performs magnitudes better than the old OCR model. Tesseract 4.0.0 was released in late 2018, thus, distributions released in that time frame may still just include version 3 in their repositories (e.g. Fedora 29 while Fedora 30 features version 4). Since version 4 is so much better at OCR I can't recommend it enough over the stable version 3.

Tesseract 4 notes (in case you need to build it from the sources):

Build instructions - warning: if you miss the autoconf-archive dependency you'll get weird autoconf error messages
Data files - you need the training data for your languages of choice and the OSD data

Python packages:

img2pdf (Fedora package: python3-img2pdf)
Pillow (PIL) (Fedora package: python3-pillow-devel)
PyPDF2 (Fedora package: python3-PyPDF2)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.8.3

Aug 15, 2023

0.8.2

Dec 6, 2020

0.8.1

Mar 25, 2019

0.8.0

May 8, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adf2pdf-0.8.3.tar.gz (21.5 kB view details)

Uploaded Aug 15, 2023 Source

File details

Details for the file adf2pdf-0.8.3.tar.gz.

File metadata

Download URL: adf2pdf-0.8.3.tar.gz
Upload date: Aug 15, 2023
Size: 21.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.11.4

File hashes

Hashes for adf2pdf-0.8.3.tar.gz
Algorithm	Hash digest
SHA256	`41400fb252cb875fde225515d58027a91ade5ca77ec0c27d7fb42846d85ed7d6`
MD5	`3796c8ca880ce9d7e7253e38bb0f7803`
BLAKE2b-256	`e29e7beaedc362d898ae8781e29ea5328f3aa3a401dee5beeb52f83cf4f24c19`

See more details on using hashes here.

adf2pdf 0.8.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

Install Instructions

Hardware Requirements

Example continued

Software Requirements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes