Tool to scan and process documents to paperless

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Environment
- Console
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language
Typing
- Typed

Project description

Scan and prepare your document for Paperless

The main goal of this project is to have some productive process from the document scanning to Paperless. For that we need to prepare the documents some tools that need many resources, then the idea to do it in the background and ideally on another host like a NAS. A consequence of that it's a not easy to put it in place, but then you will be really productive. The interface between the user and the process is the scan command to do the initial scan, and the file system to verify that the result is OK (and do some advance operations describe below) and validate it.

Features

Scan the images optionally by using the Automatic Document Feeder
Easily scan double-sided images using the Automatic Document Feeder
Extract the DPI from the TIFF images
Change the images levels
Remove the area out of the image
Deskew the images
Crop the images
Sharpen the images (disable by default)
Dither the images (disable by default)
Auto rotate the images by using tesseract (To have the text on the right side)
Optimize the images using pngquant, optipng, ps2pdf or jpeg (using quality from GraphicsMagick convert)
Assisted split, used to split a prospectus page in more pages (Requires to modify the YAML...)
Append credit cart, used to have the two faces of a credit cart on the same page
Be able to copy the OCR result from the PDF
Scan the QR code and Bar code and add a new page with the values (separate process)
Manage the empty lines in the QR code (replace by a pipe (|) in the PDF, and run scan --convert-clipboard to scan your clipboard to do the inverse transform)

Requirements

On the desktop:

Python >= 3.6
The scanimage command, on Windows it should be able to use another command, but it's never be tested. This command would be an adapter that interpret the following arguments: --batch, --batch-start, --batch-increment, --batch-count, --batch for the destination file name template (%d is replaced by the page number), and the others for the auto_bash.

On the NAS:

Docker

Install

On the desktop

$ python3 -m pip install scan-to-paperless
$ echo PATH=$PATH:~/venv/bin >> ~/.bashrc
$ echo source <(register-python-argcomplete scan) >> ~/.bashrc
$ echo source <(register-python-argcomplete scan-progress-status) >> ~/.bashrc

Create the configuration file on <home_config>/scan-to-paperless.yaml (on Linux it's ~/.config/scan-to-paperless.yaml), with:

# yaml-language-server: $schema=https://raw.githubusercontent.com/sbrunner/scan-to-paperless/master/scan_to_paperless/config_schema.json

scan_folder: /home/sbrunner/Paperless/scan/
scanimage_arguments: # Additional argument passed to the scanimage command
  - --device=... # Use `scanimage --list` to get the possible values
  - --format=png
  - --mode=color
  - --resolution=300
default_args:
  auto_mask: {}
  auto_cut: {}
  run_pngquant: true
  cut_white: 200 # cut the near white color to have a uniform background
  dpi: 300 # Not necessary if the scanner generate a tiff file
  tesseract_lang: fra+eng # The used languages for the OCR

Full config documentation

On the NAS

The Docker support is required, Personally I use a Synology DiskStation DS918+, and you can get the *.syno.json files to configure your Docker services.

Otherwise, use:

SCAN_FOLDER=<scan_folder>
CONSUME_FOLDER=<consume_folder>
docker run --name=scan-to-paperless --restart=unless-stopped --detatch \
  --volume=${SCAN_FOLDER}:/source \
  --volume=${CONSUME_FOLDER}:/destination \
  sbrunner/scan-to-paperless

You can set the environment variable PROGRESS to TRUE to get all the intermediate images.

To stop run:

docker stop scan-to-paperless
docker rm scan-to-paperless

Repertory link

You should find a way to synchronize or using sharing to link the scan folder on your desktop and on your NAS.

You should also link the consume folder to paperless-ngx probably just by using the same folder.

Usage

Use the scan command to import your document, to scan your documents.
The document is transferred to your NAS (I use Syncthing).
The documents will be processed on the NAS.
Use scan-process-status to know the status of your documents.
Validate your documents.
If your happy with that remove the REMOVE_TO_CONTINUE file. (To restart the process remove one of the generated images, to cancel the job just remove the folder).
The process will continue his job and import the document in paperless-ngx.

Job config file

In the config.yaml file present in the document folder, you can find some information generated during the processing and some can be modified.

E.g. you can modify an image angle to fix the skew, then remove a generated image for force to regenerate the images.

Full job config documentation

Advance feature

Add a mask

If your scanner add some margin around the scanned image it will relay case some issue the skew and the content detection.

To solve that you can add a black and white image named mask.png in the root folder and draw in black the part that should not be taken in account.

Scan to Paperless is also able to create a mask automatically, to enable is with the default configuration, just add args name auto_mask with an empty dictionary ({}).

See also: The documentation

Configuration note:

By default, the options lower_hsv_color and upper_hsv_color select the page (white). Yon can also select the scanner background, for that you also should set the option inverse_mask to true and the option de_noise_morphology to false.

Mask the image

If your scanner add some margin around the scanned image you can definitively mask them.

To solve that you can add a black and white image named cut.png in the root folder and draw in black the part that should not be taken in account.

Scan to Paperless is also able to create a mask automatically, to enable is with the default configuration, just add args name auto_cut with an empty dictionary ({}).

See also: The documentation

Double sized scanning

Pour your sheets on the Automatic Document Feeder.
Run scan with the option --mode=double.
Press enter to start scanning the first side of all sheets.
Put again all your sheets on the Automatic Document Feeder without turning them.

The scan utils will rotate and reorder all the sheets to get a good document.

Credit card scanning

The options --append-credit-card will append all the sheets vertically to have the booth face of the credit card on the same page.

Assisted split

Do your scan as usual with the extra option --assisted-split.
After the process do his first pass you will have images with lines and numbers. The lines represent the detected potential split of the image, the length indicate the strength of the detection. In your config you will have something like:

assisted_split:
-   destinations:
    -   4 # Page number of the left part of the image
    -   1 # Same for the right page of the image
        image: image-1.png # name of the image
        limits:
    -   margin: 0 # Margin around the split
        name: 0 # Number visible on the generated image
        value: 375 # The position of the split (can be manually edited)
        vertical: true # Will split the image vertically
    -   ...
        source: /source/975468/7-assisted-split/image-1.png
-   ...

Edit your config file, you should have one more destination than the limits. If you put destination like that: 2.1, it means that it will be the first part of the page 2 and the 2.2 will be the second part.

Delete the file REMOVE_TO_CONTINUE.
After the process do his first pass you will have the final generated images.
If it's OK delete the file REMOVE_TO_CONTINUE.

The scan modes configuration

First of all the scanimage command and arguments can be configured with the scanimage and scanimage_argumentss options in the configuration file (~/.config/scan-to-paperless.yaml).

In this file there is also a modes section that can configure each modes.

See also: The documentation

Extends an existing configuration

To create the preset configuration file it can be useful to extends an existing configuration. For that you can use the extends (and merge_strategies) option in the configuration file.

See also: The documentation

Server configuration

Environment variable:

SCAN_SOURCE_FOLDER: The main input folder for the scan process.
SCAN_CODES_FOLDER: The input folder for the codes (QR code ad Barcode) detection and add a new page.
SCAN_FINAL_FOLDER: The final folder for the scan process.
SCAN_CODES_DPI: The used DPI to decode the codes.
SCAN_CODES_PDF_DPI: The used PDF DPI to create the codes document.
SCAN_CODES_FONT_NAME: The used font of code number.
SCAN_CODES_FONT_SIZE: The used font size of code number.
SCAN_CODES_MARGIN_TOP: The top margin of code number.
SCAN_CODES_MARGIN_LEFT:The left margin of code number.
TIME: Print the elapsed time.
PROGRESS: Save some intermediate files, don't clean the folder at the end.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Environment
- Console
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language
Typing
- Typed

Release history Release notifications | RSS feed

1.25.0

Feb 9, 2023

This version

1.24.0

Sep 21, 2022

1.23.3

Sep 19, 2022

1.23.2

Jul 1, 2022

1.23.1

Jun 7, 2022

1.23.0

May 27, 2022

1.17.0

May 26, 2022

1.16.0

May 23, 2022

1.15.0

May 9, 2022

1.14.0

Apr 21, 2022

1.13.0

Apr 18, 2022

1.12.0

Mar 18, 2022

1.11.0

Feb 5, 2022

1.7.0

Jul 18, 2021

1.5.0

Jun 26, 2021

1.4.0

Jun 26, 2021

1.3.0

May 30, 2021

1.2.0

May 30, 2021

1.1.0

May 26, 2021

0.19.0

May 22, 2021

0.18.0

May 22, 2021

0.17.0

Apr 20, 2021

0.16.0

Mar 7, 2021

0.15.0

Dec 16, 2020

0.14.0

Dec 15, 2020

0.13.0

Dec 15, 2020

0.12.0

Aug 18, 2020

0.12.0.dev20201014 pre-release

Oct 14, 2020

0.11.0

Aug 17, 2020

0.10.0

Jan 28, 2020

0.10.0.dev20200130 pre-release

Jan 30, 2020

0.9.0

Jan 18, 2020

0.9.0.dev20200128 pre-release

Jan 28, 2020

0.9.0.dev20200126 pre-release

Jan 26, 2020

0.9.0.dev20200118 pre-release

Jan 18, 2020

0.9.0.dev20200116 pre-release

Jan 16, 2020

0.9.0.dev20200115 pre-release

Jan 15, 2020

0.9.0.dev20200114 pre-release

Jan 14, 2020

0.8.0

Jan 4, 2020

0.8.0.dev20200104 pre-release

Jan 4, 2020

0.7.0

Jan 3, 2020

0.7.0.dev20200103 pre-release

Jan 3, 2020

0.6.0

Dec 29, 2019

0.6.0.dev20191229 pre-release

Dec 29, 2019

0.6.0.dev20190525 pre-release

May 25, 2019

0.6.0.dev20190522 pre-release

May 22, 2019

0.6.0.dev20190421 pre-release

Apr 21, 2019

0.5.0

Apr 21, 2019

0.5.0.dev20190421 pre-release

Apr 21, 2019

0.5.0.dev20190420 pre-release

Apr 20, 2019

0.5.0.dev20190419 pre-release

Apr 19, 2019

0.5.0.dev20190416 pre-release

Apr 16, 2019

0.5.0.dev20190415 pre-release

Apr 15, 2019

0.5.0.dev20190414 pre-release

Apr 14, 2019

0.5.0.dev20190413 pre-release

Apr 13, 2019

0.5.0.dev20190412 pre-release

Apr 12, 2019

0.5.0.dev20190408 pre-release

Apr 8, 2019

0.5.0.dev20190407 pre-release

Apr 7, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

scan_to_paperless-1.24.0-py3-none-any.whl (47.8 kB view details)

Uploaded Sep 21, 2022 Python 3

File details

Details for the file scan_to_paperless-1.24.0-py3-none-any.whl.

File metadata

Download URL: scan_to_paperless-1.24.0-py3-none-any.whl
Upload date: Sep 21, 2022
Size: 47.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for scan_to_paperless-1.24.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`175afd0fb3b7436fdb54d701954eaa6bc00e6a722ab6fd5cc9874ffaaf3e9343`
MD5	`c9934b27cf6116370259090666313d2c`
BLAKE2b-256	`b9b2b8e7d0418747e214215fc30e82d2185d4544704c0b4560e61b079347e563`