No project description provided

Project description

Document Preprocessing and Segmentation

Tools to preprocess and segment scanned images for OCR-D

Installing
Tools
Testing
License

Installing

Requires Python >= 3.6.

Create a new venv unless you already have one
```
 python3 -m venv venv
```
Activate the venv
```
 source venv/bin/activate
```
To install from source, get GNU make and do:
```
 make install
```
There are also prebuilds available on PyPI:
```
 pip install ocrd_anybaseocr
```

(This will install both PyTorch and TensorFlow, along with their dependents.)

Tools

All tools, also called processors, abide by the CLI specifications for OCR-D, which roughly looks like:

ocrd-<processor-name> [-m <path to METs input file>] -I <input group> -O <output group> [-p <path to parameter file>]* [-P <param name> <param value>]*

Binarizer

Method Behaviour

For each page (or sub-segment), this processor takes a scanned colored / gray scale document image as input and computes a binarized (black and white) image.

Implemented via rule-based methods (percentile based adaptive background estimation in Ocrolib).

Example

ocrd-anybaseocr-binarize -I OCR-D-IMG -O OCR-D-BIN -P operation_level line -P threshold 0.3

Deskewer

Method Behaviour

For each page (or sub-segment), this processor takes a document image as input and computes the skew angle of that. It also annotates a deskewed image.

The input images have to be binarized for this module to work.

Implemented via rule-based methods (binary projection profile entropy maximization in Ocrolib).

Example

ocrd-anybaseocr-deskew -I OCR-D-BIN -O OCR-D-DESKEW -P maxskew 5.0 -P skewsteps 20 -P operation_level page

Cropper

Method Behaviour

For each page, this processor takes a document image as input and computes the border around the page content area (i.e. removes textual noise as well as any other noise around the page frame). It also annotates a cropped image.

The input image need not be binarized, but should be deskewed for the module to work optimally.

Implemented via rule-based methods (gradient-based line segment detection and morphology based textline detection).

Example:

ocrd-anybaseocr-crop -I OCR-D-DESKEW -O OCR-D-CROP -P rulerAreaMax 0 -P marginLeft 0.1

Dewarper

Method Behaviour

For each page, this processor takes a document image as input and computes a morphed image which will make the text lines straight if they are curved.

The input image has to be binarized for the module to work, and should be cropped and deskewed for optimal quality.

Implemented via data-driven methods (neural GAN conditional image model trained with pix2pixHD/Pytorch).

Models

ocrd resmgr download ocrd-anybaseocr-dewarp '*'

Example

ocrd-anybaseocr-dewarp -I OCR-D-CROP -O OCR-D-DEWARP -P resize_mode none -P gpu_id -1

Text/Non-Text Segmenter

Method Behaviour

For each page, this processor takes a document image as an input and computes two images, separating the text and non-text parts.

The input image has to be binarized for the module to work, and should be cropped and deskewed for optimal quality.

Implemented via data-driven methods (neural pixel classifier model trained with Tensorflow/Keras).

Models

ocrd resmgr download ocrd-anybaseocr-tiseg '*'

Example

ocrd-anybaseocr-tiseg -I OCR-D-DEWARP -O OCR-D-TISEG -P use_deeplr true

Block Segmenter

Method Behaviour

For each page, this processor takes the raw document image as an input and computes a text region segmentation for it (distinguishing various types of text blocks).

The input image need not be binarized, but should be deskewed for the module to work optimally.

Implemented via data-driven methods (neural Mask-RCNN instance segmentation model trained with Tensorflow/Keras).

Models

ocrd resmgr download ocrd-anybaseocr-block-segmentation '*'

Example

ocrd-anybaseocr-block-segmentation -I OCR-D-TISEG -O OCR-D-BLOCK -P active_classes '["page-number", "paragraph", "heading", "drop-capital", "marginalia", "caption"]' -P min_confidence 0.8 -P post_process true

Textline Segmenter

Method Behaviour

For each page (or region), this processor takes a cropped document image as an input and computes a textline segmentation for it.

The input image should be binarized and deskewed for the module to work.

Implemented via rule-based methods (gradient and morphology based line estimation in Ocrolib).

Example

ocrd-anybaseocr-textline -I OCR-D-BLOCK -O OCR-D-LINE -P operation_level region

Document Analyser

Method Behaviour

For the whole document, this processor takes all the cropped page images and their corresponding text regions as input and computes the logical structure (page types and sections).

The input image should be binarized and segmented for this module to work.

Implemented via data-driven methods (neural Inception-V3 image classification model trained with Tensorflow/Keras).

Models

ocrd resmgr download ocrd-anybaseocr-layout-analysis '*'

Example

ocrd-anybaseocr-layout-analysis -I OCR-D-LINE -O OCR-D-STRUCT

Testing

To test the tools under realistic conditions (on OCR-D workspaces), download OCR-D/assets. In particular, the code is tested with the dfki-testdata dataset.

To download the data:

make assets

To run module tests:

make test

To run processor/workflow tests:

make cli-test

License

 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.

Project details

Release history Release notifications | RSS feed

This version

1.8.2

Mar 30, 2022

1.8.0

Mar 25, 2022

1.7.0

Feb 22, 2022

1.6.0

May 20, 2021

1.5.0

May 19, 2021

1.4.1

Apr 23, 2021

1.3.0

Jan 28, 2021

1.2.0

Jan 27, 2021

1.1.0

Nov 16, 2020

1.0.1

Aug 24, 2020

1.0.0

Aug 21, 2020

0.0.5

Aug 4, 2020

0.0.4

Jul 8, 2020

0.0.3

May 14, 2020

0.0.2

May 6, 2020

0.0.1

Dec 17, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrd-anybaseocr-1.8.2.tar.gz (93.8 kB view details)

Uploaded Mar 30, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ocrd_anybaseocr-1.8.2-py3-none-any.whl (140.9 kB view details)

Uploaded Mar 30, 2022 Python 3

File details

Details for the file ocrd-anybaseocr-1.8.2.tar.gz.

File metadata

Download URL: ocrd-anybaseocr-1.8.2.tar.gz
Upload date: Mar 30, 2022
Size: 93.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.12

File hashes

Hashes for ocrd-anybaseocr-1.8.2.tar.gz
Algorithm	Hash digest
SHA256	`569f0f5f052a64b2105bc2c8de018f4c9e2a49570aa699be541e247fb8762a13`
MD5	`d43b0366d7871c95471dc802f99f8c89`
BLAKE2b-256	`bc835feb2685645467d72f7e37c9057769389df87ed63989da2cfc505b28cb5b`

See more details on using hashes here.

File details

Details for the file ocrd_anybaseocr-1.8.2-py3-none-any.whl.

File metadata

Download URL: ocrd_anybaseocr-1.8.2-py3-none-any.whl
Upload date: Mar 30, 2022
Size: 140.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.12

File hashes

Hashes for ocrd_anybaseocr-1.8.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`38951e40c1696f157792b75884cfaf19b50d4bd9cc4f291b1f47b3bb515e8949`
MD5	`5dc4db13d62fb047648cc22232dd25da`
BLAKE2b-256	`34764ee327beaa5f092ec018137301bea51ae18e65be588696fbb02b7b3e1625`

See more details on using hashes here.

ocrd-anybaseocr 1.8.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Document Preprocessing and Segmentation

Installing

Tools

Binarizer

Method Behaviour

Example

Deskewer

Method Behaviour

Example

Cropper

Method Behaviour

Example:

Dewarper

Method Behaviour

Models

Example

Text/Non-Text Segmenter

Method Behaviour

Models

Example

Block Segmenter

Method Behaviour

Models

Example

Textline Segmenter

Method Behaviour

Example

Document Analyser

Method Behaviour

Models

Example

Testing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes