Skip to main content

Collection of OCR-D compliant tools for layout analysis and segmentation of historical german-language documents published in Brazil

Project description

German-Brazilian Newspapers (gbn)

This project aims at providing an OCR-D compliant toolset for optical layout recognition/analysis on images of historical german-language documents published in Brazil during the 19th and 20th centuries, focusing on periodical publications.

Table of contents

About

Although there is a considerable amount of digitized brazilian-published german-language periodicals available online (e.g. the dbp digital collection and the German-language periodicals section of the Brazilian (National) Digital Library), document image understanding of these prints is far from being optimal. While generic OCR solutions will work out of the box with typical everyday-life documents, it is a different story for historical newspapers like those due to several factors:

  • Complex layouts (still a challenge for mainstream OCR toolsets e.g. ocropy and tesseract)
  • Degradation over time (e.g. stains, rips, erased ink)
  • Poor scanning quality (e.g. lighting contrast)

In order to achieve better full-text recognition results on the target documents, this project relies on two building blocks: The German-Brazilian Newspapers dataset and the ocrd-sbb-textline-detector tool. The first as a role-model for pioneering on layout analysis of german-brazilian documents (and also as a source of testing data) and the latter as a reference implementation of a robust layout analysis workflow for german-language documents. This project itself was forked from ocrd-sbb-textline-detector, aiming at replicating the original tool's functionality into several smaller modules and extending it for more powerful workflows.

Installation

pip3 install git+https://github.com/sulzbals/gbn.git

Usage

Refer to the OCR-D CLI documentation for instructions on running OCR-D tools.

Tools (gbn.sbb)

ocrd-gbn-sbb-predict

{
 "executable": "ocrd-gbn-sbb-predict",
 "categories": [
  "Layout analysis"
 ],
 "description": "Classifies pixels of input images given a binary (two classes) model and store the prediction as the specified PAGE-XML content type",
 "steps": [
  "layout/analysis"
 ],
 "input_file_grp": [
  "OCR-D-IMG",
  "OCR-D-BIN"
 ],
 "output_file_grp": [
  "OCR-D-PREDICT"
 ],
 "parameters": {
  "model": {
   "type": "string",
   "description": "Path to Keras model to be used",
   "required": true,
   "cacheable": true
  },
  "shaping": {
   "type": "string",
   "description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
   "required": true,
   "enum": [
    "resize",
    "split"
   ]
  },
  "type": {
   "type": "string",
   "description": "PAGE-XML content type to be predicted",
   "required": true,
   "enum": [
    "AlternativeImageType",
    "BorderType",
    "TextRegionType",
    "TextLineType"
   ]
  },
  "operation_level": {
   "type": "string",
   "description": "PAGE-XML hierarchy level to operate on",
   "default": "page",
   "enum": [
    "page",
    "region",
    "line"
   ]
  }
 }
}

ocrd-gbn-sbb-crop

{
 "executable": "ocrd-gbn-sbb-crop",
 "categories": [
  "Image preprocessing",
  "Layout analysis"
 ],
 "description": "Crops the input page images by predicting the actual page surface and setting the PAGE-XML Border accordingly",
 "steps": [
  "preprocessing/optimization/cropping",
  "layout/analysis"
 ],
 "input_file_grp": [
  "OCR-D-IMG"
 ],
 "output_file_grp": [
  "OCR-D-CROP"
 ],
 "parameters": {
  "model": {
   "type": "string",
   "description": "Path to Keras model to be used",
   "required": true,
   "cacheable": true
  },
  "shaping": {
   "type": "string",
   "description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
   "default": "resize",
   "enum": [
    "resize",
    "split"
   ]
  }
 }
}

ocrd-gbn-sbb-binarize

{
 "executable": "ocrd-gbn-sbb-binarize",
 "categories": [
  "Image preprocessing",
  "Layout analysis"
 ],
 "description": "Binarizes the input page images by predicting their foreground pixels and saving it as a PAGE-XML AlternativeImage",
 "steps": [
  "preprocessing/optimization/binarization",
  "layout/analysis"
 ],
 "input_file_grp": [
  "OCR-D-IMG"
 ],
 "output_file_grp": [
  "OCR-D-BIN"
 ],
 "parameters": {
  "model": {
   "type": "string",
   "description": "Path to Keras model to be used",
   "required": true,
   "cacheable": true
  },
  "shaping": {
   "type": "string",
   "description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
   "default": "split",
   "enum": [
    "resize",
    "split"
   ]
  },
  "operation_level": {
   "type": "string",
   "description": "PAGE-XML hierarchy level to operate on",
   "default": "page",
   "enum": [
    "page",
    "region",
    "line"
   ]
  }
 }
}

ocrd-gbn-sbb-segment

{
 "executable": "ocrd-gbn-sbb-segment",
 "categories": [
  "Layout analysis"
 ],
 "description": "Segments the input page images by predicting the text regions and lines and setting the PAGE-XML TextRegion and TextLine accordingly",
 "steps": [
  "layout/segmentation/region",
  "layout/segmentation/line"
 ],
 "input_file_grp": [
  "OCR-D-DESKEW"
 ],
 "output_file_grp": [
  "OCR-D-SEG"
 ],
 "parameters": {
  "region_model": {
   "type": "string",
   "description": "Path to Keras model to be used for predicting text regions",
   "default": "",
   "cacheable": true
  },
  "region_shaping": {
   "type": "string",
   "description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
   "default": "split",
   "enum": [
    "resize",
    "split"
   ]
  },
  "line_model": {
   "type": "string",
   "description": "Path to Keras model to be used for predicting text lines",
   "required": true,
   "cacheable": true
  },
  "line_shaping": {
   "type": "string",
   "description": "How the images must be processed in order to match the input shape of the model ('resize' for resizing to model shape and 'split' for splitting into patches)",
   "default": "split",
   "enum": [
    "resize",
    "split"
   ]
  }
 }
}

Library (gbn.lib)

This small library provides an abstraction layer that the OCR-D processors contained in this project should use for performing common image processing and deep learning routines. Those processors therefore should not directly access libraries like OpenCV, Numpy or Keras.

Check the source code files for detailed documentation on each class and function of the library.

Models

Currently the models being used are the ones provided by the qurator team. Models for binarization can be found here and for cropping and segmentation here.

There are plans for extending the GBN dataset with more degraded document pages as an attempt to train robust models in the near future.

Recommended Workflow

The most generic and simple processing step implementations of ocrd-sbb-textline-detector were not implemented since there are already tools that do effectively the same. The resizing to 2800 pixels of height is performed through an imagemagick wrapper for OCR-D (ocrd-im6convert) and the deskewing through an ocropy wrapper (ocrd-cis-ocropy).

Step Processor Parameters
1 ocrd-im6convert { "output-format": "image/png", "output-options": "-geometry x2800" }
2 ocrd-gbn-sbb-crop { "model": "/path/to/model_page_mixed_best.h5", "shaping": "resize" }
3 ocrd-gbn-sbb-binarize { "model": "/path/to/model_bin4.h5", "shaping": "split", "operation_level": "page" }
4 ocrd-cis-ocropy-deskew { "level-of-operation": "page" }
5 ocrd-gbn-sbb-segment { "region_model": "/path/to/model_strukturerkennung.h5", "region_shaping": "split", "line_model": "/path/to/model_textline_new.h5", "line_shaping": "split" }

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrd-gbn-1.0.0.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocrd_gbn-1.0.0-py3-none-any.whl (23.3 kB view details)

Uploaded Python 3

File details

Details for the file ocrd-gbn-1.0.0.tar.gz.

File metadata

  • Download URL: ocrd-gbn-1.0.0.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.9

File hashes

Hashes for ocrd-gbn-1.0.0.tar.gz
Algorithm Hash digest
SHA256 cc06c1f36a0581d31d6b254b863a5728a67344ca9982cc8565b17bb069a030f0
MD5 6381d575f09207b3e4c02181faf67d73
BLAKE2b-256 886f1e6639093eb2d5c195b7b1c554a1c758e4edcaed82b772cb1cb44d00e6d6

See more details on using hashes here.

File details

Details for the file ocrd_gbn-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: ocrd_gbn-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 23.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.9

File hashes

Hashes for ocrd_gbn-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c1ddc0fcba60429704a9a51b34859d2ba8d8aef883d538bd3e6e23a58937af0f
MD5 260413be4478f6b9dbcab58f75f9f2c7
BLAKE2b-256 c487910f99e2d92e95fa82177cf9535fa7b5bcedde542cfd3017829412d01201

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page