Skip to main content

NLP-helper for OCR-ed pages in PAGE XML format.

Project description

Blatt

NLP-helper for OCR-ed pages in PAGE XML format.

Table of contents

Installation

pip install blatt

or

git clone https://github.com/UB-Mannheim/blatt
cd blatt/
pip install .

How to use

PAGE XML reader, hyphen remover and converter

On initiation the Page-class reads the file PAGEXML and stores TextRegions, TextLines and Baseline Coordinates in the Page-object p. The plain text can be saved to TXT:

from blatt import Page
p = Page(PAGEXML)
p.to_text(TXT)

By default it saves the plain text without line breaks (the hyphens '-', '-', '⹀' and '⸗' are removed and the corresponding words are merged). If you need line breaks, use p.to_text(TXT, linebreak=True).

Command Line Interface

% blatt
Usage: blatt [OPTIONS] COMMAND [ARGS]...

  BLATT CLI: NLP-helper for OCR-ed pages in PAGE XML format. To get help for a
  particular COMMAND, use `blatt COMMAND -h`.

Options:
  --help  Show this message and exit.

Commands:
  convert  Converts PAGE XML files to plain text TXT files
% blatt convert -h
Usage: blatt convert [OPTIONS] PAGE_FOLDER TEXT_FOLDER

  blatt convert: converts all PAGE XML files in PAGE_FOLDER to TXT files in
  TEXT_FOLDER.

Options:
  -lb, --linebreak BOOLEAN  If linebreak==False, it removes hyphens at the end
                            of lines and merges the lines without line breaks.
                            Otherwise, it merges the lines using line breaks.
                            [default: False]
  -h, --help                Show this message and exit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blatt-0.1.0.tar.gz (4.7 kB view hashes)

Uploaded Source

Built Distribution

blatt-0.1.0-py3-none-any.whl (5.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page