Skip to main content

NLP-helper for OCR-ed pages in PAGE XML format.

Project description

Blatt

PyPI version

NLP-helper for OCR-ed pages in PAGE XML format.

Table of contents

Installation

pip install blatt

or

git clone https://github.com/UB-Mannheim/blatt
cd blatt/
pip install .

How to use

Page object

On initiation the Page-class reads the file PAGEXML and stores TextRegions, TextLines and Baseline Coordinates in the Page-object p.

from blatt import Page
p = Page(PAGEXML)

The Page-object stores unprocessed and processed TextLines as attributes.

print(p)
[('root', 2),
 ('namespace', 63),
 ('filename', 24),
 ('text_regions_xml', 38),
 ('text_lines_xml', 260),
 ('text_regions', 260),
 ('text_lines', 260),
 ('baselines', 3651),
 ('text_with_linebreaks', 12111),
 ('text_without_linebreaks', 11979),
 ('sentences', 102),
 ('x_baselines', 3651),
 ('y_baselines', 3651),
 ('center_baseline', 2)]

Hyphen remover & converter to_txt()

The plain text can be saved to TXT:

from blatt import Page
p = Page(PAGEXML)
p.to_txt(TXT)

By default it saves the plain text without line breaks (the hyphens '-', '-', '⹀' and '⸗' are removed and the corresponding words are merged). If you need line breaks, use p.to_txt(TXT, linebreak=True).

Sentence splitter & converter to_tsv()

The TextLines or sentences can be saved to TSV:

from blatt import Page
p = Page(PAGEXML)
p.to_tsv(TSV)

By default it saves TextLines, TextRegionID, TextLineID and Coordinates to TSV. If you use p.to_tsv(TSV, sentence=True), it saves sentences (not lines!) into separate lines of TSV. The sentences are split from the plain text without hyphens using the SegTok library.

Command Line Interface

% blatt        
Usage: blatt [OPTIONS] COMMAND [ARGS]...

  Blatt CLI: NLP-helper for OCR-ed pages in PAGE XML format. To get help for a
  particular COMMAND, use `blatt COMMAND -h`.

Options:
  -h, --help  Show this message and exit.

Commands:
  to_tsv  Converts PAGE XML files to TSV files with TextLines or sentences
  to_txt  Converts PAGE XML files to TXT files with or without line breaks &
          hyphens
% blatt to_txt -h
Usage: blatt to_txt [OPTIONS] PAGE_FOLDER TEXT_FOLDER

  blatt to_txt: converts all PAGE XML files in PAGE_FOLDER to TXT files
  with/without hyphens in TEXT_FOLDER.

Options:
  -lb, --linebreak BOOLEAN  If linebreak==False, it removes hyphens at the end
                            of lines and merges the lines without line breaks.
                            Otherwise, it merges the lines using line breaks.
                            [default: False]
  -h, --help                Show this message and exit.
% blatt to_tsv -h
Usage: blatt to_tsv [OPTIONS] PAGE_FOLDER TSV_FOLDER

  blatt to_tsv: converts all PAGE XML files in PAGE_FOLDER to TSV files in
  TSV_FOLDER.

Options:
  -s, --sentence BOOLEAN  If sentence==False, it saves TextLines,
                          TextRegionID, TextLineID and Coordinates to TSV.
                          Otherwise, it saves sentences (not lines!) into
                          separate lines of TSV. The sentences are split from
                          the plain text without hyphens using the SegTok
                          library.  [default: False]
  -h, --help              Show this message and exit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blatt-0.1.6.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

blatt-0.1.6-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file blatt-0.1.6.tar.gz.

File metadata

  • Download URL: blatt-0.1.6.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.6

File hashes

Hashes for blatt-0.1.6.tar.gz
Algorithm Hash digest
SHA256 aac7ebb4612c2c36d0a1e0a89688c0de623c64d40dd4f09a410b526da32f094a
MD5 8d9ec06515046cbe9ef7f698a57676d4
BLAKE2b-256 4e03692a00f3ed5f6b461d1b3d01c36d195e36b88731ab56d476c86221b2c909

See more details on using hashes here.

File details

Details for the file blatt-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: blatt-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.6

File hashes

Hashes for blatt-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 5accea680c8ae76b51654d218eebd5b109d6bdc3f1c09d646e407de1ffa25e76
MD5 12ccb8779e8d595f16e4774dca8b4c88
BLAKE2b-256 0208f5ed1c873e4e9db3e667af040374b6b279186486ad1f1e0b6e8d2ea29db0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page