Skip to main content

NLP-helper for OCR-ed pages in PAGE XML format.

Project description

Blatt

PyPI version

NLP-helper for OCR-ed pages in PAGE XML format.

Table of contents

Installation

pip install blatt

or

git clone https://github.com/UB-Mannheim/blatt
cd blatt/
pip install .

How to use

Page object

On initiation the Page-class reads the file PAGEXML and stores TextRegions, TextLines and Baseline Coordinates in the Page-object p.

from blatt import Page
p = Page(PAGEXML)

The Page-object stores unprocessed and processed TextLines as attributes.

print(p)
[('root', 2),
 ('namespace', 63),
 ('filename', 24),
 ('text_regions_xml', 38),
 ('text_lines_xml', 260),
 ('text_regions', 260),
 ('text_lines', 260),
 ('baselines', 3651),
 ('text_with_linebreaks', 12111),
 ('text_without_linebreaks', 11979),
 ('sentences', 102),
 ('x_baselines', 3651),
 ('y_baselines', 3651),
 ('center_baseline', 2)]

Hyphen remover & converter to_txt()

The plain text can be saved to TXT:

from blatt import Page
p = Page(PAGEXML)
p.to_txt(TXT)

By default it saves the plain text without line breaks (the hyphens '-', '-', '⹀' and '⸗' are removed and the corresponding words are merged). If you need line breaks, use p.to_txt(TXT, linebreak=True).

Sentence splitter & converter to_tsv()

The TextLines or sentences can be saved to TSV:

from blatt import Page
p = Page(PAGEXML)
p.to_tsv(TSV)

By default it saves TextLines, TextRegionID, TextLineID and Coordinates to TSV. If you use p.to_tsv(TSV, sentence=True), it saves sentences (not lines!) into separate lines of TSV. The sentences are split from the plain text without hyphens using the SegTok library.

Command Line Interface

% blatt        
Usage: blatt [OPTIONS] COMMAND [ARGS]...

  Blatt CLI: NLP-helper for OCR-ed pages in PAGE XML format. To get help for a
  particular COMMAND, use `blatt COMMAND -h`.

Options:
  -h, --help  Show this message and exit.

Commands:
  to_tsv  Converts PAGE XML files to TSV files with TextLines or sentences
  to_txt  Converts PAGE XML files to TXT files with or without line breaks &
          hyphens
% blatt to_txt -h
Usage: blatt to_txt [OPTIONS] PAGE_FOLDER TEXT_FOLDER

  blatt to_txt: converts all PAGE XML files in PAGE_FOLDER to TXT files
  with/without hyphens in TEXT_FOLDER.

Options:
  -lb, --linebreak BOOLEAN  If linebreak==False, it removes hyphens at the end
                            of lines and merges the lines without line breaks.
                            Otherwise, it merges the lines using line breaks.
                            [default: False]
  -h, --help                Show this message and exit.
% blatt to_tsv -h
Usage: blatt to_tsv [OPTIONS] PAGE_FOLDER TSV_FOLDER

  blatt to_tsv: converts all PAGE XML files in PAGE_FOLDER to TSV files in
  TSV_FOLDER.

Options:
  -s, --sentence BOOLEAN  If sentence==False, it saves TextLines,
                          TextRegionID, TextLineID and Coordinates to TSV.
                          Otherwise, it saves sentences (not lines!) into
                          separate lines of TSV. The sentences are split from
                          the plain text without hyphens using the SegTok
                          library.  [default: False]
  -h, --help              Show this message and exit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blatt-0.1.5.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

blatt-0.1.5-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file blatt-0.1.5.tar.gz.

File metadata

  • Download URL: blatt-0.1.5.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.14

File hashes

Hashes for blatt-0.1.5.tar.gz
Algorithm Hash digest
SHA256 5543c1e98cb4bb648eb06a682a7250da8f9f9936dfa1841aa45e6fe25087c8fe
MD5 51c9fe08d0f5b374d4ecd168fd8978ec
BLAKE2b-256 a9f6c99173071359e93fbf998b3e0d96b4202d4d425aa6c71004e826b76eda98

See more details on using hashes here.

File details

Details for the file blatt-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: blatt-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.14

File hashes

Hashes for blatt-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b6c91bfc6c2f3665113ff81e0a1a540b823be21a0c5bd55e1dee86bb60f1ea8f
MD5 62a8c02cf4d63f698a5e4352d40b8f58
BLAKE2b-256 7d615f069cf5dc4ce3ca4394ff3ead8fdd089f99a6a4f85d83dadf7f57ec653d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page