Skip to main content

Working with QURATOR TSV, especially for neat

Project description

TSV - Processing Tools

Create .tsv files that can be viewed and edited with neat.

Installation:

Clone this project and the SBB-utils.

Setup virtual environment:

virtualenv --python=python3.6 venv

Activate virtual environment:

source venv/bin/activate

Upgrade pip:

pip install -U pip

Install package together with its dependencies in development mode:

pip install -e sbb_utils
pip install -e page2tsv

PAGE-XML to TSV Transformation:

Create a TSV file from OCR in PAGE-XML format (with word segmentation):

page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1

In order to create a TSV file for multiple PAGE XML files just perform successive calls of the tool using the same TSV file:

page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
page2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2
page2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3
page2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4
page2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5
...
...
...

For instance, for the file example.xml:

page2tsv example.xml example.tsv --image-url=http://content.staatsbibliothek-berlin.de/zefys/SNP27646518-18800101-0-3-0-0/left,top,width,height/full/0/default.jpg

Processing of already existing TSV files:

Create a URL-annotated TSV file from an existing TSV file:

annotate-tsv enp_DE.tsv enp_DE-annotated.tsv

Command-line interface:

page2tsv [OPTIONS] PAGE_XML_FILE TSV_OUT_FILE

Options:
  --purpose [NERD|OCR]      Purpose of output tsv file.
                            
                            NERD: NER/NED application/ground-truth creation.
                            
                            OCR: OCR application/ground-truth creation.
                            
                            default: NERD.
  --image-url TEXT
  --ner-rest-endpoint TEXT  REST endpoint of sbb_ner service. See
                            https://github.com/qurator-spk/sbb_ner for
                            details. Only applicable in case of NERD.
  --ned-rest-endpoint TEXT  REST endpoint of sbb_ned service. See
                            https://github.com/qurator-spk/sbb_ned for
                            details. Only applicable in case of NERD.
  --noproxy                 disable proxy. default: enabled.
  --scale-factor FLOAT      default: 1.0
  --ned-threshold FLOAT
  --min-confidence FLOAT
  --max-confidence FLOAT
  --ned-priority INTEGER
  --help                    Show this message and exit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qurator_tsvtools-0.0.1.tar.gz (57.6 kB view details)

Uploaded Source

Built Distribution

qurator_tsvtools-0.0.1-py3-none-any.whl (82.0 kB view details)

Uploaded Python 3

File details

Details for the file qurator_tsvtools-0.0.1.tar.gz.

File metadata

  • Download URL: qurator_tsvtools-0.0.1.tar.gz
  • Upload date:
  • Size: 57.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.12

File hashes

Hashes for qurator_tsvtools-0.0.1.tar.gz
Algorithm Hash digest
SHA256 ea47bbc187ccd05eb6add0607b0872b07e75e5518c1eb5a4724122bf8df1858c
MD5 6088c0d6b5d31ea31335835a43e9e32b
BLAKE2b-256 431f927b35a9fd211e6e55c78c5d59fb21a2aa8b519c3b7bbe3ce9045550a132

See more details on using hashes here.

File details

Details for the file qurator_tsvtools-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for qurator_tsvtools-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bef1c14bec3e623008122084a816aee1d585338a329c261aa4ed81783612eec4
MD5 4fcc369724029e87a5cd8d6c57093554
BLAKE2b-256 5c0929b9a66828177643b94881c5c7affa41c9115498ff52c938ec594cace990

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page