Working with QURATOR TSV, especially for neat
Project description
TSV - Processing Tools
Create .tsv files that can be viewed and edited with neat.
Installation:
Clone this project and the SBB-utils.
Setup virtual environment:
virtualenv --python=python3.6 venv
Activate virtual environment:
source venv/bin/activate
Upgrade pip:
pip install -U pip
Install package together with its dependencies in development mode:
pip install -e sbb_utils
pip install -e page2tsv
PAGE-XML to TSV Transformation:
Create a TSV file from OCR in PAGE-XML format (with word segmentation):
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
In order to create a TSV file for multiple PAGE XML files just perform successive calls of the tool using the same TSV file:
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
page2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2
page2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3
page2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4
page2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5
...
...
...
For instance, for the file example.xml:
page2tsv example.xml example.tsv --image-url=http://content.staatsbibliothek-berlin.de/zefys/SNP27646518-18800101-0-3-0-0/left,top,width,height/full/0/default.jpg
Processing of already existing TSV files:
Create a URL-annotated TSV file from an existing TSV file:
annotate-tsv enp_DE.tsv enp_DE-annotated.tsv
Command-line interface:
page2tsv [OPTIONS] PAGE_XML_FILE TSV_OUT_FILE
Options:
--purpose [NERD|OCR] Purpose of output tsv file.
NERD: NER/NED application/ground-truth creation.
OCR: OCR application/ground-truth creation.
default: NERD.
--image-url TEXT
--ner-rest-endpoint TEXT REST endpoint of sbb_ner service. See
https://github.com/qurator-spk/sbb_ner for
details. Only applicable in case of NERD.
--ned-rest-endpoint TEXT REST endpoint of sbb_ned service. See
https://github.com/qurator-spk/sbb_ned for
details. Only applicable in case of NERD.
--noproxy disable proxy. default: enabled.
--scale-factor FLOAT default: 1.0
--ned-threshold FLOAT
--min-confidence FLOAT
--max-confidence FLOAT
--ned-priority INTEGER
--help Show this message and exit.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file qurator_tsvtools-0.0.1.tar.gz
.
File metadata
- Download URL: qurator_tsvtools-0.0.1.tar.gz
- Upload date:
- Size: 57.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea47bbc187ccd05eb6add0607b0872b07e75e5518c1eb5a4724122bf8df1858c |
|
MD5 | 6088c0d6b5d31ea31335835a43e9e32b |
|
BLAKE2b-256 | 431f927b35a9fd211e6e55c78c5d59fb21a2aa8b519c3b7bbe3ce9045550a132 |
File details
Details for the file qurator_tsvtools-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: qurator_tsvtools-0.0.1-py3-none-any.whl
- Upload date:
- Size: 82.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.7.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bef1c14bec3e623008122084a816aee1d585338a329c261aa4ed81783612eec4 |
|
MD5 | 4fcc369724029e87a5cd8d6c57093554 |
|
BLAKE2b-256 | 5c0929b9a66828177643b94881c5c7affa41c9115498ff52c938ec594cace990 |