Tools to read OPUS
Project description
OpusTools
Tools for accessing and processing OPUS data.
opus_read: read parallel data sets and convert to different output formats opus_cat: extract given OPUS document from release data
opus_read
Usage
In your_script.py
, first import the package:
import package_name
If you want to give the arguments on command line: initialize a PairPrinter with an empty argument list:
opus_reader = package_name.PairPrinter([])
and then run in terminal:
python3 your_script.py -d Books -s en -t fi
You can alternatively initialize a PairPrinter with arguments in a list:
opus_reader = package_name.PairPrinter(["-d", "Books", "-s", "en", "-t", "fi"])
and in terminal:
python3 your_script.py
Read sentence alignment in XCES align format
python3 opus_read.py -d Books -s en -t fi
Print alignments with alignment certainty > LinkThr=0
python3 opus_read.py -d MultiUn -s en -t es -a certainty -tr 0
Print first 10 alignment pairs:
python3 opus_read.py -d Books -s en -t fi -m 10
Print XCES align format of all 1:1 sentence alignments:
python3 opus_read.py -d Books -s en -t fi -S 1 -T 1 -wm links
python3 opus_read.py [-h] -d D -s S -t T [-r R] [-p P] [-m M] [-S S]
[-T T] [-a A] [-tr TR] [-ln] [-w W] [-wm WM] [-f]
[-rd RD] [-af AF] [-cm CM] [-pa] [-ca CA]
optional arguments: -h, --help show this help message and exit -d D Corpus name -s S Source language -t T Target language -r R Release (default=latest) -p P Pre-process-type (raw, xml or parsed, default=xml) -m M Maximum number of alignments -S S Maximum number of source sentences in alignments (range is allowed, eg. -S 1-2) -T T Maximum number of target sentences in alignments (range is allowed, eg. -T 1-2) -a A Set attribute for filttering -tr TR Set threshold for an attribute -ln Leave non-alignments out -w W Write to file. Enter two file names separated by a comma when writing in moses format (e.g. -w moses.src,moses.trg). Otherwise enter one file name. -wm WM Set writing mode (normal, moses, tmx, links) -f Fast parsing. Faster than normal parsing, if you print a small part of the whole corpus, but requires the sentence ids in alignment files to be in sequence. -rd RD Change root directory (default=/proj/nlpl/data/OPUS/) -af AF Use given alignment file -cm CM Change moses delimiter (default=tab) -pa Print annotations, if they exist -ca CA Change annotation delimiter (default=|)
Description
opus_read.py
is a simple script to read sentence alignments stored in XCES align format and prints the aligned sentences to STDOUT. It requires monolingual alignments of sentences in linked XML files. Linked XML files are specified in the "toDoc" and "fromDoc" attributes (see below).
Several parameters can be set to filter the alignments and to print only certain types of alignments.
opus_read.py
can also be used to filter the XCES alignment files and to print the remaining links in the same
XCES align format. Set the "-wm" flag to "links" enable this mode.
opus_read.py
reads the alignments from zip files. Starting up the script might take some time, if the zip files are large (for example OpenSubtitles in OPUS)
opus_read.py
uses ExhaustiveSentenceParser
by default. This means that each time a <linkGrp>
tag is found, the corresponding source and target documents are read through and each sentence is stored in a hashmap with the sentence id as the key. This allows the reader to read alignment files that have sentence ids in non-sequential order. Each time a <linkGrp>
tag is found, the script pauses printing for a second to read through the source and target documents. The duration of the pause depends on the size of the source and target documents.
Using the "-f" flag allows the usage SentenceParser
, which is faster than ExhaustiveSentenceParser, if only a small part of the corpus is read. SentenceParser
does not store the sentences in a hashmap. Rather, when it finds a <link>
tag, it iterates through a sentence file until a sentence id is matched with the sentence id found in the <link>
tag. SentenceParser can not go backwards, which means that if the ids are not in sequential order in the alignment file, the parser will not find alignment pairs after a certain point. SentenceParser is less reliable than ExhaustiveSentenceParser, and the only reason to use the "-f" flag is when the whole corpus does not need to be scanned, in other words, while using the "-m" flag.
opus_cat
Description
opus_cat.py
reads a document from OPUS and prints it to STDOUT.
Usage
python3 opus_cat.py [-h] -d D -l L [-i] [-m M]
optional arguments: -h, --help show this help message and exit -d D Corpus name -l L Language -i Print without ids -m M Maximum number of sentences
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file opustools_pkg-0.0.1.tar.gz
.
File metadata
- Download URL: opustools_pkg-0.0.1.tar.gz
- Upload date:
- Size: 16.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 089c95fb30aa17da6f91895ab052d89af7553e459bb19671f50050c35b832e51 |
|
MD5 | 9d42fba3be9395666dda8f8c0a225f5a |
|
BLAKE2b-256 | 8a2fec4eb37f5215d66f8e8b737513ed6bbb574e7ca4a364fd5dc59136825d27 |
Provenance
File details
Details for the file opustools_pkg-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: opustools_pkg-0.0.1-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1139569fcb99db88059fb4b497bac7159d6d777b856e54aa212a42c8a545a336 |
|
MD5 | 24dfa993e16399230da404d402b05683 |
|
BLAKE2b-256 | 4ac4c0ddd2a6a11bcf74e1fbe524dd78ad64ff4939f4b962e04c3cc2dfe3a1ef |