Tools to read OPUS

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

OpusTools

Tools for accessing and processing OPUS data.

opus_read: read parallel data sets and convert to different output formats
opus_express: Create test/dev/train sets from OPUS data.
opus_cat: extract given OPUS document from release data
opus_get: download files from OPUS
opus_langid: add language ids to sentences in xml files in zip archives

Installation:

pip install opustools

pip install opustools[langid] to use opus_langid

pip install opustools[all] to install all optional requirements

opus_read

Usage

usage: opus_read [-h] -d corpus_name -s langid -t langid [-r version]
                 [-p {raw,xml,parsed,moses}] [-m M] [-S S] [-T T] [-a attribute]
                 [-tr TR] [-ln] [-w file_name [file_name ...]]
                 [-wm {normal,moses,tmx,links}] [-pn] [-rd path_to_dir]
                 [-af path_to_file] [-sz path_to_zip] [-tz path_to_zip]
                 [-cm delimiter] [-pa] [-sa attribute [attribute ...]]
                 [-ta attribute [attribute ...]] [-ca delimiter]
                 [--src_cld2 lang_id score] [--trg_cld2 lang_id score]
                 [--src_langid lang_id score] [--trg_langid lang_id score]
                 [-id file_name] [-q] [-dl DOWNLOAD_DIR] [-pi] [-n regex]
                 [-N regex] [-cs CHUNK_SIZE] [--doc_level] [--len_name N] [-v]

arguments:

-h, --help          show this help message and exit
-d corpus_name, --directory corpus_name
                    Corpus name
-s langid, --source langid
                    Source language
-t langid, --target langid
                    Target language
-r version, --release version
                    Release (default=latest)
-p {raw,xml,parsed}, --preprocess {raw,xml,parsed}
                    Preprocess-type (raw, xml or parsed, default=xml)
-m MAXIMUM, --maximum MAXIMUM   Maximum number of alignments
-S SRC_RANGE, --src_range SRC_RANGE
                    Number of source sentences in alignments (range is
                    allowed, eg. -S 1-2)
-T TGT_RANGE, --tgt_range TGT_RANGE
                    Number of target sentences in alignments (range is
                    allowed, eg. -T 1-2)
-a attribute, --attribute attribute
                    Set attribute for filttering
-tr THRESHOLD, --threshold THRESHOLD
                    Set threshold for an attribute
-ln, --leave_non_alignments_out
                    Leave non-alignments out
-w file_name [file_name ...], --write file_name [file_name ...]
                    Write to file. To print moses format in separate
                    files, enter two file names. Otherwise enter one file
                    name.
-wm {normal,moses,tmx,links}, --write_mode {normal,moses,tmx,links}
                    Set write mode
-pn, --print_file_names
                    Print file names when using moses format
-rd path_to_dir, --root_directory path_to_dir
                    Change root directory (default=/projappl/nlpl/data/OPUS)
-af path_to_file, --alignment_file path_to_file
                    Use given alignment file
-sz path_to_zip, --source_zip path_to_zip
                    Use given source zip file
-tz path_to_zip, --target_zip path_to_zip
                    Use given target zip file
-cm delimiter, --change_moses_delimiter delimiter
                    Change moses delimiter (default=tab)
-pa, --print_annotations
                    Print annotations, if they exist
-sa attribute [attribute ...], --source_annotations attribute [attribute ...]
                    Set source sentence annotation attributes to be
                    printed, e.g. -sa pos lem. To print all available
                    attributes use -sa all_attrs (default=pos lem)
-ta attribute [attribute ...], --target_annotations attribute [attribute ...]
                    Set target sentence annotation attributes to be
                    printed, e.g. -ta pos lem. To print all available
                    attributes use -ta all_attrs (default=pos lem)
-ca delimiter, --change_annotation_delimiter delimiter
                    Change annotation delimiter (default=|)
--src_cld2 lang_id score
                    Filter source sentences by their cld2 language id
                    labels and confidence score, e.g. en 0.9
--trg_cld2 lang_id score
                    Filter target sentences by their cld2 language id
                    labels and confidence score, e.g. en 0.9
--src_langid lang_id score
                    Filter source sentences by their langid.py language id
                    labels and confidence score, e.g. en 0.9
--trg_langid lang_id score
                    Filter target sentences by their langid.py language id
                    labels and confidence score, e.g. en 0.9
-id file_name, --write_ids file_name
                    Write sentence ids to a file.
-q, --suppress_prompts
                    Download necessary files without prompting "(y/n)"
-dl DOWNLOAD_DIR, --download_dir DOWNLOAD_DIR
                    Set download directory (default=current directory)
-pi, --preserve_inline_tags
                    Preserve inline tags within sentences
-n regex            Get only documents that match the regex
-N regex            Skip all documents that match the regex
-cs CHUNK_SIZE, --chunk_size CHUNK_SIZE
                    Number of sentence pairs in chunks to be processed (default=1000000)
--doc_level         Print full documents
--len_name N        Show the first N charaters of file names when displaying progress. -1 to show full names (default=50)
-v, --verbose       Print progress messages

Description

opus_read is a script to read sentence alignments stored in XCES align format and print the aligned sentences to STDOUT. It requires monolingual alignments of sentences in linked XML files. Linked XML files are specified in the "toDoc" and "fromDoc" attributes (see below).

<cesAlign version="1.0">
 <linkGrp targType="s" toDoc="source1.xml" fromDoc="target1.xml">
   <link certainty="0.88" xtargets="s1.1 s1.2;s1.1" id="SL1" />
   ....
 <linkGrp targType="s" toDoc="source2.xml" fromDoc="target2.xml">
   <link certainty="0.88" xtargets="s1.1;s1.1" id="SL1" />

Several parameters can be set to filter the alignments and to print only certain types of alignments.

opus_read can also be used to filter the XCES alignment files and to print the remaining links in the same XCES align format. Set the "-wm" flag to "links" to enable this mode.

opus_read reads the alignments from zip files. Starting up the script might take some time, if the zip files are large (for example OpenSubtitles in OPUS).

Chunk size:

The --chunk_size parameter can be used to adjust the number of sentence pairs to be processed per chunk. The XCES format can contain source and target sentence ids in a non-sequential random order. This means that we have to store sentence pairs from entire documents to make sure that all sentence pairs are found. This is not a problem for corpora that are split into multiple smaller documents, but this can lead to huge memory usage for big corpora that consist of only a single document, e.g. WikiMatrix. The --chunk_size parameter is a compromise to conserve memory at the expense of time. opus_read collects as many alignment links as --chunk_size indicates, parses the entire source and target sentence documents, and outputs the sentence pairs. This process is repeated until all sentence pairs from the document pair have been processed. This means that entire documents are re-parsed each time a new chunk is processed. For example, if a document pair has 10,000,000 sentence pairs and you use the default chunk size of 1,000,000, which uses roughly 1.5 Gb of memory, the memory usage will be one tenth of what it would be without the chunk size restriction, but the processing will take ten times longer. This is not an optimal solution and we are looking for a better one. To disable chunking, set the parameter to -1.

Moses files

It is also possible to download moses files directly without having to do any XML parsing. This enables a quicker access to corpora but loses all filtering options as filtering is done based on the XCES structure and metadata. The moses files contain all non-empty alignments but includes duplicates. To download moses files with opus_read set the preprocess flag to moses. This downloads a moses zip archive and extracts the source and target files, for example:

opus_read --directory RF \
    --source en \
    --target sv \
    --preprocess moses \
    --write en-sv.en en-sv.sv

To download a wider range of different file types, see opus_get.

Document level data

Using the --doc_level option, you can extract all sentences from parallel documents, not just the aligned parallel sentence pairs. In moses write mode, non-aligned sentences are written on separate lines where source sentences are followed by TABs and target sentences are preceded by TABs. Aligned sentence pairs are written normally separated by TABs and document boundaries are indicated by empty lines:

<src_doc1_sent1><TAB>
<src_doc1_sent2><TAB>
<TAB><trg_doc1_sent1>
<TAB><trg_doc1_sent2>
<src_doc1_sent3><TAB><trg_doc1_sent3>
<src_doc1_sent4><TAB>
<TAB><trg_doc1_sent4>
<src_doc1_sent5><TAB><trg_doc1_sent5> <trg_doc1_sent6>
<src_doc1_sent6><TAB>
<src_doc1_sent7><TAB>
<TAB><trg_doc1_sent7>

<src_doc2_sent1><TAB>
<TAB><trg_doc2_sent1>
<src_doc2_sent2><TAB><trg_doc2_sent2>
<src_doc2_sent3> <src_doc2_sent4><TAB><trg_doc2_sent3>
...

Examples:

Read sentence alignment in XCES align format. Necessary files will be downloaded automatically if they are not found locally:

opus_read --directory RF --source en --target sv

Read sentences with specific preprocessing type. (default is xml, which is tokenized text):

opus_read --directory RF --source en --target sv --preprocess raw

Leave non-alignments (pairs with no sentences on one side) out

opus_read --directory RF \
    --source en \
    --target sv \
    --preprocess raw\
    --leave_non_alignments_out

Print first 10 alignment pairs:

opus_read --directory RF --source en --target sv -m 10

Print XCES align format of all 1:1 sentence alignments:

opus_read --directory RF \
    --source en \
    --target sv \
    --src_range 1 \
    --tgt_range 1

Print alignments with alignment certainty greater than 1.1:

opus_read --directory RF \
    --source en \
    --target sv \
    --attribute certainty \
    --threshold 1.1

Write results to file:

opus_read --directory RF --source en --target sv --write result.txt

Write with different output format:

opus_read --directory RF \
    --source en \
    --target sv \
    --write result.tmx\
    --write_mode tmx

Write moses format to one file:

opus_read --directory RF \
    --source en \
    --target sv \
    --write en-sv.txt\
    --write_mode moses

or to two files:

opus_read --directory RF \
    --source en \
    --target sv \
    --write en-sv.en en-sv.sv \
    --write_mode moses

Read sentences using your alignment file. First create an alignment file, for example:

opus_read --directory RF \
    --source en \
    --target sv \
    --attribute certainty \
    --threshold 1.1 \
    --write_mode links \
    --write en-sv.links

Then use the created alignment file:

opus_read --directory RF --source en --target sv --alignment_file en-sv.links

Annotations can be printed with --print_annotations if they are included in the sentence files. To print all annotation attributes, specify this with --source_annotations and --target_annotations flags:

opus_read --directory RF \
    --source en \
    --target sv \
    --print_annotations \
    --source_annotations all_attrs \
    --target_annotations all_attrs

Sentences can be filtered by their language id labels and confidence score. First, the language ids need to be added to the sentence files with opus_langid. If you have run the previous examples, you should have RF_latest_xml_en.zip and RF_latest_xml_sv.zip in your current working directory. Apply opus_langid to these files:

opus_langid --file_path RF_latest_xml_en.zip
opus_langid --file_path RF_latest_xml_sv.zip

If you want to add language labels and scores to raw sentence files, you have to use the --preprocess raw flag:

opus_langid --file_path RF_latest_raw_en.zip --preprocess raw
opus_langid --file_path RF_latest_raw_sv.zip --preprocess raw

Now you can filter by language ids. This example uses both cld2 and langid.py language detection confidence scores:

opus_read --directory RF \
    --source en \
    --target sv \
    --src_cld2 en 0.99 \
    --trg_cld2 sv 0.99 \
    --src_langid en 1 \
    --trg_langid sv 1

You can also import the module to your python script:

In your_script.py, first import the package:

import opustools

Initialize OpusRead:

opus_reader = opustools.OpusRead(
    directory='Books',
    source='en',
    target='fi')
opus_reader.printPairs()

and then run:

python3 your_script.py

opus_express

Usage

usage: opus_express [-h] [-f] -s lang_id -t lang_id
                    [-c [coll_name [coll_name ...]]]
                    [--root-dir /path/to/OPUS] [--download-dir /path/to/dir]
                    [--test-override /path/to/file] [--test-quota num_sents]
                    [--dev-quota num_sents] [--doc-bounds] [--quality-aware]
                    [--overlap-threshold min_pct] [--preserve-inline-tags]
                    [--shuffle] [--test-set filename] [--dev-set filename]
                    [--train-set filename] [-q]

arguments:

-h, --help            show this help message and exit
-f, --force           suppress warnings (default: False)
-s lang_id, --src-lang lang_id
                      source language (e.g. `en')
-t lang_id, --tgt-lang lang_id
                      target language (e.g. `pt')
-c [coll_name [coll_name ...]], --collections [coll_name [coll_name ...]]
                      OPUS collection(s) to fetch (default: `OpenSubtitles')
                      (Check http://opus.nlpl.eu/opusapi/?corpora=True for 
                      an up-to-date list)
                      Collections list: ['ALL', 'ada83', 'Bianet', 'bible-
                      uedin', 'Books', 'CAPES', 'DGT', 'DOGC', 'ECB',
                      'EhuHac', 'Elhuyar', 'EMEA', 'EUbookshop', 'EUconst',
                      'Europarl', 'Finlex', 'fiskmo', 'giga-fren',
                      'GlobalVoices', 'GNOME', 'hrenWaC', 'JRC-Acquis',
                      'KDE4', 'KDEdoc', 'MBS', 'memat', 'MontenegrinSubs',
                      'MPC1', 'MultiUN', 'News-Commentary', 'OfisPublik',
                      'OpenOffice', 'OpenSubtitles', 'ParaCrawl', 'PHP',
                      'QED', 'RF', 'sardware', 'SciELO', 'SETIMES', 'SPC',
                      'Tanzil', 'Tatoeba', 'TED2013', 'TedTalks', 'TEP',
                      'TildeMODEL', 'Ubuntu', 'UN', 'UNPC', 'wikimedia',
                      'Wikipedia', 'WikiSource', 'WMT-News', 'XhosaNavy']
--root-dir /path/to/OPUS
                      Root directory for OPUS
                      (default:`/projappl/nlpl/data/OPUS')
--download-dir /path/to/dir
                      Directory for downloaded OPUS corpus files
                      (default:`.')
--test-override /path/to/file
                      path to file containing resource IDs to reserve for
                      the test set (default: None)
--test-quota num_sents
                      test set size in sentences (default: 10000)
--dev-quota num_sents
                      development set size in sentences (default: 10000)
--doc-bounds          preserve document blocks (also marks document
                      boundaries) (default: False)
--quality-aware       reserve one-to-one aligned samples with high overlap
                      for test/dev sets (incompatible with `--doc-bounds')
                      (default: False)
--overlap-threshold min_pct
                      threshold for alignment overlap in `--quality-aware'
                      mode (default: 0.8)
--preserve-inline-tags
                      preserve inline timestamp tags in aligned samples
                      (default: False)
--shuffle             shuffle samples (incompatible with `--doc-bounds')
                      (default: False)
--test-set filename   filename stub for output test set (default: `test')
--dev-set filename    filename stub for output development set (default:
                      `dev')
--train-set filename  filename stub for output training set (default:
                      `train')
-q                    Download necessary files without prompting "(y/n)"
                      (default: False)

Description

All aboard the OPUS Express! Create test/dev/train sets from OPUS data.

opus_cat

Usage

usage: opus_cat [-h] -d DIRECTORY -l LANGUAGE [-i] [-m MAXIMUM]
                [-pp {raw,xml}] [-p] [-f FILE_NAME]
                [-r RELEASE] [-pa] [-sa SET_ATTRIBUTE [SET_ATTRIBUTE ...]]
                [-ca CHANGE_ANNOTATION_DELIMITER] [-rd path_to_dir]
                [-dl DOWNLOAD_DIR]

arguments:

  -h, --help            show this help message and exit
  -d DIRECTORY, --directory DIRECTORY
                        Corpus name
  -l LANGUAGE, --language LANGUAGE
                        Language
  -i, --no_ids          Print without ids when using -p
  -m MAXIMUM, --maximum MAXIMUM
                        Maximum number of sentences
  -pp {raw,xml}, --preprocess {raw,xml}
                        Preprocess-type (raw, xml, default=xml)
  -p, --plain           Print in plain txt
  -f FILE_NAME, --file_name FILE_NAME
                        File name (if not given, prints all files)
  -r RELEASE, --release RELEASE
                        Release (default=latest)
  -pa, --print_annotations
                        Print annotations, if they exist
  -sa SET_ATTRIBUTE [SET_ATTRIBUTE ...], --set_attribute SET_ATTRIBUTE [SET_ATTRIBUTE ...]
                        Set sentence annotation attributes to be printed, e.g. -sa pos lem.
                        To print all available attributes use -sa all_attrs (default=pos,lem)
  -ca CHANGE_ANNOTATION_DELIMITER, --change_annotation_delimiter CHANGE_ANNOTATION_DELIMITER
                        Change annotation delimiter (default=|)
  -rd path_to_dir, --root_directory path_to_dir
                        Change root directory (default=/projappl/nlpl/data/OPUS)
  -dl DOWNLOAD_DIR, --download_dir DOWNLOAD_DIR
                        Set download directory (default=current directory)

Description

Read a document from OPUS and print to STDOUT

Examples:

Read a corpus:

opus_cat --directory RF --language en

Read with output in plain text:

opus_cat --directory RF --language en --plain

Read with output in plain text including annotations:

opus_cat --directory RF --language en --plain --print_annotations

Read a specific file in a corpus:

opus_cat --directory RF --language en --file_name RF/xml/en/1996.xml

opus_get

Usage

usage: opus_get [-h] [-s SOURCE] [-t TARGET] [-d DIRECTORY] [-r RELEASE]
                [-p {raw,xml,parsed,mono,moses,tmx,truecaser,ud,freq,smt,dic}]
                [-l] [-ll] [-lc] [--local_db] [-db DATABASE]
                [-dl DOWNLOAD_DIR] [-q] [-u] [-w]

arguments:

-h, --help            show this help message and exit
-s SOURCE, --source SOURCE
                      Source language
-t TARGET, --target TARGET
                      Target language
-d DIRECTORY, --directory DIRECTORY
                      Corpus name
-r RELEASE, --release RELEASE
                      Release
-p {raw,xml,parsed,mono,moses,tmx,truecaser,ud,freq,smt,dic}, --preprocess {raw,xml,parsed,mono,moses,tmx,truecaser,ud,freq,smt,dic}
                      Preprocess type
-l, --list_resources  List resources
-ll, --list_languages
                      List available languages. Use -d to find languages for a given corpus and -s
                      for a given source language. Use both to find target language for a given
                      source language in a given corpus.
-lc, --list_corpora   List available corpora. Use -s to find corpora for a given language and use
                      both -s and -t to find corpora for a given language pair.
--local_db            Search resources from the local database instead of the online OPUS-API.
-db DATABASE, --database DATABASE
                      Sqlite db file location
-dl DOWNLOAD_DIR, --download_dir DOWNLOAD_DIR
                      Set download directory (default=current directory)
-q, --suppress_prompts
                      Download necessary files without prompting "(y/n)"
-u, --update_db       Update the local corpus database. This could take up to 1 hour."
-w, --warnings        When updating the local database, log warnings in addition to errors in
                      "opusdb_update_error.log"

Description

Download files from OPUS

Examples:

List available files in RF corpus for en-sv language pair:

opus_get --directory RF --source en --target sv --list_resources

Download RF corpus for en-sv:

opus_get --directory RF --source en --target sv

You can specify the directory to which the files will be downloaded:

opus_get --directory RF --source en --target sv --download_dir RF_files

List all files in RF that include English:

opus_get --directory RF --source en --list_resources

List all files for all language pairs in RF:

opus_get --directory RF --list_resources

List all en-sv files in the whole OPUS:

opus_get --source en --target sv --list_resources

Find available target languages for English in RF:

opus_get --list_languages --directory RF --source en

Find all corpora that contain the language pair en-sv:

opus_get --list_corpora --source en --target sv

opus_langid

Usage

usage: opus_langid [-h] -f FILE_PATH [-t TARGET_FILE_PATH] [-v] [-s]

arguments:

-h, --help            show this help message and exit
-f FILE_PATH, --file_path FILE_PATH
                      File path
-t TARGET_FILE_PATH, --target_file_path TARGET_FILE_PATH
                      Target file path. By default, the original file is
                      edited
-v, --verbosity       Verbosity. -v: print current xml file
-s, --suppress_errors
                      Suppress error messages in language detection

Description

Add language ids to sentences in plain xml files or xml files in zip archives using pycld2 and langid.py. This is required in order to be able to filter sentences by their language ids and confidence scores as described in the examples of opus_read.

If you have run the opus_read examples, you should have RF_latest_xml_en.zip and RF_latest_xml_sv.zip in your current working directory. Apply opus_langid to these files:

opus_langid --file_path RF_latest_xml_en.zip
opus_langid --file_path RF_latest_xml_sv.zip

If you want to add language labels and scores to raw sentence files, you have to use the --preprocess raw flag:

opus_langid --file_path RF_latest_raw_en.zip --preprocess raw
opus_langid --file_path RF_latest_raw_sv.zip --preprocess raw

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.8.3

Jan 28, 2026

1.8.2

Dec 9, 2025

1.8.1

Aug 21, 2025

1.8.0

Mar 27, 2025

1.7.2

Feb 27, 2025

1.7.1

Feb 3, 2025

1.6.2

Aug 8, 2024

1.6.1

Nov 20, 2023

1.6.0

Nov 3, 2023

1.5.6

Sep 15, 2023

1.5.5

Sep 6, 2023

1.5.4

Sep 5, 2023

1.5.3

Apr 14, 2023

1.4.0

Nov 29, 2022

1.3.2

Nov 9, 2022

1.3.1

Aug 31, 2022

1.3.0

Aug 26, 2022

1.2.3

Aug 25, 2022

1.2.2

Nov 8, 2021

1.2.1

Oct 24, 2020

1.1.1

Oct 16, 2020

1.1.0

Oct 9, 2020

1.0.0

Sep 23, 2020

0.0.54

Jan 10, 2020

0.0.53

Nov 22, 2019

0.0.52

Nov 22, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opustools-1.8.3.tar.gz (18.3 MB view details)

Uploaded Jan 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

opustools-1.8.3-py3-none-any.whl (18.3 MB view details)

Uploaded Jan 28, 2026 Python 3

File details

Details for the file opustools-1.8.3.tar.gz.

File metadata

Download URL: opustools-1.8.3.tar.gz
Upload date: Jan 28, 2026
Size: 18.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for opustools-1.8.3.tar.gz
Algorithm	Hash digest
SHA256	`5673668147aa6a8fc1831530a758d45d05db9714786f4faf50a78378b088896b`
MD5	`09a1de5e40c527faf930c86daf284d76`
BLAKE2b-256	`71744041eaef4afc3cb67439155b392fd2a3e38b4b20d54e044d71d5a127322f`

See more details on using hashes here.

File details

Details for the file opustools-1.8.3-py3-none-any.whl.

File metadata

Download URL: opustools-1.8.3-py3-none-any.whl
Upload date: Jan 28, 2026
Size: 18.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for opustools-1.8.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c233e371f72a3e5895e51af49e81ad95d97cec89e13e6da89ccf0a89d1b89574`
MD5	`0bdd73ce2de18f6c8bfaa1d9864975a4`
BLAKE2b-256	`cbf7c4a56c55a4b8d3b77c395aa7163a93b25f403e734a4e86f99e0ce9b33020`

See more details on using hashes here.

opustools 1.8.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OpusTools

Installation:

opus_read

Usage

Description

opus_express

Usage

Description

opus_cat

Usage

Description

opus_get

Usage

Description

opus_langid

Usage

Description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes