Skip to main content

Toolbox for filtering parallel corpora

Project description

OpusFilter

OpusFilter is a tool for filtering and combining parallel corpora.

Features:

  • Corpus preprocessing pipelines configured with YAML
  • Simple downloading of parallel corpora from OPUS with OpusTools
  • Implementations for many common text file operations on parallel files
  • Memory-efficient processing of large files
  • Implemented filters based e.g. on language identification, word aligment, n-gram language models, and multilingual sentence embeddings
  • Extendable with your own filters written in Python

OpusFilter has been presented in ACL 2020 system demonstrations.

Installing

Install the latest release from PyPI:

  • pip install opusfilter or pip install opusfilter[all] (include optional Python libraries)

Install from source:

  • pip install . or python setup.py install

Troubleshooting

OpusFilter should generally work fine on Python 3.8 to 3.12. In the case of troubles, try installing the exact versions in requirements.txt:

  • pip install -r requirements.txt

Documentation

The complete OpusFilter documentation is available from helsinki-nlp.github.io/OpusFilter.

You can also build the documents from the source:

  • pip install -r docs/requirements.txt or pip install .[docs]
  • sphinx-build docs docs-html

Changelog

A changelog is available in docs/CHANGELOG.md.

Citing

If you use OpusFilter in your research, please cite our ACL 2020 paper:

@inproceedings{aulamo-etal-2020-opusfilter,
    title = "{O}pus{F}ilter: A Configurable Parallel Corpus Filtering Toolbox",
    author = {Aulamo, Mikko and Virpioja, Sami and Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.20",
    doi = "10.18653/v1/2020.acl-demos.20",
    pages = "150--156"
}

A full bibliography of papers cited in the documentation and code can be found from docs/references.bib.

Contributing

See docs/CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opusfilter-3.2.0.tar.gz (125.5 kB view details)

Uploaded Source

Built Distribution

opusfilter-3.2.0-py3-none-any.whl (65.3 kB view details)

Uploaded Python 3

File details

Details for the file opusfilter-3.2.0.tar.gz.

File metadata

  • Download URL: opusfilter-3.2.0.tar.gz
  • Upload date:
  • Size: 125.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for opusfilter-3.2.0.tar.gz
Algorithm Hash digest
SHA256 63aedd4ce165d3113ef0a2ea6b4aed85598ac60686e3c7cae63c2e26c6be0fbd
MD5 291f15350140a3b38b5e6d780ada18b1
BLAKE2b-256 ebf08694ce64e8902964574e1b0a69019c03c0c0c20fb41292d376950c731c5e

See more details on using hashes here.

File details

Details for the file opusfilter-3.2.0-py3-none-any.whl.

File metadata

  • Download URL: opusfilter-3.2.0-py3-none-any.whl
  • Upload date:
  • Size: 65.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for opusfilter-3.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8fc6addda097ea92044beb779b096c9ade83e6fe649560f248c6ba340b366ff8
MD5 ffda0c38699a6208300e12e8fcb12262
BLAKE2b-256 41c0cc49a2f687c98ebbd0ee45592d8ab7ccac7941b08d3ced312abee2368d21

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page