Skip to main content

Toolbox for filtering parallel corpora

Project description

OpusFilter

OpusFilter is a tool for filtering and combining parallel corpora.

Features:

  • Corpus preprocessing pipelines configured with YAML
  • Simple downloading of parallel corpora from OPUS with OpusTools
  • Implementations for many common text file operations on parallel files
  • Memory-efficient processing of large files
  • Implemented filters based e.g. on language identification, word aligment, n-gram language models, and multilingual sentence embeddings
  • Extendable with your own filters written in Python

OpusFilter has been presented in ACL 2020 system demonstrations.

Installing

Install the latest release from PyPI:

  • pip install opusfilter or pip install opusfilter[all] (include optional Python libraries)

Install from source:

  • pip install . or python setup.py install

Documentation

The complete OpusFilter documentation is available from helsinki-nlp.github.io/OpusFilter.

You can also build the documents from the source:

  • pip install -r docs/requirements.txt or pip install .[docs]
  • sphinx-build docs docs-html

Changelog

A changelog is available in docs/CHANGELOG.md.

Citing

If you use OpusFilter in your research, please cite our ACL 2020 paper:

@inproceedings{aulamo-etal-2020-opusfilter,
    title = "{O}pus{F}ilter: A Configurable Parallel Corpus Filtering Toolbox",
    author = {Aulamo, Mikko and Virpioja, Sami and Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.20",
    doi = "10.18653/v1/2020.acl-demos.20",
    pages = "150--156"
}

A full bibliography of papers cited in the documentation and code can be found from docs/references.bib.

Contributing

See docs/CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opusfilter-2.5.1.tar.gz (103.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opusfilter-2.5.1-py3-none-any.whl (51.2 kB view details)

Uploaded Python 3

File details

Details for the file opusfilter-2.5.1.tar.gz.

File metadata

  • Download URL: opusfilter-2.5.1.tar.gz
  • Upload date:
  • Size: 103.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.14

File hashes

Hashes for opusfilter-2.5.1.tar.gz
Algorithm Hash digest
SHA256 b0c53ca8c172be2074a111dc0b36f3fed39d7e7fd9e90dbe061515fe9a14d6f6
MD5 4b60f386259a1baecf4d32049198c66b
BLAKE2b-256 740873fe8e72ca3b4d870cabe9f20ef2346a1eb9ea5f0ab6a2670cb152ea7e56

See more details on using hashes here.

File details

Details for the file opusfilter-2.5.1-py3-none-any.whl.

File metadata

  • Download URL: opusfilter-2.5.1-py3-none-any.whl
  • Upload date:
  • Size: 51.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.14

File hashes

Hashes for opusfilter-2.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a859a9fee2ea0bf5506a0c77f66d60b3a08fafd493c8ac390c90ad4b242f1806
MD5 39a7b4c15464271bd1e88c3228f61f3b
BLAKE2b-256 dbc5e1bacfbd3670af9f45d50b6834ff59f5c4cba8dd5c15a7a4d59529766161

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page