Toolbox for filtering parallel corpora
Project description
OpusFilter
OpusFilter is a tool for filtering and combining parallel corpora.
Features:
- Corpus preprocessing pipelines configured with YAML
- Simple downloading of parallel corpora from OPUS with OpusTools
- Implementations for many common text file operations on parallel files
- Memory-efficient processing of large files
- Implemented filters based e.g. on language identification, word aligment, n-gram language models, and multilingual sentence embeddings
- Extendable with your own filters written in Python
OpusFilter has been presented in ACL 2020 system demonstrations.
Installing
Install the latest release from PyPI:
pip install opusfilter
orpip install opusfilter[all]
(include optional Python libraries)
Install from source:
pip install .
orpython setup.py install
Troubleshooting
OpusFilter should generally work fine on Python 3.6, 3.7, and 3.8. In the case of troubles, try installing the exact versions in requirements.txt
:
pip install -r requirements.txt
Libraries that currently cause trouble:
pyhash
pyhash-0.9.3
requiressetuptools==58
or below
fast-mosestokenizer
- no PyPI packages for Python>=3.9
Documentation
The complete OpusFilter documentation is available from helsinki-nlp.github.io/OpusFilter.
You can also build the documents from the source:
pip install -r docs/requirements.txt
orpip install .[docs]
sphinx-build docs docs-html
Changelog
A changelog is available in docs/CHANGELOG.md.
Citing
If you use OpusFilter in your research, please cite our ACL 2020 paper:
@inproceedings{aulamo-etal-2020-opusfilter,
title = "{O}pus{F}ilter: A Configurable Parallel Corpus Filtering Toolbox",
author = {Aulamo, Mikko and Virpioja, Sami and Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
month = jul,
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-demos.20",
doi = "10.18653/v1/2020.acl-demos.20",
pages = "150--156"
}
A full bibliography of papers cited in the documentation and code can be found from docs/references.bib.
Contributing
See docs/CONTRIBUTING.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for opusfilter-2.6.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c29e904faa22f37b71fb6d7923b79cd0320f5f688231809b498a6161e898c29 |
|
MD5 | 338bf800be3a69b41d1faeba4b901347 |
|
BLAKE2b-256 | 6b1f3f67ed866fbd39ae1c448c9513cf3a6eb3b9b1b814b267810f15cf373f80 |