Toolbox for filtering parallel corpora
Project description
OpusFilter
OpusFilter is a tool for filtering and combining parallel corpora.
Features:
- Corpus preprocessing pipelines configured with YAML
- Simple downloading of parallel corpora from OPUS with OpusTools
- Implementations for many common text file operations on parallel files
- Memory-efficient processing of large files
- Implemented filters based e.g. on language identification, word aligment, n-gram language models, and multilingual sentence embeddings
- Extendable with your own filters written in Python
OpusFilter has been presented in ACL 2020 system demonstrations.
Installing
Install the latest release from PyPI:
pip install opusfilter
orpip install opusfilter[all]
(include optional Python libraries)
Install from source:
pip install .
orpython setup.py install
Troubleshooting
OpusFilter should generally work fine on Python 3.8 to 3.12. In the case of troubles, try installing the exact versions in requirements.txt
:
pip install -r requirements.txt
Documentation
The complete OpusFilter documentation is available from helsinki-nlp.github.io/OpusFilter.
You can also build the documents from the source:
pip install -r docs/requirements.txt
orpip install .[docs]
sphinx-build docs docs-html
Changelog
A changelog is available in docs/CHANGELOG.md.
Citing
If you use OpusFilter in your research, please cite our ACL 2020 paper:
@inproceedings{aulamo-etal-2020-opusfilter,
title = "{O}pus{F}ilter: A Configurable Parallel Corpus Filtering Toolbox",
author = {Aulamo, Mikko and Virpioja, Sami and Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
month = jul,
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-demos.20",
doi = "10.18653/v1/2020.acl-demos.20",
pages = "150--156"
}
A full bibliography of papers cited in the documentation and code can be found from docs/references.bib.
Contributing
See docs/CONTRIBUTING.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file opusfilter-3.2.0.tar.gz
.
File metadata
- Download URL: opusfilter-3.2.0.tar.gz
- Upload date:
- Size: 125.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63aedd4ce165d3113ef0a2ea6b4aed85598ac60686e3c7cae63c2e26c6be0fbd |
|
MD5 | 291f15350140a3b38b5e6d780ada18b1 |
|
BLAKE2b-256 | ebf08694ce64e8902964574e1b0a69019c03c0c0c20fb41292d376950c731c5e |
File details
Details for the file opusfilter-3.2.0-py3-none-any.whl
.
File metadata
- Download URL: opusfilter-3.2.0-py3-none-any.whl
- Upload date:
- Size: 65.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8fc6addda097ea92044beb779b096c9ade83e6fe649560f248c6ba340b366ff8 |
|
MD5 | ffda0c38699a6208300e12e8fcb12262 |
|
BLAKE2b-256 | 41c0cc49a2f687c98ebbd0ee45592d8ab7ccac7941b08d3ced312abee2368d21 |