Preprocessing and sentence-aligning for parallel corpora
Project description
preprocess-corpora
This repository contains Python scripts to preprocess and sentence-align parallel (or monolingual) corpora. The repository heavily relies upon Uplug and (in lesser respect) TreeTagger to work.
Installation
First, make sure to have installed Uplug and TreeTagger.
Then, install the requirements via:
$ pip install -r requirements.txt
Finally, create the executables preprocess and align via:
$ pip install --editable .
Usage
Preprocessing
The preprocess script allows to preprocess raw text and then to tokenize and tag the text in the XML format used in OPUS.
Run preprocess to process all unformatted .txt-files in a folder.
Usage:
process [OPTIONS] FOLDER_IN FOLDER_OUT [de|en|es|fr|it|nl|ru|ca|sv|pt]
Options:
--from_wordto use .docx-files as input, rather than .txt-files.--tokenizeto tokenize the files (requires installation of Uplug (and language support in Uplug)).--tagto tag the files (requires installation of TreeTagger (and language support in TreeTagger))
Alignment
Run align to sentence-align .xml-files in a working directory. Requires installation of Uplug.
Usage:
align [OPTIONS] WORKING_DIR [[de|en|es|fr|it|nl|ru|ca|sv|pt]]...
Supported languages
Full support
- German (de)
- English (en)
- Spanish (es) (+ variants Rioplatense (ar) and Mexican (mx) Spanish)
- French (fr)
- Italian (it)
- Dutch (nl)
- Russian (ru)
- Portuguese (pt)
Limited support
- Breton (br) [not supported in Uplug/TreeTagger]
- Catalan (ca) [not supported in Uplug/TreeTagger]
- Swedish (sv) [not supported in Uplug/TreeTagger]
Tests
Run the tests via
python -m unittest discover
In preprocess_corpora/tests/data/alice, you can find the example corpus used in the tests.
This corpus was compiled from Lewis Carroll's Alice in Wonderland and its translations into German, French, and Italian.
The source files are available through Project Gutenberg.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file preprocess-corpora-0.1.1.tar.gz.
File metadata
- Download URL: preprocess-corpora-0.1.1.tar.gz
- Upload date:
- Size: 780.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.5.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
892097440dd1bdaf2bce78c28fe6930809e4703aceb5d495d52645ab62a7bd01
|
|
| MD5 |
2e047d06f88fa1d9cbd3ae1c96696071
|
|
| BLAKE2b-256 |
14d564a0669a0c1445ca809ec646f981f2377075ed9bd53fbc92e77f398d6caa
|
File details
Details for the file preprocess_corpora-0.1.1-py3-none-any.whl.
File metadata
- Download URL: preprocess_corpora-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.5.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a1a368bd0cc55ed93fff91a2b8445d4aa9201dea81aaa559e1cbd8ab5f643c4
|
|
| MD5 |
ecc7bd474b1e7c8715be5c78b53e0e97
|
|
| BLAKE2b-256 |
2a436c996ab6f90df412155b5a99e8a932f10a16091233b6396d0dc9337e3921
|