Preprocessing and sentence-aligning for parallel corpora
Project description
preprocess-corpora
This repository contains Python scripts to preprocess and sentence-align parallel (or monolingual) corpora. The repository heavily relies upon Uplug and (in lesser respect) TreeTagger to work.
Installation
First, make sure to have installed Uplug and TreeTagger.
Then, install the requirements via:
$ pip install -r requirements.txt
Finally, create the executables preprocess
and align
via:
$ pip install --editable .
Usage
Preprocessing
The preprocess
script allows to preprocess raw text and then to tokenize and tag the text in the XML format used in OPUS.
Run preprocess
to process all unformatted .txt-files in a folder.
Usage:
process [OPTIONS] FOLDER_IN FOLDER_OUT [de|en|es|fr|it|nl|ru|ca|sv|pt]
Options:
--from_word
to use .docx-files as input, rather than .txt-files.--tokenize
to tokenize the files (requires installation of Uplug (and language support in Uplug)).--tag
to tag the files (requires installation of TreeTagger (and language support in TreeTagger))
Alignment
Run align
to sentence-align .xml-files in a working directory. Requires installation of Uplug.
Usage:
align [OPTIONS] WORKING_DIR [[de|en|es|fr|it|nl|ru|ca|sv|pt]]...
Supported languages
Full support
- German (de)
- English (en)
- Spanish (es) (+ variants Rioplatense (ar) and Mexican (mx) Spanish)
- French (fr)
- Italian (it)
- Dutch (nl)
- Russian (ru)
- Portuguese (pt)
Limited support
- Breton (br) [not supported in Uplug/TreeTagger]
- Catalan (ca) [not supported in Uplug/TreeTagger]
- Swedish (sv) [not supported in Uplug/TreeTagger]
Tests
Run the tests via
python -m unittest discover
In preprocess_corpora/tests/data/alice
, you can find the example corpus used in the tests.
This corpus was compiled from Lewis Carroll's Alice in Wonderland and its translations into German, French, and Italian.
The source files are available through Project Gutenberg.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file preprocess-corpora-0.1.1.tar.gz
.
File metadata
- Download URL: preprocess-corpora-0.1.1.tar.gz
- Upload date:
- Size: 780.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 892097440dd1bdaf2bce78c28fe6930809e4703aceb5d495d52645ab62a7bd01 |
|
MD5 | 2e047d06f88fa1d9cbd3ae1c96696071 |
|
BLAKE2b-256 | 14d564a0669a0c1445ca809ec646f981f2377075ed9bd53fbc92e77f398d6caa |
File details
Details for the file preprocess_corpora-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: preprocess_corpora-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a1a368bd0cc55ed93fff91a2b8445d4aa9201dea81aaa559e1cbd8ab5f643c4 |
|
MD5 | ecc7bd474b1e7c8715be5c78b53e0e97 |
|
BLAKE2b-256 | 2a436c996ab6f90df412155b5a99e8a932f10a16091233b6396d0dc9337e3921 |