Skip to main content

Preprocessing and sentence-aligning for parallel corpora

Project description

preprocess-corpora

This repository contains Python scripts to preprocess and sentence-align parallel (or monolingual) corpora. The repository heavily relies upon Uplug and (in lesser respect) TreeTagger to work.

Installation

First, make sure to have installed Uplug and TreeTagger.

Then, install the requirements via:

$ pip install -r requirements.txt

Finally, create the executables preprocess and align via:

$ pip install --editable .

Usage

Preprocessing

The preprocess script allows to preprocess raw text and then to tokenize and tag the text in the XML format used in OPUS.

Run preprocess to process all unformatted .txt-files in a folder.

Usage:

process [OPTIONS] FOLDER_IN FOLDER_OUT [de|en|es|fr|it|nl|ru|ca|sv|pt]

Options:

  • --from_word to use .docx-files as input, rather than .txt-files.
  • --tokenize to tokenize the files (requires installation of Uplug (and language support in Uplug)).
  • --tag to tag the files (requires installation of TreeTagger (and language support in TreeTagger))

Alignment

Run align to sentence-align .xml-files in a working directory. Requires installation of Uplug.

Usage:

align [OPTIONS] WORKING_DIR [[de|en|es|fr|it|nl|ru|ca|sv|pt]]...

Supported languages

Full support

  • German (de)
  • English (en)
  • Spanish (es) (+ variants Rioplatense (ar) and Mexican (mx) Spanish)
  • French (fr)
  • Italian (it)
  • Dutch (nl)
  • Russian (ru)
  • Portuguese (pt)

Limited support

  • Breton (br) [not supported in Uplug/TreeTagger]
  • Catalan (ca) [not supported in Uplug/TreeTagger]
  • Swedish (sv) [not supported in Uplug/TreeTagger]

Tests

Run the tests via

python -m unittest discover

In preprocess_corpora/tests/data/alice, you can find the example corpus used in the tests. This corpus was compiled from Lewis Carroll's Alice in Wonderland and its translations into German, French, and Italian. The source files are available through Project Gutenberg.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preprocess-corpora-0.1.1.tar.gz (780.8 kB view details)

Uploaded Source

Built Distribution

preprocess_corpora-0.1.1-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file preprocess-corpora-0.1.1.tar.gz.

File metadata

  • Download URL: preprocess-corpora-0.1.1.tar.gz
  • Upload date:
  • Size: 780.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.5.2

File hashes

Hashes for preprocess-corpora-0.1.1.tar.gz
Algorithm Hash digest
SHA256 892097440dd1bdaf2bce78c28fe6930809e4703aceb5d495d52645ab62a7bd01
MD5 2e047d06f88fa1d9cbd3ae1c96696071
BLAKE2b-256 14d564a0669a0c1445ca809ec646f981f2377075ed9bd53fbc92e77f398d6caa

See more details on using hashes here.

File details

Details for the file preprocess_corpora-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: preprocess_corpora-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.5.2

File hashes

Hashes for preprocess_corpora-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4a1a368bd0cc55ed93fff91a2b8445d4aa9201dea81aaa559e1cbd8ab5f643c4
MD5 ecc7bd474b1e7c8715be5c78b53e0e97
BLAKE2b-256 2a436c996ab6f90df412155b5a99e8a932f10a16091233b6396d0dc9337e3921

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page