Skip to main content

Preprocessing and sentence-aligning for parallel corpora

Project description

preprocess-corpora

This repository contains Python scripts to preprocess and sentence-align parallel (or monolingual) corpora. The repository heavily relies upon Uplug and (in lesser respect) TreeTagger to work.

Installation

First, make sure to have installed Uplug and TreeTagger.

Then, install the requirements via:

$ pip install -r requirements.txt

Finally, create the executables preprocess and align via:

$ pip install --editable .

Usage

Preprocessing

The preprocess script allows to preprocess raw text and then to tokenize and tag the text in the XML format used in OPUS.

Run preprocess to process all unformatted .txt-files in a folder.

Usage:

process [OPTIONS] FOLDER_IN FOLDER_OUT [de|en|es|fr|it|nl|ru|ca|sv|pt]

Options:

  • --from_word to use .docx-files as input, rather than .txt-files.
  • --tokenize to tokenize the files (requires installation of Uplug (and language support in Uplug)).
  • --tag to tag the files (requires installation of TreeTagger (and language support in TreeTagger))

Alignment

Run align to sentence-align .xml-files in a working directory. Requires installation of Uplug.

Usage:

align [OPTIONS] WORKING_DIR [[de|en|es|fr|it|nl|ru|ca|sv|pt]]...

Supported languages

Full support

  • German (de)
  • English (en)
  • Spanish (es) (+ variants Rioplatense (ar) and Mexican (mx) Spanish)
  • French (fr)
  • Italian (it)
  • Dutch (nl)
  • Russian (ru)
  • Portuguese (pt)

Limited support

  • Breton (br) [not supported in Uplug/TreeTagger]
  • Catalan (ca) [not supported in Uplug/TreeTagger]
  • Swedish (sv) [not supported in Uplug/TreeTagger]

Tests

Run the tests via

python -m unittest discover

In preprocess_corpora/tests/data/alice, you can find the example corpus used in the tests. This corpus was compiled from Lewis Carroll's Alice in Wonderland and its translations into German, French, and Italian. The source files are available through Project Gutenberg.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preprocess-corpora-0.1.1.tar.gz (780.8 kB view hashes)

Uploaded Source

Built Distribution

preprocess_corpora-0.1.1-py3-none-any.whl (11.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page