Skip to main content

A tool to align comparable corpora

Project description

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

wget http://yalign.machinalis.com/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign’s implementation please read the docs.

Yalign is a Machinalis project. You can view our other open source contributions here.

The Yalign Team:

Andrew Vine
Gonzalo García Berrotarán
Rafael Carrascosa
Elías Andrawos
Laura Alonso Alemany

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yalign-0.1.1.tar.gz (34.5 kB view details)

Uploaded Source

File details

Details for the file yalign-0.1.1.tar.gz.

File metadata

  • Download URL: yalign-0.1.1.tar.gz
  • Upload date:
  • Size: 34.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for yalign-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5c01186b2190c76249caf406d0f6e3313dd0c76f1fc675437665d9b60cf9d738
MD5 737ed99922c71bd62471f6da0f483626
BLAKE2b-256 900726fb70bdece4c9163c35093519ce0c0d58c631abd8891e9393b48b827fe8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page