Skip to main content

State of the art translation for Indic languages.

Project description

Anuvaad

State of the art open-source translation models for Indic languages.

Installation

# CPU pytorch will be installed if torch is not installed
pip install --upgrade anuvaad

Usage

As a Python module

from anuvaad import Anuvaad
anu = Anuvaad("english-telugu")

# Single sentence translation
# beam_size is optional and defaults to 4
anu.anuvaad("YS Jagan is the chief minister of Andhra Pradesh.")
# "వైఎస్ జగన్ ఆంధ్రప్రదేశ్ ముఖ్యమంత్రి."

# Batch translation
anu.anuvaad(["YS Jagan is the chief minister of Andhra Pradesh.",
            "Nara Lokesh suffered a humiliating defeat in Mangalagiri."])
# ['వైఎస్ జగన్ ఆంధ్రప్రదేశ్ ముఖ్యమంత్రి.', 'మంగళగిరిలో నారా లోకేష్కు అవమానకరమైన ఓటమి ఎదురైంది.']

As a service

# Starting the api service
docker run -it -e BATCH_SIZE=1 -p 8080:8080 notaitech/anuvaad:english-telugu

# Running a prediction
curl -d '{"data": ["YS Jagan is the chief minister of Andhra Pradesh."]}' -H "Content-Type: application/json" -X POST http://localhost:8080/sync
Available Models Anuvaad BLEU Google BLEU
english-telugu 12.721173743764009 6.841437460383768
english-tamil 12.737036149214694 5.558450942590664
english-malayalam 17.785746646721996 19.569069412553812
english-kannada 7.888886041933815 3.2803251953567893
english-marathi 23.02755955392518 12.888112016722792
english-hindi 29.175892213216954 18.130893478614375
english-bengali
english-punjabi
english-gujarati

My thoughts on the evaluation/accuracy of the model(s):

  1. Unlike classification/ sequence labelling tasks, for open-domain translation or summarization systems it is very hard to quantify the accuracy through numbers.
  2. This is because, most accuracy measurements actually measure the overlap of character/word n-grams between the expected output and predicted output.
  3. These scores definitely help when evaluating/comparing multiple models on a particular dataset, but the number don't translate well for open-domain models.
  4. For example, Anuvaad translates the sentence An advance is placed with the Medical Superintendents of such hospitals who then provide assistance on a case to case basis. (taken from http://data.statmt.org/pmindia/v1/parallel corpus) to ऐसे अस्पतालों के चिकित्सा अधीक्षकों के साथ एडवांस रखा जाता है, जिसके बाद मामले के आधार पर सहायता प्रदान की जाती है। where as the expected translation of the sentence from the dataset is अग्रिम धन राशि इन अस्पतालों को चिकित्सा निरीक्षकों को दी जाएगी, जो हर मामले को देखते हुए सहायता प्रदान करेंगे।.
  5. In the above example, Although Anuvaad's translation is correct (in the sense that translation conveys the same thing as the original sentence), the BLEU score with n=3 will be 0.
  6. Similarly, a model trained on the pmindia dataset will have bad score on a different dataset which uses a different style of writing, even if the translation is semantically correct.
  7. Our aim in building Anuvaad is to build a general purpose, open-domain translation module that can flexibly translate text from various domains.
  8. https://docs.google.com/spreadsheets/d/1_TTtBEvVgemQfGbRBSZYkECMMt5r7L9-dt0FGVUbmOY/edit?usp=sharing is a sheet comparing translations from Anuvaad, ilmulti (https://github.com/jerinphilip/ilmulti) and Google Translate (=GOOGLETRANSLATE(text, "en", "language") function on google sheets) on 100 randomly selected English sentences from Tatoeba.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anuvaad-1.0.6.tar.gz (6.6 kB view details)

Uploaded Source

Built Distribution

anuvaad-1.0.6-py2.py3-none-any.whl (17.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file anuvaad-1.0.6.tar.gz.

File metadata

  • Download URL: anuvaad-1.0.6.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for anuvaad-1.0.6.tar.gz
Algorithm Hash digest
SHA256 6e647ba7aa29f0d501777c2951f1cb6746f3a182a935108f225f8f717b28676b
MD5 139fd68a165f369f4aca2f00de7de29f
BLAKE2b-256 3f1cc445f65fabec1f31e5d1b46cb7c562d5d2423eb6aadfaffaa5b78bc403d8

See more details on using hashes here.

File details

Details for the file anuvaad-1.0.6-py2.py3-none-any.whl.

File metadata

  • Download URL: anuvaad-1.0.6-py2.py3-none-any.whl
  • Upload date:
  • Size: 17.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/50.3.1.post20201107 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.8.5

File hashes

Hashes for anuvaad-1.0.6-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 56f4d6efd9ef890dba65fec0a547ac7195e957d74366f23daf140484b9c8438b
MD5 0da940a216385e8dd87dac2d189272dc
BLAKE2b-256 c5b9ac235b904de10b2e10d2ae6d2a7de62b57e765e05cc84b6671d39c4d3111

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page