Natural Language Toolkit for Indian Languages (iNLTK)
Project description
Natural Language Toolkit for Indic Languages (iNLTK)
iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.
Installation
pip install http://download.pytorch.org/whl/cpu/torch-1.0.0-cp36-cp36m-linux_x86_64.whl
pip install inltk
iNLTK runs on CPU, as is the desired behaviour for most of the Deep Learning models in production.
The first command above will install pytorch-cpu, which, as the name suggests, does not have cuda support.
Note: inltk is currently supported only on Linux with Python >= 3.6
Supported languages
Language | Code |
---|---|
Hindi | hi |
Punjabi | pa |
Sanskrit | sa |
Gujarati | gu |
Kannada | kn |
Malyalam | ml |
Nepali | ne |
Odia | or |
Marathi | mr |
Bengali | bn |
Tamil | ta |
Usage
Setup the language
from inltk.inltk import setup
setup('<code-of-language>') // if you wanted to use hindi, then setup('hi')
Note: You need to run setup('<code-of-language>') when you use a language for the FIRST TIME ONLY. This will download all the necessary models required to do inference for that language.
Tokenize
from inltk.inltk import tokenize
tokenize(text ,'<code-of-language>') // where text is string in <code-of-language>
Get Embedding Vectors
This returns an array of "Embedding vectors", containing 400 Dimensional representation for every token in the text.
from inltk.inltk import get_embedding_vectors
vectors = get_embedding_vectors(text, '<code-of-language>') // where text is string in <code-of-language>
Example:
>> vectors = get_embedding_vectors('भारत', 'hi')
>> vectors[0].shape
(400,)
>> get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
[array([-0.894777, -0.140635, -0.030086, -0.669998, ..., 0.859898, 1.940608, 0.09252 , 1.043363], dtype=float32), array([ 0.290839, 1.459981, -0.582347, 0.27822 , ..., -0.736542, -0.259388, 0.086048, 0.736173], dtype=float32), array([ 0.069481, -0.069362, 0.17558 , -0.349333, ..., 0.390819, 0.117293, -0.194081, 2.492722], dtype=float32), array([-0.37837 , -0.549682, -0.497131, 0.161678, ..., 0.048844, -1.090546, 0.154555, 0.925028], dtype=float32), array([ 0.219287, 0.759776, 0.695487, 1.097593, ..., 0.016115, -0.81602 , 0.333799, 1.162199], dtype=float32), array([-0.31529 , -0.281649, -0.207479, 0.177357, ..., 0.729619, -0.161499, -0.270225, 2.083801], dtype=float32), array([-0.501414, 1.337661, -0.405563, 0.733806, ..., -0.182045, -1.413752, 0.163339, 0.907111], dtype=float32), array([ 0.185258, -0.429729, 0.060273, 0.232177, ..., -0.537831, -0.51664 , -0.249798, 1.872428], dtype=float32)]
>> vectors = get_embedding_vectors('ਜਿਹਨਾਂ ਤੋਂ ਧਾਤਵੀ ਅਲੌਹ ਦਾ ਆਰਥਕ','pa')
>> len(vectors)
8
Predict Next 'n' words
from inltk.inltk import predict_next_words
predict_next_words(text , n, '<code-of-language>')
// text --> string in <code-of-language>
// n --> number of words you want to predict (integer)
Note: You can also pass a fourth parameter, randomness, to predict_next_words. It has a default value of 0.8
Identify language
Note: If you update the version of iNLTK, you need to run
reset_language_identifying_models
before identifying language.
from inltk.inltk import identify_language, reset_language_identifying_models
reset_language_identifying_models() # only if you've updated iNLTK version
identify_language(text)
// text --> string in one of the supported languages
Example:
>> identify_language('न्यायदर्शनम् भारतीयदर्शनेषु अन्यतमम्। वैदिकदर्शनेषु ')
'sanskrit'
Remove foreign languages
from inltk.inltk import remove_foreign_languages
remove_foreign_languages(text, '<code-of-language>')
// text --> string in one of the supported languages
// <code-of-language> --> code of that language whose words you want to retain
Example:
>> remove_foreign_languages('विकिपीडिया सभी विषयों ਇੱਕ ਅਲੌਕਿਕ ਨਜ਼ਾਰਾ ਬੱਝਾ ਹੋਇਆ ਸਾਹਮਣੇ ਆ ਖਲੋਂਦਾ ਸੀ पर प्रामाणिक और 维基百科:关于中文维基百科 उपयोग, परिवर्तन 维基百科:关于中文维基百科', 'hi')
['▁विकिपीडिया', '▁सभी', '▁विषयों', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁', '<unk>', '▁पर', '▁प्रामाणिक', '▁और', '▁', '<unk>', ':', '<unk>', '▁उपयोग', ',', '▁परिवर्तन', '▁', '<unk>', ':', '<unk>']
Every word other than that of host language will become <unk>
and ▁
signifies space character
Checkout this notebook by Amol Mahajan where he uses iNLTK to remove foreign characters from iitb_en_hi_parallel corpus
Repositories containing models used in iNLTK
Language | Repository | Perplexity of Language model | Wikipedia Articles Dataset | Classification accuracy | Classification Kappa score |
---|---|---|---|---|---|
Hindi | NLP for Hindi | ~36 | 55,000 articles | ~79 (News Classification) | ~30 (Movie Review Classification) |
Punjabi | NLP for Punjabi | ~13 | 44,000 articles | ~89 (News Classification) | ~60 (News Classification) |
Sanskrit | NLP for Sanskrit | ~6 | 22,273 articles | ~70 (Shloka Classification) | ~56 (Shloka Classification) |
Gujarati | NLP for Gujarati | ~34 | 31,913 articles | ~91 (News Classification) | ~85 (News Classification) |
Kannada | NLP for Kannada | ~70 | 32,997 articles | ~94 (News Classification) | ~90 (News Classification) |
Malyalam | NLP for Malyalam | ~26 | 12,388 articles | ~94 (News Classification) | ~91 (News Classification) |
Nepali | NLP for Nepali | ~32 | 38,757 articles | ~97 (News Classification) | ~96 (News Classification) |
Odia | NLP for Odia | ~27 | 17,781 articles | ~95 (News Classification) | ~92 (News Classification) |
Marathi | NLP for Marathi | ~18 | 85,537 articles | ~91 (News Classification) | ~84 (News Classification) |
Bengali | NLP for Bengali | ~41 | 72,374 articles | ~94 (News Classification) | ~92 (News Classification) |
Tamil | NLP for Tamil | ~20 | >127,000 articles | ~97 (News Classification) | ~95 (News Classification) |
Contributing
Add a new language support for iNLTK
If you would like to add support for language of your own choice to iNLTK, please start with checking/raising a issue here
Please checkout the steps I'd mentioned here for Telugu to begin with. They should be almost similar for other languages as well.
Improving models/Using models for your own research
If you would like to take iNLTK's models and refine them with your own dataset or build your own custom models on top of it, please check out the repositories in the above table for the language of your choice. The repositories above contain links to datasets, pretrained models, classifiers and all of the code for that.
Add new functionality
If you wish for a particular functionality in iNLTK - Start by checking/raising a issue here
What's next (and being worked upon)
Shout out if you want to help :)
- Add Tamil and Telugu support
- Add function to get_embeddings_for_words, get_embeddings_for_sentences
- Add NER for all the languages
- Add translations - to and from languages in iNLTK + English
- Work on a unified model for all the languages
What's next - (and NOT being worked upon)
Shout out if you want to lead :)
- Add Windows support
Appreciation for iNLTK
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.