Natural Language Toolkit for Indian Languages (iNLTK)
Project description
Natural Language Toolkit for Indic Languages (iNLTK)
iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages. Paper for iNLTK library has been accepted at EMNLP-2020's NLP-OSS workshop. Here's the preprint for the paper
Documentation
Checkout detailed docs along with Installation instructions at https://inltk.readthedocs.io
Supported languages
Native languages
Language | Code |
---|---|
Hindi | hi |
Punjabi | pa |
Gujarati | gu |
Kannada | kn |
Malayalam | ml |
Oriya | or |
Marathi | mr |
Bengali | bn |
Tamil | ta |
Urdu | ur |
Nepali | ne |
Sanskrit | sa |
English | en |
Telugu | te |
Code Mixed languages
Language | Script | Code |
---|---|---|
Hinglish (Hindi+English) | Latin | hi-en |
Tanglish (Tamil+English) | Latin | ta-en |
Manglish (Malayalam+English) | Latin | ml-en |
Repositories containing models used in iNLTK
Note: English model has been directly taken from fast.ai
Effect of using Transfer Learning + Paraphrases from iNLTK
Language | Repository | Dataset used for Classification | Results on using complete training set |
Percentage Decrease in Training set size |
Results on using reduced training set without Paraphrases |
Results on using reduced training set with Paraphrases |
---|---|---|---|---|---|---|
Hindi | NLP for Hindi | IIT Patna Movie Reviews | Accuracy: 57.74 MCC: 37.23 |
80% (2480 -> 496) | Accuracy: 47.74 MCC: 20.50 |
Accuracy: 56.13 MCC: 34.39 |
Bengali | NLP for Bengali | Bengali News Articles (Soham Articles) | Accuracy: 90.71 MCC: 87.92 |
99% (11284 -> 112) | Accuracy: 69.88 MCC: 61.56 |
Accuracy: 74.06 MCC: 65.08 |
Gujarati | NLP for Gujarati | iNLTK Headlines Corpus - Gujarati | Accuracy: 91.05 MCC: 86.09 |
90% (5269 -> 526) | Accuracy: 80.88 MCC: 70.18 |
Accuracy: 81.03 MCC: 70.44 |
Malayalam | NLP for Malayalam | iNLTK Headlines Corpus - Malayalam | Accuracy: 95.56 MCC: 93.29 |
90% (5036 -> 503) | Accuracy: 82.38 MCC: 73.47 |
Accuracy: 84.29 MCC: 76.36 |
Marathi | NLP for Marathi | iNLTK Headlines Corpus - Marathi | Accuracy: 92.40 MCC: 85.23 |
95% (9672 -> 483) | Accuracy: 84.13 MCC: 68.59 |
Accuracy: 84.55 MCC: 69.11 |
Tamil | NLP for Tamil | iNLTK Headlines Corpus - Tamil | Accuracy: 95.22 MCC: 92.70 |
95% (5346 -> 267) | Accuracy: 86.25 MCC: 79.42 |
Accuracy: 89.84 MCC: 84.63 |
For more details around implementation or to reproduce results, checkout respective repositories.
Contributing
Add a new language support
If you would like to add support for language of your own choice to iNLTK, please start with checking/raising a issue here
Please checkout the steps I'd mentioned here for Telugu to begin with. They should be almost similar for other languages as well.
Improving models/using models for your own research
If you would like to take iNLTK's models and refine them with your own dataset or build your own custom models on top of it, please check out the repositories in the above table for the language of your choice. The repositories above contain links to datasets, pretrained models, classifiers and all of the code for that.
Add new functionality
If you wish for a particular functionality in iNLTK - Start by checking/raising a issue here
What's next
..and being worked upon
Shout out if you want to help :)
- Add Maithili support
..and NOT being worked upon
Shout out if you want to lead :)
- Add NER support for all languages
- Add Textual Entailment support for all languages
- Work on a unified model for all the languages
- POS support in iNLTK
- Add translations - to and from languages in iNLTK + English
iNLTK's Appreciation
- By Jeremy Howard on Twitter
- By Sebastian Ruder on Twitter
- By Vincent Boucher, By Philip Vollet, By Steve Nouri on LinkedIn
- By Kanimozhi, By Soham, By Imaad on LinkedIn
- iNLTK was trending on GitHub in May 2019
Citation
If you use this library in your research, please consider citing:
@misc{arora2020inltk,
title={iNLTK: Natural Language Toolkit for Indic Languages},
author={Gaurav Arora},
year={2020},
eprint={2009.12534},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file inltk-0.9.tar.gz
.
File metadata
- Download URL: inltk-0.9.tar.gz
- Upload date:
- Size: 20.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f146833e6e713b60ff50b08325e44933dc355f9120eaf1999c5ac7ac4dab6041 |
|
MD5 | 52b97826c373b7d3b5682f36c92a1cf7 |
|
BLAKE2b-256 | a5c65b0d49937ee372f1a132eadf2ba491876e8bef78f640c69e91dceddc6ff6 |
File details
Details for the file inltk-0.9-py3-none-any.whl
.
File metadata
- Download URL: inltk-0.9-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f54b1a8889508ffab27223ce3951c4295a6ce1ace90c413e4ae6e9de1f5760a |
|
MD5 | e36ebf8eba3cfd010160a8bd877d8dc5 |
|
BLAKE2b-256 | 8dcc942b7e86043dc9caa3ea967665b30b84527f2a163aaf3f7d14d9afcd7d1a |