An NLP Library for Marathi Language
Project description
mahaNLP
-
mahaNLP is a python-based natural language processing library focused on the Indian language Marathi. It provides an easy interface for NLP features like sentiment analysis, named entity recognition, hate speech detection, etc. exclusively for Marathi text.
-
L3Cube, the author of this library aims to bring Marathi to the forefront of IndicNLP. Our vision is to make Marathi a resource-rich language and promote AI for Maharashtra!
Features:
This library is customised to be used by a basic programmer and an ML practitioner.
1. Basic Usage:
This mode of access is designed from a basic programmer point of view and follow simpler way to perform the desired tasks. It provides the following features:
-
Datasets: Provides the functionality to load the dataset
-
Autocomplete: Text prediction
-
Preprocess: Data cleaning
-
Tokenizer: Tokenizes text
-
Tagger: Named entity recognision
-
MaskFill: Predicts the masked tokens
-
Hate: Detects hate speech
-
Sentiment: Sentiment analysis
-
Similarity: Detects similarity
2. Advanced Usage:
This way of accessing the library is designed from an ML Practitioner's point of view and has more flexibility to choose a model for the desired task.
-
MaskFill Model: Predicts the masked tokens
-
GPT Model: Text prediction
-
Hate Model: Detects hate speech
-
NER Model: Named entity recognision
-
Sentiment Model: Sentiment analysis
-
Similarity Model: Detects similarity
Some of the mentioned models have sub models within them that can be seen using the listModels() function.
Installation:
-
pip install mahaNLP==[version] Eg.: pip install mahaNLP==0.6
-
or we can simply use: pip install mahaNLP
Few Examples:
1. Tagger (from basic usage point of view)
Stepwise execution:
-
import from mahaNLP.mask_fill import MaskPredictor
-
create an object model = MaskPredictor()
It provides one functionality
- predict_mask: Predicts the masked token
- Example:
-
pass the string with the word to be predicted replaced with '[MASK]': text = 'मी महाराष्ट्रात [MASK].' English Translation: 'I in Maharashtra [MASK]'
-
model.predict_mask(text)
-
The output will contain some predictions like:
- मी महाराष्ट्रात आहे.
- मी महाराष्ट्रात राहणार.
- मी महाराष्ट्रात नाही.
- मी महाराष्ट्रातच.
- मी महाराष्ट्रात राहतो.
-
There are some optional parameters:
- details (minimum, medium, all) in string - Default: minimum
- Used to pass the detailedness to be considered
- as_dict (True, False) in boolean - Default: False
- Used to define the print type
- details (minimum, medium, all) in string - Default: minimum
-
Example:
- model.predict_mask(text9, 'all', True)
- Output: [{'score': 0.46560075879096985, 'token': 1155, 'token_str': 'आहे', 'sequence': 'मी महाराष्ट्रात आहे.'}, {'score': 0.07969045639038086, 'token': 92222, 'token_str': 'राहणार', 'sequence': 'मी महाराष्ट्रात राहणार.'}, {'score': 0.07400081306695938, 'token': 1826, 'token_str': 'नाही', 'sequence': 'मी महाराष्ट्रात नाही.'}, {'score': 0.050422605127096176, 'token': 1617, 'token_str': '##च', 'sequence': 'मी महाराष्ट्रातच.'}, {'score': 0.04373728483915329, 'token': 62560, 'token_str': 'राहतो', 'sequence': 'मी महाराष्ट्रात राहतो.'}]
2. Sentiment (from advance usage point of view)
Stepwise execution:
-
import from mahaNLP.model_repo import SentimentModel
-
list the available models
- modelSentiment.list_models()
- Output:
- sentiment models: MarathiSentiment : l3cube-pune/MarathiSentiment
- tagger models: marathi-ner : l3cube-pune/marathi-ner
- autocomplete models: marathi-gpt : l3cube-pune/marathi-gpt
- similarity models: marathi-sentence-similarity-sbert : l3cube-pune/marathi-sentence-similarity-sbert marathi-sentence-bert-nli : l3cube-pune/marathi-sentence-bert-nli
- mask_fill models: marathi-bert-v2 : l3cube-pune/marathi-bert-v2 marathi-roberta : l3cube-pune/marathi-roberta marathi-albert : l3cube-pune/marathi-albert
- hate models: mahahate-bert : l3cube-pune/mahahate-bert mahahate-multi-roberta : l3cube-pune/mahahate-multi-roberta
The library lists down the models available for all the models. These can be changed by the user.
To change the default model: Pass the name of the model as the argument: modelSentiment = SentimentModel('name of model') Eg.: modelSentiment = SentimentModel('MarathiSentiment')
- Sentiment provides one functionality
- get_polarity_score: Gives the polarity score of words in a sentence along with the tokens (Neutral, Positive, Negative)
- Example: text = 'दिवाळीच्या सणादरम्यान सगळे आनंदी असतात.' English Translation: 'Everyone is happy during Diwali festival.'
- modelSentiment.get_polarity_score(text)
- Output: label: Positive score: 0.995338
Entire working of mahaNLP is explained in this demo file. Please have a look at it to get a better idea!
Citing
@article{joshi2022l3cube_mahanlp,
title={L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library},
author={Joshi, Raviraj},
journal={arXiv preprint arXiv:2205.14728},
year={2022}
}
Thank you
Team L3Cube
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file mahaNLP-0.9-py3-none-any.whl
.
File metadata
- Download URL: mahaNLP-0.9-py3-none-any.whl
- Upload date:
- Size: 38.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de7d85e92865c2dd68b012d464580c608844d1638335f316308f768907bafcb2 |
|
MD5 | e87ac562a2fe336d14521280cba7b07e |
|
BLAKE2b-256 | 38cd60dbe33a0fae51fa171c7299906cd0c63069e43a41929ad70b6b7e2ff409 |