No project description provided
Project description
Bangla FastText Model & Toolkit
We have constructed a dataset that contains Bangla text data for training unsupervised ML model, and it contains around 14 GB of text data. One of the largest in Bengali Language model called BanglaLM: Bangla Language Model Dataset
. The Bangla FastText model had been developed based on this dataset. We used google cloud to train model. We developed two models based on skipgram and cbow training method. This is open source python module to use these two models easily. We also developed sentence embedding systems for the using of sklearn classifiers. It showed better perfromance than facebook pretrained fasttext model on Bangla Wikidataset.
Dataset (Bengali)
Kaggle link for the dataset :
BanglaLM: Bangla Language Model Dataset
Model link:
Installation
To install the latest release, we can do :
!pip install BanglaFastText
or, to get the latest development version of BanglaFastText, we can install from our github repository :
$ https://github.com/Kowsher/Bangla-Fasttext.git
$ cd Bangla-Fasttext
$ sudo pip install .
$ # or :
$ sudo python setup.py install
For further information and introduction see README.md
Getting started
In order to learn word vectors, as described here, BanglaFastText
function like this:
import BanglaFastText
#there are two variation of training methods cbow and skipgram.
# Skipgram model :
>>> Bn = BanglaFastText.BanglaFasttext(method='skipgram', path = './content/model/')
# 'path' is the directory to save the downloaded model
>>> model = Bn.model_load()
# or, cbow model :
>>> Bn = BanglaFastText.BanglaFasttext(method='cbow', path = './content/model/')
>>> model = Bn.model_load()
Where method parameter is to choose the training method and path is to save model.
Loading a model object
If we have already model then we can simply read and load the model as :
# To read a model
>>> Bn = BanglaFastText.BanglaFasttext(model_name = 'model_name')
# to load the model as object we can
>>> model = Bn.model_load()
Playing with the parameters
# to get vector of a word
>>> model['দেশ']
# to get most similar words
>>> model.most_similar("দেশ")
# to find word similarity
>>> Bn.word_similarity('কিতাব', 'বই')
# to find sentence similarity
>>> Bn.sent_similarity('আমি দেশকে ভালোবাসি', 'অনেক সুন্দর আমাদের দেশ')
# for sentence embedding
>>> corpus = ['আমি দেশকে ভালোবাসি', 'অনেক সুন্দর আমাদের দেশ']
>>> X = Bn.sent_embd(corpus)
Fine Tuning
If we want to fine tuning or update weights by a new dataset
>>> corpus = ['আমি দেশকে ভালোবাসি', 'অনেক সুন্দর আমাদের দেশ']
>>> Bn.fine_tuning(corpus, epochs=5)
>>> model = Bn.model_load()
......
>>> tuned_model = Bn.fine_tuning(corpus, epochs=5) # to get the raw model after finetuned, if we want to use it further
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for BanglaFastText-1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 562803b41653c8ae071a1ebc98921dd1652e9c4361e0fc6f1fe3ab30b658539b |
|
MD5 | 240bc84be240b4a2af5ee8c8675e6fea |
|
BLAKE2b-256 | 4ae8f6de7221f8c60d08c647abbcd3e1c494a97781fd333943cce411e0d9556e |