BNLP is a natural language processing toolkit for Bengali Language
Project description
Bengali Natural Language Processing(BNLP)
BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, Bengali POS Tagging, Construct Neural Model for Bengali NLP purposes.
Contents
- Current Features
- Installation
- Pretrained Model
- Tokenization
- Embedding
- POS Tagging
- Issue
- Contributor Guide
- Contributor List
Current Features
-
- SentencePiece Tokenizer
- Basic Tokenizer
- NLTK Tokenizer
-
- Bengali Word2Vec
- Bengali Fasttext
- Bengali GloVe
Installation
PIP installer(python 3.5, 3.6, 3.7 tested okay)
pip install bnlp_toolkit
Local Installer
$git clone https://github.com/sagorbrur/bnlp.git
$cd bnlp
$python setup.py install
Pretrained Model
Download Link
- Bengali SentencePiece
- Bengali Word2Vec
- Bengali FastText
- Bengali GloVe Wordvectors
- Bengali POS Tag model
Training Details
- Sentencepiece, Word2Vec, Fasttext, GloVe model trained with Bengali Wikipedia Dump Dataset
- SentencePiece Training Vocab Size=50000
- Fasttext trained with total words = 20M, vocab size = 1171011, epoch=50, embedding dimension = 300 and the training loss = 0.318668,
- Word2Vec word embedding dimension = 300
- To Know Bengali GloVe Wordvector and training process follow this repository
- Bengali CRF POS Tagging was training with nltr dataset with 80% accuracy.
Tokenization
-
Bengali SentencePiece Tokenization
- tokenization using trained model
from bnlp.sentencepiece_tokenizer import SP_Tokenizer bsp = SP_Tokenizer() model_path = "./model/bn_spm.model" input_text = "আমি ভাত খাই। সে বাজারে যায়।" tokens = bsp.tokenize(model_path, input_text) print(tokens) text2id = bsp.text2id(model_path, input_text) print(text2id) id2text = bsp.id2text(model_path, text2id) print(id2text)
- Training SentencePiece
from bnlp.sentencepiece_tokenizer import SP_Tokenizer bsp = SP_Tokenizer() data = "test.txt" model_prefix = "test" vocab_size = 5 bsp.train_bsp(data, model_prefix, vocab_size)
- tokenization using trained model
-
Basic Tokenizer
from bnlp.basic_tokenizer import BasicTokenizer basic_t = BasicTokenizer() raw_text = "আমি বাংলায় গান গাই।" tokens = basic_t.tokenize(raw_text) print(tokens) # output: ["আমি", "বাংলায়", "গান", "গাই", "।"]
-
NLTK Tokenization
from bnlp.nltk_tokenizer import NLTK_Tokenizer text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?" bnltk = NLTK_Tokenizer() word_tokens = bnltk.word_tokenize(text) sentence_tokens = bnltk.sentence_tokenize(text) print(word_tokens) print(sentence_tokens) # output # word_token: ["আমি", "ভাত", "খাই", "।", "সে", "বাজারে", "যায়", "।", "তিনি", "কি", "সত্যিই", "ভালো", "মানুষ", "?"] # sentence_token: ["আমি ভাত খাই।", "সে বাজারে যায়।", "তিনি কি সত্যিই ভালো মানুষ?"]
Word Embedding
-
Bengali Word2Vec
-
Generate Vector using pretrain model
from bnlp.bengali_word2vec import Bengali_Word2Vec bwv = Bengali_Word2Vec() model_path = "model/bengali_word2vec.model" word = 'আমার' vector = bwv.generate_word_vector(model_path, word) print(vector.shape) print(vector)
-
Find Most Similar Word Using Pretrained Model
from bnlp.bengali_word2vec import Bengali_Word2Vec bwv = Bengali_Word2Vec() model_path = "model/bengali_word2vec.model" word = 'আমার' similar = bwv.most_similar(model_path, word) print(similar)
-
Train Bengali Word2Vec with your own data
from bnlp.bengali_word2vec import Bengali_Word2Vec bwv = Bengali_Word2Vec() data_file = "test.txt" model_name = "test_model.model" vector_name = "test_vector.vector" bwv.train_word2vec(data_file, model_name, vector_name)
-
-
Bengali FastText
-
Generate Vector Using Pretrained Model
from bnlp.bengali_fasttext import Bengali_Fasttext bft = Bengali_Fasttext() word = "গ্রাম" model_path = "model/bengali_fasttext.bin" word_vector = bft.generate_word_vector(model_path, word) print(word_vector.shape) print(word_vector)
-
Train Bengali FastText Model
from bnlp.bengali_fasttext import Bengali_Fasttext bft = Bengali_Fasttext() data = "data.txt" model_name = "saved_model.bin" epoch = 50 bft.train_fasttext(data, model_name, epoch)
-
-
Bengali GloVe Word Vectors
We trained glove model with bengali data(wiki+news articles) and published bengali glove word vectors
You can download and use it on your different machine learning purposes.from bnlp.glove_wordvector import BN_Glove glove_path = "bn_glove.39M.100d.txt" word = "গ্রাম" bng = BN_Glove() res = bng.closest_word(glove_path, word) print(res) vec = bng.word2vec(glove_path, word) print(vec)
Bengali POS Tagging
-
Bengali CRF POS Tagging
-
Find Pos Tag Using Pretrained Model
from bnlp.bengali_pos import BN_CRF_POS bn_pos = BN_CRF_POS() model_path = "model/bn_pos_model.pkl" text = "আমি ভাত খাই।" res = bn_pos.pos_tag(model_path, text) print(res) # [('আমি', 'PPR'), ('ভাত', 'NC'), ('খাই', 'VM'), ('।', 'PU')]
-
Train POS Tag Model
from bnlp.bengali_pos import BN_CRF_POS bn_pos = BN_CRF_POS() model_name = "pos_model.pkl" tagged_sentences = [[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'),('কার্পেট', 'NC'), ('৷', 'PU')], [('মাটি', 'NC'), ('থেকে', 'PP'), ('বড়জোর', 'JQ'), ('চার', 'JQ'), ('পাঁচ', 'JQ'), ('ফুট', 'CCL'), ('উঁচু', 'JJ'), ('হবে', 'VM'), ('৷', 'PU')]] bn_pos.training(model_name, tagged_sentences)
-
Issue
- if
ModuleNotFoundError: No module named 'fasttext'
problem arise please do the next line
pip install fasttext
- if
nltk
issue arise please do the following line before importingbnlp
import nltk
nltk.download("punkt")
Contributor Guide
Check CONTRIBUTING.md page for details.
Thanks To
Contributor List
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for bnlp_toolkit-2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 76bceb2212e041610edf7e23ef934fd3f80c5c10d16b5475f27a9abafc342389 |
|
MD5 | 40299f6e440c2871bbe5b54d21deaca0 |
|
BLAKE2b-256 | 92743a4d78a255f487df2132a41501cccb603ad5a6d679440a5f59ed0f63844b |