BNLP is a natural language processing toolkit for Bengali Language
Project description
Bengali Natural Language Processing(BNLP)
BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, Bengali POS Tagging, Bengali Name Entity Recognition, Construct Neural Model for Bengali NLP purposes.
NB: Any Researcher who refer this tool in his/her paper please let us know, we will include paper link here
Documentation
For full documentation follow bnlp documentation
Installation
PIP installer(python 3.5, 3.6, 3.7 tested okay)
pip install bnlp_toolkit
Pretrained Model
Download Link
- Bengali SentencePiece
- Bengali Word2Vec
- Bengali FastText
- Bengali GloVe Wordvectors
- Bengali POS Tag model
- Bengali NER model
Tokenization
-
Bengali SentencePiece Tokenization
- tokenization using trained model
from bnlp import SentencepieceTokenizer bsp = SentencepieceTokenizer() model_path = "./model/bn_spm.model" input_text = "আমি ভাত খাই। সে বাজারে যায়।" tokens = bsp.tokenize(model_path, input_text) print(tokens) text2id = bsp.text2id(model_path, input_text) print(text2id) id2text = bsp.id2text(model_path, text2id) print(id2text)
- Training SentencePiece
from bnlp import SentencepieceTokenizer bsp = SentencepieceTokenizer() data = "test.txt" model_prefix = "test" vocab_size = 5 bsp.train(data, model_prefix, vocab_size)
- tokenization using trained model
-
Basic Tokenizer
from bnlp import BasicTokenizer basic_tokenizer = BasicTokenizer() raw_text = "আমি বাংলায় গান গাই।" tokens = basic_tokenizer.tokenize(raw_text) print(tokens) # output: ["আমি", "বাংলায়", "গান", "গাই", "।"]
-
NLTK Tokenization
from bnlp import NLTKTokenizer text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?" bnltk = NLTKTokenizer() word_tokens = bnltk.word_tokenize(text) sentence_tokens = bnltk.sentence_tokenize(text) print(word_tokens) print(sentence_tokens) # output # word_token: ["আমি", "ভাত", "খাই", "।", "সে", "বাজারে", "যায়", "।", "তিনি", "কি", "সত্যিই", "ভালো", "মানুষ", "?"] # sentence_token: ["আমি ভাত খাই।", "সে বাজারে যায়।", "তিনি কি সত্যিই ভালো মানুষ?"]
Word Embedding
-
Bengali Word2Vec
-
Generate Vector using pretrain model
from bnlp import BengaliWord2Vec bwv = BengaliWord2Vec() model_path = "bengali_word2vec.model" word = 'আমার' vector = bwv.generate_word_vector(model_path, word) print(vector.shape) print(vector)
-
Find Most Similar Word Using Pretrained Model
from bnlp import BengaliWord2Vec bwv = BengaliWord2Vec() model_path = "bengali_word2vec.model" word = 'গ্রাম' similar = bwv.most_similar(model_path, word) print(similar)
-
Train Bengali Word2Vec with your own data
from bnlp import BengaliWord2Vec bwv = BengaliWord2Vec() data_file = "sample.txt" model_name = "test_model.model" vector_name = "test_vector.vector" bwv.train(data_file, model_name, vector_name)
-
-
Bengali FastText
To use
fasttext
you need to install fasttext manually bypip install fasttext==0.9.2
NB: it will not work in
windows
, it will only work inlinux
-
Generate Vector Using Pretrained Model
from bnlp.embedding.fasttext import BengaliFasttext bft = BengaliFasttext() word = "গ্রাম" model_path = "bengali_fasttext_wiki.bin" word_vector = bft.generate_word_vector(model_path, word) print(word_vector.shape) print(word_vector)
-
Train Bengali FastText Model
from bnlp.embedding.fasttext import BengaliFasttext bft = BengaliFasttext() data = "sample.txt" model_name = "saved_model.bin" epoch = 50 bft.train(data, model_name, epoch)
-
-
Bengali GloVe Word Vectors
We trained glove model with bengali data(wiki+news articles) and published bengali glove word vectors
You can download and use it on your different machine learning purposes.from bnlp import BengaliGlove glove_path = "bn_glove.39M.100d.txt" word = "গ্রাম" bng = BN_Glove() res = bng.closest_word(glove_path, word) print(res) vec = bng.word2vec(glove_path, word) print(vec)
Bengali POS Tagging
-
Bengali CRF POS Tagging
-
Find Pos Tag Using Pretrained Model
from bnlp import POS bn_pos = POS() model_path = "model/bn_pos.pkl" text = "আমি ভাত খাই।" res = bn_pos.tag(model_path, text) print(res) # [('আমি', 'PPR'), ('ভাত', 'NC'), ('খাই', 'VM'), ('।', 'PU')]
-
Train POS Tag Model
from bnlp import POS bn_pos = POS() model_name = "pos_model.pkl" tagged_sentences = [[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'),('কার্পেট', 'NC'), ('৷', 'PU')], [('মাটি', 'NC'), ('থেকে', 'PP'), ('বড়জোর', 'JQ'), ('চার', 'JQ'), ('পাঁচ', 'JQ'), ('ফুট', 'CCL'), ('উঁচু', 'JJ'), ('হবে', 'VM'), ('৷', 'PU')]] bn_pos.train(model_name, tagged_sentences)
-
Bengali NER
-
Bengali CRF NER
-
Find NER Tag Using Pretrained Model
from bnlp import NER bn_ner = NER() model_path = "model/bn_ner.pkl" text = "সে ঢাকায় থাকে।" result = bn_ner.tag(model_path, text) print(result) # [('সে', 'O'), ('ঢাকায়', 'S-LOC'), ('থাকে', 'O')]
-
Train NER Tag Model
from bnlp import NER bn_ner = NER() model_name = "ner_model.pkl" tagged_sentences = [[('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')]] bn_ner.train(model_name, tagged_sentences)
-
Bengali Corpus Class
-
Stopwords and Punctuations
from bnlp.corpus import stopwords, punctuations stopwords = stopwords() print(stopwords) print(punctuations)
-
Remove stopwords from Text
from bnlp.corpus import stopwords from bnlp.corpus.util import remove_stopwords stopwords = stopwords() raw_text = 'আমি ভাত খাই।' result = remove_stopwords(raw_text, stopwords) print(result) # ['ভাত', 'খাই', '।']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bnlp_toolkit-3.0.0.dev2.tar.gz
.
File metadata
- Download URL: bnlp_toolkit-3.0.0.dev2.tar.gz
- Upload date:
- Size: 11.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1528bd4888a53ce677d0e5dad3a4bfaa9c5310fa29f606296f5ff7f6db1414e2 |
|
MD5 | 30da094a22c94ed2a0f0a896c95afc8f |
|
BLAKE2b-256 | c542db3baf3c6792810ca5450ed051953532724c4efec36bed09c4dbf8291a15 |
File details
Details for the file bnlp_toolkit-3.0.0.dev2-py3-none-any.whl
.
File metadata
- Download URL: bnlp_toolkit-3.0.0.dev2-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a20c9818045ad517d81443cee39e81ab79de3752f80543224a1918bd28d4e60b |
|
MD5 | 1a61c8db59fbe682140dcfc2b4188ccd |
|
BLAKE2b-256 | b3f9856bceb022f6fd33900d96160962e8943cd2a49a58db53066765be2e32a9 |