BNLP is a natural language processing toolkit for Bengali Language
Project description
Bengali Natural Language Processing(BNLP)
BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, construct neural model for Bengali NLP purposes.
Installation
-
pypi package installer(python 3.6, 3.7 tested okay)
pip install bnlp_toolkit
Pretrained Model
Download Link
Training Details
- All three model trained with Bengali Wikipedia Dump Dataset
- SentencePiece Training Vocab Size=50000
- Word2Vec and Fasttext word embedding dimension = 300
Tokenization
-
Bengali SentencePiece Tokenization
- tokenization using trained model
from bnlp.sentencepiece_tokenizer import SP_Tokenizer bsp = SP_Tokenizer() model_path = "./model/bn_spm.model" input_text = "আমি ভাত খাই। সে বাজারে যায়।" tokens = bsp.tokenize(model_path, input_text) print(tokens)
- Training SentencePiece
from bnlp.sentencepiece_tokenizer import SP_Tokenizer bsp = SP_Tokenizer(is_train=True) data = "test.txt" model_prefix = "test" vocab_size = 5 bsp.train_bsp(data, model_prefix, vocab_size)
- tokenization using trained model
-
Basic Tokenizer
from bnlp.basic_tokenizer import BasicTokenizer basic_t = BasicTokenizer(False) raw_text = "আমি বাংলায় গান গাই।" tokens = basic_t.tokenize(raw_text) print(tokens)
-
NLTK Tokenization
from bnlp.nltk_tokenizer import NLTK_Tokenizer text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?" bnltk = NLTK_Tokenizer(text) word_tokens = bnltk.word_tokenize() sentence_tokens = bnltk.sentence_tokenize() print(word_tokens) print(sentence_tokens)
Word Embedding
-
Bengali Word2Vec
-
Generate Vector using pretrain model
from bnlp.bengali_word2vec import Bengali_Word2Vec bwv = Bengali_Word2Vec() model_path = "model/wiki.bn.text.model" word = 'আমার' vector = bwv.generate_word_vector(model_path, word) print(vector.shape) print(vector)
-
Find Most Similar Word Using Pretrained Model
from bnlp.bengali_word2vec import Bengali_Word2Vec bwv = Bengali_Word2Vec() model_path = "model/wiki.bn.text.model" word = 'আমার' similar = bwv.most_similar(model_path, word) print(similar)
-
Train Bengali Word2Vec with your own data
from bnlp.bengali_word2vec import Bengali_Word2Vec bwv = Bengali_Word2Vec(is_train=True) data_file = "test.txt" model_name = "test_model.model" vector_name = "test_vector.vector" bwv.train_word2vec(data_file, model_name, vector_name)
-
-
Bengali FastText
-
Generate Vector Using Pretrained Model
from bnlp.bengali_fasttext import Bengali_Fasttext bft = Bengali_Fasttext() word = "গ্রাম" model_path = "cc.bn.300.bin" word_vector = bft.generate_word_vector(model_path, word) print(word_vector.shape) print(word_vector)
-
Train Bengali FastText Model
from bnlp.bengali_fasttext import Bengali_Fasttext bft = Bengali_Fasttext(is_train=True) data = "data.txt" model_name = "saved_model.bin" bft.train_fasttext(data, model_name)
-
Issue
- if
ModuleNotFoundError: No module named 'fasttext'
problem arise please do the next line
pip install fasttext
- if
nltk
issue arise please do the following line before importingbnlp
import nltk
nltk.download("punkt")
Developer Guide
Fork
add
ormodify
- send
pull request
for merging
Thanks To
Contributor List
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for bnlp_toolkit-1.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 68b525a55d5fa4512f27de83b915f2eb6c65ec2eff6ee44b22939d0f2b0140df |
|
MD5 | 514bf62b81ad10d45ef4895ecd5f2a9f |
|
BLAKE2b-256 | 9a0b933ff8fb2b64d7ed37a8a982fa7465a40e7746c554ba71cc30603a2f2053 |