BNLP is a natural language processing toolkit for Bengali Language
Project description
Bengali Natural Language Processing(BNLP)
BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, construct neural model for Bengali NLP purposes.
Installation
-
pypi package installer(python 3.6)
pip install bnlp_toolkit
Pretrained Model
Tokenization
-
Bengali SentencePiece Tokenization
- tokenization using trained model
from bnlp.sentencepiece_tokenizer import SP_Tokenizer bsp = SP_Tokenizer() model_path = "./model/bn_spm.model" input_text = "আমি ভাত খাই। সে বাজারে যায়।" tokens = bsp.tokenize(model_path, input_text) print(tokens)
- Training SentencePiece
from bnlp.sentencepiece_tokenizer import SP_Tokenizer bsp = SP_Tokenizer(is_train=True) data = "test.txt" model_prefix = "test" vocab_size = 5 bsp.train_bsp(data, model_prefix, vocab_size)
- tokenization using trained model
-
NLTK Tokenization
from bnlp.nltk_tokenizer import NLTK_Tokenizer
text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
bnltk = NLTK_Tokenizer(text)
word_tokens = bnltk.word_tokenize()
sentence_tokens = bnltk.sentence_tokenize()
print(word_tokens)
print(sentence_tokens)
Word Embedding
-
Bengali Word2Vec
-
Generate Vector using pretrain model
from bnlp.bengali_word2vec import Bengali_Word2Vec bwv = Bengali_Word2Vec() model_path = "model/wiki.bn.text.model" word = 'আমার' vector = bwv.generate_word_vector(model_path, word) print(vector.shape) print(vector)
-
Find Most Similar Word Using Pretrained Model
from bnlp.bengali_word2vec import Bengali_Word2Vec bwv = Bengali_Word2Vec() model_path = "model/wiki.bn.text.model" word = 'আমার' similar = bwv.most_similar(model_path, word) print(similar)
-
Train Bengali Word2Vec with your own data
from bnlp.bengali_word2vec import Bengali_Word2Vec data_file = "test.txt" model_name = "test_model.model" vector_name = "test_vector.vector" bwv.train_word2vec(data_file, model_name, vector_name)
-
-
Bengali FastText
-
Download Bengali FastText Pretrained Model From Here
-
Generate Vector Using Pretrained Model
from bnlp.bengali_fasttext import Bengali_Fasttext bft = Bengali_Fasttext() word = "গ্রাম" model_path = "cc.bn.300.bin" word_vector = bf.generate_word_vector(model_path, word) print(word_vector.shape) print(word_vector)
-
Train Bengali FastText Model
from bnlp.bengali_fasttext import Bengali_Fasttext bft = Bengali_Fasttext(is_train=True) data = "data.txt" model_name = "saved_model.bin" bf.train_fasttext(data, model_name)
-
Issue
- if
ModuleNotFoundError: No module named 'fasttext'
problem arise please do the next line
pip install fasttext
Developer Guide
Fork
add
ormodify
- send
pull request
for merging - we will verify and include your name in
Contributor List
Contributor List
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bnlp_toolkit-1.0.0.tar.gz
.
File metadata
- Download URL: bnlp_toolkit-1.0.0.tar.gz
- Upload date:
- Size: 4.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.0.post20191124 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3fa4264b457980372065b9a9104e53adf52971a00ca2400637f0b5b362dbc57 |
|
MD5 | 718ef2127910f7771cf8d0b15535b66e |
|
BLAKE2b-256 | efb58b099a1ec999556da5d041575e21e9e90ee0bee26c6a25c0610dbe5faa26 |
File details
Details for the file bnlp_toolkit-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: bnlp_toolkit-1.0.0-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.0.post20191124 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d78e5c599a5f66fab31109fcc7d1d69a56325fe785f63000375bdb011410f5b2 |
|
MD5 | 75f3413faefe4cfaebd832b91df371ed |
|
BLAKE2b-256 | 5320833fc11f32fedbe47d62e9f27b76016b7ab7d09db53d42ea9f823efa154b |