BNLP is a natural language processing toolkit for Bengali Language

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Bengali Natural Language Processing(BNLP)

BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, Bengali POS Tagging, Bengali Name Entity Recognition, Construct Neural Model for Bengali NLP purposes.

NB: Any Researcher who refer this tool in his/her paper please let us know, we will include paper link here

Documentation

For full documentation follow bnlp documentation

Installation

PIP installer(python 3.5, 3.6, 3.7 tested okay)

pip install bnlp_toolkit

Pretrained Model

Download Link

Tokenization

Bengali SentencePiece Tokenization

tokenization using trained model

from bnlp import SentencepieceTokenizer

bsp = SentencepieceTokenizer()
model_path = "./model/bn_spm.model"
input_text = "আমি ভাত খাই। সে বাজারে যায়।"
tokens = bsp.tokenize(model_path, input_text)
print(tokens)
text2id = bsp.text2id(model_path, input_text)
print(text2id)
id2text = bsp.id2text(model_path, text2id)
print(id2text)

Training SentencePiece

from bnlp import SentencepieceTokenizer

bsp = SentencepieceTokenizer()
data = "test.txt"
model_prefix = "test"
vocab_size = 5
bsp.train(data, model_prefix, vocab_size)

Basic Tokenizer

from bnlp import BasicTokenizer
basic_tokenizer = BasicTokenizer()
raw_text = "আমি বাংলায় গান গাই।"
tokens = basic_tokenizer.tokenize(raw_text)
print(tokens)

# output: ["আমি", "বাংলায়", "গান", "গাই", "।"]

NLTK Tokenization

from bnlp import NLTKTokenizer

text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
bnltk = NLTKTokenizer()
word_tokens = bnltk.word_tokenize(text)
sentence_tokens = bnltk.sentence_tokenize(text)
print(word_tokens)
print(sentence_tokens)

# output
# word_token: ["আমি", "ভাত", "খাই", "।", "সে", "বাজারে", "যায়", "।", "তিনি", "কি", "সত্যিই", "ভালো", "মানুষ", "?"]
# sentence_token: ["আমি ভাত খাই।", "সে বাজারে যায়।", "তিনি কি সত্যিই ভালো মানুষ?"]

Word Embedding

Bengali Word2Vec

Generate Vector using pretrain model

from bnlp import BengaliWord2Vec

bwv = BengaliWord2Vec()
model_path = "bengali_word2vec.model"
word = 'আমার'
vector = bwv.generate_word_vector(model_path, word)
print(vector.shape)
print(vector)

Find Most Similar Word Using Pretrained Model

from bnlp import BengaliWord2Vec

bwv = BengaliWord2Vec()
model_path = "bengali_word2vec.model"
word = 'গ্রাম'
similar = bwv.most_similar(model_path, word)
print(similar)

Train Bengali Word2Vec with your own data

from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()
data_file = "sample.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.train(data_file, model_name, vector_name)

Bengali FastText

To use fasttext you need to install fasttext manually by pip install fasttext==0.9.2

NB: it will not work in windows, it will only work in linux

Generate Vector Using Pretrained Model

from bnlp.embedding.fasttext import BengaliFasttext

bft = BengaliFasttext()
word = "গ্রাম"
model_path = "bengali_fasttext_wiki.bin"
word_vector = bft.generate_word_vector(model_path, word)
print(word_vector.shape)
print(word_vector)

Train Bengali FastText Model

from bnlp.embedding.fasttext import BengaliFasttext

bft = BengaliFasttext()
data = "sample.txt"
model_name = "saved_model.bin"
epoch = 50
bft.train(data, model_name, epoch)

Bengali GloVe Word Vectors

We trained glove model with bengali data(wiki+news articles) and published bengali glove word vectors
You can download and use it on your different machine learning purposes.

from bnlp import BengaliGlove
glove_path = "bn_glove.39M.100d.txt"
word = "গ্রাম"
bng = BN_Glove()
res = bng.closest_word(glove_path, word)
print(res)
vec = bng.word2vec(glove_path, word)
print(vec)

Bengali POS Tagging

Bengali CRF POS Tagging

Find Pos Tag Using Pretrained Model

from bnlp import POS
bn_pos = POS()
model_path = "model/bn_pos.pkl"
text = "আমি ভাত খাই।"
res = bn_pos.tag(model_path, text)
print(res)
# [('আমি', 'PPR'), ('ভাত', 'NC'), ('খাই', 'VM'), ('।', 'PU')]

Train POS Tag Model

from bnlp import POS
bn_pos = POS()
model_name = "pos_model.pkl"
tagged_sentences = [[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'),('কার্পেট', 'NC'), ('৷', 'PU')], [('মাটি', 'NC'), ('থেকে', 'PP'), ('বড়জোর', 'JQ'), ('চার', 'JQ'), ('পাঁচ', 'JQ'), ('ফুট', 'CCL'), ('উঁচু', 'JJ'), ('হবে', 'VM'), ('৷', 'PU')]]

bn_pos.train(model_name, tagged_sentences)

Bengali NER

Bengali CRF NER

Find NER Tag Using Pretrained Model

from bnlp import NER
bn_ner = NER()
model_path = "model/bn_ner.pkl"
text = "সে ঢাকায় থাকে।"
result = bn_ner.tag(model_path, text)
print(result)
# [('সে', 'O'), ('ঢাকায়', 'S-LOC'), ('থাকে', 'O')]

Train NER Tag Model

from bnlp import NER
bn_ner = NER()
model_name = "ner_model.pkl"
tagged_sentences = [[('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')]]

bn_ner.train(model_name, tagged_sentences)

Bengali Corpus Class

Stopwords and Punctuations

from bnlp.corpus import stopwords, punctuations

stopwords = stopwords() 
print(stopwords)
print(punctuations)

Remove stopwords from Text

from bnlp.corpus import stopwords
from bnlp.corpus.util import remove_stopwords

stopwords = stopwords()
raw_text = 'আমি ভাত খাই।' 
result = remove_stopwords(raw_text, stopwords)
print(result)
# ['ভাত', 'খাই', '।']

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

4.4.0

Jan 6, 2026

4.3.0

Jan 6, 2026

4.2.0

Jan 6, 2026

4.1.0

Jan 6, 2026

4.0.3

Aug 20, 2024

4.0.2

Aug 12, 2024

4.0.1

May 4, 2024

4.0.0

Aug 14, 2023

4.0.0.dev4 pre-release

Aug 14, 2023

4.0.0.dev3 pre-release

Aug 14, 2023

4.0.0.dev2 pre-release

Aug 12, 2023

4.0.0.dev0 pre-release

Jul 16, 2023

3.3.2

Jul 10, 2023

3.3.1

Apr 29, 2023

3.3.0

Mar 7, 2023

3.3.0.dev0 pre-release yanked

Nov 17, 2022

Reason this release was yanked:

this is not the official release. you can try latest release instead

3.2.0

Nov 7, 2022

3.1.2

Sep 11, 2021

3.1.1

Apr 29, 2021

3.1.0

Apr 24, 2021

3.1.0.dev0 pre-release

Apr 24, 2021

3.0.0

Oct 20, 2020

3.0.0a1 pre-release

Oct 20, 2020

3.0.0.dev3 pre-release

Oct 19, 2020

3.0.0.dev2 pre-release

Oct 19, 2020

This version

3.0.0.dev1 pre-release

Oct 19, 2020

2.3

Jul 22, 2020

2.2

Mar 6, 2020

2.1

Feb 12, 2020

2.0.0

Dec 16, 2019

1.2.0

Dec 14, 2019

1.1.0

Dec 1, 2019

1.0.0

Nov 25, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bnlp_toolkit-3.0.0.dev1.tar.gz (11.7 MB view details)

Uploaded Oct 19, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bnlp_toolkit-3.0.0.dev1-py3-none-any.whl (9.7 kB view details)

Uploaded Oct 19, 2020 Python 3

File details

Details for the file bnlp_toolkit-3.0.0.dev1.tar.gz.

File metadata

Download URL: bnlp_toolkit-3.0.0.dev1.tar.gz
Upload date: Oct 19, 2020
Size: 11.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for bnlp_toolkit-3.0.0.dev1.tar.gz
Algorithm	Hash digest
SHA256	`502915a65ea62733c785ff1760bfddac924f6e669a8d99e0a814c98dcf773cdc`
MD5	`679f087c4b63130e97e2f2976a902ae5`
BLAKE2b-256	`4503cb1618c725bccaceaa938c972ccb1e1617bf9eac860d6e1e0ee5f1a3328d`

See more details on using hashes here.

File details

Details for the file bnlp_toolkit-3.0.0.dev1-py3-none-any.whl.

File metadata

Download URL: bnlp_toolkit-3.0.0.dev1-py3-none-any.whl
Upload date: Oct 19, 2020
Size: 9.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.9

File hashes

Hashes for bnlp_toolkit-3.0.0.dev1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`152715ba41accc8d36bdb610831847061a32a98d03ee1cc4e96c5feebfb28db9`
MD5	`90d826f6ec39dc8473b261f25cee9f4d`
BLAKE2b-256	`a2cf118c8f469824938c44e0e69bcda51293e2fc306faa7731e9755c7978c3a6`

See more details on using hashes here.

bnlp-toolkit 3.0.0.dev1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Bengali Natural Language Processing(BNLP)

Documentation

Installation

PIP installer(python 3.5, 3.6, 3.7 tested okay)

Pretrained Model

Download Link

Tokenization

Word Embedding

Bengali POS Tagging

Bengali NER

Bengali Corpus Class

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes