BNLP is a natural language processing toolkit for Bengali Language

These details have not been verified by PyPI

Project links

Homepage

Project description

Bengali Natural Language Processing(BNLP)

BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, Embedding Bengali Document, Bengali POS Tagging, Bengali Name Entity Recognition, Bangla Text Cleaning for Bengali NLP purposes.

Installation
- PIP installer
Pretrained Model
- Download Links
- Training Details
Tokenization
Word Embedding
Document Embedding
- Bengali Doc2Vec
Bengali POS Tagging
- Bengali CRF POS Tagging
  - Find Pos Tag Using Pretrained Model
  - Train POS Tag Model
Bengali NER
- Bengali CRF NER
  - Find NER Tag Using Pretrained Model
  - Train NER Tag Model
Bengali Corpus Class
- Stopwords and Punctuations
- Remove stopwords from Text
Bangla Text Cleaning
Contributor Guide

Installation

PIP installer

pip install bnlp_toolkit

or Upgrade

pip install -U bnlp_toolkit

Python: 3.6, 3.7, 3.8, 3.9
OS: Linux, Windows, Mac

Pretrained Model

Download Links

Large model published in huggingface model hub.

Training Details

Sentencepiece, Word2Vec, Fasttext, GloVe model trained with Bengali Wikipedia Dump Dataset
- Bengali Wiki Dump
SentencePiece Training Vocab Size=50000
Fasttext trained with total words = 20M, vocab size = 1171011, epoch=50, embedding dimension = 300 and the training loss = 0.318668,
Word2Vec word embedding dimension = 100, min_count=5, window=5, epochs=10
To Know Bengali GloVe Wordvector and training process follow this repository
Bengali CRF POS Tagging was training with nltr dataset with 80% accuracy.
Bengali CRF NER Tagging was train with this data with 90% accuracy.
Bengali news article doc2vec model train with 8 jsons of this corpus with epochs 40 vector size 100 min_count=2, total news article 400013
Bengali wikipedia doc2vec model trained with wikipedia dump datasets. Total articles 110448, epochs: 40, vector_size: 100, min_count: 2

Tokenization

Basic Tokenizer

from bnlp import BasicTokenizer

basic_tokenizer = BasicTokenizer()
raw_text = "আমি বাংলায় গান গাই।"
tokens = basic_tokenizer.tokenize(raw_text)
print(tokens)

# output: ["আমি", "বাংলায়", "গান", "গাই", "।"]

NLTK Tokenization

from bnlp import NLTKTokenizer

bnltk = NLTKTokenizer()
text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
word_tokens = bnltk.word_tokenize(text)
sentence_tokens = bnltk.sentence_tokenize(text)
print(word_tokens)
print(sentence_tokens)

# output
# word_token: ["আমি", "ভাত", "খাই", "।", "সে", "বাজারে", "যায়", "।", "তিনি", "কি", "সত্যিই", "ভালো", "মানুষ", "?"]
# sentence_token: ["আমি ভাত খাই।", "সে বাজারে যায়।", "তিনি কি সত্যিই ভালো মানুষ?"]

Bengali SentencePiece Tokenization

Tokenization using trained model

from bnlp import SentencepieceTokenizer

bsp = SentencepieceTokenizer()
model_path = "./model/bn_spm.model"
input_text = "আমি ভাত খাই। সে বাজারে যায়।"
tokens = bsp.tokenize(model_path, input_text)
print(tokens)
text2id = bsp.text2id(model_path, input_text)
print(text2id)
id2text = bsp.id2text(model_path, text2id)
print(id2text)

Training SentencePiece

from bnlp import SentencepieceTokenizer

bsp = SentencepieceTokenizer()
data = "raw_text.txt"
model_prefix = "test"
vocab_size = 5
bsp.train(data, model_prefix, vocab_size)

Word Embedding

Bengali Word2Vec

Generate Vector using pretrain model

from bnlp import BengaliWord2Vec

bwv = BengaliWord2Vec()
model_path = "bengali_word2vec.model"
word = 'গ্রাম'
vector = bwv.generate_word_vector(model_path, word)
print(vector.shape)
print(vector)

Find Most Similar Word Using Pretrained Model

from bnlp import BengaliWord2Vec

bwv = BengaliWord2Vec()
model_path = "bengali_word2vec.model"
word = 'গ্রাম'
similar = bwv.most_similar(model_path, word, topn=10)
print(similar)

Train Bengali Word2Vec with your own data

Train Bengali word2vec with your custom raw data or tokenized sentences.

Custom tokenized sentence format example:

sentences = [['আমি', 'ভাত', 'খাই', '।'], ['সে', 'বাজারে', 'যায়', '।']]

Check gensim word2vec api for details of training parameter

from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()
data_file = "raw_text.txt" # or you can pass custom sentence tokens as list of list
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.train(data_file, model_name, vector_name, epochs=5)

Pre-train or resume word2vec training with same or new corpus or tokenized sentences

Check gensim word2vec api for details of training parameter

from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()

trained_model_path = "mytrained_model.model"
data_file = "raw_text.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.pretrain(trained_model_path, data_file, model_name, vector_name, epochs=5)

Bengali FastText

To use fasttext you need to install fasttext manually by pip install fasttext==0.9.2

NB: fasttext may not be worked in windows, it will only work in linux

Generate Vector Using Pretrained Model

from bnlp.embedding.fasttext import BengaliFasttext

bft = BengaliFasttext()
word = "গ্রাম"
model_path = "bengali_fasttext_wiki.bin"
word_vector = bft.generate_word_vector(model_path, word)
print(word_vector.shape)
print(word_vector)

Train Bengali FastText Model

Check fasttext documentation for details of training parameter

from bnlp.embedding.fasttext import BengaliFasttext

bft = BengaliFasttext()
data = "raw_text.txt"
model_name = "saved_model.bin"
epoch = 50
bft.train(data, model_name, epoch)

Generate Vector File from Fasttext Binary Model

from bnlp.embedding.fasttext import BengaliFasttext

bft = BengaliFasttext()

model_path = "mymodel.bin"
out_vector_name = "myvector.txt"
bft.bin2vec(model_path, out_vector_name)

Bengali GloVe Word Vectors

We trained glove model with bengali data(wiki+news articles) and published bengali glove word vectors
You can download and use it on your different machine learning purposes.

from bnlp import BengaliGlove
glove_path = "bn_glove.39M.100d.txt"
word = "গ্রাম"
bng = BengaliGlove()
res = bng.closest_word(glove_path, word)
print(res)
vec = bng.word2vec(glove_path, word)
print(vec)

Document Embedding

Bengali Doc2Vec

Get document vector from input document

from bnlp import BengaliDoc2vec

bn_doc2vec = BengaliDoc2vec()

model_path = "bangla_news_article_doc2vec.model" # keep other .npy model files also in same folder
document = "রাষ্ট্রবিরোধী ও উসকানিমূলক বক্তব্য দেওয়ার অভিযোগে গাজীপুরের গাছা থানায় ডিজিটাল নিরাপত্তা আইনে করা মামলায় আলোচিত ‘শিশুবক্তা’ রফিকুল ইসলামের বিরুদ্ধে অভিযোগ গঠন করেছেন আদালত। ফলে মামলার আনুষ্ঠানিক বিচার শুরু হলো। আজ বুধবার (২৬ জানুয়ারি) ঢাকার সাইবার ট্রাইব্যুনালের বিচারক আসসামছ জগলুল হোসেন এ অভিযোগ গঠন করেন। এর আগে, রফিকুল ইসলামকে কারাগার থেকে আদালতে হাজির করা হয়। এরপর তাকে নির্দোষ দাবি করে তার আইনজীবী শোহেল মো. ফজলে রাব্বি অব্যাহতি চেয়ে আবেদন করেন। অন্যদিকে, রাষ্ট্রপক্ষ অভিযোগ গঠনের পক্ষে শুনানি করেন। উভয় পক্ষের শুনানি শেষে আদালত অব্যাহতির আবেদন খারিজ করে অভিযোগ গঠনের মাধ্যমে বিচার শুরুর আদেশ দেন। একইসঙ্গে সাক্ষ্যগ্রহণের জন্য আগামী ২২ ফেব্রুয়ারি দিন ধার্য করেন আদালত।"

vector = bn_doc2vec.get_document_vector(model_path, text)
print(vector)

Find document similarity between two document

from bnlp import BengaliDoc2vec

bn_doc2vec = BengaliDoc2vec()

model_path = "bangla_news_article_doc2vec.model" # keep other .npy model files also in same folder
article_1 = "রাষ্ট্রবিরোধী ও উসকানিমূলক বক্তব্য দেওয়ার অভিযোগে গাজীপুরের গাছা থানায় ডিজিটাল নিরাপত্তা আইনে করা মামলায় আলোচিত ‘শিশুবক্তা’ রফিকুল ইসলামের বিরুদ্ধে অভিযোগ গঠন করেছেন আদালত। ফলে মামলার আনুষ্ঠানিক বিচার শুরু হলো। আজ বুধবার (২৬ জানুয়ারি) ঢাকার সাইবার ট্রাইব্যুনালের বিচারক আসসামছ জগলুল হোসেন এ অভিযোগ গঠন করেন। এর আগে, রফিকুল ইসলামকে কারাগার থেকে আদালতে হাজির করা হয়। এরপর তাকে নির্দোষ দাবি করে তার আইনজীবী শোহেল মো. ফজলে রাব্বি অব্যাহতি চেয়ে আবেদন করেন। অন্যদিকে, রাষ্ট্রপক্ষ অভিযোগ গঠনের পক্ষে শুনানি করেন। উভয় পক্ষের শুনানি শেষে আদালত অব্যাহতির আবেদন খারিজ করে অভিযোগ গঠনের মাধ্যমে বিচার শুরুর আদেশ দেন। একইসঙ্গে সাক্ষ্যগ্রহণের জন্য আগামী ২২ ফেব্রুয়ারি দিন ধার্য করেন আদালত।"
article_2 = "রাষ্ট্রবিরোধী ও উসকানিমূলক বক্তব্য দেওয়ার অভিযোগে গাজীপুরের গাছা থানায় ডিজিটাল নিরাপত্তা আইনে করা মামলায় আলোচিত ‘শিশুবক্তা’ রফিকুল ইসলামের বিরুদ্ধে অভিযোগ গঠন করেছেন আদালত। ফলে মামলার আনুষ্ঠানিক বিচার শুরু হলো। আজ বুধবার (২৬ জানুয়ারি) ঢাকার সাইবার ট্রাইব্যুনালের বিচারক আসসামছ জগলুল হোসেন এ অভিযোগ গঠন করেন। এর আগে, রফিকুল ইসলামকে কারাগার থেকে আদালতে হাজির করা হয়। এরপর তাকে নির্দোষ দাবি করে তার আইনজীবী শোহেল মো. ফজলে রাব্বি অব্যাহতি চেয়ে আবেদন করেন। অন্যদিকে, রাষ্ট্রপক্ষ অভিযোগ গঠনের পক্ষে শুনানি করেন। উভয় পক্ষের শুনানি শেষে আদালত অব্যাহতির আবেদন খারিজ করে অভিযোগ গঠনের মাধ্যমে বিচার শুরুর আদেশ দেন। একইসঙ্গে সাক্ষ্যগ্রহণের জন্য আগামী ২২ ফেব্রুয়ারি দিন ধার্য করেন আদালত।"

similarity = bn_doc2vec.get_document_similarity(
  model_path,
  article_1,
  article_2
)
print(similarity)

Train doc2vec vector with custom text files

from bnlp import BengaliDoc2vec

bn_doc2vec = BengaliDoc2vec()

text_files = "path/myfiles"
checkpoint_path = "msc/logs"

bn_doc2vec.train_doc2vec(
  text_files,
  checkpoint_path=checkpoint_path,
  vector_size=100,
  min_count=2,
  epochs=10
)

# it will train doc2vec with your text files and save the train model in checkpoint_path

Bengali POS Tagging

Bengali CRF POS Tagging

Find Pos Tag Using Pretrained Model

from bnlp import POS
bn_pos = POS()
model_path = "model/bn_pos.pkl"
text = "আমি ভাত খাই।" # or you can pass ['আমি', 'ভাত', 'খাই', '।']
res = bn_pos.tag(model_path, text)
print(res)
# [('আমি', 'PPR'), ('ভাত', 'NC'), ('খাই', 'VM'), ('।', 'PU')]

Train POS Tag Model

from bnlp import POS
bn_pos = POS()
model_name = "pos_model.pkl"
train_data = [[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা',  'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'),('কার্পেট', 'NC'), ('৷', 'PU')], [('মাটি', 'NC'), ('থেকে', 'PP'), ('বড়জোর', 'JQ'), ('চার', 'JQ'), ('পাঁচ', 'JQ'), ('ফুট', 'CCL'), ('উঁচু', 'JJ'), ('হবে', 'VM'), ('৷', 'PU')]]

test_data = [[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'),('কার্পেট', 'NC'), ('৷', 'PU')], [('মাটি', 'NC'), ('থেকে', 'PP'), ('বড়জোর', 'JQ'), ('চার', 'JQ'), ('পাঁচ', 'JQ'), ('ফুট', 'CCL'), ('উঁচু', 'JJ'), ('হবে', 'VM'), ('৷', 'PU')]]

bn_pos.train(model_name, train_data, test_data)

Bengali NER

Bengali CRF NER

Find NER Tag Using Pretrained Model

from bnlp import NER
bn_ner = NER()
model_path = "model/bn_ner.pkl"
text = "সে ঢাকায় থাকে।" # or you can pass ['সে', 'ঢাকায়', 'থাকে', '।']
result = bn_ner.tag(model_path, text)
print(result)
# [('সে', 'O'), ('ঢাকায়', 'S-LOC'), ('থাকে', 'O')]

Train NER Tag Model

from bnlp import NER
bn_ner = NER()
model_name = "ner_model.pkl"
train_data = [[('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')]]

test_data = [[('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')]]

bn_ner.train(model_name, train_data, test_data)

Bengali Corpus Class

Stopwords and Punctuations

from bnlp.corpus import stopwords, punctuations, letters, digits

print(stopwords)
print(punctuations)
print(letters)
print(digits)

Remove stopwords from Text

from bnlp.corpus import stopwords
from bnlp.corpus.util import remove_stopwords

raw_text = 'আমি ভাত খাই।'
result = remove_stopwords(raw_text, stopwords)
print(result)
# ['ভাত', 'খাই', '।']

Text Cleaning

We adopted different text cleaning formula, codes from clean-text and modified for Bangla. Now you can normalize and clean your text using the following methods.

from bnlp import CleanText

clean_text = CleanText(
   fix_unicode=True,
   unicode_norm=True,
   unicode_norm_form="NFKC",
   remove_url=False,
   remove_email=False,
   remove_emoji=False,
   remove_number=False,
   remove_digits=False,
   remove_punct=False,
   replace_with_url="<URL>",
   replace_with_email="<EMAIL>",
   replace_with_number="<NUMBER>",
   replace_with_digit="<DIGIT>",
   replace_with_punct = "<PUNC>"
)

input_text = "আমার সোনার বাংলা।"
clean_text = clean_text(input_text)
print(clean_text)

Contributor Guide

Check CONTRIBUTING.md page for details.

Thanks To

Semantics Lab

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

4.0.3

Aug 20, 2024

4.0.2

Aug 12, 2024

4.0.1

May 4, 2024

4.0.0

Aug 14, 2023

4.0.0.dev4 pre-release

Aug 14, 2023

4.0.0.dev3 pre-release

Aug 14, 2023

4.0.0.dev2 pre-release

Aug 12, 2023

4.0.0.dev0 pre-release

Jul 16, 2023

This version

3.3.2

Jul 10, 2023

3.3.1

Apr 29, 2023

3.3.0

Mar 7, 2023

3.3.0.dev0 pre-release yanked

Nov 17, 2022

Reason this release was yanked:

this is not the official release. you can try latest release instead

3.2.0

Nov 7, 2022

3.1.2

Sep 11, 2021

3.1.1

Apr 29, 2021

3.1.0

Apr 24, 2021

3.1.0.dev0 pre-release

Apr 24, 2021

3.0.0

Oct 20, 2020

3.0.0a1 pre-release

Oct 20, 2020

3.0.0.dev3 pre-release

Oct 19, 2020

3.0.0.dev2 pre-release

Oct 19, 2020

3.0.0.dev1 pre-release

Oct 19, 2020

2.3

Jul 22, 2020

2.2

Mar 6, 2020

2.1

Feb 12, 2020

2.0.0

Dec 16, 2019

1.2.0

Dec 14, 2019

1.1.0

Dec 1, 2019

1.0.0

Nov 25, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bnlp_toolkit-3.3.2.tar.gz (26.9 kB view details)

Uploaded Jul 10, 2023 Source

Built Distribution

bnlp_toolkit-3.3.2-py3-none-any.whl (23.1 kB view details)

Uploaded Jul 10, 2023 Python 3

File details

Details for the file bnlp_toolkit-3.3.2.tar.gz.

File metadata

Download URL: bnlp_toolkit-3.3.2.tar.gz
Upload date: Jul 10, 2023
Size: 26.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for bnlp_toolkit-3.3.2.tar.gz
Algorithm	Hash digest
SHA256	`673d98998b18abeed022550ade1e09980d577758b6ff3d21ab9225bfceaa70da`
MD5	`090ff8fbf77ab51d12896ad254cec3b6`
BLAKE2b-256	`6d5d02d78757698f2dfeffd3b26925ce779cf37bb2d8c29a178c1a77974b555f`

See more details on using hashes here.

File details

Details for the file bnlp_toolkit-3.3.2-py3-none-any.whl.

File metadata

Download URL: bnlp_toolkit-3.3.2-py3-none-any.whl
Upload date: Jul 10, 2023
Size: 23.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for bnlp_toolkit-3.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e14077e929ad92b347b59600349c97acb6790563dd85aef000b37148d55bf633`
MD5	`35d25e3db6d36beda89c42ab7c60680c`
BLAKE2b-256	`e8a88141e56d82944d3348c85cfb43f8bef967d8240705d14fd4684fbfb39692`

See more details on using hashes here.

bnlp-toolkit 3.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Bengali Natural Language Processing(BNLP)

Table of contents

Installation

PIP installer

Pretrained Model

Download Links

Training Details

Tokenization

Basic Tokenizer

NLTK Tokenization

Bengali SentencePiece Tokenization

Tokenization using trained model

Training SentencePiece

Word Embedding

Bengali Word2Vec

Generate Vector using pretrain model

Find Most Similar Word Using Pretrained Model

Train Bengali Word2Vec with your own data

Pre-train or resume word2vec training with same or new corpus or tokenized sentences

Bengali FastText

Generate Vector Using Pretrained Model

Train Bengali FastText Model

Generate Vector File from Fasttext Binary Model

Bengali GloVe Word Vectors

Document Embedding

Bengali Doc2Vec

Get document vector from input document

Find document similarity between two document

Train doc2vec vector with custom text files

Bengali POS Tagging

Bengali CRF POS Tagging

Find Pos Tag Using Pretrained Model

Train POS Tag Model

Bengali NER

Bengali CRF NER

Find NER Tag Using Pretrained Model

Train NER Tag Model

Bengali Corpus Class

Stopwords and Punctuations

Remove stopwords from Text

Text Cleaning

Contributor Guide

Thanks To

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes