Skip to main content

BNLP is a natural language processing toolkit for Bengali Language

Project description

Bengali Natural Language Processing(BNLP)

BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, construct neural model for Bengali NLP purposes.

Installation

  • pypi package installer(python 3.6)

    pip install bnlp_toolkit

Pretrained Model

Tokenization

  • Bengali SentencePiece Tokenization

    • tokenization using trained model
      from bnlp.sentencepiece_tokenizer import SP_Tokenizer
      
      bsp = SP_Tokenizer()
      model_path = "./model/bn_spm.model"
      input_text = "আমি ভাত খাই। সে বাজারে যায়।"
      tokens = bsp.tokenize(model_path, input_text)
      print(tokens)
      
    • Training SentencePiece
      from bnlp.sentencepiece_tokenizer import SP_Tokenizer
      
      bsp = SP_Tokenizer(is_train=True)
      data = "test.txt"
      model_prefix = "test"
      vocab_size = 5
      bsp.train_bsp(data, model_prefix, vocab_size) 
      
  • NLTK Tokenization

from bnlp.nltk_tokenizer import NLTK_Tokenizer

text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
bnltk = NLTK_Tokenizer(text)
word_tokens = bnltk.word_tokenize()
sentence_tokens = bnltk.sentence_tokenize()
print(word_tokens)
print(sentence_tokens)

Word Embedding

  • Bengali Word2Vec

    • Generate Vector using pretrain model

      from bnlp.bengali_word2vec import Bengali_Word2Vec
      
      bwv = Bengali_Word2Vec()
      model_path = "model/wiki.bn.text.model"
      word = 'আমার'
      vector = bwv.generate_word_vector(model_path, word)
      print(vector.shape)
      print(vector)
      
    • Find Most Similar Word Using Pretrained Model

      from bnlp.bengali_word2vec import Bengali_Word2Vec
      
      bwv = Bengali_Word2Vec()
      model_path = "model/wiki.bn.text.model"
      word = 'আমার'
      similar = bwv.most_similar(model_path, word)
      print(similar)
      
    • Train Bengali Word2Vec with your own data

      from bnlp.bengali_word2vec import Bengali_Word2Vec
      
      data_file = "test.txt"
      model_name = "test_model.model"
      vector_name = "test_vector.vector"
      bwv.train_word2vec(data_file, model_name, vector_name)
      
  • Bengali FastText

    • Download Bengali FastText Pretrained Model From Here

    • Generate Vector Using Pretrained Model

      from bnlp.bengali_fasttext import Bengali_Fasttext
      
      bft = Bengali_Fasttext()
      word = "গ্রাম"
      model_path = "cc.bn.300.bin"
      word_vector = bf.generate_word_vector(model_path, word)
      print(word_vector.shape)
      print(word_vector)
      
    • Train Bengali FastText Model

      from bnlp.bengali_fasttext import Bengali_Fasttext
      
      bft = Bengali_Fasttext(is_train=True)
      data = "data.txt"
      model_name = "saved_model.bin"
      bf.train_fasttext(data, model_name)
      

Issue

  • if ModuleNotFoundError: No module named 'fasttext' problem arise please do the next line

pip install fasttext

Developer Guide

  • Fork
  • add or modify
  • send pull request for merging
  • we will verify and include your name in Contributor List

Contributor List

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bnlp_toolkit-1.0.0.tar.gz (4.0 kB view hashes)

Uploaded Source

Built Distribution

bnlp_toolkit-1.0.0-py3-none-any.whl (5.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page