Skip to main content

BNLP is a natural language processing toolkit for Bengali Language

Project description

Bengali Natural Language Processing(BNLP)

BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, construct neural model for Bengali NLP purposes.

Installation

  • pypi package installer(python 3.6, 3.7 tested okay)

    pip install bnlp_toolkit

Pretrained Model

Download Link

Training Details

  • All three model trained with Bengali Wikipedia Dump Dataset
  • SentencePiece Training Vocab Size=50000
  • Word2Vec and Fasttext word embedding dimension = 300

Tokenization

  • Bengali SentencePiece Tokenization

    • tokenization using trained model
      from bnlp.sentencepiece_tokenizer import SP_Tokenizer
      
      bsp = SP_Tokenizer()
      model_path = "./model/bn_spm.model"
      input_text = "আমি ভাত খাই। সে বাজারে যায়।"
      tokens = bsp.tokenize(model_path, input_text)
      print(tokens)
      
    • Training SentencePiece
      from bnlp.sentencepiece_tokenizer import SP_Tokenizer
      
      bsp = SP_Tokenizer(is_train=True)
      data = "test.txt"
      model_prefix = "test"
      vocab_size = 5
      bsp.train_bsp(data, model_prefix, vocab_size) 
      
  • Basic Tokenizer

    from bnlp.basic_tokenizer import BasicTokenizer
    basic_t = BasicTokenizer(False)
    raw_text = "আমি বাংলায় গান গাই।"
    tokens = basic_t.tokenize(raw_text)
    print(tokens)
    
  • NLTK Tokenization

    from bnlp.nltk_tokenizer import NLTK_Tokenizer
    
    text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
    bnltk = NLTK_Tokenizer(text)
    word_tokens = bnltk.word_tokenize()
    sentence_tokens = bnltk.sentence_tokenize()
    print(word_tokens)
    print(sentence_tokens)
    

Word Embedding

  • Bengali Word2Vec

    • Generate Vector using pretrain model

      from bnlp.bengali_word2vec import Bengali_Word2Vec
      
      bwv = Bengali_Word2Vec()
      model_path = "model/wiki.bn.text.model"
      word = 'আমার'
      vector = bwv.generate_word_vector(model_path, word)
      print(vector.shape)
      print(vector)
      
    • Find Most Similar Word Using Pretrained Model

      from bnlp.bengali_word2vec import Bengali_Word2Vec
      
      bwv = Bengali_Word2Vec()
      model_path = "model/wiki.bn.text.model"
      word = 'আমার'
      similar = bwv.most_similar(model_path, word)
      print(similar)
      
    • Train Bengali Word2Vec with your own data

      from bnlp.bengali_word2vec import Bengali_Word2Vec
      bwv = Bengali_Word2Vec(is_train=True)
      data_file = "test.txt"
      model_name = "test_model.model"
      vector_name = "test_vector.vector"
      bwv.train_word2vec(data_file, model_name, vector_name)
      
  • Bengali FastText

    • Generate Vector Using Pretrained Model

      from bnlp.bengali_fasttext import Bengali_Fasttext
      
      bft = Bengali_Fasttext()
      word = "গ্রাম"
      model_path = "cc.bn.300.bin"
      word_vector = bft.generate_word_vector(model_path, word)
      print(word_vector.shape)
      print(word_vector)
      
    • Train Bengali FastText Model

      from bnlp.bengali_fasttext import Bengali_Fasttext
      
      bft = Bengali_Fasttext(is_train=True)
      data = "data.txt"
      model_name = "saved_model.bin"
      bft.train_fasttext(data, model_name)
      

Issue

  • if ModuleNotFoundError: No module named 'fasttext' problem arise please do the next line

pip install fasttext

  • if nltk issue arise please do the following line before importing bnlp
import nltk
nltk.download("punkt")

Developer Guide

  • Fork
  • add or modify
  • send pull request for merging

Thanks To

Contributor List

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bnlp_toolkit-1.1.0.tar.gz (5.6 kB view hashes)

Uploaded Source

Built Distribution

bnlp_toolkit-1.1.0-py3-none-any.whl (7.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page