Skip to main content

BNLP is a natural language processing toolkit for Bengali Language

Project description

Bengali Natural Language Processing(BNLP)

BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, construct neural model for Bengali NLP purposes.

Installation

  • pypi package installer(python 3.6)

    pip install bnlp_toolkit

Pretrained Model

Tokenization

  • Bengali SentencePiece Tokenization

    • tokenization using trained model
      from bnlp.sentencepiece_tokenizer import SP_Tokenizer
      
      bsp = SP_Tokenizer()
      model_path = "./model/bn_spm.model"
      input_text = "আমি ভাত খাই। সে বাজারে যায়।"
      tokens = bsp.tokenize(model_path, input_text)
      print(tokens)
      
    • Training SentencePiece
      from bnlp.sentencepiece_tokenizer import SP_Tokenizer
      
      bsp = SP_Tokenizer(is_train=True)
      data = "test.txt"
      model_prefix = "test"
      vocab_size = 5
      bsp.train_bsp(data, model_prefix, vocab_size) 
      
  • NLTK Tokenization

from bnlp.nltk_tokenizer import NLTK_Tokenizer

text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
bnltk = NLTK_Tokenizer(text)
word_tokens = bnltk.word_tokenize()
sentence_tokens = bnltk.sentence_tokenize()
print(word_tokens)
print(sentence_tokens)

Word Embedding

  • Bengali Word2Vec

    • Generate Vector using pretrain model

      from bnlp.bengali_word2vec import Bengali_Word2Vec
      
      bwv = Bengali_Word2Vec()
      model_path = "model/wiki.bn.text.model"
      word = 'আমার'
      vector = bwv.generate_word_vector(model_path, word)
      print(vector.shape)
      print(vector)
      
    • Find Most Similar Word Using Pretrained Model

      from bnlp.bengali_word2vec import Bengali_Word2Vec
      
      bwv = Bengali_Word2Vec()
      model_path = "model/wiki.bn.text.model"
      word = 'আমার'
      similar = bwv.most_similar(model_path, word)
      print(similar)
      
    • Train Bengali Word2Vec with your own data

      from bnlp.bengali_word2vec import Bengali_Word2Vec
      
      data_file = "test.txt"
      model_name = "test_model.model"
      vector_name = "test_vector.vector"
      bwv.train_word2vec(data_file, model_name, vector_name)
      
  • Bengali FastText

    • Download Bengali FastText Pretrained Model From Here

    • Generate Vector Using Pretrained Model

      from bnlp.bengali_fasttext import Bengali_Fasttext
      
      bft = Bengali_Fasttext()
      word = "গ্রাম"
      model_path = "cc.bn.300.bin"
      word_vector = bf.generate_word_vector(model_path, word)
      print(word_vector.shape)
      print(word_vector)
      
    • Train Bengali FastText Model

      from bnlp.bengali_fasttext import Bengali_Fasttext
      
      bft = Bengali_Fasttext(is_train=True)
      data = "data.txt"
      model_name = "saved_model.bin"
      bf.train_fasttext(data, model_name)
      

Issue

  • if ModuleNotFoundError: No module named 'fasttext' problem arise please do the next line

pip install fasttext

Developer Guide

  • Fork
  • add or modify
  • send pull request for merging
  • we will verify and include your name in Contributor List

Contributor List

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bnlp_toolkit-1.0.0.tar.gz (4.0 kB view details)

Uploaded Source

Built Distribution

bnlp_toolkit-1.0.0-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file bnlp_toolkit-1.0.0.tar.gz.

File metadata

  • Download URL: bnlp_toolkit-1.0.0.tar.gz
  • Upload date:
  • Size: 4.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.0.post20191124 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.6.7

File hashes

Hashes for bnlp_toolkit-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a3fa4264b457980372065b9a9104e53adf52971a00ca2400637f0b5b362dbc57
MD5 718ef2127910f7771cf8d0b15535b66e
BLAKE2b-256 efb58b099a1ec999556da5d041575e21e9e90ee0bee26c6a25c0610dbe5faa26

See more details on using hashes here.

File details

Details for the file bnlp_toolkit-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: bnlp_toolkit-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.0.post20191124 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.6.7

File hashes

Hashes for bnlp_toolkit-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d78e5c599a5f66fab31109fcc7d1d69a56325fe785f63000375bdb011410f5b2
MD5 75f3413faefe4cfaebd832b91df371ed
BLAKE2b-256 5320833fc11f32fedbe47d62e9f27b76016b7ab7d09db53d42ea9f823efa154b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page