Skip to main content

A simple function for mathematical operations

Project description

Bangla Feature Extractor(BFE)

BFE is a Bangla Natural Language Processing based feature extractor.

Current Features

  1. CountVectorizer
  2. TfIdf
  3. Word Embedding

Installation

pip install bfe

Example

1. CountVectorizer

  • Fit n Transform
  • Transform
  • Get Wordset

Fit n Transform

from bfe import CountVectorizer
ct = CountVectorizer()
X = ct.fit_transform(X) # X is the word features
#Output: the countVectorized matrix form of given features

Transform

from bfe import CountVectorizer
ct = CountVectorizer()
get_mat = ct.transform("রাহাত")
#Output: the countVectorized matrix form of given word

Get Wordset

from bfe import CountVectorizer
ct = CountVectorizer()
ct.get_wordSet()
#Output: get the raw wordset used in training model

2. TfIdf

  • Fit n Transform
  • Transform
  • Coefficients

Fit n Transform

from bfe import TfIdfVectorizer
k = TfIdfVectorizer()
doc = ["কাওছার আহমেদ", "শুভ হাইদার"]
matrix1 = k.fit_transform(doc)
print(matrix1)

'''
Output: 
[[0.150515 0.150515 0.       0.      ]
 [0.       0.       0.150515 0.150515]]
'''

Transform

from bfe import TfIdfVectorizer
k = TfIdfVectorizer()
doc = ["আহমেদ সুমন", "কাওছার করিম"]
matrix2 = k.transform(doc)
print(matrix2)

'''
Output: 
[[0.150515 0.       0.       0.      ]
 [0.       0.150515 0.       0.      ]]
'''

Coefficients

from bfe import TfIdfVectorizer
k = TfIdfVectorizer()
doc = ["কাওছার আহমেদ", "শুভ হাইদার"]
k.fit_transform(doc)
wordset, idf = k.coefficients()
print(wordset)
#Output: ['আহমেদ', 'কাওছার', 'হাইদার', 'শুভ']

print(idf)
'''
Output: 
{'আহমেদ': 0.3010299956639812, 'কাওছার': 0.3010299956639812, 'হাইদার': 0.3010299956639812, 'শুভ': 0.3010299956639812}
'''

3. Word Embedding

  • Word2Vec

    • Training
    • Get Word Vector
    • Get Similarity
    • Get n Similar Words
    • Get Middle Word
    • Get Odd Words
    • Get Similarity Plot

Training

from bfe import BN_Word2Vec
#Training Against Sentences
w2v = BN_Word2Vec(sentences=[['আমার', 'প্রিয়', 'জন্মভূমি'], ['বাংলা', 'আমার', 'মাতৃভাষা']])
w2v.train_Word2Vec()

#Training Against one Dataset
w2v = BN_Word2Vec(corpus_file="path to data or txt file")
w2v.train_Word2Vec()

#Training Against Multiple Dataset
'''
    path
      ->data
        ->1.txt
        ->2.txt
        ->3.txt
'''
w2v = BN_Word2Vec(corpus_path="path/data")
w2v.train_Word2Vec(epochs=25)

After training is done the model "w2v_model" along with it's supportive vector files will be saved to current directory.

If you use any pretrained model, specify it while initializing BN_Word2Vec() . Otherwise no model_name is needed.

Get Word Vector

from bfe import BN_Word2Vec 
w2v = BN_Word2Vec(model_name='give the model name here')
w2v.get_wordVector('আমার')

Get Similarity

from bfe import BN_Word2Vec 
w2v = BN_Word2Vec(model_name='give the model name here')
w2v.get_similarity('ঢাকা', 'রাজধানী')

#Output: 67.457879

Get n Similar Words

from bfe import BN_Word2Vec 
w2v = BN_Word2Vec(model_name='give the model name here')
w2v.get_n_similarWord(['পদ্মা'], n=10)
#Output: 
'''
[('সেতুর', 0.5857524275779724),
 ('মুলফৎগঞ্জ', 0.5773632526397705),
 ('মহানন্দা', 0.5634652376174927),
 ("'পদ্মা", 0.5617109537124634),
 ('গোমতী', 0.5605217218399048),
 ('পদ্মার', 0.5547558069229126),
 ('তুলসীগঙ্গা', 0.5274507999420166),
 ('নদীর', 0.5232067704200745),
 ('সেতু', 0.5225246548652649),
 ('সেতুতে', 0.5192927718162537)]
'''

Get Middle Word

Get the probability distribution of the center word given words list.
from bfe import BN_Word2Vec 
w2v = BN_Word2Vec(model_name='give the model name here')
w2v.get_outputWord(['ঢাকায়', 'মৃত্যু'], n=2)

#Output:  [("হয়েছে।',", 0.05880642), ('শ্রমিকের', 0.05639163)]

Get Odd Words

Get the most unmatched word out from given words list
from bfe import BN_Word2Vec 
w2v = BN_Word2Vec(model_name='give the model name here')
w2v.get_oddWords(['চাল', 'ডাল', 'চিনি', 'আকাশ'])

#Output: 'আকাশ' 

Get Similarity Plot

Creates a barplot of similar words with their probability 
from bfe import BN_Word2Vec 
w2v = BN_Word2Vec(model_name='give the model name here')
w2v.get_oddWords(['চাল', 'ডাল', 'চিনি', 'আকাশ'])
  • FastText

    • Training
    • Get Word Vector
    • Get Similarity
    • Get n Similar Words
    • Get Middle Word
    • Get Odd Words

Training

from bfe import BN_FastText
#Training Against Sentences
ft = FastText(sentences=[['আমার', 'প্রিয়', 'জন্মভূমি'], ['বাংলা', 'আমার', 'মাতৃভাষা']])
ft.train_fasttext()

#Training Against one Dataset
ft = FastText(corpus_file="path to data or txt file")
ft.train_fasttext()

#Training Against Multiple Dataset
'''
    path
      ->data
        ->1.txt
        ->2.txt
        ->3.txt
'''
ft = FastText(corpus_path="path/data")
ft.train_fasttext(epochs=25)

After training is done the model "ft_model" along with it's supportive vector files will be saved to current directory.

If you use any pretrained model, specify it while initializing BN_FastText() . Otherwise no model_name is needed.

Get Word Vector

from bfe import BN_FastText 
ft = BN_FastText(model_name='give the model name here')
ft.get_wordVector('আমার')

Get Similarity

from bfe import BN_FastText 
ft = BN_FastText(model_name='give the model name here')
ft.get_similarity('ঢাকা', 'রাজধানী')

#Output: 70.56821120

Get n Similar Words

from bfe" import BN_FastText 
ft = BN_FastText(model_name='give the model name here')
ft.get_n_similarWord(['পদ্মা'], n=10)
#Output: 
'''
[('পদ্মায়', 0.8103810548782349),
 ('পদ্মার', 0.794012725353241),
 ('পদ্মানদীর', 0.7747839689254761),
 ('পদ্মা-মেঘনার', 0.7573559284210205),
 ('পদ্মা.', 0.7470568418502808),
 ('‘পদ্মা', 0.7413997650146484),
 ('পদ্মাসেতুর', 0.716225266456604),
 ('পদ্ম', 0.7154797315597534),
 ('পদ্মহেম', 0.6881639361381531),
 ('পদ্মাবত', 0.6682782173156738)]
'''

Get Odd Words

Get the most unmatched word out from given words list
from "package_name" import BN_FastText 
ft = BN_FastText(model_name='give the model name here')
ft.get_oddWords(['চাল', 'ডাল', 'চিনি', 'আকাশ'])

#Output: 'আকাশ' 

Get Similarity Plot

Creates a barplot of similar words with their probability 
from bfe import BN_FastText 
ft = BN_FastText(model_name='give the model name here')
ft.get_oddWords(['চাল', 'ডাল', 'চিনি', 'আকাশ'])

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

test_mark-0.1-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file test_mark-0.1-py3-none-any.whl.

File metadata

  • Download URL: test_mark-0.1-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/42.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.5

File hashes

Hashes for test_mark-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1e7f75691ea94a0530489efd8b7cccb9dcda4f2864b66fd1122b6417413888d0
MD5 0ab201eb75343413d3595f8a03b850db
BLAKE2b-256 6218322898fdd5ac334632316ee87cf2672075f5c53ba54ea883627817ca8fc6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page