A package for analyzing entities present in Bengali sentence
Project description
Bengali (Bangla) Analyzer
This package provides an analyzer for Bengali (Bangla) language. We have gone through a dictionary entry based approach with grammatical sanitizing for this project. Here in our implementation we have 5 different type of entities:
-
Prefix: Prefix or উপসর্গ is a substring in a word that generally does not hold a meaning of its own but when added to a word that has its own meaning, gets a new definition on it.
-
Suffix: Suffix or অনুসর্গ is a trailing substring in a word that generally does not hold a meaning of its own but when added to a word that has its own meaning, gets a new definition on it.
-
Verb: Any word or group of words that describe the action, state or occurrence of an event in a Bengali sentence. For example - খাওয়া, চলে যাওয়া etc. etc .
-
Non-verb: Any other remaining parts of speech that are not recognized as a verb in a Bengali sentence. For example - আমি, খুব, তারা, বাংলা, বয়স, etc. etc.
-
Special entity: As the name suggests, a special entity can be a special date (for example, ২১ শে ফেব্রুয়ারী which is the International Mother Language Day), a person (for example - ড. মুহাম্মদ জাফর ইকবাল a famous author of science fictions and well-known professor), institute (for example - জাবি which is the abbreviation of Jahangirnagar University) or any other multi-word single entity.
-
Composite word: Our structural definition of composite Bengali word is - Prefix (optional) + (One or) Multiple stand-alone Bengali words + Suffix (optional)
Our package analyzes the given text and returns the word configurations of the text according to the definitions we have chosen to give to the entities which could be present in a bengali sentence.
Installation
The package can be installed in any fashion. It is highly recommended to install Conda and then run the following command to install the package:
pip install bengalianlyzer
Local Environment
This is the environment in which the package was developed:
██████████████████ ████████ Python: 3.9.0
██████████████████ ████████ OS: Manjaro 21.2.3 Qonos
██████████████████ ████████ Kernel: x86_64 Linux 5.15.21-1-MANJARO
██████████████████ ████████ Conda: 4.10.3
████████ ████████ CPU: 11th Gen Intel Core i7-11370H @ 8x 4.8GHz
████████ ████████ ████████ GPU: NVIDIA GeForce RTX 3060 Laptop GPU
████████ ████████ ████████ RAM: 15694MiB
████████ ████████ ████████
████████ ████████ ████████
████████ ████████ ████████
████████ ████████ ████████
████████ ████████ ████████
████████ ████████ ████████
████████ ████████ ████████
Usage
Import the module first.
from bengalianlyzer import BengaliAnalyzer
And then pass the text for analysis.
bl = BengaliAnalyzer()
bl.analyze_sentence('জগন্নাথ বিশ্ববিদ্যালয়ের (জবি) লাইফ অ্যান্ড আর্থ সায়েন্স অনুষদের নতুন ডিন হিসেবে দায়িত্ব পেয়েছেন বিশ্ববিদ্যালয়ের উদ্ভিদবিজ্ঞান বিভাগের অধ্যাপক ড. মো. মনিরুজ্জামান খন্দকার।')
Response
The response will return tokens
(data type : dictionary
) which has each token
as its key
. The following dimension will be present for each token
:
tokens[token] = {
"Global_Index": [int or (int, int)],
"Punctuation_Flag": bool,
"Numeric":
{
"Digit": int,
"Literal": str,
"Weight": str,
"Suffix": [str]
},
"Verb":
{
"Parent_Verb": str,
"Tense": str,
"Emphasis": [str],
"Form": str,
"Person": str,
"Related_Indices": [[int or (int,int)]]
},
"Non_Verb": str,
"Composite_Word":
{
"Suffix": str,
"Prefix": str,
"Stand_Alone_Words": {str},
},
"Special_Entity":
{
"Definition": str,
"Related_Indices": [[int or (int,int)]]
}
}
Team
This tool is developed by people with diverse affiliations. The following are the people behind this effort.
Name | Affiliation | |
---|---|---|
Shahriar Elahi Dhruvo | shahriardhruvo119@gmail.com | Shahjalal University of Science & Technology, Sylhet |
Md. Rakibul Hasan Ranak | rakibulhasanranak1@gmail.com | Shahjalal University of Science & Technology, Sylhet |
Mahfuzur Rahman Emon | emon.swe.sust@gmail.com | Shahjalal University of Science & Technology, Sylhet |
Fazle Rabbi Rakib | fazlerakib009@gmail.com | Shahjalal University of Science & Technology, Sylhet |
Souhardya Saha Dip | souhardyasaha98@gmail.com | Shahjalal University of Science & Technology, Sylhet |
Asif Shahriyar Shushmit | sushmit@ieee.org | Bengali.ai |
A. A. Noman Ansary | showrav.ansary.bd@gmail.com | Govt. Laboratory High School, Rajshahi |
Special thanks to Md Nazmuddoha Ansary for implementing an open source general purpose indic grapheme
parser, which is a required dependency in this tool.
In collaboration with: Bengali.ai, SUST, RGLHS, Jahangirnagar University
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for bengalianalyzer-0.0.102-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72d845fe9d61ea93e199a5eab6831b4c5ee1f0cfcaa00a255ffc6cd7bcd36e44 |
|
MD5 | a43cf49ede4bea3d4ad4d995e4fa982b |
|
BLAKE2b-256 | ec1dc7bb4ce6b01827f3096152843e0b5ebee165b8a3cf307757a170db34a313 |