Skip to main content

A package for analyzing entities present in Bengali sentence

Project description

Bengali (Bangla) Analyzer

This package provides an analyzer for Bengali (Bangla) language. We have gone through a dictionary entry based approach with grammatical sanitizing for this project. Here in our implementation we have 5 different type of entities:

  • Prefix: Prefix or উপসর্গ is a substring in a word that generally does not hold a meaning of its own but when added to a word that has its own meaning, gets a new definition on it.

  • Suffix: Suffix or অনুসর্গ is a trailing substring in a word that generally does not hold a meaning of its own but when added to a word that has its own meaning, gets a new definition on it.

  • Verb: Any word or group of words that describe the action, state or occurrence of an event in a Bengali sentence. For example - খাওয়া, চলে যাওয়া etc. etc .

  • Non-verb: Any other remaining parts of speech that are not recognized as a verb in a Bengali sentence. For example - আমি, খুব, তারা, বাংলা, বয়স, etc. etc.

  • Special entity: As the name suggests, a special entity can be a special date (for example, ২১ শে ফেব্রুয়ারী which is the International Mother Language Day), a person (for example - ড. মুহাম্মদ জাফর ইকবাল a famous author of science fictions and well-known professor), institute (for example - জাবি which is the abbreviation of Jahangirnagar University) or any other multi-word single entity.

  • Composite word: Our structural definition of composite Bengali word is - prefix (optional) + (One or) Multiple stand-alone Bengali words + suffix (optional)

Our package analyzes the given text and returns the word configurations of the text according to the definitions we have chosen to give to the entities which could be present in a bengali sentence.

Installation

The package can be installed in any fashion. It is highly recommended to install Conda and then run the following command to install the package:

pip install bengalianalyzer

Or,

  1. Download the whole repo as a compressed file.
  2. Extract the compressed file.
  3. Open a terminal at the base directory of the extracted folder.
  4. Type pip install . and hit enter.

Local Environment

This is the environment in which the package was developed:

Python: 3.9.0
OS: Manjaro 21.2.3 Qonos
Kernel: x86_64 Linux 5.15.21-1-MANJARO
Conda: 4.10.3
CPU: 11th Gen Intel Core i7-11370H @ 8x 4.8GHz
RAM: 15694MiB

Usage

Import the module first.

from bengali_analyzer import bengali_analyzer as bla

And then pass the text for analysis.

tokens = bla.analyze_sentence(text)
  • For Parts of Speech tagging:
tokens = bla.analyze_pos(text)
  • For lemma parsing:
tokens = bla.lemmatize_sentence(text)
tokens = bla.vectorize_pos(text)

Response

  • For analyze_sentence(text) :

Structure:

token = {
            "numeric_flag": bool,
            "global_index": [(int,int)],
            "punctuation_flag": bool,
            "numeric": {
                "digit": int,
                "literal": str,
                "weight": str,
                "suffix": [str]
            },
            "verb": {
                "parent_verb": str,
                "emphasizer": str,
                "contentative_verb": bool,
                "tp": str,
                "non_finite": bool,
                "form": str,
                "related_indices": [(int,int)],
            },
            "pronoun": {
                "pronoun_tag": str,
                "number_tag": str,
                "honorificity": str,
                "case": str,
                "proximity": str,
                "encoding": str,
            },
            "pos": [str],
            "composite_flag": bool,
            "composite_word": {
                "suffix": str,
                "prefix": str,
                "stand_alone_words": set(),
            },
            "special_entity": {
                "definition": str,
                "related_indices": [(int,int)],
                "space_indices": set(),
                "suffix": str,
            },
        }

Example:

text: "অর্থনীতিবিদদের ভালো কাজ দেয়া উচিত।"

response:
{'অর্থনীতিবিদদের': {'numeric_flag': False,
'global_index': [[0, 13]],
'pos': ['বিশেষ্য'],
'composite_flag': False,
'composite_word': {'suffix': 'দের',
'stand_alone_words': ['অর্থ', 'নীতি', 'বিদ']}},
'ভালো': {'numeric_flag': False,
'global_index': [[15, 18]],
'verb': {'parent_verb': ['ভালা'],
'tp': [{'tense': 'bo', 'person': 'tm'}, {'tense': 'sb', 'person': 'tm'}],
'related_indices': [[15, 18]],
'language_form': 'standard'},
'pos': ['বিশেষ্য', 'বিশেষণ', 'অব্যয়'],
'composite_flag': False},
'কাজ': {'numeric_flag': False,
'global_index': [[20, 22]],
'pos': ['বিশেষ্য'],
'composite_flag': False},
'দেয়া': {'numeric_flag': False,
'global_index': [[24, 27]],
'verb': {'parent_verb': ['দেয়ানো'],
'tp': [{'tense': 'bo', 'person': 'tu'}],
'related_indices': [[24, 27]],
'language_form': 'standard'},
'pos': ['বিশেষ্য'],
'composite_flag': False},
'উচিত': {'numeric_flag': False,
'global_index': [[29, 32]],
'pos': ['বিশেষণ'],
'composite_flag': False},
'।': {'numeric_flag': False,
'global_index': [[33, 33]],
'punctuation_flag': True,
'pos': ['punc'],
'composite_flag': False}}
  • For analyze_pos(text): The the mother list will contain all the tokens and each child list contains the PoS taggings of that token.

Structure :

dict(str:dict(str:list()))

Example:

text: "আমার ফ্যামিলি প্রবলেমের কারণে কুয়েটে পড়াই হবে না কিন্তু টিউশন করে সাপোর্ট লাগবে এজন্য চুয়েট চুজ করা ভুল হবে? খেতে থাকবই খেতে থাকব"

response:
{'আমার': {'pos': ['pronoun']},
'ফযামিলি': {'pos': ['undefined']},
'প্রবলেমের': {'pos': ['undefined']},
'কারণে': {'pos': ['undefined']},
'কুয়েটে': {'pos': ['undefined']},
'পড়াই': {'pos': ['verb']},
'হবে': {'pos': ['verb']},
'না': {'pos': ['conjunction', 'noun']},
'কিন্তু': {'pos': ['conjunction']},
'টিউশন': {'pos': ['undefined']},
'করে': {'pos': ['verb']},
'সাপোর্ট': {'pos': ['undefined']},
'লাগবে': {'pos': ['verb']},
'এজন্য': {'pos': ['conjunction', 'adverb']},
'চুয়েট': {'pos': ['undefined']},
'চুজ': {'pos': ['undefined']},
'করা': {'pos': ['verb']},
'ভুল': {'pos': ['adjective', 'noun']},
'?': {'pos': ['punctuation']},
'খেতে থাকবই': {'pos': ['contentative_verb']},
'খেতে থাকব': {'pos': ['contentative_verb']}}
  • For lemmatize_sentence(text):

Structure :

list(list())

Example:

text : "অর্থনীতিবিদদের ভালো কাজ দেয়া উচিত।"
respone : ['অর্থনীতিবিদ', 'ভালা/ভালো, 'কাজ', 'দেয়ানো', 'উচিত', '।']
  • For vectorize_pos(text):

Structure :

dict(str:list(list()))

Example:

text : "ঢাকা অর্থনৈতিক রাজধানী।"
respone : 
{'ঢাকা': [[[4, 185, 3, 3, False]],[1, None, None],[0, None, None],[5, None, None]],
 'অর্থনৈতিক': [[0, None, None]],
 'রাজধানী': [[1, None, None]]
 '।': [[6, None, None]]}

Quick Guide

Team

This tool is developed by people with diverse affiliations. The following are the people behind this effort.

Name Email Affiliation
Shahriar Elahi Dhruvo shahriardhruvo119@gmail.com Shahjalal University of Science & Technology, Sylhet
Md. Rakibul Hasan rakibulhasanranak1@gmail.com Shahjalal University of Science & Technology, Sylhet
Mahfuzur Rahman Emon emon.swe.sust@gmail.com Shahjalal University of Science & Technology, Sylhet
Fazle Rabbi Rakib fazlerakib009@gmail.com Shahjalal University of Science & Technology, Sylhet
Souhardya Saha Dip souhardyasaha98@gmail.com Shahjalal University of Science & Technology, Sylhet
Dr. Farig Yousuf Sadeque farigsadeque@gmail.com BRAC University, Dhaka
Mohammad Mamun Or Rashid mamunbd@juniv.edu Jahangirnagar University, Dhaka
Asif Shahriyar Shushmit sushmit@ieee.org Bengali.ai
A. A. Noman Ansary showrav.ansary.bd@gmail.com BRAC University, Dhaka
Sazia Mehnaz sayma.iict@gmail.com Bengali.ai

Special thanks to Md Nazmuddoha Ansary for implementing an open source general purpose indic grapheme parser and bn unicode normalizer, which are required dependencies in this tool.

In collaboration with: Bengali.ai, SUST, Jahangirnagar University, BRAC University

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bengalianalyzer-0.0.109.tar.gz (588.8 kB view details)

Uploaded Source

Built Distribution

bengalianalyzer-0.0.109-py3-none-any.whl (601.1 kB view details)

Uploaded Python 3

File details

Details for the file bengalianalyzer-0.0.109.tar.gz.

File metadata

  • Download URL: bengalianalyzer-0.0.109.tar.gz
  • Upload date:
  • Size: 588.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.7

File hashes

Hashes for bengalianalyzer-0.0.109.tar.gz
Algorithm Hash digest
SHA256 812a5149c69fa0b87e97f19f62b112110f659991aa63735793a0dfcbb381f0c2
MD5 ddf3b6debd9cec1f5bf91ec801f2cfe3
BLAKE2b-256 d5f8916b82f9422b92afe6a93acd364c9630ad662e8a0d0014baf5bfa94e2a59

See more details on using hashes here.

File details

Details for the file bengalianalyzer-0.0.109-py3-none-any.whl.

File metadata

File hashes

Hashes for bengalianalyzer-0.0.109-py3-none-any.whl
Algorithm Hash digest
SHA256 4cf564cb95fc7f18de8da23bd511fe518c67c8035bb9b7361152760057ff81ab
MD5 3ec844cae77552174e77016a9e7a7a6e
BLAKE2b-256 96cb4376dc9d7afdca7545da351601ec23f5e2d8442dc9f0daf702d779e74e5a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page