Skip to main content

Toolkits for text processing and augmentation for Bangla NLP

Project description

Bangla NLP Toolkit

Created by A F M Mahfuzul Kabir
mahfuzulkabir.com
https://www.linkedin.com/in/mahfuzulkabir

Installation

Install the 'csebuetnlp normalizer' first with:

pip install git+https://github.com/csebuetnlp/normalizer

install the package with

pip install banglanlptoolkit

Introduction

This package contains several toolkits for Bangla NLP text processing and augmentation. The available tools are listed below.

  • Bangla Text Normalizer
  • Bangla Punctuation Generator
  • Bangla Text Augmentation

Documentations:

Thank you very much for using my package. I handle this package all on my own, so if there's any issue with it, I might not always be available to fix it. But if you do encounter such event, feel free to let me know and I'll fix them as soon as I can.

Bangla Text Normalizer

Bangla text normalization is a known problem in language processing for normalizing Bangla text data in computer readable format. The unicode normalization normalizes all characters of a text string in the same unicode format and removes unwanted characters present. The csebuetnlp normalizer is used for models such as BanglaBERT, BanglaT5 etc.

The package uses two normalization toolkits for Bangla text processing. The unicode normalizer is used from here. The other normalizer is specifically used for BanglaT5 translation module and taken from here.

Bangla Punctuation Generator

The scarcity of good punctuation generator model for Bangla language was very dominant even a few months ago. However, with development of Bangla AI models, we now have very good punctuation generation models for our language as well.

The package uses an open-source punctuation generation model from this Kaggle dataset. I currently have this model in my huggingface for ease of use without any token. You can replace with any model of your like if you want.

Bangla Text Augmentation

The package uses three kind of text augmentation techniques.

  • Bangla Token Replacement
  • Back Translation
  • Bangla Paraphrasing

The token replacement method uses fill-mask method to replace random tokens from a sentence and then replace them. The package uses BanglishBERT Generator model by CSEBUETNLP for this task. The model can be found in here.

The back translation method translates the sentences from Bangla to English and then to Bangla again. The package uses bn-en and en-bn models of BanglaT5 by CSEBUETNLP for this task. The models can be found here: bn2en, en2bn.

The paraphrasing toolkit uses Bangla paraphrase model of BanglaT5 by CSEBUETNLP. The model can be found in here.

The package supports both online and offline augmentations. Offline augmentation can be used to generate new dataframe of augmented texts from original dataframe. This can be saved in a variable or to a file for later use. While offline augmentation can be faster for utilizing processing power (GPU parallelism), it can get a bit annoying because of saving the augmented data every once in a while. People also love to use online augmentation, meaning, augmenting the data 'on the fly' in predefined custom dataset class. This improves performance by augmentation of sentences during training or inference, with no hassle of saving the data separately.

From version 1.1.5, I'm happy to introduce online augmentation techniques in this package. This technique was inspired from the exact same technique of torchvision.transpose, meaning, you can stack several augmentation techniques with a compose class. You can also write your own custom class of augmentation or transform techniques and use them with compose.

Inspired from

If you use this package, please don't forget to cite the links and papers mentioned.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

banglanlptoolkit-1.1.8.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

banglanlptoolkit-1.1.8-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file banglanlptoolkit-1.1.8.tar.gz.

File metadata

  • Download URL: banglanlptoolkit-1.1.8.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for banglanlptoolkit-1.1.8.tar.gz
Algorithm Hash digest
SHA256 4a6609c2d5436ec052fd6a015e0f5b7bbf21a915dcf2d7f542a920421a9644a3
MD5 c70a5d2d16c72a6f17ec489a41029a6d
BLAKE2b-256 b2cf77b0439000648835655712e244a43d734b2e5e90973d0865212709b732e8

See more details on using hashes here.

File details

Details for the file banglanlptoolkit-1.1.8-py3-none-any.whl.

File metadata

File hashes

Hashes for banglanlptoolkit-1.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 fd5e93b3e05627fd8f55e960f9a191b66729af65fa829039ccbb1e260f27080a
MD5 d7b65a980256a6230d0f3493ec532cb1
BLAKE2b-256 250940c7446c3dc977a32805f69527397b73e71f320487bd4cce4ba322aacb0b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page