Zero-dependency Python library for Bangla NLP text preprocessing
Project description
bangla-text-toolkit
Zero-dependency Python library for Bangla (Bengali) NLP text preprocessing.
Built in public as a 12-week engineering roadmap. Each day adds one well-tested component to a growing pipeline — from raw Unicode to fixed-length token embeddings.
Components
| Module | Class | What it does |
|---|---|---|
normalizer.py |
BanglaTextNormalizer |
Unicode NFC, ZWJ/ZWNJ, whitespace, punctuation, digit normalisation |
cleaner.py |
BanglaTextCleaner |
Strip URLs, HTML, emails, emojis, digits, mentions, hashtags |
pipeline.py |
Pipeline |
Chainable step runner — compose any callables |
tokenizer.py |
BanglaTokenizer |
Word and sentence tokenisation with Bangla-aware regex |
stopwords.py |
— | 150+ curated Bangla stopwords with filter helpers |
romanization.py |
BanglaRomanizer |
Rule-based Bangla → Roman transliteration |
stemmer.py |
BanglaStemmer |
Suffix-stripping stemmer (plurals, case markers, tense suffixes) |
vectorizer.py |
BanglaVectorizer |
TF-IDF vectorizer for pre-tokenised Bangla text |
keyword_extractor.py |
BanglaKeywordExtractor |
Top-k keyword extraction per document using TF-IDF scores |
sequence_labeler.py |
BanglaSequenceLabeler |
Rule-based token labelling (NUM, PUNCT, STOP, WORD + custom rules) |
embedder.py |
BanglaEmbedder |
Character n-gram hashing embeddings — fixed-length vectors for any token |
Installation
pip install bangla-text-toolkit
Or install from source:
git clone https://github.com/Mouly22/bangla-text-toolkit.git
cd bangla-text-toolkit
pip install -e ".[dev]"
Quick start
from bangla_text_toolkit import (
BanglaTextNormalizer,
BanglaStemmer,
BanglaVectorizer,
BanglaKeywordExtractor,
BanglaSequenceLabeler,
BanglaEmbedder,
)
from bangla_text_toolkit.tokenizer import BanglaTokenizer
tok = BanglaTokenizer()
tokens = tok.tokenize("আমি বাংলায় গান গাই")
# Label tokens
labeler = BanglaSequenceLabeler()
print(labeler.label(tokens))
# -> [('আমি', 'STOP'), ('বাংলায়', 'STOP'), ('গান', 'WORD'), ('গাই', 'WORD')]
# Embed document as a fixed-length vector
emb = BanglaEmbedder(dim=64)
doc_vec = emb.embed_document(tokens)
print(len(doc_vec)) # 64
API reference
BanglaTextNormalizer
from bangla_text_toolkit import BanglaTextNormalizer
n = BanglaTextNormalizer(digit_mode="ascii")
n.normalize("আমি ০১২ বাংলা") # -> "আমি 012 বাংলা"
BanglaTextCleaner
from bangla_text_toolkit.cleaner import BanglaTextCleaner
c = BanglaTextCleaner(remove_urls=True, remove_emojis=True)
c.clean("দেখো https://example.com 😊") # -> "দেখো"
Pipeline
from bangla_text_toolkit.pipeline import Pipeline
pipe = Pipeline()
pipe.add_step(BanglaTextNormalizer().normalize)
result = pipe.run(" আমি বাংলা ")
BanglaTokenizer
from bangla_text_toolkit.tokenizer import BanglaTokenizer, remove_stopwords
tok = BanglaTokenizer()
tok.tokenize("আমি বাংলায় গান গাই।")
# -> ["আমি", "বাংলায়", "গান", "গাই"]
BanglaRomanizer
from bangla_text_toolkit.romanization import BanglaRomanizer
r = BanglaRomanizer()
r.romanize("বাংলা") # -> "bangla"
BanglaStemmer
from bangla_text_toolkit import BanglaStemmer
s = BanglaStemmer(min_stem_length=2)
s.stem("বাংলাদের") # -> "বাংলা"
s.stem_tokens(["বাংলাদের", "গানগুলো"]) # -> ["বাংলা", "গান"]
BanglaVectorizer
from bangla_text_toolkit import BanglaVectorizer
corpus = [["আমি", "বাংলা"], ["সে", "বাংলা", "বলে"]]
vec = BanglaVectorizer(max_features=500, min_df=1, use_idf=True)
matrix = vec.fit_transform(corpus)
vec.get_feature_names()
BanglaKeywordExtractor
from bangla_text_toolkit import BanglaKeywordExtractor
corpus = [["আমি", "বাংলায়", "গান", "গাই"], ["সে", "বাংলায়", "কথা", "বলে"]]
kex = BanglaKeywordExtractor(top_k=3)
kex.fit(corpus)
kex.extract(corpus[0])
# -> [('গাই', 0.57...), ('গান', 0.40...), ('আমি', 0.40...)]
BanglaSequenceLabeler
from bangla_text_toolkit import BanglaSequenceLabeler
labeler = BanglaSequenceLabeler()
labeler.label(["আমি", "১২৩", "গান", "।"])
# -> [('আমি', 'STOP'), ('১২৩', 'NUM'), ('গান', 'WORD'), ('।', 'PUNCT')]
labeler.add_rule(r"[A-Za-z]+", "LATIN") # custom rule, highest priority
BanglaEmbedder
from bangla_text_toolkit import BanglaEmbedder
emb = BanglaEmbedder(dim=64, ngram_range=(2, 4), normalize=True)
# Single token → 64-d L2-normalised vector
vec = emb.embed_token("বাংলা")
# Document → mean of token embeddings
doc_vec = emb.embed_document(["আমি", "বাংলায়", "গান", "গাই"])
print(len(doc_vec)) # 64
# Corpus → one vector per document
corpus_vecs = emb.embed_corpus([["আমি", "গান"], ["সে", "কথা"]])
Testing
pytest tests/ -v
# 246 tests, 0 failures, 0 dependencies
Roadmap
12-week build log — one component per session, all tested and CI-green.
| Day | Component | Status |
|---|---|---|
| 1 | Package scaffold, pyproject.toml, CI | ✅ |
| 2 | BanglaTextNormalizer + 36 tests |
✅ |
| 3 | Pipeline + 14 tests |
✅ |
| 4 | BanglaTextCleaner, BanglaTokenizer, stopwords + tests |
✅ |
| 5 | BanglaRomanizer, GitHub Actions CI |
✅ |
| 6 | BanglaStemmer + 35 tests |
✅ |
| 7 | BanglaVectorizer (TF-IDF) + 30 tests |
✅ |
| 8 | BanglaKeywordExtractor (top-k TF-IDF keywords) + 29 tests |
✅ |
| 9 | BanglaSequenceLabeler (rule-based token labelling) + 33 tests |
✅ |
| 10 | BanglaEmbedder (character n-gram hashing embeddings) + 33 tests |
✅ |
| 11 | notebooks/demo.ipynb end-to-end demo (all 11 components) |
✅ |
| 12 | PyPI publish (v0.1.0) |
✅ |
Why this exists
Standard NLP tools silently break on Bangla text. Python's \w regex doesn't match Bangla combining vowel signs (Unicode category Mc/Mn), and most tokenisers treat matras as noise:
import re
re.findall(r'\w+', 'বাংলা') # ['ব', 'ল'] ← drops 'া', 'ং'
This library handles the full Bangla Unicode block (U+0980–U+09FF) correctly, with no external dependencies.
License
MIT © Umme Abira Azmary
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bangla_text_toolkit-0.1.0.tar.gz.
File metadata
- Download URL: bangla_text_toolkit-0.1.0.tar.gz
- Upload date:
- Size: 29.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4283be0a3ad6547fe4b19ae970db99b60c533fd8fe563effd99cd36116be767
|
|
| MD5 |
204151fcbcae5b1e709a5d461ff9e42f
|
|
| BLAKE2b-256 |
5f5533b2dee4016c96600d5a8c01f1f326c730ff5095513d2bc0d2ee67c800a8
|
Provenance
The following attestation bundles were made for bangla_text_toolkit-0.1.0.tar.gz:
Publisher:
publish.yml on Mouly22/bangla-text-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bangla_text_toolkit-0.1.0.tar.gz -
Subject digest:
e4283be0a3ad6547fe4b19ae970db99b60c533fd8fe563effd99cd36116be767 - Sigstore transparency entry: 2005740130
- Sigstore integration time:
-
Permalink:
Mouly22/bangla-text-toolkit@135801d4d98bf60c60d4f1d64b901eb55ea08fa7 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Mouly22
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@135801d4d98bf60c60d4f1d64b901eb55ea08fa7 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file bangla_text_toolkit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: bangla_text_toolkit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a67c5793f4954249bb6090af30afbaebaaad01ecb4def5d5b7342efd9b4876f8
|
|
| MD5 |
e7ce3d359e129e24fd3c7d25790e0977
|
|
| BLAKE2b-256 |
3bf6159c7895b4fce430357a2719670deba179b2c5181d0203026ca00d763e77
|
Provenance
The following attestation bundles were made for bangla_text_toolkit-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Mouly22/bangla-text-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bangla_text_toolkit-0.1.0-py3-none-any.whl -
Subject digest:
a67c5793f4954249bb6090af30afbaebaaad01ecb4def5d5b7342efd9b4876f8 - Sigstore transparency entry: 2005740449
- Sigstore integration time:
-
Permalink:
Mouly22/bangla-text-toolkit@135801d4d98bf60c60d4f1d64b901eb55ea08fa7 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Mouly22
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@135801d4d98bf60c60d4f1d64b901eb55ea08fa7 -
Trigger Event:
workflow_dispatch
-
Statement type: