Bangla synonym lookup — offline dataset + live Web Scraping
Project description
bangla-synonyms
Bangla synonym lookup for the NLP community
Offline dataset · Live web scraping · Source metadata · CLI included
pip install bangla-synonyms
Why
Bengali is spoken by over 230 million people, yet it remains one of the most underserved languages in the NLP ecosystem. Finding synonyms programmatically — something trivially easy for English — has no reliable solution for Bangla.
bangla-synonyms fills that gap. Common use cases:
- Text augmentation — expand training data for Bangla ML models
- Search and indexing — build synonym-aware search in Bangla applications
- Writing tools — avoid word repetition in Bangla text editors
- Education — vocabulary builders and language learning tools
- Linguistics research — corpus building and lexical analysis
Results are cached locally on first lookup, so the dataset grows automatically the more you use it. No API key required. No internet connection needed for cached words.
Table of Contents
- Features
- Installation
- Quick Start
- Scraping Sources
- Top-level API
- Raw Mode
- Scrapper
- Core API
- CLI Reference
- Dataset
- Architecture
- Adding a New Source
- Contributing
- License
Features
| Offline-first | Checks local dataset before making any network call |
| Live fallback | Cascades through Wiktionary → Shabdkosh → English-Bangla |
| Quality filtering | Cross-source validation removes noise and wrong-sense entries |
| Source metadata | raw=True returns per-synonym source attribution and confidence |
| Source control | Choose exactly which sources to query |
| Merge or first-hit | Combine results from all sources, or stop at the first match |
| Opt-in persistence | Scraped results are saved to disk only when auto_save=True |
| Batch scraping | Scrape thousands of words with progress tracking and resume support |
| Dataset download | One-command download of a pre-built ~10,000 word dataset |
| CLI | Full command-line interface for scripting and one-off lookups |
| Python 3.9+ | Type-annotated, minimal dependencies |
Installation
pip install bangla-synonyms
Quick Start
import bangla_synonyms as bs
# Download the dataset once
bs.download()
# Look up a single word
bs.get("চোখ")
# → ['চক্ষু', 'নেত্র', 'লোচন', 'আঁখি', 'অক্ষি']
# Look up multiple words at once
bs.get_many(["চোখ", "মা", "সুন্দর"])
# → {
# 'চোখ': ['চক্ষু', 'নেত্র', 'লোচন', 'আঁখি'],
# 'মা': ['জননী', 'আম্মা', 'জন্মদাত্রী', 'মাতা'],
# 'সুন্দর': ['খুবসুরত', 'হাসিন', 'মনোরম', 'মনোহর'],
# }
Words not found in the local dataset are scraped automatically:
bs.get("তটিনী")
# → ['নদী', 'প্রবাহিনী', 'সরিৎ', 'স্রোতস্বিনী']
Scraping Sources
Three sources are available. All three are used by default, tried in order from most to least reliable.
| Key | Site | Type | Notes |
|---|---|---|---|
"wiktionary" |
bn.wiktionary.org | Structured wikitext | Most reliable; queried first |
"shabdkosh" |
shabdkosh.com | Dictionary | Good coverage; clean output |
"english_bangla" |
english-bangla.com | bn→bn dictionary | Last resort; near-synonyms and related words |
By default all three sources are tried and their results are merged and deduplicated.
Quality filtering
Raw scraper output is passed through a multi-stage quality pipeline before being returned:
- Noise removal — drops phrases, hyphenated entries, numbered items, entries containing digits or Latin characters, and zero-width characters.
- Cross-source validation — when Wiktionary is present, its entries are kept in full (authoritative). Entries from other sources are filtered by source tier:
- Shabdkosh entries are included only when Wiktionary independently confirms them.
- English-Bangla entries are included only when confirmed by at least one other source.
- Deduplication — duplicate synonyms across sources are removed, keeping the first-seen source attribution.
The quality field in raw mode output describes which strategy was applied.
Top-level API
The most common operations are available directly on the package — no class or instance needed.
import bangla_synonyms as bs
download()
bs.download() # full dataset (~10,000 words)
bs.download("mini") # small starter set (~500 words)
bs.download(force=True) # re-download even if the file already exists
bs.download("latest", force=True)
The dataset is saved to ./bangla_synonyms_data/dataset.json.
get()
bs.get(word, sources=None, raw=False)
| Parameter | Type | Default | Description |
|---|---|---|---|
word |
str |
— | The Bangla word to look up |
sources |
list | None |
None |
Sources to query (None uses all three) |
raw |
bool |
False |
Return a metadata dict instead of a plain list |
bs.get("সুন্দর")
# → ['মনোরম', 'সুশ্রী', 'চমৎকার']
bs.get("সুন্দর", sources=["wiktionary"])
bs.get("সুন্দর", sources=["wiktionary", "shabdkosh"])
bs.get("সুন্দর", raw=True)
# → {
# 'word': 'সুন্দর',
# 'sources_results': {
# 'wiktionary': ['মনোরম', 'সুশ্রী', 'চমৎকার'],
# 'shabdkosh': ['লাবণ্যময়', 'দৃষ্টিনন্দন', 'মনোরম']
# },
# 'results': [
# {'synonym': 'মনোরম', 'source': 'wiktionary'},
# {'synonym': 'সুশ্রী', 'source': 'wiktionary'}
# ],
# 'words': ['মনোরম', 'সুশ্রী', 'চমৎকার'],
# 'sources_hit': ['wiktionary', 'shabdkosh'],
# 'sources_tried': ['wiktionary', 'shabdkosh'],
# 'quality': 'wikiconfirmed',
# 'source': 'wiktionary'
# }
bs.get("xyz")
# → []
get_many()
bs.get_many(words, sources=None, raw=False)
| Parameter | Type | Default | Description |
|---|---|---|---|
words |
list[str] |
— | List of Bangla words to look up |
sources |
list | None |
None |
Sources to query |
raw |
bool |
False |
Return metadata dicts instead of plain lists |
bs.get_many(["চোখ", "মা", "নদী"])
# → {'চোখ': [...], 'মা': [...], 'নদী': [...]}
bs.get_many(["চোখ", "মা"], sources=["wiktionary"])
bs.get_many(["চোখ", "মা"], raw=True)
# → {'চোখ': {raw dict}, 'মা': {raw dict}}
stats()
bs.stats()
# Words : 9842
# Total synonyms: 47391
# Avg / word : 4.82
# Source : /home/user/bangla_synonyms_data/dataset.json
# Top 5 words :
# চোখ: চক্ষু, নেত্র, লোচন, আঁখি ...
# মা: জননী, আম্মা, জন্মদাত্রী ...
Returns a dict with keys: total_words, total_synonyms, avg_per_word, source.
Raw Mode
Pass raw=True to any lookup function to receive full source metadata alongside results. This is supported at every level: get(), get_many(), and Scrapper.
Response structure
{
"word": str, # looked up word
"source": str | None, # primary source
"sources_results": {
"source_name": list[str] # synonyms returned by that source
},
"results": [
{
"synonym": str,
"source": str
}
],
"words": list[str], # flat synonym list
"sources_hit": list[str], # sources that returned data
"sources_tried": list[str], # queried sources
"quality": str # filtering strategy
}
quality values
| Value | Meaning |
|---|---|
"wikiconfirmed" |
Wiktionary was present; other sources filtered by cross-validation |
"cross_source" |
No Wiktionary; synonyms confirmed by two or more sources |
"single_source" |
Only one source was available; noise-cleaned results returned as-is |
"local" |
Returned from local dataset cache; no scraping was performed |
"empty" |
No results survived filtering, or all sources returned errors |
confirmed flag
The confirmed: True flag is set on entries from secondary sources that passed cross-validation. Wiktionary entries are always authoritative and do not carry this flag.
result = bs.get("চোখ", raw=True)
# Filter to Wiktionary entries only
wiki = [r["synonym"] for r in result["results"] if r["source"] == "wiktionary"]
# Filter to cross-validated entries (Wiktionary + confirmed secondaries)
high_confidence = [
r["synonym"] for r in result["results"]
if r["source"] == "wiktionary" or r.get("confirmed")
]
# Check which filtering strategy was applied
print(result["quality"]) # "wikiconfirmed"
print(result["sources_hit"]) # ["wiktionary", "shabdkosh"]
Scrapper
Fine-grained control over every aspect of the scraping process. Intended for researchers and power users.
from bangla_synonyms import Scrapper
Constructor
| Parameter | Type | Default | Description |
|---|---|---|---|
offline |
bool |
False |
Use local dataset only; make no network calls |
auto_save |
bool |
False |
Persist scraped results to disk |
delay |
float |
1.0 |
Seconds between HTTP requests |
timeout |
int |
10 |
HTTP request timeout in seconds |
sources |
list | None |
None |
Sources to query (None = all three) |
merge |
bool |
True |
Merge all sources (True) or stop at first result (False) |
sc = Scrapper() # online, no persistence
sc = Scrapper(offline=True) # local dataset only
sc = Scrapper(auto_save=True) # persist results
sc = Scrapper(sources=["wiktionary"]) # single source
sc = Scrapper(sources=["wiktionary", "shabdkosh"]) # two sources
sc = Scrapper(merge=False) # stop at first hit
sc = Scrapper(delay=2.0, timeout=20) # slow connection
.get(word, raw=False)
Checks the local dataset first. Falls back to live scraping if the word is not found.
sc.get("চোখ")
# → ['চক্ষু', 'নেত্র', 'লোচন', ...]
sc.get("চোখ", raw=True)
# → {"word": "চোখ", "source": "wiktionary", "quality": "wikiconfirmed", ...}
# Local cache hit — no network call is made
sc.get("মা", raw=True)
# → {"word": "মা", "source": "local", "quality": "local", ...}
Scrapper(offline=True).get("নদী")
# → local dataset lookup only
.get_many(words, raw=False)
sc.get_many(["চোখ", "মা", "নদী"])
# → {'চোখ': [...], 'মা': [...], 'নদী': [...]}
sc.get_many(["চোখ", "মা"], raw=True)
# → {'চোখ': {raw dict}, 'মা': {raw dict}}
The request delay applies only to live HTTP calls. Local cache hits incur no delay.
.active_sources
Scrapper().active_sources
# → ["wiktionary", "shabdkosh", "english_bangla"]
Scrapper(sources=["wiktionary"]).active_sources
# → ["wiktionary"]
.download() — class method
Scrapper.download()
Scrapper.download("mini")
Scrapper.download(force=True)
Source selection patterns
# Structured data only — best for NLP pipelines
sc = Scrapper(sources=["wiktionary"])
# Maximum coverage — all sources merged
sc = Scrapper()
# Speed-first — stop at the first source that returns results
sc = Scrapper(merge=False)
# Exclude the lowest-reliability source
sc = Scrapper(sources=["wiktionary", "shabdkosh"])
# Long-running batch — persist results, polite rate limit
sc = Scrapper(auto_save=True, delay=2.0)
Dataset helpers
sc.add("পরিবেশ", ["প্রকৃতি", "জগত", "বিশ্ব"]) # add to local dataset
sc.stats() # dataset statistics
sc.export("synonyms.json") # export as JSON
sc.export("synonyms.csv", fmt="csv") # export as CSV
Core API
Lower-level building blocks for advanced users.
DatasetManager
Direct read/write access to the local synonym dataset.
from bangla_synonyms.core import DatasetManager
dm = DatasetManager()
All instances share the same in-memory store. A change made through one instance is immediately visible through any other.
Dataset location: ./bangla_synonyms_data/dataset.json
Reading data
dm.get("চোখ") # → ['চক্ষু', 'নেত্র', ...] empty list if not found
dm.has("চোখ") # → True / False
dm.all_words() # → sorted list of all words in the dataset
"চোখ" in dm # → True
len(dm) # → 9842
Writing data
# Merge new synonyms with any that already exist
dm.add("শব্দ", ["প্রতিশব্দ১", "প্রতিশব্দ২"])
# Replace the synonym list entirely
dm.update("শব্দ", ["নতুন১", "নতুন২"])
# Remove a word
dm.remove("শব্দ") # returns True if the word existed, False otherwise
Each write automatically flushes to disk. To batch multiple writes into a single flush, pass save=False and export manually:
dm.add("ক", ["খ", "গ"], save=False)
dm.add("ঘ", ["ঙ"], save=False)
dm.add("চ", ["ছ", "জ"], save=False)
dm.export("synonyms.json")
Merging from a file
added = dm.merge("extra_synonyms.json")
print(f"{added} new words added")
The JSON file should have the same format as the main dataset:
{
"নদী": ["তটিনী", "প্রবাহিনী", "সরিৎ"],
"আকাশ": ["গগন", "অম্বর", "নভ"]
}
Exporting
dm.export("synonyms.json") # JSON (default)
dm.export("synonyms.csv", fmt="csv") # CSV
dm.reload() # reload from disk after an external change
CSV format:
word,synonyms,count
চোখ,চক্ষু | নেত্র | লোচন | আঁখি,4
মা,জননী | আম্মা | জন্মদাত্রী,3
Stats
info = dm.stats()
# Words : 9842
# Total synonyms: 47391
# Avg / word : 4.82
# Source : /home/user/bangla_synonyms_data/dataset.json
# Top 5 words :
# চোখ: চক্ষু, নেত্র, লোচন, আঁখি ...
info["total_words"] # 9842
info["total_synonyms"] # 47391
info["avg_per_word"] # 4.82
WordlistFetcher
Fetches Bangla word lists from Wiktionary for use with BatchScraper.
from bangla_synonyms.core import WordlistFetcher, DatasetManager
wf = WordlistFetcher()
dm = DatasetManager()
# Fetch up to 500 words from Wiktionary
words = wf.fetch(limit=500)
# Filter to words not yet in the local dataset (enables safe resume)
new_words = wf.filter_new(words, dm)
print(f"{len(new_words)} words not yet scraped")
# Persist and reload word lists
wf.save(words, "wordlist.txt")
words = wf.load("wordlist.txt")
BatchScraper
Scrapes synonyms for large word lists with progress tracking, checkpointing, and resume support.
from bangla_synonyms.core import BatchScraper
Constructor
| Parameter | Default | Description |
|---|---|---|
dataset |
shared | DatasetManager instance to write results into |
delay |
1.0 |
Seconds between HTTP requests |
timeout |
10 |
HTTP request timeout in seconds |
save_every |
50 |
Flush results to disk every N words |
sources |
None |
Sources to query (None = all three) |
merge |
True |
Merge all sources or stop at first hit |
.run(words, skip_existing=True, show_progress=True, sources=None)
scraper = BatchScraper(delay=1.0)
result = scraper.run(["চোখ", "মা", "নদী", "আকাশ"])
# [ 1/4] চোখ: ✓ চক্ষু, নেত্র, লোচন ...
# [ 2/4] মা: ✓ জননী, আম্মা, জন্মদাত্রী
# [ 3/4] নদী: ✓ তটিনী, প্রবাহিনী
# [ 4/4] আকাশ: — not found
#
# [bangla-synonyms] done: 3 found, 1 not found, 0 errors
# Safe to re-run; already-scraped words are skipped
result = scraper.run(words, skip_existing=True)
# Override sources for this run only
result = scraper.run(words, sources=["wiktionary"])
# Suppress progress output
result = scraper.run(words, show_progress=False)
# Return value: {word: [synonyms], ...} for newly scraped words
.run_from_wiktionary(limit=200)
Fetches a word list from Wiktionary and scrapes all of them in one step.
scraper = BatchScraper(delay=1.5, sources=["wiktionary", "shabdkosh"])
scraper.run_from_wiktionary(limit=1000)
Full batch workflow
from bangla_synonyms.core import DatasetManager, WordlistFetcher, BatchScraper
dm = DatasetManager()
wf = WordlistFetcher()
# Fetch word list
words = wf.fetch(limit=5000)
wf.save(words, "wordlist.txt")
# Skip words already in the dataset
new_words = wf.filter_new(words, dm)
print(f"{len(new_words)} words to scrape")
# Scrape
scraper = BatchScraper(
delay=1.0,
save_every=100,
sources=["wiktionary", "shabdkosh"],
)
scraper.run(new_words, skip_existing=True)
# Export
dm.stats()
dm.export("bangla_synonyms_full.json")
dm.export("bangla_synonyms_full.csv", fmt="csv")
CLI Reference
# Dataset management
bangla-synonyms download
bangla-synonyms download --version mini
bangla-synonyms download --force
# Synonym lookup
bangla-synonyms get চোখ
bangla-synonyms get চোখ মা সুন্দর
bangla-synonyms get চোখ --offline
bangla-synonyms get চোখ --sources wiktionary
bangla-synonyms get চোখ --sources wiktionary --sources shabdkosh
bangla-synonyms get চোখ --no-merge
bangla-synonyms get চোখ --raw
# Build / expand the local dataset
bangla-synonyms build
bangla-synonyms build --limit 1000
bangla-synonyms build --delay 2.0
bangla-synonyms build --sources wiktionary
bangla-synonyms build --sources wiktionary --sources shabdkosh
bangla-synonyms build --no-merge
# Information and export
bangla-synonyms stats
bangla-synonyms export synonyms.json
bangla-synonyms export synonyms.csv --format csv
# Help
bangla-synonyms --help
bangla-synonyms get --help
bangla-synonyms build --help
The CLI functions are also importable for use in scripts:
from bangla_synonyms.cli import get, build, stats
result = get(["চোখ", "মা"])
result = get(["চোখ"], offline=True)
result = get(["চোখ"], sources=["wiktionary"], merge=False)
added = build(limit=500, delay=1.5)
info = stats()
Dataset
A pre-built dataset is available for download via GitHub Releases.
| Version | Words | Approximate size | Command |
|---|---|---|---|
latest |
~10,000 | ~3 MB | bs.download() |
mini |
~500 | ~150 KB | bs.download("mini") |
The dataset is saved to ./bangla_synonyms_data/dataset.json. All running instances pick up the new data immediately after a download — no restart required.
Building a larger dataset
bangla-synonyms build --limit 5000 \
--sources wiktionary --sources shabdkosh \
--delay 1.5
bangla-synonyms stats
bangla-synonyms export my_dataset.json
Dataset format
{
"চোখ": ["চক্ষু", "নেত্র", "লোচন", "আঁখি", "অক্ষি"],
"মা": ["জননী", "আম্মা", "জন্মদাত্রী", "মাতা"],
"নদী": ["তটিনী", "প্রবাহিনী", "সরিৎ", "স্রোতস্বিনী"]
}
Contributing
Bengali is spoken by 230 million people but remains one of the most underserved
languages in NLP. bangla-synonyms is one of the few programmatic tools for
Bangla lexical resources — your contribution directly improves what the entire
community can build.
Bug reports, new sources, quality improvements, and dataset contributions are all welcome. See CONTRIBUTING.md for the full workflow — how to open issues, create branches, and submit PRs.
BNLP users — if you use BNLP for tokenization, embeddings, or NER,
bangla-synonymspairs naturally with it. Use this library to expand your training vocabulary, augment datasets, or build synonym-aware preprocessing pipelines on top of BNLP models.
Acknowledgements
Data sources used by this package:
- Bangla Wiktionary — CC BY-SA 3.0
- Shabdkosh
- English-Bangla Dictionary
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bangla_synonyms-1.0.1.tar.gz.
File metadata
- Download URL: bangla_synonyms-1.0.1.tar.gz
- Upload date:
- Size: 56.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb9bb073ba7f234cec5f9081907e073fd5445ac537a70afed9e146a9127092c1
|
|
| MD5 |
b6b5a2b6c8e470224c080e681cb66655
|
|
| BLAKE2b-256 |
b9fe1728d3ead8f87761a73c79086c0c11920e6bf7a39796d206e7f5b7103d32
|
File details
Details for the file bangla_synonyms-1.0.1-py3-none-any.whl.
File metadata
- Download URL: bangla_synonyms-1.0.1-py3-none-any.whl
- Upload date:
- Size: 58.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
12073c26264f19df7b25f19ffdaee7fed7698151fac32fbb93d4c56553318659
|
|
| MD5 |
a2b608aba7706f8de6e5dabc6914b74f
|
|
| BLAKE2b-256 |
d91e3540cba7782ef64c82bbf2d1921d081d4a593848e5ccae6a3418976c6b61
|