Skip to main content

Bangla synonym lookup — offline dataset + live Web Scraping

Project description

bangla-synonyms

Bangla synonym lookup for the NLP community
Offline dataset  ·  Live web scraping  ·  Source metadata  ·  CLI included

PyPI version Python versions License


pip install bangla-synonyms

Why

Bengali is spoken by over 230 million people, yet it remains one of the most underserved languages in the NLP ecosystem. Finding synonyms programmatically — something trivially easy for English — has no reliable solution for Bangla.

bangla-synonyms fills that gap. Common use cases:

  • Text augmentation — expand training data for Bangla ML models
  • Search and indexing — build synonym-aware search in Bangla applications
  • Writing tools — avoid word repetition in Bangla text editors
  • Education — vocabulary builders and language learning tools
  • Linguistics research — corpus building and lexical analysis

Results are cached locally on first lookup, so the dataset grows automatically the more you use it. No API key required. No internet connection needed for cached words.


Table of Contents


Features

Offline-first Checks local dataset before making any network call
Live fallback Cascades through Wiktionary → Shabdkosh → English-Bangla
Quality filtering Cross-source validation removes noise and wrong-sense entries
Source metadata raw=True returns per-synonym source attribution and confidence
Source control Choose exactly which sources to query
Merge or first-hit Combine results from all sources, or stop at the first match
Opt-in persistence Scraped results are saved to disk only when auto_save=True
Batch scraping Scrape thousands of words with progress tracking and resume support
Dataset download One-command download of a pre-built ~10,000 word dataset
CLI Full command-line interface for scripting and one-off lookups
Python 3.9+ Type-annotated, minimal dependencies

Installation

pip install bangla-synonyms

Quick Start

import bangla_synonyms as bs

# Download the dataset once
bs.download()

# Look up a single word
bs.get("চোখ")
# → ['চক্ষু', 'নেত্র', 'লোচন', 'আঁখি', 'অক্ষি']

# Look up multiple words at once
bs.get_many(["চোখ", "মা", "সুন্দর"])
# → {
#     'চোখ':    ['চক্ষু', 'নেত্র', 'লোচন', 'আঁখি'],
#     'মা':     ['জননী', 'আম্মা', 'জন্মদাত্রী', 'মাতা'],
#     'সুন্দর': ['খুবসুরত', 'হাসিন', 'মনোরম', 'মনোহর'],
#   }

Words not found in the local dataset are scraped automatically:

bs.get("তটিনী")
# → ['নদী', 'প্রবাহিনী', 'সরিৎ', 'স্রোতস্বিনী']

Scraping Sources

Three sources are available. All three are used by default, tried in order from most to least reliable.

Key Site Type Notes
"wiktionary" bn.wiktionary.org Structured wikitext Most reliable; queried first
"shabdkosh" shabdkosh.com Dictionary Good coverage; clean output
"english_bangla" english-bangla.com bn→bn dictionary Last resort; near-synonyms and related words

By default all three sources are tried and their results are merged and deduplicated.

Quality filtering

Raw scraper output is passed through a multi-stage quality pipeline before being returned:

  1. Noise removal — drops phrases, hyphenated entries, numbered items, entries containing digits or Latin characters, and zero-width characters.
  2. Cross-source validation — when Wiktionary is present, its entries are kept in full (authoritative). Entries from other sources are filtered by source tier:
    • Shabdkosh entries are included only when Wiktionary independently confirms them.
    • English-Bangla entries are included only when confirmed by at least one other source.
  3. Deduplication — duplicate synonyms across sources are removed, keeping the first-seen source attribution.

The quality field in raw mode output describes which strategy was applied.


Top-level API

The most common operations are available directly on the package — no class or instance needed.

import bangla_synonyms as bs

download()

bs.download()                       # full dataset (~10,000 words)
bs.download("mini")                 # small starter set (~500 words)
bs.download(force=True)             # re-download even if the file already exists
bs.download("latest", force=True)

The dataset is saved to ./bangla_synonyms_data/dataset.json.


get()

bs.get(word, sources=None, raw=False)
Parameter Type Default Description
word str The Bangla word to look up
sources list | None None Sources to query (None uses all three)
raw bool False Return a metadata dict instead of a plain list
bs.get("সুন্দর")
# → ['মনোরম', 'সুশ্রী', 'চমৎকার']

bs.get("সুন্দর", sources=["wiktionary"])
bs.get("সুন্দর", sources=["wiktionary", "shabdkosh"])

bs.get("সুন্দর", raw=True)
# → {
#   'word': 'সুন্দর',
#   'sources_results': {
#       'wiktionary': ['মনোরম', 'সুশ্রী', 'চমৎকার'],
#       'shabdkosh': ['লাবণ্যময়', 'দৃষ্টিনন্দন', 'মনোরম']
#   },
#   'results': [
#       {'synonym': 'মনোরম', 'source': 'wiktionary'},
#       {'synonym': 'সুশ্রী', 'source': 'wiktionary'}
#   ],
#   'words': ['মনোরম', 'সুশ্রী', 'চমৎকার'],
#   'sources_hit': ['wiktionary', 'shabdkosh'],
#   'sources_tried': ['wiktionary', 'shabdkosh'],
#   'quality': 'wikiconfirmed',
#   'source': 'wiktionary'
# }

bs.get("xyz")
# → []

get_many()

bs.get_many(words, sources=None, raw=False)
Parameter Type Default Description
words list[str] List of Bangla words to look up
sources list | None None Sources to query
raw bool False Return metadata dicts instead of plain lists
bs.get_many(["চোখ", "মা", "নদী"])
# → {'চোখ': [...], 'মা': [...], 'নদী': [...]}

bs.get_many(["চোখ", "মা"], sources=["wiktionary"])

bs.get_many(["চোখ", "মা"], raw=True)
# → {'চোখ': {raw dict}, 'মা': {raw dict}}

stats()

bs.stats()
# Words         : 9842
# Total synonyms: 47391
# Avg / word    : 4.82
# Source        : /home/user/bangla_synonyms_data/dataset.json
# Top 5 words   :
#   চোখ: চক্ষু, নেত্র, লোচন, আঁখি ...
#   মা: জননী, আম্মা, জন্মদাত্রী ...

Returns a dict with keys: total_words, total_synonyms, avg_per_word, source.


Raw Mode

Pass raw=True to any lookup function to receive full source metadata alongside results. This is supported at every level: get(), get_many(), and Scrapper.

Response structure

{
    "word": str,                  # looked up word
    "source": str | None,         # primary source

    "sources_results": {
        "source_name": list[str]  # synonyms returned by that source
    },

    "results": [
        {
            "synonym": str,
            "source": str
        }
    ],

    "words": list[str],           # flat synonym list
    "sources_hit": list[str],     # sources that returned data
    "sources_tried": list[str],   # queried sources
    "quality": str                # filtering strategy
}

quality values

Value Meaning
"wikiconfirmed" Wiktionary was present; other sources filtered by cross-validation
"cross_source" No Wiktionary; synonyms confirmed by two or more sources
"single_source" Only one source was available; noise-cleaned results returned as-is
"local" Returned from local dataset cache; no scraping was performed
"empty" No results survived filtering, or all sources returned errors

confirmed flag

The confirmed: True flag is set on entries from secondary sources that passed cross-validation. Wiktionary entries are always authoritative and do not carry this flag.

result = bs.get("চোখ", raw=True)

# Filter to Wiktionary entries only
wiki = [r["synonym"] for r in result["results"] if r["source"] == "wiktionary"]

# Filter to cross-validated entries (Wiktionary + confirmed secondaries)
high_confidence = [
    r["synonym"] for r in result["results"]
    if r["source"] == "wiktionary" or r.get("confirmed")
]

# Check which filtering strategy was applied
print(result["quality"])         # "wikiconfirmed"
print(result["sources_hit"])     # ["wiktionary", "shabdkosh"]

Scrapper

Fine-grained control over every aspect of the scraping process. Intended for researchers and power users.

from bangla_synonyms import Scrapper

Constructor

Parameter Type Default Description
offline bool False Use local dataset only; make no network calls
auto_save bool False Persist scraped results to disk
delay float 1.0 Seconds between HTTP requests
timeout int 10 HTTP request timeout in seconds
sources list | None None Sources to query (None = all three)
merge bool True Merge all sources (True) or stop at first result (False)
sc = Scrapper()                                       # online, no persistence
sc = Scrapper(offline=True)                           # local dataset only
sc = Scrapper(auto_save=True)                         # persist results
sc = Scrapper(sources=["wiktionary"])                 # single source
sc = Scrapper(sources=["wiktionary", "shabdkosh"])    # two sources
sc = Scrapper(merge=False)                            # stop at first hit
sc = Scrapper(delay=2.0, timeout=20)                  # slow connection

.get(word, raw=False)

Checks the local dataset first. Falls back to live scraping if the word is not found.

sc.get("চোখ")
# → ['চক্ষু', 'নেত্র', 'লোচন', ...]

sc.get("চোখ", raw=True)
# → {"word": "চোখ", "source": "wiktionary", "quality": "wikiconfirmed", ...}

# Local cache hit — no network call is made
sc.get("মা", raw=True)
# → {"word": "মা", "source": "local", "quality": "local", ...}

Scrapper(offline=True).get("নদী")
# → local dataset lookup only

.get_many(words, raw=False)

sc.get_many(["চোখ", "মা", "নদী"])
# → {'চোখ': [...], 'মা': [...], 'নদী': [...]}

sc.get_many(["চোখ", "মা"], raw=True)
# → {'চোখ': {raw dict}, 'মা': {raw dict}}

The request delay applies only to live HTTP calls. Local cache hits incur no delay.

.active_sources

Scrapper().active_sources
# → ["wiktionary", "shabdkosh", "english_bangla"]

Scrapper(sources=["wiktionary"]).active_sources
# → ["wiktionary"]

.download() — class method

Scrapper.download()
Scrapper.download("mini")
Scrapper.download(force=True)

Source selection patterns

# Structured data only — best for NLP pipelines
sc = Scrapper(sources=["wiktionary"])

# Maximum coverage — all sources merged
sc = Scrapper()

# Speed-first — stop at the first source that returns results
sc = Scrapper(merge=False)

# Exclude the lowest-reliability source
sc = Scrapper(sources=["wiktionary", "shabdkosh"])

# Long-running batch — persist results, polite rate limit
sc = Scrapper(auto_save=True, delay=2.0)

Dataset helpers

sc.add("পরিবেশ", ["প্রকৃতি", "জগত", "বিশ্ব"])  # add to local dataset
sc.stats()                                        # dataset statistics
sc.export("synonyms.json")                        # export as JSON
sc.export("synonyms.csv", fmt="csv")              # export as CSV

Core API

Lower-level building blocks for advanced users.

DatasetManager

Direct read/write access to the local synonym dataset.

from bangla_synonyms.core import DatasetManager

dm = DatasetManager()

All instances share the same in-memory store. A change made through one instance is immediately visible through any other.

Dataset location: ./bangla_synonyms_data/dataset.json

Reading data

dm.get("চোখ")          # → ['চক্ষু', 'নেত্র', ...]   empty list if not found
dm.has("চোখ")          # → True / False
dm.all_words()         # → sorted list of all words in the dataset
"চোখ" in dm           # → True
len(dm)               # → 9842

Writing data

# Merge new synonyms with any that already exist
dm.add("শব্দ", ["প্রতিশব্দ১", "প্রতিশব্দ২"])

# Replace the synonym list entirely
dm.update("শব্দ", ["নতুন১", "নতুন২"])

# Remove a word
dm.remove("শব্দ")   # returns True if the word existed, False otherwise

Each write automatically flushes to disk. To batch multiple writes into a single flush, pass save=False and export manually:

dm.add("ক", ["খ", "গ"],  save=False)
dm.add("ঘ", ["ঙ"],       save=False)
dm.add("চ", ["ছ", "জ"],  save=False)
dm.export("synonyms.json")

Merging from a file

added = dm.merge("extra_synonyms.json")
print(f"{added} new words added")

The JSON file should have the same format as the main dataset:

{
  "নদী": ["তটিনী", "প্রবাহিনী", "সরিৎ"],
  "আকাশ": ["গগন", "অম্বর", "নভ"]
}

Exporting

dm.export("synonyms.json")             # JSON (default)
dm.export("synonyms.csv", fmt="csv")   # CSV

dm.reload()   # reload from disk after an external change

CSV format:

word,synonyms,count
চোখ,চক্ষু | নেত্র | লোচন | আঁখি,4
মা,জননী | আম্মা | জন্মদাত্রী,3

Stats

info = dm.stats()
# Words         : 9842
# Total synonyms: 47391
# Avg / word    : 4.82
# Source        : /home/user/bangla_synonyms_data/dataset.json
# Top 5 words   :
#   চোখ: চক্ষু, নেত্র, লোচন, আঁখি ...

info["total_words"]     # 9842
info["total_synonyms"]  # 47391
info["avg_per_word"]    # 4.82

WordlistFetcher

Fetches Bangla word lists from Wiktionary for use with BatchScraper.

from bangla_synonyms.core import WordlistFetcher, DatasetManager

wf = WordlistFetcher()
dm = DatasetManager()
# Fetch up to 500 words from Wiktionary
words = wf.fetch(limit=500)

# Filter to words not yet in the local dataset (enables safe resume)
new_words = wf.filter_new(words, dm)
print(f"{len(new_words)} words not yet scraped")

# Persist and reload word lists
wf.save(words, "wordlist.txt")
words = wf.load("wordlist.txt")

BatchScraper

Scrapes synonyms for large word lists with progress tracking, checkpointing, and resume support.

from bangla_synonyms.core import BatchScraper

Constructor

Parameter Default Description
dataset shared DatasetManager instance to write results into
delay 1.0 Seconds between HTTP requests
timeout 10 HTTP request timeout in seconds
save_every 50 Flush results to disk every N words
sources None Sources to query (None = all three)
merge True Merge all sources or stop at first hit

.run(words, skip_existing=True, show_progress=True, sources=None)

scraper = BatchScraper(delay=1.0)

result = scraper.run(["চোখ", "মা", "নদী", "আকাশ"])
#   [ 1/4] চোখ: ✓ চক্ষু, নেত্র, লোচন ...
#   [ 2/4] মা:  ✓ জননী, আম্মা, জন্মদাত্রী
#   [ 3/4] নদী: ✓ তটিনী, প্রবাহিনী
#   [ 4/4] আকাশ: — not found
#
#   [bangla-synonyms] done: 3 found, 1 not found, 0 errors

# Safe to re-run; already-scraped words are skipped
result = scraper.run(words, skip_existing=True)

# Override sources for this run only
result = scraper.run(words, sources=["wiktionary"])

# Suppress progress output
result = scraper.run(words, show_progress=False)

# Return value: {word: [synonyms], ...} for newly scraped words

.run_from_wiktionary(limit=200)

Fetches a word list from Wiktionary and scrapes all of them in one step.

scraper = BatchScraper(delay=1.5, sources=["wiktionary", "shabdkosh"])
scraper.run_from_wiktionary(limit=1000)

Full batch workflow

from bangla_synonyms.core import DatasetManager, WordlistFetcher, BatchScraper

dm = DatasetManager()
wf = WordlistFetcher()

# Fetch word list
words = wf.fetch(limit=5000)
wf.save(words, "wordlist.txt")

# Skip words already in the dataset
new_words = wf.filter_new(words, dm)
print(f"{len(new_words)} words to scrape")

# Scrape
scraper = BatchScraper(
    delay=1.0,
    save_every=100,
    sources=["wiktionary", "shabdkosh"],
)
scraper.run(new_words, skip_existing=True)

# Export
dm.stats()
dm.export("bangla_synonyms_full.json")
dm.export("bangla_synonyms_full.csv", fmt="csv")

CLI Reference

# Dataset management
bangla-synonyms download
bangla-synonyms download --version mini
bangla-synonyms download --force

# Synonym lookup
bangla-synonyms get চোখ
bangla-synonyms get চোখ মা সুন্দর
bangla-synonyms get চোখ --offline
bangla-synonyms get চোখ --sources wiktionary
bangla-synonyms get চোখ --sources wiktionary --sources shabdkosh
bangla-synonyms get চোখ --no-merge
bangla-synonyms get চোখ --raw


# Build / expand the local dataset
bangla-synonyms build
bangla-synonyms build --limit 1000
bangla-synonyms build --delay 2.0
bangla-synonyms build --sources wiktionary
bangla-synonyms build --sources wiktionary --sources shabdkosh
bangla-synonyms build --no-merge

# Information and export
bangla-synonyms stats
bangla-synonyms export synonyms.json
bangla-synonyms export synonyms.csv --format csv

# Help
bangla-synonyms --help
bangla-synonyms get --help
bangla-synonyms build --help

The CLI functions are also importable for use in scripts:

from bangla_synonyms.cli import get, build, stats

result = get(["চোখ", "মা"])
result = get(["চোখ"], offline=True)
result = get(["চোখ"], sources=["wiktionary"], merge=False)

added  = build(limit=500, delay=1.5)
info   = stats()

Dataset

A pre-built dataset is available for download via GitHub Releases.

Version Words Approximate size Command
latest ~10,000 ~3 MB bs.download()
mini ~500 ~150 KB bs.download("mini")

The dataset is saved to ./bangla_synonyms_data/dataset.json. All running instances pick up the new data immediately after a download — no restart required.

Building a larger dataset

bangla-synonyms build --limit 5000 \
    --sources wiktionary --sources shabdkosh \
    --delay 1.5

bangla-synonyms stats
bangla-synonyms export my_dataset.json

Dataset format

{
  "চোখ": ["চক্ষু", "নেত্র", "লোচন", "আঁখি", "অক্ষি"],
  "মা": ["জননী", "আম্মা", "জন্মদাত্রী", "মাতা"],
  "নদী": ["তটিনী", "প্রবাহিনী", "সরিৎ", "স্রোতস্বিনী"]
}

Contributing

Bengali is spoken by 230 million people but remains one of the most underserved languages in NLP. bangla-synonyms is one of the few programmatic tools for Bangla lexical resources — your contribution directly improves what the entire community can build.

Bug reports, new sources, quality improvements, and dataset contributions are all welcome. See CONTRIBUTING.md for the full workflow — how to open issues, create branches, and submit PRs.

BNLP users — if you use BNLP for tokenization, embeddings, or NER, bangla-synonyms pairs naturally with it. Use this library to expand your training vocabulary, augment datasets, or build synonym-aware preprocessing pipelines on top of BNLP models.



Acknowledgements

Data sources used by this package:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bangla_synonyms-1.0.1.tar.gz (56.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bangla_synonyms-1.0.1-py3-none-any.whl (58.6 kB view details)

Uploaded Python 3

File details

Details for the file bangla_synonyms-1.0.1.tar.gz.

File metadata

  • Download URL: bangla_synonyms-1.0.1.tar.gz
  • Upload date:
  • Size: 56.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for bangla_synonyms-1.0.1.tar.gz
Algorithm Hash digest
SHA256 cb9bb073ba7f234cec5f9081907e073fd5445ac537a70afed9e146a9127092c1
MD5 b6b5a2b6c8e470224c080e681cb66655
BLAKE2b-256 b9fe1728d3ead8f87761a73c79086c0c11920e6bf7a39796d206e7f5b7103d32

See more details on using hashes here.

File details

Details for the file bangla_synonyms-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for bangla_synonyms-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 12073c26264f19df7b25f19ffdaee7fed7698151fac32fbb93d4c56553318659
MD5 a2b608aba7706f8de6e5dabc6914b74f
BLAKE2b-256 d91e3540cba7782ef64c82bbf2d1921d081d4a593848e5ccae6a3418976c6b61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page