A Morphology-Aware NLP Toolkit for the Sindhi Language
Project description
SindhiNLTK: A Morphology-Aware NLP Toolkit for Sindhi
SindhiNLTK is a high-performance Python library for Sindhi natural language processing. It addresses the "Linguistic Efficiency Gap" where standard multilingual models break Sindhi's unique orthographic clusters into meaningless tokens.
What's New in v1.1.0
- Expanded Stopwords — 245 stopwords (up from ~30) organized by grammatical category (pronouns, postpositions, auxiliaries, conjunctions, etc.), extracted from the sindhi-corpus-505m.
- Datasets Module —
sindhinltk.datasetsprovides programmatic access to bundled linguistic data with category filtering and utility functions. - Improved Packaging — Proper
pyproject.tomlwith[project.optional-dependencies], bundled JSON data files, Python 3.8–3.12 support.
Performance
Validated against a corpus of 43,784 SFT instruction samples:
| Metric | Llama-3 (Meta) | SindhiNLTK (Ours) | Improvement |
|---|---|---|---|
| Token Fertility Rate | 4.15 | 1.06 | 291% more efficient |
| Aspiration Integrity | ~30% | 100% (Atomic) | Full linguistic accuracy |
| Context Window | Baseline | 4x Larger | Memory optimization |
Installation
pip install sindhinltk
For sentiment analysis (requires PyTorch):
pip install sindhinltk[sentiment]
Quick Start
from sindhinltk import stemmer, normalizer, stopwords
# Normalize text
text = normalizer.normalize(" سنڌ جي ثقافت ")
# → "سنڌ جي ثقافت"
# Stem words
stemmer.stem("ڪتابون") # → "ڪتاب"
stemmer.stem("پڙهائيندڙ") # → "پڙه"
# Get stopwords
sw = stopwords.get_stopwords()
New in v1.1: Expanded Stopwords & Datasets
from sindhinltk.datasets import (
get_stopwords_expanded,
get_stopwords_by_category,
is_stopword,
remove_stopwords,
)
# 245 stopwords (vs ~30 in v1.0)
all_sw = get_stopwords_expanded()
print(len(all_sw)) # 245
# Filter by grammatical category
pronouns = get_stopwords_by_category("pronouns")
negation = get_stopwords_by_category("negation")
postpositions = get_stopwords_by_category("postpositions")
# Quick check
is_stopword("آهي") # True
is_stopword("ڪتاب") # False
# Filter a token list
tokens = ["سنڌ", "جو", "گاديءَ", "وارو", "شهر"]
clean = remove_stopwords(tokens)
# → ["سنڌ", "گاديءَ", "شهر"]
Available categories: pronouns, postpositions, auxiliaries_and_copulas, conjunctions, particles_and_markers, negation, question_words, demonstratives, adverbs_of_time_place, high_frequency_function_words
Data Sources
The expanded stopwords and SindhiNLTK's development are informed by these Sindhi datasets:
| Source | Type | Size |
|---|---|---|
| sindhi-corpus-505m | Web + news + lit | 505M tokens |
| AMBILE Sindhi Mega Corpus | Mixed | 118M tokens |
| Daily Kawish Articles | News | Articles |
| CC100-Sindhi | Web crawl | Large |
| Sindhi Legal Dataset | Legal | Documents |
| Sindhi Stopwords | Linguistic | Word list |
| Encyclopedia Sindhiana | Encyclopedia | Articles |
| Awami Awaz News | News | Articles |
| Sindh Express News | News | Articles |
| Sindhi Religious Data | Religious | Texts |
| Sindhi Language Corpus | Mixed | Corpus |
Related Projects
- SindhiLM — GPT-2 model for Sindhi (37.8M params)
- SindhiLM-Qwen-0.5B-v2 — Qwen2.5-0.5B fine-tuned for Sindhi
- SindhiLM-Tokenizer-v2 — Morpheme-boundary-aware BPE tokenizer
- Sindhi-Intelligence-Core-SFT-v2 — 46K instruction-tuning samples
Author
Aakash Meghwar — Computational Linguist
- HuggingFace: aakashMeghwar01
- GitHub: AakashKumarMissrani
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sindhinltk-1.1.1.tar.gz.
File metadata
- Download URL: sindhinltk-1.1.1.tar.gz
- Upload date:
- Size: 11.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a94d2c8ca0d12b77516f19f8194c1adf0e022242ff9a582ee84208e5a62bff09
|
|
| MD5 |
d5f7371f00d3400284944b8a46f210f4
|
|
| BLAKE2b-256 |
9746457d4d14dfc1dc007ae1664476fa45a8db641ea6c4cbac899733a7c2812f
|
File details
Details for the file sindhinltk-1.1.1-py3-none-any.whl.
File metadata
- Download URL: sindhinltk-1.1.1-py3-none-any.whl
- Upload date:
- Size: 11.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6466f8de5059a3f28c8fb5184470fa89e0f1e0c2b78d095e82e15bc27a086a73
|
|
| MD5 |
53f7ea1e1f10345a620daecc7bb20af4
|
|
| BLAKE2b-256 |
77f924eddf97297add786a7510eb01b7d3a04cdc26bac59c1159b5ff6713fa55
|