A lightweight, pure-Python NLP library for Urdu language processing
Project description
urdu-nlp
A lightweight, pure-Python NLP library for Urdu language processing. No deep learning required — just install and go.
Urdu is spoken by 230+ million people, yet has almost no usable NLP tooling on PyPI. urdu-nlp fills that gap with tokenization, stop word removal, normalization, stemming, transliteration, and sentence boundary detection.
Installation
pip install urdu-nlp
Or install from source:
git clone https://github.com/imabd645/urdu-nlp.git
cd urdu-nlp
pip install -e .
Quick Start
1. Tokenization
from urdu_nlp import tokenize
tokenize("میں اسکول جاتا ہوں")
# → ["میں", "اسکول", "جاتا", "ہوں"]
2. Stop Word Removal
from urdu_nlp import remove_stopwords
remove_stopwords(["میں", "اسکول", "جاتا", "ہوں"])
# → ["اسکول", "جاتا"]
3. Normalization
from urdu_nlp import normalize
normalize("ﻛﺮﻧﺎ") # Arabic form
# → "کرنا" # Urdu form
4. Stemming
from urdu_nlp import stem
stem("کتابوں")
# → "کتاب"
5. Roman Urdu → Urdu Script
from urdu_nlp import roman_to_urdu
roman_to_urdu("mein school jata hoon")
# → "میں اسکول جاتا ہوں"
6. Sentence Boundary Detection
from urdu_nlp import sent_tokenize
sent_tokenize("یہ پہلا جملہ ہے۔ یہ دوسرا ہے۔")
# → ["یہ پہلا جملہ ہے۔", "یہ دوسرا ہے۔"]
API Reference
| Function | Description |
|---|---|
tokenize(text) |
Split Urdu text into word tokens |
sent_tokenize(text) |
Split text into sentences |
remove_stopwords(tokens) |
Remove common Urdu stop words from a token list |
normalize(text) |
Normalize Arabic/Urdu character variants and whitespace |
stem(word) |
Strip common Urdu suffixes to get root form |
roman_to_urdu(text) |
Transliterate Roman Urdu to Urdu script |
Dependencies
regex— Unicode-aware pattern matching (the only dependency)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file urdu_nlp-0.1.0.tar.gz.
File metadata
- Download URL: urdu_nlp-0.1.0.tar.gz
- Upload date:
- Size: 13.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6dcd2bf0aa112f8db3af526fc9effb5f9d40cd078055e3959fe7a94d263368f3
|
|
| MD5 |
633e9a3d579721d10634800dfb7b292b
|
|
| BLAKE2b-256 |
b6e3ab03727d99a933dd49cb8172e2cd7f38ba7a1f153223ae37ce3dbbcf1c78
|
File details
Details for the file urdu_nlp-0.1.0-py3-none-any.whl.
File metadata
- Download URL: urdu_nlp-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
26c237496a75e5149ec612736a7357c1862a8580b1126271c953b8bcbb3d4bd5
|
|
| MD5 |
4ed19aa5e9d334c4f264a5e4766109e1
|
|
| BLAKE2b-256 |
18c0632de6e090e5e4fcb6c31c3f648a41712bbf1a35ef0aa6c0dbedfcbbc1ac
|