Skip to main content

A lightweight, pure-Python NLP library for Urdu language processing

Project description

urdu-nlp

A lightweight, pure-Python NLP library for Urdu language processing. No deep learning required — just install and go.

Urdu is spoken by 230+ million people, yet has almost no usable NLP tooling on PyPI. urdu-nlp fills that gap with tokenization, stop word removal, normalization, stemming, transliteration, and sentence boundary detection.

Installation

pip install urdu-nlp

Or install from source:

git clone https://github.com/imabd645/urdu-nlp.git
cd urdu-nlp
pip install -e .

Quick Start

1. Tokenization

from urdu_nlp import tokenize

tokenize("میں اسکول جاتا ہوں")
# → ["میں", "اسکول", "جاتا", "ہوں"]

2. Stop Word Removal

from urdu_nlp import remove_stopwords

remove_stopwords(["میں", "اسکول", "جاتا", "ہوں"])
# → ["اسکول", "جاتا"]

3. Normalization

from urdu_nlp import normalize

normalize("ﻛﺮﻧﺎ")   # Arabic form
# → "کرنا"          # Urdu form

4. Stemming

from urdu_nlp import stem

stem("کتابوں")
# → "کتاب"

5. Roman Urdu → Urdu Script

from urdu_nlp import roman_to_urdu

roman_to_urdu("mein school jata hoon")
# → "میں اسکول جاتا ہوں"

6. Sentence Boundary Detection

from urdu_nlp import sent_tokenize

sent_tokenize("یہ پہلا جملہ ہے۔ یہ دوسرا ہے۔")
# → ["یہ پہلا جملہ ہے۔", "یہ دوسرا ہے۔"]

API Reference

Function Description
tokenize(text) Split Urdu text into word tokens
sent_tokenize(text) Split text into sentences
remove_stopwords(tokens) Remove common Urdu stop words from a token list
normalize(text) Normalize Arabic/Urdu character variants and whitespace
stem(word) Strip common Urdu suffixes to get root form
roman_to_urdu(text) Transliterate Roman Urdu to Urdu script

Dependencies

  • regex — Unicode-aware pattern matching (the only dependency)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urdu_nlp-0.1.0.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

urdu_nlp-0.1.0-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file urdu_nlp-0.1.0.tar.gz.

File metadata

  • Download URL: urdu_nlp-0.1.0.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for urdu_nlp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6dcd2bf0aa112f8db3af526fc9effb5f9d40cd078055e3959fe7a94d263368f3
MD5 633e9a3d579721d10634800dfb7b292b
BLAKE2b-256 b6e3ab03727d99a933dd49cb8172e2cd7f38ba7a1f153223ae37ce3dbbcf1c78

See more details on using hashes here.

File details

Details for the file urdu_nlp-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: urdu_nlp-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for urdu_nlp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 26c237496a75e5149ec612736a7357c1862a8580b1126271c953b8bcbb3d4bd5
MD5 4ed19aa5e9d334c4f264a5e4766109e1
BLAKE2b-256 18c0632de6e090e5e4fcb6c31c3f648a41712bbf1a35ef0aa6c0dbedfcbbc1ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page