Skip to main content

A Python package for natural language processing tasks for the Urdu language, including normalization, part-of-speech (POS) tagging, named entity recognition (NER), stemming, lemmatization, tokenization, and stopword removal.

Project description

LughaatNLP

LughaatNLP is the first comprehensive Urdu language preprocessing library developed for NLP tasks in Pakistan. It provides essential tools for tokenization, lemmatization, stop word removal, Part of speech (POS), Name Entity Relationship (NER) and normalization specifically tailored for the Urdu language.

Documentation

here you can see documentation: Download Document

Google Colab Link: The documentation includes a link to a Google Colab notebook: Google Colab Notebook Link

Pypi Link: The documentation includes a link to a Pypi link: Pypi Link

YouTube Link: The documentation includes a link to a Youtube link for library tutorial: YouTube Link



Alt Text

Features

  • Tokenization: Breaks down Urdu text into individual tokens, considering the intricacies of Urdu script and language structure.
  • Lemmatization: Converts inflected words into their base or dictionary form, aiding in text analysis and comprehension.
  • Stop Word Removal: Eliminates common Urdu stop words to focus on meaningful content during text processing.
  • Normalization: Standardizes text by removing diacritics, normalizing character variations, and handling common orthographic variations in Urdu.
  • Stemming: Reduces words to their root form, improving text analysis and comprehension in Urdu.
  • Spell Checker: Identifies and corrects misspelled words in Urdu text, enhancing text quality and readability.
  • Part of Speech Extraction: Tag words with their grammatical categories, enabling advanced syntactic analysis.
  • Named Entity Recognition (NER): Identify and extract names of entities like persons, organizations, or locations.

Functions

Normalization

1. normalize_characters(text)

This function normalizes the Urdu characters in the given text by mapping incorrect Urdu characters to their correct forms. Sometimes, single Unicode characters representing Urdu may be written in multiple forms, and this function normalizes them accordingly.

Example:

text = "آپ کیسے ہیں؟"
normalized_text = urdu_text_processing.normalize_characters(text)
print(normalized_text)  # Output: اپ کیسے ہیں؟

2. normalize_combine_characters(text)

This function simplifies Urdu characters by combining certain combinations into their correct single forms. In Urdu writing, some characters are made up of multiple parts like ligatures or diacritics. This function finds these combinations in the text and changes them to their single character forms. It ensures consistency and accuracy in how Urdu text is represented.

Example:

text = "اُردو"
normalized_text = urdu_text_processing.normalize_combine_characters(text)
print(normalized_text)  # Output: اُردو

3. normalize(text)

This function performs all-in-one normalization on the Urdu text, including character normalization, diacritic removal, punctuation handling, digit conversion, and special character preservation.

Example:

text = "آپ کیسے ہیں؟ میں ۲۳ سال کا ہوں۔"
normalized_text = urdu_text_processing.normalize(text)
print("Normalize all at once together of Urdu: ", normalized_text)  # Output: اپ کیسے ہیں ؟ میں 23 سال کا ہوں ۔

4. remove_diacritics(text)

This function removes diacritics (zabar, zer, pesh) from the Urdu text.

Example:

text = "کِتَاب"
diacritics_removed = urdu_text_processing.remove_diacritics(text)
print("Remove all Diacritic (Zabar - Zer - Pesh): ", diacritics_removed)  # Output: کتاب

5. punctuations_space(text)

This function remove spaces after punctuations (excluding numbers) and removes spaces before punctuations in the Urdu text.

Example:

text = "کیا آپ کھانا کھانا چاہتے ہیں ؟ میں کھانا کھاؤں گا  ۔"
punctuated_text = urdu_text_processing.punctuations_space(text)
print(punctuated_text)  # Output: کیا آپ کھانا کھانا چاہتے ہیں؟ میں کھانا کھاؤں گا۔

6. replace_digits(text)

This function replaces English digits with Urdu digits.

Example:

text = "میں 23 سال کا ہوں۔"
english_digits = urdu_text_processing.replace_digits(text)
print("Replace All maths numbers with Urdu number eg(2 1 3 1 -> ۲ ۱ ۳ ۱): ", english_digits)  # Output: میں ۲۳ سال کا ہوں۔

7. remove_numbers_urdu(text)

This function removes Urdu numbers from the Urdu text.

Example:

text = "میں  22 ۲۳ سال کا ہوں۔"
no_urdu_numbers = urdu_text_processing.remove_numbers_urdu(text)
print("Remove Urdu numbers from text: ", no_urdu_numbers)  # Output: میں 22 سال کا ہوں۔

8. remove_numbers_english(text)

This function removes English numbers from the Urdu text.

Example:

text = "میں ۲۳ 23 سال کا ہوں۔"
no_english_numbers = urdu_text_processing.remove_numbers_english(text)
print("Remove English numbers from text: ", no_english_numbers)  # Output: میں ۲۳ سال کا ہوں۔

9. remove_whitespace(text)

This function removes extra whitespaces from the Urdu text.

Example:

text = "میں   گھر   جا   رہا   ہوں۔"
cleaned_text = urdu_text_processing.remove_whitespace(text)
print("Remove All extra space between words", cleaned_text)  # Output: میں گھر جا رہا ہوں۔

10. preserve_special_characters(text)

This function adds spaces around special characters in the Urdu text to facilitate tokenization.

Example:

text = "میں@پاکستان_سے_ہوں۔"
preserved_text = urdu_text_processing.preserve_special_characters(text)
print("make a space between every special character and word so tokenize easily", preserved_text)  # Output: میں @ پاکستان _ سے _ ہوں ۔

11. remove_numbers(text)

This function removes both Urdu and English numbers from the Urdu text.

Example:

text = "میں ۲۳ سال کا ہوں اور میری عمر 23 ہے۔"
number_removed = urdu_text_processing.remove_numbers(text)
print("Remove All numbers whether they are Urdu or English: ", number_removed)  # Output: میں سال کا ہوں اور میری عمر ہے۔

12. remove_english(text)

This function removes English characters from the Urdu text.

Example:

text = "I am learning Urdu."
urdu_only = urdu_text_processing.remove_english(text)
print("Remove All English characters from text: ", urdu_only)  # Output:  ام لرننگ اردو

13. pure_urdu(text)

This function removes all non-Urdu characters and numbers from the text, leaving only Urdu characters and special characters used in Urdu.

Example:

text = "I ?  # & am learning Urdu. میں اردو سیکھ رہا ہوں۔ 123"
pure_urdu_text = urdu_text_processing.pure_urdu(text)
print(pure_urdu_text)  # Output: میں اردو سیکھ رہا ہوں۔

14. just_urdu(text)

This function removes all non-Urdu characters, numbers, and special characters, just leaving only pure Urdu text even not special character used in urdu.

Example:

text = "I am learning Urdu. میں اردو سیکھ رہا ہوں۔ 123"
just_urdu_text = urdu_text_processing.just_urdu(text)
print(just_urdu_text)  # Output: میں اردو سیکھ رہا ہوں 

15. remove_urls(text)

This function removes URLs from the Urdu text.

Example:

text = "میں https://www.example.com پر گیا۔"
no_urls = urdu_text_processing.remove_urls(text)
print("Remove All URLs", no_urls)  # Output: میں  پر گیا۔

16. remove_special_characters(text)

This function removes all special characters from the Urdu text.

Example:

text = "میں@پاکستان_سے_ہوں۔"
no_special_chars = urdu_text_processing.remove_special_characters(text)
print("Remove All Special characters", no_special_chars)  # Output: میں پاکستان سے ہوں

17. remove_special_characters_exceptUrdu(text)

This function removes all special characters from the Urdu text, except for those commonly used in the Urdu language (e.g., ؟, ۔, ،).

Example:

text = "میں@پاکستان??_سے_ہوں؟"
urdu_special_chars = urdu_text_processing.remove_special_characters_exceptUrdu(text)
print("Remove All Special characters except those which are used in Urdu language eg( ؟ ۔ ، ): ", urdu_special_chars)  # Output: میں پاکستان سے ہوں؟

Stop Words Removing

1. remove_stopwords(text)

This function removes stopwords from the Urdu text.

Example:

text = "میں اس کتاب کو پڑھنا چاہتا ہوں۔"
filtered_text = urdu_text_processing.remove_stopwords(text)
print("Remove Stop words:", filtered_text)  # Output: کتاب پڑھنا چاہتا ہوں۔

Tokenization

Tokenization involves breaking down Urdu text into individual tokens, considering the intricacies of Urdu script and language structure.

1. urdu_tokenize(text)

This function tokenizes the Urdu text into individual tokens (words, numbers, and punctuations).

Example:

text = "میں پاکستان سے ہوں۔"
tokens = urdu_text_processing.urdu_tokenize(text)
print("Tokenization for Urdu language:", tokens)  # Output: ['میں', 'پاکستان', 'سے', 'ہوں۔']

Lemmatization and Stemming

Lemmatization involves converting inflected words into their base or dictionary form, while stemming reduces words to their root form.

1. lemmatize_sentence(sentence)

This function performs lemmatization on the Urdu sentence, replacing words with their base or dictionary form.

Example:

sentence = "میں کتابیں پڑھتا ہوں۔"
lemmatized_sentence = urdu_text_processing.lemmatize_sentence(sentence)
print("lemmatize the words ", lemmatized_sentence)  # Output: میں کتاب پڑھنا ہوں۔

2. urdu_stemmer(sentence)

This function performs stemming on the Urdu sentence, reducing words to their root or stem form.

Example:

sentence = "میں کتابیں پڑھتا ہوں۔"
stemmed_sentence = urdu_text_processing.urdu_stemmer(sentence)
print("Urdu Stemming ", stemmed_sentence)  # Output: میں کتاب پڑھ ہوں۔

Spell Checker

Spell checking involves identifying and correcting misspelled words in Urdu text.

1. corrected_sentence_spelling(input_word, threshold)

This function takes an input sentence and a similarity threshold as arguments and returns the corrected sentence with potentially misspelled words replaced by the most similar words from the vocabulary.

Example:

spell_checker = LughaatNLP()
sentence = 'سسب سےا بڑاا ملکا ہے'
corrected_sentence = spell_checker.corrected_sentence_spelling(sentence, 60)
print("This correct spelling of sentence itself", corrected_sentence)

2. most_similar_word(input_word, threshold)

This function takes an input word and a similarity threshold as arguments and returns the most similar word from the vocabulary based on the Levenshtein distance.

Example:

spell_checker = LughaatNLP()
word = 'پاکستاان'
most_similar = spell_checker.most_similar_word(word, 70)
print("This will return the most similar single word in string", most_similar)

3. get_similar_words_percentage(input_word, threshold)

This function takes an input word and a similarity threshold as arguments and returns a list of tuples containing similar words and their corresponding similarity percentages.

Example:

spell_checker = LughaatNLP()
word = 'پاکستاان'
similar_words_with_percentage = spell_checker.get_similar_words_percentage(word, 70)
print("This will return the most similar words in list with percentage", similar_words_with_percentage)

4. get_similar_words(input_word, threshold)

This function takes an input word and a similarity threshold as arguments and returns a list of similar words from the vocabulary based on the Levenshtein distance.

Example:

spell_checker = LughaatNLP()
word = 'پاکستاان'
similar_words = spell_checker.get_similar_words(word, 70)
print("This will return the most similar words in list only you can access word using index", similar_words)

These functions leverage the Levenshtein distance algorithm to calculate the similarity between the input word or sentence and the words in the vocabulary. The threshold parameter is used to filter out words with a similarity percentage below the specified threshold.

Note: These examples assume that you have an instance of the UrduTextNormalizer class named spell_checker and have imported the Levenshtein module for calculating the edit distance.

Part of Speech

The pos_tags_urdu function is used for part-of-speech tagging in Urdu text. It takes an Urdu sentence as input and returns a list of dictionaries where each word is paired with its assigned part-of-speech tag, such as nouns (NN), verbs (VB), adjectives (ADJ), etc.

1. pos_tagger.pos_tags_urdu (sentence)

The example output demonstrates how words like "میرے" (G for postposition) and "تعلیم" (NN for noun) are tagged based on their grammatical roles within the sentence.

Example:

from LughaatNLP import POS_urdu

pos_tagger = POS_urdu()

sentence = "میرے والدین نے میری تعلیم اور تربیت میں بہت محنت کی تاکہ میں اپنی زندگی میں کامیاب ہو سکوں۔"

pos_tagger = POS_urdu()
predicted_pos_tags = pos_tagger.pos_tags_urdu (sentence)

print(predicted_pos_tags)
# output => [{'Word': 'میرے', 'POS_Tag': 'G'}, {'Word': 'والدین', 'POS_Tag': 'NN'},{'Word': 'نے', 'POS_Tag': 'P'},{'Word': 'میری', 'POS_Tag': 'G'},{'Word': 'تعلیم', 'POS_Tag': 'NN'},{'Word': 'اور', 'POS_Tag': 'CC'},{'Word': 'تربیت', 'POS_Tag': 'NN'},{'Word': 'میں', 'POS_Tag': 'P'},{'Word': 'بہت', 'POS_Tag': 'ADV'},{'Word': 'محنت', 'POS_Tag': 'NN'},{'Word': 'کی', 'POS_Tag': 'VB'},{'Word': 'تاکہ', 'POS_Tag': 'SC'},{'Word': 'میں', 'POS_Tag': 'P'},{'Word': 'اپنی', 'POS_Tag': 'GR'},{'Word': 'زندگی', 'POS_Tag': 'NN'},{'Word': 'میں', 'POS_Tag': 'P'},{'Word': 'کامیاب', 'POS_Tag': 'ADJ'},{'Word': 'ہو', 'POS_Tag': 'VB'},{'Word': 'سکوں', 'POS_Tag': 'NN'},{'Word': '۔', 'POS_Tag': 'SM'}]

Name Entity Relationships

The ner_tags_urdu function performs named entity recognition on Urdu text, assigning named entity tags (such as U-LOCATION for locations) to identified entities in the input sentence. It outputs a dictionary where words are mapped to their corresponding named entity tags, facilitating tasks like information extraction and text analysis specific to Urdu language.

1. ner_urdu.ner_tags_urdu (sentence)

The example output illustrates how entities like "پاکستان" are recognized as locations (U-LOCATION) within the provided sentence.

Example:

from LughaatNLP import NER_Urdu

ner_urdu = NER_Urdu()

sentence = "اس کتاب میں پاکستان کی تاریخ بیان کی گئی ہے۔"

word_tag_dict= ner_urdu.ner_tags_urdu (sentence)

print(word_tag_dict)


print(predicted_pos_tags)

# output  {'اس': 'O', 'کتاب': 'O', 'میں': 'O', 'پاکستان': 'U-LOCATION', 'کی': 'O', 'تاریخ': 'O', 'بیان': 'O', 'گئی': 'O', 'ہے': 'O', '۔': 'O'}

Installation

You can install the LughaatUrdu library from PyPI using pip:

pip install lughaatNLP

Alternatively, you can manually install it by downloading and unzipping the provided LughaatNLP.rar file and installing the wheel file using pip:

pip install path_to_wheel_file/LughaatNLP-1.0.2-py3-none-any.whl

Required Packages

The LughaatNLP library requires the following packages:

  • python-Levenshtein
  • tensorflow
  • numpy

You can install these packages using pip:

pip install python-Levenshtein tensorflow numpy

Usage

After installing the library, you can import the necessary functions or classes in your Python script:

#importing Pakages
from LughaatNLP import LughaatNLP
from LughaatNLP import POS_urdu
from LughaatNLP import NER_Urdu

# Instance Calling
urdu_text_processing = LughaatNLP()
ner_urdu = NER_Urdu()
pos_tagger = POS_urdu()

Future Work

The future roadmap for LughaatNLP includes the following features: - Urdu language translator - Urdu chatbot models - Text-to-speech and speech-to-text capabilities for Urdu - Urdu text summarization To implement these features, resources such as servers and GPU for training are required. Therefore, Muhammad Noman is collecting funds to support the development and maintenance of this library.

Contributing

This library was created by Muhammad Noman, a student at Iqra University. You can reach him via email at muhammadnomanshafiq76@gmail.com or connect with him on LinkedIn.

If you find any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request on the GitHub repository.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

LughaatNLP-1.0.4.tar.gz (69.8 MB view hashes)

Uploaded Source

Built Distribution

LughaatNLP-1.0.4-py3-none-any.whl (69.8 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page