A Python package for natural language processing tasks for the Urdu language, including normalization, part-of-speech (POS) tagging, named entity recognition (NER), stemming, lemmatization, tokenization, and stopword removal.
Project description
LughaatNLP
LughaatNLP is the first comprehensive Urdu language preprocessing library developed for NLP tasks in Pakistan. It provides essential tools for tokenization, lemmatization, stop word removal, Part of speech (POS), Name Entity Relationship (NER) and normalization specifically tailored for the Urdu language.
Documentation
here you can see documentation:
Download Document
Google Colab Link:
The documentation includes a link to a Google Colab notebook:
Google Colab Notebook Link
Pypi Link:
The documentation includes a link to a Pypi link:
Pypi Link
YouTube Link:
The documentation includes a link to a Youtube link for library tutorial:
YouTube Link
Features
- Tokenization: Breaks down Urdu text into individual tokens, considering the intricacies of Urdu script and language structure.
- Lemmatization: Converts inflected words into their base or dictionary form, aiding in text analysis and comprehension.
- Stop Word Removal: Eliminates common Urdu stop words to focus on meaningful content during text processing.
- Normalization: Standardizes text by removing diacritics, normalizing character variations, and handling common orthographic variations in Urdu.
- Stemming: Reduces words to their root form, improving text analysis and comprehension in Urdu.
- Spell Checker: Identifies and corrects misspelled words in Urdu text, enhancing text quality and readability.
- Part of Speech Extraction: Tag words with their grammatical categories, enabling advanced syntactic analysis.
- Named Entity Recognition (NER): Identify and extract names of entities like persons, organizations, or locations.
Functions
Normalization
1. normalize_characters(text)
This function normalizes the Urdu characters in the given text by mapping incorrect Urdu characters to their correct forms. Sometimes, single Unicode characters representing Urdu may be written in multiple forms, and this function normalizes them accordingly.
Example:
text = "آپ کیسے ہیں؟"
normalized_text = urdu_text_processing.normalize_characters(text)
print(normalized_text) # Output: اپ کیسے ہیں؟
2. normalize_combine_characters(text)
This function simplifies Urdu characters by combining certain combinations into their correct single forms. In Urdu writing, some characters are made up of multiple parts like ligatures or diacritics. This function finds these combinations in the text and changes them to their single character forms. It ensures consistency and accuracy in how Urdu text is represented.
Example:
text = "اُردو"
normalized_text = urdu_text_processing.normalize_combine_characters(text)
print(normalized_text) # Output: اُردو
3. normalize(text)
This function performs all-in-one normalization on the Urdu text, including character normalization, diacritic removal, punctuation handling, digit conversion, and special character preservation.
Example:
text = "آپ کیسے ہیں؟ میں ۲۳ سال کا ہوں۔"
normalized_text = urdu_text_processing.normalize(text)
print("Normalize all at once together of Urdu: ", normalized_text) # Output: اپ کیسے ہیں ؟ میں 23 سال کا ہوں ۔
4. remove_diacritics(text)
This function removes diacritics (zabar, zer, pesh) from the Urdu text.
Example:
text = "کِتَاب"
diacritics_removed = urdu_text_processing.remove_diacritics(text)
print("Remove all Diacritic (Zabar - Zer - Pesh): ", diacritics_removed) # Output: کتاب
5. punctuations_space(text)
This function remove spaces after punctuations (excluding numbers) and removes spaces before punctuations in the Urdu text.
Example:
text = "کیا آپ کھانا کھانا چاہتے ہیں ؟ میں کھانا کھاؤں گا ۔"
punctuated_text = urdu_text_processing.punctuations_space(text)
print(punctuated_text) # Output: کیا آپ کھانا کھانا چاہتے ہیں؟ میں کھانا کھاؤں گا۔
6. replace_digits(text)
This function replaces English digits with Urdu digits.
Example:
text = "میں 23 سال کا ہوں۔"
english_digits = urdu_text_processing.replace_digits(text)
print("Replace All maths numbers with Urdu number eg(2 1 3 1 -> ۲ ۱ ۳ ۱): ", english_digits) # Output: میں ۲۳ سال کا ہوں۔
7. remove_numbers_urdu(text)
This function removes Urdu numbers from the Urdu text.
Example:
text = "میں 22 ۲۳ سال کا ہوں۔"
no_urdu_numbers = urdu_text_processing.remove_numbers_urdu(text)
print("Remove Urdu numbers from text: ", no_urdu_numbers) # Output: میں 22 سال کا ہوں۔
8. remove_numbers_english(text)
This function removes English numbers from the Urdu text.
Example:
text = "میں ۲۳ 23 سال کا ہوں۔"
no_english_numbers = urdu_text_processing.remove_numbers_english(text)
print("Remove English numbers from text: ", no_english_numbers) # Output: میں ۲۳ سال کا ہوں۔
9. remove_whitespace(text)
This function removes extra whitespaces from the Urdu text.
Example:
text = "میں گھر جا رہا ہوں۔"
cleaned_text = urdu_text_processing.remove_whitespace(text)
print("Remove All extra space between words", cleaned_text) # Output: میں گھر جا رہا ہوں۔
10. preserve_special_characters(text)
This function adds spaces around special characters in the Urdu text to facilitate tokenization.
Example:
text = "میں@پاکستان_سے_ہوں۔"
preserved_text = urdu_text_processing.preserve_special_characters(text)
print("make a space between every special character and word so tokenize easily", preserved_text) # Output: میں @ پاکستان _ سے _ ہوں ۔
11. remove_numbers(text)
This function removes both Urdu and English numbers from the Urdu text.
Example:
text = "میں ۲۳ سال کا ہوں اور میری عمر 23 ہے۔"
number_removed = urdu_text_processing.remove_numbers(text)
print("Remove All numbers whether they are Urdu or English: ", number_removed) # Output: میں سال کا ہوں اور میری عمر ہے۔
12. remove_english(text)
This function removes English characters from the Urdu text.
Example:
text = "I am learning Urdu."
urdu_only = urdu_text_processing.remove_english(text)
print("Remove All English characters from text: ", urdu_only) # Output: ام لرننگ اردو
13. pure_urdu(text)
This function removes all non-Urdu characters and numbers from the text, leaving only Urdu characters and special characters used in Urdu.
Example:
text = "I ? # & am learning Urdu. میں اردو سیکھ رہا ہوں۔ 123"
pure_urdu_text = urdu_text_processing.pure_urdu(text)
print(pure_urdu_text) # Output: میں اردو سیکھ رہا ہوں۔
14. just_urdu(text)
This function removes all non-Urdu characters, numbers, and special characters, just leaving only pure Urdu text even not special character used in urdu.
Example:
text = "I am learning Urdu. میں اردو سیکھ رہا ہوں۔ 123"
just_urdu_text = urdu_text_processing.just_urdu(text)
print(just_urdu_text) # Output: میں اردو سیکھ رہا ہوں
15. remove_urls(text)
This function removes URLs from the Urdu text.
Example:
text = "میں https://www.example.com پر گیا۔"
no_urls = urdu_text_processing.remove_urls(text)
print("Remove All URLs", no_urls) # Output: میں پر گیا۔
16. remove_special_characters(text)
This function removes all special characters from the Urdu text.
Example:
text = "میں@پاکستان_سے_ہوں۔"
no_special_chars = urdu_text_processing.remove_special_characters(text)
print("Remove All Special characters", no_special_chars) # Output: میں پاکستان سے ہوں
17. remove_special_characters_exceptUrdu(text)
This function removes all special characters from the Urdu text, except for those commonly used in the Urdu language (e.g., ؟, ۔, ،).
Example:
text = "میں@پاکستان??_سے_ہوں؟"
urdu_special_chars = urdu_text_processing.remove_special_characters_exceptUrdu(text)
print("Remove All Special characters except those which are used in Urdu language eg( ؟ ۔ ، ): ", urdu_special_chars) # Output: میں پاکستان سے ہوں؟
Stop Words Removing
1. remove_stopwords(text)
This function removes stopwords from the Urdu text.
Example:
text = "میں اس کتاب کو پڑھنا چاہتا ہوں۔"
filtered_text = urdu_text_processing.remove_stopwords(text)
print("Remove Stop words:", filtered_text) # Output: کتاب پڑھنا چاہتا ہوں۔
Tokenization
Tokenization involves breaking down Urdu text into individual tokens, considering the intricacies of Urdu script and language structure.
1. urdu_tokenize(text)
This function tokenizes the Urdu text into individual tokens (words, numbers, and punctuations).
Example:
text = "میں پاکستان سے ہوں۔"
tokens = urdu_text_processing.urdu_tokenize(text)
print("Tokenization for Urdu language:", tokens) # Output: ['میں', 'پاکستان', 'سے', 'ہوں۔']
Lemmatization and Stemming
Lemmatization involves converting inflected words into their base or dictionary form, while stemming reduces words to their root form.
1. lemmatize_sentence(sentence)
This function performs lemmatization on the Urdu sentence, replacing words with their base or dictionary form.
Example:
sentence = "میں کتابیں پڑھتا ہوں۔"
lemmatized_sentence = urdu_text_processing.lemmatize_sentence(sentence)
print("lemmatize the words ", lemmatized_sentence) # Output: میں کتاب پڑھنا ہوں۔
2. urdu_stemmer(sentence)
This function performs stemming on the Urdu sentence, reducing words to their root or stem form.
Example:
sentence = "میں کتابیں پڑھتا ہوں۔"
stemmed_sentence = urdu_text_processing.urdu_stemmer(sentence)
print("Urdu Stemming ", stemmed_sentence) # Output: میں کتاب پڑھ ہوں۔
Spell Checker
Spell checking involves identifying and correcting misspelled words in Urdu text.
1. corrected_sentence_spelling(input_word, threshold)
This function takes an input sentence and a similarity threshold as arguments and returns the corrected sentence with potentially misspelled words replaced by the most similar words from the vocabulary.
Example:
spell_checker = LughaatNLP()
sentence = 'سسب سےا بڑاا ملکا ہے'
corrected_sentence = spell_checker.corrected_sentence_spelling(sentence, 60)
print("This correct spelling of sentence itself", corrected_sentence)
2. most_similar_word(input_word, threshold)
This function takes an input word and a similarity threshold as arguments and returns the most similar word from the vocabulary based on the Levenshtein distance.
Example:
spell_checker = LughaatNLP()
word = 'پاکستاان'
most_similar = spell_checker.most_similar_word(word, 70)
print("This will return the most similar single word in string", most_similar)
3. get_similar_words_percentage(input_word, threshold)
This function takes an input word and a similarity threshold as arguments and returns a list of tuples containing similar words and their corresponding similarity percentages.
Example:
spell_checker = LughaatNLP()
word = 'پاکستاان'
similar_words_with_percentage = spell_checker.get_similar_words_percentage(word, 70)
print("This will return the most similar words in list with percentage", similar_words_with_percentage)
4. get_similar_words(input_word, threshold)
This function takes an input word and a similarity threshold as arguments and returns a list of similar words from the vocabulary based on the Levenshtein distance.
Example:
spell_checker = LughaatNLP()
word = 'پاکستاان'
similar_words = spell_checker.get_similar_words(word, 70)
print("This will return the most similar words in list only you can access word using index", similar_words)
These functions leverage the Levenshtein distance algorithm to calculate the similarity between the input word or sentence and the words in the vocabulary. The threshold
parameter is used to filter out words with a similarity percentage below the specified threshold.
Note: These examples assume that you have an instance of the UrduTextNormalizer
class named spell_checker
and have imported the Levenshtein
module for calculating the edit distance.
Part of Speech
The pos_tags_urdu
function is used for part-of-speech tagging in Urdu text. It takes an Urdu sentence as input and returns a list of dictionaries where each word is paired with its assigned part-of-speech tag, such as nouns (NN
), verbs (VB
), adjectives (ADJ
), etc.
1. pos_tagger.pos_tags_urdu (sentence)
The example output demonstrates how words like "میرے" (G
for postposition) and "تعلیم" (NN
for noun) are tagged based on their grammatical roles within the sentence.
Example:
from LughaatNLP import POS_urdu
pos_tagger = POS_urdu()
sentence = "میرے والدین نے میری تعلیم اور تربیت میں بہت محنت کی تاکہ میں اپنی زندگی میں کامیاب ہو سکوں۔"
pos_tagger = POS_urdu()
predicted_pos_tags = pos_tagger.pos_tags_urdu (sentence)
print(predicted_pos_tags)
# output => [{'Word': 'میرے', 'POS_Tag': 'G'}, {'Word': 'والدین', 'POS_Tag': 'NN'},{'Word': 'نے', 'POS_Tag': 'P'},{'Word': 'میری', 'POS_Tag': 'G'},{'Word': 'تعلیم', 'POS_Tag': 'NN'},{'Word': 'اور', 'POS_Tag': 'CC'},{'Word': 'تربیت', 'POS_Tag': 'NN'},{'Word': 'میں', 'POS_Tag': 'P'},{'Word': 'بہت', 'POS_Tag': 'ADV'},{'Word': 'محنت', 'POS_Tag': 'NN'},{'Word': 'کی', 'POS_Tag': 'VB'},{'Word': 'تاکہ', 'POS_Tag': 'SC'},{'Word': 'میں', 'POS_Tag': 'P'},{'Word': 'اپنی', 'POS_Tag': 'GR'},{'Word': 'زندگی', 'POS_Tag': 'NN'},{'Word': 'میں', 'POS_Tag': 'P'},{'Word': 'کامیاب', 'POS_Tag': 'ADJ'},{'Word': 'ہو', 'POS_Tag': 'VB'},{'Word': 'سکوں', 'POS_Tag': 'NN'},{'Word': '۔', 'POS_Tag': 'SM'}]
Name Entity Relationships
The ner_tags_urdu
function performs named entity recognition on Urdu text, assigning named entity tags (such as U-LOCATION
for locations) to identified entities in the input sentence. It outputs a dictionary where words are mapped to their corresponding named entity tags, facilitating tasks like information extraction and text analysis specific to Urdu language.
1. ner_urdu.ner_tags_urdu (sentence)
The example output illustrates how entities like "پاکستان" are recognized as locations (U-LOCATION) within the provided sentence.
Example:
from LughaatNLP import NER_Urdu
ner_urdu = NER_Urdu()
sentence = "اس کتاب میں پاکستان کی تاریخ بیان کی گئی ہے۔"
word_tag_dict= ner_urdu.ner_tags_urdu (sentence)
print(word_tag_dict)
print(predicted_pos_tags)
# output {'اس': 'O', 'کتاب': 'O', 'میں': 'O', 'پاکستان': 'U-LOCATION', 'کی': 'O', 'تاریخ': 'O', 'بیان': 'O', 'گئی': 'O', 'ہے': 'O', '۔': 'O'}
Installation
You can install the LughaatUrdu
library from PyPI using pip:
pip install lughaatNLP
Alternatively, you can manually install it by downloading and unzipping the provided LughaatNLP.rar
file and installing the wheel file using pip:
pip install path_to_wheel_file/LughaatNLP-1.0.2-py3-none-any.whl
Required Packages
The LughaatNLP library requires the following packages:
python-Levenshtein
tensorflow
numpy
You can install these packages using pip:
pip install python-Levenshtein tensorflow numpy
Usage
After installing the library, you can import the necessary functions or classes in your Python script:
#importing Pakages
from LughaatNLP import LughaatNLP
from LughaatNLP import POS_urdu
from LughaatNLP import NER_Urdu
# Instance Calling
urdu_text_processing = LughaatNLP()
ner_urdu = NER_Urdu()
pos_tagger = POS_urdu()
Future Work
The future roadmap for LughaatNLP includes the following features: - Urdu language translator - Urdu chatbot models - Text-to-speech and speech-to-text capabilities for Urdu - Urdu text summarization To implement these features, resources such as servers and GPU for training are required. Therefore, Muhammad Noman is collecting funds to support the development and maintenance of this library.
Contributing
This library was created by Muhammad Noman, a student at Iqra University. You can reach him via email at muhammadnomanshafiq76@gmail.com or connect with him on LinkedIn.
If you find any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request on the GitHub repository.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for LughaatNLP-1.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc9d20b4952ef045617d3a73b2f3ee0ff3f3e0d93709e88e2472d138ca9fd5aa |
|
MD5 | fa4830cf857031330f1a24d065a66ecd |
|
BLAKE2b-256 | 42db3558eb135316dbef18d8636d1df2db44713f13f10489b77f8116f5d6d710 |