A Python library for stemming Punjabi language words, including preprocessing for noise removal.
Project description
PunjabiStemmer
Introduction
The PunjabiStemmer introduces a groundbreaking advancement in Natural Language Processing (NLP) for Punjabi, a major regional language in India. It combines rule-based and dictionary-based methodologies to address the morphological complexity of Punjabi, setting a new benchmark in linguistic technology.
Featuring over 300 suffix rules and a comprehensive database of over 50,000 words, the stemmer excels in handling various grammatical scenarios while preserving semantic integrity. This hybrid approach allows for meticulous processing across nouns, pronouns, adjectives, adverbs, verbs, and more, maintaining the essence of words and supporting diverse NLP applications.
Now available on PyPI, PunjabiStemmer is a monumental leap in text processing for regional languages, inviting collaboration and innovation in the field. This Punjabi Stemmer was developed by Gurpej Singh, a researcher and software engineer dedicated to advancing Natural Language Processing for Punjabi, as part of his research work.
Main Features
The Punjabi_Stemmer package offers several functionalities, including:
- Stemming single word for testing.
- Stemming a Sentence or Paragraph.
- Stemming Content from a Large Text File.
- Hybrid approach( Rule-Based + Dictionary Based Approach): Incorporates over 300 specific rules and a comprehensive database of over 50,000 words, to precisely manage the linguistic diversity of Punjabi, effectively dealing with suffixes, prefixes, and more.
- High Accuracy: Designed to minimize overstemming and understemming errors, enhancing the reliability of subsequent NLP tasks.
- Open Source: Fully open-source, encouraging contributions, improvements, and customization to meet diverse needs.
You can install Punjabi_Stemmer directly from PyPI
Installation
Before installing Punjabi Stemmer, ensure you have Python 3.7 or newer installed on your system. Punjabi Stemmer also require some other libraries installation:
pip install regex
pip install os-sys
pip install collection
Install Punjabi_Stemmer
using pip:
pip install punjabi_stemmer
Usage
Here's how to use the Punjabi_Stemmer Package in your Python projects:
Stemming a Single Word
To stem a single Punjabi word, use the stem_word method:
from Punjabi_Stemmer.Stemmer import PunjabiStemmer
# Now you can simply initialize the stemmer without specifying file paths
stemmer = PunjabiStemmer()
word = "ਭੱਜਣਾ" # Example Punjabi word
stemmed_word = stemmer.stem_word(word)
print(f"Original: {word}, Stemmed: {stemmed_word}")
Output
ਭੱਜ
This will print the original word and its stemmed version.
Stemming a Sentence or Paragraph
To stem a longer piece of text, such as a sentence or paragraph, use the stem_text method:
from Punjabi_Stemmer.Stemmer import PunjabiStemmer
# Now you can simply initialize the stemmer without specifying file paths
stemmer = PunjabiStemmer()
text = "ਪੜਾਉਂਦਾ ਪੜਾਉਂਦੀ ਪੜਾਉਂਦੇ ਪੜਾਉਣੀਆਂ ਪੜਾਉਣੀ ਪੜਾਉਣੇ ਪੜਾਂਦਾ ਪੜਾਂਦੀ"
stemmed_text = stemmer.stem_text(text)
print(f"Original: {text}\nStemmed: {stemmed_text}")
Output
ਪੜਾ ਪੜਾ ਪੜਾ ਪੜਾ ਪੜਾ ਪੜਾ ਪੜਾ ਪੜਾ
This will output the original text and the stemmed version.
Stemming Content from a Text File
To process text from a file, ensuring all the content is automatically preprocessed and stemmed, then outputted to another file, you can use the stem_file method:
from Punjabi_Stemmer.Stemmer import PunjabiStemmer
# Now you can simply initialize the stemmer without specifying file paths
stemmer = PunjabiStemmer()
input_file_path = 'path/to/your/input.txt' # Path to your input file
output_file_path = 'path/to/your/output.txt' # Path where you want to save the output
# Stem the content of the input file and save it to the output file
stemmer.stem_file(input_file_path, output_file_path)
Output
Processed text has been saved to D:/output.txt
Make sure the input file exists at the specified path; otherwise, you'll receive an error message.
These steps provide a comprehensive guide on how to use the Punjabi Stemmer package from basic to more advanced use cases, including processing individual words, text, and files. This should cover most needs users might have when working with Punjabi text data.
Punjabi Stemmer Algorithm
Step 1: Initialization
Load lists of pronouns, adverbs, post-positions, vocabulary, names, and suffixes.
Load the dictionary, organized alphabetically.
Step 2: Input Processing
Split the input text into individual words.
For each word in the split text:
Step 3: List Category Checking
a. If the length of the word is 1, output the word and move to the next one.
b. If the word is found in any loaded list (pronouns, adverbs, postpositions, vocabulary, names, suffixes), output the word and move to the next one.
Step 4: Apply Suffix Rule
a. If a suffix rule matches, apply it to stem the word.
b. If no suffix rule matches, output the original word and continue.
Step 5: Single-Character Check
If the length of the stemmed word is 1, output the original word and continue.
Step 6: Dictionary Validation and Further Stemming
a. If the stemmed word is validated in the dictionary, output the stemmed word.
b. If the stemmed word is not validated but the original word is:
i. Apply further suffix rules iteratively to the original word.
ii. If a valid stemmed word is found during iteration, output it.
iii. If no valid stem is found after all iterations, output the original word.
c. If neither the stemmed word nor the original word is validated in the dictionary, output the stemmed word.
Rules Overview
In the development of the PunjabiStemmer, one of our core objectives was to create a highly accurate and versatile tool capable of navigating the rich morphological landscape of the Punjabi language. To achieve this, we have meticulously developed and implemented an expansive set of rules that serve as the foundation of our stemming process.
The PunjabiStemmer incorporates over 300 specific rules, designed to accurately process a wide array of grammatical scenarios. These rules are meticulously categorized to address different aspects of the language, including but not limited to, proper nouns and names, pronouns, verbs, adverbs, and adjectives. This structured approach allows the stemmer to precisely identify and handle the morphological nuances of Punjabi, significantly reducing errors related to overstemming and understemming.
Our research paper details about 160 of these rules, showcasing their application across various linguistic categories. To ensure comprehensive understanding and transparency, we've included the entire rule set in stemmer file in this repossitory https://github.com/gurpejsingh13/Punjabi_Stemmer.git
The rules are thoughtfully crafted and arranged from longest to smallest, optimizing both accuracy and efficiency in stemming. This organization reflects our meticulous approach to handling the complexities of Punjabi morphology.
We encourage users and developers to explore this detailed compilation of rules. It highlights the PunjabiStemmer's effectiveness and our dedication to advancing the field of Punjabi language processing.
Contributing
Contributions to punjabi_stopwords are welcome! If you have suggestions for additional rules, or improvements to the existing list, please feel free to contribute.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file punjabi_stemmer-1.0.1.tar.gz
.
File metadata
- Download URL: punjabi_stemmer-1.0.1.tar.gz
- Upload date:
- Size: 364.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 563dfa0c56450f3af3facfc39a5be8a4908068ea8368d26d8155f187442a19c6 |
|
MD5 | 7b3661c7a29d1ff36b768fda2980211e |
|
BLAKE2b-256 | 33aef9a568ae5d8ce5b665ee09aface2309dbadba4c6a4cec46dd6ff7fcbc8c8 |
File details
Details for the file punjabi_stemmer-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: punjabi_stemmer-1.0.1-py3-none-any.whl
- Upload date:
- Size: 10.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 423eb96dd4804bf493dd20a09af3b2f8d7a2d1caebdcb2a21d082833c94cb7d0 |
|
MD5 | e76fe01494f35e24a96f73db83ae9dff |
|
BLAKE2b-256 | cecb50b66eadc06d2480e7c0d66d77a666fc323e8cc9875c0f4315b447145e18 |