A Python package for Arabic text preprocessing, including cleaning, normalization, stemming, and stopword removal.
Project description
SafwaText
SafwaText is a Python package for cleaning, normalizing, and stemming Arabic text effortlessly. Whether you're working on NLP projects or need to preprocess Arabic text, SafwaText simplifies the process.
Features
- Remove Tashkeel (diacritics): Simplifies text by removing diacritical marks.
- Normalize Arabic text: Converts text into a consistent format.
- Filter Non-Arabic Characters: Removes any characters not part of the Arabic script, including numbers, punctuation, and symbols.
- Remove Arabic Articles: Strips common Arabic definite articles.
- Remove Arabic Prefixes: Removes common prefixes from words.
- Remove Arabic Suffixes: Removes common suffixes from words.
- Arabic Stemming: Applies a light stemming pipeline to Arabic words, including normalization, prefix/suffix removal, and article stripping.
- Remove Stopwords: Filters out common Arabic stopwords
Installation
Install the package directly from PyPI using pip: ```bash pip install safwaText
Usage
```bash
from safwaText.cleaner import remove_tashkeel, normalize_text, remove_non_arabic
from safwaText.stemmer import arabic_stemmer
from safwaText.stopwords import remove_stopwords
# Clean and normalize text
input = "يذهب مُحَمَّدٌ للمَدْرَسَةِ كل صباح"
cleaned_text = remove_tashkeel(input)
normalized_text = normalize_text(cleaned_text)
filtered_text = remove_non_arabic(normalized_text)
# Apply light stemming
stemmed_text = arabic_stemmer(filtered_text)
# Remove stopwords
final_output = remove_stopwords(stemmed_text)
print(final_output) # Output: "ذهب محمد مدرس صباح"
```
Contributing
Contributions are welcome! If you'd like to improve this extension:
- Fork the repository.
- Create a new branch:
git checkout -b feature-name
- Commit your changes and push to your branch :
git commit -m "Add feature: feature-name" git push origin feature-name
- Open a pull request.
License
SafwaText is licensed under the Apache-2.0 license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file safwatext-0.1.0.tar.gz.
File metadata
- Download URL: safwatext-0.1.0.tar.gz
- Upload date:
- Size: 15.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7153e1425d9429ca8bfaa7aec413ee70db457514019a8029ef096d5b0a001bfb
|
|
| MD5 |
b3286dc727d51a1d1c12980380a498a1
|
|
| BLAKE2b-256 |
22c30eab968b1e3afecc9a43df37a92dd3bb7823807d42202eae13d03a7e5bd2
|
File details
Details for the file safwaText-0.1.0-py3-none-any.whl.
File metadata
- Download URL: safwaText-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcf3a0976e56b57c748cf7faf6cf4c120857a7f6d63e1face2731b3e8374af68
|
|
| MD5 |
6340470aae783be157a913bc0a81b846
|
|
| BLAKE2b-256 |
41ef7b020ad96454f66b761bc4cf77e8e0e401888bc4e1097f99202242f76c1e
|