Skip to main content

A Python package for Arabic text preprocessing, including cleaning, normalization, stemming, and stopword removal.

Project description

SafwaText

SafwaText is a Python package for cleaning, normalizing, and stemming Arabic text effortlessly. Whether you're working on NLP projects or need to preprocess Arabic text, SafwaText simplifies the process.

Features

  • Remove Tashkeel (diacritics): Simplifies text by removing diacritical marks.
  • Normalize Arabic text: Converts text into a consistent format.
  • Filter Non-Arabic Characters: Removes any characters not part of the Arabic script, including numbers, punctuation, and symbols.
  • Remove Arabic Articles: Strips common Arabic definite articles.
  • Remove Arabic Prefixes: Removes common prefixes from words.
  • Remove Arabic Suffixes: Removes common suffixes from words.
  • Arabic Stemming: Applies a light stemming pipeline to Arabic words, including normalization, prefix/suffix removal, and article stripping.
  • Remove Stopwords: Filters out common Arabic stopwords

Installation

Install the package directly from PyPI using pip: ```bash pip install safwaText

Usage

```bash
from safwaText.cleaner import remove_tashkeel, normalize_text, remove_non_arabic
from safwaText.stemmer import arabic_stemmer
from safwaText.stopwords import remove_stopwords

# Clean and normalize text
input = "يذهب مُحَمَّدٌ للمَدْرَسَةِ كل صباح"
cleaned_text = remove_tashkeel(input) 
normalized_text = normalize_text(cleaned_text) 
filtered_text = remove_non_arabic(normalized_text) 

# Apply light stemming
stemmed_text = arabic_stemmer(filtered_text)  

# Remove stopwords
final_output = remove_stopwords(stemmed_text)

print(final_output)  # Output: "ذهب محمد مدرس صباح"
```

Contributing

Contributions are welcome! If you'd like to improve this extension:

  1. Fork the repository.
  2. Create a new branch:
    git checkout -b feature-name
    
  3. Commit your changes and push to your branch :
    git commit -m "Add feature: feature-name"
    git push origin feature-name
    
  4. Open a pull request.

License

SafwaText is licensed under the Apache-2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safwatext-0.1.0.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

safwaText-0.1.0-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file safwatext-0.1.0.tar.gz.

File metadata

  • Download URL: safwatext-0.1.0.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for safwatext-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7153e1425d9429ca8bfaa7aec413ee70db457514019a8029ef096d5b0a001bfb
MD5 b3286dc727d51a1d1c12980380a498a1
BLAKE2b-256 22c30eab968b1e3afecc9a43df37a92dd3bb7823807d42202eae13d03a7e5bd2

See more details on using hashes here.

File details

Details for the file safwaText-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: safwaText-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for safwaText-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fcf3a0976e56b57c748cf7faf6cf4c120857a7f6d63e1face2731b3e8374af68
MD5 6340470aae783be157a913bc0a81b846
BLAKE2b-256 41ef7b020ad96454f66b761bc4cf77e8e0e401888bc4e1097f99202242f76c1e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page