A utility for normalizing persian, arabic and english texts
Project description
Piraye: NLP Utilities
Piraye is a Python library designed to facilitate text normalization for Persian, Arabic, and English languages.
Requirements
- Python 3.11+
- nltk 3.4.5+
Installation
You can install the latest version of Piraye via pip:
pip install piraye
Usage
To use Piraye, create an instance of the Normalizer class with NormalizerBuilder and then call the normalize function. You can configure the normalization process using various settings available. Below are two examples demonstrating different approaches:
- Using builder pattern:
from piraye import NormalizerBuilder
text = "این یک متن تسة اسﺘ , 24/12/1400 "
normalizer = NormalizerBuilder().alphabet_fa().digit_fa().punctuation_fa().tokenizing().remove_extra_spaces().build()
normalizer.normalize(text) # "این یک متن تست است ، ۲۴/۱۲/۱۴۰۰"
- Using constructor:
from piraye import NormalizerBuilder
from piraye.tasks.normalizer.normalizer_builder import Config
text = "این یک متن تسة اسﺘ , 24/12/1400 "
normalizer = NormalizerBuilder([Config.PUNCTUATION_FA, Config.ALPHABET_FA, Config.DIGIT_FA], remove_extra_spaces=True,
tokenization=True).build()
normalizer.normalize(text) # "این یک متن تست است ، ۲۴/۱۲/۱۴۰۰"
You can find more examples here
Configs
Piraye provides various configurations for text normalization. Here's a list of available configurations:
Config | Function | Description |
---|---|---|
ALPHABET_AR | alphabet_ar | mapping alphabet characters to Arabic |
ALPHABET_EN | alphabet_en | mapping alphabet characters to English |
ALPHABET_FA | alphabet_fa | mapping alphabet characters to Persian |
DIGIT_AR | digit_ar | convert digits to Arabic digits |
DIGIT_EN | digit_en | convert digits to English digits |
DIGIT_FA | digit_fa | convert digits to Persian digits |
DIACRITIC_DELETE | diacritic_delete | remove all diacritics |
SPACE_DELETE | space_delete | remove all spaces |
SPACE_NORMAL | space_normal | normal spaces ( like NO-BREAK SPACE , Tab and etc...) |
SPACE_KEEP | space_keep | mapping spaces and not normal them |
PUNCTUATION_AR | punctuation_ar | mapping punctuations to Arabic punctuations |
PUNCTUATION_Fa | punctuation_fa | mapping punctuations to Persian punctuations |
PUNCTUATION_EN | punctuation_en | mapping punctuations to English punctuations |
Other attributes:
- remove_extra_spaces: Appends multiple spaces together.
- tokenization: Replaces punctuation characters which are just tokens.
Development
To set up a development environment, install dependencies with:
pip install -e .[dev]
License
GNU Lesser General Public License v2.1
Piraye is licensed under the GNU Lesser General Public License v2.1, which primarily applies to software libraries. See the LICENSE file for more details.
About ️
Piraye is maintained by Arusha.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file piraye-0.6.1.tar.gz
.
File metadata
- Download URL: piraye-0.6.1.tar.gz
- Upload date:
- Size: 51.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 49735dc63669a6e0cbcd2a9c3a1aac1c9cec0923d3d7b3e27e663d297b0e1007 |
|
MD5 | 66c7e01591f1d430829a20896399f37f |
|
BLAKE2b-256 | 383747542ae6857ad54b026d1327e6ade378582428a29749f9a035002a16e660 |
File details
Details for the file piraye-0.6.1-py3-none-any.whl
.
File metadata
- Download URL: piraye-0.6.1-py3-none-any.whl
- Upload date:
- Size: 53.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6275d06b49ce780b8ed3ae854fec742a1aa9cb4447b5fa9710ce13164901cd4d |
|
MD5 | 0935764fc66de474d81f01c77dc7b218 |
|
BLAKE2b-256 | 69d5353bf283b948479acce2ac0d244d1f68ea85a656e2468cfb6a5389330988 |