A utility for normalizing persian, arabic and english texts
Project description
Piraye: NLP Utils
A utility for normalizing persian, arabic and english texts
Requirements
- Python 3.11+
- nltk 3.4.5+
Installation
Install the latest version with pip
pip install piraye
Usage
Create an instance of Normalizer with NormalizerBuilder and then call normalize function. Also see list of all available configs in configs section.
- Using builder pattern:
from piraye import NormalizerBuilder
from piraye.tasks.normalizer.normalizer_builder import Config
text = "این یک متن تسة اسﺘ , 24/12/1400 "
normalizer = NormalizerBuilder().alphabet_fa().digit_fa().punctuation_fa().tokenizing().remove_extra_spaces().build()
normalizer.normalize(text) # "این یک متن تست است ، ۲۴/۱۲/۱۴۰۰"
- Using constructor:
from piraye import NormalizerBuilder
from piraye.tasks.normalizer.normalizer_builder import Config
text = "این یک متن تسة اسﺘ , 24/12/1400 "
normalizer = NormalizerBuilder([Config.PUNCTUATION_FA, Config.ALPHABET_FA, Config.DIGIT_FA], remove_extra_spaces=True,
tokenization=True).build()
normalizer.normalize(text) # "این یک متن تست است ، ۲۴/۱۲/۱۴۰۰"
Also see other examples
Configs
Config | Function | Description |
---|---|---|
ALPHABET_AR | alphabet_ar | mapping alphabet characters to arabic |
ALPHABET_EN | alphabet_en | mapping alphabet characters to english |
ALPHABET_FA | alphabet_fa | mapping alphabet characters to persian |
DIGIT_AR | digit_ar | convert digits to arabic digits |
DIGIT_EN | digit_en | convert digits to english digits |
DIGIT_FA | digit_fa | convert digits to persian digits |
DIACRITIC_DELETE | diacritic_delete | remove all diacritics |
SPACE_DELETE | space_delete | remove all spaces |
SPACE_NORMAL | space_normal | normal spaces ( like NO-BREAK SPACE , Tab and etc...) |
SPACE_KEEP | space_keep | mapping spaces and not normal them |
PUNCTUATION_AR | punctuation_ar | mapping punctuations to arabic punctuations |
PUNCTUATION_Fa | punctuation_fa | mapping punctuations to persian punctuations |
PUNCTUATION_EN | punctuation_en | mapping punctuations to english punctuations |
Other attributes:
- remove_extra_spaces : append multiple spaces together
- tokenization : replace punctuation characters that just are tokens
Development
- Install dependencies with
pip install -e .[dev]
License
GNU Lesser General Public License v2.1
Primarily used for software libraries, the GNU LGPL requires that derived works be licensed under the same license, but works that only link to it do not fall under this restriction. There are two commonly used versions of the GNU LGPL.
See LICENSE
About ️
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file priaye-0.4.0.tar.gz
.
File metadata
- Download URL: priaye-0.4.0.tar.gz
- Upload date:
- Size: 49.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb12f53cf271936a67042aecf0d1c369d5b6e80aecbe356716e05d305915e560 |
|
MD5 | 2ffc328688510dd1ece05a7da54552a2 |
|
BLAKE2b-256 | 7f655e99ea63e41da831f4ebff69c8fd02d88eee393323f4bf1c803959c782d9 |
File details
Details for the file priaye-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: priaye-0.4.0-py3-none-any.whl
- Upload date:
- Size: 52.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52aebafd9d69e5c242df74f08c4786f92e9fe035f5405a617a6589537231c21f |
|
MD5 | 4411f56558aeb5dc2a15739399fdf0fd |
|
BLAKE2b-256 | 693b087f086a2fe4e068f436595a263e785d8e3bf4e7278b97a616bbd9a57a99 |