Skip to main content

A utility for normalizing persian, arabic and english texts

Project description

Piraye: NLP Utilities

PyPI Version Python Versions License Downloads Pylint Unit Test

Piraye is a Python library designed to facilitate text normalization for Persian, Arabic, and English languages.

Requirements

  • Python 3.11+
  • nltk 3.4.5+

Installation

You can install the latest version of Piraye via pip:

pip install piraye

Usage

To use Piraye, create an instance of the Normalizer class with NormalizerBuilder and then call the normalize function. You can configure the normalization process using various settings available. Below are two examples demonstrating different approaches:

  • Using builder pattern:
from piraye import NormalizerBuilder

text = "این یک متن تسة اسﺘ       , 24/12/1400 "
normalizer = NormalizerBuilder().alphabet_fa().digit_fa().punctuation_fa().tokenizing().remove_extra_spaces().build()
normalizer.normalize(text)  # "این یک متن تست است ، ۲۴/۱۲/۱۴۰۰"
  • Using constructor:
from piraye import NormalizerBuilder
from piraye.tasks.normalizer.normalizer_builder import Config

text = "این یک متن تسة اسﺘ       , 24/12/1400 "
normalizer = NormalizerBuilder([Config.PUNCTUATION_FA, Config.ALPHABET_FA, Config.DIGIT_FA], remove_extra_spaces=True,
                               tokenization=True).build()
normalizer.normalize(text)  # "این یک متن تست است ، ۲۴/۱۲/۱۴۰۰"

You can find more examples here

Configs

Piraye provides various configurations for text normalization. Here's a list of available configurations:

Config Function Description
ALPHABET_AR alphabet_ar mapping alphabet characters to Arabic
ALPHABET_EN alphabet_en mapping alphabet characters to English
ALPHABET_FA alphabet_fa mapping alphabet characters to Persian
DIGIT_AR digit_ar convert digits to Arabic digits
DIGIT_EN digit_en convert digits to English digits
DIGIT_FA digit_fa convert digits to Persian digits
DIACRITIC_DELETE diacritic_delete remove all diacritics
SPACE_DELETE space_delete remove all spaces
SPACE_NORMAL space_normal normal spaces ( like NO-BREAK SPACE , Tab and etc...)
SPACE_KEEP space_keep mapping spaces and not normal them
PUNCTUATION_AR punctuation_ar mapping punctuations to Arabic punctuations
PUNCTUATION_Fa punctuation_fa mapping punctuations to Persian punctuations
PUNCTUATION_EN punctuation_en mapping punctuations to English punctuations

Other attributes:

  • remove_extra_spaces: Appends multiple spaces together.
  • tokenization: Replaces punctuation characters which are just tokens.

Development

To set up a development environment, install dependencies with:

pip install -e .[dev]

License

GNU Lesser General Public License v2.1

Piraye is licensed under the GNU Lesser General Public License v2.1, which primarily applies to software libraries. See the LICENSE file for more details.

About ️

Piraye is maintained by Arusha.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piraye-0.6.0.tar.gz (51.1 kB view details)

Uploaded Source

Built Distribution

piraye-0.6.0-py3-none-any.whl (53.8 kB view details)

Uploaded Python 3

File details

Details for the file piraye-0.6.0.tar.gz.

File metadata

  • Download URL: piraye-0.6.0.tar.gz
  • Upload date:
  • Size: 51.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for piraye-0.6.0.tar.gz
Algorithm Hash digest
SHA256 f89c14ed55ab1290078a482bcf164a6f0824b214cdd571957969efcf1a497812
MD5 ef3f6daf0b027edc0cff4264eeb97cfa
BLAKE2b-256 bc8f54b0344450311263d05e114a83006bad5a094c41b8a09c0329cb34adb92f

See more details on using hashes here.

File details

Details for the file piraye-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: piraye-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 53.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for piraye-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b763d242be84e9da479c25032351e5e421108141a00cb5b9fbd0cb42c50a5656
MD5 994424b4747116dab1fad7a086eb4fcc
BLAKE2b-256 d7a2104901f5d71ab57e9d2cc72e767f6846a4b6f881dbc4362218ac7b552c35

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page