Skip to main content

Standardize your Persian text: Preprocessing, Embedding, and more!

Project description

PersianUtils

A wonderful package to preprocess your Persian text for Search, Standardizing & NLP processes.

Why PersianUtils?

Persian has a lot of duplicate characters with Arabic but with different Unicode code points. This may lead to different writings of a word, with almost exactly the same appearance. In addition to that, contextual forms of a character may also be used in text, which doesn't change the word shape but makes the same trouble mentioned above. Unfortunately, a lot of non-standard Persian keyboards don't obey these rules, which makes the problem more severe. This package helps to make your Persian text a standard one, with original Persian characters.

Installation

To install PersianUtils, you can use pip:

pip install persianutils

Usage

There are two functions implemented for standardizing Persian text named "standardize" and "standardize4Word2vec".

standardize()

This function does the following:

  1. Replace Arabic characters with their Persian equivalent. Like from persianutils.ArabicAlphabet import ALEF_MAKSURA to from persianutils.PersianAlphabet import YE
  2. Remove Tanveens like ـٍ , ـَ , & etc.
  3. Replace contextual forms of a character to its original form. Like "ـتـ‎" to "ت".
  4. Replace western and eastern numerals to their Persian equivalent. 2 to ۲

Example:

import persianutils as pu
raw_text = "سلامٌ علیکم!"
processed_text = pu.standardize(raw_text)
print(processed_text)

That would result in:

سلام علیکم!

standardize4Word2vec()

This function has these features:

  1. Same as the standardize() #1
  2. Same as the standardize() #2
  3. Same as the standardize() #3
  4. Replace all numerals (Eastern, Western and Persian) to their Persian writings. 2 to دو
  5. Replaces all punctuation marks with single space. Punctuations are: [!"#%\'()*+,-./:;<=>?@\[\]^_`{|}~’”“′‘\\\]؟؛«»،٪

Example:

import persianutils as pu
raw_text = "سلامٌ علیکم!"
processed_text = pu.standardize4Word2vec(raw_text)
print(processed_text)

This would result in:

سلام علیکم 

Persian & Arabic Characters

There is also a list of Persian & Arabic characters, accessible from persianutils.PersianAlphabet:

from persianutils.PersianAlphabet import ALEF, BE, PE, TE

Or for Arabic:

from persianutils.ArabicAlphabet import ALEF_HAMZA_ABOVE_FINAL, HAMZA_ABOVE_ALEF

Contributing

We appreciate all contributions. If you're interested in contributing, please start by reading our Contributing Guide.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

persianutils-1.0.0.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

persianutils-1.0.0-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file persianutils-1.0.0.tar.gz.

File metadata

  • Download URL: persianutils-1.0.0.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for persianutils-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f6bb2f3223253a7da49335034e258fb881fea39a9b4b8f4b6987721b0653390f
MD5 5fc034792410ab68b23c5221e074988a
BLAKE2b-256 7d69fc0f6acb4096fa37742077ad6afd6a2870df3738dbfcee1857e9b0eb735a

See more details on using hashes here.

File details

Details for the file persianutils-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for persianutils-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4a76ea74022e463bd5ac03986ad81fd5d14aae0bfca24d08b8edc41a98373bb8
MD5 3c79bf3d086464e46a3ac579386088d2
BLAKE2b-256 4f1776817b55baed51a49edca63957c153dc0745d34cbf37b6b8f98bad3d0980

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page