Standardize your Persian text: Preprocessing, Embedding, and more!
Project description
PersianUtils
A wonderful package to preprocess your Persian text for Search, Standardizing & NLP processes.
Why PersianUtils?
Persian has a lot of duplicate characters with Arabic but with different Unicode code points. This may lead to different writings of a word, with almost exactly the same appearance. In addition to that, contextual forms of a character may also be used in text, which doesn't change the word shape but makes the same trouble mentioned above. Unfortunately, a lot of non-standard Persian keyboards don't obey these rules, which makes the problem more severe. This package helps to make your Persian text a standard one, with original Persian characters.
Installation
To install PersianUtils, you can use pip:
pip install persianutils
Usage
There are two functions implemented for standardizing Persian text named "standardize" and "standardize4Word2vec".
standardize()
This function does the following:
- Replace Arabic characters with their Persian equivalent. Like
from persianutils.ArabicAlphabet import ALEF_MAKSURAtofrom persianutils.PersianAlphabet import YE - Remove Tanveens like ـٍ , ـَ , & etc.
- Replace contextual forms of a character to its original form. Like "ـتـ" to "ت".
- Replace western and eastern numerals to their Persian equivalent.
2to۲
Example:
import persianutils as pu
raw_text = "سلامٌ علیکم!"
processed_text = pu.standardize(raw_text)
print(processed_text)
That would result in:
سلام علیکم!
standardize4Word2vec()
This function has these features:
- Same as the standardize() #1
- Same as the standardize() #2
- Same as the standardize() #3
- Replace all numerals (Eastern, Western and Persian) to their Persian writings.
2toدو - Replaces all punctuation marks with single space. Punctuations are:
[!"#%\'()*+,-./:;<=>?@\[\]^_`{|}~’”“′‘\\\]؟؛«»،٪
Example:
import persianutils as pu
raw_text = "سلامٌ علیکم!"
processed_text = pu.standardize4Word2vec(raw_text)
print(processed_text)
This would result in:
سلام علیکم
Persian & Arabic Characters
There is also a list of Persian & Arabic characters, accessible from persianutils.PersianAlphabet:
from persianutils.PersianAlphabet import ALEF, BE, PE, TE
Or for Arabic:
from persianutils.ArabicAlphabet import ALEF_HAMZA_ABOVE_FINAL, HAMZA_ABOVE_ALEF
Contributing
We appreciate all contributions. If you're interested in contributing, please start by reading our Contributing Guide.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file persianutils-1.0.0.tar.gz.
File metadata
- Download URL: persianutils-1.0.0.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6bb2f3223253a7da49335034e258fb881fea39a9b4b8f4b6987721b0653390f
|
|
| MD5 |
5fc034792410ab68b23c5221e074988a
|
|
| BLAKE2b-256 |
7d69fc0f6acb4096fa37742077ad6afd6a2870df3738dbfcee1857e9b0eb735a
|
File details
Details for the file persianutils-1.0.0-py3-none-any.whl.
File metadata
- Download URL: persianutils-1.0.0-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a76ea74022e463bd5ac03986ad81fd5d14aae0bfca24d08b8edc41a98373bb8
|
|
| MD5 |
3c79bf3d086464e46a3ac579386088d2
|
|
| BLAKE2b-256 |
4f1776817b55baed51a49edca63957c153dc0745d34cbf37b6b8f98bad3d0980
|