Persain Text Pre-Proceesing Tool

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

ParsiNorm

The normalization step is so essential to format unification in pure textual applications. However, for embedded language models in speech processing modules, normalization is not limited to format unification. Moreover, it has to convert each readable symbol, number, etc., to how they are pronounced.

functionalities

General Normalization

Sentence tokenizer (you must add postagger.model to a folder named resources in where parsinorm is installed for example "/home/yourComputerName/.anaconda3/lib/python3.8/site-packages/parsinorm/resources")
Word Tokenizer
Normalizing persian and English characters
Normalizing Numbers (Converting to unique perisan Number)
Converting Persian, English, Arabic symbols to normalized characters
Normalize Punctuations
Removing emojis
Converting HTML tags to characters and symbols
Having unique floating point number
Removing different comma between numbers
Removing repeated punctuations
convert semi sapce to null

Speech Normalization

Converting mail and url to how are pronounced
Converting Date and Times to how they are pronounced
Converting special numbers to how they are pronounced
Converting English and Persian Abbrevations to how they are pronounced
converting telephone numbers to how they are pronounced in Persian
Converting currency to how they are read
Converting some symbols to how they are read such as %, °, *, #, +, &, Δ

Usage

>>> from parsinorm import Mail_url_cleaner 
>>> mail_url_cleaner  = Mail_url_cleaner ()
>>> mail_url_cleaner.find_mails_clean(sentence="info@hara.ai")
info at hara dot ai

>>> mail_url_cleaner.find_urls_clean(sentence="https://hara.ai/#services")
https do noghte slash slash hara dot ai

>>> from parsinorm import Date_time_to_text
>>> date_time_to_text = Date_time_to_text()
>>> date_time_to_text.date_to_text(sentence='2021/10/27')
بیست و هفتم اکتبر سال دو هزار و بیست و یک

>>> date_time_to_text.time_to_text(sentence='22:57:11')
بیست و دو و پنجاه و هفت دقیقه و  یازده ثانیه

>>> from parsinorm import General_normalization
>>> general_normalization = General_normalization()
>>> general_normalization.alphabet_correction(sentence='ﻙﯘݙݤﮮ')
کودکی

>>>general_normalization.semi_space_correction(sentence='کتاب\u200bخانه')
کتابخانه

>>> general_normalization.english_correction(sentence='naïve')
naive

>>> general_normalization.html_correction(sentence='&quot;')
"

>>> general_normalization.arabic_correction(sentence='ﷺ')
صلی الله علیه و سلم

>>> general_normalization.punctuation_correction(sentence="…")
...

>>> general_normalization.specials_chars(sentence="℡")
TEL

>>> general_normalization.remove_emojis(sentence='😊')


>>> general_normalization.unique_floating_point(sentence='1،2')
۱.۲

>>> general_normalization.remove_comma_between_numbers(sentence='1٬234')
۱۲۳۴

>>> general_normalization.number_correction(sentence="⑤")
۵

>>> general_normalization.remove_not_desired_chars(sentence="^ Hi ~")
  Hi  

>>> general_normalization.remove_repeated_punctuation(sentence="!!!!!")
!

>>> from parsinorm import Telephone_number
>>> telephone_number = Telephone_number()
>>> telephone_number.find_phones_replace(sentence='تلفن ۰۲۱۳۳۴۵۶۷۸۸')
تلفن   صفر  بیست و یک سی و سه چهل و پنج شصت و هفت هشتاد و هشت

>>> from parsinorm import Abbreviation
>>> abbreviation = Abbreviation()
>>> abbreviation.replace_date_abbreviation(sentence=".در سال 1400 ه.ش")
در سال 1400 هجری شمسی

>>> abbreviation.replace_persian_label_abbreviation(sentence='امام زمان (عج)')
امام زمان  عجل الله تعالی فرجه الشریف 

>>> abbreviation.replace_law_abbreviation(sentence='در ق.ا آمده است')
در قانون اساسی آمده است

>>> abbreviation.replace_book_abbreviation(sentence='به کتاب زیر ر.ک مراجعه کنید')
به کتاب زیر رجوع کنید مراجعه کنید

>>> abbreviation.replace_other_abbreviation(sentence='در قانون ج.ا آمده است')
در قانون جمهوری اسلامی آمده است

>>> abbreviation.replace_English_abbrevations(sentence='U.S.A')
یو اس آ

>>> from parsinorm import TTS_normalization
>>> TTS_normalization = TTS_normalization()
>>> TTS_normalization.math_correction(sentence='⅞')
هفت هشتم

>>> TTS_normalization.replace_currency(sentence='۳۳$')
۳۳ دلار

>>> TTS_normalization.replace_symbols(sentence='۳۳°')
۳۳ درجه 

>>> from parsinorm import Special_numbers
>>> special_numbers = Special_numbers()
>>> special_numbers.convert_numbers_to_text(sentence='122')
 صد و بیست و دو

>>> special_numbers.replace_national_code(sentence='0499370899')
صفر  چهار   نهصد و نود و سه   هفتاد   هشتصد و نود و نه

>>> special_numbers.replace_card_number(sentence='6037701689095443')
شصت   سی و هفت   هفتاد   شانزده   هشتاد و نه   صفر  نه   پنجاه و چهار   چهل و سه

>>>special_numbers.replace_shaba(sentence='IR820540102680020817909002')
 آی آر   هشتاد و دو   صفر  پنج   چهل   ده   بیست و شش   هشتاد   صفر  دو   صفر  هشت   هفده   نود   نود   صفر  دو 

>>> from parsinorm import Tokenizer
>>> tokenizer = Tokenizer()
>>> tokenizer.sentence_tokenize('این مثالی است که در آن یک جمله فقط بر اساس علائم نگارشی جدا می‌شود.',verb_seperator= False)
['این مثالی است که در آن یک جمله فقط بر اساس علائم نگارشی جدا میشود .']

>>> tokenizer.sentence_tokenize('این مثالی است که در آن یک جمله با فعل تمام شده‌است ولی با نقطه تمام نشده‌است به همین دلیل آن را بر اساس فعل جدا می‌کنیم',verb_seperator= True)
[' این مثالی است',
 ' که در آن یک جمله با فعل تمام شده\u200cاست',
 ' ولی با نقطه تمام نشده\u200cاست',
 ' به همین دلیل آن را بر اساس فعل جدا می\u200cکنیم']



>>> tokenizer.word_tokenize('می‌توانید از طریق اینemail با ما در ارتباط باشید: info@hara.ai. همچنین با هشتگ #hara ما را دنبال کنید')
['می\u200cتوانید',
 'از',
 'طریق',
 'این',
 'email',
 'با',
 'ما',
 'در',
 'ارتباط',
 'باشید',
 ':',
 'info@hara.ai',
 '.',
 'همچنین',
 'با',
 'هشتگ',
 'hara#',
 'ما',
 'را',
 'دنبال',
 'کنید']

Reference

If you use or discuss this normalization tool in your work, please cite our paper :

@inproceedings{oji2021parsinorm,
  title={ParsiNorm: A Persian Toolkit for Speech Processing Normalization},
  author={Oji, Romina and Razavi, Seyedeh Fatemeh and Dehsorkh, Sajjad Abdi and Hariri, Alireza and Asheri, Hadi and Hosseini, Reshad},
  booktitle={2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS)},
  pages={1--5},
  year={2021},
  organization={IEEE}
}

Contact

If you have any technical question regarding the dataset or publication, please create an issue in this repository.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.4

Mar 6, 2024

0.0.2

Mar 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsinorm-fork-0.0.4.tar.gz (31.7 kB view details)

Uploaded Mar 6, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parsinorm_fork-0.0.4-py3-none-any.whl (30.5 kB view details)

Uploaded Mar 6, 2024 Python 3

File details

Details for the file parsinorm-fork-0.0.4.tar.gz.

File metadata

Download URL: parsinorm-fork-0.0.4.tar.gz
Upload date: Mar 6, 2024
Size: 31.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for parsinorm-fork-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`6927ee3ac79f1788987aa8adb82d686261876f070a9e5a2fd6ba79eac1b69574`
MD5	`211e1f06fea37f917913dab63a6e5fd4`
BLAKE2b-256	`8d00f430c1e7b7986435b64b519d7a846abfed8425317b47344555cf2fab320c`

See more details on using hashes here.

File details

Details for the file parsinorm_fork-0.0.4-py3-none-any.whl.

File metadata

Download URL: parsinorm_fork-0.0.4-py3-none-any.whl
Upload date: Mar 6, 2024
Size: 30.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for parsinorm_fork-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8761056b645e730f8e7ca23f6e5a33502de2839ef8ff870d4cfe37df661822c8`
MD5	`58662716ac0f563abeeeb17b5e49294b`
BLAKE2b-256	`5f45e9cafa6d9c18c09aaa95a1335e996e93ba2c5bceb64bfaba67fa7799d552`

See more details on using hashes here.

parsinorm-fork 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ParsiNorm

functionalities

General Normalization

Speech Normalization

Usage

Reference

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes