Skip to main content

Persain Text Pre-Proceesing Tool

Project description

ParsiNorm

The normalization step is so essential to format unification in pure textual applications. However, for embedded language models in speech processing modules, normalization is not limited to format unification. Moreover, it has to convert each readable symbol, number, etc., to how they are pronounced.

functionalities

General Normalization

  • Sentence tokenizer (you must add postagger.model to a folder named resources in where parsinorm is installed for example "/home/yourComputerName/.anaconda3/lib/python3.8/site-packages/parsinorm/resources")
  • Word Tokenizer
  • Normalizing persian and English characters
  • Normalizing Numbers (Converting to unique perisan Number)
  • Converting Persian, English, Arabic symbols to normalized characters
  • Normalize Punctuations
  • Removing emojis
  • Converting HTML tags to characters and symbols
  • Having unique floating point number
  • Removing different comma between numbers
  • Removing repeated punctuations
  • convert semi sapce to null

Speech Normalization

  • Converting mail and url to how are pronounced
  • Converting Date and Times to how they are pronounced
  • Converting special numbers to how they are pronounced
  • Converting English and Persian Abbrevations to how they are pronounced
  • converting telephone numbers to how they are pronounced in Persian
  • Converting currency to how they are read
  • Converting some symbols to how they are read such as %, °, *, #, +, &, Δ

Usage

>>> from parsinorm import Mail_url_cleaner 
>>> mail_url_cleaner  = Mail_url_cleaner ()
>>> mail_url_cleaner.find_mails_clean(sentence="info@hara.ai")
info at hara dot ai

>>> mail_url_cleaner.find_urls_clean(sentence="https://hara.ai/#services")
https do noghte slash slash hara dot ai

>>> from parsinorm import Date_time_to_text
>>> date_time_to_text = Date_time_to_text()
>>> date_time_to_text.date_to_text(sentence='2021/10/27')
بیست و هفتم اکتبر سال دو هزار و بیست و یک

>>> date_time_to_text.time_to_text(sentence='22:57:11')
بیست و دو و پنجاه و هفت دقیقه و  یازده ثانیه

>>> from parsinorm import General_normalization
>>> general_normalization = General_normalization()
>>> general_normalization.alphabet_correction(sentence='ﻙﯘݙݤﮮ')
کودکی

>>>general_normalization.semi_space_correction(sentence='کتاب\u200bخانه')
کتابخانه

>>> general_normalization.english_correction(sentence='naïve')
naive

>>> general_normalization.html_correction(sentence='"')
"

>>> general_normalization.arabic_correction(sentence='ﷺ')
صلی الله علیه و سلم

>>> general_normalization.punctuation_correction(sentence="…")
...

>>> general_normalization.specials_chars(sentence="℡")
TEL

>>> general_normalization.remove_emojis(sentence='😊')


>>> general_normalization.unique_floating_point(sentence='1،2')
۱.۲

>>> general_normalization.remove_comma_between_numbers(sentence='1٬234')
۱۲۳۴

>>> general_normalization.number_correction(sentence="⑤")
۵

>>> general_normalization.remove_not_desired_chars(sentence="^ Hi ~")
  Hi  

>>> general_normalization.remove_repeated_punctuation(sentence="!!!!!")
!

>>> from parsinorm import Telephone_number
>>> telephone_number = Telephone_number()
>>> telephone_number.find_phones_replace(sentence='تلفن ۰۲۱۳۳۴۵۶۷۸۸')
تلفن   صفر  بیست و یک سی و سه چهل و پنج شصت و هفت هشتاد و هشت

>>> from parsinorm import Abbreviation
>>> abbreviation = Abbreviation()
>>> abbreviation.replace_date_abbreviation(sentence=".در سال 1400 ه.ش")
در سال 1400 هجری شمسی

>>> abbreviation.replace_persian_label_abbreviation(sentence='امام زمان (عج)')
امام زمان  عجل الله تعالی فرجه الشریف 

>>> abbreviation.replace_law_abbreviation(sentence='در ق.ا آمده است')
در قانون اساسی آمده است

>>> abbreviation.replace_book_abbreviation(sentence='به کتاب زیر ر.ک مراجعه کنید')
به کتاب زیر رجوع کنید مراجعه کنید

>>> abbreviation.replace_other_abbreviation(sentence='در قانون ج.ا آمده است')
در قانون جمهوری اسلامی آمده است

>>> abbreviation.replace_English_abbrevations(sentence='U.S.A')
یو اس آ

>>> from parsinorm import TTS_normalization
>>> TTS_normalization = TTS_normalization()
>>> TTS_normalization.math_correction(sentence='⅞')
هفت هشتم

>>> TTS_normalization.replace_currency(sentence='۳۳$')
۳۳ دلار

>>> TTS_normalization.replace_symbols(sentence='۳۳°')
۳۳ درجه 

>>> from parsinorm import Special_numbers
>>> special_numbers = Special_numbers()
>>> special_numbers.convert_numbers_to_text(sentence='122')
 صد و بیست و دو

>>> special_numbers.replace_national_code(sentence='0499370899')
صفر  چهار   نهصد و نود و سه   هفتاد   هشتصد و نود و نه

>>> special_numbers.replace_card_number(sentence='6037701689095443')
شصت   سی و هفت   هفتاد   شانزده   هشتاد و نه   صفر  نه   پنجاه و چهار   چهل و سه

>>>special_numbers.replace_shaba(sentence='IR820540102680020817909002')
 آی آر   هشتاد و دو   صفر  پنج   چهل   ده   بیست و شش   هشتاد   صفر  دو   صفر  هشت   هفده   نود   نود   صفر  دو 

>>> from parsinorm import Tokenizer
>>> tokenizer = Tokenizer()
>>> tokenizer.sentence_tokenize('این مثالی است که در آن یک جمله فقط بر اساس علائم نگارشی جدا می‌شود.',verb_seperator= False)
['این مثالی است که در آن یک جمله فقط بر اساس علائم نگارشی جدا میشود .']

>>> tokenizer.sentence_tokenize('این مثالی است که در آن یک جمله با فعل تمام شده‌است ولی با نقطه تمام نشده‌است به همین دلیل آن را بر اساس فعل جدا می‌کنیم',verb_seperator= True)
[' این مثالی است',
 ' که در آن یک جمله با فعل تمام شده\u200cاست',
 ' ولی با نقطه تمام نشده\u200cاست',
 ' به همین دلیل آن را بر اساس فعل جدا می\u200cکنیم']



>>> tokenizer.word_tokenize('می‌توانید از طریق اینemail با ما در ارتباط باشید: info@hara.ai. همچنین با هشتگ #hara ما را دنبال کنید')
['می\u200cتوانید',
 'از',
 'طریق',
 'این',
 'email',
 'با',
 'ما',
 'در',
 'ارتباط',
 'باشید',
 ':',
 'info@hara.ai',
 '.',
 'همچنین',
 'با',
 'هشتگ',
 'hara#',
 'ما',
 'را',
 'دنبال',
 'کنید']

Reference

If you use or discuss this normalization tool in your work, please cite our paper :

@inproceedings{oji2021parsinorm,
  title={ParsiNorm: A Persian Toolkit for Speech Processing Normalization},
  author={Oji, Romina and Razavi, Seyedeh Fatemeh and Dehsorkh, Sajjad Abdi and Hariri, Alireza and Asheri, Hadi and Hosseini, Reshad},
  booktitle={2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS)},
  pages={1--5},
  year={2021},
  organization={IEEE}
}

Contact

If you have any technical question regarding the dataset or publication, please create an issue in this repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsinorm-fork-0.0.4.tar.gz (31.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parsinorm_fork-0.0.4-py3-none-any.whl (30.5 kB view details)

Uploaded Python 3

File details

Details for the file parsinorm-fork-0.0.4.tar.gz.

File metadata

  • Download URL: parsinorm-fork-0.0.4.tar.gz
  • Upload date:
  • Size: 31.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for parsinorm-fork-0.0.4.tar.gz
Algorithm Hash digest
SHA256 6927ee3ac79f1788987aa8adb82d686261876f070a9e5a2fd6ba79eac1b69574
MD5 211e1f06fea37f917913dab63a6e5fd4
BLAKE2b-256 8d00f430c1e7b7986435b64b519d7a846abfed8425317b47344555cf2fab320c

See more details on using hashes here.

File details

Details for the file parsinorm_fork-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: parsinorm_fork-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 30.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for parsinorm_fork-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8761056b645e730f8e7ca23f6e5a33502de2839ef8ff870d4cfe37df661822c8
MD5 58662716ac0f563abeeeb17b5e49294b
BLAKE2b-256 5f45e9cafa6d9c18c09aaa95a1335e996e93ba2c5bceb64bfaba67fa7799d552

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page