Skip to main content

Simple grapheme-to-phomene converter for persian (farsi)

Project description

Simple persian (farsi) grapheme-to-phoneme converter

PyPI version

pip install PersianG2p

It uses this neural net to convertion persian texts (with arabic symbols) into phonemes text.

Features of farsi

  • arabic notation
  • the characters have different forms depended on position into word
  • vowels a, e, o are often not written but pronounced; for example:
    • سس pronounces sos but written ss
    • شش pronounces šeš but written šš
    • من pronounces man but written mn
    • سلام pronounces salām but written slām
    • شما pronounces šomā but written šmā
    • ممنون pronounces mamnun but written mmnun
  • the same symbols have different pronounces: in the word مو the symbol و pronounces u, but in the word میوه this symbol goes after vowel and pronounces v; the word تو pronounses to or tu depending on the meaning; symbol ه (hā-ye docešm) pronounces like a (e) in the word نه and like h it the word آنها
  • no overlap of vowel sounds
  • verbs are at the end of sentence
  • no sex
  • no cases
  • adjectives and definitions append to the end of nouns

How it works

There is the dictionary with 1867 pairs like (persian word, pronouncing of one); you also can load the dictionary with over 48 000 words by using use_large = True in constuctor. Some of these word (in English): water, there, feeling, use, people, throw, he, can, highway, was, hall, guarantee, production, sentence, account, god, self, they know, dollar, mind, novel, earthquake, organizing, weapons, personal, martyr, necessity, opinion, french, legal, london, deprived, people, studies, source, fruit, they take, system, the light, are, and, leg, bridge, what, done, do.

Firstly, your text is normalized by hazm, after --- tokenized.

  1. If token is not a symbol of arabic alphabet then it does nothing.
  2. If token is the word from dictionary then it chooses the pronouncing from dictionary.
  3. Otherwise the pronouncing will be predicted by neural net.

If token was a word from dictionary then it's pronouncing is the word like ' t h i s ' (spaces between symbols and in the end and begin of word). If the word is continues then it's the predicted word. U can disable this option by setting secret = True.

"Tidy" argument

persian symbols sound (tidy = False) sound (tidy = True)
آ A ā
ش S š
ژ Z ž
چ C č
ء، ع ? `

Comparison with epitran

Code

persian word epitran convertion PersianG2p conversion expected
سلام slɒm salām salām
ممنون mmnvn mamnun mamnun
خب xb xab xāb
ساحل sɒhl sāhel sāhel
یخ jx yax yax
لاغر lɒɣr lāġar lāġar
پسته پsth peste peste
مثلث msls mosles mosles
سال ها sɒl hɒ sālehā sālhā
لذت lzt lazt lezzat
دژ dož dež
برف brf barf barf
خدا حافظ xdɒ hɒfz x o d ā hāfez xodā hāfez
دمپایی dmپɒjj dampāyi dampāyi
نشستن nʃstn nešastan nešastan
متأسفانه mtɒʔsfɒnh motsafe`āne mota’assefāne

Installation

pip install PersianG2p

Usage

from PersianG2p import Persian_g2p_converter

PersianG2Pconverter = Persian_g2p_converter()
# or 
## PersianG2Pconverter = Persian_g2p_converter(use_large = True)

PersianG2Pconverter.transliterate('ما الان درحال بازی بودیم', tidy = False)
# ' m A   a l A n  darhAl  b A z i   b u d i m '

PersianG2Pconverter.transliterate('ما الان درحال بازی بودیم')
# ' m ā   a l ā n  darhāl  b ā z i   b u d i m '

Persian_g2p_converter().transliterate( "زان یار دلنوازم شکریست با شکایت", secret = True)
# 'zān yār delnavāzam šokrist bā šekāyat'

PersianG2Pconverter.transliterate('نه تنها یک کلمه')
# ' n o h   t a n h ā   y e k  kalame'

#object() and object.transliterate() are equal if they have same arguments
PersianG2Pconverter('نه تنها یک کلمه', secret = True)
# 'noh tanhA yek kalame'

Telegram bot @PersianG2Pbot

This telegram bot uses PersianG2P package. Write him to check results.

What can u do better

  • Fit better model (with another hyperparams or bigger dictionary)

  • Add many new words into dictionary. If u want, I will write Python/C# script for this task or even create Telegram bot

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PersianG2p-0.3.2.tar.gz (924.3 kB view details)

Uploaded Source

Built Distribution

PersianG2p-0.3.2-py3-none-any.whl (928.1 kB view details)

Uploaded Python 3

File details

Details for the file PersianG2p-0.3.2.tar.gz.

File metadata

  • Download URL: PersianG2p-0.3.2.tar.gz
  • Upload date:
  • Size: 924.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for PersianG2p-0.3.2.tar.gz
Algorithm Hash digest
SHA256 7d2445d0c51d27dd4548cd4715ae659286fbd9f9de3c07a42ce19a87d5e8531e
MD5 24fe7157d8f8dcb33e7d3c593f87aa2d
BLAKE2b-256 f34e42ff1fdc16527cda0c93a85e9f9c5adb3fa5a48c4c4b28b43087723a162d

See more details on using hashes here.

File details

Details for the file PersianG2p-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: PersianG2p-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 928.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for PersianG2p-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8df07a8b6d99ea714887ff27b1fc9a66f61a1476bc33def549147e84fe92f89a
MD5 b1bdcb98b71f7759399949d86b1f2ac4
BLAKE2b-256 801202fc7652c3377a5ad1df5a18619c219f6b361b720070dc6dc5366a019efd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page