Skip to main content

Indonesian G2P.

Project description

g2p ID: Indonesian Grapheme-to-Phoneme Converter

GitHub Documentation GitHub release Contributor Covenant Tests Code Coverage chat on Discord contributing guidelines

This library is developed to convert Indonesian (Bahasa Indonesia) graphemes (words) to phonemes in IPA. We followed the methods and designs used in the English equivalent library, g2p.

Installation

pip install g2p_id_py

How to Use

from g2p_id import G2p

texts = [
    "Apel itu berwarna merah.",
    "Rahel bersekolah di S M A Jakarta 17.",
    "Mereka sedang bermain bola di lapangan.",
]

g2p = G2p()
for text in texts:
    print(g2p(text))

>> [['a', 'p', 'ə', 'l'], ['i', 't', 'u'], ['b', 'ə', 'r', 'w', 'a', 'r', 'n', 'a'], ['m', 'e', 'r', 'a', 'h'], ['.']]
>> [['r', 'a', 'h', 'e', 'l'], ['b', 'ə', 'r', 's', 'ə', 'k', 'o', 'l', 'a', 'h'], ['d', 'i'], ['e', 's'], ['e', 'm'], ['a'], ['dʒ', 'a', 'k', 'a', 'r', 't', 'a'], ['t', 'u', 'dʒ', 'u', 'h'], ['b', 'ə', 'l', 'a', 's'], ['.']]
>> [['m', 'ə', 'r', 'e', 'k', 'a'], ['s', 'ə', 'd', 'a', 'ŋ'], ['b', 'ə', 'r', 'm', 'a', 'i', 'n'], ['b', 'o', 'l', 'a'], ['d', 'i'], ['l', 'a', 'p', 'a', 'ŋ', 'a', 'n'], ['.']]

Algorithm

This is heavily inspired from the English g2p.

  1. Spells out arabic numbers and some currency symbols, e.g. Rp 200,000 -> dua ratus ribu rupiah. This is borrowed from Cahya's code.
  2. Attempts to retrieve the correct pronunciation for homographs based on their POS (part-of-speech) tags.
  3. Looks up a lexicon (pronunciation dictionary) for non-homographs. This list is originally from ipa-dict, and we later made a modified version.
  4. For OOVs, we predict their pronunciations using either a BERT model or an LSTM model.

Phoneme and Grapheme Sets

graphemes = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
phonemes = ['a', 'b', 'd', 'e', 'f', 'ɡ', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'z', 'ŋ', 'ə', 'ɲ', 'tʃ', 'ʃ', 'dʒ', 'x', 'ʔ']

Implementation Details

You can find more details on how we handled homographs and out-of-vocabulary prediction on our documentation page.

References

@misc{g2pE2019,
  author = {Park, Kyubyong & Kim, Jongseok},
  title = {g2pE},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Kyubyong/g2p}}
}
@misc{TextProcessor2021,
  author = {Cahya Wirawan},
  title = {Text Processor},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/cahya-wirawan/text_processor}}
}

Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

g2p_id_py-0.3.7.tar.gz (5.1 MB view details)

Uploaded Source

Built Distribution

g2p_id_py-0.3.7-py3-none-any.whl (5.1 MB view details)

Uploaded Python 3

File details

Details for the file g2p_id_py-0.3.7.tar.gz.

File metadata

  • Download URL: g2p_id_py-0.3.7.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.12

File hashes

Hashes for g2p_id_py-0.3.7.tar.gz
Algorithm Hash digest
SHA256 881f81f7b17c5fdce20874751fd3e46bcd582c47e1a41b605bca431b2063b07a
MD5 af31bd0204a352e9a8674e279c48d88c
BLAKE2b-256 84db317d6e338d729366103657383b8f9fc56a427c8250012ccd1062d768ea7c

See more details on using hashes here.

File details

Details for the file g2p_id_py-0.3.7-py3-none-any.whl.

File metadata

  • Download URL: g2p_id_py-0.3.7-py3-none-any.whl
  • Upload date:
  • Size: 5.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.12

File hashes

Hashes for g2p_id_py-0.3.7-py3-none-any.whl
Algorithm Hash digest
SHA256 4a670a9fe22631872b7776b3870615158bd085f5d6e0d08151003ac1e6872d41
MD5 ccc19b0f50972afb4130d51077b23dad
BLAKE2b-256 ff2aa9f56edea16570daf4a2305f058ca2e1682acbd0da62993b071c8a767af7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page