Skip to main content

Indonesian G2P.

Project description

g2p ID: Indonesian Grapheme-to-Phoneme Converter

GitHub Documentation GitHub release Contributor Covenant Tests Code Coverage chat on Discord contributing guidelines

This library is developed to convert Indonesian (Bahasa Indonesia) graphemes (words) to phonemes in IPA. We followed the methods and designs used in the English equivalent library, g2p.

Installation

pip install g2p_id_py

How to Use

from g2p_id import G2p

texts = [
    "Apel itu berwarna merah.",
    "Rahel bersekolah di S M A Jakarta 17.",
    "Mereka sedang bermain bola di lapangan.",
]

g2p = G2p()
for text in texts:
    print(g2p(text))

>> [['a', 'p', 'ə', 'l'], ['i', 't', 'u'], ['b', 'ə', 'r', 'w', 'a', 'r', 'n', 'a'], ['m', 'e', 'r', 'a', 'h'], ['.']]
>> [['r', 'a', 'h', 'e', 'l'], ['b', 'ə', 'r', 's', 'ə', 'k', 'o', 'l', 'a', 'h'], ['d', 'i'], ['e', 's'], ['e', 'm'], ['a'], ['dʒ', 'a', 'k', 'a', 'r', 't', 'a'], ['t', 'u', 'dʒ', 'u', 'h'], ['b', 'ə', 'l', 'a', 's'], ['.']]
>> [['m', 'ə', 'r', 'e', 'k', 'a'], ['s', 'ə', 'd', 'a', 'ŋ'], ['b', 'ə', 'r', 'm', 'a', 'i', 'n'], ['b', 'o', 'l', 'a'], ['d', 'i'], ['l', 'a', 'p', 'a', 'ŋ', 'a', 'n'], ['.']]

Algorithm

This is heavily inspired from the English g2p.

  1. Spells out arabic numbers and some currency symbols, e.g. Rp 200,000 -> dua ratus ribu rupiah. This is borrowed from Cahya's code.
  2. Attempts to retrieve the correct pronunciation for homographs based on their POS (part-of-speech) tags.
  3. Looks up a lexicon (pronunciation dictionary) for non-homographs. This list is originally from ipa-dict, and we later made a modified version.
  4. For OOVs, we predict their pronunciations using either a BERT model or an LSTM model.

Phoneme and Grapheme Sets

graphemes = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
phonemes = ['a', 'b', 'd', 'e', 'f', 'ɡ', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'z', 'ŋ', 'ə', 'ɲ', 'tʃ', 'ʃ', 'dʒ', 'x', 'ʔ']

Implementation Details

You can find more details on how we handled homographs and out-of-vocabulary prediction on our documentation page.

References

@misc{g2pE2019,
  author = {Park, Kyubyong & Kim, Jongseok},
  title = {g2pE},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Kyubyong/g2p}}
}
@misc{TextProcessor2021,
  author = {Cahya Wirawan},
  title = {Text Processor},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/cahya-wirawan/text_processor}}
}

Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

g2p_id_py-0.4.2.tar.gz (5.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

g2p_id_py-0.4.2-py3-none-any.whl (5.1 MB view details)

Uploaded Python 3

File details

Details for the file g2p_id_py-0.4.2.tar.gz.

File metadata

  • Download URL: g2p_id_py-0.4.2.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.2

File hashes

Hashes for g2p_id_py-0.4.2.tar.gz
Algorithm Hash digest
SHA256 16537ee23035895557cd90b9156390c3a7367a747c69b41bd68bf2b94ff56245
MD5 ba302cc15bd22047e78850f3085a33ea
BLAKE2b-256 1a827a158c9714a77299de714454ad9b1eeda18dfef3f8e4c4f8f898cea00671

See more details on using hashes here.

File details

Details for the file g2p_id_py-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: g2p_id_py-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 5.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.2

File hashes

Hashes for g2p_id_py-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0b9ce7e76be2e8d1738c604751d05e0a8aa1b2ce2199fce27a115d350468e671
MD5 a9f42331de4d99698178e6fda7ca5133
BLAKE2b-256 faa4cd99ba82de323f2084b59f0b7b55211fcb4a7126d4ecbdcee7608b9ed0c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page