Indonesian G2P.
Project description
g2p ID: Indonesian Grapheme-to-Phoneme Converter
This library is developed to convert Indonesian (Bahasa Indonesia) graphemes (words) to phonemes in IPA. We followed the methods and designs used in the English equivalent library, g2p.
Installation
pip install g2p_id_py
How to Use
from g2p_id import G2p
texts = [
"Apel itu berwarna merah.",
"Rahel bersekolah di S M A Jakarta 17.",
"Mereka sedang bermain bola di lapangan.",
]
g2p = G2p()
for text in texts:
print(g2p(text))
>> [['a', 'p', 'ə', 'l'], ['i', 't', 'u'], ['b', 'ə', 'r', 'w', 'a', 'r', 'n', 'a'], ['m', 'e', 'r', 'a', 'h'], ['.']]
>> [['r', 'a', 'h', 'e', 'l'], ['b', 'ə', 'r', 's', 'ə', 'k', 'o', 'l', 'a', 'h'], ['d', 'i'], ['e', 's'], ['e', 'm'], ['a'], ['dʒ', 'a', 'k', 'a', 'r', 't', 'a'], ['t', 'u', 'dʒ', 'u', 'h'], ['b', 'ə', 'l', 'a', 's'], ['.']]
>> [['m', 'ə', 'r', 'e', 'k', 'a'], ['s', 'ə', 'd', 'a', 'ŋ'], ['b', 'ə', 'r', 'm', 'a', 'i', 'n'], ['b', 'o', 'l', 'a'], ['d', 'i'], ['l', 'a', 'p', 'a', 'ŋ', 'a', 'n'], ['.']]
Algorithm
This is heavily inspired from the English g2p.
- Spells out arabic numbers and some currency symbols, e.g.
Rp 200,000 -> dua ratus ribu rupiah
. This is borrowed from Cahya's code. - Attempts to retrieve the correct pronunciation for homographs based on their POS (part-of-speech) tags.
- Looks up a lexicon (pronunciation dictionary) for non-homographs. This list is originally from ipa-dict, and we later made a modified version.
- For OOVs, we predict their pronunciations using either a BERT model or an LSTM model.
Phoneme and Grapheme Sets
graphemes = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
phonemes = ['a', 'b', 'd', 'e', 'f', 'ɡ', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'z', 'ŋ', 'ə', 'ɲ', 'tʃ', 'ʃ', 'dʒ', 'x', 'ʔ']
Implementation Details
You can find more details on how we handled homographs and out-of-vocabulary prediction on our documentation page.
References
@misc{g2pE2019,
author = {Park, Kyubyong & Kim, Jongseok},
title = {g2pE},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Kyubyong/g2p}}
}
@misc{TextProcessor2021,
author = {Cahya Wirawan},
title = {Text Processor},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/cahya-wirawan/text_processor}}
}
Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
g2p_id_py-0.3.7.tar.gz
(5.1 MB
view details)
Built Distribution
File details
Details for the file g2p_id_py-0.3.7.tar.gz
.
File metadata
- Download URL: g2p_id_py-0.3.7.tar.gz
- Upload date:
- Size: 5.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 881f81f7b17c5fdce20874751fd3e46bcd582c47e1a41b605bca431b2063b07a |
|
MD5 | af31bd0204a352e9a8674e279c48d88c |
|
BLAKE2b-256 | 84db317d6e338d729366103657383b8f9fc56a427c8250012ccd1062d768ea7c |
File details
Details for the file g2p_id_py-0.3.7-py3-none-any.whl
.
File metadata
- Download URL: g2p_id_py-0.3.7-py3-none-any.whl
- Upload date:
- Size: 5.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a670a9fe22631872b7776b3870615158bd085f5d6e0d08151003ac1e6872d41 |
|
MD5 | ccc19b0f50972afb4130d51077b23dad |
|
BLAKE2b-256 | ff2aa9f56edea16570daf4a2305f058ca2e1682acbd0da62993b071c8a767af7 |