Yoruba grapheme-to-phoneme tool with tones (IPA + ASCII + MFA-ready).
Project description
🇳🇬 Yoruba-G2P
Tone-Aware Yoruba Grapheme-to-Phoneme Toolkit (IPA + ASCII + MFA-Ready)
Yoruba-G2P is a fully deterministic Python package for converting Yorùbá text → phoneme sequences, with correct tones, nasal handling, affricates, and labial-velars.
It outputs:
- ✔ IPA dictionary
- ✔ ASCII-safe dictionary (for MFA, ESPnet, Kaldi)
- ✔ Phoneset file
- ✔ Lexicon statistics
- ✔ CLI + Python API
- ✔ Works on any Yoruba transcript
- ✔ No ML training required → fully rule-based + Epitran-backed
Installation
From PyPI (recommended)
pip install yoruba-g2p
From GitHub
pip install git+https://github.com/<your-username>/yoruba-g2p.git
Quick Start (Python API)
from yoruba_g2p import YorG2P
g2p = YorG2P()
#edit these
print(g2p.to_ipa("ọ̀mọ́"))
print(g2p.to_ascii("ọ̀mọ́"))
Output:
['ɔ_L', 'm', 'ɔ_H']
['O_L', 'm', 'O_H']
Another example:
g2p.to_ipa("àwọn")
# ['a_L', 'w', 'ɔ_M', 'n']
Command-line Interface (CLI)
Convert a sentence:
yoruba-g2p --ipa "àwọn ọmọ ń lọ"
yoruba-g2p --ascii "àwọn ọmọ ń lọ"
Build lexicons from .lab transcripts:
yoruba-g2p build-lexicon \
--lab-dir data/lab/train \
--out-dir dict/
You will get:
dict/ipa.dict
dict/ascii.dict
dict/phoneset.txt
dict/stats.json
All MFA-ready.
What Yoruba-G2P Produces
1. IPA Dictionary
ọ̀mọ́ ɔ_L m ɔ_H
jẹ́ d͡ʒ ɛ_H
àwọn a_L w ɔ_M n
2. ASCII Dictionary
ọ̀mọ́ O_L m O_H
jẹ́ dZ e_H
àwọn a_L w O_M n
3. Phoneset (phoneset.txt)
a_M
a_H
a_L
e_M
e_H
e_L
ɛ_M
ɛ_H
ɛ_L
i_M
i_H
i_L
o_M
o_H
o_L
ɔ_M
ɔ_H
ɔ_L
kp
gb
s
t
d
m
n
n_H
n_L
...
4. Stats (stats.json)
{
"num_words": 5478,
"problem_words": 2,
"phoneset_size": 41
}
Theory Behind Yoruba-G2P
Yoruba tone is lexical, and vowels carry tone:
| Mark | Tone | Examples |
|---|---|---|
| ´ | H | ó, é, á |
| ` | L | ò, è, à |
| none | M | o, e, a |
| ń, ǹ | syllabic nasal with tone | ńlá, ǹkan |
The engine performs:
-
Unicode-normalization (NFC)
-
Orthographic vowel + tone extraction
-
IPA transliteration via Epitran
-
IPA segmentation (phones: vowels, consonants, digraphs)
-
Tone reattachment using your vowel map
- Guarantees correct H/M/L tones
-
Handling:
- nasal vowels (ɛ̃ → ɛ_M + n)
- syllabic nasals (ń → n_H, ǹ → n_L, decomposed forms)
- affricates (
d͡ʒ,t͡ʃ) - labial-velars (
kp,gb)
Outputs IPA or ASCII-friendly phones (e.g., kp, gb, dZ).
Directory Outputs (for MFA)
If you run build-lexicon, your output folder will contain:
ipa.dict # MFA-ready IPA lexicon
ascii.dict # ASCII-safe lexicon
phoneset.txt # sorted unique phones
stats.json # statistics
Then feed into MFA:
mfa train \
--corpus wavs/ \
--dictionary dict/ipa.dict \
--output align_out/
Demo Notebook
A worked example is provided:
demo_yoruba_g2p.ipynb
It includes:
- IPA & ASCII conversion
- Batch lexicon building
- Phoneset extraction
- Tone distribution visualization
- Handling problem words
Project Structure
yoruba_g2p/
│
├── core.py # Main G2P engine
├── utils.py # helper rules, ASCII mapping
├── cli.py # command line interface
└── __init__.py
Contributing
Pull requests are welcome!
Steps:
- Fork the repo
- Create a new branch:
git checkout -b feature-name - Commit changes
- Push and open a PR
Issues, feature requests, and Yoruba orthography corrections are welcome.
📣 Citation (for research)
Osakuade, O. (2025). Yoruba-G2P: A tone-aware grapheme-to-phoneme converter for Yorùbá.
https://github.com/OpeyemiOsakuade/yoruba-g2p
License
MIT License — free for academic and commercial use.
⭐ Support Yoruba NLP
If this toolkit helps you, please give the repo a star ⭐ to promote more open-source resources for African languages.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yoruba_g2p-0.2.0.tar.gz.
File metadata
- Download URL: yoruba_g2p-0.2.0.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0899d60a926acd3cadbe802bdf82aa752130405b1df58d1361873f1f6c134548
|
|
| MD5 |
de6eafef00ecaf9a610b7ad5bdf92270
|
|
| BLAKE2b-256 |
be5e197bdc331edef03b029b48ffd72a929e79fcf08717df6bb280247691c1ba
|
File details
Details for the file yoruba_g2p-0.2.0-py3-none-any.whl.
File metadata
- Download URL: yoruba_g2p-0.2.0-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16257f416aad8a7616ae197cce947165c716a6b4b4b321df0f83af12eb8bbda5
|
|
| MD5 |
10d8fbe93c9bbd032ee74ba1b400a752
|
|
| BLAKE2b-256 |
c5f3ea269f26a99c7e943e071abd9c543006a310c65a7a147577d7b1435c0c47
|