Skip to main content

Yoruba grapheme-to-phoneme tool with tones (IPA + ASCII + MFA-ready).

Project description

🇳🇬 Yoruba-G2P

Tone-Aware Yoruba Grapheme-to-Phoneme Toolkit (IPA + ASCII + MFA-Ready)

Yoruba-G2P is a fully deterministic Python package for converting Yorùbá text → phoneme sequences, with correct tones, nasal handling, affricates, and labial-velars.

It outputs:

  • IPA dictionary
  • ASCII-safe dictionary (for MFA, ESPnet, Kaldi)
  • Phoneset file
  • Lexicon statistics
  • CLI + Python API
  • Works on any Yoruba transcript
  • ✔ No ML training required → fully rule-based + Epitran-backed

Installation

From PyPI (recommended)

pip install yoruba-g2p

From GitHub

pip install git+https://github.com/<your-username>/yoruba-g2p.git

Quick Start (Python API)

from yoruba_g2p import YorG2P

g2p = YorG2P()
print(g2p.to_ipa("ọ̀yọ́"))
print(g2p.to_ascii("ọ̀mọ́"))

Output:

['ɔ_L', 'j', 'ɔ_H']
['O_L', 'm', 'O_H']

Another example:

g2p.to_ipa("àwọn")

['a_L', 'w', 'ɔ_M', 'n']

Command-line Interface (CLI)

Convert a sentence:

yoruba-g2p --ipa "àwọn ọmọ ń lọ"
yoruba-g2p --ascii "àwọn ọmọ ń lọ"

Build lexicons from .lab transcripts:

yoruba-g2p build-lexicon \
  --lab-dir data/lab/train \
  --out-dir dict/

You will get:

dict/ipa.dict
dict/ascii.dict
dict/phoneset.txt
dict/stats.json

All MFA-ready.


What Yoruba-G2P Produces

1. IPA Dictionary

ọ̀mọ́    ɔ_L m ɔ_H
jẹ́      d͡ʒ ɛ_H
àwọn     a_L w ɔ_M n

2. ASCII Dictionary

ọ̀mọ́    O_L m O_H
jẹ́      dZ e_H
àwọn     a_L w O_M n

3. Phoneset (phoneset.txt)

a_M
a_H
a_L
e_M
e_H
e_L
ɛ_M
ɛ_H
ɛ_L
i_M
i_H
i_L
o_M
o_H
o_L
ɔ_M
ɔ_H
ɔ_L
kp
gb
s
t
d
m
n
n_H
n_L
...

4. Stats (stats.json)

{
  "num_words": 5478,
  "problem_words": 2,
  "phoneset_size": 41
}

Theory Behind Yoruba-G2P

Yoruba tone is lexical, and vowels carry tone:

Mark Tone Examples
´ H ó, é, á
` L ò, è, à
none M o, e, a
ń, ǹ syllabic nasal with tone ńlá, ǹkan

The engine performs:

  1. Unicode-normalization (NFC)

  2. Orthographic vowel + tone extraction

  3. IPA transliteration via Epitran

  4. IPA segmentation (phones: vowels, consonants, digraphs)

  5. Tone reattachment using your vowel map

    • Guarantees correct H/M/L tones
  6. Handling:

    • nasal vowels (ɛ̃ → ɛ_M + n)
    • syllabic nasals (ń → n_H, ǹ → n_L, decomposed forms)
    • affricates (d͡ʒ, t͡ʃ)
    • labial-velars (kp, gb)

Outputs IPA or ASCII-friendly phones (e.g., kp, gb, dZ).


Directory Outputs (for MFA)

If you run build-lexicon, your output folder will contain:

ipa.dict       # MFA-ready IPA lexicon
ascii.dict     # ASCII-safe lexicon
phoneset.txt   # sorted unique phones
stats.json     # statistics 

Then feed into MFA:

mfa train \
  --corpus wavs/ \
  --dictionary dict/ipa.dict \
  --output align_out/

Demo Notebook

A worked example is provided:

demo_yoruba_g2p.ipynb

It includes:

  • IPA & ASCII conversion
  • Batch lexicon building
  • Phoneset extraction
  • Tone distribution visualization
  • Handling problem words

Project Structure

yoruba_g2p/
│
├── core.py     # Main G2P engine
├── utils.py    # helper rules, ASCII mapping
├── cli.py      # command line interface
└── __init__.py

Contributing

Pull requests are welcome!

Steps:

  1. Fork the repo
  2. Create a new branch: git checkout -b feature-name
  3. Commit changes
  4. Push and open a PR

Issues, feature requests, and Yoruba orthography corrections are welcome.


📣 Citation (for research)

Osakuade, O. (2025). Yoruba-G2P: A tone-aware grapheme-to-phoneme converter for Yorùbá. 
https://github.com/OpeyemiOsakuade/yoruba-g2p

License

MIT License — free for academic and commercial use.


⭐ Support Yoruba NLP

If this toolkit helps you, please give the repo a star ⭐ to promote more open-source resources for African languages.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yoruba_g2p-0.2.4.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yoruba_g2p-0.2.4-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file yoruba_g2p-0.2.4.tar.gz.

File metadata

  • Download URL: yoruba_g2p-0.2.4.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for yoruba_g2p-0.2.4.tar.gz
Algorithm Hash digest
SHA256 f10b9c445ca9e504e6d6d0ff20c2196be5114445e631eacfa3c0def1978766b6
MD5 3f952ad8b3755133ddbb2b8741c95bef
BLAKE2b-256 bb8d5e83ef969ee4dd8a4d364830a592ed197c111e953f74f4ccd63dfa011080

See more details on using hashes here.

File details

Details for the file yoruba_g2p-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: yoruba_g2p-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for yoruba_g2p-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c33ab55e6df60531f81a011b1e84f23cfe0d86d7bb8083895576d316c7d95c6e
MD5 4f85984a868caf7e9952437c353b0ed6
BLAKE2b-256 dac5f0470e08a36423bc1eb7b23651296e03d2ca08e1d0b7a5f939145fc739aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page