Skip to main content

Yoruba grapheme-to-phoneme tool with tones (IPA + ASCII + MFA-ready).

Project description

🇳🇬 Yoruba-G2P

Tone-Aware Yoruba Grapheme-to-Phoneme Toolkit (IPA + ASCII + MFA-Ready)

Yoruba-G2P is a fully deterministic Python package for converting Yorùbá text → phoneme sequences, with correct tones, nasal handling, affricates, and labial-velars.

It outputs:

  • IPA dictionary
  • ASCII-safe dictionary (for MFA, ESPnet, Kaldi)
  • Phoneset file
  • Lexicon statistics
  • CLI + Python API
  • Works on any Yoruba transcript
  • ✔ No ML training required → fully rule-based + Epitran-backed

Installation

From PyPI (recommended)

pip install yoruba-g2p

From GitHub

pip install git+https://github.com/<your-username>/yoruba-g2p.git

Quick Start (Python API)

from yoruba_g2p import YorG2P

g2p = YorG2P()
#edit these
print(g2p.to_ipa("ọ̀mọ́"))
print(g2p.to_ascii("ọ̀mọ́"))

Output:

['ɔ_L', 'm', 'ɔ_H']
['O_L', 'm', 'O_H']

Another example:

g2p.to_ipa("àwọn")
# ['a_L', 'w', 'ɔ_M', 'n']

Command-line Interface (CLI)

Convert a sentence:

yoruba-g2p --ipa "àwọn ọmọ ń lọ"
yoruba-g2p --ascii "àwọn ọmọ ń lọ"

Build lexicons from .lab transcripts:

yoruba-g2p build-lexicon \
  --lab-dir data/lab/train \
  --out-dir dict/

You will get:

dict/ipa.dict
dict/ascii.dict
dict/phoneset.txt
dict/stats.json

All MFA-ready.


What Yoruba-G2P Produces

1. IPA Dictionary

ọ̀mọ́    ɔ_L m ɔ_H
jẹ́      d͡ʒ ɛ_H
àwọn     a_L w ɔ_M n

2. ASCII Dictionary

ọ̀mọ́    O_L m O_H
jẹ́      dZ e_H
àwọn     a_L w O_M n

3. Phoneset (phoneset.txt)

a_M
a_H
a_L
e_M
e_H
e_L
ɛ_M
ɛ_H
ɛ_L
i_M
i_H
i_L
o_M
o_H
o_L
ɔ_M
ɔ_H
ɔ_L
kp
gb
s
t
d
m
n
n_H
n_L
...

4. Stats (stats.json)

{
  "num_words": 5478,
  "problem_words": 2,
  "phoneset_size": 41
}

Theory Behind Yoruba-G2P

Yoruba tone is lexical, and vowels carry tone:

Mark Tone Examples
´ H ó, é, á
` L ò, è, à
none M o, e, a
ń, ǹ syllabic nasal with tone ńlá, ǹkan

The engine performs:

  1. Unicode-normalization (NFC)

  2. Orthographic vowel + tone extraction

  3. IPA transliteration via Epitran

  4. IPA segmentation (phones: vowels, consonants, digraphs)

  5. Tone reattachment using your vowel map

    • Guarantees correct H/M/L tones
  6. Handling:

    • nasal vowels (ɛ̃ → ɛ_M + n)
    • syllabic nasals (ń → n_H, ǹ → n_L, decomposed forms)
    • affricates (d͡ʒ, t͡ʃ)
    • labial-velars (kp, gb)

Outputs IPA or ASCII-friendly phones (e.g., kp, gb, dZ).


Directory Outputs (for MFA)

If you run build-lexicon, your output folder will contain:

ipa.dict       # MFA-ready IPA lexicon
ascii.dict     # ASCII-safe lexicon
phoneset.txt   # sorted unique phones
stats.json     # statistics 

Then feed into MFA:

mfa train \
  --corpus wavs/ \
  --dictionary dict/ipa.dict \
  --output align_out/

Demo Notebook

A worked example is provided:

demo_yoruba_g2p.ipynb

It includes:

  • IPA & ASCII conversion
  • Batch lexicon building
  • Phoneset extraction
  • Tone distribution visualization
  • Handling problem words

Project Structure

yoruba_g2p/
│
├── core.py     # Main G2P engine
├── utils.py    # helper rules, ASCII mapping
├── cli.py      # command line interface
└── __init__.py

Contributing

Pull requests are welcome!

Steps:

  1. Fork the repo
  2. Create a new branch: git checkout -b feature-name
  3. Commit changes
  4. Push and open a PR

Issues, feature requests, and Yoruba orthography corrections are welcome.


📣 Citation (for research)

Osakuade, O. (2025). Yoruba-G2P: A tone-aware grapheme-to-phoneme converter for Yorùbá. 
https://github.com/OpeyemiOsakuade/yoruba-g2p

License

MIT License — free for academic and commercial use.


⭐ Support Yoruba NLP

If this toolkit helps you, please give the repo a star ⭐ to promote more open-source resources for African languages.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yoruba_g2p-0.2.0.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yoruba_g2p-0.2.0-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file yoruba_g2p-0.2.0.tar.gz.

File metadata

  • Download URL: yoruba_g2p-0.2.0.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for yoruba_g2p-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0899d60a926acd3cadbe802bdf82aa752130405b1df58d1361873f1f6c134548
MD5 de6eafef00ecaf9a610b7ad5bdf92270
BLAKE2b-256 be5e197bdc331edef03b029b48ffd72a929e79fcf08717df6bb280247691c1ba

See more details on using hashes here.

File details

Details for the file yoruba_g2p-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: yoruba_g2p-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for yoruba_g2p-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 16257f416aad8a7616ae197cce947165c716a6b4b4b321df0f83af12eb8bbda5
MD5 10d8fbe93c9bbd032ee74ba1b400a752
BLAKE2b-256 c5f3ea269f26a99c7e943e071abd9c543006a310c65a7a147577d7b1435c0c47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page