Dialect-aware Portuguese (Lusophone) text-to-IPA phonemizer
Project description
TugaPhone — Dialect-aware Portuguese Phonemizer
TugaPhone is a Python library that phonemizes arbitrary Portuguese text across major Lusophone dialects (pt-PT, pt-BR, pt-AO, pt-MZ, pt-TL). It uses a curated phonetic lexicon plus a rule-based fallback to deliver plausible phoneme transcriptions while preserving dialectal variation.
Choveu muito ontem à noite.
pt-PT-x-porto → ˈʃɔ·vew mˈũj·tu õ·ˈtẽ ˈa nˈuoj·tɨ
pt-PT → ˈʃɔ·vew mˈũj·tu õ·ˈtẽ ˈa nˈoj·tɨ
pt-BR → ˈʃɔ·vew mwˈĩ·tʊ õ·ˈtẽ ˈa nˈoj·tʃɪ
pt-AO → ˈʃɔ·vew mˈũjn·tʊ õ·ˈtẽ ˈa nˈoj·tɨ
pt-MZ → ˈʃɔ·vew mˈũj·tu õ·ˈtẽ ˈa nˈɔj·tɨ
pt-TL → ˈʃɔ·vew mˈuj·tʊ õ·ˈtẽ ˈa nˈojtʰ
🚀 Features
- Multi-dialect support: European Portuguese (pt-PT), Brazilian Portuguese (pt-BR), Angolan (pt-AO), Mozambican (pt-MZ), and Timorese (pt-TL)
- Regional accent modeling: Additional micro-dialects like Porto, Minho, Braga, Trás-os-Montes, and more
- Hybrid approach: Combines a curated phonetic lexicon (Portuguese Phonetic Lexicon) with rule-based G2P fallback
- Context-aware: Takes part-of-speech tags into account for homograph disambiguation
- Number normalization: Automatically converts digits to their Portuguese spoken forms with proper gender agreement
- Syllabification: Rule-based syllable boundary detection (~99.6% accuracy on benchmark)
- Stress detection: Automatic stress placement following Portuguese phonological rules
- IPA output: Full International Phonetic Alphabet transcription with stress markers and syllable boundaries
📦 Installation
pip install tugaphone
🧰 Usage
Companion libraries
The follow libraries are dependencies of tugaphone and might be useful on their own
- Tugalex - Lexicon of words and exceptions
- TugaTagger - portuguese text postagger
- silabificador - portuguese text syllabification
Basic Phonemization
from tugaphone import TugaPhonemizer
ph = TugaPhonemizer()
sentences = [
"O gato dorme.",
"Tu falas português muito bem.",
"O comboio chegou à estação.",
"A menina comeu o pão todo.",
"Vou pôr a manteiga no frigorífico."
]
for s in sentences:
print(f"Sentence: {s}")
for code in ["pt-PT", "pt-BR", "pt-AO", "pt-MZ", "pt-TL"]:
phones = ph.phonemize_sentence(s, code)
print(f" {code} → {phones}")
print("-----")
Regional Dialects
from tugaphone import TugaPhonemizer
from tugaphone.regional import PortoDialect, MinhoDialect, BragaDialect
ph = TugaPhonemizer()
sentence = "O Porto é uma cidade bonita."
# Standard European Portuguese
print(f"pt-PT: {ph.phonemize_sentence(sentence, 'pt-PT')}")
# Porto accent (rising diphthongs, rhotic realization)
print(f"Porto: {ph.phonemize_sentence(sentence, regional_dialect=PortoDialect)}")
# Minho accent (vowel resistance, open vowels)
print(f"Minho: {ph.phonemize_sentence(sentence, regional_dialect=MinhoDialect)}")
Number Normalization
from tugaphone.number_utils import normalize_numbers
# Automatic gender agreement
print(normalize_numbers("vou comprar 1 casa")) # uma casa
print(normalize_numbers("vou comprar 2 casas")) # duas casas
print(normalize_numbers("vou adotar 1 cão")) # um cão
print(normalize_numbers("vou adotar 2 cães")) # dois cães
# Ordinals
print(normalize_numbers("1º lugar")) # primeiro lugar
print(normalize_numbers("1ª vez")) # primeira vez
# Large numbers with scale differences
print(normalize_numbers("897654356789098", "pt-PT")) # long-scale (biliões)
print(normalize_numbers("897654356789098", "pt-BR")) # short-scale (trilhões)
Advanced: Tokenization and Features
from tugaphone.tokenizer import Sentence
from tugaphone.dialects import EuropeanPortuguese
sentence = Sentence("O cão comeu o pão.", dialect=EuropeanPortuguese())
print(f"IPA: {sentence.ipa}")
# Access word-level details
for word in sentence.words:
print(f"\nWord: {word.surface}")
print(f" Syllables: {'.'.join(word.syllables)}")
print(f" Stress: syllable {word.stressed_syllable_idx}")
print(f" IPA: {word.ipa}")
# Access grapheme-level details
for grapheme in word.graphemes:
if grapheme.is_diphthong:
print(f" Diphthong: {grapheme.surface} → {grapheme.ipa}")
📖 Documentation
Supported Dialects
| Dialect Code | Region | Characteristics |
|---|---|---|
pt-PT |
European Portuguese (Lisbon) | Heavy vowel reduction, fricative palatalization, uvular /r/ |
pt-BR |
Brazilian Portuguese (Rio) | Less vowel reduction, t/d palatalization, l-vocalization |
pt-AO |
Angolan Portuguese (Luanda) | Moderate vowel reduction, alveolar trill /r/, Bantu substrate |
pt-MZ |
Mozambican Portuguese (Maputo) | Similar to European with regional variation, Bantu influence |
pt-TL |
Timorese Portuguese (Dili) | Conservative pronunciation, Tetum substrate influence |
Regional Accents (Experimental)
TugaPhone includes experimental support for sub-regional Portuguese accents:
- PortoDialect: Rising diphthongs (o → uo), rhotic realization
- MinhoDialect: Reduced vowel centralization, open vowel preference
- BragaDialect: Palatal epenthesis (abelha → abeilha)
- TrasMontanoDialect: Palatal affrication, s-voicing, final nasal denasalization
- FafeDialect: Nasal diphthongization (gente → geinte)
Note: These are based on documented phonological features but should be considered approximate. Real-world variation is more complex.
Part-of-Speech Tagging
TugaPhone uses POS tags to disambiguate homographs:
from tugaphone import TugaPhonemizer
ph = TugaPhonemizer(postag_engine="spacy") # or "brill", "auto"
# "para" has different pronunciations as preposition vs. verb
print(ph.phonemize_sentence("Vou para casa.")) # preposition
print(ph.phonemize_sentence("Ele para o carro.")) # verb
Supported engines:
spacy: Requiresspacyand Portuguese model (most accurate)brill: Requiresbrill-postaggers(lighter, faster)lexicon: Uses built-in lexicon lookup (limited coverage)auto: Falls back through available enginesdummy: Simple rule-based fallback (no dependencies)
🏗️ Architecture
TugaPhone uses a hierarchical tokenization model:
Sentence → Words → Graphemes → Characters
Each level applies context-sensitive phonological rules:
- Character level: Vowel quality, consonant allophones
- Grapheme level: Digraphs (ch, nh), diphthongs (ai, ou)
- Word level: Stress assignment, syllabification
- Sentence level: Prosodic boundaries (future: liaison, phrasal stress)
The phonemization process:
- Normalize text (numbers → words)
- POS tagging (for homograph disambiguation)
- Lexicon lookup (for known words)
- Rule-based G2P fallback (for unknown words)
- Dialect-specific transformations (regional accents)
⚠️ Limitations & Future Work
Current Limitations
- Lexicon coverage: Many words (especially names, foreign words, neologisms) rely solely on rule-based fallback
- Sparse coverage: African and Timorese dialects have less lexicon data than European/Brazilian
- Lexical variation: Dialect-specific vocabulary (e.g., "trem" vs "comboio") is not handled; text is assumed orthographically consistent
- Regional accents: Sub-regional dialects are experimental and approximate
- Prosody: Sentence-level features (liaison, phrasal stress, intonation) are simplified
- Homograph disambiguation: Limited to POS-based rules; doesn't handle semantic context
🤝 Contributing
Contributions are welcome! Areas where help is especially needed:
- Lexicon expansion: Especially for pt-AO, pt-MZ, pt-TL
- Regional accent validation: Native speaker verification of dialectal features
- Test cases: Edge cases, challenging words, dialectal examples
- Documentation: Usage examples, linguistic explanations
📄 License
This project is licensed under the Apache License 2.0. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tugaphone-0.4.0a2.tar.gz.
File metadata
- Download URL: tugaphone-0.4.0a2.tar.gz
- Upload date:
- Size: 79.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3564beb29e22b9fc6fe278a363dd4104d26551bc39361d3c95414f00ed11e8f2
|
|
| MD5 |
cdff62426159bf736312b4fb55dbefba
|
|
| BLAKE2b-256 |
65040183fe3ea38dccdfc91aac061f734b18f2da96482b9f7187006a7040cc26
|
File details
Details for the file tugaphone-0.4.0a2-py3-none-any.whl.
File metadata
- Download URL: tugaphone-0.4.0a2-py3-none-any.whl
- Upload date:
- Size: 70.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11f469baa1908cd4ee62e036d73c0c3236ae92872c128cd590cb6cfc80da20b3
|
|
| MD5 |
5f701573885b1f8218060d2aea4446eb
|
|
| BLAKE2b-256 |
f99d077252effa3b4b47ed075432808bb8474c81dd208cf330c8983380843e1f
|