Dialect-aware Portuguese (Lusophone) text-to-IPA phonemizer

These details have not been verified by PyPI

Project links

Project description

TugaPhone — Dialect-aware Portuguese Phonemizer

TugaPhone is a Python library that phonemizes arbitrary Portuguese text across major Lusophone dialects (pt-PT, pt-BR, pt-AO, pt-MZ, pt-TL). It uses a curated phonetic lexicon plus a rule-based fallback to deliver plausible phoneme transcriptions while preserving dialectal variation.

Choveu muito ontem à noite.
pt-PT-x-porto → ˈʃɔ·vew mˈũj·tu õ·ˈtẽ ˈa nˈuoj·tɨ 
pt-PT → ˈʃɔ·vew mˈũj·tu õ·ˈtẽ ˈa nˈoj·tɨ 
pt-BR → ˈʃɔ·vew mwˈĩ·tʊ õ·ˈtẽ ˈa nˈoj·tʃɪ 
pt-AO → ˈʃɔ·vew mˈũjn·tʊ õ·ˈtẽ ˈa nˈoj·tɨ 
pt-MZ → ˈʃɔ·vew mˈũj·tu õ·ˈtẽ ˈa nˈɔj·tɨ 
pt-TL → ˈʃɔ·vew mˈuj·tʊ õ·ˈtẽ ˈa nˈojtʰ

🚀 Features

Multi-dialect support: European Portuguese (pt-PT), Brazilian Portuguese (pt-BR), Angolan (pt-AO), Mozambican (pt-MZ), and Timorese (pt-TL)
Regional accent modeling: Additional micro-dialects like Porto, Minho, Braga, Trás-os-Montes, and more
Hybrid approach: Combines a curated phonetic lexicon (Portuguese Phonetic Lexicon) with rule-based G2P fallback
Context-aware: Takes part-of-speech tags into account for homograph disambiguation
Number normalization: Automatically converts digits to their Portuguese spoken forms with proper gender agreement
Syllabification: Rule-based syllable boundary detection (~99.6% accuracy on benchmark)
Stress detection: Automatic stress placement following Portuguese phonological rules
IPA output: Full International Phonetic Alphabet transcription with stress markers and syllable boundaries

📦 Installation

pip install tugaphone

🧰 Usage

Companion libraries

The follow libraries are dependencies of tugaphone and might be useful on their own

Tugalex - Lexicon of words and exceptions
TugaTagger - portuguese text postagger
silabificador - portuguese text syllabification

Basic Phonemization

from tugaphone import TugaPhonemizer

ph = TugaPhonemizer()

sentences = [
    "O gato dorme.",
    "Tu falas português muito bem.",
    "O comboio chegou à estação.",
    "A menina comeu o pão todo.",
    "Vou pôr a manteiga no frigorífico."
]

for s in sentences:
    print(f"Sentence: {s}")
    for code in ["pt-PT", "pt-BR", "pt-AO", "pt-MZ", "pt-TL"]:
        phones = ph.phonemize_sentence(s, code)
        print(f"  {code} → {phones}")
    print("-----")

Regional Dialects

from tugaphone import TugaPhonemizer
from tugaphone.regional import PortoDialect, MinhoDialect, BragaDialect

ph = TugaPhonemizer()

sentence = "O Porto é uma cidade bonita."

# Standard European Portuguese
print(f"pt-PT: {ph.phonemize_sentence(sentence, 'pt-PT')}")

# Porto accent (rising diphthongs, rhotic realization)
print(f"Porto: {ph.phonemize_sentence(sentence, regional_dialect=PortoDialect)}")

# Minho accent (vowel resistance, open vowels)
print(f"Minho: {ph.phonemize_sentence(sentence, regional_dialect=MinhoDialect)}")

Number Normalization

from tugaphone.number_utils import normalize_numbers

# Automatic gender agreement
print(normalize_numbers("vou comprar 1 casa"))    # uma casa
print(normalize_numbers("vou comprar 2 casas"))   # duas casas
print(normalize_numbers("vou adotar 1 cão"))      # um cão
print(normalize_numbers("vou adotar 2 cães"))     # dois cães

# Ordinals
print(normalize_numbers("1º lugar"))              # primeiro lugar
print(normalize_numbers("1ª vez"))                # primeira vez

# Large numbers with scale differences
print(normalize_numbers("897654356789098", "pt-PT"))  # long-scale (biliões)
print(normalize_numbers("897654356789098", "pt-BR"))  # short-scale (trilhões)

Advanced: Tokenization and Features

from tugaphone.tokenizer import Sentence
from tugaphone.dialects import EuropeanPortuguese

sentence = Sentence("O cão comeu o pão.", dialect=EuropeanPortuguese())

print(f"IPA: {sentence.ipa}")

# Access word-level details
for word in sentence.words:
    print(f"\nWord: {word.surface}")
    print(f"  Syllables: {'.'.join(word.syllables)}")
    print(f"  Stress: syllable {word.stressed_syllable_idx}")
    print(f"  IPA: {word.ipa}")
    
    # Access grapheme-level details
    for grapheme in word.graphemes:
        if grapheme.is_diphthong:
            print(f"  Diphthong: {grapheme.surface} → {grapheme.ipa}")

📖 Documentation

Supported Dialects

Dialect Code	Region	Characteristics
`pt-PT`	European Portuguese (Lisbon)	Heavy vowel reduction, fricative palatalization, uvular /r/
`pt-BR`	Brazilian Portuguese (Rio)	Less vowel reduction, t/d palatalization, l-vocalization
`pt-AO`	Angolan Portuguese (Luanda)	Moderate vowel reduction, alveolar trill /r/, Bantu substrate
`pt-MZ`	Mozambican Portuguese (Maputo)	Similar to European with regional variation, Bantu influence
`pt-TL`	Timorese Portuguese (Dili)	Conservative pronunciation, Tetum substrate influence

Regional Accents (Experimental)

TugaPhone includes experimental support for sub-regional Portuguese accents:

PortoDialect: Rising diphthongs (o → uo), rhotic realization
MinhoDialect: Reduced vowel centralization, open vowel preference
BragaDialect: Palatal epenthesis (abelha → abeilha)
TrasMontanoDialect: Palatal affrication, s-voicing, final nasal denasalization
FafeDialect: Nasal diphthongization (gente → geinte)

Note: These are based on documented phonological features but should be considered approximate. Real-world variation is more complex.

Part-of-Speech Tagging

TugaPhone uses POS tags to disambiguate homographs:

from tugaphone import TugaPhonemizer

ph = TugaPhonemizer(postag_engine="spacy")  # or "brill", "auto"

# "para" has different pronunciations as preposition vs. verb
print(ph.phonemize_sentence("Vou para casa."))      # preposition
print(ph.phonemize_sentence("Ele para o carro."))   # verb

Supported engines:

spacy: Requires spacy and Portuguese model (most accurate)
brill: Requires brill-postaggers (lighter, faster)
lexicon: Uses built-in lexicon lookup (limited coverage)
auto: Falls back through available engines
dummy: Simple rule-based fallback (no dependencies)

🏗️ Architecture

TugaPhone uses a hierarchical tokenization model:

Sentence → Words → Graphemes → Characters

Each level applies context-sensitive phonological rules:

Character level: Vowel quality, consonant allophones
Grapheme level: Digraphs (ch, nh), diphthongs (ai, ou)
Word level: Stress assignment, syllabification
Sentence level: Prosodic boundaries (future: liaison, phrasal stress)

The phonemization process:

Normalize text (numbers → words)
POS tagging (for homograph disambiguation)
Lexicon lookup (for known words)
Rule-based G2P fallback (for unknown words)
Dialect-specific transformations (regional accents)

⚠️ Limitations & Future Work

Current Limitations

Lexicon coverage: Many words (especially names, foreign words, neologisms) rely solely on rule-based fallback
Sparse coverage: African and Timorese dialects have less lexicon data than European/Brazilian
Lexical variation: Dialect-specific vocabulary (e.g., "trem" vs "comboio") is not handled; text is assumed orthographically consistent
Regional accents: Sub-regional dialects are experimental and approximate
Prosody: Sentence-level features (liaison, phrasal stress, intonation) are simplified
Homograph disambiguation: Limited to POS-based rules; doesn't handle semantic context

🤝 Contributing

Contributions are welcome! Areas where help is especially needed:

Lexicon expansion: Especially for pt-AO, pt-MZ, pt-TL
Regional accent validation: Native speaker verification of dialectal features
Test cases: Edge cases, challenging words, dialectal examples
Documentation: Usage examples, linguistic explanations

📄 License

This project is licensed under the Apache License 2.0. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.1a1 pre-release

Jun 20, 2026

0.6.0a1 pre-release

Jun 13, 2026

0.5.1a3 pre-release

Jun 12, 2026

0.5.1a2 pre-release

Jun 12, 2026

0.5.1a1 pre-release

Jun 12, 2026

0.5.0a2 pre-release

Jun 12, 2026

0.5.0a1 pre-release

Jun 12, 2026

This version

0.4.0a2 pre-release

Jun 12, 2026

0.4.0a1 pre-release

Jun 12, 2026

0.3.1a1 pre-release

Jun 12, 2026

0.3.0a1 pre-release

Jun 11, 2026

0.2.2a2 pre-release

May 29, 2026

0.2.2a1 pre-release

Feb 25, 2026

0.2.1

Feb 6, 2026

0.2.0

Feb 6, 2026

0.2.0a2 pre-release

Feb 6, 2026

0.2.0a1 pre-release

Feb 6, 2026

0.1.0a1 pre-release

Feb 6, 2026

0.0.2

Oct 12, 2025

0.0.2a1 pre-release

Oct 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tugaphone-0.4.0a2.tar.gz (79.1 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tugaphone-0.4.0a2-py3-none-any.whl (70.0 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file tugaphone-0.4.0a2.tar.gz.

File metadata

Download URL: tugaphone-0.4.0a2.tar.gz
Upload date: Jun 12, 2026
Size: 79.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tugaphone-0.4.0a2.tar.gz
Algorithm	Hash digest
SHA256	`3564beb29e22b9fc6fe278a363dd4104d26551bc39361d3c95414f00ed11e8f2`
MD5	`cdff62426159bf736312b4fb55dbefba`
BLAKE2b-256	`65040183fe3ea38dccdfc91aac061f734b18f2da96482b9f7187006a7040cc26`

See more details on using hashes here.

File details

Details for the file tugaphone-0.4.0a2-py3-none-any.whl.

File metadata

Download URL: tugaphone-0.4.0a2-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 70.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tugaphone-0.4.0a2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`11f469baa1908cd4ee62e036d73c0c3236ae92872c128cd590cb6cfc80da20b3`
MD5	`5f701573885b1f8218060d2aea4446eb`
BLAKE2b-256	`f99d077252effa3b4b47ed075432808bb8474c81dd208cf330c8983380843e1f`

See more details on using hashes here.

tugaphone 0.4.0a2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TugaPhone — Dialect-aware Portuguese Phonemizer

🚀 Features

📦 Installation

🧰 Usage

Companion libraries

Basic Phonemization

Regional Dialects

Number Normalization

Advanced: Tokenization and Features

📖 Documentation

Supported Dialects

Regional Accents (Experimental)

Part-of-Speech Tagging

🏗️ Architecture

⚠️ Limitations & Future Work

Current Limitations

🤝 Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes