Skip to main content

Dialect-aware Portuguese (Lusophone) text-to-IPA phonemizer

Project description

tugaphone — dialect-aware Portuguese phonemizer

tugaphone converts Portuguese text to IPA across all five Lusophone dialect groups. It combines a curated phonetic lexicon, part-of-speech tagging for homograph disambiguation, meaning-based heterophone resolution via bifonia, and a scientifically-grounded regional-accent layer.

O gato dorme.
pt-PT → ˈu gˈa·tu ˈdoɾ·mɨ ˈ···
pt-BR → ˈu gˈa·tʊ ˈdoɾ·mɪ ˈ···
pt-AO → ˈu gˈa·tʊ ˈdoɾ·me ˈ···
pt-MZ → ˈu gˈa·tu ˈdoɾ·me ˈ···
pt-TL → ˈu gˈa·tʊ ˈdoɾ·me ˈ···

Install

pip install tugaphone

30-second quick start

from tugaphone import TugaPhonemizer

ph = TugaPhonemizer()
print(ph.phonemize_sentence("O gato dorme.", "pt-PT"))
# ˈu gˈa·tu ˈdoɾ·mɨ ˈ···

TugaPhonemizer() loads the lexicon and POS tagger once; then call phonemize_sentence(text, lang) as many times as you like. Output is a space-separated phoneme string — one token per word — with ˈ marking primary stress and · marking syllable boundaries.


Features

Five dialect inventories

Code Region
pt-PT European Portuguese — heavy vowel reduction, post-alveolar fricatives, uvular /ʁ/
pt-BR Brazilian Portuguese — fuller vowels, /t d/ palatalisation, l-vocalisation
pt-AO Angolan Portuguese — moderate reduction, alveolar trill, Bantu substrate
pt-MZ Mozambican Portuguese — similar to European with regional variation
pt-TL Timorese Portuguese — conservative pronunciation, Tetum substrate
for code in ["pt-PT", "pt-BR", "pt-AO", "pt-MZ", "pt-TL"]:
    print(code, "→", ph.phonemize_sentence("Choveu muito ontem.", code))
# pt-PT → ʃu·ˈvew mˈũj·tu ˈõ·tẽ ˈ···
# pt-BR → ʃo·ˈvew mwˈĩ·tʊ ˈõ·tẽ ˈ···
# pt-AO → ʃo·ˈvew mˈũjn·tʊ ˈõ·tẽ ˈ···
# pt-MZ → ʃu·ˈvew mˈũj·tu ˈõ·tẽ ˈ···
# pt-TL → ʃo·ˈvew mˈuj·tʊ ˈõ·tẽ ˈ···

Homograph disambiguation

Heterophonic homographs are resolved at two levels:

  1. Meaning-based (via bifonia): sede thirst vs HQ, forma mould vs shape.
  2. POS-based: gosto noun /ˈgoʃtu/ vs verb /ˈgɔʃtu/, para preposition vs verb.
print(ph.phonemize_sentence("Eu gosto de música."))   # verb → ˈgɔʃ·tu
print(ph.phonemize_sentence("Tenho bom gosto."))      # noun → ˈgoʃ·tu

Sub-regional accents

RegionalTransforms presets layer phonological rules on top of any dialect. Rules are grounded in published phonology (Cintra 1971; ALEPG). Every preset is reachable by its BCP-47 private-use code:

# Porto: stressed /o/ → [uo] (rising diphthong)
print(ph.phonemize_sentence("O vinho é muito bom.", "pt-PT-x-porto"))
# ˈu bˈi·ɲu ˈɛ mˈũj·tu bˈuõ ˈ···

# Açores: stressed /u/ → [y], l-palatalisation
print(ph.phonemize_sentence("O vinho é muito bom.", "pt-PT-x-azores"))
# ˈy vˈi·ɲu ˈɛ mˈỹj·tu bˈõ ˈ···

from tugaphone import list_dialects
print(list_dialects())   # all 20 registered codes

Available presets: NorthernDialect, PortoDialect, MinhoDialect, BragaDialect, FamalicaoDialect, FafeDialect, TrasMontanoDialect, CoimbraDialect, AlentejoDialect, AlgarveDialect, MadeiraDialect, AzoresDialect — importable from tugaphone.regional and passable as regional_dialect=, which overrides the code-derived preset.

Number normalization

Digits are spelled out with gender agreement and long/short scale per dialect:

from tugaphone.number_utils import normalize_numbers

normalize_numbers("vou comprar 1 casa")   # 'vou comprar uma casa'
normalize_numbers("vou adotar 1 cão")    # 'vou adotar um cão'
normalize_numbers("comprei 2 casas")     # 'comprei duas casas'

Syllabification and stress

Syllabification is handled by silabificador, registered as an orthography2ipa syllabifier plugin. Stress detection delegates to orthography2ipa's declarative StressRules.

Rules-only mode

Pass an IRREGULAR_WORDS-emptied dialect inventory to bypass the lexicon and use only grapheme rules — useful for testing rule coverage or synthesising unknown words.

orthography2ipa plugin interface

TugaphoneG2PPlugin implements orthography2ipa's G2PPlugin interface; SilabificadorSyllabifier implements its SyllabifierPlugin interface and is registered at the orthography2ipa.syllabify entry point.

from tugaphone.plugin import TugaphoneG2PPlugin

p = TugaphoneG2PPlugin(lang="pt-BR")
print(p.transcribe("o gato dorme"))   # ˈu gˈa·tʊ ˈdoɾ·mɪ

Sibling libraries

tugaphone is part of the TigreGotico Portuguese NLP stack:

Library Role
tugalex Phonetic lexicon
tugatagger POS tagger
silabificador Syllabifier
bifonia Heterophone sense disambiguation
orthography2ipa G2P plugin base + stress rules

Documentation


License

Apache License 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tugaphone-0.6.0a1.tar.gz (84.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tugaphone-0.6.0a1-py3-none-any.whl (71.9 kB view details)

Uploaded Python 3

File details

Details for the file tugaphone-0.6.0a1.tar.gz.

File metadata

  • Download URL: tugaphone-0.6.0a1.tar.gz
  • Upload date:
  • Size: 84.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tugaphone-0.6.0a1.tar.gz
Algorithm Hash digest
SHA256 89a104d4347b6d76ceda111573e80ad1b99d71628c95e70f162795f0f0514f11
MD5 de3dda521e9e0fb973af1d6773cba5fc
BLAKE2b-256 8738e56bd9bec50a7b361c015bed2575ca0cc2740c0099bff2f1012de729594d

See more details on using hashes here.

File details

Details for the file tugaphone-0.6.0a1-py3-none-any.whl.

File metadata

  • Download URL: tugaphone-0.6.0a1-py3-none-any.whl
  • Upload date:
  • Size: 71.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tugaphone-0.6.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 0b477837ab4b6a516b2e4cbe5b20c4f92470a9898674177917b994c6d95ad990
MD5 37796e383e83b0db2681cfbddc7f180c
BLAKE2b-256 d16e171c01d174cff79f1b68cde06041ad0700e6a21031b8f9b05ab77fb04973

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page