Skip to main content

Portuguese heterophonic homograph disambiguation (rule-based + learned)

Project description

bifonia

Pronunciation disambiguation for European-Portuguese heterophonic homographs — words spelled identically whose pronunciation depends on meaning.

sede is thirst (ˈsedɨ, closed e) or a headquarters (ˈsɛdɨ, open e); forma is a mould (ˈfoɾmɐ) or a shape (ˈfɔɾmɐ); molho is sauce (ˈmoʎu) or a bundle (ˈmɔʎu). A text-to-speech front-end that guesses wrong says the wrong word out loud. bifonia picks the right reading — and therefore the right IPA — from context.

from bifonia import tokenize, is_ambiguous, guess_sense, disambiguate

words = tokenize("Tinha tanta sede que bebi a garrafa toda.")
i = words.index("sede")
guess_sense(words, i)    # 'thirst'
disambiguate(words, i)   # 'ˈsedɨ'

Why meaning, not part of speech

The obvious approach — tag the part of speech and pick the pronunciation from it — cannot work when two readings share a POS. sede thirst and seat are both nouns; corte cut and court are both nominal; forma mould and shape likewise. A POS tagger labels them identically and is wrong on the minority reading by construction. bifonia keys every reading on sense (a meaning slug) and resolves the meaning directly.

Two interchangeable engines

engine needs a corpus? how it decides
rules no hand-written context rules + wordlists (.voc)
learned yes per-word Naive-Bayes / averaged-perceptron over context features

The rule engine is self-contained and needs no training data — the right fit for a fork of a low-resource language. The learned models are trained from the labelled corpus and generalise better where enough data exists. guess_sense uses a per-word ensemble: each word is served by whichever engine scores at least as well on held-out data, so the combined system never does worse than the rules alone. Both are pure Python with no heavy runtime dependencies.

The learned engine ships in two forms — a Naive-Bayes model and an averaged perceptron — and guess_sense loads the averaged perceptron by default. The perceptron leads on both the synthetic and the real-world benchmarks and avoids the per-word collapses that Naive-Bayes suffers when correlated features violate its independence assumption; it is warm-started from the Naive-Bayes log-odds, so its weights stay readable as per-sense lexicons. Load the other model explicitly with SenseModel.load(path) if you want to compare.

Accuracy

Sense prediction, measured two ways:

approach synthetic test real-world (OOD)
most-common baseline 52.7% 47.5%
spaCy POS → sense 65.7% 81.4%
Stanza POS → sense 75.5% 82.5%
rules (no corpus) 94.5% 84.6%
Naive-Bayes 98.1% 86.7%
averaged perceptron 99.0% 89.6%
shipped ensemble 96.1% 90.5%

The synthetic column is the held-out split of the generated training corpus, balanced across senses; the OOD column is real sentences from bifonia-pt-homographs-wild. The two columns answer different questions. The synthetic set is balanced, so it exposes how badly POS tagging handles minority readings (a tagger cannot separate two senses that share a part of speech — both score 0% on sede/thirst). Real text is skewed toward the majority readings POS taggers do get right, which lifts them to ~82% — yet the meaning-aware models still win, and the perceptron leads by ~7 points. Reproduce with python benchmark_tagger.py (synthetic) and python benchmark_ood.py (OOD).

Install

pip install -e . --no-deps

No dependencies — pure standard library.

API

from bifonia import (tokenize, is_ambiguous, guess_sense, guess_pos,
                     disambiguate, add_extra_diacritics)

sentence = "Resolveu o problema desta forma simples."
words = tokenize(sentence)
i = words.index("forma")

guess_sense(words, i)              # 'shape'
guess_pos(words, i)                # 'NOUN'   (descriptive)
disambiguate(words, i)             # 'ˈfɔɾmɐ'
disambiguate(words, i, sense="mould")   # 'ˈfoɾmɐ'  (override)
add_extra_diacritics(sentence)     # '...desta fórma simples.'  (acute = open vowel)

add_extra_diacritics rewrites each homograph with a disambiguating diacritic (acute → open vowel, circumflex → closed) that a downstream grapheme-to-phoneme stage can read directly.

Datasets

Both on the Hugging Face Hub, schema {word, sense, pos, ipa, sentence}:

  • bifonia-pt-homographs — 56,891 labelled sentences over 27 words, with stratified train/test splits, for training and synthetic evaluation.
  • bifonia-pt-homographs-wild — real Wikipedia and web sentences, an out-of-distribution test set.

Word coverage

27 homographs: acordo, acerto, cerro, choro, colher, começo, conserto, coro, corte, forma, gosto, gozo, jogo, molho, olho, para, pelo, peso, porto, posto, rego, seco, sede, sobre, tola, torre, transtorno.

Per-word IPA, senses, and diacritized forms are in docs/words.md.

Project layout

  • bifonia/data/corpus.jsonl — the labelled corpus (single source of truth).
  • bifonia/data/heterophonic_homographs.csv — the word,sense,pos,ipa table.
  • bifonia/data/sense_model_{nb,perceptron}.json — trained models (JSON weights).
  • bifonia/locale/<lang>/*.voc — context wordlists, one term per line, editable.
  • bifonia/features.py — language-agnostic feature extraction (shared by train and inference).

Porting to a related language means supplying a corpus and .voc files and retraining — the algorithm carries no hardcoded Portuguese.

See also

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bifonia-0.1.1a1.tar.gz (1.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bifonia-0.1.1a1-py3-none-any.whl (1.9 MB view details)

Uploaded Python 3

File details

Details for the file bifonia-0.1.1a1.tar.gz.

File metadata

  • Download URL: bifonia-0.1.1a1.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bifonia-0.1.1a1.tar.gz
Algorithm Hash digest
SHA256 312cad33826d29895c5d916afc43de7a2d24f55c109570e8b39b8cf473f80c13
MD5 54237ed3e3c113bf973ea7fa226db903
BLAKE2b-256 adae44a48230d3260b6ad5292551ada4a1865b32968d1472d8eda79b9521fb35

See more details on using hashes here.

File details

Details for the file bifonia-0.1.1a1-py3-none-any.whl.

File metadata

  • Download URL: bifonia-0.1.1a1-py3-none-any.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bifonia-0.1.1a1-py3-none-any.whl
Algorithm Hash digest
SHA256 d854717fd440088f0db1c7f3ddd33f1bd9115da2fe0db8c1d28f9e994eba170b
MD5 bc037d5681d5b333011e8cfdc005084d
BLAKE2b-256 7820d2feffeb30f82414c4d35867bb18ed352cd0362210f3088a09e9384a0e95

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page