Skip to main content

Portuguese heterophonic homograph disambiguation (rule-based + learned)

Project description

bifonia

Pronunciation disambiguation for European-Portuguese heterophonic homographs — words spelled identically whose pronunciation depends on meaning.

sede is thirst (ˈsedɨ, closed e) or a headquarters (ˈsɛdɨ, open e); forma is a mould (ˈfoɾmɐ) or a shape (ˈfɔɾmɐ); molho is sauce (ˈmoʎu) or a bundle (ˈmɔʎu). A text-to-speech front-end that guesses wrong says the wrong word out loud. bifonia picks the right reading — and therefore the right IPA — from context.

from bifonia import tokenize, is_ambiguous, guess_sense, disambiguate

words = tokenize("Tinha tanta sede que bebi a garrafa toda.")
i = words.index("sede")
guess_sense(words, i)    # 'thirst'
disambiguate(words, i)   # 'ˈsedɨ'

Why meaning, not part of speech

The obvious approach — tag the part of speech and pick the pronunciation from it — cannot work when two readings share a POS. sede thirst and seat are both nouns; corte cut and court are both nominal; forma mould and shape likewise. A POS tagger labels them identically and is wrong on the minority reading by construction. bifonia keys every reading on sense (a meaning slug) and resolves the meaning directly.

Two interchangeable engines

engine needs a corpus? how it decides
rules no hand-written context rules + wordlists (.voc)
learned yes per-word Naive-Bayes / averaged-perceptron over context features

The rule engine is self-contained and needs no training data — the right fit for a fork of a low-resource language. The learned models are trained from the labelled corpus and generalise better where enough data exists. guess_sense uses a per-word ensemble: each word is served by whichever engine scores at least as well on held-out data, so the combined system never does worse than the rules alone. Both are pure Python with no heavy runtime dependencies.

The learned engine ships in two forms — a Naive-Bayes model and an averaged perceptron — and guess_sense loads the averaged perceptron by default. The perceptron leads on both the synthetic and the real-world benchmarks and avoids the per-word collapses that Naive-Bayes suffers when correlated features violate its independence assumption; it is warm-started from the Naive-Bayes log-odds, so its weights stay readable as per-sense lexicons. Load the other model explicitly with SenseModel.load(path) if you want to compare.

Accuracy

Sense prediction, measured two ways:

approach synthetic test real-world (OOD)
most-common baseline 52.7% 47.5%
spaCy POS → sense 65.7% 81.4%
Stanza POS → sense 75.5% 82.5%
rules (no corpus) 94.5% 84.6%
Naive-Bayes 98.1% 86.7%
averaged perceptron 99.0% 89.6%
shipped ensemble 96.1% 90.5%

The synthetic column is the held-out split of the generated training corpus, balanced across senses; the OOD column is real sentences from bifonia-pt-homographs-wild. The two columns answer different questions. The synthetic set is balanced, so it exposes how badly POS tagging handles minority readings (a tagger cannot separate two senses that share a part of speech — both score 0% on sede/thirst). Real text is skewed toward the majority readings POS taggers do get right, which lifts them to ~82% — yet the meaning-aware models still win, and the perceptron leads by ~7 points. Reproduce with python benchmark_tagger.py (synthetic) and python benchmark_ood.py (OOD).

Install

pip install -e . --no-deps

No dependencies — pure standard library.

API

from bifonia import (tokenize, is_ambiguous, guess_sense, guess_pos,
                     disambiguate, add_extra_diacritics)

sentence = "Resolveu o problema desta forma simples."
words = tokenize(sentence)
i = words.index("forma")

guess_sense(words, i)              # 'shape'
guess_pos(words, i)                # 'NOUN'   (descriptive)
disambiguate(words, i)             # 'ˈfɔɾmɐ'
disambiguate(words, i, sense="mould")   # 'ˈfoɾmɐ'  (override)
add_extra_diacritics(sentence)     # '...desta fórma simples.'  (acute = open vowel)

add_extra_diacritics rewrites each homograph with a disambiguating diacritic (acute → open vowel, circumflex → closed) that a downstream grapheme-to-phoneme stage can read directly.

Datasets

Both on the Hugging Face Hub, schema {word, sense, pos, ipa, sentence}:

  • bifonia-pt-homographs — 56,891 labelled sentences over 27 words, with stratified train/test splits, for training and synthetic evaluation.
  • bifonia-pt-homographs-wild — real Wikipedia and web sentences, an out-of-distribution test set.

Word coverage

27 homographs: acordo, acerto, cerro, choro, colher, começo, conserto, coro, corte, forma, gosto, gozo, jogo, molho, olho, para, pelo, peso, porto, posto, rego, seco, sede, sobre, tola, torre, transtorno.

Per-word IPA, senses, and diacritized forms are in docs/words.md.

Project layout

  • bifonia/data/corpus.jsonl — the labelled corpus (single source of truth).
  • bifonia/data/heterophonic_homographs.csv — the word,sense,pos,ipa table.
  • bifonia/data/sense_model_{nb,perceptron}.json — trained models (JSON weights).
  • bifonia/locale/<lang>/*.voc — context wordlists, one term per line, editable.
  • bifonia/features.py — language-agnostic feature extraction (shared by train and inference).

Porting to a related language means supplying a corpus and .voc files and retraining — the algorithm carries no hardcoded Portuguese.

See also

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bifonia-0.1.1.tar.gz (1.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bifonia-0.1.1-py3-none-any.whl (1.9 MB view details)

Uploaded Python 3

File details

Details for the file bifonia-0.1.1.tar.gz.

File metadata

  • Download URL: bifonia-0.1.1.tar.gz
  • Upload date:
  • Size: 1.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bifonia-0.1.1.tar.gz
Algorithm Hash digest
SHA256 93ad7bff737a92185b32ff3997a52e209b10f22233ca1226cf5b27a52ba4a11b
MD5 eaa00bc5e387ad1610585065162300a0
BLAKE2b-256 57aa65df7388a1e13a67e53d1bec544864895734dd9fed59af42db1455797606

See more details on using hashes here.

File details

Details for the file bifonia-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: bifonia-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bifonia-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fb605015df2f7e47a34c9d5a1dcda29e89294889bd7cb4b72a802d85a2e5a176
MD5 3a5b87d685149d313eb313a19a2d6d4b
BLAKE2b-256 deb3ea6c064489dce6869c541acd43fc4434a69ffa7fb7908c4ce86339adab5e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page