Portuguese heterophonic homograph disambiguation (rule-based + learned)
Project description
bifonia
Pronunciation disambiguation for European-Portuguese heterophonic homographs — words spelled identically whose pronunciation depends on meaning.
sede is thirst (ˈsedɨ, closed e) or a headquarters (ˈsɛdɨ, open e); forma
is a mould (ˈfoɾmɐ) or a shape (ˈfɔɾmɐ); molho is sauce (ˈmoʎu) or a
bundle (ˈmɔʎu). A text-to-speech front-end that guesses wrong says the wrong word
out loud. bifonia picks the right reading — and therefore the right IPA — from context.
from bifonia import tokenize, is_ambiguous, guess_sense, disambiguate
words = tokenize("Tinha tanta sede que bebi a garrafa toda.")
i = words.index("sede")
guess_sense(words, i) # 'thirst'
disambiguate(words, i) # 'ˈsedɨ'
Why meaning, not part of speech
The obvious approach — tag the part of speech and pick the pronunciation from it —
cannot work when two readings share a POS. sede thirst and seat are both nouns;
corte cut and court are both nominal; forma mould and shape likewise. A POS tagger
labels them identically and is wrong on the minority reading by construction. bifonia
keys every reading on sense (a meaning slug) and resolves the meaning directly.
Two interchangeable engines
| engine | needs a corpus? | how it decides |
|---|---|---|
| rules | no | hand-written context rules + wordlists (.voc) |
| learned | yes | per-word Naive-Bayes / averaged-perceptron over context features |
The rule engine is self-contained and needs no training data — the right fit for a fork
of a low-resource language. The learned models are trained from the labelled corpus and
generalise better where enough data exists. guess_sense uses a per-word ensemble:
each word is served by whichever engine scores at least as well on held-out data, so the
combined system never does worse than the rules alone. Both are pure Python with no heavy
runtime dependencies.
The learned engine ships in two forms — a Naive-Bayes model and an averaged perceptron —
and guess_sense loads the averaged perceptron by default. The perceptron leads on both
the synthetic and the real-world benchmarks and avoids the per-word collapses that Naive-Bayes
suffers when correlated features violate its independence assumption; it is warm-started from
the Naive-Bayes log-odds, so its weights stay readable as per-sense lexicons. Load the other
model explicitly with SenseModel.load(path) if you want to compare.
Accuracy
Sense prediction, measured two ways:
| approach | synthetic test | real-world (OOD) |
|---|---|---|
| most-common baseline | 52.7% | 47.5% |
| spaCy POS → sense | 65.7% | 81.4% |
| Stanza POS → sense | 75.5% | 82.5% |
| rules (no corpus) | 94.5% | 84.6% |
| Naive-Bayes | 98.1% | 86.7% |
| averaged perceptron | 99.0% | 89.6% |
| shipped ensemble | 96.1% | 90.5% |
The synthetic column is the held-out split of the generated training corpus, balanced
across senses; the OOD column is real sentences from
bifonia-pt-homographs-wild.
The two columns answer different questions. The synthetic set is balanced, so it exposes
how badly POS tagging handles minority readings (a tagger cannot separate two senses that
share a part of speech — both score 0% on sede/thirst). Real text is skewed toward the
majority readings POS taggers do get right, which lifts them to ~82% — yet the
meaning-aware models still win, and the perceptron leads by ~7 points. Reproduce with
python benchmark_tagger.py (synthetic) and python benchmark_ood.py (OOD).
Install
pip install -e . --no-deps
No dependencies — pure standard library.
API
from bifonia import (tokenize, is_ambiguous, guess_sense, guess_pos,
disambiguate, add_extra_diacritics)
sentence = "Resolveu o problema desta forma simples."
words = tokenize(sentence)
i = words.index("forma")
guess_sense(words, i) # 'shape'
guess_pos(words, i) # 'NOUN' (descriptive)
disambiguate(words, i) # 'ˈfɔɾmɐ'
disambiguate(words, i, sense="mould") # 'ˈfoɾmɐ' (override)
add_extra_diacritics(sentence) # '...desta fórma simples.' (acute = open vowel)
add_extra_diacritics rewrites each homograph with a disambiguating diacritic
(acute → open vowel, circumflex → closed) that a downstream grapheme-to-phoneme stage
can read directly.
Datasets
Both on the Hugging Face Hub, schema {word, sense, pos, ipa, sentence}:
bifonia-pt-homographs— 56,891 labelled sentences over 27 words, with stratified train/test splits, for training and synthetic evaluation.bifonia-pt-homographs-wild— real Wikipedia and web sentences, an out-of-distribution test set.
Word coverage
27 homographs: acordo, acerto, cerro, choro, colher, começo, conserto,
coro, corte, forma, gosto, gozo, jogo, molho, olho, para, pelo,
peso, porto, posto, rego, seco, sede, sobre, tola, torre, transtorno.
Per-word IPA, senses, and diacritized forms are in docs/words.md.
Project layout
bifonia/data/corpus.jsonl— the labelled corpus (single source of truth).bifonia/data/heterophonic_homographs.csv— theword,sense,pos,ipatable.bifonia/data/sense_model_{nb,perceptron}.json— trained models (JSON weights).bifonia/locale/<lang>/*.voc— context wordlists, one term per line, editable.bifonia/features.py— language-agnostic feature extraction (shared by train and inference).
Porting to a related language means supplying a corpus and .voc files and retraining —
the algorithm carries no hardcoded Portuguese.
See also
docs/methodology.md— algorithm, features, and benchmarksdocs/usage.md— full API referencedocs/words.md— per-word pronunciation notesdocs/diacritics_restoration.md— the diacritics-restoration taskexamples/basic_usage.py— runnable demotrain.py·benchmark_tagger.py·benchmark_ood.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bifonia-0.1.1.tar.gz.
File metadata
- Download URL: bifonia-0.1.1.tar.gz
- Upload date:
- Size: 1.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93ad7bff737a92185b32ff3997a52e209b10f22233ca1226cf5b27a52ba4a11b
|
|
| MD5 |
eaa00bc5e387ad1610585065162300a0
|
|
| BLAKE2b-256 |
57aa65df7388a1e13a67e53d1bec544864895734dd9fed59af42db1455797606
|
File details
Details for the file bifonia-0.1.1-py3-none-any.whl.
File metadata
- Download URL: bifonia-0.1.1-py3-none-any.whl
- Upload date:
- Size: 1.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb605015df2f7e47a34c9d5a1dcda29e89294889bd7cb4b72a802d85a2e5a176
|
|
| MD5 |
3a5b87d685149d313eb313a19a2d6d4b
|
|
| BLAKE2b-256 |
deb3ea6c064489dce6869c541acd43fc4434a69ffa7fb7908c4ce86339adab5e
|