Portuguese heterophonic homograph disambiguation (rule-based + learned)

These details have not been verified by PyPI

Project links

Project description

bifonia

Pronunciation disambiguation for European-Portuguese heterophonic homographs — words spelled identically whose pronunciation depends on meaning.

sede is thirst (ˈsedɨ, closed e) or a headquarters (ˈsɛdɨ, open e); forma is a mould (ˈfoɾmɐ) or a shape (ˈfɔɾmɐ); molho is sauce (ˈmoʎu) or a bundle (ˈmɔʎu). A text-to-speech front-end that guesses wrong says the wrong word out loud. bifonia picks the right reading — and therefore the right IPA — from context.

from bifonia import tokenize, is_ambiguous, guess_sense, disambiguate

words = tokenize("Tinha tanta sede que bebi a garrafa toda.")
i = words.index("sede")
guess_sense(words, i)    # 'thirst'
disambiguate(words, i)   # 'ˈsedɨ'

Why meaning, not part of speech

The obvious approach — tag the part of speech and pick the pronunciation from it — cannot work when two readings share a POS. sede thirst and seat are both nouns; corte cut and court are both nominal; forma mould and shape likewise. A POS tagger labels them identically and is wrong on the minority reading by construction. bifonia keys every reading on sense (a meaning slug) and resolves the meaning directly.

Two interchangeable engines

engine	needs a corpus?	how it decides
rules	no	hand-written context rules + wordlists (`.voc`)
learned	yes	per-word Naive-Bayes / averaged-perceptron over context features

The rule engine is self-contained and needs no training data — the right fit for a fork of a low-resource language. The learned models are trained from the labelled corpus and generalise better where enough data exists. guess_sense uses a per-word ensemble: each word is served by whichever engine scores at least as well on held-out data, so the combined system never does worse than the rules alone. Both are pure Python with no heavy runtime dependencies.

The learned engine ships in two forms — a Naive-Bayes model and an averaged perceptron — and guess_sense loads the averaged perceptron by default. The perceptron leads on both the synthetic and the real-world benchmarks and avoids the per-word collapses that Naive-Bayes suffers when correlated features violate its independence assumption; it is warm-started from the Naive-Bayes log-odds, so its weights stay readable as per-sense lexicons. Load the other model explicitly with SenseModel.load(path) if you want to compare.

Accuracy

Sense prediction, measured two ways:

approach	synthetic test	real-world (OOD)
most-common baseline	52.7%	47.5%
spaCy POS → sense	65.7%	81.4%
Stanza POS → sense	75.5%	82.5%
rules (no corpus)	94.5%	84.6%
Naive-Bayes	98.1%	86.7%
averaged perceptron	99.0%	89.6%
shipped ensemble	96.1%	90.5%

The synthetic column is the held-out split of the generated training corpus, balanced across senses; the OOD column is real sentences from bifonia-pt-homographs-wild. The two columns answer different questions. The synthetic set is balanced, so it exposes how badly POS tagging handles minority readings (a tagger cannot separate two senses that share a part of speech — both score 0% on sede/thirst). Real text is skewed toward the majority readings POS taggers do get right, which lifts them to ~82% — yet the meaning-aware models still win, and the perceptron leads by ~7 points. Reproduce with python benchmark_tagger.py (synthetic) and python benchmark_ood.py (OOD).

Install

pip install -e . --no-deps

No dependencies — pure standard library.

API

from bifonia import (tokenize, is_ambiguous, guess_sense, guess_pos,
                     disambiguate, add_extra_diacritics)

sentence = "Resolveu o problema desta forma simples."
words = tokenize(sentence)
i = words.index("forma")

guess_sense(words, i)              # 'shape'
guess_pos(words, i)                # 'NOUN'   (descriptive)
disambiguate(words, i)             # 'ˈfɔɾmɐ'
disambiguate(words, i, sense="mould")   # 'ˈfoɾmɐ'  (override)
add_extra_diacritics(sentence)     # '...desta fórma simples.'  (acute = open vowel)

add_extra_diacritics rewrites each homograph with a disambiguating diacritic (acute → open vowel, circumflex → closed) that a downstream grapheme-to-phoneme stage can read directly.

Datasets

Both on the Hugging Face Hub, schema {word, sense, pos, ipa, sentence}:

bifonia-pt-homographs — 56,891 labelled sentences over 27 words, with stratified train/test splits, for training and synthetic evaluation.
bifonia-pt-homographs-wild — real Wikipedia and web sentences, an out-of-distribution test set.

Word coverage

27 homographs: acordo, acerto, cerro, choro, colher, começo, conserto, coro, corte, forma, gosto, gozo, jogo, molho, olho, para, pelo, peso, porto, posto, rego, seco, sede, sobre, tola, torre, transtorno.

Per-word IPA, senses, and diacritized forms are in docs/words.md.

Project layout

bifonia/data/corpus.jsonl — the labelled corpus (single source of truth).
bifonia/data/heterophonic_homographs.csv — the word,sense,pos,ipa table.
bifonia/data/sense_model_{nb,perceptron}.json — trained models (JSON weights).
bifonia/locale/<lang>/*.voc — context wordlists, one term per line, editable.
bifonia/features.py — language-agnostic feature extraction (shared by train and inference).

Porting to a related language means supplying a corpus and .voc files and retraining — the algorithm carries no hardcoded Portuguese.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jun 12, 2026

This version

0.1.1a1 pre-release

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bifonia-0.1.1a1.tar.gz (1.9 MB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bifonia-0.1.1a1-py3-none-any.whl (1.9 MB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file bifonia-0.1.1a1.tar.gz.

File metadata

Download URL: bifonia-0.1.1a1.tar.gz
Upload date: Jun 12, 2026
Size: 1.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bifonia-0.1.1a1.tar.gz
Algorithm	Hash digest
SHA256	`312cad33826d29895c5d916afc43de7a2d24f55c109570e8b39b8cf473f80c13`
MD5	`54237ed3e3c113bf973ea7fa226db903`
BLAKE2b-256	`adae44a48230d3260b6ad5292551ada4a1865b32968d1472d8eda79b9521fb35`

See more details on using hashes here.

File details

Details for the file bifonia-0.1.1a1-py3-none-any.whl.

File metadata

Download URL: bifonia-0.1.1a1-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 1.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bifonia-0.1.1a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d854717fd440088f0db1c7f3ddd33f1bd9115da2fe0db8c1d28f9e994eba170b`
MD5	`bc037d5681d5b333011e8cfdc005084d`
BLAKE2b-256	`7820d2feffeb30f82414c4d35867bb18ed352cd0362210f3088a09e9384a0e95`

See more details on using hashes here.

bifonia 0.1.1a1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

bifonia

Why meaning, not part of speech

Two interchangeable engines

Accuracy

Install

API

Datasets

Word coverage

Project layout

See also

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes