Grapheme-to-phoneme models for Norwegian bokmål to dialectal phonemic transcriptions
Project description
Grapheme to Phoneme models for Norwegian Bokmål
This repo contains G2P models for Norwegian bokmål[^1], which produce phonemic transcriptions for close-to-spoken pronunciations (such as in spontaneous conversations: spoken
) and close-to-written pronunciations (such as when reading text aloud: written
) for 5 different dialect areas:
- East Norwegian (
e
) - South West Norwegian (
sw
) - West Norwegian (
w
) - Central Norwegian (Trøndersk) (
t
) - North Norwegian (
n
)
[^1]: Bokmål is the most widely used written standard for Norwegian. The other written standard is Nynorsk. Read more on Wikipedia.
Setup
Follow installation instructions from Phonetisaurus. You only need the steps "Next grab and install OpenFst-1.7.2" and "Checkout the latest Phonetisaurus from master and compile without bindings".
Data
The pronunciation lexica that were used to train the G2P-models are free to download and use from Språkbanken's resource catalogue: NB Uttale
For more information about the lexica, see the Github repo: Sprakbanken/nb_uttale
The models and data can be downloaded from release v2.0 or from Språkbanken's resource catalogue G2P-no-2_0.tar.gz.
Extract them and place the folders data
and models
in your local clone of this repo.
Content
models/
: contains the models, as well as auxiliary files used by Phonetisaurusnb_*.fst
: model files to use withphonetisaurus-apply
. The expansion of*
is a string of a dialect and pronunciation style, e.g.e_spoken
ort_written
.nb_*.o8.arpa
: 8-gram-models for phoneme sequences that Phonetisaurus uses during training.nb_*.corpus
: aligned graphemes and phonemes from the lexica.
data/
: contains various lexica used for training and testing, including predictions from the models on the test setNB-uttale_*_train.dict
: training data formodels/nb_*.fst
. Each file contains 543 495 word-transcription pairs (WTP), and makes up 80% of all unique WTPs in the lexicon.NB-uttale_*_test.dict
: test data formodels/nb_*.fst
. Each file contains the remaining 20% of the WTPs in the lexicon, i.e. 135 787 WTPs.predicted_nb_*.dict
: The words from the testdata with the model's predicted transriptions.wordlist_test.txt
: The orthographic words from the test data, which the models predict transcriptions for.
evaluate.py
: script to evaluate the models. The method for calculating WER and PER were re-implemented.g2p_stats.py
: script to evaluate the models from V1.0, which can be used to compare results between these models and the NDT models (with and without tone and stress markers) from version 1.LICENSE
: The license text for CC0, which this resource is distributed with.
Usage
phonetisaurus-apply --model models/nb_e_spoken.fst --word_list data/wordlist_test.txt -n 1 -v > output.txt
- Input data (
--word_list
) should be a list of newline-delimited words. See the filedata/wordlist_test.txt
for an example. - The trained G2P-models are
.fst
files located in themodels
folder. The same folder also contains aligned.corpus
files and phoneme 8-gram models (.arpa
files), also from thePhonetisaurus
training process. -n
lets you adjust the number of most probable predictons.
Evaluation
There are 2 scripts to calculate WER and PER statistics, which give slightly different results.
evaluate.py
Calculates stats for all the provided models by default.
You can give a pronunciation variant (e.g. -l e_spoken
) to calculate stats for specific models.
- The WER score is calculated as the count of all mismatching transcriptions (1 error = 1 mismatching word) divided by the count of all words in the reference, i.e. a
*_test.dict
file. - PER is calculated as the count of all errors (1 error = a mismatching phoneme) divided by the total count of phonemes in the reference file.
python evaluate.py
Model | Word Error Rate | Phoneme Error Rate |
---|---|---|
nb_e_written.fst | 13.661238654564869 | 1.9681178920293207 |
nb_e_spoken.fst | 13.72501038144391 | 1.9832518286152074 |
nb_sw_written.fst | 13.240048644480037 | 1.8396612197218096 |
nb_sw_spoken.fst | 16.422702734768936 | 2.426312206336983 |
nb_w_written.fst | 13.240048644480037 | 1.8396612197218096 |
nb_w_spoken.fst | 16.892833837574894 | 2.5064155890730686 |
nb_t_written.fst | 13.736133357062347 | 1.98774986044724 |
nb_t_spoken.fst | 16.47992288013051 | 2.5809178688066843 |
nb_n_written.fst | 13.736133357062347 | 1.98774986044724 |
nb_n_spoken.fst | 17.22590930999963 | 2.8209779677747715 |
g2p_stats.py
Calculates WER and PER for two input files.
- The reference file (e.g.
data/NB-uttale_e_spoken_test.dict
) - The model prediction file (e.g.
output.txt
from the command in Usage, ordata/predicted_nb_e_spoken.dict
).
- The WER score is calculated as the count of errors (1 error = 1 mismatching word) divided by the count of all words in the predictions, i.e. a
predicted_*.dict
file. - PER is calculated as the sum of phone error rates for each transcription, divided by the total count of words in the predictions.
NOTE: This method doesn't take transcription lengths into account, so a transcription with 2 phonemes where 1 is wrong has a 0.5 PER while a word with length 10 with 1 error has a 0.1 PER, and the average score for the two words would be 0.35.
python g2p_stats.py data/NB-uttale_e_spoken_test.dict data/predicted_nb_e_spoken.dict
# WER: 14.049209423582523
# PER: 2.714882650391985
Model | Word Error Rate | Phoneme Error Rate |
---|---|---|
nb_e_written.fst | 13.97114598599277 | 2.7038190765903214 |
nb_e_spoken.fst | 14.049209423582523 | 2.714882650391985 |
nb_sw_written.fst | 13.541060631724685 | 2.5423757844377284 |
nb_sw_spoken.fst | 16.729141964989285 | 3.34063477772742 |
nb_w_written.fst | 13.541060631724685 | 2.5423757844377284 |
nb_w_spoken.fst | 17.186475877661337 | 3.4137304874392114 |
nb_t_written.fst | 14.059519688924565 | 2.7190289235234104 |
nb_t_spoken.fst | -- | -- |
nb_n_written.fst | 14.059519688924565 | 2.7190289235234104 |
nb_n_spoken.fst | -- | -- |
NOTE: The t_spoken and n_spoken model predictions are not the same length as the reference file, which causes the script to exit.
Transcription standard
The G2P models have been trained on the NoFAbet transcription standard which is easier to read by humans than X-SAMPA. NoFAbet is in part based on 2-letter ARPAbet and is made by Nate Young for the National Library of Norway in connection with the development of NoFA, a forced aligner for Norwegian. The equivalence table below contains X-SAMPA, IPA and NoFAbet notatations.
X-SAMPA-IPA-NoFAbet equivalence table
X-SAMPA | IPA | NoFAbet | Example |
---|---|---|---|
A: | ɑː | AA0 | bad |
{: | æː | AE0 | vær |
{ | æ | AEH0 | vært |
{*I | æɪ | AEJ0 | sei |
E*u0 | æʉ | AEW0 | sau |
A | ɑ | AH0 | hatt |
A*I | ɑɪ | AJ0 | kai |
@ | ə | AX0 | behage |
b | b | B | bil |
d | d | D | dag |
e: | eː | EE0 | lek |
E | ɛ | EH0 | penn |
f | f | F | fin |
g | g | G | gul |
h | h | H | hes |
I | ɪ | IH0 | sitt |
i: | iː | II0 | vin |
j | j | J | ja |
k | k | K | kost |
C | ç | KJ | kino |
l | l | L | land |
l= | l̩ | LX0 | |
m | m | M | man |
n | n | N | nord |
N | ŋ | NG | eng |
n= | n̩ | NX0 | |
o: | oː | OA0 | rå |
O | ɔ | OAH0 | gått |
2: | øː | OE0 | løk |
9 | œ | OEH0 | høst |
9*Y | œy | OEJ0 | køye |
U | u | OH0 | f*ort |
O*Y | ɔy | OJ0 | konvoy |
u: | uː | OO0 | bod |
@U | oʉ | OU0 | show |
p | p | P | pil |
r | r | R | rose |
d` | ɖ | RD | rekord |
l` | ɭ | RL | perle |
l`= | ɭ̩ | RLX0 | |
n` | ɳ | RN | barn |
n`= | ɳ̩ | RNX0 | |
s` | ʂ | RS | pers |
t` | ʈ | RT | stort |
s | s | S | sil |
S | ʃ | SJ | sju |
t | t | T | tid |
u0 | ʉ | UH0 | russ |
}: | ʉː | UU0 | hus |
v | ʋ | V | vase |
w | w | W | Washington |
Y | y | YH0 | nytt |
y: | yː | YY0 | ny |
Unstressed syllables are marked with a 0 after the vowel or syllabic consonant. The nucleus is marked with a 1 for tone 1 and a 2 for tone 2. Secondary stress is marked with 3.
License
These models are shared with a Creative_Commons-ZERO (CC-ZERO) license, and so are the lexica they are trained on. The models can be used for any purpose, as long as it is compliant with Phonetisaurus' license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file g2p_nb-0.1.1.tar.gz
.
File metadata
- Download URL: g2p_nb-0.1.1.tar.gz
- Upload date:
- Size: 10.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.10.12 Linux/5.14.0-1048-oem
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba7e47e9505adece94219e848a5616d4a8978e1833db0d4bd0639d747490e2c1 |
|
MD5 | ddb5b7beef2d00817e1007a03961f3d2 |
|
BLAKE2b-256 | aa77a6fd8bcece6eb19740589d340032f4fa49b7fbf3de9086cb80862fbe209c |
File details
Details for the file g2p_nb-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: g2p_nb-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.10.12 Linux/5.14.0-1048-oem
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c9d78976b3fecb4ffc3f0f24bc1faa76916f0374f8c0f86e590a8d45ee3e0c5c |
|
MD5 | 777e9d81bcae0b15f36697c59c33e1d9 |
|
BLAKE2b-256 | 92811534263a9a2777e24cb11c5fc1008ffe3e82a6db8129d031460392f5aa4b |