Grapheme-to-phoneme models for Norwegian bokmål to dialectal phonemic transcriptions

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Grapheme to Phoneme models for Norwegian Bokmål

This repo contains G2P models for Norwegian bokmål[^1], which produce phonemic transcriptions for close-to-spoken pronunciations (such as in spontaneous conversations: spoken) and close-to-written pronunciations (such as when reading text aloud: written) for 5 different dialect areas:

East Norwegian (e)
South West Norwegian (sw)
West Norwegian (w)
Central Norwegian (Trøndersk) (t)
North Norwegian (n)

[^1]: Bokmål is the most widely used written standard for Norwegian. The other written standard is Nynorsk. Read more on Wikipedia.

Setup

Follow installation instructions from Phonetisaurus. You only need the steps "Next grab and install OpenFst-1.7.2" and "Checkout the latest Phonetisaurus from master and compile without bindings".

Data

The pronunciation lexica that were used to train the G2P-models are free to download and use from Språkbanken's resource catalogue: NB Uttale

For more information about the lexica, see the Github repo: Sprakbanken/nb_uttale

The models and data can be downloaded from release v2.0 or from Språkbanken's resource catalogue G2P-no-2_0.tar.gz.

Extract them and place the folders data and models in your local clone of this repo.

Content

models/: contains the models, as well as auxiliary files used by Phonetisaurus
- nb_*.fst: model files to use with phonetisaurus-apply. The expansion of * is a string of a dialect and pronunciation style, e.g. e_spoken or t_written.
- nb_*.o8.arpa: 8-gram-models for phoneme sequences that Phonetisaurus uses during training.
- nb_*.corpus: aligned graphemes and phonemes from the lexica.
data/: contains various lexica used for training and testing, including predictions from the models on the test set
- NB-uttale_*_train.dict: training data for models/nb_*.fst. Each file contains 543 495 word-transcription pairs (WTP), and makes up 80% of all unique WTPs in the lexicon.
- NB-uttale_*_test.dict: test data for models/nb_*.fst. Each file contains the remaining 20% of the WTPs in the lexicon, i.e. 135 787 WTPs.
- predicted_nb_*.dict: The words from the testdata with the model's predicted transriptions.
- wordlist_test.txt: The orthographic words from the test data, which the models predict transcriptions for.
evaluate.py: script to evaluate the models. The method for calculating WER and PER were re-implemented.
g2p_stats.py: script to evaluate the models from V1.0, which can be used to compare results between these models and the NDT models (with and without tone and stress markers) from version 1.
LICENSE: The license text for CC0, which this resource is distributed with.

Usage

phonetisaurus-apply --model models/nb_e_spoken.fst --word_list data/wordlist_test.txt  -n 1  -v  > output.txt

Input data (--word_list) should be a list of newline-delimited words. See the file data/wordlist_test.txt for an example.
The trained G2P-models are .fst files located in the models folder. The same folder also contains aligned .corpus files and phoneme 8-gram models (.arpa files), also from the Phonetisaurus training process.
-n lets you adjust the number of most probable predictons.

Evaluation

There are 2 scripts to calculate WER and PER statistics, which give slightly different results.

`evaluate.py`

Calculates stats for all the provided models by default. You can give a pronunciation variant (e.g. -l e_spoken) to calculate stats for specific models.

The WER score is calculated as the count of all mismatching transcriptions (1 error = 1 mismatching word) divided by the count of all words in the reference, i.e. a *_test.dict file.
PER is calculated as the count of all errors (1 error = a mismatching phoneme) divided by the total count of phonemes in the reference file.

python evaluate.py

Model	Word Error Rate	Phoneme Error Rate
nb_e_written.fst	13.661238654564869	1.9681178920293207
nb_e_spoken.fst	13.72501038144391	1.9832518286152074
nb_sw_written.fst	13.240048644480037	1.8396612197218096
nb_sw_spoken.fst	16.422702734768936	2.426312206336983
nb_w_written.fst	13.240048644480037	1.8396612197218096
nb_w_spoken.fst	16.892833837574894	2.5064155890730686
nb_t_written.fst	13.736133357062347	1.98774986044724
nb_t_spoken.fst	16.47992288013051	2.5809178688066843
nb_n_written.fst	13.736133357062347	1.98774986044724
nb_n_spoken.fst	17.22590930999963	2.8209779677747715

`g2p_stats.py`

Calculates WER and PER for two input files.

The reference file (e.g. data/NB-uttale_e_spoken_test.dict)
The model prediction file (e.g. output.txt from the command in Usage, or data/predicted_nb_e_spoken.dict).

The WER score is calculated as the count of errors (1 error = 1 mismatching word) divided by the count of all words in the predictions, i.e. a predicted_*.dict file.
PER is calculated as the sum of phone error rates for each transcription, divided by the total count of words in the predictions.

NOTE: This method doesn't take transcription lengths into account, so a transcription with 2 phonemes where 1 is wrong has a 0.5 PER while a word with length 10 with 1 error has a 0.1 PER, and the average score for the two words would be 0.35.

python g2p_stats.py data/NB-uttale_e_spoken_test.dict data/predicted_nb_e_spoken.dict
# WER: 14.049209423582523
# PER: 2.714882650391985

Model	Word Error Rate	Phoneme Error Rate
nb_e_written.fst	13.97114598599277	2.7038190765903214
nb_e_spoken.fst	14.049209423582523	2.714882650391985
nb_sw_written.fst	13.541060631724685	2.5423757844377284
nb_sw_spoken.fst	16.729141964989285	3.34063477772742
nb_w_written.fst	13.541060631724685	2.5423757844377284
nb_w_spoken.fst	17.186475877661337	3.4137304874392114
nb_t_written.fst	14.059519688924565	2.7190289235234104
nb_t_spoken.fst	--	--
nb_n_written.fst	14.059519688924565	2.7190289235234104
nb_n_spoken.fst	--	--

NOTE: The t_spoken and n_spoken model predictions are not the same length as the reference file, which causes the script to exit.

Transcription standard

The G2P models have been trained on the NoFAbet transcription standard which is easier to read by humans than X-SAMPA. NoFAbet is in part based on 2-letter ARPAbet and is made by Nate Young for the National Library of Norway in connection with the development of NoFA, a forced aligner for Norwegian. The equivalence table below contains X-SAMPA, IPA and NoFAbet notatations.

X-SAMPA-IPA-NoFAbet equivalence table

X-SAMPA	IPA	NoFAbet	Example
A:	ɑː	AA0	bad
{:	æː	AE0	vær
{	æ	AEH0	vært
{*I	æɪ	AEJ0	sei
E*u0	æʉ	AEW0	sau
A	ɑ	AH0	hatt
A*I	ɑɪ	AJ0	kai
@	ə	AX0	behage
b	b	B	bil
d	d	D	dag
e:	eː	EE0	lek
E	ɛ	EH0	penn
f	f	F	fin
g	g	G	gul
h	h	H	hes
I	ɪ	IH0	sitt
i:	iː	II0	vin
j	j	J	ja
k	k	K	kost
C	ç	KJ	kino
l	l	L	land
l=	l̩	LX0
m	m	M	man
n	n	N	nord
N	ŋ	NG	eng
n=	n̩	NX0
o:	oː	OA0	rå
O	ɔ	OAH0	gått
2:	øː	OE0	løk
9	œ	OEH0	høst
9*Y	œy	OEJ0	køye
U	u	OH0	f*ort
O*Y	ɔy	OJ0	konvoy
u:	uː	OO0	bod
@U	oʉ	OU0	show
p	p	P	pil
r	r	R	rose
d`	ɖ	RD	rekord
l`	ɭ	RL	perle
l`=	ɭ̩	RLX0
n`	ɳ	RN	barn
n`=	ɳ̩	RNX0
s`	ʂ	RS	pers
t`	ʈ	RT	stort
s	s	S	sil
S	ʃ	SJ	sju
t	t	T	tid
u0	ʉ	UH0	russ
}:	ʉː	UU0	hus
v	ʋ	V	vase
w	w	W	Washington
Y	y	YH0	nytt
y:	yː	YY0	ny

Unstressed syllables are marked with a 0 after the vowel or syllabic consonant. The nucleus is marked with a 1 for tone 1 and a 2 for tone 2. Secondary stress is marked with 3.

License

These models are shared with a Creative_Commons-ZERO (CC-ZERO) license, and so are the lexica they are trained on. The models can be used for any purpose, as long as it is compliant with Phonetisaurus' license.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.1

Apr 29, 2024

0.1.0

Apr 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

g2p_nb-0.1.1.tar.gz (10.2 kB view hashes)

Uploaded Apr 29, 2024 Source

Built Distribution

g2p_nb-0.1.1-py3-none-any.whl (11.5 kB view hashes)

Uploaded Apr 29, 2024 Python 3

Hashes for g2p_nb-0.1.1.tar.gz

Hashes for g2p_nb-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ba7e47e9505adece94219e848a5616d4a8978e1833db0d4bd0639d747490e2c1`
MD5	`ddb5b7beef2d00817e1007a03961f3d2`
BLAKE2b-256	`aa77a6fd8bcece6eb19740589d340032f4fa49b7fbf3de9086cb80862fbe209c`

Hashes for g2p_nb-0.1.1-py3-none-any.whl

Hashes for g2p_nb-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c9d78976b3fecb4ffc3f0f24bc1faa76916f0374f8c0f86e590a8d45ee3e0c5c`
MD5	`777e9d81bcae0b15f36697c59c33e1d9`
BLAKE2b-256	`92811534263a9a2777e24cb11c5fc1008ffe3e82a6db8129d031460392f5aa4b`