A collection of pronunciation dictionaries and neural grapheme-to-phoneme models.
Project description
Lexikos - λεξικός /lek.si.kós/
A collection of pronunciation dictionaries and neural grapheme-to-phoneme models.
Install Lexikos
Install from PyPI
pip install lexikos
Editable install from Source
git clone https://github.com/bookbot-hive/lexikos.git
pip install -e lexikos
Usage
Lexicon
>>> from lexikos import Lexicon
>>> lexicon = Lexicon()
>>> print(lexicon["added"])
{'ˈæ d ɪ d', 'ˈæ ɾ ə d', 'æ ɾ ɪ d', 'a d ɪ d', 'ˈa d ɪ d', 'æ ɾ ə d', 'ˈa d ə d', 'a d ə d', 'ˈæ d ə d', 'æ d ə d', 'æ d ɪ d', 'ˈæ ɾ ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'ɹ ʌ n ɝ', 'ˈr ʌ n ɝ'}
>>> print(lexicon["water"])
{'ˈʋ aː ʈ ə r ɯ', 'ˈw oː t ə', 'w ɑ t ə ɹ', 'ˈw aː ʈ ə r ɯ', 'ˈw ɔ t ɝ', 'w ɔ t ə ɹ', 'ˈw ɑ t ə ɹ', 'w ɔ t ɝ', 'w ɑ ɾ ɚ', 'ˈw ɑ ɾ ɚ', 'ˈʋ ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɔː t ə', 'ˈw oː ɾ ə', 'ˈw ɔ ʈ ə r'}
To get a lexicon where phonemes are normalized (diacritics removed, digraphs split):
>>> from lexikos import Lexicon
>>> lexicon = Lexicon(normalize_phonemes=True)
>>> print(lexicon["added"])
{'æ ɾ ɪ d', 'a d ɪ d', 'a d ə d', 'æ ɾ ə d', 'æ d ə d', 'æ d ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'r ʌ n ɝ', 'ɹ ʌ n ɝ'}
>>> print(lexicon["water"])
{'w o ɾ ə', 'w ɔ t ə', 'ʋ ɔ ʈ ə r', 'w a ʈ ə r ɯ', 'w ɔ t ə ɹ', 'ʋ a ʈ ə r ɯ', 'w ɑ ɾ ɚ', 'w o t ə', 'w ɔ t ɝ', 'w ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɑ t ə ɹ'}
To include synthetic (non-dictionary-based) pronunciations:
>>> from lexikos import Lexicon
>>> lexicon = Lexicon(include_synthetic=True)
>>> print(lexicon["athletic"])
{'æ t l ɛ t ɪ k', 'æ θ ˈl ɛ t ɪ k', 'æ θ l ɛ t ɪ k'}
Phonemization
>>> from lexikos import G2p
>>> g2p = G2p(lang="en-us")
>>> g2p("Hello there! $100 is not a lot of money in 2023.")
['h ɛ l o ʊ', 'ð ɛ ə ɹ', 'w ʌ n', 'h ʌ n d ɹ ɪ d', 'd ɑ l ɚ z', 'ɪ z', 'n ɒ t', 'ə', 'l ɑ t', 'ʌ v', 'm ʌ n i', 'ɪ n', 't w ɛ n t i', 't w ɛ n t i', 'θ ɹ iː']
>>> g2p = G2p(lang="en-au")
>>> g2p("Hi there mate! Have a g'day!")
['h a ɪ', 'θ ɛ ə ɹ', 'm e ɪ t', 'h e ɪ v', 'ə', 'ɡ ə ˈd æ ɪ']
Dictionaries & Models
English (en)
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en | Wikipron | IPA | Link | bookbot/byt5-small-wikipron-eng-latn |
English (en-US)
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en-US | CMU Dict | ARPA | External Link | bookbot/byt5-small-cmudict |
en-US | CMU Dict IPA | IPA | External Link | |
en-US | CharsiuG2P | IPA | External Link | charsiu/g2p_multilingual_byT5_small_100 |
en-US (Broad) | Wikipron | IPA | External Link | bookbot/byt5-small-wikipron-eng-latn-us-broad |
en-US (Narrow) | Wikipron | IPA | External Link | |
en-US | LibriSpeech | IPA | Link |
English (en-UK)
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en-UK | CharsiuG2P | IPA | External Link | charsiu/g2p_multilingual_byT5_small_100 |
en-UK (Broad) | Wikipron | IPA | External Link | bookbot/byt5-small-wikipron-eng-latn-uk-broad |
en-UK (Narrow) | Wikipron | IPA | External Link |
English (en-AU)
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en-AU (Broad) | Wikipron | IPA | Link | bookbot/byt5-small-wikipron-eng-latn-au-broad |
en-AU (Narrow) | Wikipron | IPA | Link | |
en-AU | AusTalk | IPA | Link | |
en-AU | SC-CW | IPA | Link |
English (en-CA)
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en-CA (Broad) | Wikipron | IPA | Link | bookbot/byt5-small-wikipron-eng-latn-ca-broad |
en-CA (Narrow) | Wikipron | IPA | Link |
English (en-NZ)
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en-NZ (Broad) | Wikipron | IPA | Link | bookbot/byt5-small-wikipron-eng-latn-nz-broad |
en-NZ (Narrow) | Wikipron | IPA | Link |
English (en-IN)
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en-IN (Broad) | Wikipron | IPA | Link | bookbot/byt5-small-wikipron-eng-latn-in-broad |
en-IN (Narrow) | Wikipron | IPA | Link |
Training G2P Model
We modified the sequence-to-sequence training script of 🤗 HuggingFace for the purpose of training G2P models. Refer to their installation requirements for more details.
Training a new G2P model generally follow this recipe:
python run_translation.py \
+ --model_name_or_path $PRETRAINED_MODEL \
+ --dataset_name $DATASET_NAME \
--output_dir $OUTPUT_DIR \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 32 \
--learning_rate 2e-4 \
--lr_scheduler_type linear \
--warmup_ratio 0.1 \
--num_train_epochs 10 \
--evaluation_strategy epoch \
--save_strategy epoch \
--logging_strategy epoch \
--max_source_length 64 \
--max_target_length 64 \
--val_max_target_length 64 \
--pad_to_max_length True \
--overwrite_output_dir \
--do_train --do_eval \
--bf16 \
--predict_with_generate \
--report_to tensorboard \
--push_to_hub \
+ --hub_model_id $HUB_MODEL_ID \
--use_auth_token
Example: Fine-tune ByT5 on CMU Dict
python run_translation.py \
--model_name_or_path google/byt5-small \
--dataset_name bookbot/cmudict-0.7b \
--output_dir ./byt5-small-cmudict \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 32 \
--learning_rate 2e-4 \
--lr_scheduler_type linear \
--warmup_ratio 0.1 \
--num_train_epochs 10 \
--evaluation_strategy epoch \
--save_strategy epoch \
--logging_strategy epoch \
--max_source_length 64 \
--max_target_length 64 \
--val_max_target_length 64 \
--pad_to_max_length True \
--overwrite_output_dir \
--do_train --do_eval \
--bf16 \
--predict_with_generate \
--report_to tensorboard \
--push_to_hub \
--hub_model_id bookbot/byt5-small-cmudict \
--use_auth_token
Evaluating G2P Model
Then to evaluate:
python eval.py \
+ --model $PRETRAINED_MODEL \
+ --dataset_name $DATASET_NAME \
--source_text_column_name source \
--target_text_column_name target \
--max_length 64 \
--batch_size 64
Example: Evaluate ByT5 on CMU Dict
python eval.py \
--model bookbot/byt5-small-cmudict \
--dataset_name bookbot/cmudict-0.7b \
--source_text_column_name source \
--target_text_column_name target \
--max_length 64 \
--batch_size 64
Corpus Roadmap
Wikipron
Language Family | Code | Region | Corpus | G2P Model |
---|---|---|---|---|
African English | en-ZA | South Africa | ||
Australian English | en-AU | Australia | ✅ | ✅ |
East Asian English | en-CN, en-HK, en-JP, en-KR, en-TW | China, Hong Kong, Japan, South Korea, Taiwan | ||
European English | en-UK, en-HU, en-IE | United Kingdom, Hungary, Ireland | 🚧 | 🚧 |
Mexican English | en-MX | Mexico | ||
New Zealand English | en-NZ | New Zealand | ✅ | ✅ |
North American | en-CA, en-US | Canada, United States | ✅ | ✅ |
Middle Eastern English | en-EG, en-IL | Egypt, Israel | ||
Southeast Asian | en-TH, en-ID, en-MY, en-PH, en-SG | Thailand, Indonesia, Malaysia, Philippines, Singapore | ||
South Asian English | en-IN | India | ✅ | ✅ |
Resources
References
@inproceedings{lee-etal-2020-massively,
title = "Massively Multilingual Pronunciation Modeling with {W}iki{P}ron",
author = "Lee, Jackson L. and
Ashby, Lucas F.E. and
Garza, M. Elizabeth and
Lee-Sikka, Yeonju and
Miller, Sean and
Wong, Alan and
McCarthy, Arya D. and
Gorman, Kyle",
booktitle = "Proceedings of LREC",
year = "2020",
publisher = "European Language Resources Association",
pages = "4223--4228",
}
@misc{zhu2022byt5,
title={ByT5 model for massively multilingual grapheme-to-phoneme conversion},
author={Jian Zhu and Cong Zhang and David Jurgens},
year={2022},
eprint={2204.03067},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file lexikos-0.0.1rc7.tar.gz
.
File metadata
- Download URL: lexikos-0.0.1rc7.tar.gz
- Upload date:
- Size: 5.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ecc8eae4b740e82d7b8cee50e905bfe92519c67c55d532ba5c8f7236c89e38b |
|
MD5 | ff168ec0119a3226caefcb49c1fbddb7 |
|
BLAKE2b-256 | d4c305b54d465401fe357b22abe82fadbd3b71606f01807c2f3f88d03778578f |
File details
Details for the file lexikos-0.0.1rc7-py3-none-any.whl
.
File metadata
- Download URL: lexikos-0.0.1rc7-py3-none-any.whl
- Upload date:
- Size: 5.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8dc1d8d0055f8131f0b0c9d82c3fed7f74965c2f3932bedfe2f42beb87203691 |
|
MD5 | 72b66fef36119ea4db4ce54a1cba2ca0 |
|
BLAKE2b-256 | 5939a98b2ec76d74b24877d4daf6fb0c55f8fe1e611581e221cfeeff0bbd2896 |