Skip to main content

A collection of pronunciation dictionaries and neural grapheme-to-phoneme models.

Project description

Lexikos - λεξικός /lek.si.kós/

A collection of pronunciation dictionaries and neural grapheme-to-phoneme models.

logo

Install Lexikos

Install from PyPI

pip install lexikos

Editable install from Source

git clone https://github.com/bookbot-hive/lexikos.git
pip install -e lexikos

Usage

Lexicon

>>> from lexikos import Lexicon
>>> lexicon = Lexicon()
>>> print(lexicon["added"])
{'ˈæ d ɪ d', 'ˈæ ɾ ə d', 'æ ɾ ɪ d', 'a d ɪ d', 'ˈa d ɪ d', 'æ ɾ ə d', 'ˈa d ə d', 'a d ə d', 'ˈæ d ə d', 'æ d ə d', 'æ d ɪ d', 'ˈæ ɾ ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'ɹ ʌ n ɝ', 'ˈr ʌ n ɝ'}
>>> print(lexicon["water"])
{'ˈʋ aː ʈ ə r ɯ', 'ˈw oː t ə', 'w ɑ t ə ɹ', 'ˈw aː ʈ ə r ɯ', 'ˈw ɔ t ɝ', 'w ɔ t ə ɹ', 'ˈw ɑ t ə ɹ', 'w ɔ t ɝ', 'w ɑ ɾ ɚ', 'ˈw ɑ ɾ ɚ', 'ˈʋ ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɔː t ə', 'ˈw oː ɾ ə', 'ˈw ɔ ʈ ə r'}

To get a lexicon where phonemes are normalized (diacritics removed, digraphs split):

>>> from lexikos import Lexicon
>>> lexicon = Lexicon(normalize_phonemes=True)
>>> print(lexicon["added"])
{'æ ɾ ɪ d', 'a d ɪ d', 'a d ə d', 'æ ɾ ə d', 'æ d ə d', 'æ d ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'r ʌ n ɝ', 'ɹ ʌ n ɝ'}
>>> print(lexicon["water"])
{'w o ɾ ə', 'w ɔ t ə', 'ʋ ɔ ʈ ə r', 'w a ʈ ə r ɯ', 'w ɔ t ə ɹ', 'ʋ a ʈ ə r ɯ', 'w ɑ ɾ ɚ', 'w o t ə', 'w ɔ t ɝ', 'w ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɑ t ə ɹ'}

To include synthetic (non-dictionary-based) pronunciations:

>>> from lexikos import Lexicon
>>> lexicon = Lexicon(include_synthetic=True)
>>> print(lexicon["athletic"])
{'æ t l ɛ t ɪ k', 'æ θ ˈl ɛ t ɪ k', 'æ θ l ɛ t ɪ k'}

Phonemization

>>> from lexikos import G2p
>>> g2p = G2p(lang="en-us")
>>> g2p("Hello there! $100 is not a lot of money in 2023.")
['h ɛ l o ʊ', 'ð ɛ ə ɹ', 'w ʌ n', 'h ʌ n d ɹ ɪ d', 'd ɑ l ɚ z', 'ɪ z', 'n ɒ t', 'ə', 'l ɑ t', 'ʌ v', 'm ʌ n i', 'ɪ n', 't w ɛ n t i', 't w ɛ n t i', 'θ ɹ iː']
>>> g2p = G2p(lang="en-au")
>>> g2p("Hi there mate! Have a g'day!")
['h a ɪ', 'θ ɛ ə ɹ', 'm e ɪ t', 'h e ɪ v', 'ə', 'ɡ ə ˈd æ ɪ']

Dictionaries & Models

English (en)

Language Dictionary Phone Set Corpus G2P Model
en Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn

English (en-US)

Language Dictionary Phone Set Corpus G2P Model
en-US CMU Dict ARPA External Link bookbot/byt5-small-cmudict
en-US CMU Dict IPA IPA External Link
en-US CharsiuG2P IPA External Link charsiu/g2p_multilingual_byT5_small_100
en-US (Broad) Wikipron IPA External Link bookbot/byt5-small-wikipron-eng-latn-us-broad
en-US (Narrow) Wikipron IPA External Link
en-US LibriSpeech IPA Link

English (en-UK)

Language Dictionary Phone Set Corpus G2P Model
en-UK CharsiuG2P IPA External Link charsiu/g2p_multilingual_byT5_small_100
en-UK (Broad) Wikipron IPA External Link bookbot/byt5-small-wikipron-eng-latn-uk-broad
en-UK (Narrow) Wikipron IPA External Link

English (en-AU)

Language Dictionary Phone Set Corpus G2P Model
en-AU (Broad) Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn-au-broad
en-AU (Narrow) Wikipron IPA Link
en-AU AusTalk IPA Link
en-AU SC-CW IPA Link

English (en-CA)

Language Dictionary Phone Set Corpus G2P Model
en-CA (Broad) Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn-ca-broad
en-CA (Narrow) Wikipron IPA Link

English (en-NZ)

Language Dictionary Phone Set Corpus G2P Model
en-NZ (Broad) Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn-nz-broad
en-NZ (Narrow) Wikipron IPA Link

English (en-IN)

Language Dictionary Phone Set Corpus G2P Model
en-IN (Broad) Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn-in-broad
en-IN (Narrow) Wikipron IPA Link

Training G2P Model

We modified the sequence-to-sequence training script of 🤗 HuggingFace for the purpose of training G2P models. Refer to their installation requirements for more details.

Training a new G2P model generally follow this recipe:

python run_translation.py \
+   --model_name_or_path $PRETRAINED_MODEL \
+   --dataset_name $DATASET_NAME \
    --output_dir $OUTPUT_DIR \
    --per_device_train_batch_size 128 \
    --per_device_eval_batch_size 32 \
    --learning_rate 2e-4 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.1 \
    --num_train_epochs 10 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --logging_strategy epoch \
    --max_source_length 64 \
    --max_target_length 64 \
    --val_max_target_length 64 \
    --pad_to_max_length True \
    --overwrite_output_dir \
    --do_train --do_eval \
    --bf16 \
    --predict_with_generate \
    --report_to tensorboard \
    --push_to_hub \
+   --hub_model_id $HUB_MODEL_ID \
    --use_auth_token

Example: Fine-tune ByT5 on CMU Dict

python run_translation.py \
    --model_name_or_path google/byt5-small \
    --dataset_name bookbot/cmudict-0.7b \
    --output_dir ./byt5-small-cmudict \
    --per_device_train_batch_size 128 \
    --per_device_eval_batch_size 32 \
    --learning_rate 2e-4 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.1 \
    --num_train_epochs 10 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --logging_strategy epoch \
    --max_source_length 64 \
    --max_target_length 64 \
    --val_max_target_length 64 \
    --pad_to_max_length True \
    --overwrite_output_dir \
    --do_train --do_eval \
    --bf16 \
    --predict_with_generate \
    --report_to tensorboard \
    --push_to_hub \
    --hub_model_id bookbot/byt5-small-cmudict \
    --use_auth_token

Evaluating G2P Model

Then to evaluate:

python eval.py \
+   --model $PRETRAINED_MODEL \
+   --dataset_name $DATASET_NAME \
    --source_text_column_name source \
    --target_text_column_name target \
    --max_length 64 \
    --batch_size 64

Example: Evaluate ByT5 on CMU Dict

python eval.py \
    --model bookbot/byt5-small-cmudict \
    --dataset_name bookbot/cmudict-0.7b \
    --source_text_column_name source \
    --target_text_column_name target \
    --max_length 64 \
    --batch_size 64

Corpus Roadmap

Wikipron

Language Family Code Region Corpus G2P Model
African English en-ZA South Africa
Australian English en-AU Australia
East Asian English en-CN, en-HK, en-JP, en-KR, en-TW China, Hong Kong, Japan, South Korea, Taiwan
European English en-UK, en-HU, en-IE United Kingdom, Hungary, Ireland 🚧 🚧
Mexican English en-MX Mexico
New Zealand English en-NZ New Zealand
North American en-CA, en-US Canada, United States
Middle Eastern English en-EG, en-IL Egypt, Israel
Southeast Asian en-TH, en-ID, en-MY, en-PH, en-SG Thailand, Indonesia, Malaysia, Philippines, Singapore
South Asian English en-IN India

Resources

References

@inproceedings{lee-etal-2020-massively,
    title = "Massively Multilingual Pronunciation Modeling with {W}iki{P}ron",
    author = "Lee, Jackson L.  and
      Ashby, Lucas F.E.  and
      Garza, M. Elizabeth  and
      Lee-Sikka, Yeonju  and
      Miller, Sean  and
      Wong, Alan  and
      McCarthy, Arya D.  and
      Gorman, Kyle",
    booktitle = "Proceedings of LREC",
    year = "2020",
    publisher = "European Language Resources Association",
    pages = "4223--4228",
}
@misc{zhu2022byt5,
    title={ByT5 model for massively multilingual grapheme-to-phoneme conversion}, 
    author={Jian Zhu and Cong Zhang and David Jurgens},
    year={2022},
    eprint={2204.03067},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lexikos-0.0.1rc7.tar.gz (5.8 MB view hashes)

Uploaded Source

Built Distribution

lexikos-0.0.1rc7-py3-none-any.whl (5.9 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page