Skip to main content

A collection of pronunciation dictionaries and neural grapheme-to-phoneme models.

Project description

Lexikos - λεξικός /lek.si.kós/

A collection of pronunciation dictionaries and neural grapheme-to-phoneme models.

logo

Install Lexikos

Install from PyPI

pip install lexikos

Editable install from Source

git clone https://github.com/bookbot-hive/lexikos.git
pip install -e lexikos

Usage

Lexicon

>>> from lexikos import Lexicon
>>> lexicon = Lexicon()
>>> print(lexicon["added"])
{'ˈæ d ɪ d', 'ˈæ ɾ ə d', 'æ ɾ ɪ d', 'a d ɪ d', 'ˈa d ɪ d', 'æ ɾ ə d', 'ˈa d ə d', 'a d ə d', 'ˈæ d ə d', 'æ d ə d', 'æ d ɪ d', 'ˈæ ɾ ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'ɹ ʌ n ɝ', 'ˈr ʌ n ɝ'}
>>> print(lexicon["water"])
{'ˈʋ aː ʈ ə r ɯ', 'ˈw oː t ə', 'w ɑ t ə ɹ', 'ˈw aː ʈ ə r ɯ', 'ˈw ɔ t ɝ', 'w ɔ t ə ɹ', 'ˈw ɑ t ə ɹ', 'w ɔ t ɝ', 'w ɑ ɾ ɚ', 'ˈw ɑ ɾ ɚ', 'ˈʋ ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɔː t ə', 'ˈw oː ɾ ə', 'ˈw ɔ ʈ ə r'}

To get a lexicon where phonemes are normalized (diacritics removed, digraphs split):

>>> from lexikos import Lexicon
>>> lexicon = Lexicon(normalize_phonemes=True)
>>> print(lexicon["added"])
{'æ ɾ ɪ d', 'a d ɪ d', 'a d ə d', 'æ ɾ ə d', 'æ d ə d', 'æ d ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'r ʌ n ɝ', 'ɹ ʌ n ɝ'}
>>> print(lexicon["water"])
{'w o ɾ ə', 'w ɔ t ə', 'ʋ ɔ ʈ ə r', 'w a ʈ ə r ɯ', 'w ɔ t ə ɹ', 'ʋ a ʈ ə r ɯ', 'w ɑ ɾ ɚ', 'w o t ə', 'w ɔ t ɝ', 'w ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɑ t ə ɹ'}

To include synthetic (non-dictionary-based) pronunciations:

>>> from lexikos import Lexicon
>>> lexicon = Lexicon(include_synthetic=True)
>>> print(lexicon["athletic"])
{'æ t l ɛ t ɪ k', 'æ θ ˈl ɛ t ɪ k', 'æ θ l ɛ t ɪ k'}

Phonemization

>>> from lexikos import G2p
>>> g2p = G2p(lang="en-us")
>>> g2p("Hello there! $100 is not a lot of money in 2023.")
['h ɛ l o ʊ', 'ð ɛ ə ɹ', 'w ʌ n', 'h ʌ n d ɹ ɪ d', 'd ɑ l ɚ z', 'ɪ z', 'n ɒ t', 'ə', 'l ɑ t', 'ʌ v', 'm ʌ n i', 'ɪ n', 't w ɛ n t i', 't w ɛ n t i', 'θ ɹ iː']
>>> g2p = G2p(lang="en-au")
>>> g2p("Hi there mate! Have a g'day!")
['h a ɪ', 'θ ɛ ə ɹ', 'm e ɪ t', 'h e ɪ v', 'ə', 'ɡ ə ˈd æ ɪ']

Dictionaries & Models

English (en)

Language Dictionary Phone Set Corpus G2P Model
en Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn

English (en-US)

Language Dictionary Phone Set Corpus G2P Model
en-US CMU Dict ARPA External Link bookbot/byt5-small-cmudict
en-US CMU Dict IPA IPA External Link
en-US CharsiuG2P IPA External Link charsiu/g2p_multilingual_byT5_small_100
en-US (Broad) Wikipron IPA External Link bookbot/byt5-small-wikipron-eng-latn-us-broad
en-US (Narrow) Wikipron IPA External Link
en-US LibriSpeech IPA Link

English (en-UK)

Language Dictionary Phone Set Corpus G2P Model
en-UK CharsiuG2P IPA External Link charsiu/g2p_multilingual_byT5_small_100
en-UK (Broad) Wikipron IPA External Link bookbot/byt5-small-wikipron-eng-latn-uk-broad
en-UK (Narrow) Wikipron IPA External Link

English (en-AU)

Language Dictionary Phone Set Corpus G2P Model
en-AU (Broad) Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn-au-broad
en-AU (Narrow) Wikipron IPA Link
en-AU AusTalk IPA Link
en-AU SC-CW IPA Link

English (en-CA)

Language Dictionary Phone Set Corpus G2P Model
en-CA (Broad) Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn-ca-broad
en-CA (Narrow) Wikipron IPA Link

English (en-NZ)

Language Dictionary Phone Set Corpus G2P Model
en-NZ (Broad) Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn-nz-broad
en-NZ (Narrow) Wikipron IPA Link

English (en-IN)

Language Dictionary Phone Set Corpus G2P Model
en-IN (Broad) Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn-in-broad
en-IN (Narrow) Wikipron IPA Link

Training G2P Model

We modified the sequence-to-sequence training script of 🤗 HuggingFace for the purpose of training G2P models. Refer to their installation requirements for more details.

Training a new G2P model generally follow this recipe:

python run_translation.py \
+   --model_name_or_path $PRETRAINED_MODEL \
+   --dataset_name $DATASET_NAME \
    --output_dir $OUTPUT_DIR \
    --per_device_train_batch_size 128 \
    --per_device_eval_batch_size 32 \
    --learning_rate 2e-4 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.1 \
    --num_train_epochs 10 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --logging_strategy epoch \
    --max_source_length 64 \
    --max_target_length 64 \
    --val_max_target_length 64 \
    --pad_to_max_length True \
    --overwrite_output_dir \
    --do_train --do_eval \
    --bf16 \
    --predict_with_generate \
    --report_to tensorboard \
    --push_to_hub \
+   --hub_model_id $HUB_MODEL_ID \
    --use_auth_token

Example: Fine-tune ByT5 on CMU Dict

python run_translation.py \
    --model_name_or_path google/byt5-small \
    --dataset_name bookbot/cmudict-0.7b \
    --output_dir ./byt5-small-cmudict \
    --per_device_train_batch_size 128 \
    --per_device_eval_batch_size 32 \
    --learning_rate 2e-4 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.1 \
    --num_train_epochs 10 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --logging_strategy epoch \
    --max_source_length 64 \
    --max_target_length 64 \
    --val_max_target_length 64 \
    --pad_to_max_length True \
    --overwrite_output_dir \
    --do_train --do_eval \
    --bf16 \
    --predict_with_generate \
    --report_to tensorboard \
    --push_to_hub \
    --hub_model_id bookbot/byt5-small-cmudict \
    --use_auth_token

Evaluating G2P Model

Then to evaluate:

python eval.py \
+   --model $PRETRAINED_MODEL \
+   --dataset_name $DATASET_NAME \
    --source_text_column_name source \
    --target_text_column_name target \
    --max_length 64 \
    --batch_size 64

Example: Evaluate ByT5 on CMU Dict

python eval.py \
    --model bookbot/byt5-small-cmudict \
    --dataset_name bookbot/cmudict-0.7b \
    --source_text_column_name source \
    --target_text_column_name target \
    --max_length 64 \
    --batch_size 64

Corpus Roadmap

Wikipron

Language Family Code Region Corpus G2P Model
African English en-ZA South Africa
Australian English en-AU Australia
East Asian English en-CN, en-HK, en-JP, en-KR, en-TW China, Hong Kong, Japan, South Korea, Taiwan
European English en-UK, en-HU, en-IE United Kingdom, Hungary, Ireland 🚧 🚧
Mexican English en-MX Mexico
New Zealand English en-NZ New Zealand
North American en-CA, en-US Canada, United States
Middle Eastern English en-EG, en-IL Egypt, Israel
Southeast Asian en-TH, en-ID, en-MY, en-PH, en-SG Thailand, Indonesia, Malaysia, Philippines, Singapore
South Asian English en-IN India

Resources

References

@inproceedings{lee-etal-2020-massively,
    title = "Massively Multilingual Pronunciation Modeling with {W}iki{P}ron",
    author = "Lee, Jackson L.  and
      Ashby, Lucas F.E.  and
      Garza, M. Elizabeth  and
      Lee-Sikka, Yeonju  and
      Miller, Sean  and
      Wong, Alan  and
      McCarthy, Arya D.  and
      Gorman, Kyle",
    booktitle = "Proceedings of LREC",
    year = "2020",
    publisher = "European Language Resources Association",
    pages = "4223--4228",
}
@misc{zhu2022byt5,
    title={ByT5 model for massively multilingual grapheme-to-phoneme conversion}, 
    author={Jian Zhu and Cong Zhang and David Jurgens},
    year={2022},
    eprint={2204.03067},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lexikos-0.0.1rc7.tar.gz (5.8 MB view details)

Uploaded Source

Built Distribution

lexikos-0.0.1rc7-py3-none-any.whl (5.9 MB view details)

Uploaded Python 3

File details

Details for the file lexikos-0.0.1rc7.tar.gz.

File metadata

  • Download URL: lexikos-0.0.1rc7.tar.gz
  • Upload date:
  • Size: 5.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.12

File hashes

Hashes for lexikos-0.0.1rc7.tar.gz
Algorithm Hash digest
SHA256 0ecc8eae4b740e82d7b8cee50e905bfe92519c67c55d532ba5c8f7236c89e38b
MD5 ff168ec0119a3226caefcb49c1fbddb7
BLAKE2b-256 d4c305b54d465401fe357b22abe82fadbd3b71606f01807c2f3f88d03778578f

See more details on using hashes here.

File details

Details for the file lexikos-0.0.1rc7-py3-none-any.whl.

File metadata

  • Download URL: lexikos-0.0.1rc7-py3-none-any.whl
  • Upload date:
  • Size: 5.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.12

File hashes

Hashes for lexikos-0.0.1rc7-py3-none-any.whl
Algorithm Hash digest
SHA256 8dc1d8d0055f8131f0b0c9d82c3fed7f74965c2f3932bedfe2f42beb87203691
MD5 72b66fef36119ea4db4ce54a1cba2ca0
BLAKE2b-256 5939a98b2ec76d74b24877d4daf6fb0c55f8fe1e611581e221cfeeff0bbd2896

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page