Library to load and save pronunciation dictionaries (any language).
Project description
pronunciation-dictionary
Library to load and save pronunciation dictionaries (any language).
Features
- Load dictionary from file or URL
- Parsing of
- line comments
- pronunciation comments
- numbers indicating alternative pronunciations for words
- weights
- Multiprocessing for faster deserialization
- Parsing of
- Save dictionary to file
- including numbers for alternative pronunciations
- include weights
- set word/weight/pronunciation separator
- Select pronunciation via
- first/last
- longest/shortest
- highest/lowest weight
- random
- weight
- Get phoneme set
Example dictionaries and deserialization arguments
- Montreal Forced Aligner dictionaries
encoding: "UTF-8"
- CMU
encoding: "ISO-8859-1"
consider_numbers: True
consider_pronunciation_comments: True
- LibriSpeech
encoding: "UTF-8"
- Prosodylab
- Old: CMU 0.7b
encoding: "ISO-8859-1"
consider_comments: True
consider_numbers: True
Excerpt from CMU (as example)
a.d. EY2 D IY1
a.m. EY2 EH1 M
a.s EY1 Z
aaa T R IH2 P AH0 L EY1
aaberg AA1 B ER0 G
aachen AA1 K AH0 N
aachener AA1 K AH0 N ER0
aaker AA1 K ER0
aalborg AO1 L B AO0 R G # place, danish
aalborg(2) AA1 L B AO0 R G
Installation
pip install pronunciation-dictionary --user
Usage
from pronunciation_dictionary import load_dict, save_dict, MultiprocessingOptions, DeserializationOptions, SerializationOptions
Example
from pathlib import Path
from pronunciation_dictionary import (DeserializationOptions,
MultiprocessingOptions, SerializationOptions,
get_phoneme_set, load_dict_from_url, save_dict)
dictionary = load_dict_from_url(
"https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict",
"ISO-8859-1",
DeserializationOptions(False, True, True, False),
MultiprocessingOptions(4, None, 10000)
)
phoneme_set = get_phoneme_set(dictionary)
print(phoneme_set)
# {'Z', 'EY1', 'AH0', 'F', 'AE0', 'UW0', 'CH', 'G', 'V', 'AY1', 'AO2', 'ZH', 'AA1', 'IY1', 'AW0', 'T', 'TH', 'AY2', 'DH', 'S', 'W', 'ER1', 'AA2', 'AE2', 'AE1', 'AW1', 'UW1', 'AH1', 'Y', 'EY2', 'AO0', 'OW2', 'OY2', 'IY2', 'JH', 'N', 'NG', 'P', 'IH2', 'M', 'OW0', 'L', 'UH1', 'IY0', 'EY0', 'HH', 'IH0', 'SH', 'AH2', 'AW2', 'EH2', 'OW1', 'D', 'R', 'IH1', 'AO1', 'B', 'UH2', 'UH0', 'ER0', 'UW2', 'ER2', 'EH0', 'AY0', 'AA0', 'EH1', 'OY1', 'OY0', 'K'}
pronunciations_distmantle = dictionary.get("dismantle")
for pronunciation, weight in pronunciations_distmantle.items():
print(pronunciation, weight)
# ('D', 'IH0', 'S', 'M', 'AE1', 'N', 'T', 'AH0', 'L') 1.0
# ('D', 'IH0', 'S', 'M', 'AE1', 'N', 'AH0', 'L') 1.0
save_dict(dictionary, Path("/tmp/cmu.dict"), "UTF-8",
SerializationOptions("DOUBLE-SPACE", False, False))
head /tmp/cmu.dict
# 'bout B AW1 T
# 'cause K AH0 Z
# 'course K AO1 R S
# 'cuse K Y UW1 Z
# 'em AH0 M
# 'frisco F R IH1 S K OW0
# 'gain G EH1 N
# 'kay K EY1
# 'm AH0 M
# 'n AH0 N
Roadmap
- replace
SerializationOptions
,DeserializationOptions
andMultiprocessingOptions
with parameters - add default parameter values
- add more tests
Running the tests
git clone https://github.com/stefantaubert/pronunciation-dictionary.git
cd pronunciation-dictionary
pip install .
pip install tox
tox
Contributing
If you notice an error, please don't hesitate to open an issue.
License
MIT License
Acknowledgments
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410
Citation
If you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see About => Cite this repository).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for pronunciation-dictionary-0.0.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | dedb841fb3cbf4df8fc621456576df84c81358e42fb5b1f3cc254cb3082f115a |
|
MD5 | 517f360f69d2d41a0245f73ad979ccf8 |
|
BLAKE2b-256 | 9aae64ca0eabc4d09bd956ffa0721c696b58e661e034badc27433825fe94c27c |
Close
Hashes for pronunciation_dictionary-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f26b288b73d14cea211fa41f4fe0344cf175882093c38f1d46b48bef8faf70bb |
|
MD5 | 1a05b7be18634b4096c6d21a851b4804 |
|
BLAKE2b-256 | 148eef61ebff125d7f9cd75726bdcc09f61c9ac4ea2ee113f48d8f197a08b669 |