Tools for loading dictionaries with various phonecodes (IPA, Callhome, X-SAMPA, ARPABET, DISC=CELEX), for converting among those phonecodes, and for searching those dictionaries for word sequences matching a target.
Project description
phonecodes
This library provides tools for converting between the International Phonetic Alphabet (IPA) and other phonetic alphabets used to transcribe speech, including Callhome, X-SAMPA, ARPABET, DISC/CELEX. Additionally, tools for searching mappings between phonetic symbols and reading/writing pronounciation lexicon files in several standard formats are also provided.
These functionalities are useful for processing data for automatic speech recognition, text to speech, and linguistic analyses of speech.
Setup and Installation
Install the library by running pip install phonecodes
with python 3.10 or greater.
Developers may refer to the CONTRIBUTIONS.md for information on the development environment for testing, linting and contributing to the code.
Basic Usage
Converting between Phonetic Alphabets
If you want to convert to or from IPA to some other phonetic code, use phonecodes.phonecodes
as follows:
>>> from phonecodes import phonecodes
>>> print(phonecodes.CODES) # available phonetic alphabets
{'arpabet', 'ipa', 'xsampa', 'callhome', 'disc'}
>>> phonecodes.convert("ð ɪ s ɪ z ə t ˈɛ s t", "ipa", "arpabet") # convert from IPA to ARPABET with language optional
>>> phonecodes.convert("DH IH S IH Z AH0 T EH1 S T", "arpabet", "ipa", "eng") # convert from IPA to ARPABET with language
'ð ɪ s ɪ z ə t ˈɛ s t'
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.ipa2arpabet("ð ɪ s ɪ z ə t ˈɛ s t", "eng") # equivalent to previous, language required
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.convert("DH IH S IH Z AH0 T EH1 S T", "arpabet", "ipa") # convert from ARPABET to IPA, language optional
'ð ɪ s ɪ z ə t ˈɛ s t'
>>> phonecodes.arpabet2ipa("DH IH S IH Z AH0 T EH1 S T", "eng") # equivalent to previous with language required
'ð ɪ s ɪ z ə t ˈɛ s t'
Note that for 'callhome' and 'disc' you should also specify a language code from the following lists:
- DISC/CELEX: Dutch
'nld'
, English'eng'
, German'deu'
. Uses German if unspecified. - Callhome: Spanish
'spa'
, Egyptian Arabic'arz'
, Mandarin Chinese'cmn'
. You MUST specify an appropriate language code or you'll get a KeyError.
Reading Corpus Files
If you are working with specific corpora, you can also convert between certain corpus formats as follows:
>>> from phonecodes import pronlex
>>> my_lex = pronlex.read("test/fixtures/isle_eng_sample.txt", "isle", "eng") # Read in an English ISLE corpus file
>>> my_lex.w2p # see orthographic to phonetic word mapping
{'a': ['#', 'ə', '#'], 'is': ['#', 'ɪ', 'z', '#'], 'test': ['#', 't', 'ˈɛ', 's', 't', '#'], 'this': ['#', 'ð', 'ɪ', 's', '#']}
new_lex = my_lex.recode('arpabet') # Convert mapping to ARPABET
>>> new_lex.w2p
{'a': ['#', 'AH0', '#'], 'is': ['#', 'IH', 'Z', '#'], 'test': ['#', 'T', 'EH1', 'S', 'T', '#'], 'this': ['#', 'DH', 'IH', 'S', '#']}
The supported corpus formats and their corresponding phonetic alphabets are as follows:
Corpus Format | Phonetic Alphabet | Language Options |
---|---|---|
'babel' | 'xsampa' | 'amh', 'asm', 'ben', 'yue', 'ceb', 'luo', 'kat', 'gug', 'hat', 'ibo', 'jav', 'kur', 'lao', 'lit', 'mon', 'pus', 'swa', 'tgl', 'tam', 'tpi', 'tur', 'vie', 'zul' |
'callhome' | 'callhome' | 'arz', 'cmn', 'spa' |
'celex' | 'disc' | 'eng', 'ndl', 'deu' |
'isle' | 'ipa' | Not required |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for phonecodes-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08772df2da08e3032937403e18b4c9be4a8ba2872868b0cbb9eb94512a7bdfaf |
|
MD5 | 43598b0a0a8e90f17d9165ae2b7adc8d |
|
BLAKE2b-256 | adce8caaccc357ea5196ab0cd3fa00be14fe65cf1aef1db2aaa01c4323ef43a4 |