Skip to main content

Tools for loading dictionaries with various phonecodes (IPA, Callhome, X-SAMPA, ARPABET, DISC=CELEX, Buckeye), for converting among those phonecodes, and for searching those dictionaries for word sequences matching a target.

Project description

phonecodes

This library provides tools for converting between the International Phonetic Alphabet (IPA) and other phonetic alphabets used to transcribe speech, including Callhome, X-SAMPA, ARPABET, DISC/CELEX, Buckeye Corpus Phonetic Alphabet, and TIMIT. Additionally, tools for searching mappings between phonetic symbols and reading/writing pronounciation lexicon files in several standard formats are also provided.

These functionalities are useful for processing data for automatic speech recognition, text to speech, and linguistic analyses of speech.

Setup and Installation

Install the library by running pip install phonecodes with python 3.10 or greater. It probably works with earlier versions of python, but this was not tested.

Developers may refer to the CONTRIBUTIONS.md for information on the development environment for testing, linting and contributing to the code.

Basic Usage

Converting between Phonetic Alphabets

If you want to convert to or from IPA to some other phonetic code, use phonecodes.phonecodes as follows:

>>> from phonecodes import phonecodes
>>> print(phonecodes.CODES) # available phonetic alphabets
{'arpabet', 'buckeye', 'ipa', 'timit', 'callhome', 'xsampa', 'disc'}
>>> phonecodes.convert("DH IH S IH Z AH0 T EH1 S T", "arpabet", "ipa", "eng") # convert from IPA to ARPABET with language explicitly specified
'ð ɪ s ɪ z ə t ˈɛ s t'
>>> phonecodes.convert("ð ɪ s ɪ z ə t ˈɛ s t", "ipa", "arpabet") # convert from IPA to ARPABET with optional language left out
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.ipa2arpabet("ð ɪ s ɪ z ə t ˈɛ s t", "eng") # equivalent to previous with explicit language
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.ipa2arpabet("ð ɪ s ɪ z ə t ˈɛ s t") # equivalent to previous with optional language left out
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.convert("DH IH S IH Z AH0 T EH1 S T", "arpabet", "ipa") # convert from ARPABET to IPA, optional language left out
'ð ɪ s ɪ z ə t ˈɛ s t'
>>> phonecodes.arpabet2ipa("DH IH S IH Z AH0 T EH1 S T", "eng") # equivalent to previous with optional language explicit
'ð ɪ s ɪ z ə t ˈɛ s t'

For 'arpabet', 'buckeye', 'timit' and 'xsampa', specifying a language is optional and ignored by the code, since X-SAMPA is language agnostic and ARAPABET, Buckeye, and TIMIT were designed to work only for English.

For 'callhome' and 'disc' you should also specify a language code from the following lists:

  • DISC/CELEX: Dutch 'nld', English 'eng', German 'deu'. Uses German if unspecified.
  • Callhome: Spanish 'spa', Egyptian Arabic 'arz', Mandarin Chinese 'cmn'. You MUST specify an appropriate language code or you'll get a KeyError.

Reading Corpus Files

If you are working with specific corpora, you can also convert between certain corpus formats as follows:

>>> from phonecodes import pronlex
>>> my_lex = pronlex.read("test/fixtures/isle_eng_sample.txt", "isle", "eng") # Read in an English ISLE corpus file
>>> my_lex.w2p # see orthographic to phonetic word mapping
{'a': ['#', 'ə', '#'], 'is': ['#', 'ɪ', 'z', '#'], 'test': ['#', 't', 'ˈɛ', 's', 't', '#'], 'this': ['#', 'ð', 'ɪ', 's', '#']}
new_lex = my_lex.recode('arpabet') # Convert mapping to ARPABET
>>> new_lex.w2p
{'a': ['#', 'AH0', '#'], 'is': ['#', 'IH', 'Z', '#'], 'test': ['#', 'T', 'EH1', 'S', 'T', '#'], 'this': ['#', 'DH', 'IH', 'S', '#']}

The supported corpus formats and their corresponding phonetic alphabets are as follows:

Corpus Format Phonetic Alphabet Language Options
'babel' 'xsampa' 'amh', 'asm', 'ben', 'yue', 'ceb', 'luo', 'kat', 'gug', 'hat', 'ibo', 'jav', 'kur', 'lao', 'lit', 'mon', 'pus', 'swa', 'tgl', 'tam', 'tpi', 'tur', 'vie', 'zul'
'callhome' 'callhome' 'arz', 'cmn', 'spa'
'celex' 'disc' 'eng', 'ndl', 'deu'
'isle' 'ipa' Not required

Known Limitations

  • You cannot convert to TIMIT format from IPA or any other phonecode, because TIMIT marks closures of stops with separate symbols. There are no symbols corresponding to these closures in other phonecodes and the closure is not predictable from the transcription alone.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phonecodes-1.2.0.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

phonecodes-1.2.0-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file phonecodes-1.2.0.tar.gz.

File metadata

  • Download URL: phonecodes-1.2.0.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for phonecodes-1.2.0.tar.gz
Algorithm Hash digest
SHA256 879fb2d3e084fbf9c30a0b95f2ecfa1b7d72fff968f5c682e832a07cab109bb6
MD5 476d646c2d6b72f2245c74a1da99dcfc
BLAKE2b-256 ab372dbc8465e52f45743efb8cdd838ec339db390d6eb71946a320fffd6bb115

See more details on using hashes here.

Provenance

The following attestation bundles were made for phonecodes-1.2.0.tar.gz:

Publisher: publish_to_pypi.yml on ginic/phonecodes

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file phonecodes-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: phonecodes-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for phonecodes-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fa79653265989ae7cd704fc8fc5bdb6ce1853efcdd8b59d47f3e5e7702055c67
MD5 e4151b1945719919e1250df060e9e382
BLAKE2b-256 45721af10be5bffa22df2cf096cdf639076c0435eda625994fafc8ee5cea4c1d

See more details on using hashes here.

Provenance

The following attestation bundles were made for phonecodes-1.2.0-py3-none-any.whl:

Publisher: publish_to_pypi.yml on ginic/phonecodes

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page