Tools for loading dictionaries with various phonecodes (IPA, Callhome, X-SAMPA, ARPABET, DISC=CELEX, Buckeye), for converting among those phonecodes, and for searching those dictionaries for word sequences matching a target.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ginic

These details have not been verified by PyPI

Project description

phonecodes

This library provides tools for converting between the International Phonetic Alphabet (IPA) and other phonetic alphabets used to transcribe speech, including Callhome, X-SAMPA, ARPABET, DISC/CELEX, Buckeye Corpus Phonetic Alphabet, and TIMIT. Additionally, tools for searching mappings between phonetic symbols and reading/writing pronounciation lexicon files in several standard formats are also provided.

These functionalities are useful for processing data for automatic speech recognition, text to speech, and linguistic analyses of speech.

Setup and Installation

Install the library by running pip install phonecodes with python 3.10 or greater. It probably works with earlier versions of python, but this was not tested.

Developers may refer to the CONTRIBUTIONS.md for information on the development environment for testing, linting and contributing to the code.

Basic Usage

Converting between Phonetic Alphabets

If you want to convert to or from IPA to some other phonetic code, use phonecodes.phonecodes as follows:

>>> from phonecodes import phonecodes
>>> print(phonecodes.CODES) # available phonetic alphabets
{'arpabet', 'buckeye', 'ipa', 'timit', 'callhome', 'xsampa', 'disc'}
>>> phonecodes.convert("DH IH S IH Z AH0 T EH1 S T", "arpabet", "ipa", "eng") # convert from IPA to ARPABET with language explicitly specified
'ð ɪ s ɪ z ə t ˈɛ s t'
>>> phonecodes.convert("ð ɪ s ɪ z ə t ˈɛ s t", "ipa", "arpabet") # convert from IPA to ARPABET with optional language left out
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.ipa2arpabet("ð ɪ s ɪ z ə t ˈɛ s t", "eng") # equivalent to previous with explicit language
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.ipa2arpabet("ð ɪ s ɪ z ə t ˈɛ s t") # equivalent to previous with optional language left out
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.convert("DH IH S IH Z AH0 T EH1 S T", "arpabet", "ipa") # convert from ARPABET to IPA, optional language left out
'ð ɪ s ɪ z ə t ˈɛ s t'
>>> phonecodes.arpabet2ipa("DH IH S IH Z AH0 T EH1 S T", "eng") # equivalent to previous with optional language explicit
'ð ɪ s ɪ z ə t ˈɛ s t'

For 'arpabet', 'buckeye', 'timit' and 'xsampa', specifying a language is optional and ignored by the code, since X-SAMPA is language agnostic and ARAPABET, Buckeye, and TIMIT were designed to work only for English.

For 'callhome' and 'disc' you should also specify a language code from the following lists:

DISC/CELEX: Dutch 'nld', English 'eng', German 'deu'. Uses German if unspecified.
Callhome: Spanish 'spa', Egyptian Arabic 'arz', Mandarin Chinese 'cmn'. You MUST specify an appropriate language code or you'll get a KeyError.

Additional post-processing

An additional use case when converting between phonecodes is to normalize the final mapping to a subset of IPA symbols. This is useful if you are collapsing similar sounds together to a reduced symbol inventory or if you are standardizing two corpora with different IPA inventories/conventions to a shared subset.

We support this use case through the post_conversion_mapping keyword argument, an optional dictionary remapping provided with all phonecodes conversion functions. You can provide a custom mapping. Be aware that the remapping algorithm is greedy, proceeds in the order that keys appear in the dictionary, and diacritics need to appear with a base symbol in the mapping.

Additionally, we provide IPA-to-IPA post-processing dictionary mappings in phonecodes.phonecode_tables:

phonecodes.phonecode_tables.STANDARD_TIMIT_IPA_REDUCTION: The 'standard' TIMIT label reduction used in Lee and Hon (1989) that reduces the original 64 TIMIT phonetic labels to 39 categories. This reduction is widely used in the speech recognition community.
phonecodes.phonecode_tables.BUCKEYE_IPA_TO_TIMIT_BUCKEYE_SHARED and phonecodes.phonecode_tables.TIMIT_IPA_TO_TIMIT_BUCKEYE_SHARED: A conservative reduction from the Buckeye and TIMIT IPA inventories, respectively, to a shared symbol set. This maps nasalized vowels and flaps to their non-nasalized versions, r-colored vowels ('ɚ', 'ɝ') to syllabic r ('ɹ̩'), and normalizes variants of 'ʌ' and schwa to sch

>>> from phonecodes import phonecodes
# Conversion from Buckeye to IPA using the original published Buckeye mapping
>>> phonecodes.convert("B AHN NX AAN NX AH", "buckeye", "ipa")
'b ʌ̃ ɾ̃ ɑ̃ ɾ̃ ʌ'
# Conversion from Buckeye to IPA with postprocessing to an IPA inventory shared with TIMIT
>>> phonecodes.convert("B AHN NX AAN NX AH", "buckeye", "ipa", post_conversion_mapping = phonecodes.phonecode_tables.BUCKEYE_IPA_TO_TIMIT_BUCKEYE_SHARED)
'b ə n ɑ n ə'
# Custom mapping example - note that the nasalized diacritics are not affected by the remapping
>>> phonecodes.convert("B AHN NX AAN NX AH", "buckeye", "ipa", post_conversion_mapping = {'ʌ':'ə'})
'b ə̃ ɾ̃ ɑ̃ ɾ̃ ə'

Reading Corpus Files

If you are working with specific corpora, you can also convert between certain corpus formats as follows:

>>> from phonecodes import pronlex
>>> my_lex = pronlex.read("test/fixtures/isle_eng_sample.txt", "isle", "eng") # Read in an English ISLE corpus file
>>> my_lex.w2p # see orthographic to phonetic word mapping
{'a': ['#', 'ə', '#'], 'is': ['#', 'ɪ', 'z', '#'], 'test': ['#', 't', 'ˈɛ', 's', 't', '#'], 'this': ['#', 'ð', 'ɪ', 's', '#']}
new_lex = my_lex.recode('arpabet') # Convert mapping to ARPABET
>>> new_lex.w2p
{'a': ['#', 'AH0', '#'], 'is': ['#', 'IH', 'Z', '#'], 'test': ['#', 'T', 'EH1', 'S', 'T', '#'], 'this': ['#', 'DH', 'IH', 'S', '#']}

The supported corpus formats and their corresponding phonetic alphabets are as follows:

Corpus Format	Phonetic Alphabet	Language Options
'babel'	'xsampa'	'amh', 'asm', 'ben', 'yue', 'ceb', 'luo', 'kat', 'gug', 'hat', 'ibo', 'jav', 'kur', 'lao', 'lit', 'mon', 'pus', 'swa', 'tgl', 'tam', 'tpi', 'tur', 'vie', 'zul'
'callhome'	'callhome'	'arz', 'cmn', 'spa'
'celex'	'disc'	'eng', 'ndl', 'deu'
'isle'	'ipa'	Not required

Known Limitations

You cannot convert to TIMIT format from IPA or any other phonecode, because TIMIT marks closures of stops with separate symbols. There are no symbols corresponding to these closures in other phonecodes and the closure is not predictable from the transcription alone.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ginic

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.0.0

Nov 25, 2025

1.2.3

Oct 23, 2025

1.2.2

Sep 19, 2025

1.2.1

Sep 15, 2025

1.2.0

Jun 23, 2025

1.1.4

Nov 14, 2024

1.1.3

Nov 14, 2024

1.1.1

May 9, 2024

1.1.0

Feb 16, 2024

1.0.0

Feb 10, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phonecodes-2.0.0.tar.gz (26.7 kB view details)

Uploaded Nov 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

phonecodes-2.0.0-py3-none-any.whl (20.5 kB view details)

Uploaded Nov 25, 2025 Python 3

File details

Details for the file phonecodes-2.0.0.tar.gz.

File metadata

Download URL: phonecodes-2.0.0.tar.gz
Upload date: Nov 25, 2025
Size: 26.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for phonecodes-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`c0d7142e1e11600c58e8898789182f66849045f195365ee2796906d8ac753217`
MD5	`760cafbf6e439cd16b25415619e4e7c3`
BLAKE2b-256	`d1759ae214938b76802d5f02e87292603520013c46c35192162fb41a421670d5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for phonecodes-2.0.0.tar.gz:

Publisher: publish_to_pypi.yml on ginic/phonecodes

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: phonecodes-2.0.0.tar.gz
- Subject digest: c0d7142e1e11600c58e8898789182f66849045f195365ee2796906d8ac753217
- Sigstore transparency entry: 725478850
- Sigstore integration time: Nov 25, 2025
Source repository:
- Permalink: ginic/phonecodes@3edf14bd11820f0aa2006cfb9e0025f34948c5a2
- Branch / Tag: refs/tags/2.0.0
- Owner: https://github.com/ginic
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_to_pypi.yml@3edf14bd11820f0aa2006cfb9e0025f34948c5a2
- Trigger Event: release

File details

Details for the file phonecodes-2.0.0-py3-none-any.whl.

File metadata

Download URL: phonecodes-2.0.0-py3-none-any.whl
Upload date: Nov 25, 2025
Size: 20.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for phonecodes-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`65f7bac8367b99633535e15d8aef0820dcd6a8359183d8d7f8c25bdb4decda93`
MD5	`3dc8920f7a87178d111f72cdf55e45cb`
BLAKE2b-256	`ac32a8bbcb82614bb5de820a1c25005df77b991d35f95ae8e1ae77272d2ee956`

See more details on using hashes here.

Provenance

The following attestation bundles were made for phonecodes-2.0.0-py3-none-any.whl:

Publisher: publish_to_pypi.yml on ginic/phonecodes

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: phonecodes-2.0.0-py3-none-any.whl
- Subject digest: 65f7bac8367b99633535e15d8aef0820dcd6a8359183d8d7f8c25bdb4decda93
- Sigstore transparency entry: 725478859
- Sigstore integration time: Nov 25, 2025
Source repository:
- Permalink: ginic/phonecodes@3edf14bd11820f0aa2006cfb9e0025f34948c5a2
- Branch / Tag: refs/tags/2.0.0
- Owner: https://github.com/ginic
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_to_pypi.yml@3edf14bd11820f0aa2006cfb9e0025f34948c5a2
- Trigger Event: release

phonecodes 2.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

phonecodes

Setup and Installation

Basic Usage

Converting between Phonetic Alphabets

Additional post-processing

Reading Corpus Files

Known Limitations

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance