Skip to main content

Tools for loading dictionaries with various phonecodes (IPA, Callhome, X-SAMPA, ARPABET, DISC=CELEX, Buckeye), for converting among those phonecodes, and for searching those dictionaries for word sequences matching a target.

Project description

phonecodes

This library provides tools for converting between the International Phonetic Alphabet (IPA) and other phonetic alphabets used to transcribe speech, including Callhome, X-SAMPA, ARPABET, DISC/CELEX and Buckeye Corpus Phonetic Alphabet. Additionally, tools for searching mappings between phonetic symbols and reading/writing pronounciation lexicon files in several standard formats are also provided.

These functionalities are useful for processing data for automatic speech recognition, text to speech, and linguistic analyses of speech.

Setup and Installation

Install the library by running pip install phonecodes with python 3.10 or greater. It probably works with earlier versions of python, but this was not tested.

Developers may refer to the CONTRIBUTIONS.md for information on the development environment for testing, linting and contributing to the code.

Basic Usage

Converting between Phonetic Alphabets

If you want to convert to or from IPA to some other phonetic code, use phonecodes.phonecodes as follows:

>>> from phonecodes import phonecodes
>>> print(phonecodes.CODES) # available phonetic alphabets
{'buckeye', 'disc', 'callhome', 'xsampa', 'arpabet', 'ipa'}
>>> phonecodes.convert("DH IH S IH Z AH0 T EH1 S T", "arpabet", "ipa", "eng") # convert from IPA to ARPABET with language explicitly specified
'ð ɪ s ɪ z ə t ˈɛ s t'
>>> phonecodes.convert("ð ɪ s ɪ z ə t ˈɛ s t", "ipa", "arpabet") # convert from IPA to ARPABET with optional language left out
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.ipa2arpabet("ð ɪ s ɪ z ə t ˈɛ s t", "eng") # equivalent to previous with explicit language
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.ipa2arpabet("ð ɪ s ɪ z ə t ˈɛ s t") # equivalent to previous with optional language left out
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.convert("DH IH S IH Z AH0 T EH1 S T", "arpabet", "ipa") # convert from ARPABET to IPA, optional language left out
'ð ɪ s ɪ z ə t ˈɛ s t'
>>> phonecodes.arpabet2ipa("DH IH S IH Z AH0 T EH1 S T", "eng") # equivalent to previous with optional language explicit
'ð ɪ s ɪ z ə t ˈɛ s t'

For 'arpabet', 'buckeye' and 'xsampa', specifying a language is optional and ignored by the code, since X-SAMPA is language agnostic and ARAPABET and Buckeye were designed to work only for English. Note that for 'callhome' and 'disc' you should also specify a language code from the following lists:

  • DISC/CELEX: Dutch 'nld', English 'eng', German 'deu'. Uses German if unspecified.
  • Callhome: Spanish 'spa', Egyptian Arabic 'arz', Mandarin Chinese 'cmn'. You MUST specify an appropriate language code or you'll get a KeyError.

Reading Corpus Files

If you are working with specific corpora, you can also convert between certain corpus formats as follows:

>>> from phonecodes import pronlex
>>> my_lex = pronlex.read("test/fixtures/isle_eng_sample.txt", "isle", "eng") # Read in an English ISLE corpus file
>>> my_lex.w2p # see orthographic to phonetic word mapping
{'a': ['#', 'ə', '#'], 'is': ['#', 'ɪ', 'z', '#'], 'test': ['#', 't', 'ˈɛ', 's', 't', '#'], 'this': ['#', 'ð', 'ɪ', 's', '#']}
new_lex = my_lex.recode('arpabet') # Convert mapping to ARPABET
>>> new_lex.w2p
{'a': ['#', 'AH0', '#'], 'is': ['#', 'IH', 'Z', '#'], 'test': ['#', 'T', 'EH1', 'S', 'T', '#'], 'this': ['#', 'DH', 'IH', 'S', '#']}

The supported corpus formats and their corresponding phonetic alphabets are as follows:

Corpus Format Phonetic Alphabet Language Options
'babel' 'xsampa' 'amh', 'asm', 'ben', 'yue', 'ceb', 'luo', 'kat', 'gug', 'hat', 'ibo', 'jav', 'kur', 'lao', 'lit', 'mon', 'pus', 'swa', 'tgl', 'tam', 'tpi', 'tur', 'vie', 'zul'
'callhome' 'callhome' 'arz', 'cmn', 'spa'
'celex' 'disc' 'eng', 'ndl', 'deu'
'isle' 'ipa' Not required

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phonecodes-1.1.4.tar.gz (19.0 kB view details)

Uploaded Source

Built Distribution

phonecodes-1.1.4-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file phonecodes-1.1.4.tar.gz.

File metadata

  • Download URL: phonecodes-1.1.4.tar.gz
  • Upload date:
  • Size: 19.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for phonecodes-1.1.4.tar.gz
Algorithm Hash digest
SHA256 f9edc92a5996d2c8de903d5bd1ccce6c3d6147cce5a42c3c2ebb2b92d8e2664f
MD5 b03e50de3ac9501c5a0f6abea4f519d9
BLAKE2b-256 0408dea90728ff6f5a32a0777497933af35038ca9a3a59ac8a0f5976b37e08c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for phonecodes-1.1.4.tar.gz:

Publisher: publish_to_pypi.yml on ginic/phonecodes

Attestations:

File details

Details for the file phonecodes-1.1.4-py3-none-any.whl.

File metadata

  • Download URL: phonecodes-1.1.4-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for phonecodes-1.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 6092a8b9dd6e6dae1a9b9ae43dc0bd1b4f450c23f333c27a7d195ab5588c2fe6
MD5 b1fbdaa329f45e25cdcdc7be7c0e7501
BLAKE2b-256 5d1d5825ba7703bf3e6a7d151b3bb202d7a9bb913149e0ead3a769cb77410cf2

See more details on using hashes here.

Provenance

The following attestation bundles were made for phonecodes-1.1.4-py3-none-any.whl:

Publisher: publish_to_pypi.yml on ginic/phonecodes

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page