Skip to main content

G2P library for multiple language

Project description

G2P+

This repository contains scripts for converting various corpora to a unified IPA format, with marked word and utterance boundaries, to prepare them for training and evaluating small transformer-based language models.

It leverages four existing G2P tools (two statistical tools and two pronunciation dictionaries) to support a wide variety of languages:

Backend Languages
phonemizer 100+ languages/accents
epitran 149 languages/scripts
pinyin-to-ipa 1 (Mandarin)
pingyam 1 (Cantonese)

Installation

The simplest way is using pip:

pip install g2p-plus

Or you can install from source:

git clone https://github.com/codebyzeb/g2p-plus
cd g2p-plus
pip install .

Dependencies

The phonemizer backend requires espeak-ng to be installed. See instructions here.

On mac, the backend requires PHONEMIZER_ESPEAK_LIBRARY to be set in the local environment. You may find that before running g2p_plus, you need to add this to your environment, e.g.:

export PHONEMIZER_ESPEAK_LIBRARY=/opt/local/lib/libespeak-ng.dylib

The epitran backend with English requires Flite to be installed. See instructions here.

Usage

G2P+ is available as a command-line tool or as a python function.

Command-line interface

g2p_plus is the CLI for G2P+, supporting the conversion of corpora to a unified IPA format. It supports multiple backends, as described above. The help menu (-h) describes usage and the languages supported by each backend. The script reads lines from an input file (using -i) and saves space-separated IPA phonemes to an output file (using -o) or reads/writes to/from STDIN/STDOUT if files are not provided. Word boundaries are provided between words using -k using a WORD_BOUNDARY token.

For many languages, the underlying transcription tool does not output phoneme sets that match typical phoneme inventories for that language. As such, we have implemented "folding" dictionaries for many languages. These map the output of a backend for a language to a standard phoneme inventory in Phoible. See g2p_plus/folding for these dictionaries. This "folding" can be turned off using -u.

Example usage:

> g2p_plus phonemizer en-gb -k
hello there!
h ə l əʊ WORD_BOUNDARY ð eə WORD_BOUNDARY

> g2p_plus phonemizer en-us
hello there!
h ə l oʊ ð ɛ ɹ

Python library

G2P+ can be imported in python and used as follows:

from g2p_plus import phonemize_utterances
lines = ['hello there!', 'nice to meet you.']
phonemized = phonemize_utterances(lines, "phonemizer", "en-gb", keep_word_boundaries=True)

Attribution

This project incorporates content from the following sources:

In accordance with the CC BY-SA 3.0 license, any derivative work or adaptation of these resources must also be shared under the same license.

License

All original content in this repository created by Zébulon Goriely is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

g2p_plus-0.1.2.tar.gz (4.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

g2p_plus-0.1.2-py3-none-any.whl (4.0 MB view details)

Uploaded Python 3

File details

Details for the file g2p_plus-0.1.2.tar.gz.

File metadata

  • Download URL: g2p_plus-0.1.2.tar.gz
  • Upload date:
  • Size: 4.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for g2p_plus-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c2d1dd5f8b55e7b3a4a0218bec2ec43e7ea30924886e8d78b0ceb24c6b556da7
MD5 c448544c25d35eb66f05d3523535556f
BLAKE2b-256 aaa22683335debf86d0f177f5931b3cd4e7ac79b0f55634733dee59e745c8e48

See more details on using hashes here.

Provenance

The following attestation bundles were made for g2p_plus-0.1.2.tar.gz:

Publisher: publish.yml on codebyzeb/g2p-plus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file g2p_plus-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: g2p_plus-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 4.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for g2p_plus-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1d5793d15ba9f8c54c47f71d76430b272eaa479dd673c11953dc078105278475
MD5 b4799efc0c8a9024a1546ffb6d3b7f91
BLAKE2b-256 76291ce39e090442f0a2fb85f3af6a5b35517aa5c5d95c0a6317230aa0a48bfb

See more details on using hashes here.

Provenance

The following attestation bundles were made for g2p_plus-0.1.2-py3-none-any.whl:

Publisher: publish.yml on codebyzeb/g2p-plus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page