Skip to main content

G2P library for multiple language

Project description

G2P+

This repository contains scripts for converting various corpora to a unified IPA format, with marked word and utterance boundaries, to prepare them for training and evaluating small transformer-based language models.

It leverages four existing G2P tools (two statistical tools and two pronunciation dictionaries) to support a wide variety of languages:

Backend Languages
phonemizer 100+ languages/accents
epitran 149 languages/scripts
pinyin-to-ipa 1 (Mandarin)
pingyam 1 (Cantonese)

Installation

The simplest way is using pip:

pip install g2p-plus

Or you can install from source:

git clone https://github.com/codebyzeb/g2p-plus
cd g2p-plus
pip install .

Dependencies

The phonemizer backend requires espeak-ng to be installed. See instructions here.

On mac, the backend requires PHONEMIZER_ESPEAK_LIBRARY to be set in the local environment. You may find that before running g2p_plus, you need to add this to your environment, e.g.:

export PHONEMIZER_ESPEAK_LIBRARY=/opt/local/lib/libespeak-ng.dylib

The epitran backend with English requires Flite to be installed. See instructions here.

Usage

G2P+ is available as a command-line tool or as a python function.

Command-line interface

g2p_plus is the CLI for G2P+, supporting the conversion of corpora to a unified IPA format. It supports multiple backends, as described above. The help menu (-h) describes usage and the languages supported by each backend. The script reads lines from an input file (using -i) and saves space-separated IPA phonemes to an output file (using -o) or reads/writes to/from STDIN/STDOUT if files are not provided. Word boundaries are provided between words using -k using a WORD_BOUNDARY token.

For many languages, the underlying transcription tool does not output phoneme sets that match typical phoneme inventories for that language. As such, we have implemented "folding" dictionaries for many languages. These map the output of a backend for a language to a standard phoneme inventory in Phoible. See g2p_plus/folding for these dictionaries. This "folding" can be turned off using -u.

Example usage:

> g2p_plus phonemizer en-gb -k
hello there!
h ə l əʊ WORD_BOUNDARY ð eə WORD_BOUNDARY

> g2p_plus phonemizer en-us
hello there!
h ə l oʊ ð ɛ ɹ

Python library

G2P+ can be imported in python and used as follows:

from g2p_plus import phonemize_utterances
lines = ['hello there!', 'nice to meet you.']
phonemized = phonemize_utterances(lines, "phonemizer", "en-gb", keep_word_boundaries=True)

Attribution

This project incorporates content from the following sources:

In accordance with the CC BY-SA 3.0 license, any derivative work or adaptation of these resources must also be shared under the same license.

License

All original content in this repository created by Zébulon Goriely is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

g2p_plus-0.1.3.tar.gz (4.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

g2p_plus-0.1.3-py3-none-any.whl (4.0 MB view details)

Uploaded Python 3

File details

Details for the file g2p_plus-0.1.3.tar.gz.

File metadata

  • Download URL: g2p_plus-0.1.3.tar.gz
  • Upload date:
  • Size: 4.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for g2p_plus-0.1.3.tar.gz
Algorithm Hash digest
SHA256 8938fdb99093550b6056068e995bf9772c950aba2d55787a7ba12c843cd82da1
MD5 231783209a07122f267eec8c2747c1f2
BLAKE2b-256 21dfe56800cab47ce158e20684331ba5198a9ccc1046f5abddd184b340a43037

See more details on using hashes here.

Provenance

The following attestation bundles were made for g2p_plus-0.1.3.tar.gz:

Publisher: publish.yml on codebyzeb/g2p-plus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file g2p_plus-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: g2p_plus-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 4.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for g2p_plus-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 272a259240ef25fa97f4313471bf2d306d6caa794a02fdff56df114b95200c90
MD5 374d11330c4e8867ef593aa3ce2f5bf1
BLAKE2b-256 1456dd2931f170cdc420499f9b43690c7d98f1c01de1ec32eab316bdb42e7de5

See more details on using hashes here.

Provenance

The following attestation bundles were made for g2p_plus-0.1.3-py3-none-any.whl:

Publisher: publish.yml on codebyzeb/g2p-plus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page