Skip to main content

Module for creating context-aware, rule-based G2P mappings that preserve indices

Project description

Gⁱ-2-Pⁱ

Coverage Status Documentation Status Build Status PyPI package license standard-readme compliant

Grapheme-to-Phoneme transformations that preserve input and output indices!

This library is for handling arbitrary conversions between input and output segments while preserving indices.

indices

Table of Contents

Background

The initial version of this package was developed by Patrick Littell and was developed in order to allow for g2p from community orthographies to IPA and back again in ReadAlong-Studio. We decided to then pull out the g2p mechanism from Convertextract which allows transducer relations to be declared in CSV files, and turn it into its own library - here it is!

Install

The best thing to do is install with pip pip install g2p.

Otherwise, clone the repo and pip install it locally.

$ git clone https://github.com/roedoejet/g2p.git
$ cd g2p
$ pip install -e .

Usage

The easiest way to create a transducer is to use the g2p.make_g2p function.

To use it, first import the function:

from g2p import make_g2p

Then, call it with an argument for in_lang and out_lang. Both must be strings equal to the name of a particular mapping.

>>> transducer = make_g2p('dan', 'eng-arpabet')
>>> transducer('hej').output_string
'HH EH Y'

There must be a valid path between the in_lang and out_lang in order for this to work. If you've edited a mapping or added a custom mapping, you must update g2p to include it: g2p update

Writing mapping files

Mapping files are written as either CSV or JSON files.

CSV

CSV files write each new rule as a new line and consist of at least two columns, and up to four. The first column is required and corresponds to the rule's input. The second column is also required and corresponds to the rule's output. The third column is optional and corresponds to the context before the rule input. The fourth column is also optional and corresponds to the context after the rule input. For example:

  1. This mapping describes two rules; a -> b and c -> d.
a,b
c,d
  1. This mapping describes two rules; a -> b / c _ d* and a -> e
a,b,c,d
a,e

The g2p studio exports its rules to CSV format.

*If this notation is unfamiliar, have a look at phonological rewrite rules

JSON

JSON files are written as an array of objects where each object corresponds to a new rule. The following two examples illustrate how the examples from the CSV section above would be written in JSON:

  1. This mapping describes two rules; a -> b and c -> d.
 [
   {
     "in": "a",
     "out": "b"
   },
   {
     "in": "c",
     "out": "d"
   }
 ]
  1. This mapping describes two rules; a -> b / c _ d* and a -> e
 [
   {
     "in": "a",
     "out": "b",
     "context_before": "c",
     "context_after": "d"
   },
   {
     "in": "a",
     "out": "e"
   }
 ]

CLI

update

If you edit or add new mappings to the g2p.mappings.langs folder, you need to update g2p. You do this by running g2p update

convert

If you want to convert a string on the command line, you can use g2p convert <input_text> <in_lang> <out_lang>

Ex. g2p convert hej dan eng-arpabet would produce HH EH Y

generate-mapping

If your language has a mapping to IPA and you want to generate a mapping between that and the English IPA mapping, you can use g2p generate-mapping <in_lang> --ipa. Remember to run g2p update before so that it has the latest mappings for your language.

Ex. g2p generate-mapping dan --ipa will produce a mapping from dan-ipa to eng-ipa. You must also run g2p update afterwards to update g2p. The resulting mapping will be added to the folder in g2p.mappings.langs.generated

Studio

You can also run the g2p Studio which is a web interface for creating custom lookup tables to be used with g2p. To run the g2p Studio either visit https://g2p-studio.herokuapp.com/ or run it locally using python run_studio.py.

Alternatively, you can run the app from the command line: g2p run

Maintainers

@roedoejet.

Contributing

Feel free to dive in! Open an issue or submit PRs.

This repo follows the Contributor Covenant Code of Conduct.

Adding a new mapping

In order to add a new mapping, you have to follow the following steps.

  1. Determine your language's ISO 639-3 code.
  2. Add a folder with your language's ISO 639-3 code to g2p/mappings/langs
  3. Add a configuration file at g2p/mappings/langs/<yourlangISOcode>/config.yaml. Here is the basic template for a configuration:
<<: &shared
  language_name: <This is the actual name of the language>
mappings:
  - display_name: This is a description of the mapping
    in_lang: This is your language's ISO 639-3 code
    out_lang: This is the output of the mapping
    type: mapping
    authors:
      - <YourNameHere>
    mapping: <FilenameOfMapping>
    <<: *shared
  1. Add a mapping file. Look at the other mappings for examples, or visit the g2p studio to practise your mappings. Mappings are defined in either a CSV or json file. See writing mapping files for more info.
  2. After installing your local version (pip3 install -e .), update with g2p update
  3. Add some tests in g2p/testspublic/data/<YourIsoCode>.psv. Each line in the file will run a test with the following structure: <in_lang>|<out_lang>|<input_string>|<expected_output>
  4. Run python3 run_tests.py langs to make sure your tests pass.
  5. Make sure you have checked all the boxes and make a [pull request]((https://github.com/roedoejet/g2p/pulls)!

Adding a new language for support with ReadAlongs

This repo is used extensively by ReadAlongs. In order to make your language supported by ReadAlongs, you must add a mapping from your language's orthography to IPA. So, for example, to add Danish (ISO 639-3: dan), the steps above must be followed. The in_lang for the mapping must be dan and the out_lang must be suffixed with 'ipa' as in dan-ipa. The following is the proper configuration:

<<: &shared
  language_name: Danish
mappings:
  - display_name: Danish to IPA
    in_lang: dan
    out_lang: dan-ipa
    type: mapping
    authors:
      - Aidan Pine
    mapping: dan_to_ipa.csv
    abbreviations: dan_abbs.csv
    as_is: true
    case_sensitive: false
    norm_form: 'none'
    <<: *shared

Then, you can generate the mapping between dan-ipa and eng-ipa by running g2p generate-mapping --ipa. This will add the mapping to g2p/mappings/langs/generated - do not edit this file, but feel free to have a look. Then, run g2p update and submit a pull request, and tada! Your language is supported by ReadAlongs as well!

Contributors

This project exists thanks to all the people who contribute.

@littell. @finguist. @joanise. @eddieantonio. @dhdaines.

License

MIT © Patrick Littell, Aidan Pine

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

g2p-0.5.20200812.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

g2p-0.5.20200812-py3-none-any.whl (1.6 MB view details)

Uploaded Python 3

File details

Details for the file g2p-0.5.20200812.tar.gz.

File metadata

  • Download URL: g2p-0.5.20200812.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for g2p-0.5.20200812.tar.gz
Algorithm Hash digest
SHA256 2b844684222998c59d1c7959dd5c2792ca494b7a344ebdcbda8aab1c3602a661
MD5 05e43b938c0697cb9f551a2ea5d970f0
BLAKE2b-256 26f0c168cf576690a225cb5201bafd5ee4e4905bc7eb6b4739250e934e5186e9

See more details on using hashes here.

File details

Details for the file g2p-0.5.20200812-py3-none-any.whl.

File metadata

  • Download URL: g2p-0.5.20200812-py3-none-any.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for g2p-0.5.20200812-py3-none-any.whl
Algorithm Hash digest
SHA256 9a65c8ea28b25a2016c57632ac2b1eefcff01cd9d8b60f1b1d19e0375bfcb629
MD5 dd8128e76b6d8fcae979709228fb4aaf
BLAKE2b-256 bd689cb02b157923b2e714c6c5df55fa7dcca94edf459b7cbd15609dfc96e900

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page