Skip to main content

A grapheme-to-phoneme (g2p) converter for Icelandic

Project description

Ice-g2p : Phonetic transcription (grapheme-to-phoneme) for Icelandic

Ice-g2p is a module for automatic phonetic transcription of Icelandic. Ice-g2p can be used as a stand-alone command line tool or as a library, and can e.g. be used for the final text processing step in a frontend pipeline for speech synthesis (TTS).

Ice-g2p uses a manually curated pronunciation dictionary and LSTM-based g2p-models for unknown words. It can be used to transcribe Icelandic in four pronunciation variations and also uses a special model to transcribe English words that might occur in Icelandic texts, using the Icelandic phone set.

Setup

Install from PyPI (into an active virtual environment):

$ pip install ice-g2p
# Download the g2p models
$ fetch-models     

Clone the repository and create a virtual environment in the project root directory. Install the requirements:

$ git clone git@github.com:grammatek/ice-g2p.git
$ cd ice-g2p
$ python3 -m venv g2p-venv
$ source g2p-venv/bin/activate
$ pip install -e .
$ fetch-models

Command line interface

The input strings/texts need to be normalized. The module only handles lowercase characters from the Icelandic alphabet, no punctuation or other characters, unless language detection is enabled (see Flags)

Characters allowed: [aábcðdeéfghiíjklmnoóprstuúvxyýzþæö]. If other characters are found in the input, the transcription of the respective token is skipped and a notice written to stdout.

To transcribe text, currently two main options are available, direct from stdin to stdout or from file or a collection of files (directory)

$ ice-g2p -i 'hljóðrita þetta takk'
l_0 j ou D r I t a T E h t a t_h a h k

$ ice-g2p -i 'þetta war fürir þig'
war contains non valid character(s) {'w'}, skipping transcription.
fürir contains non valid character(s) {'ü'}, skipping transcription.
T E h t a   T I: G

$ ice-g2p -if file_to_transcribe.txt

If the input comes from stdin, the output is written to stdout. Input from file(s) is written to file(s) with the same name with the suffix '_transcribed.tsv'. The files are transcribed line by line and written out correspondingly.

Flags

The options available:

--infile INFILE, -if INFILE
                    inputfile or directory
--inputstr INPUTSTR, -i INPUTSTR
                      input string
--sep SEP_STR, -s SEP_STR  word separator to use, if not present, no word separators are used
--syll SYLL_STR -y SYLL_STR syllable separator to use, if not present, no syllabification will be performed
# boolean arguments
--stress, -t          perform stress labeling, ONLY APPLICABLE IN COMBINATION WITH --syll ARGUMENT!
--keep, -k            keep original
--sep, -s             use word separator
--dict, -d            use pronunciation dictionary
--langdetect, -l      use word-based language detection
--phoneticalpha, -p   return the output in a specific alphabet (default: SAMPA, currently also available: IPA, SINGLE, FLITE)

Using the -k flag keeps the original grapheme strings and for file input/output writes the original strings in the first column of the tab separated output file, and the phonetic transcription in the second one. The -sflag adds the defined word separator to the transcription and with the -y flag syllabification is added to the transcription with the chosen separator. The word and syllable separators may be the same or different symbols. Common symbol for syllable separation is a dot . In combination with syllabification, stress labels can be added using the -t flag. With the -d flag all tokens are first looked up in an existing pronunciation dictionary, the automatic g2p is then only a fallback for words not contained in the dictionary.

$ ice-g2p -i 'hljóðrita þetta takk' -k -s '-'
hljóðrita þetta takk : l_0 j ou D r I t a - T E h t a - t_h a h k

$ ice-g2p -i 'hljóðrita þetta takk' -k -y '.' -s '.' -t
hljóðrita þetta takk : l_0 j ou1 D . r I0 . t a0 . T E1 h . t a0 . t_h a1 h k

Using the -l flag allows for word-based language detection, where words considered foreign are transcribed by an LSTM trained on English words instead of Icelandic. If this flag is used, the module can handle common non-Icelandic characters, including all of the English alphabet:

$ ice-g2p -i 'hljóðrita þetta please'
l_0 j ou D r I t a T E h t a t_h a p_h l E: a s E

$ ice-g2p -i 'hljóðrita þetta please' -l
l_0 j ou D r I t a T E h t a p_h l i: s

Import to project

To use ice-g2p in a Python project, you import the Transcriber:

from ice_g2p.transcriber import Transcriber

g2p = Transcriber()
grapheme_string = 'halló heimur'
transcribed = g2p.transcribe(grapheme_string)
# transcribed == 'h a l ou h ei: m Y r'

To use another phonetic alphabet, import the converter too:

from ice_g2p.transcriber import Transcriber
from ice_g2p.converter import Converter

g2p = Transcriber()
conv = Converter()
grapheme_string = 'góðan daginn heimur'
transcribed = g2p.transcribe(grapheme_string)
# transcribed == 'k ou: D a n t ai j I n h ei: m Y r'
converted = conv.convert(transcribed, 'SAMPA', 'IPA')
# converted == 'k ouː ð a n t ai j ɪ n h eiː m ʏ r'

Data

The file sampa_ipa_single_flite.csv contains all the phonetic alphabets that have been used in Icelandic speech technology projects with in the language technology program.

  • X-SAMPA
  • IPA
  • Single: A custom alphabet designed to only contain one character per phone
  • Flite: A custom alphabet for Festival/Flite that only contains ascii alphabetic characters (no ':', '_', or digits)

Trouble shooting & inquiries

This application is still in development. If you encounter any errors, feel free to open an issue inside the issue tracker. You can also contact us via email.

Contributing

You can contribute to this project by forking it, creating a private branch and opening a new pull request.

License

Grammatek

Copyright © 2020, 2021 Grammatek ehf.

This software is developed under the auspices of the Icelandic Government 5-Year Language Technology Program, described here and here (English).

This software is licensed under the Apache License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ice_g2p-1.2.0.tar.gz (4.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ice_g2p-1.2.0-py3-none-any.whl (4.7 MB view details)

Uploaded Python 3

File details

Details for the file ice_g2p-1.2.0.tar.gz.

File metadata

  • Download URL: ice_g2p-1.2.0.tar.gz
  • Upload date:
  • Size: 4.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.13

File hashes

Hashes for ice_g2p-1.2.0.tar.gz
Algorithm Hash digest
SHA256 b09c696225e820cbb4d7ef1c6cb89f57dd34a01b5b0605361d9c91c3fe7fe34b
MD5 c44d748c4ac0ca19746f16cc085b7e1a
BLAKE2b-256 a24539580a4f205301b973d73829c256c6495ef2e64f6fdfb7c2ac6de8dd7b6e

See more details on using hashes here.

File details

Details for the file ice_g2p-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: ice_g2p-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 4.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.13

File hashes

Hashes for ice_g2p-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c0bdf36c72987dc3f6ee5c20766af862d59f5e339ac8b1f0fdea4bb073641231
MD5 b2478ab213e90c57ba1c55e918b70a28
BLAKE2b-256 46aee21145898ea14e9931dab300c7db8bad1c9bc5e34bf665c79c4639dc11ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page