Skip to main content

Augmented Recurrent Neural Grapheme-to-Phoneme conversion with Inflectional Orthography.

Project description

Aquila Resolve - Grapheme-to-Phoneme Converter

Build CodeQL codecov

Supported Versions License FOSSA Status

Augmented Recurrent Neural G2P with Inflectional Orthography

Grapheme-to-phoneme (G2P) conversion is the process of converting the written form of words (Graphemes) to their pronunciations (Phonemes). Deep learning models for text-to-speech (TTS) synthesis using phoneme / mixed symbols typically require a G2P conversion method for both training and inference.

Aquila Resolve presents a new approach for accurate and efficient English G2P resolution. Input text graphemes are translated into their phonetic pronunciations, using ARPAbet as the phoneme symbol set. The pipeline employs a context layer, multiple transformer and n-gram morpho-orthographical search layers, and an autoregressive recurrent neural transformer base.

The current implementation offers state-of-the-art accuracy for out-of-vocabulary (OOV) words, as well as contextual analysis for correct inferencing of English Heteronyms.

Installation

pip install aquila-resolve

A pre-trained model checkpoint (~106 MB) will be automatically downloaded on the first use of relevant public methods that require inferencing. For example, when instantiating G2p. You can also start this download manually by calling Aquila_Resolve.download().

If you are in an environment where remote file downloads are not possible, you can also download the checkpoint manually and instantiate G2p with the flag: G2p(custom_checkpoint='path/model.pt')

Usage

from Aquila_Resolve import G2p

g2p = G2p(device='cuda')

g2p.convert('The book costs $5, will you read it?')
# >> '{DH AH0} {B UH1 K} {K AA1 S T S} {F AY1 V} {D AA1 L ER0 Z}, {W IH1 L} {Y UW1} {R IY1 D} {IH1 T}?'

Additional optional parameters are available when defining a G2p instance:

Parameter Default Description
device 'cpu' Device for Pytorch inference model
ph_format sds_b Phoneme output format:
sds - Space delimited
sds_b Space delimited, with curly brackets
list List of individual phonemes
cmu_dict_path None Path to a custom CMUDict .dict file.
h2p_dict_path None Path to a custom Heteronyms Dictionary .json file. See heteronyms.json for the expected format.
cmu_multi_mode 0 Default selection index for CMUDict entries with multiple pronunciations as donated by the (1) or (n) format
process_numbers True Toggles conversion of some numbers and symbols to their spoken pronunciation forms. See numbers.py for details on what is covered.
unresolved_mode 'keep' Unresolved word resolution modes:
keep - Keeps the text-form word in the output.
remove - Removes the text-form word from the output.
drop - Returns the line as None if any word is unresolved.

Model Architecture

In evaluation[^1], neural G2P models have traditionally been extremely sensitive to orthographical variations in graphemes. Attention-based mapping of contextual recognition has traditionally been poor for languages like English with a low correlative relationship between grapheme and phonemes[^2]. Furthermore, both static methods (i.e. CMU Dictionary), and dynamic methods (i.e. G2p-seq2seq, Phonetisaurus, DeepPhonemizer) incur a loss of sentence context during tokenization for training and inference, and therefore make it impossible to accurately resolve words with multiple pronunciations based on grammatical context (Heteronyms).

This model attempts to address these issues to optimize inference accuracy and run-time speed. The current architecture employs additional natural language analysis steps, including Part-of-speech (POS) tagging, n-gram segmentation, lemmatization searches, and word stem analysis. Some layers are universal for all text, such as POS tagging, while others are activated when deemed required for the requested word. Layer information is retained with the token in vectorized and tensor operations. This allows morphological variations of seen words, such as plurals, possessives, compounds, inflectional stem affixes, and lemma variations to be resolved with near ground-truth level of accuracy. This also improves out-of-vocabulary (OOV) inferencing accuracy, by truncating individual tensor size and characteristics to be closer to seen data.

The inferencing layer is built as an autoregressive implementation of the forward DeepPhonemizer model, as a 4-layer transformer with 256 hidden units. The pre-trained checkpoint for Aquila Resolve is trained using the CMU Dict v0.7b corpus, with 126,456 unique words. The validation dataset was split as a uniform 5% sample of unique words, sorted by grapheme length. The learning rate was linearly increased during the warmup steps, and step-decreased during fine-tuning.

Symbol Set

The 2 letter ARPAbet symbol set is used, with numbered vowel stress markers.

Vowels

Phoneme Example Phoneme Example Phoneme Example Phoneme Example
AA0 Balm AW0 Ourself EY0 Mayday OY0
AA1 Bot AW1 Shout EY1 Mayday OY1
AA2 Cot AW2 Outdo EY2 airfreight OY2
AE0 Bat AY0 Ally IH0 Cooking UH0
AE1 Fast AY1 Bias IH1 Exist UH1
AE2 Midland AY2 Alibi IH2 Outfit UH2
AH0 Central EH0 Enroll IY0 Lady UW0
AH1 Chunk EH1 Bless IY1 Beak UW1
AH2 Outcome EH2 Telex IY2 Turnkey UW2
AO0 Story ER0 Chapter OW0 Reo
AO1 Adore ER1 Verb OW1 So
AO2 Blog ER2 Catcher OW2 Cargo

License

The code in this project is released under Apache License 2.0.

FOSSA Status

References

[^1]: r-G2P: Evaluating and Enhancing Robustness of Grapheme to Phoneme Conversion by Controlled noise introducing and Contextual information incorporation

[^2]: OTEANN: Estimating the Transparency of Orthographies with an Artificial Neural Network

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

Aquila_Resolve-0.1.1-py3-none-any.whl (966.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page