Skip to main content

g2pM: A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese

Project description

g2pM

Release Downloads license

This is the official repository of our paper A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese Based on a New Open Benchmark Dataset (Interspeech 2020).

Install

pip install g2pM

The CPP Dataset

In data folder, there are [train/dev/test].sent files and [train/dev/test].lb files. In *.sent file, each lines corresponds to one sentence and a special symbol ▁ (U+2581) is added to the left and right of polyphonic character. The pronunciation of the corresponding character is at the same line from *.lb file. For each sentence, there could be several polyphonic characters, but we randomly choose only one polyphonic character to annotate.

Requirements

  • python >= 3.6
  • numpy

Usage

If you want to remove all the digits which denote the tones, set tone=False. Default setting is tone=True.
If you want to split all the non Chinese characters (e.g. digit), set char_split=True. Default setting is char_split=False.

>>> from g2pM import G2pM
>>> model = G2pM()
>>> sentence = "然而,他红了20年以后,他竟退出了大家的视线。"
>>> model(sentence, tone=True, char_split=False)
['ran2', 'er2', ',', 'ta1', 'hong2', 'le5', '20', 'nian2', 'yi3', 'hou4', ',', 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', '。']
>>> model(sentence, tone=False, char_split=False)
['ran', 'er', ',', 'ta', 'hong', 'le', '2', '0', 'nian', 'yi', 'hou', ',', 'ta', 'jing', 'tui', 'chu', 'le', 'da', 'jia', 'de', 'shi', 'xian', '。']
>>> model(sentence, tone=True, char_split=True)
['ran2', 'er2', ',', 'ta1', 'hong2', 'le5', '2', '0', 'nian2', 'yi3', 'hou4', ',', 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', '。']

Model Size

Layer Size
Embedding 64
LSTM x1 64
Fully-Connected x2 64
Total # of parameters 477,228
Model size 1.7MB
Package size 2.1MB

Evaluation Result

Model Dev. Test
g2pC 84.84 84.45
xpinyin(0.5.6) 78.74 78.56
pypinyin(0.36.0) 85.44 86.13
Majority Vote 92.15 92.08
Chinese Bert 97.95 97.85
Ours 97.36 97.31

Reference

To cite the code/data/paper, please use this BibTex

@article{park2020g2pm,
 author={Park, Kyubyong and Lee, Seanie},
 title = {A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese Based on a New Open Benchmark Dataset
},
 journal={Proc. Interspeech 2020},
 url = {https://arxiv.org/abs/2004.03136},
 year = {2020}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

g2pM-0.1.2.5.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

g2pM-0.1.2.5-py3-none-any.whl (1.7 MB view details)

Uploaded Python 3

File details

Details for the file g2pM-0.1.2.5.tar.gz.

File metadata

  • Download URL: g2pM-0.1.2.5.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for g2pM-0.1.2.5.tar.gz
Algorithm Hash digest
SHA256 bd1ccdb1cd512dfe6ece51578d4210c154e59d02a5f99fb215ab6c04ff387483
MD5 581c17efe8db335122e56750054e0abc
BLAKE2b-256 2ed606b20ffa5ea2e2a6c55ada6bf9503c1ee7bae2c64b3f6aa6107396a0a657

See more details on using hashes here.

File details

Details for the file g2pM-0.1.2.5-py3-none-any.whl.

File metadata

  • Download URL: g2pM-0.1.2.5-py3-none-any.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for g2pM-0.1.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9e920b82df4002f96d8679720c7a858d10816d438dbd61b8b6b621a80a976361
MD5 d176c68579583a5520a292a14d43bf06
BLAKE2b-256 af21dc5b497f09a94a9605e0b8a94ad0e01ae73a2b65109bf5bd325b0814b6a8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page