Skip to main content

g2pW

Project description

g2pW: Mandarin Grapheme-to-Phoneme Converter

Downloads license

Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Yeh

This is the official repository of our paper g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin (INTERSPEECH 2022).

News

Getting Started

Dependency / Install

(This work was tested with PyTorch 1.7.0, CUDA 10.1, python 3.6 and Ubuntu 16.04.)

  • Install PyTorch

  • $ pip install g2pw

Quick Demo

Open In Colab

>>> from g2pw import G2PWConverter
>>> conv = G2PWConverter()
>>> sentence = '上校請技術人員校正FN儀器'
>>> conv(sentence)
[['ㄕㄤ4', 'ㄒㄧㄠ4', 'ㄑㄧㄥ3', 'ㄐㄧ4', 'ㄕㄨ4', 'ㄖㄣ2', 'ㄩㄢ2', 'ㄐㄧㄠ4', 'ㄓㄥ4', None, None, 'ㄧ2', 'ㄑㄧ4']]
>>> sentences = ['銀行', '行動']
>>> conv(sentences)
[['ㄧㄣ2', 'ㄏㄤ2'], ['ㄒㄧㄥ2', 'ㄉㄨㄥ4']]

Load Offline Model

conv = G2PWConverter(model_dir='./G2PWModel-v2-onnx/', model_source='./path-to/bert-base-chinese/')

Support Simplified Chinese and Pinyin

>>> from g2pw import G2PWConverter
>>> conv = G2PWConverter(style='pinyin', enable_non_tradional_chinese=True)
>>> conv('然而,他红了20年以后,他竟退出了大家的视线。')
[['ran2', 'er2', None, 'ta1', 'hong2', 'le5', None, None, 'nian2', 'yi3', 'hou4', None, 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', None]]

Scripts

$ git clone https://github.com/GitYCC/g2pW.git

Train Model

For example, we train models on CPP dataset as follows:

$ bash cpp_dataset/download.sh
$ python scripts/train_g2p_bert.py --config configs/config_cpp.py

Prediction

$ python scripts/test_g2p_bert.py \
    --config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \
    --checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \
    --sent_path cpp_dataset/test.sent \
    --output_path output_pred.txt

Testing

$ python scripts/predict_g2p_bert.py \
    --config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \
    --checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \
    --sent_path cpp_dataset/test.sent \
    --lb_path cpp_dataset/test.lb

Checkpoints

Citation

To cite the code/data/paper, please use this BibTex

@article{chen2022g2pw,
    author={Yi-Chang Chen and Yu-Chuan Chang and Yen-Cheng Chang and Yi-Ren Yeh},
    title = {g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin},
    journal={Proc. Interspeech 2022},
    url = {https://arxiv.org/abs/2203.10430},
    year = {2022}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

g2pw-0.1.1.tar.gz (267.1 kB view details)

Uploaded Source

Built Distribution

g2pw-0.1.1-py3-none-any.whl (283.2 kB view details)

Uploaded Python 3

File details

Details for the file g2pw-0.1.1.tar.gz.

File metadata

  • Download URL: g2pw-0.1.1.tar.gz
  • Upload date:
  • Size: 267.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.9

File hashes

Hashes for g2pw-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1bc11692347d6a2daa54b238aa1cf71c4da45c18ee280ba7ac0425f83dc10a6a
MD5 f1eef7f6514b3bec49f2e5460126f014
BLAKE2b-256 263416dacd28c9797276e2149ce95196844561a8b8e3585c478dcd5d593c40fc

See more details on using hashes here.

File details

Details for the file g2pw-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: g2pw-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 283.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.9

File hashes

Hashes for g2pw-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 de85b86773ad30f77320da1859e1f6c9a46ccf1c2ede24b35c9b9cfb2c7a197b
MD5 7cc983041f9bbff88af3269199d6f83a
BLAKE2b-256 d05d12ab1e62f4d9bc2dcd4dfaf599047eacd87988f5c2f1b011fbfacb19673d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page