g2pW
Project description
g2pW: Mandarin Grapheme-to-Phoneme Converter
Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Yeh
This is the official repository of our paper g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin (INTERSPEECH 2022).
News
- g2pW is included in PaddlePaddle/PaddleSpeech
- g2pW is included in mozillazg/pypinyin-g2pW
Getting Started
Dependency / Install
(This work was tested with PyTorch 1.7.0, CUDA 10.1, python 3.6 and Ubuntu 16.04.)
-
Install PyTorch
-
$ pip install g2pw
Quick Demo
>>> from g2pw import G2PWConverter
>>> conv = G2PWConverter()
>>> sentence = '上校請技術人員校正FN儀器'
>>> conv(sentence)
[['ㄕㄤ4', 'ㄒㄧㄠ4', 'ㄑㄧㄥ3', 'ㄐㄧ4', 'ㄕㄨ4', 'ㄖㄣ2', 'ㄩㄢ2', 'ㄐㄧㄠ4', 'ㄓㄥ4', None, None, 'ㄧ2', 'ㄑㄧ4']]
>>> sentences = ['銀行', '行動']
>>> conv(sentences)
[['ㄧㄣ2', 'ㄏㄤ2'], ['ㄒㄧㄥ2', 'ㄉㄨㄥ4']]
Load Offline Model
conv = G2PWConverter(model_dir='./G2PWModel-v2-onnx/', model_source='./path-to/bert-base-chinese/')
Support Simplified Chinese and Pinyin
>>> from g2pw import G2PWConverter
>>> conv = G2PWConverter(style='pinyin', enable_non_tradional_chinese=True)
>>> conv('然而,他红了20年以后,他竟退出了大家的视线。')
[['ran2', 'er2', None, 'ta1', 'hong2', 'le5', None, None, 'nian2', 'yi3', 'hou4', None, 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', None]]
Scripts
$ git clone https://github.com/GitYCC/g2pW.git
Train Model
For example, we train models on CPP dataset as follows:
$ bash cpp_dataset/download.sh
$ python scripts/train_g2p_bert.py --config configs/config_cpp.py
Prediction
$ python scripts/test_g2p_bert.py \
--config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \
--checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \
--sent_path cpp_dataset/test.sent \
--output_path output_pred.txt
Testing
$ python scripts/predict_g2p_bert.py \
--config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \
--checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \
--sent_path cpp_dataset/test.sent \
--lb_path cpp_dataset/test.lb
Checkpoints
Citation
To cite the code/data/paper, please use this BibTex
@article{chen2022g2pw,
author={Yi-Chang Chen and Yu-Chuan Chang and Yen-Cheng Chang and Yi-Ren Yeh},
title = {g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin},
journal={Proc. Interspeech 2022},
url = {https://arxiv.org/abs/2203.10430},
year = {2022}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
g2pw-0.1.1.tar.gz
(267.1 kB
view details)
Built Distribution
g2pw-0.1.1-py3-none-any.whl
(283.2 kB
view details)
File details
Details for the file g2pw-0.1.1.tar.gz
.
File metadata
- Download URL: g2pw-0.1.1.tar.gz
- Upload date:
- Size: 267.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1bc11692347d6a2daa54b238aa1cf71c4da45c18ee280ba7ac0425f83dc10a6a |
|
MD5 | f1eef7f6514b3bec49f2e5460126f014 |
|
BLAKE2b-256 | 263416dacd28c9797276e2149ce95196844561a8b8e3585c478dcd5d593c40fc |
File details
Details for the file g2pw-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: g2pw-0.1.1-py3-none-any.whl
- Upload date:
- Size: 283.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de85b86773ad30f77320da1859e1f6c9a46ccf1c2ede24b35c9b9cfb2c7a197b |
|
MD5 | 7cc983041f9bbff88af3269199d6f83a |
|
BLAKE2b-256 | d05d12ab1e62f4d9bc2dcd4dfaf599047eacd87988f5c2f1b011fbfacb19673d |