g2pW
Project description
g2pW: Mandarin Grapheme-to-Phoneme Converter
Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Yeh
This is the official repository of our paper g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin (INTERSPEECH 2022).
News
- g2pW is included in PaddlePaddle/PaddleSpeech
- g2pW is included in mozillazg/pypinyin-g2pW
Getting Started
Dependency / Install
(This work was tested with PyTorch 1.7.0, CUDA 10.1, python 3.6 and Ubuntu 16.04.)
-
Install PyTorch
-
$ pip install g2pw
Quick Demo
>>> from g2pw import G2PWConverter
>>> conv = G2PWConverter()
>>> sentence = '上校請技術人員校正FN儀器'
>>> conv(sentence)
[['ㄕㄤ4', 'ㄒㄧㄠ4', 'ㄑㄧㄥ3', 'ㄐㄧ4', 'ㄕㄨ4', 'ㄖㄣ2', 'ㄩㄢ2', 'ㄐㄧㄠ4', 'ㄓㄥ4', None, None, 'ㄧ2', 'ㄑㄧ4']]
>>> sentences = ['銀行', '行動']
>>> conv(sentences)
[['ㄧㄣ2', 'ㄏㄤ2'], ['ㄒㄧㄥ2', 'ㄉㄨㄥ4']]
Load Offline Model
conv = G2PWConverter(model_dir='./G2PWModel-v2-onnx/', model_source='./path-to/bert-base-chinese/')
Support Simplified Chinese and Pinyin
>>> from g2pw import G2PWConverter
>>> conv = G2PWConverter(style='pinyin', enable_non_tradional_chinese=True)
>>> conv('然而,他红了20年以后,他竟退出了大家的视线。')
[['ran2', 'er2', None, 'ta1', 'hong2', 'le5', None, None, 'nian2', 'yi3', 'hou4', None, 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', None]]
Scripts
$ git clone https://github.com/GitYCC/g2pW.git
Train Model
For example, we train models on CPP dataset as follows:
$ bash cpp_dataset/download.sh
$ python scripts/train_g2p_bert.py --config configs/config_cpp.py
Prediction
$ python scripts/test_g2p_bert.py \
--config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \
--checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \
--sent_path cpp_dataset/test.sent \
--output_path output_pred.txt
Testing
$ python scripts/predict_g2p_bert.py \
--config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \
--checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \
--sent_path cpp_dataset/test.sent \
--lb_path cpp_dataset/test.lb
Checkpoints
Citation
To cite the code/data/paper, please use this BibTex
@article{chen2022g2pw,
author={Yi-Chang Chen and Yu-Chuan Chang and Yen-Cheng Chang and Yi-Ren Yeh},
title = {g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin},
journal={Proc. Interspeech 2022},
url = {https://arxiv.org/abs/2203.10430},
year = {2022}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file g2pw-0.1.1.tar.gz.
File metadata
- Download URL: g2pw-0.1.1.tar.gz
- Upload date:
- Size: 267.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1bc11692347d6a2daa54b238aa1cf71c4da45c18ee280ba7ac0425f83dc10a6a
|
|
| MD5 |
f1eef7f6514b3bec49f2e5460126f014
|
|
| BLAKE2b-256 |
263416dacd28c9797276e2149ce95196844561a8b8e3585c478dcd5d593c40fc
|
File details
Details for the file g2pw-0.1.1-py3-none-any.whl.
File metadata
- Download URL: g2pw-0.1.1-py3-none-any.whl
- Upload date:
- Size: 283.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de85b86773ad30f77320da1859e1f6c9a46ccf1c2ede24b35c9b9cfb2c7a197b
|
|
| MD5 |
7cc983041f9bbff88af3269199d6f83a
|
|
| BLAKE2b-256 |
d05d12ab1e62f4d9bc2dcd4dfaf599047eacd87988f5c2f1b011fbfacb19673d
|