Pinyin Tokenizer, chinese pinyin tokenizer
Project description
Pinyin Tokenizer
pinyin tokenizer(拼音分词器),将连续的拼音切分为单字拼音列表,开箱即用。python3开发。
Guide
Feature
- 基于前缀树(PyTrie)高效快速把连续拼音切分为单字拼音列表,便于后续拼音转汉字等处理。
Install
- Requirements and Installation
pip install pinyintokenizer
or
git clone https://github.com/shibing624/pinyin-tokenizer.git
cd pinyin-tokenizer
python setup.py install
Usage
拼音切分(Pinyin Tokenizer)
example:examples/pinyin_tokenize_demo.py:
import sys
sys.path.append('..')
from pinyintokenizer import PinyinTokenizer
if __name__ == '__main__':
m = PinyinTokenizer()
print(f"{m.tokenize('wo3')}")
print(f"{m.tokenize('nihao')}")
print(f"{m.tokenize('lv3you2')}") # 旅游
print(f"{m.tokenize('liudehua')}")
print(f"{m.tokenize('liu de hua')}") # 刘德华
print(f"{m.tokenize('womenzuogelvyougongnue')}") # 我们做个旅游攻略
print(f"{m.tokenize('xi anjiaotongdaxue')}") # 西安交通大学
# not support english
print(f"{m.tokenize('good luck')}")
output:
(['wo'], ['3'])
(['ni', 'hao'], [])
(['lv', 'you'], ['3', '2'])
(['liu', 'de', 'hua'], [])
(['liu', 'de', 'hua'], [' ', ' '])
(['wo', 'men', 'zuo', 'ge', 'lv', 'you', 'gong', 'nue'], [])
(['xi', 'an', 'jiao', 'tong', 'da', 'xue'], [' '])
(['o', 'o', 'lu'], ['g', 'd', ' ', 'c', 'k'])
tokenize
方法返回两个结果,第一个为拼音列表,第二个为非法拼音列表。
连续拼音转汉字(Pinyin2Hanzi)
先使用本库pinyintokenizer把连续拼音切分,再使用Pinyin2Hanzi库把拼音转汉字。
example:examples/pinyin2hanzi_demo.py:
import sys
from Pinyin2Hanzi import DefaultDagParams
from Pinyin2Hanzi import dag
sys.path.append('..')
from pinyintokenizer import PinyinTokenizer
dagparams = DefaultDagParams()
def pinyin2hanzi(pinyin_sentence):
pinyin_list, _ = PinyinTokenizer().tokenize(pinyin_sentence)
result = dag(dagparams, pinyin_list, path_num=1)
return ''.join(result[0].path)
if __name__ == '__main__':
print(f"{pinyin2hanzi('wo3')}")
print(f"{pinyin2hanzi('jintianxtianqibucuo')}")
print(f"{pinyin2hanzi('liudehua')}")
output:
我
今天天气不错
刘德华
Contact
- Issue(建议):
- 邮件我:xuming: xuming624@qq.com
- 微信我:加我微信号:xuming624, 进Python-NLP交流群,备注:姓名-公司名-NLP
Citation
如果你在研究中使用了pinyin-tokenizer,请按如下格式引用:
APA:
Xu, M. pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP (Version 0.0.1) [Computer software]. https://github.com/shibing624/pinyin-tokenizer
BibTeX:
@misc{pinyin-tokenizer,
title={pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP},
author={Xu Ming},
year={2022},
howpublished={\url{https://github.com/shibing624/pinyin-tokenizer}},
}
License
授权协议为 The Apache License 2.0,可免费用做商业用途。请在产品说明中附加pinyin-tokenizer的链接和授权协议。
Contribute
项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:
- 在
tests
添加相应的单元测试 - 使用
python -m pytest
来运行所有单元测试,确保所有单测都是通过的
之后即可提交PR。
Related Projects
- 汉字转拼音:pypinyin
- 拼音转汉字:Pinyin2Hanzi
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pinyintokenizer-0.0.2.tar.gz
(11.2 kB
view details)
File details
Details for the file pinyintokenizer-0.0.2.tar.gz
.
File metadata
- Download URL: pinyintokenizer-0.0.2.tar.gz
- Upload date:
- Size: 11.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.12.0 pkginfo/1.7.0 requests/2.28.2 requests-toolbelt/0.9.1 tqdm/4.65.0 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b30b265801b4c6f7b1bd8eb4eebdb6c18dee273133234d27b3a65ab8d1e80228 |
|
MD5 | 5bb7a7675939a0ded6b904f2f13afa98 |
|
BLAKE2b-256 | 0fe54baf2ab2c8241f2608f9b3b189b527f42bb86604103b23a900aa7105c07b |