Skip to main content

Pinyin Tokenizer, chinese pinyin tokenizer

Project description

PyPI version Downloads Contributions welcome GitHub contributors License Apache 2.0 python_vesion GitHub issues Wechat Group

Pinyin Tokenizer

pinyin tokenizer(拼音分词器),将连续的拼音切分为单字拼音列表,开箱即用。python3开发。

Guide

Feature

  • 基于前缀树(PyTrie)高效快速把连续拼音切分为单字拼音列表,便于后续拼音转汉字等处理。

Install

  • Requirements and Installation
pip install pinyintokenizer

or

git clone https://github.com/shibing624/pinyin-tokenizer.git
cd pinyin-tokenizer
python setup.py install

Usage

拼音切分(Pinyin Tokenizer)

example:examples/pinyin_tokenize_demo.py:

import sys

sys.path.append('..')
from pinyintokenizer import PinyinTokenizer

if __name__ == '__main__':
    m = PinyinTokenizer()
    print(f"{m.tokenize('wo3')}")
    print(f"{m.tokenize('nihao')}")
    print(f"{m.tokenize('lv3you2')}")  # 旅游
    print(f"{m.tokenize('liudehua')}")
    print(f"{m.tokenize('liu de hua')}")  # 刘德华
    print(f"{m.tokenize('womenzuogelvyougongnue')}")  # 我们做个旅游攻略
    print(f"{m.tokenize('xi anjiaotongdaxue')}")  # 西安交通大学

    # not support english
    print(f"{m.tokenize('good luck')}")

output:

(['wo'], ['3'])
(['ni', 'hao'], [])
(['lv', 'you'], ['3', '2'])
(['liu', 'de', 'hua'], [])
(['liu', 'de', 'hua'], [' ', ' '])
(['wo', 'men', 'zuo', 'ge', 'lv', 'you', 'gong', 'nue'], [])
(['xi', 'an', 'jiao', 'tong', 'da', 'xue'], [' '])
(['o', 'o', 'lu'], ['g', 'd', ' ', 'c', 'k'])
  • tokenize方法返回两个结果,第一个为拼音列表,第二个为非法拼音列表。

连续拼音转汉字(Pinyin2Hanzi)

先使用本库pinyintokenizer把连续拼音切分,再使用Pinyin2Hanzi库把拼音转汉字。

example:examples/pinyin2hanzi_demo.py:

import sys
from Pinyin2Hanzi import DefaultDagParams
from Pinyin2Hanzi import dag

sys.path.append('..')
from pinyintokenizer import PinyinTokenizer

dagparams = DefaultDagParams()


def pinyin2hanzi(pinyin_sentence):
    pinyin_list, _ = PinyinTokenizer().tokenize(pinyin_sentence)
    result = dag(dagparams, pinyin_list, path_num=1)
    return ''.join(result[0].path)


if __name__ == '__main__':
    print(f"{pinyin2hanzi('wo3')}")
    print(f"{pinyin2hanzi('jintianxtianqibucuo')}")
    print(f"{pinyin2hanzi('liudehua')}")

output:

我
今天天气不错
刘德华

Contact

  • Issue(建议):GitHub issues
  • 邮件我:xuming: xuming624@qq.com
  • 微信我:加我微信号:xuming624, 进Python-NLP交流群,备注:姓名-公司名-NLP

Citation

如果你在研究中使用了pinyin-tokenizer,请按如下格式引用:

APA:

Xu, M. pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP (Version 0.0.1) [Computer software]. https://github.com/shibing624/pinyin-tokenizer

BibTeX:

@misc{pinyin-tokenizer,
  title={pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP},
  author={Xu Ming},
  year={2022},
  howpublished={\url{https://github.com/shibing624/pinyin-tokenizer}},
}

License

授权协议为 The Apache License 2.0,可免费用做商业用途。请在产品说明中附加pinyin-tokenizer的链接和授权协议。

Contribute

项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:

  • tests添加相应的单元测试
  • 使用python -m pytest来运行所有单元测试,确保所有单测都是通过的

之后即可提交PR。

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pinyintokenizer-0.0.2.tar.gz (11.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page