A simple iterator for using a set of Chinese tokenizer
Project description
中文分词器集合
一些中文分词器的简单封装和集合
Free software: MIT license
Documentation: https://chinese-tokenzier-iterator.readthedocs.io.
Features
TODO
使用
from tokenizers_collection.config import tokenizer_registry
for name, tokenizer in tokenizer_registry:
print("Tokenizer: {}".format(name))
tokenizer('input_file.txt', 'output_file.txt')
安装
pip install tokenizers_collection
更新许可文件与下载模型
因为其中有些模型需要更新许可文件(比如:pynlpir)或者需要下载模型文件(比如:pyltp),因此安装后需要执行特定的命令完成操作,这里已经将所有的操作封装成了一个函数,只需要执行类似如下的指令即可
python -m tokenizers_collection.helper
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
History
0.1.0 (2018-08-28)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for tokenizers_collection-0.1.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | b43e162333bf43e2ae80d567bc9dff48ed9ecfdffbfbf2fafc62035c52b39f9e |
|
MD5 | 3569cce32dca6ee44e544e6a31b5bc30 |
|
BLAKE2b-256 | 5f13524a0fae90c6254b9ccc62ec416ef5236e892b7e8455a8043c3cfc94e961 |
Close
Hashes for tokenizers_collection-0.1.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40bf32f5352f0542a8c92c7bab73451f21c873676568ed9709acdfb88601dc03 |
|
MD5 | fab7fe2976415933da7132badf12a8a8 |
|
BLAKE2b-256 | f20a2058b20bbaf939b8cfad4c7205af0853fbd6d75ff057c6e194f25f591dda |