Just Cut Word Faster
Project description
jieba不维护了,所以有了cutword。
cutword 是一个中文分词库,字典文件根据截止到2024年1月份的最新数据统计得到,词频更加合理。 基于ac自动机实现的分词算法,分词速度是jieba的两倍。
cutword-lite 基于原项目 cutword 精简而成,移除了命名实体识别(NER),专注提供中文分词能力。
可通过 python -m cutword.comparewithjieba 进行测试。
Note:本项目只专注于中文分词。需要其他 NLP 能力时请结合合适的工具链。
1、安装:
pip install -U cutword-lite
2、使用:
2.1分词功能
from cutword import Cutter
cutter = Cutter()
res = cutter.cutword("你好,世界")
print(res)
本分词器提供两种词典库,一种是基本的词库,默认加载。一种是升级词库,升级词库总体长度会比基本词库更长一点。
如需要加载升级词库,需要将 want_long_word 设为True
from cutword import Cutter
cutter = Cutter()
res = cutter.cutword("精诚所至,金石为开")
print(res) # ['精诚', '所', '至', ',', '金石为开']
cutter = Cutter(want_long_word=True)
res = cutter.cutword("精诚所至,金石为开")
print(res) # ['精诚所至', ',', '金石为开']
初始化Cutter时,支持传入用户自定义的词典,词典格式需要和本项目的dict文件保持一致,词典中的词性一列,暂时没有使用,可随意填写。
本项目借鉴了苏神的bytepiece的代码,在此表示感谢。
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cutword_lite-0.2.0.tar.gz.
File metadata
- Download URL: cutword_lite-0.2.0.tar.gz
- Upload date:
- Size: 4.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1d7cca259a2be39c1e21a3082fcc4740a28114a9fe4438b6a3c00d93348e11b
|
|
| MD5 |
29ca98f0bc7db07a00f6a85c1545085c
|
|
| BLAKE2b-256 |
3f34970f5d76b4db8ab60ad9adba380e91841bfed491e8f97ac7d6720b721f6e
|
File details
Details for the file cutword_lite-0.2.0-py3-none-any.whl.
File metadata
- Download URL: cutword_lite-0.2.0-py3-none-any.whl
- Upload date:
- Size: 4.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
281bf99acbb35acf13d500321ea3c835d0ff2f1332d1d924800df2091009ad7a
|
|
| MD5 |
59abebf7fd420d9647793de3c13b81b5
|
|
| BLAKE2b-256 |
b8539b50054fc021df15bb84ddb9ed3857372b6f1663b81f85acd1aba3e12095
|