Skip to main content

Just Cut Word Faster

Project description

cutword

jieba不维护了,所以有了cutword。

cutword 是一个中文分词库,字典文件根据截止到2024年1月份的最新数据统计得到,词频更加合理。 基于ac自动机实现的分词算法,分词速度是jieba的两倍。

cutword-lite 基于原项目 cutword 精简而成,移除了命名实体识别(NER),专注提供中文分词能力。

可通过 python -m cutword.comparewithjieba 进行测试。

Note:本项目只专注于中文分词。需要其他 NLP 能力时请结合合适的工具链。

1、安装:

pip install -U cutword-lite

2、使用:

2.1分词功能

from  cutword import Cutter

cutter = Cutter()
res = cutter.cutword("你好,世界")
print(res)

本分词器提供两种词典库,一种是基本的词库,默认加载。一种是升级词库,升级词库总体长度会比基本词库更长一点。

如需要加载升级词库,需要将 want_long_word 设为True

from  cutword import Cutter

cutter = Cutter()
res = cutter.cutword("精诚所至,金石为开")
print(res) # ['精诚', '所', '至', ',', '金石为开']

cutter = Cutter(want_long_word=True)
res = cutter.cutword("精诚所至,金石为开")
print(res) # ['精诚所至', ',', '金石为开']

初始化Cutter时,支持传入用户自定义的词典,词典格式需要和本项目的dict文件保持一致,词典中的词性一列,暂时没有使用,可随意填写。

本项目借鉴了苏神的bytepiece的代码,在此表示感谢。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cutword_lite-0.2.0.tar.gz (4.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cutword_lite-0.2.0-py3-none-any.whl (4.4 MB view details)

Uploaded Python 3

File details

Details for the file cutword_lite-0.2.0.tar.gz.

File metadata

  • Download URL: cutword_lite-0.2.0.tar.gz
  • Upload date:
  • Size: 4.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for cutword_lite-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c1d7cca259a2be39c1e21a3082fcc4740a28114a9fe4438b6a3c00d93348e11b
MD5 29ca98f0bc7db07a00f6a85c1545085c
BLAKE2b-256 3f34970f5d76b4db8ab60ad9adba380e91841bfed491e8f97ac7d6720b721f6e

See more details on using hashes here.

File details

Details for the file cutword_lite-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: cutword_lite-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 4.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for cutword_lite-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 281bf99acbb35acf13d500321ea3c835d0ff2f1332d1d924800df2091009ad7a
MD5 59abebf7fd420d9647793de3c13b81b5
BLAKE2b-256 b8539b50054fc021df15bb84ddb9ed3857372b6f1663b81f85acd1aba3e12095

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page