Skip to main content

jiojio: a convenient Chinese word segmentation tool

Project description

jiojio

- 基于CPU的高性能、持续迭代模型、简便中文分词器

a convenient Chinese word segmentation tool

<a alt="License">

    <img src="https://img.shields.io/github/license/dongrixinyu/jiojio?color=crimson" /></a>

<a alt="Size">

    <img src="https://img.shields.io/badge/size-82.1m-orange" /></a>

<a alt="Downloads">

    <img src="https://pepy.tech/badge/jiojio/month" /></a>

<a alt="Version">

    <img src="https://img.shields.io/badge/version-1.2.6-green" /></a>

<a href="https://github.com/dongrixinyu/jiojio/pulse" alt="Activity">

    <img src="https://img.shields.io/github/commit-activity/m/dongrixinyu/jiojio?color=blue" /></a>

适用场景

  • 基于 CPU高性能持续优化 中文分词器。

功能

  • 基于 C 的 Python 接口分词器,CPU 单进程运行性能达 13.4 万字/秒多个分词工具性能对比

  • 网页版 JioNLP源站,可快速试用分词、词性标注功能

  • 基于 CRF 算法,精细优化的 字符特征工程模型特征说明

  • 对模型文件的尽力压缩,使用 np.float8 精度类型,500万特征参数,模型文件大小30M,方便 pip 安装

  • 添加自定义词典兼容静态、动态两种方式,流程一致性强,词典配置说明

  • 将规则加入模型,有效克服某些类型文本受限于模型处理的情况,分词-添加正则

  • 支持词性标注功能,与 JioNLP 联合实现关键短语抽取新闻地域识别 等功能

安装

  • pip 方式(稳定版本)

$ pip install jiojio

  • Git 方式(开发版本)

$ git clone https://github.com/dongrixinyu/jiojio

$ cd jiojio

$ pip install .

  • 非 ubuntu 环境的 C 安装

如使用 windows 或 mac 等操作系统或其它硬件,则没有直接可调用 C 的库,程序默认直接调用纯 Python 进行分词,因此速度会慢。可以使用以下方式安装编译 C 库。以下方式仅供参考,在熟悉 C 语言后进行调试使用。


$ git clone https://github.com/dongrixinyu/jiojio

$ cd jiojio/jiojio/jiojio_cpp

$ ./compiler.sh

使用

  • 基础方式

>>> import jiojio

>>> jiojio.init()

>>> print(jiojio.cut('开源软件应秉持全人类共享的精神,搞封闭式是行不通的。'))



# ['开源', '软件', '应', '秉持', '全人类', '共享', '的', '精神', ',', '搞', '封闭式', '是', '行', '不通', '的', '。']

# 可通过 jiojio.help() 获取基本使用方式说明

# 可通过 print(jiojio.init.__doc__) 获取模型初始化的各类参数

关于 jiojio 分词器的一些问答

  • 可能早十年把这个分词器写出来,jiojio 也许现在就会流行起来。在 ChatGPT 称霸 NLP 界的今天,我写这个工具,加速这个工具,纯粹是为了提升一下 C 语言的编程能力。ChatGPT 能够做出来,还是需要理想主义的,我写这个工具同理。

  • 与jiojio有关的问答

TODO

  • 对分词器效果做标注数据更新,模型长期优化

交流群聊

  • 欢迎加入自然语言处理NLP交流群,搜索wx公众号“JioNLP”,或扫以下码即可入群

image

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

jiojio-1.2.6-py2.py3-none-any.whl (85.5 MB view details)

Uploaded Python 2 Python 3

jiojio-1.2.6-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.whl (85.6 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.5+ x86-64

jiojio-1.2.6-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl (85.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.5+ x86-64

jiojio-1.2.6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (85.6 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.5+ x86-64

jiojio-1.2.6-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (85.6 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.5+ x86-64

jiojio-1.2.6-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (85.7 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.5+ x86-64

jiojio-1.2.6-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (85.7 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.5+ x86-64

jiojio-1.2.6-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (85.7 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.5+ x86-64

File details

Details for the file jiojio-1.2.6-py2.py3-none-any.whl.

File metadata

  • Download URL: jiojio-1.2.6-py2.py3-none-any.whl
  • Upload date:
  • Size: 85.5 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.6

File hashes

Hashes for jiojio-1.2.6-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 efa59fdd2c161b288a311b37faf68b51407e4823bc566055c339dc7d3646b772
MD5 a069943d38edcb8a6d813e942e2d995b
BLAKE2b-256 49caa406149da0e337fd3299959cdf259e74214b091625c347b4184950c0416a

See more details on using hashes here.

File details

Details for the file jiojio-1.2.6-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for jiojio-1.2.6-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2546ab9416bc30434c26325026f9980eeacf2c05af1b4fdf5abe1a3d935e7893
MD5 0ae68a4d1f51f3d6ea617aaf94bf7509
BLAKE2b-256 41b570c953ab096829788c6572d1d1401f523d821d60496436feddcc027f8374

See more details on using hashes here.

File details

Details for the file jiojio-1.2.6-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for jiojio-1.2.6-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5080574a1b4369d7aad1fd464c9a466aafd0c01a935432bba01dc85a2e079d9a
MD5 a34d02f6a0212d7edeb1b531375ba878
BLAKE2b-256 5c127619ca2b21e7b23d0515ba1e95be00d61eaea8158d14438b34e82e91f158

See more details on using hashes here.

File details

Details for the file jiojio-1.2.6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for jiojio-1.2.6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e6e42dfb9aa18ca10dfc5c794f80b0f0537c20f5e24cdbef05893faae5dadced
MD5 64d01fd8949f4cc6747e3d6b8cd611e1
BLAKE2b-256 563a428ec80c8c8cb78d344a5ee5b712166db3081851a8093d7909dd10ebdd90

See more details on using hashes here.

File details

Details for the file jiojio-1.2.6-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for jiojio-1.2.6-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 9bce0760137eefcb1b5a30a48c0ad33ed225d2908266b2051e1ec3f5bdedd92a
MD5 6e9d534b02d4eb455d305f451291ef59
BLAKE2b-256 0bb70b996e76d29d95a784b01c96ab2eeb1d392300a0b34b467363c362b31100

See more details on using hashes here.

File details

Details for the file jiojio-1.2.6-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for jiojio-1.2.6-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 564165bb07fed8a29539aa287a37fd7c82085fb42df5aa43c1a836562135331f
MD5 d7743463a4146f372c8c002179ccc8aa
BLAKE2b-256 32ace0e678d5aac5b424501747b673b0486f39ac2edfcb1318c5086a2d5092bc

See more details on using hashes here.

File details

Details for the file jiojio-1.2.6-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for jiojio-1.2.6-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d9a6c9ee64fbde949b1627468a45c9e6f10aa1915f737614b4fe78550374f571
MD5 c18fa1d5d47ccdf8d6f312e536d1cc88
BLAKE2b-256 52d10d6f4b972e86cc36d351a3d81ed5aa503bdd61b9379a16544dd0814211c4

See more details on using hashes here.

File details

Details for the file jiojio-1.2.6-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for jiojio-1.2.6-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4ee43f89d7034b7e99530bb74ceb923953c502f48c9944559ade10bdf5e42a2c
MD5 57a329ece75fc6e3b5e6f6de8b954ee6
BLAKE2b-256 b9531e5301775eeb8e79041c4b95f6c7746c2638ed8664553d01932586155256

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page