Skip to main content

A micro tokenizer for Chinese

Project description

微型中文分词器

一个微型的中文分词器,目前提供了七种分词算法:

  1. 按照词语的频率(概率)来利用构建 DAG(有向无环图)来分词,使用 Trie Tree 构建前缀字典树

  2. 使用隐马尔可夫模型(Hidden Markov Model,HMM)来分词

  3. 融合 DAG 和 HMM 两种分词模型的结果,按照分词粒度最大化的原则进行融合得到的模型

  4. 正向最大匹配法

  5. 反向最大匹配法

  6. 双向最大匹配法

  7. 基于 CRF (Conditional Random Field, 条件随机场) 的分词方法

特点 / 特色

  • 面向教育:可以导出 graphml 格式的图结构文件,辅助学习者理解算法过程

  • 良好的分词性能:由于使用类似 结巴分词 的算法,具有良好的分词性能

  • 具有良好的扩展性:使用和 结巴分词 一样的字典文件,可以轻松添加自定义字典

  • 自定义能力强

  • 提供工具和脚本帮助用户训练自己的分词模型而不是使用内建的模型


更多内容见仓库 https://github.com/howl-anderson/MicroTokenizer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

MicroTokenizer-0.19.2.tar.gz (18.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

MicroTokenizer-0.19.2-py2.py3-none-any.whl (36.8 MB view details)

Uploaded Python 2Python 3

File details

Details for the file MicroTokenizer-0.19.2.tar.gz.

File metadata

  • Download URL: MicroTokenizer-0.19.2.tar.gz
  • Upload date:
  • Size: 18.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for MicroTokenizer-0.19.2.tar.gz
Algorithm Hash digest
SHA256 b12ae5868aa66382c8e040b6a9d10976420d2bf93473d187c7933bec2e1af325
MD5 1bec37a5c78517c75f959e620a5c66dc
BLAKE2b-256 b6347b1a2b5dedf65bd3b1e2fac00470257e83715edac39ddf7caf5b09973fae

See more details on using hashes here.

File details

Details for the file MicroTokenizer-0.19.2-py2.py3-none-any.whl.

File metadata

  • Download URL: MicroTokenizer-0.19.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 36.8 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for MicroTokenizer-0.19.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 aa356b0aacbfa6f6c35a762afd87d4749b55ce01fd2c421fc4f12847983a3e6a
MD5 65f5e975abddc137a0ea5574857d868e
BLAKE2b-256 a309d4afddbbd79e2447c3c6f50e6ae3448eacd6aed013f793eb823bb297c106

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page