MicroTokenizer

A micro tokenizer for Chinese

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

# 微型中文分词器

一个微型的中文分词器，能够按照词语的频率（概率）来利用构建 DAG（有向无环图）来分词。

# 特点 / 特色

* 微型：主要代码只有一个文件，不足 200 行
* 面向教育：可以导出 `graphml` 格式的图结构文件，辅助学习者理解算法过程
* 良好的分词性能：由于使用类似 `结巴分词` 的算法，具有良好的分词性能
* 具有良好的扩展性：使用和 `结巴分词` 一样的字典文件，可以轻松添加自定义字典

# 演示

## 在线演示
在线的 Jupyter Notebook 在 [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/howl-anderson/MicroTokenizer/master?filepath=.notebooks%2FMicroTokenizer.ipynb)

## 离线演示
### 分词
代码：
```python
import MicroTokenizer

tokens = MicroTokenizer.cut("知识就是力量")
print(tokens)
```
输出：
```python
['知识', '就是', '力量']
```
### 有向无环图效果演示
![DAG of 'knowledge is power'](.images/DAG_of_knowledge_is_power.png)

#### 备注
* `<s>` 和 `</s>` 是图的起始和结束节点，不是实际要分词的文本
* 图中 Edge 上标注的是 `log(下一个节点的概率的倒数)`
* 最短路径已经用 `深绿色` 作了标记

### 更多演示
#### "王小明在北京的清华大学读书"
![DAG of xiaomin](.images/DAG_of_xiaomin.png)

# 依赖
只在 python 3.5+ 环境测试过，其他环境不做兼容性保障。

# 安装
## pip
```bash
pip install MicroTokenizer
```

## source
```console
pip install git+https://github.com/howl-anderson/MicroTokenizer.git
```

# 如何使用
## 分词
见上文

## 导出 GraphML 文件
```python
from MicroTokenizer.MicroTokenizer import MicroTokenizer

micro_tokenizer = MicroTokenizer()
micro_tokenizer.build_graph("知识就是力量")
micro_tokenizer.write_graphml("output.graphml")
```

# Roadmap
* 融合 HMM 模型以处理 OOV 以及提高 Performance
* 和主流分词模型做一个分词能力的测试

# Credits

=======
History
=======

0.1.0 (2018-06-12)
------------------

* First release on PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.21.3

Oct 18, 2024

0.21.2

Sep 28, 2021

0.21.1

Sep 28, 2021

0.21.0

Sep 28, 2021

0.20.4

Aug 31, 2021

0.20.3

Aug 31, 2021

0.20.2

Jul 16, 2021

0.20.1

Jul 16, 2021

0.20.0

Jul 16, 2021

0.19.2

Jul 8, 2020

0.19.1

Jul 8, 2020

0.19.0

Dec 13, 2018

0.18.0

Oct 16, 2018

0.17.4

Sep 25, 2018

0.17.3

Sep 25, 2018

0.17.2

Sep 25, 2018

0.17.1

Sep 25, 2018

0.17.0

Sep 23, 2018

0.16.0

Sep 23, 2018

0.15.2

Sep 20, 2018

0.15.1

Sep 20, 2018

0.15.0

Sep 20, 2018

0.14.1

Sep 7, 2018

0.14.0

Sep 7, 2018

0.13.1

Sep 3, 2018

0.13.0

Sep 3, 2018

0.12.1

Sep 2, 2018

0.11.1

Sep 1, 2018

0.11.0

Sep 1, 2018

0.10.0

Sep 1, 2018

0.9.0

Sep 1, 2018

0.8.0

Aug 28, 2018

0.7.11

Aug 19, 2018

0.7.10

Aug 19, 2018

0.7.9

Aug 19, 2018

0.7.8

Aug 19, 2018

0.7.6

Aug 19, 2018

0.1.7.6

Aug 19, 2018

0.1.7.5

Aug 19, 2018

0.1.7.4

Aug 18, 2018

0.1.7.3

Aug 18, 2018

0.1.7.2

Aug 18, 2018

0.1.7.1

Aug 18, 2018

0.1.7

Aug 16, 2018

0.1.6.3

Aug 14, 2018

0.1.6.2

Aug 14, 2018

0.1.6.1

Aug 14, 2018

0.1.6

Aug 13, 2018

0.1.5.1

Aug 12, 2018

0.1.5

Aug 12, 2018

0.1.4

Aug 12, 2018

0.1.2

Aug 6, 2018

This version

0.1.1

Jul 18, 2018

0.1.0

Jul 13, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

MicroTokenizer-0.1.1.tar.gz (2.0 MB view details)

Uploaded Jul 18, 2018 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

MicroTokenizer-0.1.1-py2.py3-none-any.whl (4.0 MB view details)

Uploaded Jul 18, 2018 Python 2Python 3

File details

Details for the file MicroTokenizer-0.1.1.tar.gz.

File metadata

Download URL: MicroTokenizer-0.1.1.tar.gz
Upload date: Jul 18, 2018
Size: 2.0 MB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for MicroTokenizer-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`74e5b065c1345c2535f61e8c36208f7228c052112c177e15120bbb2adacc9829`
MD5	`f2c8079740fb47a9814adc6fd469dd8b`
BLAKE2b-256	`19ee7596d7ad84f289b841642217d5ddf5e8a0233310f1fe1128c3dea1bcc6db`

See more details on using hashes here.

File details

Details for the file MicroTokenizer-0.1.1-py2.py3-none-any.whl.

File metadata

Download URL: MicroTokenizer-0.1.1-py2.py3-none-any.whl
Upload date: Jul 18, 2018
Size: 4.0 MB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for MicroTokenizer-0.1.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`bbc2c6725635a904c2bdfa57aa668e4c8d74facaf93306f9b326ba914c2922b9`
MD5	`b826ff7c304f97a2d82ac091e72373d5`
BLAKE2b-256	`1cd72fdd26b9ab04e4f2d4366d993e45ece3995a4e21619c60cb56cfe01630d7`

See more details on using hashes here.

MicroTokenizer 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes