Tools for tokenizer develope and evaluation
Project description
Tokenizer Tools
Tools/Utils for NLP (including dataset reading, tagset encoding & decoding, metrics computing) | NLP 工具集(包含数据集读取、tagset 编码和解码、指标的计算等)
Free software: MIT license
Documentation: https://tokenizer-tools.readthedocs.io.
Features
功能
语料集读写
本软件提供了一种语料存储的磁盘文件格式(暂定名为 conllx)和内存对象格式(暂定名为 offset)。
语料集读取
任务:读取 corpus.collx 文件,遍历打印每一条语料。
代码:
from tokenizer_tools.tagset.offset.corpus import Corpus
corpus = Corpus.read_from_file("corpus.conllx")
for document in corpus:
print(document) # document 就是单条语料对象
语料集写入
任务:将多条语料写入 corpus.conllx 文件
代码:
from tokenizer_tools.tagset.offset.corpus import Corpus
corpus_list = [corpus_item_one, corpus_item_two]
corpus = Corpus(corpus_list)
corpus.write_to_file("corpus.conllx")
Document 属性和方法
每一个单条语料都是一个 Document 对象,现在介绍这个对象所拥有的属性和方法
属性
text
类型是 list, 代表文本的字段
domain
类型是 string, 代表领域
function
类型是 string, 代表功能点
sub_function
类型是 string,代表子功能点
intent
类型是 string, 代表意图
entities
类型是 SpanSet, 代表实体,下文有详细介绍
方法
compare_entities
比较文本和实体是否匹配
convert_to_md
将文本和实体转换成 markdown 格式,用于文本化渲染输出
SpanSet 属性和方法
方法
__iter__
可以像列表一样访问,得到的每一个元素都是 Span 对象
check_overlap
检查 span 是否重叠
Span 属性和方法
属性
start
int, 从 0 开始,包含该位置
end
int, 从0开始,不包含该位置
entity
string, 实体类型
value
string, 实体的值
TODO
改变项目的名字,tokenizer_tools 已经无法正确描述现在项目的功能
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
History
0.1.0 (2018-09-05)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tokenizer_tools-0.46.1.tar.gz
.
File metadata
- Download URL: tokenizer_tools-0.46.1.tar.gz
- Upload date:
- Size: 69.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7bdfe8249b3422b77e2fe798d2ef38e05283c229834e79a12bdd8784d759733a |
|
MD5 | 89877fda9a5612e2a6ed0cc689276361 |
|
BLAKE2b-256 | a1ec56971a0b4adb6286713ab6dc85792d87dafd544e58367e678f6baaf49f2f |
File details
Details for the file tokenizer_tools-0.46.1-py2.py3-none-any.whl
.
File metadata
- Download URL: tokenizer_tools-0.46.1-py2.py3-none-any.whl
- Upload date:
- Size: 69.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e73d27f575e357202272ab68a9b3e6cbec32bc3aae8268d537ab090bdebf74bb |
|
MD5 | 0a41f9fbf963caa4bd2c9af3e6986bb7 |
|
BLAKE2b-256 | 93851f5bd5df02cfc4a7d7edd7d4a9b62a29f092e64f6660fbe159337569196f |