Skip to main content

Tools for tokenizer develope and evaluation

Project description

Tokenizer Tools

https://img.shields.io/pypi/v/tokenizer_tools.svg https://travis-ci.com/howl-anderson/tokenizer_tools.svg?branch=master Documentation Status Updates

Tools/Utils for NLP (including dataset reading, tagset encoding & decoding, metrics computing) | NLP 工具集(包含数据集读取、tagset 编码和解码、指标的计算等)

Features

功能

语料集读写

本软件提供了一种语料存储的磁盘文件格式(暂定名为 conllx)和内存对象格式(暂定名为 offset)。

语料集读取

任务:读取 corpus.collx 文件,遍历打印每一条语料。

代码:

from tokenizer_tools.tagset.offset.corpus import Corpus

corpus = Corpus.read_from_file("corpus.conllx")
for document in corpus:
    print(document)  # document 就是单条语料对象
语料集写入

任务:将多条语料写入 corpus.conllx 文件

代码:

from tokenizer_tools.tagset.offset.corpus import Corpus

corpus_list = [corpus_item_one, corpus_item_two]

corpus = Corpus(corpus_list)
corpus.write_to_file("corpus.conllx")

Document 属性和方法

每一个单条语料都是一个 Document 对象,现在介绍这个对象所拥有的属性和方法

属性
text

类型是 list, 代表文本的字段

domain

类型是 string, 代表领域

function

类型是 string, 代表功能点

sub_function

类型是 string,代表子功能点

intent

类型是 string, 代表意图

entities

类型是 SpanSet, 代表实体,下文有详细介绍

方法
compare_entities

比较文本和实体是否匹配

convert_to_md

将文本和实体转换成 markdown 格式,用于文本化渲染输出

SpanSet 属性和方法

方法

__iter__

可以像列表一样访问,得到的每一个元素都是 Span 对象

check_overlap

检查 span 是否重叠

Span 属性和方法

属性

start

int, 从 0 开始,包含该位置

end

int, 从0开始,不包含该位置

entity

string, 实体类型

value

string, 实体的值

TODO

  • 改变项目的名字,tokenizer_tools 已经无法正确描述现在项目的功能

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

History

0.1.0 (2018-09-05)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizer_tools-0.36.1.tar.gz (57.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokenizer_tools-0.36.1-py2.py3-none-any.whl (46.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file tokenizer_tools-0.36.1.tar.gz.

File metadata

  • Download URL: tokenizer_tools-0.36.1.tar.gz
  • Upload date:
  • Size: 57.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.8

File hashes

Hashes for tokenizer_tools-0.36.1.tar.gz
Algorithm Hash digest
SHA256 df6a94dc1e3624c762953d8eb3efa1becb306990187b8dac5aa6bf47f64affa2
MD5 e22ba5a59080d6dacfda7b1e8301b3e9
BLAKE2b-256 841e30f1f0b472ef6f4bd6260d0657e85eab7aa8d1d5117c78b6936a6d6008ae

See more details on using hashes here.

File details

Details for the file tokenizer_tools-0.36.1-py2.py3-none-any.whl.

File metadata

  • Download URL: tokenizer_tools-0.36.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 46.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.8

File hashes

Hashes for tokenizer_tools-0.36.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ce1f2ff98fa3b2cf9597eec7fab604eb8bf3bce2bd1b43eaadfdeed26e43353c
MD5 5314e43372a41407228652eda567f4f8
BLAKE2b-256 6b69097d7e889506b7e2569e43e07ebe13460575950ffb6a1362d7ba8cbe54ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page