Skip to main content

Deep learning toolbox for end-to-end text information extraction tasks.

Project description

Theta

Deep learning toolbox for end-to-end text information extraction tasks.

Theta定位是解决实际工程项目中文本信息抽取任务的实用工具箱,端到端实现从原始文本输入到结构化输出全过程。用户工作聚焦于输入数据格式转换,调整关键参数调度theta完成模型训练推理任务及输出格式化数据利用。

Theta应用场景包括国家级重点企业非结构化数据挖掘利用、开放域文本数据结构化抽取、各大在线实体关系抽取类评测赛事等。

Theta性能指标要求达到业内主流头部水准,近期参加了包括CCF2019、CHIP2019、CCKS2020、CCL2020等C字头顶级赛事,目前取得10余次决赛奖项,包括7次前三,2次第一。

更新

  • 2022.09.06 0.50.0

    nlp.entity_extraction, nlp.relation_extraction

安装

测试版

pip install git+http://github.com/idleuncle/theta.git

正式版

pip install -U theta

CLUE-CLUENER 细粒度命名实体识别

本数据是在清华大学开源的文本分类数据集THUCTC基础上,选出部分数据进行细粒度命名实体标注,原数据来源于Sina News RSS.

训练集:10748 验证集:1343

标签类别: 数据分为10个标签类别,分别为: 地址(address),书名(book),公司(company),游戏(game),政府(goverment),电影(movie),姓名(name),组织机构(organization),职位(position),景点(scene)

数据下载地址:https://github.com/CLUEbenchmark/CLUENER2020

排行榜地址:https://cluebenchmarks.com/ner.html

完整代码见theta/examples/CLUENER:cluener.ipynb

选用bert-base-chinese预训练模型,CLUE测评F1得分77.160。

# 训练
make -f Makefile.cluener train

# 推理
make -f Makefile.cluener predict

# 生成提交结果文件
make -f Makefile.cluener submission

CLUE-TNEWS 今日头条中文新闻(短文)分类任务

以下样例是CLUE(中文任务基准测评)中今日头条中文新闻(短文)分类任务。

数据集来自今日头条的新闻版块,共提取了15个类别的新闻,包括旅游,教育,金融,军事等。

数据量:训练集(53,360),验证集(10,000),测试集(10,000)

例子: {"label": "102", "label_desc": "news_entertainment", "sentence": "江疏影甜甜圈自拍,迷之角度竟这么好看,美吸引一切事物"} 每一条数据有三个属性,从前往后分别是 分类ID,分类名称,新闻字符串(仅含标题)。

选用bert-base-chinese预训练模型,CLUE测评F1得分56.100。

完整代码见theta/examples/TNEWS:tnews.ipynb

TNEWS数据集下载

导入基础库

import json
from tqdm import tqdm
from loguru import logger
import numpy as np

from theta.modeling import load_glue_examples
from theta.modeling.glue import GlueTrainer, load_model, get_args
from theta.utils import load_json_file

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

theta-0.51.0.tar.gz (192.5 kB view details)

Uploaded Source

Built Distribution

theta-0.51.0-py3-none-any.whl (255.2 kB view details)

Uploaded Python 3

File details

Details for the file theta-0.51.0.tar.gz.

File metadata

  • Download URL: theta-0.51.0.tar.gz
  • Upload date:
  • Size: 192.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.6

File hashes

Hashes for theta-0.51.0.tar.gz
Algorithm Hash digest
SHA256 f43e3bf8c5cc592e27d2f16045db143c3fa05b82983236ff6fb158fa01c56599
MD5 907bbc3a5acfc5a362b20da2bb6faf90
BLAKE2b-256 11fecc4b283d59c134f2f7144e0dd5ac41e1a2a6154936a99ea76e32ce6c42b0

See more details on using hashes here.

File details

Details for the file theta-0.51.0-py3-none-any.whl.

File metadata

  • Download URL: theta-0.51.0-py3-none-any.whl
  • Upload date:
  • Size: 255.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.6

File hashes

Hashes for theta-0.51.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8350f856004ea47396114846ac8cf3bc59db3b16685969190c2ac828be697969
MD5 b85ec988eb626f924c36d7be84398a2a
BLAKE2b-256 2d2462a8d66543c57815712a0845fda6be2b333b9268952985843cdd48d98611

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page