Text classification datasets
Project description
The framework is designed with BERT
in mind and currently support seven commonsense reasoning datsets(alphanli
, hellaswag
, physicaliqa
, socialiqa
, codah
, cosmosqa
, and commonsenseqa
). It can be also applied to other datasets with few line of codes.
Architecture
Dependency
conda install av -c conda-forge
pip install -r requirements.txt
pip install --editable .
# or
pip install textbook
Download raw datasets
./fetch.sh
It downloads alphanli
, hellaswag
, physicaliqa
, socialiqa
, codah
, cosmosqa
, and commonsenseqa
from AWS in data_cache
.
In case you want to use something-something, pelase download the dataset from 20bn's website.
Usage
Template
The goal of a template is to transform raw text into a intermediate datum where abstractive information are provided for later use.
Ideally, the template should do the following things:
- construct
text
: a list of list. The outer list is ideal for multichoice situations and inner list if for each input pairs/triplets(e.g context, question, and choice); - construct
label
: an integer representing a zero-indexed label for the truth, orNone
; - construct
token_type_id
andattention
: abstractive representation of the segment id and attention. In the following example of anli, bothtoken_type_id
andattention
have three digits, each for the three components of each row of the text. - construct
image
: any forms of image id/path you want to read later.
One example of anli is as follows:
# raw
case = {"story_id": "58090d3f-8a91-4c89-83ef-2b4994de9d241", "obs1": "Ron started his new job as a landscaper today.",
"obs2": "Ron is immediately fired for insubordination.", "hyp1": "Ron ignores his bosses's orders and called him an idiot.",
"hyp2": "Ron's boss called him an idiot.", "label": "1"}
# target intermediate datum
target = {
'text':
[['Ron started his new job as a landscaper today.', "Ron ignores his bosses's orders and called him an idiot.",
'Ron is immediately fired for insubordination.'],
['Ron started his new job as a landscaper today.', "Ron's boss called him an idiot.",
'Ron is immediately fired for insubordination.']],
'label': 0, 'image': None, 'token_type_id': [0, 1, 0],
'attention': [1, 1, 1]}
LABEL2INT = {
"anli": {
"1": 0,
"2": 1,
},
}
assert template_anli(case, LABEL2INT['anli']) == target
Renderer
Renderer transformer your intermediate datum into a fully blown datum. Each renderer takes care of different part of the datum. For example, renderer_text
renders the text into input_id
and generate all token-based attention
and token_type_id
, while renderer_video
renders the image
path to an image
tensor. renderers are passed to the dataset constructer in a list, therefore are execute sequentially.
BatchTool
We provided a BatchTool where MLM or padding can be used easily, you can check the doc for the class for more information.
Load a dataset with pandas
from transformers import BertTokenizer
from textbook import MultiModalDataset, template_anli, renderer_text, BatchTool, TokenBasedSampler
from torch.utils.data import Dataset, DataLoader
from textbook import LABEL2INT
import pandas as pd
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
d1 = MultiModalDataset(
df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
template=lambda x: template_anli(x, LABEL2INT['anli']),
renderers=[lambda x: renderer_text(x, tokenizer)],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)
Create a multitask dataset with multiple datasets
from transformers import BertTokenizer
from textbook import *
import pandas as pd
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
# add additional tokens for each task as special `cls_token`
tokenizer.add_special_tokens({"additional_special_tokens": [
"[ANLI]", "[HELLASWAG]"
]})
d1 = MultiModalDataset(
df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
template=lambda x: template_anli(x, LABEL2INT['anli']),
renderers=[lambda x: renderer_text(x, tokenizer, "[ANLI]")],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)
d2 = MultiModalDataset(
df=pd.read_json("data_cache/hellaswag/train.jsonl", lines=True),
template=lambda x: template_hellaswag(x, LABEL2INT['hellaswag']),
renderers=[lambda x: renderer_text(x, tokenizer, "[HELLASWAG]")],
)
bt2 = BatchTool(tokenizer, source="hellaswag")
i2 = DataLoader(d2, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt2.collate_fn)
d = MultiTaskDataset([i1, i2], shuffle=False)
#! batch size must be 1 for multitaskdataset, because we already batched in each sub dataset.
for batch in DataLoader(d, batch_size=1, collate_fn=BatchTool.uncollate_fn):
pass
# {
# "source": "anli" or "hellaswag",
# "labels": ...,
# "input_ids": ...,
# "attentions": ...,
# "token_type_ids": ...,
# "images": ...,
# }
Impletement a New template or rennderer
It is advised to follow the following conventions but you can do whatever you like since you can call lambda
anywhere.
def template_xxx(raw_datum, *args, **kwargs):
pass
def renderer_xxx(intermediate_datum, *args, **kwargs):
pass
e.g. For Quora question pairs dataset:
def template_qqp(raw_datum, label2int={"0": 0, "1": 1},):
result = {
"text": [
[datum['question1'], datum['question2']]
],
"image": None,
"label": None if 'is_duplicate' not in datum or datum['is_duplicate'] is None else label2int[str(datum['i_duplicate'])],
"token_type_id": [0, 1],
"attention": [1, 1],
}
return result
Contact
Author: Chenghao Mou
Email: mouchenghao@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file textbook-0.3.10.tar.gz
.
File metadata
- Download URL: textbook-0.3.10.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 215d51b915d9f5b08902bfa66d41d4d4a786b464c41d44c0cefa691126687ba9 |
|
MD5 | 8e3479293b54ee8402f809d51cc89b36 |
|
BLAKE2b-256 | 1aa0dba78f887b38086aee7d5c664743af44e8d990480c1aed325b42e76ee6f8 |
File details
Details for the file textbook-0.3.10-py3-none-any.whl
.
File metadata
- Download URL: textbook-0.3.10-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4585d7bf79c2b8e48893eae61868bc83887f78ebcd6f99cc74d52e610aa9c7b |
|
MD5 | 86439be018ea770e44a27c24bf95c993 |
|
BLAKE2b-256 | 5e4936ba588b9e4f34daef25f0602bff46c4a3037ea5b3b24219bf6bbdcbfad0 |