Text classification datasets
Project description
The framework is designed with BERT
in mind and currently support seven commonsense reasoning datsets(alphanli
, hellaswag
, physicaliqa
, socialiqa
, codah
, cosmosqa
, and commonsenseqa
). It can be also applied to other datasets with few line of codes.
Architecture
Dependency
conda install av -c conda-forge
pip install -r requirements.txt
pip install --editable .
# or
pip install textbook
Download raw datasets
./fetch.sh
It downloads alphanli
, hellaswag
, physicaliqa
, socialiqa
, codah
, cosmosqa
, and commonsenseqa
from AWS in data_cache
.
In case you want to use something-something, pelase download the dataset from 20bn's website.
Usage
Template
The goal of a template is to transform raw text into a intermediate datum where abstractive information are provided for later use.
Ideally, the template should do the following things:
- construct
text
: a list of list. The outer list is ideal for multichoice situations and inner list if for each input pairs/triplets(e.g context, question, and choice); - construct
label
: an integer representing a zero-indexed label for the truth, orNone
; - construct
token_type_id
andattention
: abstractive representation of the segment id and attention. In the following example of anli, bothtoken_type_id
andattention
have three digits, each for the three components of each row of the text. - construct
image
: any forms of image id/path you want to read later.
One example of anli is as follows:
# raw
case = {"story_id": "58090d3f-8a91-4c89-83ef-2b4994de9d241", "obs1": "Ron started his new job as a landscaper today.",
"obs2": "Ron is immediately fired for insubordination.", "hyp1": "Ron ignores his bosses's orders and called him an idiot.",
"hyp2": "Ron's boss called him an idiot.", "label": "1"}
# target intermediate datum
target = {
'text':
[['Ron started his new job as a landscaper today.', "Ron ignores his bosses's orders and called him an idiot.",
'Ron is immediately fired for insubordination.'],
['Ron started his new job as a landscaper today.', "Ron's boss called him an idiot.",
'Ron is immediately fired for insubordination.']],
'label': 0, 'image': None, 'token_type_id': [0, 1, 0],
'attention': [1, 1, 1]}
LABEL2INT = {
"anli": {
"1": 0,
"2": 1,
},
}
assert template_anli(case, LABEL2INT['anli']) == target
Renderer
Renderer transformer your intermediate datum into a fully blown datum. Each renderer takes care of different part of the datum. For example, renderer_text
renders the text into input_id
and generate all token-based attention
and token_type_id
, while renderer_video
renders the image
path to an image
tensor. renderers are passed to the dataset constructer in a list, therefore are execute sequentially.
BatchTool
We provided a BatchTool where MLM or padding can be used easily, you can check the doc for the class for more information.
Load a dataset with pandas
from transformers import BertTokenizer
from textbook import MultiModalDataset, template_anli, renderer_text, BatchTool, TokenBasedSampler
from torch.utils.data import Dataset, DataLoader
from textbook import LABEL2INT
import pandas as pd
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
d1 = MultiModalDataset(
df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
template=lambda x: template_anli(x, LABEL2INT['anli']),
renderers=[lambda x: renderer_text(x, tokenizer)],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)
Create a multitask dataset with multiple datasets
from transformers import BertTokenizer
from textbook import *
import pandas as pd
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
# add additional tokens for each task as special `cls_token`
tokenizer.add_special_tokens({"additional_special_tokens": [
"[ANLI]", "[HELLASWAG]"
]})
d1 = MultiModalDataset(
df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
template=lambda x: template_anli(x, LABEL2INT['anli']),
renderers=[lambda x: renderer_text(x, tokenizer, "[ANLI]")],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)
d2 = MultiModalDataset(
df=pd.read_json("data_cache/hellaswag/train.jsonl", lines=True),
template=lambda x: template_hellaswag(x, LABEL2INT['hellaswag']),
renderers=[lambda x: renderer_text(x, tokenizer, "[HELLASWAG]")],
)
bt2 = BatchTool(tokenizer, source="hellaswag")
i2 = DataLoader(d2, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt2.collate_fn)
d = MultiTaskDataset([i1, i2], shuffle=False)
#! batch size must be 1 for multitaskdataset, because we already batched in each sub dataset.
for batch in DataLoader(d, batch_size=1, collate_fn=BatchTool.uncollate_fn):
pass
# {
# "source": "anli" or "hellaswag",
# "labels": ...,
# "input_ids": ...,
# "attentions": ...,
# "token_type_ids": ...,
# "images": ...,
# }
Impletement a New template or rennderer
It is advised to follow the following conventions but you can do whatever you like since you can call lambda
anywhere.
def template_xxx(raw_datum, *args, **kwargs):
pass
def renderer_xxx(intermediate_datum, *args, **kwargs):
pass
e.g. For Quora question pairs dataset:
def template_qqp(raw_datum, label2int={"0": 0, "1": 1},):
result = {
"text": [
[datum['question1'], datum['question2']]
],
"image": None,
"label": None if 'is_duplicate' not in datum or datum['is_duplicate'] is None else label2int[str(datum['i_duplicate'])],
"token_type_id": [0, 1],
"attention": [1, 1],
}
return result
Contact
Author: Chenghao Mou
Email: mouchenghao@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for textbook-0.3.10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4585d7bf79c2b8e48893eae61868bc83887f78ebcd6f99cc74d52e610aa9c7b |
|
MD5 | 86439be018ea770e44a27c24bf95c993 |
|
BLAKE2b-256 | 5e4936ba588b9e4f34daef25f0602bff46c4a3037ea5b3b24219bf6bbdcbfad0 |