Text classification datasets
Project description
The framework is designed with BERT
in mind and currently support seven commonsense reasoning datsets(alphanli
, hellaswag
, physicaliqa
, socialiqa
, codah
, cosmosqa
, and commonsenseqa
). It can be also applied to other datasets with few line of codes.
Architecture
Dependency
conda install av -c conda-forge
pip install -r requirements.txt
pip install --editable .
# or
pip install textbook
Download raw datasets
./fetch.sh
It downloads alphanli
, hellaswag
, physicaliqa
, socialiqa
, codah
, cosmosqa
, and commonsenseqa
from AWS in data_cache
.
In case you want to use something-something, pelase download the dataset from 20bn's website.
Usage
Template
The goal of a template is to transform raw text into a intermediate datum where abstractive information are provided for later use.
Ideally, the template should do the following things:
- construct
text
: a list of list. The outer list is ideal for multichoice situations and inner list if for each input pairs/triplets(e.g context, question, and choice); - construct
label
: an integer representing a zero-indexed label for the truth, orNone
; - construct
token_type_id
andattention
: abstractive representation of the segment id and attention. In the following example of anli, bothtoken_type_id
andattention
have three digits, each for the three components of each row of the text. - construct
image
: any forms of image id/path you want to read later.
One example of anli is as follows:
# raw case = {"story_id": "58090d3f-8a91-4c89-83ef-2b4994de9d241", "obs1": "Ron started his new job as a landscaper today.", "obs2": "Ron is immediately fired for insubordination.", "hyp1": "Ron ignores his bosses's orders and called him an idiot.", "hyp2": "Ron's boss called him an idiot.", "label": "1"} # target intermediate datum target = { 'text': [['Ron started his new job as a landscaper today.', "Ron ignores his bosses's orders and called him an idiot.", 'Ron is immediately fired for insubordination.'], ['Ron started his new job as a landscaper today.', "Ron's boss called him an idiot.", 'Ron is immediately fired for insubordination.']], 'label': 0, 'image': None, 'token_type_id': [0, 1, 0], 'attention': [1, 1, 1]} LABEL2INT = { "anli": { "1": 0, "2": 1, }, } assert template_anli(case, LABEL2INT['anli']) == target
Renderer
Renderer transformer your intermediate datum into a fully blown datum. Each renderer takes care of different part of the datum. For example, renderer_text
renders the text into input_id
and generate all token-based attention
and token_type_id
, while renderer_video
renders the image
path to an image
tensor. renderers are passed to the dataset constructer in a list, therefore are execute sequentially.
BatchTool
We provided a BatchTool where MLM or padding can be used easily, you can check the doc for the class for more information.
Load a dataset with pandas
from transformers import BertTokenizer from textbook import MultiModalDataset, template_anli, renderer_text, BatchTool, TokenBasedSampler from torch.utils.data import Dataset, DataLoader from textbook import LABEL2INT import pandas as pd tokenizer = BertTokenizer.from_pretrained('bert-base-cased') d1 = MultiModalDataset( df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True), template=lambda x: template_anli(x, LABEL2INT['anli']), renderers=[lambda x: renderer_text(x, tokenizer)], ) bt1 = BatchTool(tokenizer, source="anli") i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)
Create a multitask dataset with multiple datasets
from transformers import BertTokenizer from textbook import * import pandas as pd tokenizer = BertTokenizer.from_pretrained('bert-base-cased') # add additional tokens for each task as special `cls_token` tokenizer.add_special_tokens({"additional_special_tokens": [ "[ANLI]", "[HELLASWAG]" ]}) d1 = MultiModalDataset( df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True), template=lambda x: template_anli(x, LABEL2INT['anli']), renderers=[lambda x: renderer_text(x, tokenizer, "[ANLI]")], ) bt1 = BatchTool(tokenizer, source="anli") i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn) d2 = MultiModalDataset( df=pd.read_json("data_cache/hellaswag/train.jsonl", lines=True), template=lambda x: template_hellaswag(x, LABEL2INT['hellaswag']), renderers=[lambda x: renderer_text(x, tokenizer, "[HELLASWAG]")], ) bt2 = BatchTool(tokenizer, source="hellaswag") i2 = DataLoader(d2, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt2.collate_fn) d = MultiTaskDataset([i1, i2], shuffle=False) #! batch size must be 1 for multitaskdataset, because we already batched in each sub dataset. for batch in DataLoader(d, batch_size=1, collate_fn=BatchTool.uncollate_fn): pass # { # "source": "anli" or "hellaswag", # "labels": ..., # "input_ids": ..., # "attentions": ..., # "token_type_ids": ..., # "images": ..., # }
Impletement a New template or rennderer
It is advised to follow the following conventions but you can do whatever you like since you can call lambda
anywhere.
def template_xxx(raw_datum, *args, **kwargs): pass def renderer_xxx(intermediate_datum, *args, **kwargs): pass
e.g. For Quora question pairs dataset:
def template_qqp(raw_datum, label2int={"0": 0, "1": 1},): result = { "text": [ [datum['question1'], datum['question2']] ], "image": None, "label": None if 'is_duplicate' not in datum or datum['is_duplicate'] is None else label2int[str(datum['i_duplicate'])], "token_type_id": [0, 1], "attention": [1, 1], } return result
Contact
Author: Chenghao Mou
Email: mouchenghao@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size textbook-0.3.10-py3-none-any.whl (12.0 kB) | File type Wheel | Python version py3 | Upload date | Hashes View |
Filename, size textbook-0.3.10.tar.gz (13.0 kB) | File type Source | Python version None | Upload date | Hashes View |
Hashes for textbook-0.3.10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4585d7bf79c2b8e48893eae61868bc83887f78ebcd6f99cc74d52e610aa9c7b |
|
MD5 | 86439be018ea770e44a27c24bf95c993 |
|
BLAKE2-256 | 5e4936ba588b9e4f34daef25f0602bff46c4a3037ea5b3b24219bf6bbdcbfad0 |