Text classification datasets

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Logo

PyPI - License Madein

The framework is designed with BERT in mind and currently support seven commonsense reasoning datsets(alphanli, hellaswag, physicaliqa, socialiqa, codah, cosmosqa, and commonsenseqa). It can be also applied to other datasets with few line of codes.

Architecture
Dependency
Download raw datasets
Usage
Contact

Architecture

Architecture Image

Dependency

conda install av -c conda-forge

pip install -r requirements.txt
pip install --editable .

# or

pip install textbook

Download raw datasets

./fetch.sh

It downloads alphanli, hellaswag, physicaliqa, socialiqa, codah, cosmosqa, and commonsenseqa from AWS in data_cache. In case you want to use something-something, pelase download the dataset from 20bn's website.

Usage

Template

The goal of a template is to transform raw text into a intermediate datum where abstractive information are provided for later use.

Ideally, the template should do the following things:

construct text: a list of list. The outer list is ideal for multichoice situations and inner list if for each input pairs/triplets(e.g context, question, and choice);
construct label: an integer representing a zero-indexed label for the truth, or None;
construct token_type_id and attention: abstractive representation of the segment id and attention. In the following example of anli, both token_type_id and attention have three digits, each for the three components of each row of the text.
construct image: any forms of image id/path you want to read later.

One example of anli is as follows:

# raw
case = {"story_id": "58090d3f-8a91-4c89-83ef-2b4994de9d241", "obs1": "Ron started his new job as a landscaper today.",
        "obs2": "Ron is immediately fired for insubordination.", "hyp1": "Ron ignores his bosses's orders and called him an idiot.",
        "hyp2": "Ron's boss called him an idiot.", "label": "1"}

# target intermediate datum
target = {
    'text':
    [['Ron started his new job as a landscaper today.', "Ron ignores his bosses's orders and called him an idiot.",
        'Ron is immediately fired for insubordination.'],
        ['Ron started his new job as a landscaper today.', "Ron's boss called him an idiot.",
        'Ron is immediately fired for insubordination.']],
    'label': 0, 'image': None, 'token_type_id': [0, 1, 0],
    'attention': [1, 1, 1]}

LABEL2INT = {
    "anli": {
        "1": 0,
        "2": 1,
    },
}
assert template_anli(case, LABEL2INT['anli']) == target

Renderer

Renderer transformer your intermediate datum into a fully blown datum. Each renderer takes care of different part of the datum. For example, renderer_text renders the text into input_id and generate all token-based attention and token_type_id, while renderer_video renders the image path to an image tensor. renderers are passed to the dataset constructer in a list, therefore are execute sequentially.

BatchTool

We provided a BatchTool where MLM or padding can be used easily, you can check the doc for the class for more information.

Load a dataset with pandas

from transformers import BertTokenizer
from textbook import MultiModalDataset, template_anli, renderer_text, BatchTool, TokenBasedSampler
from torch.utils.data import Dataset, DataLoader
from textbook import LABEL2INT
import pandas as pd

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

d1 = MultiModalDataset(
    df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
    template=lambda x: template_anli(x, LABEL2INT['anli']),
    renderers=[lambda x: renderer_text(x, tokenizer)],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)

Create a multitask dataset with multiple datasets

from transformers import BertTokenizer
from textbook import *
import pandas as pd

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# add additional tokens for each task as special `cls_token`
tokenizer.add_special_tokens({"additional_special_tokens": [
        "[ANLI]", "[HELLASWAG]"
]})

d1 = MultiModalDataset(
    df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
    template=lambda x: template_anli(x, LABEL2INT['anli']),
    renderers=[lambda x: renderer_text(x, tokenizer, "[ANLI]")],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)

d2 = MultiModalDataset(
        df=pd.read_json("data_cache/hellaswag/train.jsonl", lines=True),
        template=lambda x: template_hellaswag(x, LABEL2INT['hellaswag']),
        renderers=[lambda x: renderer_text(x, tokenizer, "[HELLASWAG]")],
    )
bt2 = BatchTool(tokenizer, source="hellaswag")
i2 = DataLoader(d2, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt2.collate_fn)

d = MultiTaskDataset([i1, i2], shuffle=False)

#! batch size must be 1 for multitaskdataset, because we already batched in each sub dataset.
for batch in DataLoader(d, batch_size=1, collate_fn=BatchTool.uncollate_fn):

    pass

    # {
    #     "source": "anli" or "hellaswag",
    #     "labels": ...,
    #     "input_ids": ...,
    #     "attentions": ...,
    #     "token_type_ids": ...,
    #     "images": ...,
    # }

Impletement a New template or rennderer

It is advised to follow the following conventions but you can do whatever you like since you can call lambda anywhere.

def template_xxx(raw_datum, *args, **kwargs):
    pass

def renderer_xxx(intermediate_datum, *args, **kwargs):
    pass

e.g. For Quora question pairs dataset:

def template_qqp(raw_datum, label2int={"0": 0, "1": 1},):

    result = {
        "text": [
            [datum['question1'], datum['question2']]
        ],
        "image": None,
        "label": None if 'is_duplicate' not in datum or datum['is_duplicate'] is None else label2int[str(datum['i_duplicate'])],
        "token_type_id": [0, 1],
        "attention": [1, 1],
    }

    return result

Contact

Author: Chenghao Mou

Email: mouchenghao@gmail.com

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.3.10

Dec 23, 2019

0.3.9

Dec 23, 2019

0.3.8

Dec 23, 2019

0.3.6

Dec 21, 2019

0.3.5

Dec 20, 2019

0.3.4

Dec 16, 2019

0.3.3

Dec 16, 2019

0.3.2

Dec 16, 2019

0.3.1

Dec 16, 2019

0.3.0

Dec 14, 2019

0.2.5

Dec 14, 2019

0.2.4

Dec 14, 2019

0.2.3

Dec 14, 2019

0.2.2

Dec 13, 2019

0.2.1

Dec 12, 2019

0.2.0

Dec 12, 2019

0.1.5

Dec 1, 2019

0.1.4

Nov 30, 2019

0.1.3

Nov 29, 2019

0.1.2

Nov 28, 2019

0.1.1

Nov 21, 2019

0.1.0

Nov 21, 2019

0.0.9

Nov 20, 2019

0.0.8

Nov 19, 2019

0.0.7

Nov 19, 2019

0.0.6

Nov 19, 2019

0.0.5

Nov 19, 2019

0.0.4

Nov 19, 2019

0.0.3

Nov 18, 2019

0.0.2

Nov 18, 2019

0.0.1

Nov 18, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textbook-0.3.10.tar.gz (13.0 kB view hashes)

Uploaded Dec 23, 2019 Source

Built Distribution

textbook-0.3.10-py3-none-any.whl (12.0 kB view hashes)

Uploaded Dec 23, 2019 Python 3

Hashes for textbook-0.3.10.tar.gz

Hashes for textbook-0.3.10.tar.gz
Algorithm	Hash digest
SHA256	`215d51b915d9f5b08902bfa66d41d4d4a786b464c41d44c0cefa691126687ba9`
MD5	`8e3479293b54ee8402f809d51cc89b36`
BLAKE2b-256	`1aa0dba78f887b38086aee7d5c664743af44e8d990480c1aed325b42e76ee6f8`

Hashes for textbook-0.3.10-py3-none-any.whl

Hashes for textbook-0.3.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b4585d7bf79c2b8e48893eae61868bc83887f78ebcd6f99cc74d52e610aa9c7b`
MD5	`86439be018ea770e44a27c24bf95c993`
BLAKE2b-256	`5e4936ba588b9e4f34daef25f0602bff46c4a3037ea5b3b24219bf6bbdcbfad0`