Skip to main content

Text classification datasets

Project description

Textbook: Universal NLP Datasets

Current support few commonsense reasoning datsets(alphanli, hellaswag, physicaliqa, socialiqa, codah, and commonsenseqa). It adopts ray's multiprocessing in loading/processing the datasets.

Architecture

Architecture Image

Dependency

pip install -r requirements.txt

Download raw datasets

    bash fetch.sh

It downloads alphanli, hellaswag, physicaliqa, socialiqa, codah, and commonsenseqa from AWS. In case you want to use something-something, pelase download the dataset from 20bn's website.

Usage

Initialize ray

    import ray
    ray.init(memory=1024 * 1024 * 1024, num_cpus=2)

Load a dataset

    from transformers import BertTokenizer
    from textbook import *

    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    text_renderer = TextRenderer.remote(tokenizer)

    anli_tool = BatchTool(tokenizer, max_seq_len=128, source="anli")
    anli_dataset = TextDataset(path='data_cache/alphanli/eval.jsonl',
                                config=ANLIConfiguration.remote(), renderers=[text_renderer])
    # Batch by number of examples
    anli_iter = DataLoader(anli_dataset, batch_size=2, collate_fn=anli_tool.collate_fn)

    # Batch by number of tokens
    anli_iter = DataLoader(anli_dataset, batch_sampler=TokenBasedSampler(anli_dataset, batch_size=128), collate_fn=anli_tool.collate_fn)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textbook-0.2.1.tar.gz (8.8 kB view details)

Uploaded Source

File details

Details for the file textbook-0.2.1.tar.gz.

File metadata

  • Download URL: textbook-0.2.1.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.3

File hashes

Hashes for textbook-0.2.1.tar.gz
Algorithm Hash digest
SHA256 a894de8193d49f982826dba77672d9cd326f8ee8adbdd2b2760b9684634fce80
MD5 cea8c0e2e28be80d12a4b797d7595fb2
BLAKE2b-256 4b01572d8b762ba5437877f7317ee3a5ea68a3618dde0d70fcf04e5c959948d9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page