Skip to main content

Text classification datasets

Project description

Textbook: Universal NLP Datasets

Current support few commonsense reasoning datsets(alphanli, hellaswag, physicaliqa, socialiqa, codah, and commonsenseqa). It adopts ray's multiprocessing in loading/processing the datasets.

Dependency

`pip install -r requirements.txt`

Download raw datasets

```bash
bash fetch.sh
```

It downloads alphanli, hellaswag, physicaliqa, socialiqa, codah, and commonsenseqa from AWS. In case you want to use something-something, pelase download the dataset from 20bn's website.

Usage

Initialize ray

```python
import ray
ray.init(memory=1024 * 1024 * 1024, num_cpus=2)

```

Load a dataset

```python
from transformers import BertTokenizer
from textbook import *

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text_renderer = TextRenderer.remote(tokenizer)

anli_tool = BatchTool(tokenizer, max_seq_len=128, source="anli")
anli_dataset = TextDataset(path='data_cache/alphanli/eval.jsonl',
                            config=ANLIConfiguration.remote(), renderers=[text_renderer])
# Batch by number of examples
anli_iter = DataLoader(anli_dataset, batch_size=2, collate_fn=anli_tool.collate_fn)

# Batch by number of tokens
anli_iter = DataLoader(anli_dataset, batch_sampler=TokenBasedSampler(anli_dataset, batch_size=128), collate_fn=anli_tool.collate_fn)




```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textbook-0.2.0.tar.gz (8.8 kB view details)

Uploaded Source

File details

Details for the file textbook-0.2.0.tar.gz.

File metadata

  • Download URL: textbook-0.2.0.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.3

File hashes

Hashes for textbook-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f311707bfb32af0a0325c2e71889e0c3a0c6115552feeb7a9d68e77539d42dfc
MD5 c80de562178a85149529ab87ca24916a
BLAKE2b-256 fc366a69a2bced87b6680c678fc42e811c6b38b11a2d35e19d65a272cbf1bdf1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page