Skip to main content

Preprocessings to prepare datasets for a task

Project description

tasksource: 500+ dataset harmonization preprocessings for effortless extreme multi-task learning and evaluation

Huggingface Datasets is an excellent library, but it lacks standardization, and datasets often require preprocessing work to be used interchangeably. tasksource streamlines interchangeable datasets usage to scale evaluation or multi-task learning.

Each dataset is standardized to a MultipleChoice, Classification, or TokenClassification template with canonical fields. We focus on discriminative tasks (= with negative examples or classes) and do not yet support generation tasks as they are addressed by promptsource. All implemented preprocessings are in tasks.py or tasks.md. A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable.

Installation and usage:

pip install tasksource

from tasksource import list_tasks, load_task
df = list_tasks() # takes some time

for id in df[df.task_type=="MultipleChoice"].id:
    dataset = load_task(id) # all yielded datasets can be used interchangeably

Browse the 500+ curated tasks in tasks.md (200+ MultipleChoice tasks, 200+ Classification tasks), and feel free to request a new task. Datasets are downloaded to $HF_DATASETS_CACHE (like any Hugging Face dataset), so ensure you have more than 100GB of space available.

Pretrained Model

Pretrained model:

Text encoder pretrained on tasksource reached state-of-the-art results: 🤗/deberta-v3-base-tasksource-nli

Tasksource pretraining is notably helpful for RLHF reward modeling.

Contact and citation

For help integrating tasksource into your experiments, please contact damien.sileo@inria.fr.

For more details, refer to this article:

@article{sileo2023tasksource,
  title={tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation},
  author={Sileo, Damien},
  url= {https://arxiv.org/abs/2301.05948},
  journal={arXiv preprint arXiv:2301.05948},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tasksource-0.0.38.tar.gz (54.8 kB view hashes)

Uploaded Source

Built Distribution

tasksource-0.0.38-py3-none-any.whl (40.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page