Preprocessings to prepare datasets for a task
Project description
tasksource: 500+ dataset harmonization preprocessings for frictionless extreme multi-task learning and evaluation
Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably.
tasksource
automates this and facilitates reproducible multi-task learning scaling and meta-learning.
Each dataset is standardized to either MultipleChoice
, Classification
, or TokenClassification
dataset with identical fields. We focus on discriminative tasks (= with negative examples or classes) and do not yet support generation tasks as they are addressed by promptsource. All implemented preprocessings are in tasks.py or tasks.md. A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable.
Installation and usage:
pip install tasksource
from tasksource import list_tasks, load_task
df = list_tasks()
for id in df[df.task_type=="MultipleChoice"].id:
dataset = load_task(id) # all yielded datasets can be used interchangeably
See supported 500+ tasks in tasks.md (+200 MultipleChoice tasks, +200 Classification tasks) and feel free to request a new task. Datasets are downloaded to $HF_DATASETS_CACHE
(as any huggingface dataset), so be sure to have >100GB of space there.
Pretrained model:
Text encoder pretrained on tasksource reached state-of-the-art results: 🤗/deberta-v3-base-tasksource-nli Tasksource pretraining should be quite helpful for RLHF reward models pretraining.
Contact and citation
I can help you integrate tasksource in your experiments. damien.sileo@inria.fr
More details on this article:
@article{sileo2023tasksource,
title={tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation},
author={Sileo, Damien},
url= {https://arxiv.org/abs/2301.05948},
journal={arXiv preprint arXiv:2301.05948},
year={2023}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tasksource-0.0.30-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c7d1fb9ba0929de0e16c83fb2c8f4d8b6fd2fc097ff52832ae749e36df555617 |
|
MD5 | b5c96be31685fbf657149fa5b7c55fb3 |
|
BLAKE2b-256 | 14032d5516d350f1c61dab08a8dd086e7a19bff3609d039be3db566bf64b6df8 |