Skip to main content

Preprocessings to prepare datasets for a task

Project description

tasksource 600+ curated datasets and preprocessings for instant and interchangeable use

Huggingface Datasets is an excellent library, but it lacks standardization, and datasets often require preprocessing work to be used interchangeably. tasksource streamlines interchangeable datasets usage to scale evaluation or multi-task learning.

Each dataset is standardized to a MultipleChoice, Classification, or TokenClassification template with canonical fields. We focus on discriminative tasks (= with negative examples or classes) for our annotations but also provide a SequenceToSequence template. All implemented preprocessings are in tasks.py or tasks.md. A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable.

Installation and usage:

pip install tasksource

from tasksource import list_tasks, load_task
df = list_tasks(multilingual=False) # takes some time

for id in df[df.task_type=="MultipleChoice"].id:
    dataset = load_task(id) # all yielded datasets can be used interchangeably

Browse the 500+ curated tasks in tasks.md (200+ MultipleChoice tasks, 200+ Classification tasks), and feel free to request a new task. Datasets are downloaded to $HF_DATASETS_CACHE (like any Hugging Face dataset), so ensure you have more than 100GB of space available.

You can now also use:

load_dataset("tasksource/data", "glue/rte",max_rows=30_000)

Pretrained models:

Text encoder pretrained on tasksource reached state-of-the-art results: 🤗/deberta-v3-base-tasksource-nli

Tasksource pretraining is notably helpful for RLHF reward modeling or any kind of classification, including zero-shot. You can also find a large and a multilingual version.

tasksource-instruct

The repo also contains some recasting code to convert tasksource datasets to instructions, providing one of the richest instruction-tuning datasets: 🤗/tasksource-instruct-v0

tasksource-label-nli

We also recast all classification tasks as natural language inference, to improve entailment-based zero-shot classification detection: 🤗/zero-shot-label-nli

Write and use custom preprocessings

from tasksource import MultipleChoice

codah = MultipleChoice('question_propmt',choices_list='candidate_answers',
    labels='correct_answer_idx',
    dataset_name='codah', config_name='codah')
    
winogrande = MultipleChoice('sentence',['option1','option2'],'answer',
    dataset_name='winogrande',config_name='winogrande_xl',
    splits=['train','validation',None]) # test labels are not usable
    
tasks = [winogrande.load(), codah.load()]) #  Aligned datasets (same columns) can be used interchangably  

Citation and contact

For more details, refer to this article:

@inproceedings{sileo-2024-tasksource,
    title = "tasksource: A Large Collection of {NLP} tasks with a Structured Dataset Preprocessing Framework",
    author = "Sileo, Damien",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1361",
    pages = "15655--15684",
}

For help integrating tasksource into your experiments, please contact damien.sileo@inria.fr.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tasksource-0.0.45.tar.gz (84.2 kB view details)

Uploaded Source

Built Distribution

tasksource-0.0.45-py3-none-any.whl (45.1 kB view details)

Uploaded Python 3

File details

Details for the file tasksource-0.0.45.tar.gz.

File metadata

  • Download URL: tasksource-0.0.45.tar.gz
  • Upload date:
  • Size: 84.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for tasksource-0.0.45.tar.gz
Algorithm Hash digest
SHA256 f607d1ab8ea090bd168e7948ea05b72bfdedd61cc47aeda059dbc77f900f5723
MD5 0c19fafe08857966386708176b0f235a
BLAKE2b-256 06dc6f0cfcbe4f5bb3a8dd81698a1b86651974b663e809490077ffc02f37549f

See more details on using hashes here.

File details

Details for the file tasksource-0.0.45-py3-none-any.whl.

File metadata

  • Download URL: tasksource-0.0.45-py3-none-any.whl
  • Upload date:
  • Size: 45.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for tasksource-0.0.45-py3-none-any.whl
Algorithm Hash digest
SHA256 f599a3739d0a7bdf46d5188f28cf9d90a0325c77ac3737b88c8ef68b709582f4
MD5 35a1132cf22bf095366cb1e072a8fe60
BLAKE2b-256 c0bbce6a86f9936d0d121f673d6e8b2423a1ff6030e66cd2f0fd887ec86cf0cf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page