Skip to main content

No project description provided

Project description

hf-dataset-selecotr

PyPI version

A convenient and fast Python package to find the best datasets for intermediate fine-tuning for your task.

How it works

hf-dataset-selector enables you to find good datasets from the Hugging Face Hub for intermediate fine-tuning before training on your task. It downloads small (~2.4MB each) neural networks for each intermediate task from the Hugging Face Hub. These neural networks are called Embedding Space Maps (ESMs) and transform embeddings produced by the language model. The transformed embeddings are ranked using LogME.

hf-dataset-selector only ranks datasets with a corresponding ESM on the Hugging Face Hub. We encourage you to train and publish your own ESMs for your datasets to enable others to rank them.

What are Embedding Space Maps?

Embedding Space Maps (ESMs) are neural networks that approximate the effect of fine-tuning a language model on a task. They can be used to quickly transform embeddings from a base model to approximate how a fine-tuned model would embed the the input text. ESMs can be used for intermediate task selection with the ESM-LogME workflow.

How to install

hf-dataset-selector is available on PyPi:

$ pip install hf-dataset-selector

How to find suitable datasets for your problem

Example

from hfselect import Dataset, compute_task_ranking

# Load target dataset from the Hugging Face Hub
dataset = Dataset.from_hugging_face(
    name="stanfordnlp/imdb",
    split="train",
    text_col="text",
    label_col="label",
    is_regression=False,
    num_examples=1000,
    seed=42
)

# Fetch ESMs and rank tasks
task_ranking = compute_task_ranking(
    dataset=dataset,
    model_name="bert-base-multilingual-uncased"
)

# Display top 5 recommendations
print(task_ranking[:5])

How to train your own ESM

[TBD]

How to cite

If you are using this hf-dataset-selector, please cite our paper.

BibTeX:

@misc{schulte2024moreparameterefficientselectionintermediate,
      title={Less is More: Parameter-Efficient Selection of Intermediate Tasks for Transfer Learning}, 
      author={David Schulte and Felix Hamborg and Alan Akbik},
      year={2024},
      eprint={2410.15148},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.15148}, 
}

APA:

Schulte, D., Hamborg, F., & Akbik, A. (2024). Less is More: Parameter-Efficient Selection of Intermediate Tasks for Transfer Learning. arXiv preprint arXiv:2410.15148.

How to reproduce the results from the paper

For reproducing the results of our paper, please refer to the emnlp-submission branch.

Acknowledgements

Our methods extends the LogME method for intermediate task selection. We adapt the implementation by the authors. https://github.com/tuvuumass/task-transferability

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hf-dataset-selector-0.1.0.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

hf_dataset_selector-0.1.0-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file hf-dataset-selector-0.1.0.tar.gz.

File metadata

  • Download URL: hf-dataset-selector-0.1.0.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for hf-dataset-selector-0.1.0.tar.gz
Algorithm Hash digest
SHA256 82ac1f100353a75b84cb74b5dace7c9faa78bc856c26f14daac75a6e715c1e04
MD5 edc6b4c65c407a954699da16c9b0e9fe
BLAKE2b-256 d6a5795923839c21a94adb3e1898169f83c295a50d95218544098f2ac3527fcf

See more details on using hashes here.

File details

Details for the file hf_dataset_selector-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hf_dataset_selector-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e77622e24cf8cb061a71daeefab244465bce403c904986d724b2a83bb7e6aac0
MD5 801db789789cad87ae203ef603ae0a39
BLAKE2b-256 e34cfe1d84bc2411b84260661f86d0487dd240ec4ccc5ee2f0f3789893df0d47

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page