Skip to main content

WILDS distribution shift data

Project description

This repository is supposed to provide a simple way to use the Wild-Time datasets for your own experiments. In contrast to the original repository, this repository features only code relevant for dataset loading, has fewer and relaxed requirements. Finally, it is addressing some bugs related to data loading that currently do not allow for downloading the datasets in the original repository.

yearbook.png

Usage

The following code will return a PyTorch dataset for the training partition of the arXiv dataset in 2023. The data will be downloaded to wild-time-data unless it was downloaded into this folder before.

from wild_time_data import load_dataset

load_dataset(dataset_name="arxiv", time_step=2023, split="train", data_dir="wild-time-data")

In the following we provide some more details regarding the available options.

  • dataset_name: The options are arxiv, drug, fmow, huffpost, and yearbook. This list can be accessed via
    from wild_time_data import list_datasets
    
    list_datasets()
  • time_step: Most datasets are grouped by year, this argument will allow you to access the data from different time

    intervals. The range differs from dataset to dataset. Use following command to get a list of available time steps:

    from wild_time_data import available_time_steps
    
    available_time_steps("arxiv")
  • split: Selects the partition. Can either be train or test.

  • data_dir: Location where to store the data. By default it will be downloaded to ~/wild-time-data/.

Other Useful Functions

Several other functions can be import from wild_time_data.

from wild_time_data import available_time_steps, input_dim, list_datasets, num_outputs
  • available_time_steps: Provide the dataset name and the list of available time steps is return.

    Example: available_time_steps("huffpost") returns [2012, 2013, 2014, 2015, 2016, 2017, 2018].

  • input_dim: Provide the dataset name and the input dimensionality. For image datasets it is the shape, for text

    datasets it is the maximum number of words separated by spaces. Example: input_dim("yearbook") returns (3, 32, 32).

  • list_datasets: Returns the list of all available datasets.

    Example: list_datasets() returns ["arxiv", "drug", "fmow", "huffpost", "yearbook"].

  • num_outputs: Provide the dataset name and either the number of classes is returned or if the return value is 1,

    it indicates that this is a regression task. Example: num_outputs("arxiv") returns 172.

Licenses

All additional code for Wild-Time-Data is available under the Apache 2.0 license. We list the licenses for each Wild-Time dataset below:

Furthermore, this repository is loosely based on the Wild-Time repository which is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wild_time_data-0.0.2.tar.gz (231.8 kB view details)

Uploaded Source

Built Distribution

wild_time_data-0.0.2-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file wild_time_data-0.0.2.tar.gz.

File metadata

  • Download URL: wild_time_data-0.0.2.tar.gz
  • Upload date:
  • Size: 231.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.2

File hashes

Hashes for wild_time_data-0.0.2.tar.gz
Algorithm Hash digest
SHA256 7abec5cdbf451884d22b1080322b04299f7ee0ef70fdb0034ad722c072819942
MD5 bc4534fdb229ee2901199071b4ac8f83
BLAKE2b-256 3b80f35f8f2712d97203576105b9718c5499c3fd945ca1e5b1c5fcb787e551a8

See more details on using hashes here.

File details

Details for the file wild_time_data-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for wild_time_data-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d33a4ddb348d431405c22c697c3e080b7dd9f17bfc6910211a493e6eda1ffe66
MD5 62ec98b0ec25bc585a4052a911d30dab
BLAKE2b-256 4b316f343c8621f77776fc753d27bd00498ef0579c7ab7a54f50e68ccd795283

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page