WILDS distribution shift data
Project description
This repository is supposed to provide a simple way to use the Wild-Time datasets for your own experiments. In contrast to the original repository, this repository features only code relevant for dataset loading, has fewer and relaxed requirements. Finally, it is addressing some bugs related to data loading that currently do not allow for downloading the datasets in the original repository.
Usage
The following code will return a PyTorch dataset for the training partition of the arXiv dataset in 2023. The data will be downloaded to wild-time-data unless it was downloaded into this folder before.
from wild_time_data import load_dataset
load_dataset(dataset_name="arxiv", time_step=2023, split="train", data_dir="wild-time-data")
In the following we provide some more details regarding the available options.
- dataset_name: The options are arxiv, drug, fmow, huffpost, and yearbook. This list can be accessed via
from wild_time_data import list_datasets list_datasets()
- time_step: Most datasets are grouped by year, this argument will allow you to access the data from different time
intervals. The range differs from dataset to dataset. Use following command to get a list of available time steps:
from wild_time_data import available_time_steps available_time_steps("arxiv")
split: Selects the partition. Can either be train or test.
data_dir: Location where to store the data. By default it will be downloaded to ~/wild-time-data/.
Other Useful Functions
Several other functions can be import from wild_time_data.
from wild_time_data import available_time_steps, input_dim, list_datasets, num_outputs
- available_time_steps: Provide the dataset name and the list of available time steps is return.
Example: available_time_steps("huffpost") returns [2012, 2013, 2014, 2015, 2016, 2017, 2018].
- input_dim: Provide the dataset name and the input dimensionality. For image datasets it is the shape, for text
datasets it is the maximum number of words separated by spaces. Example: input_dim("yearbook") returns (3, 32, 32).
- list_datasets: Returns the list of all available datasets.
Example: list_datasets() returns ["arxiv", "drug", "fmow", "huffpost", "yearbook"].
- num_outputs: Provide the dataset name and either the number of classes is returned or if the return value is 1,
it indicates that this is a regression task. Example: num_outputs("arxiv") returns 172.
Licenses
All additional code for Wild-Time-Data is available under the Apache 2.0 license. We list the licenses for each Wild-Time dataset below:
arXiv: CC0: Public Domain
Drug-BA: MIT License
FMoW: The Functional Map of the World Challenge Public License
Huffpost: CC0: Public Domain
Yearbook: MIT License
Furthermore, this repository is loosely based on the Wild-Time repository which is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wild_time_data-0.0.2.tar.gz
.
File metadata
- Download URL: wild_time_data-0.0.2.tar.gz
- Upload date:
- Size: 231.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7abec5cdbf451884d22b1080322b04299f7ee0ef70fdb0034ad722c072819942 |
|
MD5 | bc4534fdb229ee2901199071b4ac8f83 |
|
BLAKE2b-256 | 3b80f35f8f2712d97203576105b9718c5499c3fd945ca1e5b1c5fcb787e551a8 |
File details
Details for the file wild_time_data-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: wild_time_data-0.0.2-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d33a4ddb348d431405c22c697c3e080b7dd9f17bfc6910211a493e6eda1ffe66 |
|
MD5 | 62ec98b0ec25bc585a4052a911d30dab |
|
BLAKE2b-256 | 4b316f343c8621f77776fc753d27bd00498ef0579c7ab7a54f50e68ccd795283 |