Galaxy Zoo datasets for PyTorch/Lightning
Project description
pytorch-galaxy-datasets
PyTorch Datasets and PyTorch Lightning Datamodules for loading images and labels from Galaxy Zoo citizen science campaigns.
Name | Class | Published | Downloadable | Galaxies |
---|---|---|---|---|
Galaxy Zoo 2 | GZ2 | ☑ | ☑ | ~210k (main sample) |
GZ Hubble | Hubble | ☑ | ☑ | ~106k (main sample) |
GZ CANDELS | Candels | ☑ | ☑ | ~50k |
GZ DECaLS GZD-5 | DecalsDR5 | ☑ | ☑ | ~230k |
Galaxy Zoo Rings | Rings | ☒ | ☑ | ~93k |
GZ Legacy Survey | Legs | ☒ | z < 0.1 only | ~375k + 8.3m unlabelled |
CFHT Tidal* | Tidal | ☑ | ☑ | 1760 (expert) |
Any datasets marked as downloadable but not marked as published are only downloadable internally (for development purposes).
If a dataset is published but not marked as downloadable (none currently), it means I haven't yet got around to making the download automatic. You can still download it via the paper instructions.
You may also be interested in Galaxy MNIST as a simple dataset for teaching/debugging.
For each dataset, you must cite/acknowledge the GZ data release paper and the original telescope survey from which the images were derived. See data.galaxyzoo.org for the data release paper citations to use.
*CFHT Tidal is not a Galaxy Zoo dataset, but rather a small expert-labelled dataset of tidal features from Atkinson 2013. MW reproduced and modified the images in Walmsley 2019. We include it here as a challenging fine-grained morphology classification task with little labelled data.
Installation
For local development (e.g. adding a new dataset), you can install this by cloning from github, then running pip install -e .
in the cloned repo root.
It will also be installed by default as a dependency of zoobot
if you specify the pytorch version of zoobot
- but this is slightly trickier if you'd like to make changes as it'll be installed under your sitepackages
.
Usage
You can load each prepared dataset as a pytorch Dataset like so:
from pytorch_galaxy_datasets.prepared_datasets import GZ2Dataset
gz2_dataset = GZ2Dataset(
root='/nvme1/scratch/walml/repos/pytorch-galaxy-datasets/roots/gz2',
train=True,
download=False
)
image, label = gz2_dataset[0]
plt.imshow(image)
You will probably want to customise the dataset, selecting a subset of galaxies or labels. Do this with the {dataset}_setup()
methods.
from pytorch_galaxy_datasets.prepared_datasets import gz2_setup
catalog, label_cols = gz2_setup(
root='/nvme1/scratch/walml/repos/pytorch-galaxy-datasets/roots/gz2',
train=True,
download=False
)
adjusted_catalog = gz2_catalog.sample(1000)
You can then customise the catalog and labels before creating a generic GalaxyDataset, which can be used with your own transforms etc. like any other pytorch dataset
from pytorch_galaxy_datasets.galaxy_dataset import GalaxyDataset
dataset = GalaxyDataset(
label_cols=['smooth-or-featured_smooth'],
catalog=adjusted_catalog,
transforms=some_torchvision_transforms_if_you_like
)
For training models, I recommend using Pytorch Lightning and GalaxyDataModule, which has default transforms for supervised learning.
from pytorch_galaxy_datasets.galaxy_datamodule import GalaxyDataModule
datamodule = GalaxyDataModule(
label_cols=['smooth-or-featured_smooth'],
catalog=adjusted_catalog
)
datamodule.prepare_data()
datamodule.setup()
for images, labels in datamodule.train_dataloader():
print(images.shape, labels.shape)
break
You can also get the canonical catalog and label_cols from the Dataset, if you prefer.
gz2_catalog = gz2_dataset.catalog
gz2_label_cols = gz2_dataset.label_cols
Download Notes
Datasets are downloaded like:
- {root}
- images
- subfolder (except GZ2)
- image.jpg
- subfolder (except GZ2)
- {catalog_name(s)}.parquet
- images
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pytorch_galaxy_datasets-0.0.2.tar.gz
.
File metadata
- Download URL: pytorch_galaxy_datasets-0.0.2.tar.gz
- Upload date:
- Size: 28.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fba8a053b7d944ef77b71b8304089e42644667a290d86cb77e6d27d9cfef6bc7 |
|
MD5 | b1afcdf46f7c33434e6308ec89261236 |
|
BLAKE2b-256 | 57214aced62cd876a31ea5bb98a0cfc9b0ef4f91f5776307e0d983832fde681c |
File details
Details for the file pytorch_galaxy_datasets-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: pytorch_galaxy_datasets-0.0.2-py3-none-any.whl
- Upload date:
- Size: 35.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3c167e6ad570515620ee2a25001a2b1cd24da235aef306b8d41aafcd6d20fdb |
|
MD5 | deddc7bf6d0faa946efdf6c351a2f2f4 |
|
BLAKE2b-256 | 40209466369c5c2a2b9359d8a2cdcdc9dc577a64f95d4ddd4615dbbd89187997 |