Skip to main content

Galaxy Zoo datasets for PyTorch/Lightning

Project description

pytorch-galaxy-datasets

PyTorch Datasets and PyTorch Lightning Datamodules for loading images and labels from Galaxy Zoo citizen science campaigns.

Name Class Published Downloadable Galaxies
Galaxy Zoo 2 GZ2 ~210k (main sample)
GZ Hubble Hubble ~106k (main sample)
GZ CANDELS Candels ~50k
GZ DECaLS GZD-5 DecalsDR5 ~230k
Galaxy Zoo Rings Rings ~93k
GZ Legacy Survey Legs z < 0.1 only ~375k + 8.3m unlabelled
CFHT Tidal* Tidal 1760 (expert)

Any datasets marked as downloadable but not marked as published are only downloadable internally (for development purposes).

If a dataset is published but not marked as downloadable (none currently), it means I haven't yet got around to making the download automatic. You can still download it via the paper instructions.

You may also be interested in Galaxy MNIST as a simple dataset for teaching/debugging.

For each dataset, you must cite/acknowledge the GZ data release paper and the original telescope survey from which the images were derived. See data.galaxyzoo.org for the data release paper citations to use.

*CFHT Tidal is not a Galaxy Zoo dataset, but rather a small expert-labelled dataset of tidal features from Atkinson 2013. MW reproduced and modified the images in Walmsley 2019. We include it here as a challenging fine-grained morphology classification task with little labelled data.

Installation

For local development (e.g. adding a new dataset), you can install this by cloning from github, then running pip install -e . in the cloned repo root.

It will also be installed by default as a dependency of zoobot if you specify the pytorch version of zoobot - but this is slightly trickier if you'd like to make changes as it'll be installed under your sitepackages.

Usage

You can load each prepared dataset as a pytorch Dataset like so:

from pytorch_galaxy_datasets.prepared_datasets import GZ2Dataset

gz2_dataset = GZ2Dataset(
    root='/nvme1/scratch/walml/repos/pytorch-galaxy-datasets/roots/gz2',
    train=True,
    download=False
)
image, label = gz2_dataset[0]
plt.imshow(image)

You will probably want to customise the dataset, selecting a subset of galaxies or labels. Do this with the {dataset}_setup() methods.

from pytorch_galaxy_datasets.prepared_datasets import gz2_setup

catalog, label_cols = gz2_setup(
    root='/nvme1/scratch/walml/repos/pytorch-galaxy-datasets/roots/gz2',
    train=True,
    download=False
)
adjusted_catalog = gz2_catalog.sample(1000)

You can then customise the catalog and labels before creating a generic GalaxyDataset, which can be used with your own transforms etc. like any other pytorch dataset

from pytorch_galaxy_datasets.galaxy_dataset import GalaxyDataset

dataset = GalaxyDataset(
    label_cols=['smooth-or-featured_smooth'],
    catalog=adjusted_catalog,
    transforms=some_torchvision_transforms_if_you_like
)

For training models, I recommend using Pytorch Lightning and GalaxyDataModule, which has default transforms for supervised learning.

from pytorch_galaxy_datasets.galaxy_datamodule import GalaxyDataModule

datamodule = GalaxyDataModule(
    label_cols=['smooth-or-featured_smooth'],
    catalog=adjusted_catalog
)

datamodule.prepare_data()
datamodule.setup()
for images, labels in datamodule.train_dataloader():
    print(images.shape, labels.shape)
    break

You can also get the canonical catalog and label_cols from the Dataset, if you prefer.

gz2_catalog = gz2_dataset.catalog
gz2_label_cols = gz2_dataset.label_cols

Download Notes

Datasets are downloaded like:

  • {root}
    • images
      • subfolder (except GZ2)
        • image.jpg
    • {catalog_name(s)}.parquet

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytorch_galaxy_datasets-0.0.2.tar.gz (28.7 kB view details)

Uploaded Source

Built Distribution

pytorch_galaxy_datasets-0.0.2-py3-none-any.whl (35.1 kB view details)

Uploaded Python 3

File details

Details for the file pytorch_galaxy_datasets-0.0.2.tar.gz.

File metadata

File hashes

Hashes for pytorch_galaxy_datasets-0.0.2.tar.gz
Algorithm Hash digest
SHA256 fba8a053b7d944ef77b71b8304089e42644667a290d86cb77e6d27d9cfef6bc7
MD5 b1afcdf46f7c33434e6308ec89261236
BLAKE2b-256 57214aced62cd876a31ea5bb98a0cfc9b0ef4f91f5776307e0d983832fde681c

See more details on using hashes here.

File details

Details for the file pytorch_galaxy_datasets-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pytorch_galaxy_datasets-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c3c167e6ad570515620ee2a25001a2b1cd24da235aef306b8d41aafcd6d20fdb
MD5 deddc7bf6d0faa946efdf6c351a2f2f4
BLAKE2b-256 40209466369c5c2a2b9359d8a2cdcdc9dc577a64f95d4ddd4615dbbd89187997

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page