Skip to main content

Galaxy Zoo datasets for PyTorch/Lightning

Project description

pytorch-galaxy-datasets

PyTorch Datasets and PyTorch Lightning Datamodules for loading images and labels from Galaxy Zoo citizen science campaigns.

Name Class Published Downloadable Galaxies
Galaxy Zoo 2 GZ2 ~210k (main sample)
GZ Hubble Hubble ~106k (main sample)
GZ CANDELS Candels ~50k
GZ DECaLS GZD-5 DecalsDR5 ~230k
Galaxy Zoo Rings Rings ~93k
GZ Legacy Survey Legs z < 0.1 only ~375k + 8.3m unlabelled
CFHT Tidal* Tidal 1760 (expert)

Any datasets marked as downloadable but not marked as published are only downloadable internally (for development purposes).

If a dataset is published but not marked as downloadable (none currently), it means I haven't yet got around to making the download automatic. You can still download it via the paper instructions.

You may also be interested in Galaxy MNIST as a simple dataset for teaching/debugging.

For each dataset, you must cite/acknowledge the GZ data release paper and the original telescope survey from which the images were derived. See data.galaxyzoo.org for the data release paper citations to use.

*CFHT Tidal is not a Galaxy Zoo dataset, but rather a small expert-labelled dataset of tidal features from Atkinson 2013. MW reproduced and modified the images in Walmsley 2019. We include it here as a challenging fine-grained morphology classification task with little labelled data.

Installation

For local development (e.g. adding a new dataset), you can install this by cloning from github, then running pip install -e . in the cloned repo root.

Note that installing zoobot will install this package as a dependency (by automatically running pip install pytorch_galaxy_datasets). As with any package) pip will install under your sitepackages so you won't be able to make changes easily.

I suggest either:

  • For development, installing both zoobot and pytorch_galaxy_datasets via git
  • For basic use without changes, installing zoobot via pip and allowing pip to manage this dependency

Usage

You can load each prepared dataset as a pytorch Dataset like so:

from pytorch_galaxy_datasets.prepared_datasets import GZ2Dataset

gz2_dataset = GZ2Dataset(
    root='/nvme1/scratch/walml/repos/pytorch-galaxy-datasets/roots/gz2',
    train=True,
    download=False
)
image, label = gz2_dataset[0]
plt.imshow(image)

You will probably want to customise the dataset, selecting a subset of galaxies or labels. Do this with the {dataset}_setup() methods.

from pytorch_galaxy_datasets.prepared_datasets import gz2_setup

catalog, label_cols = gz2_setup(
    root='/nvme1/scratch/walml/repos/pytorch-galaxy-datasets/roots/gz2',
    train=True,
    download=False
)
adjusted_catalog = gz2_catalog.sample(1000)

You can then customise the catalog and labels before creating a generic GalaxyDataset, which can be used with your own transforms etc. like any other pytorch dataset

from pytorch_galaxy_datasets.galaxy_dataset import GalaxyDataset

dataset = GalaxyDataset(
    label_cols=['smooth-or-featured_smooth'],
    catalog=adjusted_catalog,
    transforms=some_torchvision_transforms_if_you_like
)

For training models, I recommend using Pytorch Lightning and GalaxyDataModule, which has default transforms for supervised learning.

from pytorch_galaxy_datasets.galaxy_datamodule import GalaxyDataModule

datamodule = GalaxyDataModule(
    label_cols=['smooth-or-featured_smooth'],
    catalog=adjusted_catalog
)

datamodule.prepare_data()
datamodule.setup()
for images, labels in datamodule.train_dataloader():
    print(images.shape, labels.shape)
    break

You can also get the canonical catalog and label_cols from the Dataset, if you prefer.

gz2_catalog = gz2_dataset.catalog
gz2_label_cols = gz2_dataset.label_cols

Download Notes

Datasets are downloaded like:

  • {root}
    • images
      • subfolder (except GZ2)
        • image.jpg
    • {catalog_name(s)}.parquet

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytorch_galaxy_datasets-0.0.1.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

pytorch_galaxy_datasets-0.0.1-py3-none-any.whl (35.3 kB view details)

Uploaded Python 3

File details

Details for the file pytorch_galaxy_datasets-0.0.1.tar.gz.

File metadata

File hashes

Hashes for pytorch_galaxy_datasets-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b604ddae102dcf7215c8996d7e21d9f30c6375dcb17b5800957d1b289b7cfc08
MD5 074d01d02574312d250bf4dc292e8caa
BLAKE2b-256 5e82bc8a55e352be15ff8f3bc74497d6b6b2e68f493f48fc317911e7c0f657ec

See more details on using hashes here.

File details

Details for the file pytorch_galaxy_datasets-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pytorch_galaxy_datasets-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f3b09ac2d2d4f0ece316c9db03ce7e957ca19b3e7690fa7dd7aacd7a174a4baf
MD5 7bee4541097055c8dbba8e2b6b627e6a
BLAKE2b-256 75142c943a4770c2f219bc81495399646b2b4073ddf1de04b961143f56c1e670

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page