Galaxy Zoo datasets for PyTorch/TensorFlow

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- GPU :: NVIDIA CUDA
License
- OSI Approved :: GNU General Public License (GPL)
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

galaxy-datasets

ML-friendly datasets for major Galaxy Zoo citizen science campaigns.

PyTorch Datasets and PyTorch Lightning DataModules
Framework-independent download and augmentation code

See also our HuggingFace datasets, which offer faster downloads and more flexible use. This repo was created earlier and may ultimately be replaced by HuggingFace.

Name	Method	PyTorch Dataset	Published	Downloadable	Galaxies
Galaxy Zoo 2	gz2	GZ2	☑	☑	~210k (main sample)
GZ UKIDSS	gz_ukidss	GZUKIDSS	☒	☑	~71k
GZ Hubble*	gz_hubble	GZHubble	☑	☑	~106k (main sample)
GZ CANDELS	gz_candels	GZCandels	☑	☑	~50k
GZ DECaLS GZD-5	gz_decals_5	GZDecals5	☑	☑	~230k (GZD-5 only)
GZ Rings	gz_rings	GZRings	☒	☑	~93k
GZ DESI	gz_desi	GZDesi	☑	No* (500GB)	8.7M
GZ UKIDSS	gz_ukidss	-	☑	☑	~70k
GZ Euclid	gz_euclid	-	☒	☑	~100k
GZ H2O (deep HSC)	gz_h2o	GZH2O	☒	☑	~48k
GZ JWST (CEERS)	gz_jwst	GZJWST	☒	☑	~7k
CFHT Tidal*	tidal	Tidal	☑	☑	1760 (expert)

Any datasets marked as downloadable but not marked as published are only downloadable internally (for development purposes).

For each dataset, you must cite/acknowledge the GZ data release paper and the original telescope survey from which the images were derived. See data.galaxyzoo.org for the data release paper citations to use.

We also include small debugging datasets:

Name	Method	PyTorch Dataset	Downloadable	Galaxies
Demo Rings (binary)	demo_rings	DemoRings	☑	1000
Galaxy MNIST (four-class)	galaxy_mnist	GalaxyMNIST	☑	10k

Galaxy MNIST is also available as a pure torchvision dataset (exactly like MNIST).

*GZ Hubble is also available in "euclidised" form (i.e. with the Euclid PSF applied) to Euclid collaboration members. The method is gz_hubble_euclidised. Courtesy of Ben Aussel.

**Mike Smith has shared a replication of the GZ DESI images and labels on HuggingFace (983GB)

**CFHT Tidal is not a Galaxy Zoo dataset, but rather a small expert-labelled dataset of tidal features from Atkinson 2013. MW reproduced and modified the images in Walmsley 2019. We include it here as a challenging fine-grained morphology classification task with little labelled data.

Installation

Installing zoobot will automatically install this package as a dependency.

To install directly:

pip install galaxy-datasets (includes PyTorch dependencies)

For local development (e.g. adding a new dataset), you can install this by cloning from github, then running pip install -e . in the cloned repo root. This makes changing the code easier than if you don't use the -e, in which case the package is installed under sitepackages.

I suggest either:

For basic use without changes, installing zoobot via pip and allowing pip to manage this dependency
For development, installing both zoobot and galaxy-datasets via git

Usage

Check out the PyTorch quickstart Colab here, or keep reading for more explanation.

Framework-Independent

To download a dataset:

from galaxy_datasets import gz2  # or gz_hubble, gz_candels, ...

catalog, label_cols = gz2(
    root='your_data_folder/gz2',
    train=True,
    download=True
)

This will download the images and train/test catalogs to root. Each catalog is a pandas DataFrame with the column file_loc giving absolute image paths and additional columns label_cols = ['col_a', 'col_b', ...] giving the labels (usually, the number of volunteers who gave each answer for each galaxy). If train=True, the method returns the train catalog, otherwise, the test catalog.

If training Zoobot from scratch, this is all you need. For example, in PyTorch:

from zoobot.pytorch.training import train_with_pytorch_lightning

train_with_pytorch_lightning.train_default_zoobot_from_scratch(
    catalog=catalog,
    save_dir=save_dir,
    schema=gz2_schema, # see zoobot/pytorch/examples/minimal_example.py
    ...
)

Otherwise, you might like to use the classes in this package to load these catalogs into ML-friendly inputs.

PyTorch

Create a PyTorch Dataset from a catalog like so:

from galaxy_datasets.pytorch.galaxy_dataset import CatalogDataset  # generic Dataset for galaxies

dataset = CatalogDataset(
    catalog=catalog.sample(1000),  # from gz2(...) above
    label_cols=['smooth-or-featured-gz2_smooth']
)

Notice how you can adjust the catalog before creating the Dataset. This gives flexibility to try training on e.g. different catalog subsets.

If you don't want to change anything about the catalog, you can skip the framework-independent download and use a named class from galaxy_datasets.pytorch, which takes the same arguments and directly gives a Dataset:

from galaxy_datasets.pytorch import GZ2

gz2_dataset = GZ2(
    root='your_data_folder/gz2',
    train=True,
    download=False
)
batch = gz2_dataset[0]
image = batch['image']
label = batch['smooth-or-featured-gz2_smooth']

You might also find the PyTorch Lightning DataModule under galaxy_datasets/pytorch/galaxy_datamodule useful. Zoobot uses this for training and finetuning.

from galaxy_datasets.pytorch.galaxy_datamodule import CatalogDataModule
from galaxy_datasets.transforms import get_galaxy_transform, default_view_config

datamodule = CatalogDataModule(
    label_cols=['smooth-or-featured-gz2_smooth'],
    catalog=catalog
    # optional args to specify augmentations
    train_transform=get_galaxy_transform(default_view_config()),
    test_transform=get_galaxy_transform(default_view_config())
)

datamodule.prepare_data()
datamodule.setup()
for batch in datamodule.train_dataloader():
    images = batch['image']
    labels = batch['smooth-or-featured-gz2_smooth']
    print(images.shape, labels.shape)
    break

TensorFlow

TensorFlow support has now been deprecated. The ML research community has broadly converged on PyTorch. We suggest using PyTorch or, for framework-indepedent data loading, our HuggingFace datasets.

Download Notes

Datasets are downloaded like:

{root}
- images
  - subfolder (except GZ2)
    - image.jpg
- {catalog_name(s)}.parquet

The whole dataset is downloaded regardless of whether train=True or train=False.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- GPU :: NVIDIA CUDA
License
- OSI Approved :: GNU General Public License (GPL)
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.25

Jul 24, 2025

0.0.24

May 21, 2025

0.0.23

Apr 24, 2025

0.0.22

Apr 7, 2025

0.0.21

May 30, 2024

0.0.19

May 30, 2024

0.0.18

May 14, 2024

0.0.17

Mar 21, 2024

0.0.16

Mar 21, 2024

0.0.15

Nov 9, 2023

0.0.14

Aug 1, 2023

0.0.13

Aug 1, 2023

0.0.12

Mar 29, 2023

0.0.11

Mar 2, 2023

0.0.10

Mar 2, 2023

0.0.8

Mar 2, 2023

0.0.7

Feb 22, 2023

0.0.6

Feb 17, 2023

0.0.5

Feb 11, 2023

0.0.4

Dec 20, 2022

0.0.3

Dec 13, 2022

0.0.2

Nov 15, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

galaxy_datasets-0.0.25.tar.gz (59.8 kB view details)

Uploaded Jul 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

galaxy_datasets-0.0.25-py3-none-any.whl (73.4 kB view details)

Uploaded Jul 24, 2025 Python 3

File details

Details for the file galaxy_datasets-0.0.25.tar.gz.

File metadata

Download URL: galaxy_datasets-0.0.25.tar.gz
Upload date: Jul 24, 2025
Size: 59.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for galaxy_datasets-0.0.25.tar.gz
Algorithm	Hash digest
SHA256	`804c56106ce7995375f0f74a250db9d71b67f34c4778dd0a1efbed5d4fb8631f`
MD5	`c9567ddb8e354788ea40727d33b9a2bf`
BLAKE2b-256	`910cbf1e5fe8bfe008e66c4913a4c205c56dd75a2bd2e97b30875eb870d31fba`

See more details on using hashes here.

File details

Details for the file galaxy_datasets-0.0.25-py3-none-any.whl.

File metadata

Download URL: galaxy_datasets-0.0.25-py3-none-any.whl
Upload date: Jul 24, 2025
Size: 73.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for galaxy_datasets-0.0.25-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d51895c21da57b4d1924701298097b5876e031f4d347725809229116d49151f8`
MD5	`c7dff6624fa970a4f9b92e84bd6fac4d`
BLAKE2b-256	`843f1a5bbfee705ea49ec029585f142880d135271310336aec46c9e3f75d041a`

See more details on using hashes here.

galaxy-datasets 0.0.25

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

galaxy-datasets

Installation

Usage

Framework-Independent

PyTorch

TensorFlow

Download Notes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes