Skip to main content

Galaxy morphology classifiers

Project description

Zoobot

Downloads Documentation Status build publish PyPI DOI status ascl:2203.027

Zoobot classifies galaxy morphology with deep learning.

Zoobot is trained using millions of answers by Galaxy Zoo volunteers. This code will let you retrain Zoobot to accurately solve your own prediction task.

Installation

You can retrain Zoobot in the cloud with a free GPU using this Google Colab notebook. To install locally, keep reading.

Download the code using git:

git clone git@github.com:mwalmsley/zoobot.git

And then pick one of the three commands below to install Zoobot and PyTorch:

# Zoobot with PyTorch and a GPU. Requires CUDA 12.1 (or CUDA 11.8, if you use `_cu118` instead)
pip install -e "zoobot[pytorch-cu121]" --extra-index-url https://download.pytorch.org/whl/cu121

# OR Zoobot with PyTorch and no GPU
pip install -e "zoobot[pytorch-cpu]" --extra-index-url https://download.pytorch.org/whl/cpu

# OR Zoobot with PyTorch on Mac with M1 chip
pip install -e "zoobot[pytorch-m1]"

This installs the downloaded Zoobot code using pip editable mode so you can easily change the code locally. Zoobot is also available directly from pip (pip install zoobot[option]). Only use this if you are sure you won't be making changes to Zoobot itself. For Google Colab, use pip install zoobot[pytorch_colab]

To use a GPU, you must already have CUDA installed and matching the versions above. I share my install steps here. GPUs are optional - Zoobot will run retrain fine on CPU, just slower.

Quickstart

The Colab notebook is the quickest way to get started. Alternatively, the minimal example below illustrates how Zoobot works.

Let's say you want to find ringed galaxies and you have a small labelled dataset of 500 ringed or not-ringed galaxies. You can retrain Zoobot to find rings like so:

```python

import pandas as pd
from galaxy_datasets.pytorch.galaxy_datamodule import GalaxyDataModule
from zoobot.pytorch.training import finetune

# csv with 'ring' column (0 or 1) and 'file_loc' column (path to image)
labelled_df = pd.read_csv('/your/path/some_labelled_galaxies.csv')

datamodule = GalaxyDataModule(
  label_cols=['ring'],
  catalog=labelled_df,
  batch_size=32
)

# load trained Zoobot model
model = finetune.FinetuneableZoobotClassifier(checkpoint_loc, num_classes=2)  

# retrain to find rings
trainer = finetune.get_trainer(save_dir)
trainer.fit(model, datamodule)
```

Then you can make predict if new galaxies have rings:

```python
from zoobot.pytorch.predictions import predict_on_catalog

# csv with 'file_loc' column (path to image). Zoobot will predict the labels.
unlabelled_df = pd.read_csv('/your/path/some_unlabelled_galaxies.csv')

predict_on_catalog.predict(
  unlabelled_df,
  model,
  label_cols=['ring'],  # only used for 
  save_loc='/your/path/finetuned_predictions.csv'
)
```

Zoobot includes many guides and working examples - see the Getting Started section below.

Getting Started

I suggest starting with the Colab notebook or the worked examples below, which you can copy and adapt.

For context and explanation, see the documentation.

Pretrained models are listed here and available on HuggingFace

Worked Examples

PyTorch (recommended):

There is more explanation and an API reference on the docs.

I also include the scripts used to create and benchmark our pretrained models. Many pretrained models are available already, but if you need one trained on e.g. different input image sizes or with a specific architecture, I can probably make it for you.

When trained with a decision tree head (ZoobotTree, FinetuneableZoobotTree), Zoobot can learn from volunteer labels of varying confidence and predict posteriors for what the typical volunteer might say. Specifically, this Zoobot mode predicts the parameters for distributions, not simple class labels! For a demonstration of how to interpret these predictions, see the gz_decals_data_release_analysis_demo.ipynb.

(Optional) Install PyTorch with CUDA

If you're not using a GPU, skip this step. Use the pytorch-cpu option in the section below.

Install PyTorch 2.1.0 or Tensorflow 2.10.0 and compatible CUDA drivers. I highly recommend using conda to do this. Conda will handle both creating a new virtual environment (conda create) and installing CUDA (cudatoolkit, cudnn)

CUDA 12.1 for PyTorch 2.1.0:

conda create --name zoobot39_torch python==3.9
conda activate zoobot39_torch
conda install -c conda-forge cudatoolkit=12.1

Recent release features (v2.0.0)

  • New pretrained architectures: ConvNeXT, EfficientNetV2, MaxViT, and more. Each in several sizes.
  • Reworked finetuning procedure. All these architectures are finetuneable through a common method.
  • Reworked finetuning options. Batch norm finetuning removed. Cosine schedule option added.
  • Reworked finetuning saving/loading. Auto-downloads encoder from HuggingFace.
  • Now supports regression finetuning (as well as multi-class and binary). See pytorch/examples/finetuning
  • Updated timm to 0.9.10, allowing latest model architectures. Previously downloaded checkpoints may not load correctly!
  • (internal until published) GZ Evo v2 now includes Cosmic Dawn (HSC H2O). Significant performance improvement on HSC finetuning. Also now includes GZ UKIDSS (dragged from our archives).
  • Updated pytorch to 2.1.0
  • Added support for webdatasets (only recommended for large-scale distributed training)
  • Improved per-question logging when training from scratch
  • Added option to compile encoder for max speed (not recommended for finetuning, only for pretraining).
  • Deprecates TensorFlow. The CS research community focuses on PyTorch and new frameworks like JAX.

Contributions are very welcome and will be credited in any future work. Please get in touch! See CONTRIBUTING.md for more.

Benchmarks and Replication - Training from Scratch

The benchmarks folder contains slurm and Python scripts to train Zoobot from scratch. We use these scripts to make sure new code versions work well, and that TensorFlow and PyTorch achieve similar performance.

Training Zoobot using the GZ DECaLS dataset option will create models very similar to those used for the GZ DECaLS catalogue and shared with the early versions of this repo. The GZ DESI Zoobot model is trained on additional data (GZD-1, GZD-2), as the GZ Evo Zoobot model (GZD-1/2/5, Hubble, Candels, GZ2).

Pretraining is becoming increasingly complex and is now partially refactored out to a separate repository. We are gradually migrating this zoobot repository to focus on finetuning.

Citing

If you use this software, or otherwise wish to cite Zoobot as a software package, please use the JOSS paper:

@article{Walmsley2023, doi = {10.21105/joss.05312}, url = {https://doi.org/10.21105/joss.05312}, year = {2023}, publisher = {The Open Journal}, volume = {8}, number = {85}, pages = {5312}, author = {Mike Walmsley and Campbell Allen and Ben Aussel and Micah Bowles and Kasia Gregorowicz and Inigo Val Slijepcevic and Chris J. Lintott and Anna M. m. Scaife and Maja Jabłońska and Kosio Karchev and Denise Lanzieri and Devina Mohan and David O’Ryan and Bharath Saiguhan and Crisel Suárez and Nicolás Guerra-Varas and Renuka Velu}, title = {Zoobot: Adaptable Deep Learning Models for Galaxy Morphology}, journal = {Journal of Open Source Software} } 

You might be interested in reading papers using Zoobot:

Many other works use Zoobot indirectly via the Galaxy Zoo DECaLS catalog (and now via the new Galaxy Zoo DESI catalog).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zoobot-2.0.0.tar.gz (112.0 kB view hashes)

Uploaded Source

Built Distribution

zoobot-2.0.0-py3-none-any.whl (120.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page