Skip to main content

Hyperparameter search for image classifiers using Ray Tune + SkyPilot

Project description

krunic

tests

Automated hyperparameter search for image classifiers - from dataset to tuned model with one command. Distributed across GPUs and across hosts, locally and on the cloud (AWS).

krunic uses off-the-shelf models and packages, so you won't get SOTA performance. That said, it can get surprisingly close, with almost zero effort. Useful as a baseline, or experimentation with architectures, GPUs etc.

Built on Ray Tune, Optuna, timm, and SkyPilot.

Install (Mac and Linux)

$ pipx install krunic

This installs three commands: tunic (local training), krunic (cloud launcher), and tunic-plotter (results visualizer). The command takes a couple of minutes.

Quick start

Local:

$ tunic --data /path/to/dataset --model resnet50 --n_trials 30 --epochs 30 --output results.json

Cloud (AWS):

This requires, obviously, an AWS account. The image data must be copied to S3 prior to the run, for example like this:

$ aws s3 sync ~/image_data/tin s3://image.data/tin
$ krunic \
  --cluster skya \
  --s3-path my-dataset \
  --model resnet50 \
  --accelerator T4:4 \
  --num-nodes 4 \
  --n-trials 48 \
  --n-epochs 50 \
  --prefix kaws

SkyPilot creates the cluster, Ray distributes the load across the GPUs. In my experiments, it achieves near-perfect utilization:

Description

Upon completion, get the best model hyperparameters:

$ aws s3 cp s3://image.data/ray-results/tin6/kaws_results.json .

Plot metric per trial:

$ tunic-plotter kaws_results.json

Description

Remember to take down the cluster after downloading the results.

$ yes | sky down skya

Train final model from tuning results:

$ tunic --final kaws_results.json --data /path/to/dataset --epochs 50 --amp

Results on common benchmarks

Dataset Model Metric Validation Test SOTA
PCam ResNet18 AUROC 0.96 0.96 0.96
TinyImageNet ViT-Small Accuracy 0.87 0.91
ChestMNIST ResNet18 AUROC 0.75 0.75 0.77
TissueMNIST ResNet18 AUROC 0.92 0.94 0.93

All runs use generic off-the-shelf models with no domain-specific modifications.

Search space

Parameter Range
Optimizer AdamW, SGD
Learning rate 1e-5 – 1e-1 (log)
Weight decay 1e-6 – 1e-1 (log)
Label smoothing 0 – 0.3
Dropout rate 0 – 0.5
RandAugment magnitude 1 – 15
RandAugment num ops 1 – 4
Mixup alpha 0 – 0.5
CutMix alpha 0 – 1.0

Override any part with a YAML file via --search-space.

tunic - local hyperparameter search

tunic --data PATH --model MODEL [options]
Flag Default Description
--data required Dataset root (ImageFolder or WebDataset)
--model required Any timm model name
--n_trials 80 Number of Optuna trials
--epochs 30 Training epochs per trial (also used for --final)
--tune-metric val_auroc Metric for trial selection and pruning
--training_fraction 1.0 Fraction of training data (val always uses 1.0)
--batch-size 32 Batch size per trial
--amp Enable automatic mixed precision
--resume Warm-start from a previous experiment directory
--final Skip tuning; train final model from results JSON
--combine Train final model on train+val combined
--final-model tunic_final.pt Output path for final model weights
--final-stats Output path for final model stats (JSON)
--device auto auto, cuda, mps, or cpu
--smoke-test Quick end-to-end test with synthetic data

krunic - cloud launcher

krunic generates a SkyPilot YAML and launches the job. The dataset is S3-mounted (or copied); results are uploaded to S3 when the job completes.

Prerequisites

1. AWS credentials

aws configure

Prompts for your Access Key ID, Secret Access Key, and region (e.g. us-east-1). Your IAM user needs EC2 and S3 permissions. SkyPilot uses these credentials directly — no separate SkyPilot account or configuration needed.

2. Verify SkyPilot sees AWS

sky check

Should show AWS: enabled.

3. Dataset in S3

aws s3 sync ~/image_data/my-dataset s3://my-bucket/my-dataset

Monitor and tear down

krunic launches the cluster and streams logs. Once the job completes, download results and tear down:

sky status                          # check cluster state
sky logs my-cluster 1               # stream logs (job ID increments with each run)
aws s3 cp s3://my-bucket/ray-results/prefix/prefix_results.json .
yes | sky down my-cluster           # terminate cluster

--workdir defaults to the installed package directory (contains tunic.py and requirements.txt). Override it only if you are developing from a local source checkout and want to test unpublished changes.

krunic --cluster NAME --workdir DIR --s3-path PATH --model MODEL [options]
Flag Default Description
--cluster required SkyPilot cluster name
--workdir package dir Local directory synced to the cluster. Used for development
--s3-path required Dataset path within the S3 bucket
--model required Any timm model name
--accelerator T4:4 GPU spec (e.g. T4:4, A10G:1, A100:8)
--num-nodes 1 Number of cluster nodes
--n-trials 30 Number of Optuna trials
--n-epochs 30 Training epochs per trial
--batch-size 32 Batch size per trial
--training-fraction 1.0 Fraction of training data per trial
--tune-metric val_auroc Metric for trial selection and pruning
--bucket image.data S3 bucket name
--prefix tunic Prefix for output files and S3 paths
--spot Use spot instances (with retry-until-up)
--copy Copy data from S3 to local disk instead of mounting
--idle-minutes 60 Auto-stop cluster after N idle minutes
--no-autostop Disable auto-stop

Results are uploaded to s3://<bucket>/ray-results/<prefix>/<prefix>_results.json.

tunic-plotter - visualize results

tunic-plotter results.json                  # plots val_auroc and val_acc
tunic-plotter results.json --metric val_acc # single metric
tunic-plotter results.json --trial_sort     # keep original trial order, show running best

Saves PNG files alongside the results JSON.

Dataset format

tunic auto-detects the dataset format:

  • ImageFolder - standard split/class/image.ext layout
  • WebDataset - sharded TAR files; detected when wds/dataset_info.json exists

Scaling

Concurrent trials = total GPUs: --num-nodes 4 --accelerator T4:4 --> 16 concurrent trials.

Optuna's TPE needs ~20 trials before it outperforms random search. 32–64 trials is a practical range for most problems.

Output format

{
  "model": "resnet18",
  "best_val_auroc": 0.963,
  "best_val_acc": 0.891,
  "best_params": {
    "optimizer": "AdamW",
    "lr": 0.0028,
    "weight_decay": 3.6e-06,
    "label_smoothing": 0.058,
    "drop_rate": 0.183
  },
  "n_trials": 48,
  "completed_trials": 48,
  "epochs": 50,
  "all_trials": [...]
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

krunic-0.1.5.tar.gz (565.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

krunic-0.1.5-py3-none-any.whl (23.1 kB view details)

Uploaded Python 3

File details

Details for the file krunic-0.1.5.tar.gz.

File metadata

  • Download URL: krunic-0.1.5.tar.gz
  • Upload date:
  • Size: 565.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.4

File hashes

Hashes for krunic-0.1.5.tar.gz
Algorithm Hash digest
SHA256 e97187ff9ce6d6d62e4cc1e0b5ca69fc6e75088982933e77abcb79f62ee06cac
MD5 02443483e51fa5200b3b39702219b374
BLAKE2b-256 83c168614dbd7ca35cf698d9536104fe1a1d2e354e35726505470c8049d400d0

See more details on using hashes here.

File details

Details for the file krunic-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: krunic-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 23.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.4

File hashes

Hashes for krunic-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a8a2294b9804edb4c67d784e6320be26541edbfe5820fd14489f610b2e134f79
MD5 a963b60d873177e1c81b9cee64ac8d92
BLAKE2b-256 d8cbf0779333d47ddebe39641c6551f0d70e62974f3a3f6e1633a7ff20bd4da8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page