Skip to main content

Smart image downsampling for image classification datasets

Project description

smartdownsample

Blazing-fast image downsampling for large datasets

SmartDownsample selects the most diverse images from large collections using parallel processing and intelligent caching. Perfect for reducing dataset size while preserving visual variability - now optimized to handle 24,000+ images in minutes instead of hours.

Installation

pip install smartdownsample

Features

  • 10-50x faster than v0.1.x with parallel processing
  • 🔄 Smart caching - repeated runs are near-instant
  • 🎯 Intelligent selection - maintains maximum visual diversity
  • 📊 Scales efficiently - handles 100,000+ images with ease
  • 🔧 Production ready - battle-tested on large camera trap datasets

Usage

from smartdownsample import select_distinct

# Example list of image paths
my_image_list = [
    "path/to/img1.jpg",
    "path/to/img2.jpg",
    "path/to/img3.jpg",
    "path/to/img4.jpg"
]

# Simple usage - automatically uses all CPU cores
selected = select_distinct(
    image_paths=my_image_list,
    target_count=100
)

# For large datasets (10k+ images) - enable caching for fastest performance
selected = select_distinct(
    image_paths=my_image_list,
    target_count=1000,
    n_workers=8,  # Use 8 CPU cores
    cache_dir="./cache"  # Cache hashes for instant reruns
)

# With visual verification to see excluded vs included images
selected = select_distinct(
    image_paths=my_image_list,
    target_count=100,
    show_verification=True
)

print(f"Selected {len(selected)} images")

Performance

Dataset Size v0.1.x v0.2.0 (first run) v0.2.0 (cached)
1,000 images 2 min 10 sec 1 sec
10,000 images 30 min 1 min 5 sec
24,000 images 2-4 hours 5-10 min <1 min
100,000 images 12+ hours 30-45 min 2-3 min

Parameters

Parameter Default Description
image_paths Required List of image file paths (str or Path objects)
target_count Required Exact number of images to select
window_size 100 Rolling window size for diversity comparison
random_seed 42 Random seed for reproducible results
show_progress True Whether to display progress bars
show_verification False Show visual verification comparing excluded vs included images
n_workers CPU count - 1 Number of parallel workers for processing
cache_dir None Directory to cache computed hashes (dramatically speeds up reruns)
hash_size 8 Perceptual hash size (8 is 2x faster than 16 with minimal quality loss)
batch_size 100 Images to process per batch

Step by Step

  1. Sort paths by directory. Within each folder, files are naturally ordered (e.g., img1.jpg, img2.jpg, img10.jpg) so related images remain grouped.
  2. Compute perceptual hashes for all valid image paths.
  3. Apply rolling window selection on the hash array to choose indices of the most diverse images. This runs in O(n) time, scales to large classes of 100k+ images, and compares each candidate only to a sliding window of recent selections.
  4. Return results as [valid_paths[i] for i in selected_indices].
  5. Optional verification plot: If show_verification=True, the algorithm displays a visual check of 18 randomly selected excluded images and their included counterpart. The visualization opens automatically in your default image viewer without saving files to disk.

Downsample example

License

MIT License – see LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartdownsample-0.2.1.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smartdownsample-0.2.1-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file smartdownsample-0.2.1.tar.gz.

File metadata

  • Download URL: smartdownsample-0.2.1.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for smartdownsample-0.2.1.tar.gz
Algorithm Hash digest
SHA256 7c652b1e200bc31871daf5cc30d25b61c94861673826e4f13f138625a56f232c
MD5 2af17f43a9e98addd55c3280a8718a76
BLAKE2b-256 6252e97c95b4ab2fb89cbe080e57bbca5d4f4bfbca6f3dace1064c4062fc0c49

See more details on using hashes here.

File details

Details for the file smartdownsample-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for smartdownsample-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 aaf8fc98d4b7239cdda6dc39281b90c85d0cee011d617bb877ed7403df58cde4
MD5 88ba0b65146ae1d2b56604c806573a46
BLAKE2b-256 0258fcc45484db1063bcd9c1219a5890efb76965d3e268885f6b0645b5d61d21

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page