Skip to main content

Smart image downsampling for image classification datasets

Project description

smartdownsample

Blazing-fast image downsampling for large datasets

SmartDownsample selects the most diverse images from large collections using parallel processing and intelligent caching. Perfect for reducing dataset size while preserving visual variability - now optimized to handle 24,000+ images in minutes instead of hours.

Installation

pip install smartdownsample

Features

  • 10-50x faster than v0.1.x with parallel processing
  • 🔄 Smart caching - repeated runs are near-instant
  • 🎯 Intelligent selection - maintains maximum visual diversity
  • 📊 Scales efficiently - handles 100,000+ images with ease
  • 🔧 Production ready - battle-tested on large camera trap datasets

Usage

from smartdownsample import select_distinct

# Example list of image paths
my_image_list = [
    "path/to/img1.jpg",
    "path/to/img2.jpg",
    "path/to/img3.jpg",
    "path/to/img4.jpg"
]

# Simple usage - automatically uses all CPU cores
selected = select_distinct(
    image_paths=my_image_list,
    target_count=100
)

# For large datasets (10k+ images) - enable caching for fastest performance
selected = select_distinct(
    image_paths=my_image_list,
    target_count=1000,
    n_workers=8,  # Use 8 CPU cores
    cache_dir="./cache"  # Cache hashes for instant reruns
)

# With visual verification to see excluded vs included images
selected = select_distinct(
    image_paths=my_image_list,
    target_count=100,
    show_verification=True
)

print(f"Selected {len(selected)} images")

Performance

Dataset Size v0.1.x v0.2.0 (first run) v0.2.0 (cached)
1,000 images 2 min 10 sec 1 sec
10,000 images 30 min 1 min 5 sec
24,000 images 2-4 hours 5-10 min <1 min
100,000 images 12+ hours 30-45 min 2-3 min

Parameters

Parameter Default Description
image_paths Required List of image file paths (str or Path objects)
target_count Required Exact number of images to select
window_size 100 Rolling window size for diversity comparison
random_seed 42 Random seed for reproducible results
show_progress True Whether to display progress bars
show_verification False Show visual verification comparing excluded vs included images
n_workers CPU count - 1 Number of parallel workers for processing
cache_dir None Directory to cache computed hashes (dramatically speeds up reruns)
hash_size 8 Perceptual hash size (8 is 2x faster than 16 with minimal quality loss)
batch_size 100 Images to process per batch

Step by Step

  1. Sort paths by directory. Within each folder, files are naturally ordered (e.g., img1.jpg, img2.jpg, img10.jpg) so related images remain grouped.
  2. Compute perceptual hashes for all valid image paths.
  3. Apply rolling window selection on the hash array to choose indices of the most diverse images. This runs in O(n) time, scales to large classes of 100k+ images, and compares each candidate only to a sliding window of recent selections.
  4. Return results as [valid_paths[i] for i in selected_indices].
  5. Optional verification plot: If show_verification=True, the algorithm displays a visual check of 18 randomly selected excluded images and their included counterpart. The visualization opens automatically in your default image viewer without saving files to disk.

Downsample example

License

MIT License – see LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartdownsample-0.2.2.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smartdownsample-0.2.2-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file smartdownsample-0.2.2.tar.gz.

File metadata

  • Download URL: smartdownsample-0.2.2.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for smartdownsample-0.2.2.tar.gz
Algorithm Hash digest
SHA256 4b9a0430fa2d84588998be8b75e9235904ad26d850d6c8c696ade4e3fad00543
MD5 9cb074a967c9baf938411df68d2c0f5b
BLAKE2b-256 96ef5fc9fc3755955d1c2155b24ff9b1ce6dbe9898574b8ba8e892c04acda3a1

See more details on using hashes here.

File details

Details for the file smartdownsample-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for smartdownsample-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a84a21aff58c88cda6ebc97a466e490de39e45b8814043ed363ca4373b94a613
MD5 31089ab8fb2bbfa1cf91d4d4e8aea202
BLAKE2b-256 29f3a33c38bb1357c5c55951478bd390ada471c2d898588a06d025fe4df61bfa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page