Skip to main content

Smart image downsampling for image classification datasets

Project description

smartdownsample

Fast, simple image downsampling that just works

SmartDownsample samples diverse images from large collections in seconds, not hours. One simple function that works equally fast whether you're sampling 100 or 23,000 images from 24,000.

Installation

pip install smartdownsample

Features

  • Always fast - Seconds for any selection ratio
  • 🎯 Smart bucketing - Better than random, faster than complex algorithms
  • 📊 Scales linearly - 24k images? No problem
  • 🔧 Dead simple - One function, always works
  • 🎲 Reproducible - Set seed for consistent results

Usage

from smartdownsample import sample_diverse

# Sample 100 diverse images from 24,000 - takes seconds
selected = sample_diverse(
    image_paths=my_24k_images,
    target_count=100
)

# Sample 23,000 images from 24,000 - also takes seconds!
selected = sample_diverse(
    image_paths=my_24k_images,
    target_count=23000
)

# It's that simple.
print(f"Sampled {len(selected)} diverse images")

How It Works

  1. Hash images - Quick perceptual hashing (4 parallel workers)
  2. Create buckets - Group similar images together
  3. Sample evenly - Take images from each bucket for diversity

Result: Better than random selection, without the complexity.

Performance

Task Time
100 from 1,000 <5 sec
900 from 1,000 <5 sec
1,000 from 24,000 ~30 sec
23,000 from 24,000 ~30 sec
Any ratio Fast ✓

Parameters

Parameter Default Description
image_paths Required List of image file paths (str or Path objects)
target_count Required Exact number of images to select
n_workers 4 Number of parallel workers (4 is optimal)
hash_size 8 Hash size (8 is fast and good enough)
random_seed 42 Random seed for reproducible results
show_progress True Whether to display progress bars

Why It's Fast

  • Fixed algorithm - No switching between methods
  • Simple hashing - DHash is faster than PHash
  • Smart bucketing - O(n) grouping instead of O(n²) comparisons
  • Parallel processing - But capped at 4 workers (diminishing returns above that)

License

MIT License – see LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartdownsample-0.4.0.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smartdownsample-0.4.0-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file smartdownsample-0.4.0.tar.gz.

File metadata

  • Download URL: smartdownsample-0.4.0.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for smartdownsample-0.4.0.tar.gz
Algorithm Hash digest
SHA256 c5c12a238cc8e12a2650d080a557c975afbe5c07c7dba5f131daf1803e375fea
MD5 9a74850728cc6013127b64f7d72bb722
BLAKE2b-256 a8a50892496d54625e5cb8f51df88cd35416351f27af8e0a5c2cfe9a932dfb3c

See more details on using hashes here.

File details

Details for the file smartdownsample-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for smartdownsample-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 74ba8012ff53908d5bedf5227417da84cc259cd106d2273a1b861a46f5826d61
MD5 07d5baca576063c1d2e7bccb0e860512
BLAKE2b-256 fb9807752cad94fab29aa3b35377e8ae0ae936c5f40ea0e5c18df31c4550cb55

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page