Skip to main content

Smart image downsampling for image classification datasets

Project description

smartdownsample

Fast, simple image downsampling that just works

SmartDownsample samples diverse images from large collections in seconds, not hours. One simple function that works equally fast whether you're sampling 100 or 23,000 images from 24,000.

Installation

pip install smartdownsample

Features

  • Always fast - Seconds for any selection ratio
  • 🎯 Smart bucketing - Better than random, faster than optimal algorithms
  • 📊 Scales linearly - 24k images? No problem
  • 🔧 Dead simple - One function, always works
  • 🎲 Reproducible - Set seed for consistent results
  • ⚖️ Honest trade-offs - Speed over perfection, good enough for most use cases

Usage

from smartdownsample import sample_diverse

# Sample 100 diverse images from 24,000 - takes seconds
selected = sample_diverse(
    image_paths=my_24k_images,
    target_count=100
)

# Sample 23,000 images from 24,000 - also takes seconds!
selected = sample_diverse(
    image_paths=my_24k_images,
    target_count=23000
)

# It's that simple.
print(f"Sampled {len(selected)} diverse images")

How It Works

Simple "trim from top" algorithm that maximizes diversity while being blazing fast:

1. Hash Images (Fast)

Image → 64-bit fingerprint in ~0.01 seconds
Uses DHash with 4 parallel workers

2. Group Into Buckets (O(n))

Use first 4 hash bits to create ~16 visual groups:
Bucket A: [landscape1.jpg, landscape2.jpg, ...]     # 45 images
Bucket B: [portrait1.jpg, portrait2.jpg, ...]       # 12 images  
Bucket C: [closeup1.jpg, closeup2.jpg, ...]         # 890 images

3. Trim from Top (Ultra Fast)

Sort buckets by size (largest first)
Keep ALL small buckets intact
Trim only from largest buckets using stride sampling

Example: Want 500 from 1,390 images
• 50 small buckets (1 each): Keep all = 50 images ✓
• 30 medium buckets (5 each): Keep all = 150 images ✓  
• 19 large buckets (10 each): Keep all = 190 images ✓
• 1 huge bucket (1000): Keep every 9th = 110 images ✓
Total: 500 images with maximum diversity preserved

Why It's Fast

Algorithm advantages:

  • O(n) complexity - Just sort buckets once
  • Stride sampling - Array slicing, not random selection
  • No complex math - Simple bucket trimming
  • Maximum diversity - Small buckets always preserved
  • Temporal spread - Stride gives even time distribution

What you get:

  • ✅ Fastest possible while maintaining quality
  • ✅ Preserves rare/unique images (small buckets)
  • ✅ Even temporal sampling within large buckets
  • ✅ Deterministic results (no randomness needed)

Result: Optimal speed + maximum diversity preservation.

Algorithm Comparison

Approach Speed Diversity Temporal Spread Use Case
Random sampling Fastest Poor Poor Quick tests only
smartdownsample Ultra Fast Excellent Excellent Production use
Complex diversity Very Slow Perfect Poor Research only

Real Example: 24,000 images → 1,000 selected

  • Random: 1 second, poor diversity, clumped sampling
  • smartdownsample: 20 seconds, excellent diversity + temporal spread
  • Complex: 2+ hours, mathematically perfect but no temporal awareness

Sweet spot: Maximum diversity preservation with temporal awareness in minimal time.

Performance

Task Time
100 from 1,000 <5 sec
900 from 1,000 <5 sec
1,000 from 24,000 ~30 sec
23,000 from 24,000 ~30 sec
Any ratio Fast ✓

Parameters

Parameter Default Description
image_paths Required List of image file paths (str or Path objects)
target_count Required Exact number of images to select
n_workers 4 Number of parallel workers (4 is optimal)
hash_size 8 Hash size (8 is fast and good enough)
random_seed 42 Random seed for reproducible results
show_progress True Whether to display progress bars

Why It's Fast

  • Fixed algorithm - No switching between methods
  • Simple hashing - DHash is faster than PHash
  • Smart bucketing - O(n) grouping instead of O(n²) comparisons
  • Parallel processing - But capped at 4 workers (diminishing returns above that)

License

MIT License – see LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartdownsample-1.0.0.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smartdownsample-1.0.0-py3-none-any.whl (6.9 kB view details)

Uploaded Python 3

File details

Details for the file smartdownsample-1.0.0.tar.gz.

File metadata

  • Download URL: smartdownsample-1.0.0.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for smartdownsample-1.0.0.tar.gz
Algorithm Hash digest
SHA256 883a9e9ba6c0ef38042b4314ae1b0926addabeb94ed1d130a33639d3b1651f6c
MD5 f16c2d1fcd8c2a056856da076ae81c47
BLAKE2b-256 910edf1b43a59b5d8b9d49f5fc98a904cc98ed481bc5156fe1b63faf9b81e6f6

See more details on using hashes here.

File details

Details for the file smartdownsample-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for smartdownsample-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 11379c26b26bb0af388540d0bb1354c575dbf870ceda76433b7299f465480740
MD5 db82a34721ee0971eccba626c176e85f
BLAKE2b-256 969f10ec783a5227bd8d33c139b05c11d324a9b7256a813a12f1a86939c25727

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page