Smart image downsampling for image classification datasets
Project description
smartdownsample
Fast, simple image downsampling that just works
SmartDownsample samples diverse images from large collections in seconds, not hours. One simple function that works equally fast whether you're sampling 100 or 23,000 images from 24,000.
Installation
pip install smartdownsample
Features
- ⚡ Always fast - Seconds for any selection ratio
- 🎯 Smart bucketing - Better than random, faster than optimal algorithms
- 📊 Scales linearly - 24k images? No problem
- 🔧 Dead simple - One function, always works
- 🎲 Reproducible - Set seed for consistent results
- ⚖️ Honest trade-offs - Speed over perfection, good enough for most use cases
Usage
from smartdownsample import sample_diverse
# Sample 100 diverse images from 24,000 - takes seconds
selected = sample_diverse(
image_paths=my_24k_images,
target_count=100
)
# Sample 23,000 images from 24,000 - also takes seconds!
selected = sample_diverse(
image_paths=my_24k_images,
target_count=23000
)
# It's that simple.
print(f"Sampled {len(selected)} diverse images")
How It Works
Simple "trim from top" algorithm that maximizes diversity while being blazing fast:
1. Hash Images (Fast)
Image → 64-bit fingerprint in ~0.01 seconds
Uses DHash with 4 parallel workers
2. Group Into Buckets (O(n))
Use first 4 hash bits to create ~16 visual groups:
Bucket A: [landscape1.jpg, landscape2.jpg, ...] # 45 images
Bucket B: [portrait1.jpg, portrait2.jpg, ...] # 12 images
Bucket C: [closeup1.jpg, closeup2.jpg, ...] # 890 images
3. Trim from Top (Ultra Fast)
Sort buckets by size (largest first)
Keep ALL small buckets intact
Trim only from largest buckets using stride sampling
Example: Want 500 from 1,390 images
• 50 small buckets (1 each): Keep all = 50 images ✓
• 30 medium buckets (5 each): Keep all = 150 images ✓
• 19 large buckets (10 each): Keep all = 190 images ✓
• 1 huge bucket (1000): Keep every 9th = 110 images ✓
Total: 500 images with maximum diversity preserved
Why It's Fast
Algorithm advantages:
- ✅ O(n) complexity - Just sort buckets once
- ✅ Stride sampling - Array slicing, not random selection
- ✅ No complex math - Simple bucket trimming
- ✅ Maximum diversity - Small buckets always preserved
- ✅ Temporal spread - Stride gives even time distribution
What you get:
- ✅ Fastest possible while maintaining quality
- ✅ Preserves rare/unique images (small buckets)
- ✅ Even temporal sampling within large buckets
- ✅ Deterministic results (no randomness needed)
Result: Optimal speed + maximum diversity preservation.
Algorithm Comparison
| Approach | Speed | Diversity | Temporal Spread | Use Case |
|---|---|---|---|---|
| Random sampling | Fastest | Poor | Poor | Quick tests only |
| smartdownsample | Ultra Fast | Excellent | Excellent | Production use |
| Complex diversity | Very Slow | Perfect | Poor | Research only |
Real Example: 24,000 images → 1,000 selected
- Random: 1 second, poor diversity, clumped sampling
- smartdownsample: 20 seconds, excellent diversity + temporal spread
- Complex: 2+ hours, mathematically perfect but no temporal awareness
Sweet spot: Maximum diversity preservation with temporal awareness in minimal time.
Performance
| Task | Time |
|---|---|
| 100 from 1,000 | <5 sec |
| 900 from 1,000 | <5 sec |
| 1,000 from 24,000 | ~30 sec |
| 23,000 from 24,000 | ~30 sec |
| Any ratio | Fast ✓ |
Parameters
| Parameter | Default | Description |
|---|---|---|
image_paths |
Required | List of image file paths (str or Path objects) |
target_count |
Required | Exact number of images to select |
n_workers |
4 |
Number of parallel workers (4 is optimal) |
hash_size |
8 |
Hash size (8 is fast and good enough) |
random_seed |
42 |
Random seed for reproducible results |
show_progress |
True |
Whether to display progress bars |
Why It's Fast
- Fixed algorithm - No switching between methods
- Simple hashing - DHash is faster than PHash
- Smart bucketing - O(n) grouping instead of O(n²) comparisons
- Parallel processing - But capped at 4 workers (diminishing returns above that)
License
MIT License – see LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smartdownsample-1.0.0.tar.gz.
File metadata
- Download URL: smartdownsample-1.0.0.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
883a9e9ba6c0ef38042b4314ae1b0926addabeb94ed1d130a33639d3b1651f6c
|
|
| MD5 |
f16c2d1fcd8c2a056856da076ae81c47
|
|
| BLAKE2b-256 |
910edf1b43a59b5d8b9d49f5fc98a904cc98ed481bc5156fe1b63faf9b81e6f6
|
File details
Details for the file smartdownsample-1.0.0-py3-none-any.whl.
File metadata
- Download URL: smartdownsample-1.0.0-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11379c26b26bb0af388540d0bb1354c575dbf870ceda76433b7299f465480740
|
|
| MD5 |
db82a34721ee0971eccba626c176e85f
|
|
| BLAKE2b-256 |
969f10ec783a5227bd8d33c139b05c11d324a9b7256a813a12f1a86939c25727
|