Smart image downsampling for image classification datasets

These details have not been verified by PyPI

Project links

Project description

smartdownsample

Embedding-based diverse downsampling for large image datasets

smartdownsample selects representative subsets from large image collections while preserving visual diversity. It uses DINOv2 embeddings and agglomerative clustering to group visually similar images, then samples across clusters to maximize variety.

Built for image collections that:

Contain more images than you need for training, and
Have a high level of redundancy (e.g., many near-duplicate or visually similar frames)

In many ML workflows, majority classes can have hundreds of thousands of images. These often need to be reduced for efficiency or class balance, without discarding too much valuable variation. smartdownsample offers a practical solution: fast downsampling that keeps diversity, cutting processing time from hours (or days) to minutes.

This approach builds on work by Dante Wasmuht and Peter Bermant at Conservation X Labs.

Installation

pip install smartdownsample

Requires Python >= 3.8. GPU recommended but not required (falls back to CPU).

Note: pip install smartdownsample installs CPU-only PyTorch. For GPU support, install the CUDA version of PyTorch first (pytorch.org).

Usage

from smartdownsample import sample_diverse

selected = sample_diverse(
    image_paths=my_image_list,
    target_count=50000
)

Parameters

Parameter	Default	Description
`image_paths`	Required	List of image file paths (str or Path objects)
`target_count`	Required	Exact number of images to select
`distance_threshold`	`0.5`	Cosine distance threshold for clustering. Lower = more clusters (stricter). Higher = fewer clusters (more lenient).
`n_workers`	`4`	Number of parallel workers for image loading
`show_progress`	`True`	Display progress bars during processing
`show_summary`	`True`	Print cluster statistics and distribution summary
`save_distribution`	`None`	Path to save distribution chart as PNG (creates directories if needed)
`save_thumbnails`	`None`	Path to save thumbnail grids as PNG (creates directories if needed)
`image_loading_errors`	`"raise"`	How to handle image loading errors: `"raise"` (fail immediately) or `"skip"` (continue with remaining images)
`return_indices`	`False`	Return 0-based indices instead of paths (refers to original input list order)

How it works

The algorithm has four steps:

Embedding extraction Each image is passed through DINOv2 ViT-S/14 to produce a 384-dimensional embedding vector that captures semantic visual features (subjects, backgrounds, composition, lighting). Embeddings are L2-normalized. The model is loaded once and cached for subsequent calls.
Clustering Images are grouped using agglomerative clustering (cosine distance, average linkage) with a fixed distance threshold. The number of clusters reflects the natural visual structure of the data, not the selection budget. This means larger clusters (common visual patterns) get proportionally more images in the selection, while small clusters (rare/unique images) are still guaranteed representation.
Divide-and-conquer scaling (for large datasets)

Clustering all images at once requires comparing every pair. For 10,000 images that's 100 million comparisons, and for 1,000,000 images that's 1 trillion. Instead, for datasets larger than 2,000 images, clustering is done in stages:
1. Shuffle the images randomly and split them into groups of ~2,000.
2. Cluster each group independently (much smaller distance matrices).
3. From each cluster within each group, pick the 5 most central images as representatives.
4. Re-cluster all the representatives together. This merges clusters that were separated by the random split, e.g., visually similar images that ended up in different groups now get reunited.
5. Every image inherits the final cluster ID of its representative.
The random shuffle ensures each group is a representative mix. The re-clustering stitches it back together. The result is roughly the same as clustering everything at once, but at a fraction of the cost.

If the representative set is still too large after several rounds (very large datasets, 500K+), the final merging step uses MiniBatchKMeans instead of agglomerative clustering. KMeans scales linearly because it doesn't build a pairwise distance matrix. The earlier rounds still use full agglomerative clustering where the real grouping happens, so the impact on quality is minimal.
Cluster-aware sampling
- Phase 1 (diversity): Take the most central image (medoid) from each cluster, guaranteeing every visual group is represented.
- Phase 2 (proportional fill): Distribute the remaining budget across clusters proportionally to their size using largest-remainder allocation. This ensures fair representation. A cluster with twice as many images gets twice as many selections, without rounding bias toward the largest clusters. Within each cluster, images are selected by centrality rank (most representative first).
Save distribution chart (optional)
- Vertical bar chart of kept vs. excluded images per cluster
Save thumbnail grids (optional)
- 5x5 grids from each cluster, for quick visual review

Performance

Approximate times on an NVIDIA RTX 3080 Ti.

Dataset size	Embedding time (GPU)	Clustering	Total
1,000 images	~1s	instant	~2s
10,000 images	~15s	~1s	~20s
100,000 images	~2.5 min	~10s	~3 min
1,000,000 images	~25 min	~2 min	~30 min

License

MIT License, see LICENSE file.

TODO

The thumbnail plots can get very large (several dozen MB). Perhaps we whould find a solution where we just show a max of 10*10 clusters. Its just for visual check anyways, so no need to show all clusters.
The distribution plots x axis ticks overlap. Remove those, they are redundant. Its about the main reg/green idea anyways.
the in bucket assignment, how does it choose which ones to keep and which ones to exclude? Investigate. Can we improve this?

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.0.6

Mar 30, 2026

2.0.5

Mar 26, 2026

This version

2.0.4

Mar 26, 2026

2.0.3

Mar 26, 2026

2.0.2

Mar 25, 2026

2.0.0

Mar 25, 2026

1.9.2

Oct 9, 2025

1.9.1

Oct 8, 2025

1.9.0

Oct 8, 2025

1.8.5

Aug 20, 2025

1.8.4

Aug 20, 2025

1.8.3

Aug 20, 2025

1.8.2

Aug 20, 2025

1.8.1

Aug 20, 2025

1.8.0

Aug 20, 2025

1.7.2

Aug 20, 2025

1.7.1

Aug 20, 2025

1.7.0

Aug 20, 2025

1.6.2

Aug 20, 2025

1.6.1

Aug 20, 2025

1.6.0

Aug 20, 2025

1.5.3

Aug 20, 2025

1.5.2

Aug 20, 2025

1.5.1

Aug 20, 2025

1.5.0

Aug 20, 2025

1.4.1

Aug 20, 2025

1.4.0

Aug 20, 2025

1.3.6

Aug 20, 2025

1.3.3

Aug 20, 2025

1.3.2

Aug 20, 2025

1.3.1

Aug 20, 2025

1.3.0

Aug 20, 2025

1.2.1

Aug 20, 2025

1.2.0

Aug 20, 2025

1.1.1

Aug 20, 2025

1.1.0

Aug 20, 2025

1.0.1

Aug 20, 2025

1.0.0

Aug 20, 2025

0.4.0

Aug 19, 2025

0.3.2

Aug 19, 2025

0.3.0

Aug 19, 2025

0.2.2

Aug 19, 2025

0.2.1

Aug 19, 2025

0.2.0

Aug 19, 2025

0.1.2

Aug 19, 2025

0.1.1

Aug 19, 2025

0.1.0

Aug 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartdownsample-2.0.4.tar.gz (21.4 kB view details)

Uploaded Mar 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

smartdownsample-2.0.4-py3-none-any.whl (14.9 kB view details)

Uploaded Mar 26, 2026 Python 3

File details

Details for the file smartdownsample-2.0.4.tar.gz.

File metadata

Download URL: smartdownsample-2.0.4.tar.gz
Upload date: Mar 26, 2026
Size: 21.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.15

File hashes

Hashes for smartdownsample-2.0.4.tar.gz
Algorithm	Hash digest
SHA256	`dc3c70d9d0a5cfbe533782b2d444052ae54ff62a94fd3a39c63e57ff93f794d7`
MD5	`4748b745e62b19fa00b08e7372f2df26`
BLAKE2b-256	`01a5700f436161c05ad65b7b81e5a48c2e10e5b5fc0eacbe9e20e61874ecf4c1`

See more details on using hashes here.

File details

Details for the file smartdownsample-2.0.4-py3-none-any.whl.

File metadata

Download URL: smartdownsample-2.0.4-py3-none-any.whl
Upload date: Mar 26, 2026
Size: 14.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.15

File hashes

Hashes for smartdownsample-2.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c240629ca033daa32e71a1edcc76ad8eea8a58b5e0de6243d95b69da68549ca7`
MD5	`31aea569a597de5b9d8b394b19ad47be`
BLAKE2b-256	`0d27021bf86732fa6d1feeca3073d13840dc7967b93339bd25523079df873a32`

See more details on using hashes here.

smartdownsample 2.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

smartdownsample

Installation

Usage

Parameters

How it works

Performance

License

TODO

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes