Skip to main content

Smart image downsampling for image classification datasets

Project description

smartdownsample

Efficient downsampling for image classification datasets

SmartDownsample selects the most diverse images from large collections, ideal for reducing dataset size while preserving visual variability.

Installation

pip install smartdownsample

Usage

from smartdownsample import select_distinct

# Example list of image paths
my_image_list = [
    "path/to/img1.jpg",
    "path/to/img2.jpg",
    "path/to/img3.jpg",
    "path/to/img4.jpg"
]

# Simple selection - get 100 most diverse images
selected = select_distinct(
    image_paths=my_image_list,
    target_count=100
)

# With visual verification to see excluded images in context
selected = select_distinct(
    image_paths=my_image_list,
    target_count=100,
    show_verification=True
)

print(f"Selected {len(selected)} images")

Parameters

Parameter Default Description
image_paths Required List of image file paths (str or Path objects)
target_count Required Exact number of images to select
window_size 100 Rolling window size (larger = better quality, slower)
random_seed 42 Random seed for reproducible results
show_progress True Whether to display progress bars
show_verification False Show visual verification comparing excluded vs included images

Step by Step

  1. Sort paths by directory. Within each folder, files are naturally ordered (e.g., img1.jpg, img2.jpg, img10.jpg) so related images remain grouped.
  2. Compute perceptual hashes for all valid image paths.
  3. Apply rolling window selection on the hash array to choose indices of the most diverse images. This runs in O(n) time, scales to large classes of 100k+ images, and compares each candidate only to a sliding window of recent selections.
  4. Return results as [valid_paths[i] for i in selected_indices].
  5. Optional verification plot: If show_verification=True, the algorithm displays a visual check of 18 randomly selected excluded images and their included counterpart. The visualization opens automatically in your default image viewer without saving files to disk.

License

MIT License – see LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartdownsample-0.1.0.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smartdownsample-0.1.0-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file smartdownsample-0.1.0.tar.gz.

File metadata

  • Download URL: smartdownsample-0.1.0.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for smartdownsample-0.1.0.tar.gz
Algorithm Hash digest
SHA256 be1b6b2e40fae4110372747e72daddb6abca985204fba7b2c6f8e2889f078b75
MD5 9a14c1c507ef849a280f8e978ef61692
BLAKE2b-256 c20a0110e65105741a9b3deb38c1a66ca8fa9ed83a477dff392dfb2abf0e38be

See more details on using hashes here.

File details

Details for the file smartdownsample-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for smartdownsample-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d14074be972a226d006f41ee24039a71afc7133875889da98e8a1a9365b8f61a
MD5 2fff8f204355c7cb86446dcec6499041
BLAKE2b-256 69dd80a3a05346f0565a40784a60ed29405d7171fba0e9962060d0b81f9cf984

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page