Skip to main content

Smart image downsampling for image classification datasets

Project description

smartdownsample

Fast and lightweight downsampling for large image datasets

smartdownsample is built for image collections that:

  1. Contains more images than you need for training, and
  2. Has a high level of redundancy

The tool selects representative subsets while preserving diversity by distilling images to tiny signatures of visual features. In many ML workflows, majority classes can have hundreds of thousands of images. These often need to be reduced for efficiency or class balance—without discarding too much valuable variation.

Perfect deduplication would require heavy computations and isn’t feasible at scale. Instead, smartdownsample offers a practical compromise: fast downsampling that keeps diversity with minimal overhead, cutting processing time from hours (or days) to minutes.

If you need mathematically optimal results, this isn’t the right fit. But if you want a simple, effective alternative that outperforms random sampling, smartdownsample is designed for you.

Installation

pip install smartdownsample

Usage

from smartdownsample import sample_diverse

# List of image paths
my_image_list = [
    "path/to/img1.jpg",
    "path/to/img2.jpg",
    "path/to/img3.jpg",
    # ...
]

# Basic usage
selected = sample_diverse(
    image_paths=my_image_list,
    target_count=50000
)

Parameters

Parameter Default Description
image_paths Required List of image file paths (str or Path objects)
target_count Required Exact number of images to select
hash_size 8 Perceptual hash size (8 recommended)
n_workers 4 Number of parallel workers for hash computation
show_progress True Display progress bars during processing
random_seed 42 Random seed for reproducible bucket selection
show_summary True Print bucket statistics and distribution summary
save_distribution None Path to save distribution chart as PNG (creates directories if needed)
save_thumbnails None Path to save thumbnail grids as PNG (creates directories if needed)
image_loading_errors "raise" How to handle image loading errors: "raise" (fail immediately) or "skip" (continue with remaining images)
return_indices False Return 0-based indices instead of paths (refers to original input list order)

How it works

The algorithm balances speed and diversity in four steps:

  1. Feature extraction
    Each image is reduced to a compact set of visual features:

    • DHash (2 bits) → structure/edges
    • AHash (1 bit) → brightness/contrast
    • Color variance (1 bit) → grayscale vs. color
    • Overall brightness (1 bit) → dark vs. bright
    • Average color (2 bits) → dominant scene color (red/green/blue/neutral)
  2. Bucket grouping
    Images are sorted into "similarity buckets" based on the visual features extracted at step 1.

    • At most 128 buckets are possible (4×2×2×2×4 feature splits).
    • In practice, most datasets produce only a few dozen buckets, depending on their diversity.
  3. Selection across buckets

    • Ensure at least one image per bucket (diversity first)
    • Fill the remaining quota proportionally from larger buckets
  4. Within-bucket selection

    • Buckets are kept in their natural folder order to preserve any inherent structure in the dataset (e.g., locations, events, sequences, etc)
    • Images are then sampled at regular intervals (every stride-th image) until the target count is reached, ensuring a systematic spread across the bucket
  5. Save distribution chart (optional)

    • Vertical bar chart of kept vs. excluded images per bucket
  1. Save thumbnail grids (optional)
    • 5×5 grids from each bucket, for quick visual review

License

MIT License → see LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartdownsample-1.9.0.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smartdownsample-1.9.0-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file smartdownsample-1.9.0.tar.gz.

File metadata

  • Download URL: smartdownsample-1.9.0.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for smartdownsample-1.9.0.tar.gz
Algorithm Hash digest
SHA256 2765f23e5e2a75c71cd363d0e0684cb1da56b2c6042a2b58e7970338e7c376e0
MD5 63cd5fdba4a6f7e97e55bb973144297c
BLAKE2b-256 8657d76e9d3288cf6481aba25cc2d83669d4307414050907b9daaf8816ba9d7a

See more details on using hashes here.

File details

Details for the file smartdownsample-1.9.0-py3-none-any.whl.

File metadata

File hashes

Hashes for smartdownsample-1.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4495d49d768d4b58afcf5c13da2476c1265d922b48a60fb64b11049425ee488c
MD5 018f618a9a9a5f0ee44c612c08eae690
BLAKE2b-256 31afc3f22deb1a50c92549e21367121e59b7f13b8314375edfecb74e2356a1f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page