Skip to main content

A professional tool for cleaning duplicate or near-duplicate image frames using perceptual hashing and embeddings.

Project description

CleanFrames

CleanFrames is a powerful and versatile tool designed to identify and remove duplicate or near-duplicate image frames from large datasets. It leverages multiple techniques to ensure thorough and efficient cleaning, including:

  • MD5 hashing for exact byte-level duplicates.
  • Perceptual hashing for visually similar images.
  • Deep embeddings for semantic redundancy detection.

This combination allows CleanFrames to handle a wide range of duplicate detection scenarios, from exact copies to subtle semantic similarities.

Features

  • Supports multiple embedding models: Swin, CLIP, DINO, and ResNet.
  • Flexible usage modes: clean images by path only, generate embeddings on the fly, or supply custom embeddings.
  • Device support for CPU, GPU, and Apple MPS for accelerated processing.
  • Outputs cleaned images into organized folders for easy inspection.
  • Provides detailed results including removed duplicates and retained images.

Installation

Install CleanFrames easily via pip:

pip install cleanframes

Usage

1. Basic Usage: Clean by Path Only

CleanFrames can process a folder of images directly, automatically computing embeddings using the default model (Swin) and removing duplicates.

from clean_frames import CleanFrame

cleaner = CleanFrame(device='cuda')  # or 'cpu', 'mps' depending on your hardware
input_folder = "path/to/images"

# Clean images by path only
cleaner.cleanframes(input_folder)

This will create output folders inside the input folder:

  • cleaned - contains unique images after cleaning.
  • duplicates - contains removed duplicate images.
  • results.json - detailed report of the cleaning process.

2. Generate Embeddings and Clean

You can generate embeddings separately and then clean based on those embeddings. This is useful if you want to inspect or reuse embeddings.

from clean_frames import CleanFrame

cleaner = CleanFrame(device='cuda')
input_folder = "path/to/images"

# Generate embeddings using Swin model
embeddings, paths = cleaner.SwinEmbedding(input_folder)

# Clean images using generated embeddings
cleaner.cleanframes(paths, embeddings_list=[("swin", embeddings)], threshold=0.95)

3. Clean Using Custom Embeddings

If you have your own embeddings (e.g., from other models or precomputed vectors), you can supply them directly.

from clean_frames import CleanFrame
import numpy as np

cleaner = CleanFrame(device='cpu')
input_folder = "path/to/images"

# Example: Load or create custom embeddings as a numpy array
custom_embeddings = np.load("custom_embeddings.npy")
image_paths = [...]  # list of image file paths corresponding to embeddings

# Clean using custom embeddings with a specified model name
cleaner.cleanframes(image_paths, embeddings_list=[("custom_model", custom_embeddings)], threshold=0.9)

Supported Embedding Models

  • Swin: Hierarchical Vision Transformer for image representation.
  • CLIP: Contrastive Language-Image Pretraining embeddings.
  • DINO: Self-distillation with no labels for visual features.
  • ResNet: Classic convolutional neural network embeddings.

You can generate embeddings with any of these models using corresponding methods provided by CleanFrame (e.g., cleaner.CLIPEmbedding(), cleaner.DINOEmbedding(), etc.).

Device Support

CleanFrames supports multiple devices for accelerated embedding computation:

  • CPU: Default fallback.
  • CUDA GPU: For NVIDIA GPUs.
  • MPS: Apple's Metal Performance Shaders for Macs with Apple Silicon.

Specify your device when initializing CleanFrame:

cleaner = CleanFrame(device='mps')  # or 'cuda', 'cpu'

Output Structure

After cleaning, the tool creates the following inside the input folder or specified path:

  • cleaned/: Contains the filtered set of unique images.
  • duplicates/: Contains images identified as duplicates or near-duplicates.
  • results.json: JSON file summarizing duplicates removed, thresholds used, and other metadata.

Notes

  • The threshold parameter controls sensitivity for near-duplicate detection; lower values remove more images.
  • Combining multiple embedding models can improve detection accuracy.
  • CleanFrames is designed to be scalable and efficient for large image datasets.

For more information and advanced options, please refer to the official documentation or GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanframes-0.2.3.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleanframes-0.2.3-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file cleanframes-0.2.3.tar.gz.

File metadata

  • Download URL: cleanframes-0.2.3.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for cleanframes-0.2.3.tar.gz
Algorithm Hash digest
SHA256 450d91c3b533d947f7adb3a75d75cf0101e43088ca01684c0514a9fa3cbd9a34
MD5 f25638560ac447ab636aca2efc08e8d0
BLAKE2b-256 38c64419aa5d891ca9219ea50203e5d9ae17adc7dc869dd9bd6bf50efca00b3f

See more details on using hashes here.

File details

Details for the file cleanframes-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: cleanframes-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for cleanframes-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 092a1fc604e16489682cfa7116639d929fc8a173a4fd0645893b78a4209f687d
MD5 a05a4969c4a11494222bc6efb9f40439
BLAKE2b-256 3c7b7d287bdef29e12d8ad0be832ba76894a6b55987c1c50b111313fb09c663b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page