Skip to main content

A professional tool for cleaning duplicate or near-duplicate image frames using perceptual hashing and embeddings.

Project description

CleanFrames

CleanFrames is a powerful and versatile tool designed to identify and remove duplicate or near-duplicate image frames from large datasets. It leverages multiple techniques to ensure thorough and efficient cleaning, including:

  • MD5 hashing for exact byte-level duplicates.
  • Perceptual hashing for visually similar images.
  • Deep embeddings for semantic redundancy detection.

This combination allows CleanFrames to handle a wide range of duplicate detection scenarios, from exact copies to subtle semantic similarities.

Features

  • Supports multiple embedding models: Swin, CLIP, DINO, and ResNet.
  • Flexible usage modes: clean images by path only, generate embeddings on the fly, or supply custom embeddings.
  • Device support for CPU, GPU, and Apple MPS for accelerated processing.
  • Outputs cleaned images into organized folders for easy inspection.
  • Provides detailed results including removed duplicates and retained images.

Installation

Install CleanFrames easily via pip:

pip install cleanframes

Usage

1. Basic Usage: Clean by Path Only

CleanFrames can process a folder of images directly, automatically computing embeddings using the default model (Swin) and removing duplicates.

from clean_frames import CleanFrame

cleaner = CleanFrame(device='cuda')  # or 'cpu', 'mps' depending on your hardware
input_folder = "path/to/images"

# Clean images by path only
cleaner.cleanframes(input_folder)

This will create output folders inside the input folder:

  • cleaned - contains unique images after cleaning.
  • duplicates - contains removed duplicate images.
  • results.json - detailed report of the cleaning process.

2. Generate Embeddings and Clean

You can generate embeddings separately and then clean based on those embeddings. This is useful if you want to inspect or reuse embeddings.

from clean_frames import CleanFrame

cleaner = CleanFrame(device='cuda')
input_folder = "path/to/images"

# Generate embeddings using Swin model
embeddings, paths = cleaner.SwinEmbedding(input_folder)

# Clean images using generated embeddings
cleaner.cleanframes(paths, embeddings_list=[("swin", embeddings)], threshold=0.95)

3. Clean Using Custom Embeddings

If you have your own embeddings (e.g., from other models or precomputed vectors), you can supply them directly.

from clean_frames import CleanFrame
import numpy as np

cleaner = CleanFrame(device='cpu')
input_folder = "path/to/images"

# Example: Load or create custom embeddings as a numpy array
custom_embeddings = np.load("custom_embeddings.npy")
image_paths = [...]  # list of image file paths corresponding to embeddings

# Clean using custom embeddings with a specified model name
cleaner.cleanframes(image_paths, embeddings_list=[("custom_model", custom_embeddings)], threshold=0.9)

Supported Embedding Models

  • Swin: Hierarchical Vision Transformer for image representation.
  • CLIP: Contrastive Language-Image Pretraining embeddings.
  • DINO: Self-distillation with no labels for visual features.
  • ResNet: Classic convolutional neural network embeddings.

You can generate embeddings with any of these models using corresponding methods provided by CleanFrame (e.g., cleaner.CLIPEmbedding(), cleaner.DINOEmbedding(), etc.).

Device Support

CleanFrames supports multiple devices for accelerated embedding computation:

  • CPU: Default fallback.
  • CUDA GPU: For NVIDIA GPUs.
  • MPS: Apple's Metal Performance Shaders for Macs with Apple Silicon.

Specify your device when initializing CleanFrame:

cleaner = CleanFrame(device='mps')  # or 'cuda', 'cpu'

Output Structure

After cleaning, the tool creates the following inside the input folder or specified path:

  • cleaned/: Contains the filtered set of unique images.
  • duplicates/: Contains images identified as duplicates or near-duplicates.
  • results.json: JSON file summarizing duplicates removed, thresholds used, and other metadata.

Notes

  • The threshold parameter controls sensitivity for near-duplicate detection; lower values remove more images.
  • Combining multiple embedding models can improve detection accuracy.
  • CleanFrames is designed to be scalable and efficient for large image datasets.

For more information and advanced options, please refer to the official documentation or GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanframes-0.2.6.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleanframes-0.2.6-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file cleanframes-0.2.6.tar.gz.

File metadata

  • Download URL: cleanframes-0.2.6.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for cleanframes-0.2.6.tar.gz
Algorithm Hash digest
SHA256 bc486bd1a029bfec4ffec07866a1843772330bdaf0937b8762adeb04f0858f9d
MD5 82d5d936fbe5aa25498e4755c71a9b18
BLAKE2b-256 6adbc4b47b5addb9f7eaabd51dc02c9946b85456bedeb65a08609647c3dc1eec

See more details on using hashes here.

File details

Details for the file cleanframes-0.2.6-py3-none-any.whl.

File metadata

  • Download URL: cleanframes-0.2.6-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for cleanframes-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 a5356b052b5a3d0956d1675a6a6f21dc01eb9196e201cab6aed0b804d64db262
MD5 8e8d9e427af80a8452cbf86ca86a41f6
BLAKE2b-256 0cfbd90fd9a46004e72758a1c5aeb6be4756447d22ebb0d2c403511b87396052

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page