Skip to main content

A professional tool for cleaning duplicate or near-duplicate image frames using perceptual hashing and embeddings.

Project description

CleanFrames

CleanFrames is a powerful and versatile tool designed to identify and remove duplicate or near-duplicate image frames from large datasets. It leverages multiple techniques to ensure thorough and efficient cleaning, including:

  • MD5 hashing for exact byte-level duplicates.
  • Perceptual hashing for visually similar images.
  • Deep embeddings for semantic redundancy detection.

This combination allows CleanFrames to handle a wide range of duplicate detection scenarios, from exact copies to subtle semantic similarities.

Features

  • Supports multiple embedding models: Swin, CLIP, DINO, and ResNet.
  • Flexible usage modes: clean images by path only, generate embeddings on the fly, or supply custom embeddings.
  • Device support for CPU, GPU, and Apple MPS for accelerated processing.
  • Outputs cleaned images into organized folders for easy inspection.
  • Provides detailed results including removed duplicates and retained images.

Installation

Install CleanFrames easily via pip:

pip install cleanframes

Usage

1. Basic Usage: Clean by Path Only

CleanFrames can process a folder of images directly, automatically computing embeddings using the default model (Swin) and removing duplicates.

from clean_frames import CleanFrame

cleaner = CleanFrame(device='cuda')  # or 'cpu', 'mps' depending on your hardware
input_folder = "path/to/images"

# Clean images by path only
cleaner.cleanframes(input_folder)

This will create output folders inside the input folder:

  • cleaned - contains unique images after cleaning.
  • duplicates - contains removed duplicate images.
  • results.json - detailed report of the cleaning process.

2. Generate Embeddings and Clean

You can generate embeddings separately and then clean based on those embeddings. This is useful if you want to inspect or reuse embeddings.

from clean_frames import CleanFrame

cleaner = CleanFrame(device='cuda')
input_folder = "path/to/images"

# Generate embeddings using Swin model
embeddings, paths = cleaner.SwinEmbedding(input_folder)

# Clean images using generated embeddings
cleaner.cleanframes(paths, embeddings_list=[("swin", embeddings)], threshold=0.95)

3. Clean Using Custom Embeddings

If you have your own embeddings (e.g., from other models or precomputed vectors), you can supply them directly.

from clean_frames import CleanFrame
import numpy as np

cleaner = CleanFrame(device='cpu')
input_folder = "path/to/images"

# Example: Load or create custom embeddings as a numpy array
custom_embeddings = np.load("custom_embeddings.npy")
image_paths = [...]  # list of image file paths corresponding to embeddings

# Clean using custom embeddings with a specified model name
cleaner.cleanframes(image_paths, embeddings_list=[("custom_model", custom_embeddings)], threshold=0.9)

Supported Embedding Models

  • Swin: Hierarchical Vision Transformer for image representation.
  • CLIP: Contrastive Language-Image Pretraining embeddings.
  • DINO: Self-distillation with no labels for visual features.
  • ResNet: Classic convolutional neural network embeddings.

You can generate embeddings with any of these models using corresponding methods provided by CleanFrame (e.g., cleaner.CLIPEmbedding(), cleaner.DINOEmbedding(), etc.).

Device Support

CleanFrames supports multiple devices for accelerated embedding computation:

  • CPU: Default fallback.
  • CUDA GPU: For NVIDIA GPUs.
  • MPS: Apple's Metal Performance Shaders for Macs with Apple Silicon.

Specify your device when initializing CleanFrame:

cleaner = CleanFrame(device='mps')  # or 'cuda', 'cpu'

Output Structure

After cleaning, the tool creates the following inside the input folder or specified path:

  • cleaned/: Contains the filtered set of unique images.
  • duplicates/: Contains images identified as duplicates or near-duplicates.
  • results.json: JSON file summarizing duplicates removed, thresholds used, and other metadata.

Notes

  • The threshold parameter controls sensitivity for near-duplicate detection; lower values remove more images.
  • Combining multiple embedding models can improve detection accuracy.
  • CleanFrames is designed to be scalable and efficient for large image datasets.

For more information and advanced options, please refer to the official documentation or GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanframes-0.2.7.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleanframes-0.2.7-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file cleanframes-0.2.7.tar.gz.

File metadata

  • Download URL: cleanframes-0.2.7.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for cleanframes-0.2.7.tar.gz
Algorithm Hash digest
SHA256 44b16b7efcfadac50cd5bbf1bef1306a6e0eb4151273cbe295f2331656d0e1e0
MD5 3fa1ec6335ef2b5e0319be4592a891dc
BLAKE2b-256 2875c38e25d32749eefc54d47e928e60bc3778e308161381f9dd8a09771d062e

See more details on using hashes here.

File details

Details for the file cleanframes-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: cleanframes-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for cleanframes-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 e488fdbfe1b02be208299a25afcc83692a8626565887ed50150ac4567199b457
MD5 685a1b964f982a12c2aa6e58faf04f32
BLAKE2b-256 7304cafa8783c1b5febf7e3ef43587e3a5ac062df7e4a7ddbbb7df378a8d230e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page