Skip to main content

A professional tool for cleaning duplicate or near-duplicate image frames using perceptual hashing and embeddings.

Project description

CleanFrames

CleanFrames is a powerful and versatile tool designed to identify and remove duplicate or near-duplicate image frames from large datasets. It leverages multiple techniques to ensure thorough and efficient cleaning, including:

  • MD5 hashing for exact byte-level duplicates.
  • Perceptual hashing for visually similar images.
  • Deep embeddings for semantic redundancy detection.

This combination allows CleanFrames to handle a wide range of duplicate detection scenarios, from exact copies to subtle semantic similarities.

Features

  • Supports multiple embedding models: Swin, CLIP, DINO, and ResNet.
  • Flexible usage modes: clean images by path only, generate embeddings on the fly, or supply custom embeddings.
  • Device support for CPU, GPU, and Apple MPS for accelerated processing.
  • Outputs cleaned images into organized folders for easy inspection.
  • Provides detailed results including removed duplicates and retained images.

Installation

Install CleanFrames easily via pip:

pip install cleanframes

Usage

1. Basic Usage: Clean by Path Only

CleanFrames can process a folder of images directly, automatically computing embeddings using the default model (Swin) and removing duplicates.

from clean_frames import CleanFrame

cleaner = CleanFrame(device='cuda')  # or 'cpu', 'mps' depending on your hardware
input_folder = "path/to/images"

# Clean images by path only
cleaner.cleanframes(input_folder)

This will create output folders inside the input folder:

  • cleaned - contains unique images after cleaning.
  • duplicates - contains removed duplicate images.
  • results.json - detailed report of the cleaning process.

2. Generate Embeddings and Clean

You can generate embeddings separately and then clean based on those embeddings. This is useful if you want to inspect or reuse embeddings.

from clean_frames import CleanFrame

cleaner = CleanFrame(device='cuda')
input_folder = "path/to/images"

# Generate embeddings using Swin model
embeddings, paths = cleaner.SwinEmbedding(input_folder)

# Clean images using generated embeddings
cleaner.cleanframes(paths, embeddings_list=[("swin", embeddings)], threshold=0.95)

3. Clean Using Custom Embeddings

If you have your own embeddings (e.g., from other models or precomputed vectors), you can supply them directly.

from clean_frames import CleanFrame
import numpy as np

cleaner = CleanFrame(device='cpu')
input_folder = "path/to/images"

# Example: Load or create custom embeddings as a numpy array
custom_embeddings = np.load("custom_embeddings.npy")
image_paths = [...]  # list of image file paths corresponding to embeddings

# Clean using custom embeddings with a specified model name
cleaner.cleanframes(image_paths, embeddings_list=[("custom_model", custom_embeddings)], threshold=0.9)

Supported Embedding Models

  • Swin: Hierarchical Vision Transformer for image representation.
  • CLIP: Contrastive Language-Image Pretraining embeddings.
  • DINO: Self-distillation with no labels for visual features.
  • ResNet: Classic convolutional neural network embeddings.

You can generate embeddings with any of these models using corresponding methods provided by CleanFrame (e.g., cleaner.CLIPEmbedding(), cleaner.DINOEmbedding(), etc.).

Device Support

CleanFrames supports multiple devices for accelerated embedding computation:

  • CPU: Default fallback.
  • CUDA GPU: For NVIDIA GPUs.
  • MPS: Apple's Metal Performance Shaders for Macs with Apple Silicon.

Specify your device when initializing CleanFrame:

cleaner = CleanFrame(device='mps')  # or 'cuda', 'cpu'

Output Structure

After cleaning, the tool creates the following inside the input folder or specified path:

  • cleaned/: Contains the filtered set of unique images.
  • duplicates/: Contains images identified as duplicates or near-duplicates.
  • results.json: JSON file summarizing duplicates removed, thresholds used, and other metadata.

Notes

  • The threshold parameter controls sensitivity for near-duplicate detection; lower values remove more images.
  • Combining multiple embedding models can improve detection accuracy.
  • CleanFrames is designed to be scalable and efficient for large image datasets.

For more information and advanced options, please refer to the official documentation or GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanframes-0.2.4.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleanframes-0.2.4-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file cleanframes-0.2.4.tar.gz.

File metadata

  • Download URL: cleanframes-0.2.4.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for cleanframes-0.2.4.tar.gz
Algorithm Hash digest
SHA256 d4ac54edab7ccd7c2543712db1724a6b69254802bfd13d3163cd9a68b3a05380
MD5 0e5543e078cc9dbebdc772268f7161b3
BLAKE2b-256 af0eab288a4f2355b2f59058fcc8156d2f8de1ba3757075f7c23b496d7b6fe63

See more details on using hashes here.

File details

Details for the file cleanframes-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: cleanframes-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for cleanframes-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 3519c09750533ddedf15c3ef700f1e0d174c329dd5e23ad10daa6007fb902f80
MD5 de7e945d1a02b3246d74cb3424f4be31
BLAKE2b-256 bc5a7f629c1f24fa2bbc4172d2bc066c8c3d14a22313d6c23ee1ab6475cb6c45

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page