Skip to main content

A professional tool for cleaning duplicate or near-duplicate image frames using perceptual hashing and embeddings.

Project description

CleanFrames

Overview

CleanFrames is an advanced tool designed to identify and remove duplicate or near-duplicate images from large datasets using multiple embedding models and sophisticated clustering techniques. It supports exact and perceptual duplicate detection, semantic similarity analysis via deep embeddings, and offers visualization and detailed tabulated reporting for thorough dataset cleaning.

Features

  • Multi-model embedding support: Swin, CLIP, DINO, ResNet.
  • Exact duplicate detection using MD5 hashing.
  • Semantic similarity detection with deep embeddings and clustering.
  • Flexible cleaning modes: path-only, embedding-based, or custom embeddings.
  • Clustering to group similar images and identify duplicates.
  • Visualization tools for inspecting clusters and embeddings.
  • Detailed tabulated reports with removed duplicates, retained images, and thresholds.
  • Device support for CPU, CUDA GPU, and Apple MPS.
  • Efficient caching system to store and reuse embeddings for faster processing.

Installation

Install CleanFrames easily via pip:

pip install cleanframes

Usage

Basic Cleaning by Path

CleanFrames can process a folder of images, compute embeddings using the default Swin model, and remove duplicates.

from cleanframes import CleanFrame

cleaner = CleanFrame(device='cuda')  # or 'cpu', 'mps'
input_folder = "path/to/images"

cleaner.cleanframe(input_folder)

This creates an output folder inside the specified directory (default: frames_cleaned/) containing the unique images after cleaning.

A detailed tabulated report is printed to the console summarizing the cleaning results.

Generate Embeddings and Clean

Generate embeddings separately and then clean based on those embeddings:

from cleanframes import CleanFrame

cleaner = CleanFrame(device='cuda')
input_folder = "path/to/images"

embeddings, paths = cleaner.SwinEmbedding(input_folder)

cleaner.cleanframe(paths, embeddings_list=[("swin", embeddings)], threshold=0.95)

Clean Using Custom Embeddings

Supply your own embeddings (e.g., precomputed vectors) for cleaning:

from cleanframes import CleanFrame
import numpy as np

cleaner = CleanFrame(device='cpu')
image_paths = [...]  # list of image file paths
custom_embeddings = np.load("custom_embeddings.npy")

cleaner.cleanframe(image_paths, embeddings_list=[("custom_model", custom_embeddings)], threshold=0.9)

Clustering & Visualization

CleanFrames groups similar images using clustering algorithms on embeddings to identify duplicates and near-duplicates effectively.

You can also visualize clusters and embeddings to inspect dataset structure:

cleaner.visualize_clusters(embeddings, image_paths)

This helps in understanding similarity groups and verifying cleaning results.

Report

After cleaning, CleanFrames prints a comprehensive tabulated report including:

  • Number of duplicates removed.
  • Images retained.
  • Threshold values used.
  • Embedding models applied.
  • Cluster information.

This report facilitates audit and reproducibility of dataset cleaning.

Supported Models

  • Swin: Hierarchical Vision Transformer for image representation.
  • CLIP: Contrastive Language-Image Pretraining embeddings.
  • DINO: Self-distillation with no labels for visual features.
  • ResNet: Classic convolutional neural network embeddings.

Generate embeddings with corresponding methods like cleaner.CLIPEmbedding(), cleaner.DINOEmbedding(), etc.

Device Support

CleanFrames supports multiple devices for accelerated embedding computation:

  • CPU: Default fallback.
  • CUDA GPU: For NVIDIA GPUs.
  • MPS: Apple's Metal Performance Shaders for Macs with Apple Silicon.

Specify device during initialization:

cleaner = CleanFrame(device='mps')  # or 'cuda', 'cpu'

Caching System

CleanFrames includes a caching mechanism to save and load embeddings, clusters, and visualizations, reducing redundant computations on repeated runs:

  • Automatically caches .npz files per folder and model inside the .cleanframe_cache/ directory.
  • Loads cached embeddings and clusters to speed up cleaning.
  • Manages cache files for efficient storage and reuse.

Example:

embeddings, paths = cleaner.SwinEmbedding(input_folder, use_cache=True)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanframes-0.2.12.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleanframes-0.2.12-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file cleanframes-0.2.12.tar.gz.

File metadata

  • Download URL: cleanframes-0.2.12.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for cleanframes-0.2.12.tar.gz
Algorithm Hash digest
SHA256 a05e9e897310c83eafa830c3969b45f17d7e21509272cb9bee7baf5568cc22dd
MD5 b183d50e9066df713229c8787402c8b7
BLAKE2b-256 8a0146c90a74fda78c94cb0a370ced1cde41977f3f8a3a1975620e84f58205e4

See more details on using hashes here.

File details

Details for the file cleanframes-0.2.12-py3-none-any.whl.

File metadata

  • Download URL: cleanframes-0.2.12-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for cleanframes-0.2.12-py3-none-any.whl
Algorithm Hash digest
SHA256 e65846f4950d7245fec8d385144dbf4c6808e2a40b46e4e74d0eca5f61350d48
MD5 881f20d86d0405ca158c1774e8ac4814
BLAKE2b-256 7f05fcae82178c20a962ffd6adb35315c32a8855b53f1ce99c6ec2c15a9e1d3a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page