A professional tool for cleaning duplicate or near-duplicate image frames using perceptual hashing and embeddings.
Project description
CleanFrames
Overview
CleanFrames is an advanced tool designed to identify and remove duplicate or near-duplicate images from large datasets using multiple embedding models and sophisticated clustering techniques. It supports exact and perceptual duplicate detection, semantic similarity analysis via deep embeddings, and offers visualization and detailed reporting for thorough dataset cleaning.
Features
- Multi-model embedding support: Swin, CLIP, DINO, ResNet.
- Exact duplicate detection using MD5 hashing.
- Semantic similarity detection with deep embeddings and clustering.
- Flexible cleaning modes: path-only, embedding-based, or custom embeddings.
- Clustering to group similar images and identify duplicates.
- Visualization tools for inspecting clusters and embeddings.
- Detailed JSON reports with removed duplicates, retained images, and thresholds.
- Device support for CPU, CUDA GPU, and Apple MPS.
- Efficient caching system to store and reuse embeddings for faster processing.
Installation
Install CleanFrames easily via pip:
pip install cleanframes
Usage
Basic Cleaning by Path
CleanFrames can process a folder of images, compute embeddings using the default Swin model, and remove duplicates.
from clean_frames import CleanFrame
cleaner = CleanFrame(device='cuda') # or 'cpu', 'mps'
input_folder = "path/to/images"
cleaner.cleanframes(input_folder)
This creates output folders inside the input folder:
cleaned/— unique images after cleaning.duplicates/— removed duplicate images.results.json— detailed cleaning report.
Generate Embeddings and Clean
Generate embeddings separately and then clean based on those embeddings:
from clean_frames import CleanFrame
cleaner = CleanFrame(device='cuda')
input_folder = "path/to/images"
embeddings, paths = cleaner.SwinEmbedding(input_folder)
cleaner.cleanframes(paths, embeddings_list=[("swin", embeddings)], threshold=0.95)
Clean Using Custom Embeddings
Supply your own embeddings (e.g., precomputed vectors) for cleaning:
from clean_frames import CleanFrame
import numpy as np
cleaner = CleanFrame(device='cpu')
image_paths = [...] # list of image file paths
custom_embeddings = np.load("custom_embeddings.npy")
cleaner.cleanframes(image_paths, embeddings_list=[("custom_model", custom_embeddings)], threshold=0.9)
Clustering & Visualization
CleanFrames groups similar images using clustering algorithms on embeddings to identify duplicates and near-duplicates effectively.
You can also visualize clusters and embeddings to inspect dataset structure:
cleaner.visualize_clusters(embeddings, image_paths)
This helps in understanding similarity groups and verifying cleaning results.
Report Example
After cleaning, results.json provides comprehensive details including:
- Number of duplicates removed.
- Images retained.
- Threshold values used.
- Embedding models applied.
- Cluster information.
This report facilitates audit and reproducibility of dataset cleaning.
Supported Models
- Swin: Hierarchical Vision Transformer for image representation.
- CLIP: Contrastive Language-Image Pretraining embeddings.
- DINO: Self-distillation with no labels for visual features.
- ResNet: Classic convolutional neural network embeddings.
Generate embeddings with corresponding methods like cleaner.CLIPEmbedding(), cleaner.DINOEmbedding(), etc.
Device Support
CleanFrames supports multiple devices for accelerated embedding computation:
- CPU: Default fallback.
- CUDA GPU: For NVIDIA GPUs.
- MPS: Apple's Metal Performance Shaders for Macs with Apple Silicon.
Specify device during initialization:
cleaner = CleanFrame(device='mps') # or 'cuda', 'cpu'
Caching System
CleanFrames includes a caching mechanism to save and load embeddings, reducing redundant computations on repeated runs:
- Automatically caches embeddings per folder and model.
- Load cached embeddings to speed up cleaning.
- Manage cache files for efficient storage.
Example:
embeddings, paths = cleaner.SwinEmbedding(input_folder, use_cache=True)
For advanced options and detailed documentation, please visit the official GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cleanframes-0.2.8.tar.gz.
File metadata
- Download URL: cleanframes-0.2.8.tar.gz
- Upload date:
- Size: 9.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5cf8c87da6c67d7d4541c7d89f1c3a95649969adafb10ef0dce8d75472993588
|
|
| MD5 |
c70c56a45cefdbcd49f657b5e2f4b938
|
|
| BLAKE2b-256 |
43339f461fe12fad1ba5397623c29b685525733527ac906c858a52a9bd10fdbd
|
File details
Details for the file cleanframes-0.2.8-py3-none-any.whl.
File metadata
- Download URL: cleanframes-0.2.8-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cbd128209dd1f82885b8ecf6fa0b5dc1771011c55d4057024bb39d8887c2761d
|
|
| MD5 |
41a5dfca34455ecff29395aad13e6cfa
|
|
| BLAKE2b-256 |
62626d93c111ccc4d56a5ed44527d5694f69edfff60f67f3613f275e1531b809
|