A professional tool for cleaning duplicate or near-duplicate image frames using perceptual hashing and embeddings.
Project description
CleanFrames
CleanFrames is a powerful and versatile tool designed to identify and remove duplicate or near-duplicate image frames from large datasets. It leverages multiple techniques to ensure thorough and efficient cleaning, including:
- MD5 hashing for exact byte-level duplicates.
- Perceptual hashing for visually similar images.
- Deep embeddings for semantic redundancy detection.
This combination allows CleanFrames to handle a wide range of duplicate detection scenarios, from exact copies to subtle semantic similarities.
Features
- Supports multiple embedding models: Swin, CLIP, DINO, and ResNet.
- Flexible usage modes: clean images by path only, generate embeddings on the fly, or supply custom embeddings.
- Device support for CPU, GPU, and Apple MPS for accelerated processing.
- Outputs cleaned images into organized folders for easy inspection.
- Provides detailed results including removed duplicates and retained images.
Installation
Install CleanFrames easily via pip:
pip install cleanframes
Usage
1. Basic Usage: Clean by Path Only
CleanFrames can process a folder of images directly, automatically computing embeddings using the default model (Swin) and removing duplicates.
from clean_frames import CleanFrame
cleaner = CleanFrame(device='cuda') # or 'cpu', 'mps' depending on your hardware
input_folder = "path/to/images"
# Clean images by path only
cleaner.cleanframes(input_folder)
This will create output folders inside the input folder:
cleaned- contains unique images after cleaning.duplicates- contains removed duplicate images.results.json- detailed report of the cleaning process.
2. Generate Embeddings and Clean
You can generate embeddings separately and then clean based on those embeddings. This is useful if you want to inspect or reuse embeddings.
from clean_frames import CleanFrame
cleaner = CleanFrame(device='cuda')
input_folder = "path/to/images"
# Generate embeddings using Swin model
embeddings, paths = cleaner.SwinEmbedding(input_folder)
# Clean images using generated embeddings
cleaner.cleanframes(paths, embeddings_list=[("swin", embeddings)], threshold=0.95)
3. Clean Using Custom Embeddings
If you have your own embeddings (e.g., from other models or precomputed vectors), you can supply them directly.
from clean_frames import CleanFrame
import numpy as np
cleaner = CleanFrame(device='cpu')
input_folder = "path/to/images"
# Example: Load or create custom embeddings as a numpy array
custom_embeddings = np.load("custom_embeddings.npy")
image_paths = [...] # list of image file paths corresponding to embeddings
# Clean using custom embeddings with a specified model name
cleaner.cleanframes(image_paths, embeddings_list=[("custom_model", custom_embeddings)], threshold=0.9)
Supported Embedding Models
- Swin: Hierarchical Vision Transformer for image representation.
- CLIP: Contrastive Language-Image Pretraining embeddings.
- DINO: Self-distillation with no labels for visual features.
- ResNet: Classic convolutional neural network embeddings.
You can generate embeddings with any of these models using corresponding methods provided by CleanFrame (e.g., cleaner.CLIPEmbedding(), cleaner.DINOEmbedding(), etc.).
Device Support
CleanFrames supports multiple devices for accelerated embedding computation:
- CPU: Default fallback.
- CUDA GPU: For NVIDIA GPUs.
- MPS: Apple's Metal Performance Shaders for Macs with Apple Silicon.
Specify your device when initializing CleanFrame:
cleaner = CleanFrame(device='mps') # or 'cuda', 'cpu'
Output Structure
After cleaning, the tool creates the following inside the input folder or specified path:
cleaned/: Contains the filtered set of unique images.duplicates/: Contains images identified as duplicates or near-duplicates.results.json: JSON file summarizing duplicates removed, thresholds used, and other metadata.
Notes
- The
thresholdparameter controls sensitivity for near-duplicate detection; lower values remove more images. - Combining multiple embedding models can improve detection accuracy.
- CleanFrames is designed to be scalable and efficient for large image datasets.
For more information and advanced options, please refer to the official documentation or GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cleanframes-0.2.6.tar.gz.
File metadata
- Download URL: cleanframes-0.2.6.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc486bd1a029bfec4ffec07866a1843772330bdaf0937b8762adeb04f0858f9d
|
|
| MD5 |
82d5d936fbe5aa25498e4755c71a9b18
|
|
| BLAKE2b-256 |
6adbc4b47b5addb9f7eaabd51dc02c9946b85456bedeb65a08609647c3dc1eec
|
File details
Details for the file cleanframes-0.2.6-py3-none-any.whl.
File metadata
- Download URL: cleanframes-0.2.6-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5356b052b5a3d0956d1675a6a6f21dc01eb9196e201cab6aed0b804d64db262
|
|
| MD5 |
8e8d9e427af80a8452cbf86ca86a41f6
|
|
| BLAKE2b-256 |
0cfbd90fd9a46004e72758a1c5aeb6be4756447d22ebb0d2c403511b87396052
|