Skip to main content

A professional tool for cleaning duplicate or near-duplicate image frames using perceptual hashing and embeddings.

Project description

CleanFrames

CleanFrames is a Python library designed to clean and summarize image frames stored in folders efficiently using embedding models and clustering techniques. It processes folders of frames, removes duplicates or near-duplicates, caches embeddings and reports for faster subsequent runs, and saves cleaned/removed images alongside the original dataset.

Key Features

  • Processes folders of image frames instead of videos.
  • Supports multiple embedding models to represent frames.
  • Various clustering methods to group similar frames.
  • Caches embeddings and reports to optimize performance.
  • Saves cleaned and removed images beside the original dataset.
  • Visualization tools to inspect clusters and frame pairs.
  • Generates text-only console reports summarizing cleaning results.

Installation

To install CleanFrames, clone the repository and install the required dependencies:

git clone <repository-url>
cd cleanframes
pip install -r requirements.txt

Usage

Basic Example

from cleanframes import CleanFrame

# Initialize with folder path, embedding model, clustering method, and caching enabled
cf = CleanFrame(
    path='path/to/frames_folder',
    model='clip-ViT-B-32',
    cluster='kmeans',
    cache=True,
    verbose=True
)

# Run the full cleaning pipeline: embedding, clustering, cleaning
cf.run()

# Generate a text-only console report of the cleaning results
cf.report()

# Visualize clusters of frames
cf.visualize_clusters()

Optimized Workflow Example

from cleanframes import CleanFrame

# Initialize with different model and clustering method
cf = CleanFrame(
    path='path/to/frames_folder',
    model='clip-ViT-L-14',
    cluster='dbscan',
    cache=True,
    verbose=True
)

# Run the cleaning process
cf.run()

# Print cleaning report
cf.report()

# Visualize clusters and frame pairs
cf.visualize_clusters()

Caching and Outputs

  • Embeddings and cleaning reports are cached within the specified cache folder for faster reruns.
  • Cleaned and removed images are saved beside the original frames in the dataset folder, allowing easy inspection and further use.
  • The caching mechanism avoids redundant computations, improving efficiency when processing large datasets.

Supported Embedding Models

CleanFrames supports multiple embedding models for frame representation, including but not limited to:

  • CLIP models such as clip-ViT-B-32 and clip-ViT-L-14
  • Additional models can be integrated as needed.

Clustering Methods

Available clustering algorithms include:

  • KMeans clustering
  • DBSCAN clustering
  • Other clustering methods can be added or customized.

Visualization

CleanFrames provides visualization tools to help users inspect the clustering results and pairs of similar frames. This helps verify the cleaning quality and understand the grouping of frames.

Reporting

After cleaning, CleanFrames generates a concise text-only console report summarizing:

  • Number of frames processed
  • Number of frames removed
  • Number of frames retained

This report provides insights into the effectiveness of the cleaning process.


For more detailed information and advanced usage, please refer to the source code and examples provided in the repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanframes-0.3.9.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleanframes-0.3.9-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file cleanframes-0.3.9.tar.gz.

File metadata

  • Download URL: cleanframes-0.3.9.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for cleanframes-0.3.9.tar.gz
Algorithm Hash digest
SHA256 2912ea5f422f53712de58d8154d982b88d2f68142b08f7af8944954c709a61f3
MD5 15705b8c8539f7198e48001af33a78b0
BLAKE2b-256 cd2d2c508c047285f24ac14a28242bdaaf356484ac3de8083aee3bcab2236fbf

See more details on using hashes here.

File details

Details for the file cleanframes-0.3.9-py3-none-any.whl.

File metadata

  • Download URL: cleanframes-0.3.9-py3-none-any.whl
  • Upload date:
  • Size: 10.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for cleanframes-0.3.9-py3-none-any.whl
Algorithm Hash digest
SHA256 fa1b6657d855adf4ee50f2df02bf1b25d134000f54ba2a054cbc39c70a99d8af
MD5 834b0c01b14bbebd8e82495b2b18b744
BLAKE2b-256 ac0a1f9a7a6ef7303ad00cd70b790f61ffc769845b958d23aac3bcc1fc843b0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page