Skip to main content

Finds images that have a high degree of similarity

Project description

Image Deduplication

drawing

Easily find images that have a high degree of similarity using either:

  • one line of code
  • cli tool

Installation

pip install image-deduplication

Usage

Find similar images in a folder tree

from image_deduplication import get_image_paths, cluster_images

image_paths = get_image_paths("path/to/images") # Returns list of image paths
image_clusters = cluster_images(image_paths)

# Print clusters
for i, cluster in enumerate(image_clusters):
    print(f"""Cluster {i}: {cluster}\n""")

Find similar images from a list of files

from image_deduplication import cluster_images

image_paths = [
    "image1.png",
    "image2.jpg",
    "image3.jpeg"
]

image_clusters = cluster_images(image_paths)

# Print clusters
for i, cluster in enumerate(image_clusters):
    print(f"""Cluster {i}: {cluster}\n""")

CLI tool

image-deduplication path/to/images

If you want to analyse the current working directory, you can simply use "." as the path.

Methodology

Here is an overview of how this package clusters images by similarity using computer vision techniques and a union-find algorithm to group similar images together:

  1. Feature Extraction with ORB: We utilized the ORB (Oriented FAST and Rotated BRIEF) algorithm for extracting keypoints and descriptors from images. ORB is a fast, rotation-invariant, and robust feature extractor that identifies unique points in images, facilitating the comparison of different images based on their content.

  2. Image Matching: To determine the similarity between pairs of images, we employed a brute force matcher with the Hamming distance as a metric, optimized to find the best matches for the ORB descriptors. A ratio test filters out less reliable matches, ensuring that only the most similar keypoints contribute to the similarity score.

  3. Clustering with Union-Find: An implementation of the union-find algorithm is used to dynamically cluster images based on their similarity scores. This method efficiently merges images into groups as it iterates through all pairs, identifying connected components within the dataset. The union-find structure is key for minimizing redundant comparisons and accelerating the clustering process.

  4. Homography and RANSAC: For images with a sufficient number of good matches, we calculate a homography matrix using RANSAC (Random Sample Consensus). This step further refines the matching process by considering only geometrically consistent matches, enhancing the accuracy of similarity scores.

  5. Scalable Image Clustering: The end-to-end process, from feature extraction to dynamic clustering, is designed to handle a large number of images. By efficiently comparing images and grouping them based on visual similarity, this package can organize vast datasets into manageable clusters.

This technology approach combines classic computer vision techniques with modern algorithmic strategies, offering a robust solution for organizing and analyzing large collections of images based on their visual content.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

image-deduplication-0.1.3.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

image_deduplication-0.1.3-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file image-deduplication-0.1.3.tar.gz.

File metadata

  • Download URL: image-deduplication-0.1.3.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for image-deduplication-0.1.3.tar.gz
Algorithm Hash digest
SHA256 6322a5c2e726aa9a836271ff9ea3030553a32fe0c7d35e80c68efeed740f5504
MD5 f26be958f3491e52ee600a824fcd8a1f
BLAKE2b-256 05df341ddf63d4991b348cbf7a83411bf3625c3e2ed8e39032aa2b451c2b7e7a

See more details on using hashes here.

File details

Details for the file image_deduplication-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for image_deduplication-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3dc7b17d3894a68dafba7321c832957a41aaf67eecd23a11d8c065480475e7f0
MD5 0f83f9b6cfaa8bba14d0e475794aae76
BLAKE2b-256 dd4c055f2e45181635ad2ac9f107e1af293fdf5c589d182858fb1e83b0e8369a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page