Skip to main content

Sobel Gradient Image Deduplication

Project description

GraDupe Icon
GraDupe

Sobel Gradient Image Deduplication

Motivation

Classical algorithms based on image hashes can be inaccurate. Innovative ones based on RNNs can be inefficient. As the demand for image storage increases rapidly over the decade, we need a prompt solution that combines the benefits of both.

Solution

At one point, Sobel gradients occurred to me as a decent fingerprint for an image. Similar to finite differences and derivatives, two distinct images bear the same gradient only if they differ by a constant. By reading an image in grayscale, we obtain a 2D matrix suitable for Sobel operators.

Images of different dimensions are downscaled into a square grid. Although convolutions are blazingly fast on modern hardware, this is done to unify dimensions and speed up diffing. After downscaling, there remains a sufficient amount of informative bits for diffing in the next step.

Sobel operators are traditionally used for edge detection, but their nature lies in differentiating an image. Computing the Sobel gradient of an image in both the x and y directions yields two matrices, which we flatten and concatenate into a contiguous array.

The gradients are thresholded into bitmasks since Hamming distance can be optimized using SIMD XOR instructions, making it magnitudes faster than Euclidean norm. By mapping sub-indices of pairs into combinatorial indices, a densely packed array can be used as a distance matrix, saving memory and enabling parallel computation.

The single flat distance array can be thresholded into a boolean mask with SIMD instructions. All that remains is to compress the image combinations with the mask (combinatorial indexing ensures correct correspondence), resulting in a list of duplicate images with the specified threshold.

Implementation

The tool is written in pure Python. The internal library used OpenCV, NumPy, and Numba (LLVM JIT). The CLI is built with Typer and Rich.

Get the CLI with pip install gradupe, refer to gradupe --help for usage instructions. Optionally, install Intel's TBB (Threading Building Blocks) libraries on your device to enable dynamic scheduling (computational load of distance matrix is imbalanced). Run numba -s | grep TBB to check TBB presence, refer to instructions if TBB is not found.

In practice, the tool proves extremely efficient and accurate. It finishes comparing 2000 images in under 0.1 seconds on my Intel(R) Core(TM) i5-11320H laptop and found 100 duplicate pairs that iCloud Photos failed to detect.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gradupe-2.2.10.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gradupe-2.2.10-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file gradupe-2.2.10.tar.gz.

File metadata

  • Download URL: gradupe-2.2.10.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gradupe-2.2.10.tar.gz
Algorithm Hash digest
SHA256 48288c78ed073f465c8a0e0efb2b9c72952b2be2e2470ede98faa4b618602be4
MD5 50bf660846b0bab79373810e3c2c9793
BLAKE2b-256 aafc5667e353aed7f39d5c19e137021efe4827fd6b8f458b97cdbb61e4bbb3b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for gradupe-2.2.10.tar.gz:

Publisher: publish.yml on wavim/gradupe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gradupe-2.2.10-py3-none-any.whl.

File metadata

  • Download URL: gradupe-2.2.10-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gradupe-2.2.10-py3-none-any.whl
Algorithm Hash digest
SHA256 fdbebfcc5cecde6b4d742a5aba6f426131b6f5d0b7d195ad5c70f40ec9c78b29
MD5 2178a32792101c1c7ae62723890531c5
BLAKE2b-256 9de8a1c0e8c2c1de8afc2f0da1281c69c38785b8d39237e1e78aafabbb4b18ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for gradupe-2.2.10-py3-none-any.whl:

Publisher: publish.yml on wavim/gradupe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page