Skip to main content

No project description provided

Project description

Finding Duplicate Images

Finds equal or similar images in a directory containing (many) image files.

Usage

$ pip install duplicate_images
$ find-dups -h

to print the help screen. Or just

$ find-dups $IMAGE_ROOT 

for a test run.

Image comparison algorithms

Use the --algorithm option to select how equal images are found.

  • exact: marks only binary exactly equal files as equal. This is by far the fastest, but most restricted algorithm.
  • histogram: checks the images' color histograms for equality. Faster than the image hashing algorithms, but tends to give a lot of false positives for images that are similar, but not equal. Use the --fuzziness and --aspect-fuzziness options to fine-tune its behavior.
  • ahash, colorhash, dhash, phash, whash: five different image hashing algorithms. See https://pypi.org/project/ImageHash for an introduction on image hashing and https://tech.okcupid.com/evaluating-perceptual-image-hashes-okcupid for some gory details which image hashing algorithm performs best in which situation. For a start I recommend using ahash, and only evaluating the other algorithms if ahash does not perform satisfactorily in your use case.

Actions for matching image pairs

Use the --on-equal option to select what to do to pairs of equal images.

  • delete-first: deletes the first of the two files
  • delete-second: deletes the second of the two files
  • delete-bigger: deletes the file with the bigger size
  • delete-smaller: deletes the file with the smaller size
  • eog: launches the eog image viewer to compare the two files
  • xv: launches the xv image viewer to compare the two files
  • print: prints the two files
  • none: does nothing. The default action is print.

Parallel execution

Use the --parallel option to utilize all free cores on your system. There is also the --chunk-size option to tune how many comparisons each thread should make in one go, but that should hardly ever be advantageous to set explicitly.

Development notes

Needs Python3 and Pillow imaging library to run, additionally Wand for the test suite.

Uses Poetry for dependency management.

Installation

From source:

$ git clone https://gitlab.com/lilacashes/DuplicateImages.git
$ cd DuplicateImages
$ pip3 install poetry
$ poetry install

Running

$ poetry run find-dups $PICTURE_DIR

or

$ poetry run find-dups -h

for a list of all possible options.

Test suite

Running it all:

$ poetry run pytest
$ poetry run mypy duplicate_images tests
$ poetry run flake8
$ poetry run pylint duplicate_images tests

or simply

$ .git_hooks/pre-push

Setting the test suite to be run before every push:

$ cd .git/hooks
$ ln -s ../../.git_hooks/pre-push .

Publishing

$ poetry config repositories.testpypi https://test.pypi.org/legacy/
$ poetry build
$ poetry publish --username $PYPI_USER --password $PYPI_PASSWORD --repository testpypi && \
  poetry publish --username $PYPI_USER --password $PYPI_PASSWORD

(obviously assuming that username and password are the same on PyPI and TestPyPI)

Profiling

CPU time

To show the top functions by time spent, including called functions:

$ poetry run python -m cProfile -s tottime ./duplicate_images/duplicate.py \ 
    --algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15

or, to show the top functions by time spent in the function alone:

$ poetry run python -m cProfile -s cumtime ./duplicate_images/duplicate.py \ 
    --algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15

Memory usage

$ poetry run fil-profile run ./duplicate_images/duplicate.py \
    --algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1

This will open a browser window showing the functions using the most memory (see https://pypi.org/project/filprofiler for more details).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duplicate_images-0.3.1.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

duplicate_images-0.3.1-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file duplicate_images-0.3.1.tar.gz.

File metadata

  • Download URL: duplicate_images-0.3.1.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.0a1 CPython/3.8.5 Linux/5.4.0-58-generic

File hashes

Hashes for duplicate_images-0.3.1.tar.gz
Algorithm Hash digest
SHA256 61303e8e0a8faf02fb1bc51f768c01f523bd40ccb0a77c6db9e91ac19eb31138
MD5 89145863092a32a9f69b9bfd7bec5cce
BLAKE2b-256 5f937277da0b35e1a2842ec3101123ac7ded96095809e865d315543deb69abe7

See more details on using hashes here.

File details

Details for the file duplicate_images-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: duplicate_images-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.0a1 CPython/3.8.5 Linux/5.4.0-58-generic

File hashes

Hashes for duplicate_images-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b13f726a9e7fafca0485e1edc611998757f6937960c5ef6147dbea71dc0581b0
MD5 1b6df4ccacdb8413ac4eb5234ba87062
BLAKE2b-256 f93ad37e1c63cce6d3ac0dc74676020a7c5e3778cf5bd695ceb464c2e88cf191

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page