Finds equal or similar images in a directory containing (many) image files
Project description
Finding Duplicate Images
Finds equal or similar images in a directory containing (many) image files.
Usage
Installing:
$ pip install duplicate_images
Printing the help screen:
$ find-dups -h
Quick test run:
$ find-dups $IMAGE_ROOT
Typical usage:
$ find-dups $IMAGE_ROOT \
--parallel --progress \
--algorithm phash --on-equal print \
--hash-db hashes.pickle
Image comparison algorithms
Use the --algorithm
option to select how equal images are found.
ahash
, colorhash
, dhash
, phash
, whash
: five different image hashing algorithms. See
https://pypi.org/project/ImageHash for an introduction on image hashing and
https://tech.okcupid.com/evaluating-perceptual-image-hashes-okcupid for some gory details which
image hashing algorithm performs best in which situation. For a start I recommend using phash
,
and only evaluating the other algorithms if phash
does not perform satisfactorily in your use
case.
Actions for matching image pairs
Use the --on-equal
option to select what to do to pairs of equal images.
delete-first
: deletes the first of the two filesdelete-second
: deletes the second of the two filesdelete-bigger
ord>
: deletes the file with the bigger sizedelete-smaller
ord<
: deletes the file with the smaller sizeeog
: launches theeog
image viewer to compare the two filesxv
: launches thexv
image viewer to compare the two filesprint
: prints the two filesquote
: prints the two files with quotes around eachnone
: does nothing. The default action isprint
.
Parallel execution
Use the --parallel
option to utilize all free cores on your system.
Progress and verbosity control
--progress
prints a progress bar each for the process of reading the images and the process of finding duplicates among the scanned image--debug
prints debugging output
Pre-storing and using image hashes to speed up computation
Use the --hash-db $PICKLE_FILE
option to store image hashes in the file $PICKLE_FILE
and read
image hashes from that file if they are already present there. This avoids having to compute the
image hashes anew at every run and can significantly speed up run times.
Development notes
Needs Python3 and Pillow imaging library to run, additionally Wand for the test suite.
Uses Poetry for dependency management.
Installation
From source:
$ git clone https://gitlab.com/lilacashes/DuplicateImages.git
$ cd DuplicateImages
$ pip3 install poetry
$ poetry install
Running
$ poetry run find-dups $PICTURE_DIR
or
$ poetry run find-dups -h
for a list of all possible options.
Test suite
Running it all:
$ poetry run pytest
$ poetry run mypy duplicate_images tests
$ poetry run flake8
$ poetry run pylint duplicate_images tests
or simply
$ .git_hooks/pre-push
Setting the test suite to be run before every push:
$ cd .git/hooks
$ ln -s ../../.git_hooks/pre-push .
Publishing
$ poetry config repositories.testpypi https://test.pypi.org/legacy/
$ poetry build
$ poetry publish --username $PYPI_USER --password $PYPI_PASSWORD --repository testpypi && \
poetry publish --username $PYPI_USER --password $PYPI_PASSWORD
(obviously assuming that username and password are the same on PyPI and TestPyPI)
Profiling
CPU time
To show the top functions by time spent, including called functions:
$ poetry run python -m cProfile -s tottime ./duplicate_images/duplicate.py \
--algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15
or, to show the top functions by time spent in the function alone:
$ poetry run python -m cProfile -s cumtime ./duplicate_images/duplicate.py \
--algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15
Memory usage
$ poetry run fil-profile run ./duplicate_images/duplicate.py \
--algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1
This will open a browser window showing the functions using the most memory (see https://pypi.org/project/filprofiler for more details).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file duplicate_images-0.5.1.tar.gz
.
File metadata
- Download URL: duplicate_images-0.5.1.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.5 CPython/3.9.2 Linux/4.19.78-coreos
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 326f9cea68e99b94285daad2cb4317b1c30aab3212bc213b7c8d2c656353d708 |
|
MD5 | 33c66cb1bd88de0d112d6eb7ec3ede39 |
|
BLAKE2b-256 | 8fe822c0d01b16f8fef8ac7c33996c7e750779729e73683afc91206c049c68e7 |
File details
Details for the file duplicate_images-0.5.1-py3-none-any.whl
.
File metadata
- Download URL: duplicate_images-0.5.1-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.5 CPython/3.9.2 Linux/4.19.78-coreos
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6cb355df1c4e260ed71e24fa2593fb7d870d365fe36b3a82d7d54f4accdb9a87 |
|
MD5 | 31c8a5f7525fc404c3e605bf75939372 |
|
BLAKE2b-256 | 1e2c597a304452b8789fd76998656fb95ce43577d96c71094c0a7f0d59f88f93 |