Finds equal or similar images in a directory containing (many) image files
Project description
Finding Duplicate Images
Finds equal or similar images in a directory containing (many) image files.
Official home page: https://github.com/lene/DuplicateImages
Development page: https://gitlab.com/lilacashes/DuplicateImages
PyPI page: https://pypi.org/project/duplicate-images
Usage
Installing:
$ pip install duplicate_images
Printing the help screen:
$ find-dups -h
Quick test run:
$ find-dups $IMAGE_ROOT
Typical usage:
$ find-dups $IMAGE_ROOT --parallel --progress --hash-db hashes.json
Supported image formats
- JPEG and PNG (tested quite thoroughly)
- HEIC (experimental support, tested cursorily only)
- All other
formats supported by
the
pillow
Python Imaging Library should work, but are not specifically tested.
Image comparison algorithms
Use the --algorithm
option to select how equal images are found. The default algorithm is phash
.
ahash
, colorhash
, dhash
, dhash_vertical
, phash
, phash_simple
, whash
: seven different
image hashing algorithms. See https://pypi.org/project/ImageHash for an introduction on image
hashing and https://tech.okcupid.com/evaluating-perceptual-image-hashes-okcupid for some gory
details which image hashing algorithm performs best in which situation. For a start I recommend
using phash
, and only evaluating the other algorithms if phash
does not perform satisfactorily
in your use case.
Image similarity threshold configuration
Use the --hash-size
parameter to tune the precision of the hashing algorithms. For the colorhash
algorithm the hash size is interpreted as the number of bin bits and defaults to 3. For all other
algorithms the hash size defaults to 8. For whash
it must be a power of 2.
Use the --max-distance
parameter to tune how close images should be to be considered duplicates.
The argument is a positive integer. Its value is highly dependent on the algorithm used and the
nature of the images compared, so the best value for your use case can oly be found through
experimentation.
NOTE: using the --max-distance
parameter slows down the comparison considerably with large
image collections, making the runtime complexity go from O(N) to O(N2). If you want to
scan collections with at least thousands of images, it is highly recommended to tune the desired
similarity threshold with the --hash-size
parameter alone, if that is at all possible.
Pre-storing and using image hashes to speed up computation
Use the --hash-db ${FILE}.json
or --hash-db ${FILE}.pickle
option to store image hashes in the
file $FILE
in JSON or Pickle format and read image hashes from that file if they are already
present there. This avoids having to compute the image hashes anew at every run and can
significantly speed up run times.
Actions for matching image pairs
Use the --on-equal
option to select what to do to pairs of equal images. The default action is
print
.
delete-first
ord1
: deletes the first of the two filesdelete-second
ord2
: deletes the second of the two filesdelete-bigger
ord>
: deletes the file with the bigger sizedelete-smaller
ord<
: deletes the file with the smaller sizeeog
: launches theeog
image viewer to compare the two files (deprecated byexec
)xv
: launches thexv
image viewer to compare the two files (deprecated byexec
)print
: prints the two filesprint_inline
: likeprint
but without newlinequote
: prints the two files quoted for POSIX shellsquote_inline
: likequote
but without newlineexec
: executes a command (see--exec
argument)none
: does nothing.
The --exec
argument allows calling another program when the --on-equal exec
option is given.
You can pass a command line string like --exec "program {1} {2}"
where {1}
and {2}
are
replaced by the matching pair files.
Examples:
--exec "open -a Preview -W {1} {2}"
: Opens the files in MacOS Preview app and waits for it.
Parallel execution
Use the --parallel
option to utilize all free cores on your system for calculating image hashes.
Serial execution
find-dups
can also use an alternative algorithm which is O(N2) in the number of images.
Use the --serial
option to use this alternative algorithm.
Progress bar and verbosity control
--progress
prints a progress bar each for the process of reading the images, and the process of finding duplicates among the scanned image--debug
prints debugging output--quiet
decreases the log level by 1 for each time it is called;--debug
and--quiet
cancel each other out
Development notes
Needs Python3, Pillow imaging library and pillow-heif
HEIF plugin to run, additionally Wand for
the test suite.
Uses Poetry for dependency management.
Installation
From source:
$ git clone https://gitlab.com/lilacashes/DuplicateImages.git
$ cd DuplicateImages
$ pip3 install poetry
$ poetry install
Running
$ poetry run find-dups $PICTURE_DIR
or
$ poetry run find-dups -h
for a list of all possible options.
Test suite
Running it all:
$ poetry run pytest
$ poetry run mypy duplicate_images tests
$ poetry run flake8
$ poetry run pylint duplicate_images tests
or simply
$ .git_hooks/pre-push
Setting the test suite to be run before every push:
$ cd .git/hooks
$ ln -s ../../.git_hooks/pre-push .
Publishing
There is a job in GitLab CI for publishing to pypi.org
that runs as soon as a new tag is added,
which happens automatically whenever a MR is merged. The tag is the same as the version
in the
pyproject.toml
file. For every MR it needs to be ensured that the version
is not the same as an
already existing tag.
To publish the package on PyPI manually:
$ poetry config repositories.testpypi https://test.pypi.org/legacy/
$ poetry build
$ poetry publish --username $PYPI_USER --password $PYPI_PASSWORD --repository testpypi && \
poetry publish --username $PYPI_USER --password $PYPI_PASSWORD
(obviously assuming here that username and password are the same on PyPI and TestPyPI)
Updating GitHub mirror
GitHub is set up as a push mirror in GitLab CI, but mirroring is flaky at the time and may not succeed.
To push to the GitHub repository manually (assuming the GitHub repository is set up as remote
github
):
$ git checkout master
$ git fetch
$ git pull --rebase
$ git tag # to check that the latest tag is present
$ git push --tags github master
Profiling
CPU time
To show the top functions by time spent, including called functions:
$ poetry run python -m cProfile -s tottime ./duplicate_images/duplicate.py \
--algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15
or, to show the top functions by time spent in the function alone:
$ poetry run python -m cProfile -s cumtime ./duplicate_images/duplicate.py \
--algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15
Memory usage
$ poetry run fil-profile run ./duplicate_images/duplicate.py \
--algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1
This will open a browser window showing the functions using the most memory (see https://pypi.org/project/filprofiler for more details).
Contributors
- Lene Preuss (https://github.com/lene): primary developer
- Mike Reiche (https://github.com/mreiche): support for arbitrary actions, speedups
- https://github.com/beijingjazzpanda: bug fix
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file duplicate_images-0.8.3.tar.gz
.
File metadata
- Download URL: duplicate_images-0.8.3.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.0 CPython/3.11.4 Linux/5.4.109+
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e7b498921f4350cde1bdb954479346d5be82694a5c872077e1d34f909eff803 |
|
MD5 | b8b9ef3c889b2f2dfeae300033078f0d |
|
BLAKE2b-256 | 5c38618405c71524b6ccdd6ee272ba51ef4731475378bf7b7de04815a554e17a |
File details
Details for the file duplicate_images-0.8.3-py3-none-any.whl
.
File metadata
- Download URL: duplicate_images-0.8.3-py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.0 CPython/3.11.4 Linux/5.4.109+
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5152c4736e63ee3bbffc264fc56aee8690de3c648a3bf522bdc494bbeabf00a2 |
|
MD5 | a1f2b7756e24a7354975d652e45a3837 |
|
BLAKE2b-256 | fcca63bd35821daec7f622c412b454adb8fe965df9d7ca8275c57d69ccf4353e |