A library to find duplicate images and delete unwanted ones
py-image-dedup is a tool to sort out or remove duplicates within a photo library. Unlike most other solutions, py-image-dedup intentionally uses an approximate image comparison to also detect duplicates of images that slightly differ in resolution, color or other minor details.
It is build upon Image-Match a very popular library to compute a pHash for an image and store the result in an ElasticSearch backend for very high scalability.
How it works
Phase 1 - Database cleanup
In the first phase the elasticsearch backend is checked against the current filesystem state, cleaning up database entries of files that no longer exist. This will speed up queries made lateron.
Phase 2 - Counting files
Although not necessary for the deduplication process it is very convenient to have some kind of progress indication while the deduplication process is at work. To be able to provide that, available files must be counted beforehand.
Phase 3 - Analysing files
In this phase every image file is analysed. This means generating a signature (pHash) to quickly compare it to other images and adding other metadata of the image to the elasticsearch backend that is used in the next phase.
This phase is quite CPU intensive and the first run take take quite
some time. Using as much threads as feasible (using the
is advised to get the best performance.
Since we might already have a previous version of this file in the database before analysing a given file the file modification time is compared to the given one. If the database content seems to be still correct the signature for this file will not be recalculated. Because of this, subsequent runs will be much faster. There still has to happen some file access though, so it is probably limited by that.
Phase 4 - Finding duplicates
Every file is now processed again - but only by means of querying the
database backend for similar images (within the given
If there are images found that match the similarity criteria they are considered
duplicate candidates. All candidates are then ordered by the following
criteria (in this exact order):
- pixel count (more is better)
- EXIF data (more exif data is better)
- file size (bigger is better)
- file modification time (newer is better)
- distance (lower is better)
- filename contains "copy" (False is better)
- filename length (longer is better) - (for "edited" versions)
- parent folder path length (shorter is better)
- score (higher is better)
The first candidate in the resulting list is considered to be the best available version of all candidates.
Phase 5 - Moving/Deleting duplicates
All but the best version of duplicate candidates identified in the previous
phase are now deleted from the file system (if you didn't specify
--dry-run of course).
duplicates_target_directory is set, the specified folder will be used as
a root directory to move duplicates to, instead of deleting them, replicating their original
Phase 6 - Removing empty folders (Optional)
In the last phase, folders that are empty due to the deduplication process are deleted, cleaning up the directory structure (if turned on in configuration).
How to use
Install py-image-dedup using pip:
pip3 install py-image-dedup
See py_image_dedup_reference.yaml for an example in this repo.
Setup elasticsearch backend
Since this library is based on Image-Match you need a running elasticsearch instance for efficient storing and querying of image signatures.
This library requires elasticsearch version 5 or later. Sadly the Image-Match library still specifies version 2, so a fork of the original project is used instead. This fork is maintained by me, and any contributions are very much appreciated.
Set up the index
py-image-dedup uses a single index (called
images by default).
When configured, this index will be created automatically for you.
Command line usage
py-image-dedup can be used from the command line like this:
py-image-dedup deduplicate --help
Have a look at the help output to see how you can customize it.
CAUTION! This feature is still very much a work in progress. Always have a backup of your data!
py-image-dedup has a built in daemon that allows you to continuously monitor your source directories and deduplicate them on the fly.
When running the daemon (and enabled in configuration) a prometheus reporter is used to allow you to gather some statistical insights.
To analyze images and get an overview of what images would be deleted be sure to make a dry run first.
py-image-dedup deduplicate --dry-run
If you want to run this on a FreeBSD host make sure you have an up to date release that is able to install ports.
Since Image-Match does a lot of
math it relies on
scipy. To get those working on FreeBSD
you have to install them as a port:
pkg install pkgconf pkg install py38-numpy pkg install py27-scipy
.png support you also need to install
pkg install png
I still ran into issues after installing all these and just threw those two in the mix and it finally worked:
pkg install freetype pkg install py27-matplotlib # this has a LOT of dependencies
When using the python library
click on FreeBSD you might run into
encoding issues. To mitigate this change your locale from
This can be achieved f.ex. by creating a file
~/.login_conf with the following content:
me:\ :charset=ISO-8859-1:\ :lang=de_DE.UTF-8:
To run py-image-dedup using docker you can use the markusressel/py-image-dedup image from DockerHub:
sudo docker run -t \ -p 8000:8000 \ -v /where/the/original/photolibrary/is/located:/data/in \ -v /where/duplicates/should/be/moved/to:/data/out \ -e PY_IMAGE_DEDUP_DRY_RUN=False \ -e PY_IMAGE_DEDUP_ANALYSIS_SOURCE_DIRECTORIES=/data/in/ \ -e PY_IMAGE_DEDUP_ANALYSIS_RECURSIVE=True \ -e PY_IMAGE_DEDUP_ANALYSIS_ACROSS_DIRS=True \ -e PY_IMAGE_DEDUP_ANALYSIS_FILE_EXTENSIONS=.png,.jpg,.jpeg \ -e PY_IMAGE_DEDUP_ANALYSIS_THREADS=8 \ -e PY_IMAGE_DEDUP_ANALYSIS_USE_EXIF_DATA=True \ -e PY_IMAGE_DEDUP_DEDUPLICATION_DUPLICATES_TARGET_DIRECTORY=/data/out/ \ -e PY_IMAGE_DEDUP_ELASTICSEARCH_AUTO_CREATE_INDEX=True \ -e PY_IMAGE_DEDUP_ELASTICSEARCH_HOST=elasticsearch \ -e PY_IMAGE_DEDUP_ELASTICSEARCH_PORT=9200 \ -e PY_IMAGE_DEDUP_ELASTICSEARCH_INDEX=images \ -e PY_IMAGE_DEDUP_ELASTICSEARCH_AUTO_CREATE_INDEX=True \ -e PY_IMAGE_DEDUP_ELASTICSEARCH_MAX_DISTANCE=0.1 \ -e PY_IMAGE_DEDUP_REMOVE_EMPTY_FOLDERS=False \ -e PY_IMAGE_DEDUP_STATS_ENABLED=True \ -e PY_IMAGE_DEDUP_STATS_PORT=8000 \ markusressel/py-image-dedup:latest
Since an elasticsearch instance is required too, you can
also use the
docker-compose.yml file included in this repo which will
set up a single-node elasticsearch cluster too:
sudo docker-compose up
UID and GID
To run py-image-dedup inside the container using a specific user id
and group id you can use the env variables
GitHub is for social coding: if you want to write code, I encourage contributions through pull requests from forks of this repository. Create GitHub tickets for bugs and new features and comment on the ones that you are interested in.
py-image-dedup by Markus Ressel Copyright (C) 2018 Markus Ressel This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size py_image_dedup-2.0.0-py3-none-any.whl (43.2 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size py-image-dedup-2.0.0.tar.gz (44.1 kB)||File type Source||Python version None||Upload date||Hashes View|
Hashes for py_image_dedup-2.0.0-py3-none-any.whl