Multithreaded implementation of difpy with progress retention.
Project description
Fast Diff Py
Motivation
Why does this project exist? A Quick googling and one finds that there exist plentiful programs and libraries to solve this problem. Well I wanted to write my own implementation and quite frankly, I simply forgot to do the simple thing every programmer should do:
Google First - Implement Second - So here I am with a half functioning implementation of my own accord.
In its current form 1.7.2023, the project is most suited to run on a device with a decent number of cores and very little RAM. At the moment, the workers don't store all images in RAM but keep reloading them from disk leading to IO bound performance. Considering a Dataset of ~20Gb of images condenses down to at most 100Mb of RAM used, it is a rather stupid endeavor.
Additionally, since I wanted the algorithm to be able to retain its progress when stopped, I opted to use a Database to handle persistent storage of the results. Once again - stupid. Under the constraint of opperating with less than a 1Gb of RAM this would have made sense, but like this it is simply a very idiotic endeavor.
The Database simply slows down the progress of the system. Testing with 1000 Images on a 128 Core System with 500Gb RAM, the database interactions ended up becoming the bottleneck with only 32 workers (so at most 1/4 of the system used.)
So what does this project solve then?
The only thing this implementation is good at, is solving the O(n^2) problem of comparing all to all images. This is something I wasn't able to find implemented by any other fast library. However here we are once again constrained by the database where it would be possible to do everything in RAM and using only python datastructures, so lists, dicts and numpy arrays.
I leave this project as a cautionary tale, and it might still give some inspiration to someone who wants to use the all to all comparisons instead of finding neat encodings to solve the problem of finding similar images in O(n).
Methods and Classes:
The Main Class of Concern is FastDifPy
. It contains the functions to perform the main tasks.
In Addition, there are two Database Implementations available. SQLiteDatabase
(default) and MariaDBDatabase
.
Some Testing I did showed, that for small sets of images (~500) with 16 cores, SQLite outperforms MariaDB by 6% to 50% depending on the configuration and algorithms used.
For larger datasets MariaDB
might be faster, but I stopped testing since I found the other repos.
Configuration options:
cfg_path
: Path where config is storedupdate_timeout
: Time interval after which the config on the file system is updated.retain_config
: If the config should be written to the file system.thumbnail_size_x
,thumbnail_size_y
: Size to reduce the images down to.p_root_dir_a
: Directory to search images inp_root_dir_b
: Second directory to compare against (if not set, images will be compared against themselves in single directory.)thumb_dir_a
: Thumbnail directory in directory_a (ro)thumb_dir_b
: Thumbnail directory in directory_b (ro)has_dir_b
: If the second directory is populated or not (ro)similarity_threshold
: MSE of two thumbnails below which they are deemed duplicatesignore_names
: list of filenames which are not indexedignore_paths
: list of paths which are not indexedenough_images_to_compare
: Set byindex_dirs
function used to short circuit out in the loops if there's not enough to comparesupported_file_types
: List of file endings including the '.' of supported file typesless_optimized
: If a less optimized version of the second loop should be used (doesn't reuse already loaded image.)retry_limit
: Number of times to search for a new thumbnail name before an error is thrown.verbose
: If the child processes print a lot.state
: Current state of the algorithm. (Indexing done, First loop, Second loop ...)fl_compute_thumbnails
: If the thumbnails should be precomputed and written to disk in the first loopfl_compute_hash
: If file hashes of the images should be computed in the first loopfl_shift_amount
: How much the bits of the image channels RGB should be shiftedfl_cpu_proc
: Number of cpu processes to spawn for first loop.fl_inserted_counter
: Number of images already processed.fl_use_workers
: If an enqueue and dequeue worker should be used instead of having the main process handle it. (Requires Thread Safe DB.)database
: Contains the dict created bycreate_config_dump
method of databases.retain_db
: If the database config should be stored (maybe undesirable because of password for mariadb)max_queue_size
: Maximum Size of queues created for child processes.sl_matching_hash
: If images with matching file hash should even be compared for MSEsl_has_thumb
: If the thumbnails exist and don't need to be computed.sl_matching_aspect
: If the aspect ratio of the images must match before they are compared with MSEsl_make_diff_plots
: If Plots containing two images which have lower MSE than threshold should be madesl_plot_output_dir
: Directory where plots are stored.sl_gpu_proc
: Number of GPU processes for second loop (requires cupy)sl_cpu_proc
: Number of CPU processes for second loopsl_queue_status
: Contains the progress info for the second loop (should not be touched)sl_base_a
: If the fixed images are coming from directory a (should not be touched)sl_use_workers
: If an enqueue and dequeue worker should be used instead of having the main process handle it. (Requires Thread Safe DB.sl_use_special_b_algo
: If a different algorithm for the fixed images should be used for the second loop if we have a directory b (causes a slight speed up by reusing processes at the end of the loop.)
Example
db = FastDifPy(directory_a="/path/to/folder")
db.index_the_dirs()
db.estimate_disk_usage()
db.first_loop_iteration(compute_hash=True, amount=4, cpu_proc=16)
db.second_loop_iteration(only_matching_aspect=False,
make_diff_plots=False,
diff_location="/path/to/plot/folder",
similarity_threshold=200.0,
cpu_proc=16)
db.print_preprocessing_errors()
clusters, low_quality = db.get_duplicates()
db.clean_up()
The first loop computes the thumbnails and compresses the images, the second loop performs the all to all comparison. To mimic the plots generated by the Duplicate-Image-Finder, plots can be generated and are stored in the provided directory. This seemed like a more useful solution than opening potentially thousands of plots.
An additional capability of this project is the config. Everything in db.config is stored periodically to Disk and allows for the recovery of an interrupted process. At the time, there's no implementation of the recovery function (also since it might depend on the use case for a given user). But, provided no custom config path is used, the FastDiffPyConfig
can be instantiated and from its attributes the current progress state recovered.
The config simply needs to be set as an attribute and then the database reconnected. Progress recovery is not fully implemented however there are the basic tools for it.
Further TODOs:
- Implement RAM Only version with progress saved periodically to disk
- Optimize SQL Queries
- Add better handling for ungraceful shutdown of loops in the database
- Add functionality to extract hash based duplicates
- Add arbitrary hash function support
- Add arbitrary all to all comparison function support.
Similar Projects:
- Duplicate-Image-Finder (the project this is based on)
- imagededup
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.