Find similar images in a dataset
Project description
:monkey: simages:monkey:
Find similar images within a dataset.
Useful for removing duplicate images from a dataset after scraping images with google-images-download.
The Python API returns pairs, duplicates
, where pairs are the (ordered) closest pairs and distances is the
corresponding embedding distance.
Install
See the installation docs for all details.
pip install simages
or install from source:
git clone https://github.com/justinshenk/simages
cd simages
pip install .
To install the interactive interface, install mongodb and use rather pip install "simages[all]"
.
Demo
- Minimal command-line interface with
simages-show
:
- Interactive image deletion with
simages add/find
:
Usage
Two interfaces exist:
- minimal interface which plots the duplicates for visual inspection
- mongodb + flask interface which allows interactive deletion [optional]
Minimal Interface
In your console, enter the directory with images and use simages-show
:
$ simages-show --data-dir .
usage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]
[--epochs EPOCHS] [--num-channels NUM_CHANNELS]
[--pairs PAIRS] [--zdim ZDIM] [-s]
-h, --help show this help message and exit
--data-dir DATA_DIR, -d DATA_DIR
Folder containing image data
--show-train, -t Show training of embedding extractor every epoch
--epochs EPOCHS, -e EPOCHS
Number of passes of dataset through model for
training. More is better but takes more time.
--num-channels NUM_CHANNELS, -c NUM_CHANNELS
Number of channels for data (1 for grayscale, 3 for
color)
--pairs PAIRS, -p PAIRS
Number of pairs of images to show
--zdim ZDIM, -z ZDIM Compression bits (bigger generally performs better but
takes more time)
-s, --show Show closest pairs
Web Interface [Optional]
Note: To install the web interface API, install and run mongodb and use pip install "simages[all]"
to install optional dependencies.
Add your pictures to the database (this will take some time depending on the number of pictures)
simages add <images_folder_path>
A webpage will come up with all of the similar or duplicate pictures:
simages find <images_folder_path>
Usage:
simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
simages remove <path> ... [--db=<db_path>]
simages clear [--db=<db_path>]
simages show [--db=<db_path>]
simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
simages -h | --help
Options:
-h, --help Show this screen
--db=<db_path> The location of the database or a MongoDB URI. (default: ./db)
--parallel=<num_processes> The number of parallel processes to run to hash the image
files (default: number of CPUs).
find:
--print Only print duplicate files rather than displaying HTML file
--delete Move all found duplicate pictures to the trash. This option takes priority over --print.
--match-time Adds the extra constraint that duplicate images must have the
same capture times in order to be considered.
--trash=<trash_path> Where files will be put when they are deleted (default: ./Trash)
--epochs=<epochs> Epochs for training [default: 2]
Python APIs
Numpy array
from simages import find_duplicates
import numpy as np
array_data = np.random.random(100, 3, 48, 48)# N x C x H x W
pairs, distances = find_duplicates(array_data)
Folder
from simages import find_duplicates
data_dir = "my_images_folder"
pairs, distances = find_duplicates(data_dir)
Default options for find_duplicates
are:
def find_duplicates(
input: Union[str or np.ndarray],
n: int = 5,
num_epochs: int = 2,
num_channels: int = 3,
show: bool = False,
show_train: bool = False,
**kwargs
):
"""Find duplicates in dataset. Either `array` or `data_dir` must be specified.
Args:
input (str or np.ndarray): folder directory or N x C x H x W array
n (int): number of closest pairs to identify
num_epochs (int): how long to train the autoencoder (more is generally better)
show (bool): display the closest pairs
show_train (bool): show output every
z_dim (int): size of compression (more is generally better, but slower)
kwargs (dict): etc, passed to `EmbeddingExtractor`
Returns:
pairs (np.ndarray): indices for closest pairs of images, n x 2 array
distances (np.ndarray): distances of each pair to each other
Embeddings
API
from simages import Embeddings
import numpy as np
N = 1000
data = np.random.random((N, 28, 28))
embeddings = Embeddings(data)
# Access the array
array = embeddings.array # N x z (compression size)
# Get 10 closest pairs of images
pairs, distances = embeddings.duplicates(n=5)
In [0]: pairs
Out[0]: array([[912, 990], [716, 790], [907, 943], [483, 492], [806, 883]])
In [1]: distances
Out[1]: array([0.00148035, 0.00150703, 0.00158789, 0.00168699, 0.00168721])
EmbeddingExtractor
API
from simages import EmbeddingExtractor
import numpy as np
N = 1000
data = np.random.random((N, 28, 28))
extractor = EmbeddingExtractor(data, num_channels=1) # grayscale
# Show 10 closest pairs of images
pairs, distances = extractor.show_duplicates(n=10)
Class attributes and parameters:
class EmbeddingExtractor:
"""Extract embeddings from data with models and allow visualization.
Attributes:
trainloader (torch loader)
evalloader (torch loader)
model (torch.nn.Module)
embeddings (np.ndarray)
"""
def __init__(
self,
input:Union[str, np.ndarray],
num_channels=None,
num_epochs=2,
batch_size=32,
show_train=True,
show=False,
z_dim=8,
**kwargs,
):
"""Inits EmbeddingExtractor with input, either `str` or `np.nd.array`, performs training and validation.
Args:
input (np.ndarray or str): data
num_channels (int): grayscale = 1, color = 3
num_epochs (int): more is better (generally)
batch_size (int): number of images per batch
show_train (bool): show intermediate training results
show (bool): show closest pairs
z_dim (int): compression size
kwargs (dict)
"""
Specify tne number of pairs to identify with the parameter n
.
How it works
simages uses a convolutional autoencoder with PyTorch and compares the latent representations with closely :triangular_ruler:.
Dependencies
simages depends on the following packages:
- closely
- torch
- torchvision
- scikit-learn
- matplotlib
The following dependencies are required for the interactive deleting interface:
- pymongodb
- fastcluster
- flask
- jinja2
- dnspython
- python-magic
- termcolor
Cite
If you use simages, please cite it:
@misc{justin_shenk_2019_3237830,
author = {Justin Shenk},
title = {justinshenk/simages: v19.0.1},
month = jun,
year = 2019,
doi = {10.5281/zenodo.3237830},
url = {https://doi.org/10.5281/zenodo.3237830}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for simages-19.0.3.dev0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 74bd0741c8131a5248dcb68275b750bdbc2b686dfc0e5ba0db0aefb372d9a49f |
|
MD5 | 7be1e95e4880be65217c4eff1b341f39 |
|
BLAKE2b-256 | bc9167c1d79b8789dbe89d43c34797f24d34acf3e8c372663b05bdca7ef761e0 |