Skip to main content
Help the Python Software Foundation raise $60,000 USD by December 31st!  Building the PSF Q4 Fundraiser

Find similar images in a dataset

Project description

:monkey: simages:monkey:

PyPI version Build Status Documentation Status DOI Binder

Find similar images within a dataset.

Useful for removing duplicate images from a dataset after scraping images with google-images-download.

The Python API returns pairs, duplicates, where pairs are the (ordered) closest pairs and distances is the corresponding embedding distance.


See the installation docs for all details.

pip install simages

or install from source:

git clone
cd simages
pip install .

To install the interactive interface, install mongodb and use rather pip install "simages[all]".


  1. Minimal command-line interface with simages-show:


  1. Interactive image deletion with simages add/find: simages_web_demo


Two interfaces exist:

  1. matplotlib interface which plots the duplicates for visual inspection
  2. mongodb + flask interface which allows interactive deletion [optional]

Minimal interface

In your console, enter the directory with images and use simages-show:

$ simages-show --data-dir .
usage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]
                    [--epochs EPOCHS] [--num-channels NUM_CHANNELS]
                    [--pairs PAIRS] [--zdim ZDIM] [-s]

  -h, --help            show this help message and exit
  --data-dir DATA_DIR, -d DATA_DIR
                        Folder containing image data
  --show-train, -t      Show training of embedding extractor every epoch
  --epochs EPOCHS, -e EPOCHS
                        Number of passes of dataset through model for
                        training. More is better but takes more time.
  --num-channels NUM_CHANNELS, -c NUM_CHANNELS
                        Number of channels for data (1 for grayscale, 3 for
  --pairs PAIRS, -p PAIRS
                        Number of pairs of images to show
  --zdim ZDIM, -z ZDIM  Compression bits (bigger generally performs better but
                        takes more time)
  -s, --show            Show closest pairs

Web Interface [Optional]

Note: To install the web interface API, install and run mongodb and use pip install "simages[all]" to install optional dependencies.

Add your pictures to the database (this will take some time depending on the number of pictures)

simages add <images_folder_path>

A webpage will come up with all of the similar or duplicate pictures:

simages find <images_folder_path>
    simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
    simages remove <path> ... [--db=<db_path>]
    simages clear [--db=<db_path>]
    simages show [--db=<db_path>]
    simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
    simages -h | --help
    -h, --help                Show this screen
    --db=<db_path>            The location of the database or a MongoDB URI. (default: ./db)
    --parallel=<num_processes> The number of parallel processes to run to hash the image
                               files (default: number of CPUs).
        --print               Only print duplicate files rather than displaying HTML file
        --delete              Move all found duplicate pictures to the trash. This option takes priority over --print.
        --match-time          Adds the extra constraint that duplicate images must have the
                              same capture times in order to be considered.
        --trash=<trash_path>  Where files will be put when they are deleted (default: ./Trash)
        --epochs=<epochs>     Epochs for training [default: 2]

Python APIs

Numpy array

from simages import find_duplicates
import numpy as np

array_data = np.random.random(100, 3, 48, 48)# N x C x H x W
pairs, distances = find_duplicates(array_data)


from simages import find_duplicates

data_dir = "my_images_folder"
pairs, distances = find_duplicates(data_dir)

Default options for find_duplicates are:

def find_duplicates(
    input: Union[str or np.ndarray],
    n: int = 5,
    num_epochs: int = 2,
    num_channels: int = 3,
    show: bool = False,
    show_train: bool = False,
    """Find duplicates in dataset. Either `array` or `data_dir` must be specified.

        input (str or np.ndarray): folder directory or N x C x H x W array
        n (int): number of closest pairs to identify
        num_epochs (int): how long to train the autoencoder (more is generally better)
        show (bool): display the closest pairs
        show_train (bool): show output every
        z_dim (int): size of compression (more is generally better, but slower)
        kwargs (dict): etc, passed to `EmbeddingExtractor`

        pairs (np.ndarray): indices for closest pairs of images, n x 2 array
        distances (np.ndarray): distances of each pair to each other

Embeddings API

from simages import Embeddings
import numpy as np

N = 1000
data = np.random.random((N, 28, 28))
embeddings = Embeddings(data)

# Access the array
array = embeddings.array # N x z (compression size)

# Get 10 closest pairs of images
pairs, distances = embeddings.duplicates(n=5)
In [0]: pairs
Out[0]: array([[912, 990], [716, 790], [907, 943], [483, 492], [806, 883]])

In [1]: distances
Out[1]: array([0.00148035, 0.00150703, 0.00158789, 0.00168699, 0.00168721])

EmbeddingExtractor API

from simages import EmbeddingExtractor
import numpy as np

N = 1000
data = np.random.random((N, 28, 28))
extractor = EmbeddingExtractor(data, num_channels=1) # grayscale

# Show 10 closest pairs of images
pairs, distances = extractor.show_duplicates(n=10)

Class attributes and parameters:

class EmbeddingExtractor:
    """Extract embeddings from data with models and allow visualization.

        trainloader (torch loader)
        evalloader (torch loader)
        model (torch.nn.Module)
        embeddings (np.ndarray)

    def __init__(
        input:Union[str, np.ndarray],
    """Inits EmbeddingExtractor with input, either `str` or `np.nd.array`, performs training and validation.

    input (np.ndarray or str): data
    num_channels (int): grayscale = 1, color = 3
    num_epochs (int): more is better (generally)
    batch_size (int): number of images per batch
    show_train (bool): show intermediate training results
    show (bool): show closest pairs
    z_dim (int): compression size
    kwargs (dict)


Specify tne number of pairs to identify with the parameter n.

How it works

simages uses a convolutional autoencoder with PyTorch and compares the latent representations with closely :triangular_ruler:.


simages depends on the following packages:

Optional dependencies, installed with pip install simages[all] include:

  • pymongodb
  • fastcluster
  • flask
  • jinja
  • dnspython
  • python-magic
  • termcolor


If you use simages, please cite it:

      author       = {Justin Shenk},
      title        = {justinshenk/simages: v19.0.1},
      month        = jun,
      year         = 2019,
      doi          = {10.5281/zenodo.3237830},
      url          = {}

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for simages, version 19.0.2.post1
Filename, size File type Python version Upload date Hashes
Filename, size simages-19.0.2.post1-py3-none-any.whl (24.0 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size simages-19.0.2.post1.tar.gz (11.7 MB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page