Skip to main content

Find similar images in a dataset

Project description

:monkey: simages:monkey:

PyPI version Build Status Documentation Status DOI Binder

Find similar images within a dataset.

Useful for removing duplicate images from a dataset after scraping images with google-images-download.

The Python API returns pairs, duplicates, where pairs are the (ordered) closest pairs and distances is the corresponding embedding distance.

Install

See the installation docs for all details.

pip install simages

or install from source:

git clone https://github.com/justinshenk/simages
cd simages
pip install .

To install the interactive interface, install mongodb and use rather pip install "simages[all]".

Demo

  1. Minimal command-line interface with simages-show:

simages_demo

  1. Interactive image deletion with simages add/find: simages_web_demo

Usage

Two interfaces exist:

  1. minimal interface which plots the duplicates for visual inspection
  2. mongodb + flask interface which allows interactive deletion [optional]

Minimal Interface

In your console, enter the directory with images and use simages-show:

$ simages-show --data-dir .
usage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]
                    [--epochs EPOCHS] [--num-channels NUM_CHANNELS]
                    [--pairs PAIRS] [--zdim ZDIM] [-s]

  -h, --help            show this help message and exit
  --data-dir DATA_DIR, -d DATA_DIR
                        Folder containing image data
  --show-train, -t      Show training of embedding extractor every epoch
  --epochs EPOCHS, -e EPOCHS
                        Number of passes of dataset through model for
                        training. More is better but takes more time.
  --num-channels NUM_CHANNELS, -c NUM_CHANNELS
                        Number of channels for data (1 for grayscale, 3 for
                        color)
  --pairs PAIRS, -p PAIRS
                        Number of pairs of images to show
  --zdim ZDIM, -z ZDIM  Compression bits (bigger generally performs better but
                        takes more time)
  -s, --show            Show closest pairs

Web Interface [Optional]

Note: To install the web interface API, install and run mongodb and use pip install "simages[all]" to install optional dependencies.

Add your pictures to the database (this will take some time depending on the number of pictures)

simages add <images_folder_path>

A webpage will come up with all of the similar or duplicate pictures:

simages find <images_folder_path>
Usage:
    simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
    simages remove <path> ... [--db=<db_path>]
    simages clear [--db=<db_path>]
    simages show [--db=<db_path>]
    simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
    simages -h | --help
Options:
    -h, --help                Show this screen
    --db=<db_path>            The location of the database or a MongoDB URI. (default: ./db)
    --parallel=<num_processes> The number of parallel processes to run to hash the image
                               files (default: number of CPUs).
    find:
        --print               Only print duplicate files rather than displaying HTML file
        --delete              Move all found duplicate pictures to the trash. This option takes priority over --print.
        --match-time          Adds the extra constraint that duplicate images must have the
                              same capture times in order to be considered.
        --trash=<trash_path>  Where files will be put when they are deleted (default: ./Trash)
        --epochs=<epochs>     Epochs for training [default: 2]

Python APIs

Numpy array

from simages import find_duplicates
import numpy as np

array_data = np.random.random(100, 3, 48, 48)# N x C x H x W
pairs, distances = find_duplicates(array_data)
 

Folder

from simages import find_duplicates

data_dir = "my_images_folder"
pairs, distances = find_duplicates(data_dir)
 

Default options for find_duplicates are:

def find_duplicates(
    input: Union[str or np.ndarray],
    n: int = 5,
    num_epochs: int = 2,
    num_channels: int = 3,
    show: bool = False,
    show_train: bool = False,
    **kwargs
):
    """Find duplicates in dataset. Either `array` or `data_dir` must be specified.

    Args:
        input (str or np.ndarray): folder directory or N x C x H x W array
        n (int): number of closest pairs to identify
        num_epochs (int): how long to train the autoencoder (more is generally better)
        show (bool): display the closest pairs
        show_train (bool): show output every
        z_dim (int): size of compression (more is generally better, but slower)
        kwargs (dict): etc, passed to `EmbeddingExtractor`

    Returns:
        pairs (np.ndarray): indices for closest pairs of images, n x 2 array
        distances (np.ndarray): distances of each pair to each other

Embeddings API

from simages import Embeddings
import numpy as np

N = 1000
data = np.random.random((N, 28, 28))
embeddings = Embeddings(data)

# Access the array
array = embeddings.array # N x z (compression size)

# Get 10 closest pairs of images
pairs, distances = embeddings.duplicates(n=5)
In [0]: pairs
Out[0]: array([[912, 990], [716, 790], [907, 943], [483, 492], [806, 883]])

In [1]: distances
Out[1]: array([0.00148035, 0.00150703, 0.00158789, 0.00168699, 0.00168721])

EmbeddingExtractor API

from simages import EmbeddingExtractor
import numpy as np

N = 1000
data = np.random.random((N, 28, 28))
extractor = EmbeddingExtractor(data, num_channels=1) # grayscale

# Show 10 closest pairs of images
pairs, distances = extractor.show_duplicates(n=10)

Class attributes and parameters:

class EmbeddingExtractor:
    """Extract embeddings from data with models and allow visualization.

    Attributes:
        trainloader (torch loader)
        evalloader (torch loader)
        model (torch.nn.Module)
        embeddings (np.ndarray)

    """
    def __init__(
        self,
        input:Union[str, np.ndarray],
        num_channels=None,
        num_epochs=2,
        batch_size=32,
        show_train=True,
        show=False,
        z_dim=8,
        **kwargs,
    ):
    """Inits EmbeddingExtractor with input, either `str` or `np.nd.array`, performs training and validation.
    
    Args:
    input (np.ndarray or str): data
    num_channels (int): grayscale = 1, color = 3
    num_epochs (int): more is better (generally)
    batch_size (int): number of images per batch
    show_train (bool): show intermediate training results
    show (bool): show closest pairs
    z_dim (int): compression size
    kwargs (dict)
    
    """

Specify tne number of pairs to identify with the parameter n.

How it works

simages uses a convolutional autoencoder with PyTorch and compares the latent representations with closely :triangular_ruler:.

Dependencies

simages depends on the following packages:

The following dependencies are required for the interactive deleting interface:

  • pymongodb
  • fastcluster
  • flask
  • jinja2
  • dnspython
  • python-magic
  • termcolor

Cite

If you use simages, please cite it:

    @misc{justin_shenk_2019_3237830,
      author       = {Justin Shenk},
      title        = {justinshenk/simages: v19.0.1},
      month        = jun,
      year         = 2019,
      doi          = {10.5281/zenodo.3237830},
      url          = {https://doi.org/10.5281/zenodo.3237830}
    }

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simages-23.0.7.tar.gz (27.3 MB view details)

Uploaded Source

Built Distribution

simages-23.0.7-py3-none-any.whl (14.7 MB view details)

Uploaded Python 3

File details

Details for the file simages-23.0.7.tar.gz.

File metadata

  • Download URL: simages-23.0.7.tar.gz
  • Upload date:
  • Size: 27.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.11.3 pkginfo/1.8.3 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.11.2 CPython/3.8.5

File hashes

Hashes for simages-23.0.7.tar.gz
Algorithm Hash digest
SHA256 ad14051ffdd7a4a2f9950465ffea530deb4e9817e3b9b5d8848d471a92654292
MD5 cbf416cd8808bdf4637cfb52b3773bf0
BLAKE2b-256 9e49c578fc57305cca6339175f9f3e641800f587d3a6e614dbcea5abb0615856

See more details on using hashes here.

File details

Details for the file simages-23.0.7-py3-none-any.whl.

File metadata

  • Download URL: simages-23.0.7-py3-none-any.whl
  • Upload date:
  • Size: 14.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.11.3 pkginfo/1.8.3 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.11.2 CPython/3.8.5

File hashes

Hashes for simages-23.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 0ca2de57ae0438d92a057bdcbd7301b2de8ff865e96206bf1c3dadca23901001
MD5 06fec5c1cf15411e654f1883c9717bbc
BLAKE2b-256 7c288367a1b6cfa525e228911e0e925e010f8f10f350d5984318e8209aef83f3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page