Skip to main content

DeepSearch image search engine

Project description

DeepSearch

DeepSearch is a sophisticated AI-powered search engine designed to enhance image searching. It utilizes deep learning algorithms to efficiently search a vast collection of images and find the most similar matches.

The DeepSearch engine is built on top of the Annoy library, which is a fast, memory-efficient, and easy-to-use library for approximate nearest neighbor search.

The engine uses a pre-trained models from Keras to extract features from images and then stores them in an Annoy index. The index is then used to find the most similar images to a given query image.

Table of Contents

Features

  • Fast: DeepSearch is built on top of the Annoy library, which is a fast, memory-efficient, and easy-to-use library for approximate nearest neighbor search.
  • Easy to use: DeepSearch is designed to be easy to use and integrate into your existing applications.
  • High Accuracy: DeepSearch uses a pre-trained model from Keras to extract features from images and then stores them in an Annoy index. The index is then used to find the most similar images to a given query image.

Prerequisites

Python 3.10.6+ is required to install DeepSearch. You can download the latest version of Python from here.

You also need to install TensorFlow at least 2.10.1 which can be downloaded from here.

Installation from PyPI

You can install DeepSearch from PyPI package repository found here

To install DeepSearch, run the following command:

pip install deep-search-engine

Installation from GitHub Repository

You can also install DeepSearch from the GitHub repository found here by cloning the repository and installing the requirements.

Important Note: In order to use DeepSearch CLI you need to install it from the GitHub repository.

It is recommended to install DeepSearch in a virtual environment. You can use virtualenv or venv to create a virtual environment.

To create a virtual environment using venv, run the following command:

python -m venv env

To activate the virtual environment, run the following command:

# Windows
source env/Scripts/activate

# Linux
source env/bin/activate

You will need to install the requirements before you can use DeepSearch. To install the requirements, run the following command:

pip install -r requirements.txt

With everything installed, you can now start utilizing DeepSearch.

Usage

There are two options for using DeepSearch in your application. You can either use the DeepSearch class and its methods in your code or you can use the DeepSearch CLI.

Importing the DeepSearch class

First, you need to import the DeepSearch class from the DeepSearch module as follows:

from DeepSearch import DeepSearch

Initializing the DeepSearch class

Then, you need to create an instance of the DeepSearch class. You can optionally pass the model name, the number of trees to the constructor, metric, and verbose parameters. The default values are as follows:

deepSearch = DeepSearch(model_name='VGG16', n_trees=100, metric='angular', verbose=True)

The model_name parameter specifies the name of the model to use for extracting features from images. More information about the supported models can be found here.

The n_trees parameter specifies the number of trees to use in the Annoy index. The default value is 100. More trees will give you better accuracy but will also increase the memory usage and search time.

The metric parameter specifies the distance metric to use in the Annoy index. More information about the supported metrics can be found here.

The verbose parameter specifies whether to print the progress of the indexing process. The default value is False.

Building the index

Now, you can build the index and representations by calling the build() method. This method requires the path to the dataset directory which contains the images to index as a string.

deepSearch.build('dataset')

This function will go through all the images in the dataset directory and extract features from them. It will use those features to build the Annoy index and store the indexes and representations in the same directory.

You can optionally pass metric, n_trees and model_name parameters to the build() method. The default values are the same as the ones you passed to the constructor.

This can be useful if you want to try different values for the parameters without creating a new instance of the DeepSearch class.

Saving the index

The build() method will save the index and representations in the same directory as the images. If you use different values for the parameters, the build() method will save the index and representations as a separate file.

For example, if you use the VGG16 model with the angular metric and 100 trees, the index and representations will be saved in the VGG16_angular_100_annoy_index.ann and VGG16_angular_100_representations.pkl files respectively.

The saving format is as follows:

# Annoy index file
f'{model_name}_{metric}_{n_trees}_annoy_index.ann'

# Representations file
f'{model_name}_{metric}_{n_trees}_representations.pkl'

The pickle module is used to save the representations.

Searching for similar images

Finally, you can search for similar images by calling the get_similar_images() method. This method will extract features from the query image and then use them to find the most similar images in the index. You have to specify the path to the query image as a string.

You can optionally pass the number of similar images to return as an integer. The default value is 10. You can specify the optional parameter with_distances as True to return the distances of the similar images as well. The default value of this parameter is False.

similar_images = deepSearch.get_similar_images('query.jpg', num_results=20, with_distance=True)
print(similar_images)

The output of the get_similar_images() method is a python list of dictionaries. Each dictionary contains the image index from the index file, the path to the similar image and the distance between the query image and the similar image. The list is sorted by the distance in ascending order (the first image is the most similar).

[
    {
        'index': 0,
        'path': 'images/0.jpg',
        'distance': 0.0
    },
    {
        'index': 1,
        'path': 'images/1.jpg',
        'distance': 0.6206140518188477
    },
    {
        'index': 2,
        'path': 'images/2.jpg',
        'distance': 0.7063581943511963
    },
    ...
]

Full Implementation Example

The following example shows how to use DeepSearch in your code. It will index all the images in the dataset directory and then find the most similar images to the query image.

from DeepSearch import DeepSearch

# Initialize the DeepSearch class
deepSearch = DeepSearch(model_name='VGG16', n_trees=100, metric='angular', verbose=True)

# Build the index and representations
deepSearch.build('dataset')

# Search for similar images
similar_images = deepSearch.get_similar_images('lookup/query.jpg', num_results=20, with_distance=True)

# Print the similar images
print(similar_images)

The full implementation of the example can be found in the DeepSearchDemo.py file.

To run the demo, you need to copy the images you want to index to the dataset directory, copy the query image to the lookup directory, and then run the DeepSearchDemo.py file as follows:

python DeepSearchDemo.py

CLI Usage

In order to use DeepSearch from the command line, you need to install the DeepSearch CLI from GitHub explained in the Installation from GitHub Repository section.

The another option for using DeepSearch is to use the DeepSearch CLI. The DeepSearch CLI allows you to use DeepSearch from the command line without writing any code.

Running the DeepSearch CLI will build the index and search for similar images. The similar images will then be saved in a directory which can be specified using the --output option or will be saved in the output by default. The output directory will be created if it doesn't exist.

There are several options you can pass to the DeepSearch CLI. The options are as follows:

  • --folder: The path to the folder containing the images to index. This option is required.
  • --output: The path to the output directory where the similar images will be saved. The default value is output.
  • --image: The path to the query image. This option is required.
  • --num-results: The number of similar images to return. The default value is 10.
  • --metric: The distance metric to use in the Annoy index. The default value is angular.
  • --n-trees: The number of trees to use in the Annoy index. The default value is 100.
  • --model: The name of the model to use for extracting features from images. The default value is VGG16.
  • --verbose: Whether to print the progress of the indexing process. The default value is False.

To run the DeepSearch CLI, you need to run the DeepSearchCLI.py file as follows:

# Example with required options only
python DeepSearchCLI.py --folder dataset --image lookup/query.jpg

# Example with several options
python DeepSearchCLI.py --folder dataset --image lookup/query.jpg --output output --num_results 20 --metric euclidean --n_trees 20 --model ResNet50 --verbose True

Supported Models

The following models are supported:

The default value is VGG16. You can get a list of available models by calling the static get_available_models() method of the DeepSearch class as follows:

# Get a list of available models
models = DeepSearch.get_available_models()
print(models) # ['VGG16', 'ResNet50', 'InceptionV3', 'Xception']

The models are case sensitive and must be specified exactly as shown above.

You can easily add support for other models from the Keras Applications library by adding a new model class to the models dictionary in the ModelLoader class.

Supported Metrics

The following metrics are supported:

  • angular (default) - The cosine similarity metric.
  • euclidean - The Euclidean distance metric.
  • manhattan - The Manhattan distance metric.
  • hamming - The Hamming distance metric.
  • dot - The dot product metric.

The default value is angular which is the cosine distance.

You can get a list of available metrics by calling the static get_available_metrics() method of the DeepSearch class as follows:

# Get a list of available metrics
metrics = DeepSearch.get_available_metrics()
print(metrics) # ['angular', 'euclidean', 'manhattan', 'hamming', 'dot']

The metrics are case sensitive and must be specified exactly as shown above.

Impact of Image Quantity on Processing Time

When processing a large number of images, it may take longer for the algorithm to generate representations. This is due to the increased computational demands of processing more data (more memory and CPU usage).

Once the representations are generated, the search process is very fast. The search process is limited by the number of trees in the Annoy index. The more trees you use, the more accurate the search results will be, but the longer it will take to search.

When you run the algorithm for the first time, it will generate the representations and save them to a file. The next time you run the algorithm, it will load the representations from the file instead of generating them again. This will significantly reduce the processing time.

For example, I have run the algorithm on a dataset of 100,000 images and the generation of the representations took approximately 12 minutes. Each subsequent run took couple of seconds which depends on the size of the dataset and the number of trees in the Annoy index.

One of the great features is that you can add more images to the dataset and run the algorithm again. The algorithm will only generate representations for the new images and will load the representations for the existing images from the file. This will significantly reduce the processing time.

When the image is deleted from the dataset, the algorithm will remove the representation for the image from the file. This will avoid any issues when searching for similar images.

Any of this operations will force the algorithm to remove the annoy index file and generate it again. This will ensure that the annoy index file is up to date. However, this is relatively fast operation depending on the size of the dataset and the number of trees in the Annoy index. For the previous example of 100,000 images, the generation of the annoy index file took approximately 3 seconds.

You can force the algorithm to remove the representations file and annoy index file by passing the --clear option to the DeepSearch CLI as follows:

python DeepSearchCLI.py --folder dataset --image lookup/query.jpg --clear True

Or you can call the rebuild() method of the DeepSearch class if you are using the DeepSearch API as follows:

# Rebuild the index
deep_search.rebuild()

Contributing

If you would like to contribute to this project, please feel free to submit a pull request. If you have any questions, please feel free to open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deep-search-engine-0.0.4.tar.gz (171.6 kB view details)

Uploaded Source

Built Distribution

deep_search_engine-0.0.4-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file deep-search-engine-0.0.4.tar.gz.

File metadata

  • Download URL: deep-search-engine-0.0.4.tar.gz
  • Upload date:
  • Size: 171.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for deep-search-engine-0.0.4.tar.gz
Algorithm Hash digest
SHA256 23bff548a35d97a64a97142a629842df8df9bf153673a4b6f2b42099cf4878f9
MD5 a57aa2bd2142fe9822c7ecaae7cce03d
BLAKE2b-256 e1a1df5b44fe1af15c2f049b2f9d3e0b150561d18a7aab1699d0dc608f614bd6

See more details on using hashes here.

File details

Details for the file deep_search_engine-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for deep_search_engine-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 6efca5d2b7ae5b2531af9782bd94c826eab36e3db70b2b5013d39b2bae5e9c3d
MD5 039f85273c747e9e30a5deb3127ea786
BLAKE2b-256 f804b8455b37139f3c2f9e574e227dbd836f0b608a576bebc053c8fa1e164510

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page