Skip to main content

Python package undouble

Project description

undouble

Python PyPI Version License Github Forks GitHub Open Issues Project Status Sphinx Downloads Downloads BuyMeCoffee

Python package undouble is to detect (near-)identical images.

The aim of undouble is to detect (near-)identical images. It works using a multi-step process of pre-processing the images (grayscaling, normalizing, and scaling), computing the image hash, and the grouping of images. A threshold of 0 will group images with an identical image hash. The results can easily be explored by the plotting functionality and images can be moved with the move functionality. When moving images, the image in the group with the largest resolution will be copied, and all other images are moved to the "undouble" subdirectory. In case you want to cluster your images, I would recommend reading the blog and use the clustimage library.

The following steps are taken in the undouble library:

    1. Read recursively all images from directory with the specified extensions.
    1. Compute image hash.
    1. Group similar images.
    1. Move if desired.

Installation

  • Install undouble from PyPI (recommended). undouble is compatible with Python 3.6+ and runs on Linux, MacOS X and Windows.
  • A new environment can be created as following:
conda create -n env_undouble python=3.8
conda activate env_undouble
pip install undouble            # new install
pip install -U undouble         # update to latest version
  • Alternatively, you can install from the GitHub source:
# Directly install from github source
pip install -e git://github.com/erdogant/undouble.git@0.1.0#egg=master
pip install git+https://github.com/erdogant/undouble#egg=master
pip install git+https://github.com/erdogant/undouble

# By cloning
git clone https://github.com/erdogant/undouble.git
cd undouble
pip install -U .

Import undouble package

from undouble import Undouble

Example:

# Import library
from undouble import Undouble

# Init with default settings
model = Undouble(method='phash', hash_size=8)

# Import example data
targetdir = model.import_example(data='flowers')

# Importing the files files from disk, cleaning and pre-processing
model.import_data(targetdir)

# Compute image-hash
model.compute_hash()

# [undouble] >INFO> Store examples at [./undouble/data]..
# [undouble] >INFO> Downloading [flowers] dataset from github source..
# [undouble] >INFO> Extracting files..
# [undouble] >INFO> [214] files are collected recursively from path: [./undouble/data/flower_images]
# [undouble] >INFO> Reading and checking images.
# [undouble] >INFO> Reading and checking images.
# 100%|██████████| 214/214 [00:02<00:00, 96.56it/s]
# [undouble] >INFO> Extracting features using method: [phash]
# 100%|██████████| 214/214 [00:00<00:00, 3579.14it/s]
# [undouble] >INFO> Build adjacency matrix with phash differences.
# [undouble] >INFO> Extracted features using [phash]: (214, 214)
# 100%|██████████| 214/214 [00:00<00:00, 129241.33it/s]


# Find images with image-hash <= threshold
model.group(threshold=0)

# [undouble] >INFO> Number of groups with similar images detected: 3
# [undouble] >INFO> [3] groups are detected for [7] images.

# Plot the images
model.plot()

# Move the images
model.move()

# -------------------------------------------------
# >You are at the point of physically moving files.
# -------------------------------------------------
# >[7] similar images are detected over [3] groups.
# >[4] images will be moved to the [undouble] subdirectory.
# >[3] images will be copied to the [undouble] subdirectory.

# >[C]ontinue moving all files.
# >[W]ait in each directory.
# >[Q]uit
# >Answer: w

Three types of input to read the images

The input for the function import_data can be three different types:

* Path to directory
* List of file locations
* Numpy array containing images
# Import library
from undouble import Undouble

# Init with default settings
model = Undouble(method='phash', hash_size=16)

# Import data; Pathnames to the images.
input_list_of_files = model.import_example(data='flowers')

# Import data; Directory to read.
input_directory, _ = os.path.split(input_list_of_files[0])
print(input_directory)
# '.\\undouble\\undouble\\data\\flower_images'

# Import data; numpy array containing images.
input_img_array = model.import_example(data='mnist')

# Importing the files files from disk, cleaning and pre-processing
model.import_data(input_list_of_files)
model.import_data(input_directory)
model.import_data(input_img_array)

# Compute image-hash
model.compute_hash()

# Find images with image-hash <= threshold
model.group(threshold=0)

# Plot the images
model.plot()

# Move the images
# model.move()

Finding identical mnist digits.

# Import library
from undouble import Undouble

# Init with default settings
model = Undouble()

# Import example data
targetdir = model.import_example(data='mnist')

# Importing the files files from disk, cleaning and pre-processing
model.import_data(targetdir)

# Compute image-hash
model.compute_hash(method='phash', hash_size=32)

# Find images with image-hash <= threshold
model.group(threshold=0)

# Plot the images
model.plot()

References

Citation

Please cite in your publications if this is useful for your research (see citation).

Maintainers

Contribute

  • All kinds of contributions are welcome!
  • If you wish to buy me a Coffee for this work, it is very appreciated :)

Licence

See LICENSE for details.

Other interesting stuf

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

undouble-1.2.0.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

undouble-1.2.0-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file undouble-1.2.0.tar.gz.

File metadata

  • Download URL: undouble-1.2.0.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for undouble-1.2.0.tar.gz
Algorithm Hash digest
SHA256 3a538ea98b701e46a3f2f50c830c649f0a03b2265f771305cea4dd83d7b5e36f
MD5 6f0b3228bbed26db5900f02537f2671a
BLAKE2b-256 a473a6575b32866591ccd9d626d4e7e225844eb4e2b15a31bb5814cffc4eea51

See more details on using hashes here.

File details

Details for the file undouble-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: undouble-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.12

File hashes

Hashes for undouble-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1f01edd8ecc136966222768c7d2df799fa5d0e28b5162230ed3227128b1c995e
MD5 538c89107f8573667173507d47e7016e
BLAKE2b-256 62029d683b8e5807393022c06e94226fc3df6de3a4b28d41c6510233c4298ea1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page