Skip to main content

Image dataset analyzer using image embedding models and clustering methods.

Project description

ImageDatasetAnalyzer

ImageDatasetAnalyzer is a Python library designed to simplify and automate the analysis of a set of images and optionally its segmentation labels. It provides several tools and methods to perform an initial analysis of the images and its labels obtaining useful information such as sizes, number of classes, total number of objects from a class per image and bounding boxes metrics.

Aditionally, it includes a wide variety of models for image feature extraction and embedding of images from frameworks such as HuggingFace or PyTorch. These embeddings are useful for pattern recognition in images using traditional clustering algorithms like KMeans or AgglomerativeClustering.

It can also be used to apply these clustering methods for Active Learning in semantic segmentation and perform a reduction of the original dataset obtaining the most representative images from each cluster. By these means, this library can be a useful tool to select which images to label for semantic segmentation (or other task that benefits from selective labeling).

🔧 Key features

  • Image and label dataset analysis: Evaluate the distribution of images and labels in a dataset to understand its structure and characteristics. This analyisis can also be used ensure that everything is correct: each image has its label, sizes are accurate, the number of classes matches expectations...
  • Embedding clustering: Group similar images using clustering techniques based on embeddings generated by pre-trained models. The library supports KMeans, AgglomerativeClustering, DBSCAN and OPTICS from skicit-learn. They also include methods to search for hyperparameter tuning using grid search.
  • Support for pre-trained models: Compatible with embedding models from 🤗HuggingFace🤗, PyTorch, TensorFlow and OpenCV frameworks. New frameworks can be easily added using the Embedding superclass.
  • Image dataset reduction: Reduce the number of images in the dataset by selecting the most representative ones (those who are closest to the centroid) or the most diverse ones (those who are farthest from the centroid) from each cluster.

🚀 Getting Started

To start using this package, install it using pip:

For example for Ubuntu use:

pip3 install ImageDatasetAnalyzer

On Windows, use:

pip install ImageDatasetAnalyzer

🤖 Supported models

The compatibility of the following models have been tested. You can use other models and versions of these frameworks as well, although their performance and compatibility might not be fully guaranteed.

Framework Model names
Hugging Face CLIP , ViT, DeiT, Swin Transformer, DINO ViT, ConvNeXt
PyTorch ResNet (50, 101), VGG (16,19), DenseNet (121, 169, 201), InceptionV3
Tensorflow MobileNet (V2), InceptionV3, VGG (16, 19), ResNet (50, 101, 152), ResNetV2 (50, 101, 152), NASNet (Large, Mobile), ConvNeXt (Tiny, Small, Base, Large, XLarge), DenseNet (121, 169, 201)

👩‍💻 Usage

This package includes 3 main modules for Analysis, Embedding generation and clustering and Dataset Reduction.

📊 Dataset analysis

You can analyze the dataset and explore its properties, obtain metrics and visualizations. This module works both for image datasets with labels and for just image datasets.

from imagedatasetanalyzer.src.datasets.imagelabeldataset import ImageLabelDataset

# Define paths to the images and labels
img_dir = r"images/path"
labels_dir = r"labels/path"

# Load the image and label dataset
dataset = ImageLabelDataset(img_dir=img_dir, label_dir=labels_dir)

# Alternatively, you can use just an image dataset without labels
image_dataset = ImageDataset(img_dir=img_dir)

# Perform dataset analysis (visualize and analyze)
dataset.analyze(plot=True, output="results/path", verbose=True)

# If you use only images (without labels), the analysis will provide less information
image_dataset.analyze()

🔍 Embedding generation and clustering

This module is used to generate embeddings for your images and then perform clustering using different algorithms (e.g., K-Means, DBSCAN). Here’s how to generate embeddings and perform clustering:

from imagedatasetanalyzer.src.embeddings.huggingfaceembedding import HuggingFaceEmbedding
from imagedatasetanalyzer.src.datasets.imagedataset import ImageDataset
from imagedatasetanalyzer.src.models.kmeansclustering import KMeansClustering
import numpy as np

# Define image dataset directory
img_dir = r"image/path"

# Load the dataset
dataset = ImageDataset(img_dir)

# Choose an embedding model (e.g., HuggingFace DINO).
embedding_model = HuggingFaceEmbedding("facebook/dino-vits16")
embeddings = embedding_model.generate_embeddings(dataset)

# Perform K-Means clustering
kmeans = KMeansClustering(dataset, embeddings, random_state=123)
best_k = kmeans.find_elbow(25)  # Find the optimal number of clusters using the elbow method

# Apply K-Means clustering with the best number of clusters
labels_kmeans = kmeans.clustering(best_k)

# Display images from each cluster
for cluster in np.unique(labels_kmeans):
    kmeans.show_cluster_images(cluster, labels_kmeans)

# Visualize clusters using TSNE instead of PCA
kmeans.clustering(num_clusters=best_k, reduction='tsne', output='tsne_reduction')

📉 Dataset reduction

This feature allows reducing a dataset based on various clustering methods. You can use different clustering techniques to select a smaller subset of images from the dataset. It can be done selecting those images that are closer to the centroid of each cluster (selection_type=representative), selecting those that are farthest (selection_type=diverse) or randomly (selection_type=random).

from imagedatasetanalyzer.src.datasets.imagedataset import ImageDataset
from imagedatasetanalyzer.src.embeddings.tensorflowembedding import TensorflowEmbedding
from imagedatasetanalyzer.src.models.kmeansclustering import KMeansClustering

# Define paths
img_dir = r"images/path"

# Load dataset
dataset = ImageDataset(img_dir)

# Choose embedding method. We are using MobileNetV2 from Tensorflow.
emb = TensorflowEmbedding("MobileNetV2")
embeddings = emb.generate_embeddings(dataset)

# Initialize KMeans clustering
kmeans = KMeansClustering(dataset, embeddings, random_state=123)

# Select the number of clusters with KMeans that maximize the silhouette score.
best_k = kmeans.find_best_n_clusters(range(2,25), 'silhouette', plot=False)

# Reduce dataset using the best KMeans model according to the silhouette score. 
# In this case, we are mantaining the 70% of the original dataset (reduction=0.7), 
# obtaining the closest images from each cluster (selection_type='representative') 
# and ensuring that 20% of the selected images within each cluster are diverse (diverse_percentage=0.2).
# The reduced dataset will be saved to the specified output directory ("reduced/dataset/path")
reduced_dataset = kmeans.select_balanced_images(n_clusters=best_k, 
                                                reduction=0.7, 
                                                selection_type='representative', 
                                                diverse_percentage=0.2, 
                                                output="reduced/dataset/path")

🧰 Requirements

The dependencies and requirements to use this library are in the requirements.txt file. The following list includes all the dependencies:

  • Kneed
  • Matplotlib
  • Numpy
  • OpenCV
  • Pillow
  • Scikit learn
  • Scipy
  • Skimage
  • Tensorflow
  • Torch
  • Tqdm
  • Transformers

✉️ Contact

📧 jortizdemuruaferrero@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imagedatasetanalyzer-0.1.5.tar.gz (5.6 kB view details)

Uploaded Source

File details

Details for the file imagedatasetanalyzer-0.1.5.tar.gz.

File metadata

  • Download URL: imagedatasetanalyzer-0.1.5.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.11.1 requests/2.27.1 setuptools/45.2.0 requests-toolbelt/1.0.0 tqdm/4.66.4 CPython/3.8.10

File hashes

Hashes for imagedatasetanalyzer-0.1.5.tar.gz
Algorithm Hash digest
SHA256 7f5dd08bcc1f8968476421431a0d999af663b675762d052897ce944f453fc70b
MD5 dc8ffbd9a5eccbb07641df34192c367e
BLAKE2b-256 abd6c096abb35841368aea865b04f58cf55165abd1a8db4cb9810b21bd9f605e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page