Skip to main content

Complexity Measures and Visualization for Image datasets

Project description

contributions welcome

pycol-vis: Python Image Complexity Library

The Python Image Complexity Library (pycol-vis) assembles a set of data complexity measures associated with image data.

Dataset complexity poses a significant challenge in classification tasks, especially in real-world applications where a combination of factors such as class overlap, data imbalance, noise, and dimensionality can jeopardize a machine learning algorithm's performance.

The seminal work of [1] has leveraged a set of measures devoted to estimating the difficulty level of a tabular classification problem. However, since these complexity measures were designed for tabular datasets, they cannot be directly applied to images. Furthermore, while comprehensive software packages for complexity analysis exist for tabular data such as pycol , dcol , ECoL, ImbCoL, SCoL, and mfe no equivalent, standardized toolkit exists for image datasets.

The lack of dedicated image measures and the absence of supporting software, have created a significant gap in our understanding of image complexity, despite the importance of image data in areas such as healthcare, security, remote sensing, and autonomous systems. Our work aims to address this gap directly by introducing a comprehensive package for this purpose. In particular, the pycol-vis package distinguishes itself by categorizing image metrics into two distinct complexity families:

  • Intrinsic: comprised of metrics to quantify the difficulty of individual images, based image properties such as color, entropy and edge density.
  • Overlap: focusing on class separability and complexity between classes, of a binary or multiclass image dataset.

Implemented Measures

The following Table shows the measures implemented in our package divided by family:

Category Name Acronym Range Reference
Overlap Cumulative Spectral Gradient CSG 0–∞ [2]
Overlap Area Under Laplacian Spectrum AULS 0–∞ [3]
Overlap Cumulative Maximum Scaled Area Under Laplacian Spectrum cmsAULS 0–∞ [3]
Overlap Class Separability m-sep 0–∞ [4]
Overlap In-Class Variability m-var 0–∞ [4]
Intrinsic JPEG Compression Ratio JPEG 0–1 [5]
Intrinsic Fractal Compression Fractal 0–1 [5]
Intrinsic Entropy H 0–1 [6]
Intrinsic Canny Edge Density CED 0–1 [7]
Intrinsic Sobel Edge Density SED 0–1 [7]
Intrinsic Color Average/STD Color Avg. [0–1, 0–1, 0–1] [6]
Intrinsic Unique Colors #Colors 1–∞ [7]
Intrinsic Zipf Rank/Difference Zipf 0–1 [5]
Intrinsic Haralick Features haralick 0-1 [7]
Intrinsic FFT Features fft 0-1

Overlap:

  • Cumulative Spectral Gradient (CSG): Graph-based measure obtained using spectral clustering. Minimum cutting cost of the Similarity Matrix S.
  • Area Under Laplacian Curve (AULS): effect of the Area Under Laplacian Spectrum
  • Cumulative Maximum Scaled Area Under Laplacian Spectrum (cmAULS): Combines CSG and AULS
  • Class Separability (m-sep): Inter Class Separability based on LDA Measure
  • In-Class Variability (m-var): Intra Class Separability based on LDA Measure

Instrinsic:

  • JPEG Compression Ratio: The compression Ratio Achieved by compressing an image to JPEG format (quality is defined as a parameter)
  • Fractal Compression: The compression Ratio Achieved by compressing an image using fractal compression
  • Entropy: The Shannon Entropy of a given image
  • Canny/Sobel Edge Density: The density of edges of a given image, calculated used either Canny or Sobel Filters. More edge density indicates higher complexity.
  • Color Average/STD: The average and standard deviation of the colors of a given image, for each individual channel of the image. Image can be converted into different formats.
  • Unique Colors: The unique colors present in a given image. Image is first quantized to reduce the color space, leaving only the most relevant colors.
  • Zipf Rank/Difference: Complexity based on Zipf-like statistics and Zipf's Law, which claims that in many natural processes the frequency of something is inversely proportional to its rank.
  • Haralick Features: Group of measures based on haralick features obtained based on the Gray Level Co-occurrence Matrix.
  • FFT Features: Group of measures based on fft features. Image is converted to frequency space and the energy in low, mid and high frequency bands is calculated

Installation Instructions

All packages required to run pycol-vis are listed in the requirements.txt file found in this github repository. To install all needed packages run:

# Clone the repository
git clone https://github.com/DiogoApostolo/pycol-vis.git
cd pycol-vis

# Install dependencies
pip install -r requirements.txt

# Install the package in editable mode
pip install -e .

⚠️ Note: pycol-vis requires Python 3.10, 3.11, or 3.12. Python 3.13 and newer are not currently supported due to TensorFlow compatibility.

Datasets

Below is a list of some of the datasets used to test our package which are also necessary to run the use case files:

  • Shapes dataset: Dataset is composed of 2D 9 geometric shapes, each shape is drawn randomly on a 200x200 RGB image. (also available in shapes_dataset.zip)
  • COVID Dataset: Covid Dataset with 3 classes COVID19, PNEUMONIA and NORMAL
  • Fruits Dataset: A dataset contains 100 classes of fruit images. (also available in Fruit_dataset.zip)
  • MNIST: A dataset of handwritten digits
  • Fashion MNIST: A dataset of 28x28 pixel images of 10 fashion categories (e.g., shirts, shoes, bags)

This package expects the datasets to be stored in the following structure:

  • Folder
    • Class_1
      • img1.png
      • img2.png
    • Class_2
      • img1.png
      • img2.png

Basic Usage

This section shows how to correctly import the package, load a dataset, parameterize the setup and extract dataset complexity.

from pycol_vis.image_metrics import ImageComplexity

# Load the Dataset Stored in the Fruits folder, keeping only the apple and banana class and 100 samples (selected randomly) from each class.
comp = ImageComplexity('Fruits',
           keep_classes=['apple',
           'banana'],
           number_per_class=100)

#Calculate the CSG overlap Measure and the JPEG Compression measure and print them to the user
print(comp.csg_measure())
print(comp.jpeg_compression_ratio())


#Example of the CSG parameters, specifying a specific embedding and how many samples to use to estimate probability.
comp.csg_measure(
    emb_type="mobile_net",
    n_samples=50
)

Visualization Example

Our package offers the user diverse methods to visualize dataset complexity.

This example shows how the measured overlap complexity can be show in a bar plot. The plot_overlap_measures function automatically grabs all overlap measures calculated until that point and displays them to the user.

#Load Dataset

dataset = "shapes_dataset"
folder = "./" + dataset +  "/train/"
classes = ["Circle","Square","Triangle"]


complexity = ImageComplexity(folder,keep_classes=classes,number_per_class=200)

# Measure Complexity
complexity.csg_measure(emb_type="efficient_net",n_samples=50, reduction_type='pca')
complexity.tabular_measure(emb_type='efficient_net',measure='kdn',reduction_type='pca')
complexity.m_sep_measure(emb_type='efficient_net', reduction_type='pca')

#Plot Bar plot with measured complexity
complexity.plot_overlap_measures()

Bar Plot of Overlap Measures

Continuing from the previous example, a user might also want to visualize how the dataset was embedded. Using the plot_tsne method our package uses t-SNE to show the user a 2D projection of the embedded dataset.

complexity_train.plot_tsne(embs=complexity_train.feature_embeddings)

Bar Plot of Overlap Measures

Use Cases

A collection of Use Cases are provided in the use_cases folder. These examples display how our package can be used in practice to extract valuable insights from image datasets.

In particular de use case folder includes the following files:

  • model_selection.py: A Use Case showing how the overlap measures in our package can be used to inform model selection
  • sample_selection.py: A Use Case showing how the intrinsic measures can be used to reduce the dataset size, selecting only the most relevant samples
  • dim_reduction.py: A Use Case showing how the overlap measures can be used to reduce the embedding feature space, without losing classification performance.
  • viz_example.py: A Use Case displaying the different visualization options present in our package
  • layers.py A Use Case of how to train a Custom NN and extract complexity at each layer.

More information is provided in each individual file.

References

[1] Ho, T. K., & Basu, M. (2002).
Complexity Measures of Supervised Classification Problems.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 289–300.
https://doi.org/10.1109/34.990132

[2] Branchaud-Charron, F., Achkar, A., & Jodoin, P.-M. (2019).
Spectral Metric for Dataset Complexity Assessment.
arXiv:1905.07299. https://arxiv.org/abs/1905.07299

[3] Li, G., Togo, R., Ogawa, T., & Haseyama, M. (2022).
Dataset complexity assessment based on cumulative maximum scaled area under Laplacian spectrum.
Multimedia Tools and Applications, 81(22), 32287–32303.
https://doi.org/10.1007/s11042-022-13027-3

[4] Cho, H., & Lee, S. (2021).
Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data.
Applied Sciences, 11(2), 472.
https://doi.org/10.3390/app11020472

[5] Machado, P., Romero, J., Nadal, M., Santos, A., Correia, J., & Carballal, A. (2015).
Computerized measures of visual complexity.
Acta Psychologica, 160, 43–57.
https://doi.org/10.1016/j.actpsy.2015.06.005

[6] Rahane, A. A., & Subramanian, A. (2020).
Measures of Complexity for Large Scale Image Datasets.
arXiv:2008.04431. https://arxiv.org/abs/2008.04431

[7] Corchs, S. E., Ciocca, G., Bricolo, E., & Gasparini, F. (2016).
Predicting Complexity Perception of Real World Images.
PLOS ONE, 11(6).
https://doi.org/10.1371/journal.pone.0157986

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycol_vis-0.1.3.tar.gz (29.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycol_vis-0.1.3-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file pycol_vis-0.1.3.tar.gz.

File metadata

  • Download URL: pycol_vis-0.1.3.tar.gz
  • Upload date:
  • Size: 29.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.6

File hashes

Hashes for pycol_vis-0.1.3.tar.gz
Algorithm Hash digest
SHA256 c590ce05b830bfb2524302baf315421a18796e3e55c974d38cb0bdb6d49595a3
MD5 349b4c0296c46e1d28ae161555e63680
BLAKE2b-256 eac4284a7c6b0ac6594617e740fa865c07ed5104ca06eb01a94336c9d59a5fb1

See more details on using hashes here.

File details

Details for the file pycol_vis-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: pycol_vis-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 25.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.6

File hashes

Hashes for pycol_vis-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a24ad0531e0e3846a81f0f199b64e45a3d1750932614400a60585640696c94ef
MD5 9d3147ffc5deda8d5d0f1dda21023df9
BLAKE2b-256 992856a876ede0792c3247f75163c66136e4269cd15906be4fc15627697e5308

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page