Model-agnostic dataset partitioning using Mutual Information

These details have not been verified by PyPI

Project links

Project description

logo

MIGT: Mutual Information Guided Training

Model-agnostic dataset partitioning using Mutual Information

Overview

MIGT (Mutual Information Guided Training) is a model-agnostic dataset partitioning framework that splits image datasets into train / test / (optional) validation sets using Mutual Information (MI) instead of random sampling.

The main objective of MIGT is to preserve the information distribution of samples across dataset splits and reduce dataset bias.

Unlike random splitting, MIGT:

Preserves easy and hard samples proportionally
Maintains feature similarity across splits
Reduces distributional skew
Improves generalization stability

MIGT is compatible with CNNs, Vision Transformers (ViTs), and any vision-based model.

Key Features

✅ Mutual Information–guided dataset partitioning
✅ Excel-style distribution-aware histogram binning
✅ Adaptive bin reduction (4 → 3 → 2)
✅ Deterministic fallback strategies
✅ Optional validation split
✅ Research-safe strict shape mode
✅ Optional practical resizing mode
✅ Model-agnostic and framework-independent
✅ No data loss, no class skipping

Installation

pip install migt

Dataset Structure

dataset/
 ├── class1/
 │    ├── img1.jpg
 │    ├── img2.jpg
 ├── class2/
 │    ├── img1.jpg
 │    ├── img2.jpg

Basic Usage

from migt import MIGTSplitter

splitter = MIGTSplitter(
    dataset_root="path/to/dataset"
)

splitter.run(output_root="migt_output")

Output Structure

migt_output/
 ├── train/
 ├── test/
 └── val/        # created only if val is enabled

Advanced Usage

from migt import MIGTSplitter

splitter = MIGTSplitter(
    dataset_root="dataset",
    mode="auto",
    bins=4,
    min_bin=13,
    train=0.6,
    test=0.3,
    val=0.1,
    strict_shape=True,
    seed=42
)

splitter.run("migt_output")

Parameters

| Parameter        | Type          | Description                                          |
|------------------|---------------|------------------------------------------------------|
| dataset_root     | str           | Path to dataset directory                            |
| mode             | str           | MI mode: auto / grayscale / color                    |
| bins             | int           | Initial number of histogram bins                     |
| min_bin          | int           | Minimum samples required per bin                     |
| train            | float         | Training split ratio                                 |
| test             | float         | Test split ratio                                     |
| val              | float or None | Validation split ratio (optional)                    |
| strict_shape     | bool          | Enforce identical image sizes                        |
| resize_to        | tuple or None | Resize images for MI computation                     |
| seed             | int           | Random seed                                          |

Mutual Information Modes

auto (default): Automatically selects grayscale or color MI
grayscale: Histogram-based MI on grayscale images
color: Channel-wise normalized MI on RGB images

Image Size Handling

1️⃣ Strict Shape Mode (Research-Safe)

strict_shape=True
resize_to=None

All images must have identical dimensions
Mismatched images are skipped
No interpolation applied
Recommended for scientific experiments

2️⃣ Practical Mode (Optional Resizing)

strict_shape=False
resize_to=(224, 224)

Images resized in memory for MI computation
Supports mixed-resolution datasets
Original images are copied unchanged

Histogram Binning Strategy (Final)

For each class:

1- Compute MI relative to a reference image

2- Construct value-based histogram bins

3- Start with user-defined bins (default: 4)

If any bin has fewer than min_bin samples:

Reduce bins sequentially: 4 → 3 → 2
Recompute histogram from scratch each time -bin merging is performed

Fallback Rules (Deterministic)

Case A — Small Classes (Initial size < min_bin)

Histogram binning is skipped
User-selected strategy is applied:
random
mi_quantile (2-quantile split)

Case B — Large Classes, Histogram Failure

If binning fails even at 2 bins:
Forced MI-based 3-quantile split
Random split is never used
User preference is ignored intentionally

Decision Summary

| Situation                               | Strategy Applied                    |
|-----------------------------------------|-------------------------------------|
| Class size < min_bin                    | User choice (random / mi_quantile)  |
| Histogram binning succeeds              | Histogram-based split               |
| Histogram fails at 2 bins               | Forced MI-based 3-quantile          |
| Random after histogram failure          | ❌ Never                            |

Reference Image Handling

First image of each class is the reference
Reference image always goes to training set

Train/Test Ratio Note

MI is computed on N − 1 images (excluding reference). The reference image is added afterward, which may slightly shift counts (e.g., 37 instead of 36).

This behavior is expected and deterministic.

Guarantees

MIGT guarantees:

✅ No image loss

✅ No class skipping

✅ Distribution-aware splits

✅ Deterministic behavior

✅ Randomness only when statistically justified

Validation Split (Optional)

MIGTSplitter(
    train=0.8,
    test=0.2,
    val=None
)

Only train/ and test/ folders are created
No validation directory is generated

Dependencies

numpy>=1.23
scipy>=1.8
scikit-learn>=1.2
scikit-image>=0.19
opencv-python>=4.5
pillow>=9.0
tqdm>=4.64

Reference

This work is based on:

L. Shahmiri, P. Wong, and L. S. Dooley, Accurate Medicinal Plant Identification in Natural Environments by Embedding Mutual Information in a Convolution Neural Network Model, IEEE IPAS 2022, https://ieeexplore.ieee.org/abstract/document/10053008

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.1

Jan 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

migt-1.1.1.tar.gz (11.5 kB view details)

Uploaded Jan 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

migt-1.1.1-py3-none-any.whl (9.0 kB view details)

Uploaded Jan 30, 2026 Python 3

File details

Details for the file migt-1.1.1.tar.gz.

File metadata

Download URL: migt-1.1.1.tar.gz
Upload date: Jan 30, 2026
Size: 11.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for migt-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`a41917362c6affc432e4566c06c3609ab8fc25578ac3006d8d8c611ae913c6b9`
MD5	`3b5fbc7ed6ebb96329bd4eebb4e22736`
BLAKE2b-256	`8349e9a2f528525b9473e7f8910ddfb970feb3be74257f72851d1838d5094384`

See more details on using hashes here.

File details

Details for the file migt-1.1.1-py3-none-any.whl.

File metadata

Download URL: migt-1.1.1-py3-none-any.whl
Upload date: Jan 30, 2026
Size: 9.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for migt-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0970baeefcfcbcbe3edec8c374d74a7130c035aa59bb575781f302ac523b2bc7`
MD5	`e3f2818b0e3130dec4fc79aa9e90532b`
BLAKE2b-256	`f8a3be921fc7b54db391757d5c1bdc7b66c86a9a4345919a587b4eedf8ad268b`

See more details on using hashes here.

migt 1.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MIGT: Mutual Information Guided Training

Overview

Key Features

Installation

Dataset Structure

Basic Usage

Output Structure

Advanced Usage

Parameters

Mutual Information Modes

Image Size Handling

Histogram Binning Strategy (Final)

Fallback Rules (Deterministic)

Case A — Small Classes (Initial size < min_bin)

Case B — Large Classes, Histogram Failure

Decision Summary

Reference Image Handling

Train/Test Ratio Note

Guarantees

Validation Split (Optional)

Dependencies

Reference

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes