Model-agnostic dataset partitioning using Mutual Information
Project description
MIGT: Mutual Information Guided Training
Model-agnostic dataset partitioning using Mutual Information
Overview
MIGT (Mutual Information Guided Training) is a model-agnostic dataset partitioning framework that splits image datasets into train / test / (optional) validation sets using Mutual Information (MI) instead of random sampling.
The main objective of MIGT is to preserve the information distribution of samples across dataset splits and reduce dataset bias.
Unlike random splitting, MIGT:
- Preserves easy and hard samples proportionally
- Maintains feature similarity across splits
- Reduces distributional skew
- Improves generalization stability
MIGT is compatible with CNNs, Vision Transformers (ViTs), and any vision-based model.
Key Features
- ✅ Mutual Information–guided dataset partitioning
- ✅ Excel-style distribution-aware histogram binning
- ✅ Adaptive bin reduction (4 → 3 → 2)
- ✅ Deterministic fallback strategies
- ✅ Optional validation split
- ✅ Research-safe strict shape mode
- ✅ Optional practical resizing mode
- ✅ Model-agnostic and framework-independent
- ✅ No data loss, no class skipping
Installation
pip install migt
Dataset Structure
dataset/
├── class1/
│ ├── img1.jpg
│ ├── img2.jpg
├── class2/
│ ├── img1.jpg
│ ├── img2.jpg
Basic Usage
from migt import MIGTSplitter
splitter = MIGTSplitter(
dataset_root="path/to/dataset"
)
splitter.run(output_root="migt_output")
Output Structure
migt_output/
├── train/
├── test/
└── val/ # created only if val is enabled
Advanced Usage
from migt import MIGTSplitter
splitter = MIGTSplitter(
dataset_root="dataset",
mode="auto",
bins=4,
min_bin=13,
train=0.6,
test=0.3,
val=0.1,
strict_shape=True,
seed=42
)
splitter.run("migt_output")
Parameters
| Parameter | Type | Description |
|------------------|---------------|------------------------------------------------------|
| dataset_root | str | Path to dataset directory |
| mode | str | MI mode: auto / grayscale / color |
| bins | int | Initial number of histogram bins |
| min_bin | int | Minimum samples required per bin |
| train | float | Training split ratio |
| test | float | Test split ratio |
| val | float or None | Validation split ratio (optional) |
| strict_shape | bool | Enforce identical image sizes |
| resize_to | tuple or None | Resize images for MI computation |
| seed | int | Random seed |
Mutual Information Modes
-
auto (default): Automatically selects grayscale or color MI
-
grayscale: Histogram-based MI on grayscale images
-
color: Channel-wise normalized MI on RGB images
Image Size Handling
1️⃣ Strict Shape Mode (Research-Safe)
strict_shape=True
resize_to=None
-
All images must have identical dimensions
-
Mismatched images are skipped
-
No interpolation applied
-
Recommended for scientific experiments
2️⃣ Practical Mode (Optional Resizing)
strict_shape=False
resize_to=(224, 224)
-
Images resized in memory for MI computation
-
Supports mixed-resolution datasets
-
Original images are copied unchanged
Histogram Binning Strategy (Final)
For each class:
1- Compute MI relative to a reference image
2- Construct value-based histogram bins
3- Start with user-defined bins (default: 4)
If any bin has fewer than min_bin samples:
-
Reduce bins sequentially: 4 → 3 → 2
-
Recompute histogram from scratch each time -bin merging is performed
Fallback Rules (Deterministic)
Case A — Small Classes (Initial size < min_bin)
-
Histogram binning is skipped
-
User-selected strategy is applied:
-
random
-
mi_quantile (2-quantile split)
Case B — Large Classes, Histogram Failure
-
If binning fails even at 2 bins:
-
Forced MI-based 3-quantile split
-
Random split is never used
-
User preference is ignored intentionally
Decision Summary
| Situation | Strategy Applied |
|-----------------------------------------|-------------------------------------|
| Class size < min_bin | User choice (random / mi_quantile) |
| Histogram binning succeeds | Histogram-based split |
| Histogram fails at 2 bins | Forced MI-based 3-quantile |
| Random after histogram failure | ❌ Never |
Reference Image Handling
-
First image of each class is the reference
-
Reference image always goes to training set
Train/Test Ratio Note
MI is computed on N − 1 images (excluding reference). The reference image is added afterward, which may slightly shift counts (e.g., 37 instead of 36).
This behavior is expected and deterministic.
Guarantees
MIGT guarantees:
✅ No image loss
✅ No class skipping
✅ Distribution-aware splits
✅ Deterministic behavior
✅ Randomness only when statistically justified
Validation Split (Optional)
MIGTSplitter(
train=0.8,
test=0.2,
val=None
)
-
Only train/ and test/ folders are created
-
No validation directory is generated
Dependencies
numpy>=1.23
scipy>=1.8
scikit-learn>=1.2
scikit-image>=0.19
opencv-python>=4.5
pillow>=9.0
tqdm>=4.64
Reference
This work is based on:
L. Shahmiri, P. Wong, and L. S. Dooley, Accurate Medicinal Plant Identification in Natural Environments by Embedding Mutual Information in a Convolution Neural Network Model, IEEE IPAS 2022, https://ieeexplore.ieee.org/abstract/document/10053008
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file migt-1.1.1.tar.gz.
File metadata
- Download URL: migt-1.1.1.tar.gz
- Upload date:
- Size: 11.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a41917362c6affc432e4566c06c3609ab8fc25578ac3006d8d8c611ae913c6b9
|
|
| MD5 |
3b5fbc7ed6ebb96329bd4eebb4e22736
|
|
| BLAKE2b-256 |
8349e9a2f528525b9473e7f8910ddfb970feb3be74257f72851d1838d5094384
|
File details
Details for the file migt-1.1.1-py3-none-any.whl.
File metadata
- Download URL: migt-1.1.1-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0970baeefcfcbcbe3edec8c374d74a7130c035aa59bb575781f302ac523b2bc7
|
|
| MD5 |
e3f2818b0e3130dec4fc79aa9e90532b
|
|
| BLAKE2b-256 |
f8a3be921fc7b54db391757d5c1bdc7b66c86a9a4345919a587b4eedf8ad268b
|