Oversampling for Imbalanced Learning with Missing Values

These details have not been verified by PyPI

Project links

Project description

OverNaN

Oversampling for Imbalanced Learning with Missing Values

There are lots of reasons why many datasets contain missing values, particularly when dealing with real-world data. Missingness can be resolved using imputation, but the addition of synthetic instances can bias models and give overly optimistic training performance metrics and poor generalisability. Imputation also masks the hidden information about the missing mechanisms, since missingness can also be meaningful. To help eliminate the need for imputation and better prepare models for real-world validation, OverNaN implements NaN-aware oversampling algorithms for handling class imbalance in datasets with missing values. OverNaN preserves missingness patterns while generating synthetic minority class samples, and is the perfect partner to learning algorithms that natively handle NaNs. The ultimate goal is to achieve similar or better classification performance compared to imputation, while preserving the missingness and only using real data.

Key Features

Three NaN-Aware Algorithms: SMOTE, ADASYN, and ROSE with native missing value support
Flexible NaN Handling: Choose how synthetic samples inherit missingness patterns
Scikit-learn Compatible: Familiar fit_resample() interface
Pandas Integration: Preserves DataFrame column names and Series names
Parallel Processing: Joblib-based parallelization for large datasets
Cross-Platform: Windows, Linux, and macOS compatible

Installation

pip install overnan

Quick Start

from overnan import OverNaN
import numpy as np

# Generate imbalanced data with missing values
X = np.array([
    [1.0, 2.0, np.nan],
    [2.0, np.nan, 3.0],
    [3.0, 4.0, 5.0],
    [4.0, 5.0, 6.0],
    [10.0, 11.0, 12.0]
])
y = np.array([0, 0, 0, 0, 1])  # 4:1 imbalance

# Resample with NaN-aware SMOTE
oversampler = OverNaN(method='SMOTE', neighbours=2, random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X, y)

print(f"Before: {dict(zip(*np.unique(y, return_counts=True)))}")
print(f"After:  {dict(zip(*np.unique(y_resampled, return_counts=True)))}")

Available Methods

Method	Description	Best For
SMOTE	Interpolates between minority samples and neighbors	General purpose
ADASYN	Adaptive interpolation focusing on hard samples	Complex boundaries
ROSE	Kernel density estimation with Gaussian perturbation	High dimensions

# SMOTE: Neighbor-based interpolation
OverNaN(method='SMOTE', neighbours=5)

# ADASYN: Adaptive synthetic sampling
OverNaN(method='ADASYN', neighbours=5, beta=1.0, learning_rate=1.0)

# ROSE: Kernel density estimation (no neighbors required)
OverNaN(method='ROSE', shrinkage=1.0)

NaN Handling Strategies

Strategy	Behavior
`'preserve_pattern'`	NaN if either parent has NaN (default, conservative)
`'interpolate'`	Use available values; minimizes NaN in output
`'random_pattern'`	Probabilistically preserve NaN based on class rates

# Preserve missingness structure
OverNaN(method='SMOTE', nan_handling='preserve_pattern')

# Minimize NaN in output
OverNaN(method='SMOTE', nan_handling='interpolate')

Integration with XGBoost

XGBoost handles NaN natively, making it ideal for use with OverNaN:

from overnan import OverNaN
from sklearn.model_selection import train_test_split
import xgboost as xgb

# Split before oversampling
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

# Oversample training data only
oversampler = OverNaN(method='ROSE', random_state=42)
X_train_res, y_train_res = oversampler.fit_resample(X_train, y_train)

# Train and evaluate
model = xgb.XGBClassifier()
model.fit(X_train_res, y_train_res)
accuracy = model.score(X_test, y_test)

Documentation

Complete documentation, examples, and guides available at: https://github.com/amaxiom/OverNaN

Includes:

Document	Description
Interpretation Guide	Methods, parameters, usage examples
Computation Guide	Implementation, parallelization, memory
Testing Guide	Test suite and benchmarks

Running Tests

# Run test suite
python tests/OverNaN_test.py

# Run benchmarks (requires openml)
python tests/OverNaN_bench.py

Performance Considerations

Parallel Processing: Enable with n_jobs=-1 for large datasets
ROSE for High Dimensions: ROSE does not require neighbor search, making it more efficient for high-dimensional data
Memory Usage: Synthetic samples are generated in batches to manage memory

Requirements

Python >= 3.8
numpy >= 1.19.0
pandas >= 1.1.0
scikit-learn >= 0.24.0
joblib >= 1.0.0

Optional for benchmarking:

xgboost >= 1.4.0
openml >= 0.12.0
imbalanced-learn >= 0.8.0

References

SMOTE: Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. JAIR, 16, 321-357. DOI: 10.1613/jair.953
ADASYN: He, H., Bai, Y., Garcia, E.A., Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. IEEE IJCNN, 1322-1328. DOI: 10.1109/IJCNN.2008.4633969
ROSE: Menardi, G. and Torelli, N. (2014). Training and assessing classification rules with imbalanced data. DMKD, 28, 92-122. DOI: 10.1007/s10618-012-0295-5

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.

Citation

If you use OverNaN in your research, please cite:

@software{overnan2026,
  author = {Barnard, Amanda S.},
  title = {OverNaN: Oversampling for Imbalanced Learning with Missing Values},
  year = {2026},
  url = {https://github.com/amaxiom/OverNaN},
  version = {0.2}
}

Ready to preserve data integrity during imbalanced learning?

pip install overnan

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

overnan-0.2.0.tar.gz (14.8 kB view details)

Uploaded Feb 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

overnan-0.2.0-py3-none-any.whl (14.5 kB view details)

Uploaded Feb 4, 2026 Python 3

File details

Details for the file overnan-0.2.0.tar.gz.

File metadata

Download URL: overnan-0.2.0.tar.gz
Upload date: Feb 4, 2026
Size: 14.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for overnan-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`3931513c50a41dd80d06ab991761461fec0ecc46592fd6e4e3ae16c8feafe822`
MD5	`3e993e1f5af0434c845afe3cab339c59`
BLAKE2b-256	`602d230bc9623e5700f34d4c02abec57bd4ac501e288217bc405e4e78bec04d8`

See more details on using hashes here.

File details

Details for the file overnan-0.2.0-py3-none-any.whl.

File metadata

Download URL: overnan-0.2.0-py3-none-any.whl
Upload date: Feb 4, 2026
Size: 14.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for overnan-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`57c95ad106af473e5aa86f0f6686c7cb2aa5d042a940a745e0ade73914a34f38`
MD5	`cda549795ada23d1216cc549d69c5708`
BLAKE2b-256	`00b5455e1c1ef8f6de504697b4b788a8a009f46e2b17e19cfc0ab96947a54157`

See more details on using hashes here.

overnan 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OverNaN

Key Features

Installation

Quick Start

Available Methods

NaN Handling Strategies

Integration with XGBoost

Documentation

Running Tests

Performance Considerations

Requirements

References

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes