Density Aware Undersampling for Imbalanced Data
Project description
DAU Undersampling (Density-Aware Undersampling)
DAU (Density-Aware Undersampling) is a Python package to handle imbalanced datasets by reducing the majority class without losing important information.
Instead of random undersampling, DAU keeps:
- Sparse points (outliers / rare cases) → retained fully
- Dense clusters → represented by a few points (using DBSCAN)
- Noise points → kept separately
This leads to smarter undersampling and better ML performance compared to random undersampling.
Installation
From PyPI:
pip install dau-undersampling
(Optionally, for testing on TestPyPI):
pip install -i https://test.pypi.org/simple/ dau-undersampling
âš¡ Quickstart
import pandas as pd
from sklearn.datasets import make_classification
from dau_undersampling import DAU
# 1. Create an imbalanced dataset
X, y = make_classification(
n_samples=1000, n_features=10,
n_classes=2, weights=[0.9, 0.1],
random_state=42
)
X = pd.DataFrame(X)
y = pd.Series(y)
# 2. Apply DAU undersampling
dau = DAU(n_neighbors=5, min_samples=3, eps=0.5, percentile=25)
X_resampled, y_resampled = dau.fit_transform(X, y)
print("Original dataset shape:", y.value_counts().to_dict())
print("Resampled dataset shape:", y_resampled.value_counts().to_dict())
🛠Usage & Parameters
Class: DAU
DAU(n_neighbors=3, min_samples=5, eps=0.05, percentile=25)
Parameters:
n_neighbors(int, default=3) Number of neighbors for KNN distance calculation.min_samples(int, default=5) Minimum samples per cluster (DBSCAN).eps(float, default=0.05) Maximum neighborhood radius (DBSCAN).percentile(int, default=25) Threshold to split sparse vs dense points.
Method: fit_transform(X, y)
Performs density-aware undersampling.
Arguments:
X:pd.DataFrame→ features of majority class or dataset.y:pd.Series→ labels (binary classification).
Returns:
X_resampled: Reduced features after undersampling.y_resampled: Reduced labels aligned with features.
Example in Pipeline
You can also integrate DAU into an ML pipeline (with imblearn):
from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('undersample', DAU(n_neighbors=7, min_samples=5, eps=0.4, percentile=30)),
('clf', LogisticRegression())
])
pipeline.fit(X, y)
Why DAU vs Other Methods?
| Method | Behavior |
|---|---|
| Random undersampling | Drops samples randomly (risk of losing rare but important cases). |
| NearMiss / Tomek Links | Works with distances but may remove outliers or boundary points. |
| DAU (this package) | Preserves outliers + keeps 1 representative per dense cluster (balanced). |
Contributing
- Fork this repo
- Create a new branch (
git checkout -b feature-xyz) - Commit changes (
git commit -m "Added xyz") - Push (
git push origin feature-xyz) - Open a Pull Request
License
This project is licensed under the MIT License – see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dau_undersampler-0.1.0.tar.gz.
File metadata
- Download URL: dau_undersampler-0.1.0.tar.gz
- Upload date:
- Size: 4.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39c1cbffcf347483319650290d1bda240bbb73cf7efa0ef8febaa1fb76258399
|
|
| MD5 |
0e404cba38ea1af61050853e45da7ea2
|
|
| BLAKE2b-256 |
ab62ae8d10cc7d88c97447946265e5ed98dca7409406be61495206d85be98069
|
File details
Details for the file dau_undersampler-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dau_undersampler-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33f7ffb7361930e763a92b8ea89ac7038e0ca528fe3b1f531b3b0d2cc235af26
|
|
| MD5 |
33fbbe3f51253fae5cc19200d52c8b0c
|
|
| BLAKE2b-256 |
ef076a5a3494e5e3cf61e46b54c939b3d6a869e876988551788421c56a9756fd
|