Skip to main content

Density Aware Undersampling for Imbalanced Data

Project description

DAU Undersampling (Density-Aware Undersampling)

DAU (Density-Aware Undersampling) is a Python package to handle imbalanced datasets by reducing the majority class without losing important information.

Instead of random undersampling, DAU keeps:

  • Sparse points (outliers / rare cases) → retained fully
  • Dense clusters → represented by a few points (using DBSCAN)
  • Noise points → kept separately

This leads to smarter undersampling and better ML performance compared to random undersampling.


Installation

From PyPI:

pip install dau-undersampling

(Optionally, for testing on TestPyPI):

pip install -i https://test.pypi.org/simple/ dau-undersampling

âš¡ Quickstart

import pandas as pd
from sklearn.datasets import make_classification
from dau_undersampling import DAU

# 1. Create an imbalanced dataset
X, y = make_classification(
    n_samples=1000, n_features=10,
    n_classes=2, weights=[0.9, 0.1],
    random_state=42
)

X = pd.DataFrame(X)
y = pd.Series(y)

# 2. Apply DAU undersampling
dau = DAU(n_neighbors=5, min_samples=3, eps=0.5, percentile=25)
X_resampled, y_resampled = dau.fit_transform(X, y)

print("Original dataset shape:", y.value_counts().to_dict())
print("Resampled dataset shape:", y_resampled.value_counts().to_dict())

🛠 Usage & Parameters

Class: DAU

DAU(n_neighbors=3, min_samples=5, eps=0.05, percentile=25)

Parameters:

  • n_neighbors (int, default=3) Number of neighbors for KNN distance calculation.
  • min_samples (int, default=5) Minimum samples per cluster (DBSCAN).
  • eps (float, default=0.05) Maximum neighborhood radius (DBSCAN).
  • percentile (int, default=25) Threshold to split sparse vs dense points.

Method: fit_transform(X, y)

Performs density-aware undersampling.

Arguments:

  • X: pd.DataFrame → features of majority class or dataset.
  • y: pd.Series → labels (binary classification).

Returns:

  • X_resampled: Reduced features after undersampling.
  • y_resampled: Reduced labels aligned with features.

Example in Pipeline

You can also integrate DAU into an ML pipeline (with imblearn):

from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('undersample', DAU(n_neighbors=7, min_samples=5, eps=0.4, percentile=30)),
    ('clf', LogisticRegression())
])

pipeline.fit(X, y)

Why DAU vs Other Methods?

Method Behavior
Random undersampling Drops samples randomly (risk of losing rare but important cases).
NearMiss / Tomek Links Works with distances but may remove outliers or boundary points.
DAU (this package) Preserves outliers + keeps 1 representative per dense cluster (balanced).

Contributing

  1. Fork this repo
  2. Create a new branch (git checkout -b feature-xyz)
  3. Commit changes (git commit -m "Added xyz")
  4. Push (git push origin feature-xyz)
  5. Open a Pull Request

License

This project is licensed under the MIT License – see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dau_undersampler-0.1.0.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dau_undersampler-0.1.0-py3-none-any.whl (4.7 kB view details)

Uploaded Python 3

File details

Details for the file dau_undersampler-0.1.0.tar.gz.

File metadata

  • Download URL: dau_undersampler-0.1.0.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for dau_undersampler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 39c1cbffcf347483319650290d1bda240bbb73cf7efa0ef8febaa1fb76258399
MD5 0e404cba38ea1af61050853e45da7ea2
BLAKE2b-256 ab62ae8d10cc7d88c97447946265e5ed98dca7409406be61495206d85be98069

See more details on using hashes here.

File details

Details for the file dau_undersampler-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dau_undersampler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 33f7ffb7361930e763a92b8ea89ac7038e0ca528fe3b1f531b3b0d2cc235af26
MD5 33fbbe3f51253fae5cc19200d52c8b0c
BLAKE2b-256 ef076a5a3494e5e3cf61e46b54c939b3d6a869e876988551788421c56a9756fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page