Skip to main content

AuraData: A data-centric auditing and diagnostics engine for machine learning datasets, designed to detect noise, duplicates, label issues, and subgroup risks with transparent reporting.

Project description

AuraData — Automated Data Quality Auditing Engine

Author: Abdul Mofique Siddiqui
License: MIT

Install via pip:

pip install auradata

Import it in your Python code:

from auradata import Dataset

Overview

AuraData is a data-centric auditing and diagnostics engine for machine learning datasets.

It automatically inspects datasets to detect:

  • Duplicate samples
  • Noisy or anomalous records
  • Potentially mislabeled samples
  • Subgroup performance disparities (bias risks)

AuraData is designed to be transparent, conservative, and human-in-the-loop — it flags risks and provides diagnostics instead of blindly modifying data.


Installation

Install the package via pip:

pip install auradata

How It Works

  • Duplicate Detection Identifies exact row duplicates.
  • Noise Detection Uses Isolation Forest on numeric features to flag outliers.
  • Label Issue Detection Flags samples where the model strongly disagrees with provided labels.
  • Bias Audit Evaluates subgroup performance disparities across sensitive attributes.
  • State Tracking Tracks cleaning and fixing actions safely and reversibly.
  • HTML Reporting Produces structured, readable audit reports.

Getting Started

1. Import the package

from auradata import Dataset

2. Initialize the dataset

ds = Dataset(X, y)

3. Run an initial audit

ds.audit(check_labels=False, check_bias=False)

4. Clean obvious issues

ds.clean(remove_duplicates=True, remove_noise=True)

5. Train your model

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(ds.X.select_dtypes(include=["number"]), ds.y)

6. Run a full audit

ds.audit(model=model, sensitive_feature="gender", check_duplicates=False, check_noise=False)

7. Fix label issues (optional)

ds.fix_labels(model)

8. Generate a report

ds.report("auradata_report.html")

API Reference

Dataset(X, y=None, feature_names=None)

Initializes the dataset.

Parameters:

  • X: Feature matrix (array-like or DataFrame)
  • y: Labels (optional)
  • feature_names: Optional column names

.audit(...)

Audits the dataset for quality issues.


.clean(...)

Removes duplicate and/or noisy samples.


.fix_labels(model)

Replaces mislabeled values with model predictions.


.report(path)

Generates an HTML report summarizing all detected issues.


.restore_original()

Restores the dataset to its original unmodified state.


.summary()

Prints a quick console summary of the dataset state.


Example Usage

import numpy as np
import pandas as pd
from auradata import Dataset
from sklearn.linear_model import LogisticRegression

# Create synthetic dataset with 200 samples
np.random.seed(42)
n = 200

X = pd.DataFrame({
    "age": np.random.randint(18, 70, n),
    "income": np.random.normal(50000, 15000, n),
    "score": np.random.normal(70, 10, n),
    "gender": np.random.choice(["M", "F"], n)
})

# Create binary labels where income > 50k and score > 70 determines the class
y = ((X["income"] > 50000) & (X["score"] > 70)).astype(int).values

# Inject a duplicate row at index 1
X.iloc[1] = X.iloc[0]
y[1] = y[0]

# Inject an extreme outlier at index 5
X.loc[5, ["age", "income", "score"]] = [150, 1_000_000, 300]

# Flip labels at specific indices to simulate labeling errors
y[10] = 1 - y[10]
y[20] = 1 - y[20]
y[30] = 1 - y[30]

print("Injected issues: 1 duplicate, 1 outlier, 3 flipped labels\n")

# Initialize the AuraData wrapper
ds = Dataset(X, y)

print("STEP 1: Initial audit (duplicates & noise)")
# Check for structural issues first
ds.audit(check_labels=False, check_bias=False)

print("\nSTEP 2: Cleaning dataset")
# Remove duplicates and noise identified in the audit
ds.clean(remove_duplicates=True, remove_noise=True)

print("\nSTEP 3: Training model")
# Train a simple model on numeric columns only
model = LogisticRegression(max_iter=1000)
model.fit(ds.X.select_dtypes(include=["number"]), ds.y)

print("\nSTEP 4: Full audit (labels & bias)")
# Run model-based checks
ds.audit(
    model=model,
    sensitive_feature="gender",
    check_duplicates=False,
    check_noise=False,
    label_threshold=0.7
)

print("\nSTEP 5: Fixing labels")
# Automatically correct labels where the model is confident
ds.fix_labels(model, retrain=True, threshold=0.7)

print("\nSTEP 6: Generating reports")

# --- REPORTING OPTIONS ---

# Option 1: Generate HTML report only (This is the DEFAULT)
# ds.report("auradata_report", report_format="html")

# Option 2: Generate PDF report only
# ds.report("auradata_report", report_format="pdf")

# Option 3: Generate BOTH HTML and PDF reports
ds.report("auradata_report", report_format="both")

# -------------------------

# Print final stats to console
ds.summary()

print("\nDone! Reports generated.")

Internals

  • Isolation Forest for outlier detection
  • Confidence-based disagreement for label validation
  • Group-wise evaluation for bias detection
  • State-aware cleaning with reversible actions
  • Transparent reporting for auditability

Notes

  • Works with numeric and mixed datasets
  • Conservative by default (no blind destructive actions)
  • Designed for ML practitioners and researchers
  • Suitable for responsible and regulated workflows

Author

Abdul Mofique Siddiqui


License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auradata-1.0.2.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

auradata-1.0.2-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file auradata-1.0.2.tar.gz.

File metadata

  • Download URL: auradata-1.0.2.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for auradata-1.0.2.tar.gz
Algorithm Hash digest
SHA256 4f879493a34323cbb7e36bbbaddf6eb0846a1e6a4d278905dcd96f3ac7404e08
MD5 82648e31b51ae05c4cd41362dc61a7fc
BLAKE2b-256 d2d7b62b94d8bf4273495d42ece5e8a1bf13fce092e891335a136fd94ee1f703

See more details on using hashes here.

File details

Details for the file auradata-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: auradata-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for auradata-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 83111396b0cae12627fe77381861db8a21ff64342aec798ac735d90fc119a81a
MD5 28daa7158054df71e0b9242a81c04d27
BLAKE2b-256 c74a389b45476c91c549e50b9e8e8de05f9b69db8df5dbf182b3fa8020c814a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page