Skip to main content

AuraData: A data-centric auditing and diagnostics engine for machine learning datasets, designed to detect noise, duplicates, label issues, and subgroup risks with transparent reporting.

Project description

AuraData — Automated Data Quality Auditing Engine

Author: Abdul Mofique Siddiqui
License: MIT

Install via pip:

pip install auradata

Import it in your Python code:

from auradata import Dataset

Overview

AuraData is a data-centric auditing and diagnostics engine for machine learning datasets.

It automatically inspects datasets to detect:

  • Duplicate samples
  • Noisy or anomalous records
  • Potentially mislabeled samples
  • Subgroup performance disparities (bias risks)

AuraData is designed to be transparent, conservative, and human-in-the-loop — it flags risks and provides diagnostics instead of blindly modifying data.


Installation

Install the package via pip:

pip install auradata

How It Works

  • Duplicate Detection Identifies exact row duplicates.
  • Noise Detection Uses Isolation Forest on numeric features to flag outliers.
  • Label Issue Detection Flags samples where the model strongly disagrees with provided labels.
  • Bias Audit Evaluates subgroup performance disparities across sensitive attributes.
  • State Tracking Tracks cleaning and fixing actions safely and reversibly.
  • HTML Reporting Produces structured, readable audit reports.

Getting Started

1. Import the package

from auradata import Dataset

2. Initialize the dataset

ds = Dataset(X, y)

3. Run an initial audit

ds.audit(check_labels=False, check_bias=False)

4. Clean obvious issues

ds.clean(remove_duplicates=True, remove_noise=True)

5. Train your model

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(ds.X.select_dtypes(include=["number"]), ds.y)

6. Run a full audit

ds.audit(model=model, sensitive_feature="gender", check_duplicates=False, check_noise=False)

7. Fix label issues (optional)

ds.fix_labels(model)

8. Generate a report

ds.report("auradata_report.html")

API Reference

Dataset(X, y=None, feature_names=None)

Initializes the dataset.

Parameters:

  • X: Feature matrix (array-like or DataFrame)
  • y: Labels (optional)
  • feature_names: Optional column names

.audit(...)

Audits the dataset for quality issues.


.clean(...)

Removes duplicate and/or noisy samples.


.fix_labels(model)

Replaces mislabeled values with model predictions.


.report(path)

Generates an HTML report summarizing all detected issues.


.restore_original()

Restores the dataset to its original unmodified state.


.summary()

Prints a quick console summary of the dataset state.


Example Usage

Example 1: Basic Audit

from auradata import Dataset
import pandas as pd

X = pd.read_csv("data.csv")
ds = Dataset(X)
ds.audit()
ds.summary()

Example 2: Audit + Clean + Fix Labels

from auradata import Dataset
from sklearn.linear_model import LogisticRegression

X = pd.read_csv("data.csv")
y = pd.read_csv("labels.csv").squeeze()

ds = Dataset(X, y)
ds.audit(check_labels=False, check_bias=False)
ds.clean()

model = LogisticRegression(max_iter=1000).fit(ds.X.select_dtypes(include=["number"]), ds.y)
ds.audit(model=model, sensitive_feature="gender", check_duplicates=False, check_noise=False)
ds.fix_labels(model)
ds.report("auradata_report.html")

Internals

  • Isolation Forest for outlier detection
  • Confidence-based disagreement for label validation
  • Group-wise evaluation for bias detection
  • State-aware cleaning with reversible actions
  • Transparent reporting for auditability

Notes

  • Works with numeric and mixed datasets
  • Conservative by default (no blind destructive actions)
  • Designed for ML practitioners and researchers
  • Suitable for responsible and regulated workflows

Author

Abdul Mofique Siddiqui


License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auradata-1.0.1.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

auradata-1.0.1-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file auradata-1.0.1.tar.gz.

File metadata

  • Download URL: auradata-1.0.1.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for auradata-1.0.1.tar.gz
Algorithm Hash digest
SHA256 b08850442e038d78c2c52860b4f3b7ffe0b41e85db27c4f437cafc34580c9b86
MD5 cee8b90604e827e28860da6017574388
BLAKE2b-256 b966e38d3d35912e4b2793f041f3e4b7c81d73ed6e1eca91961eb3d9b10ddc1a

See more details on using hashes here.

File details

Details for the file auradata-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: auradata-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for auradata-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 38304e28d20317ac1e7a29eb64a4a916bde888964dfe6a2d7fad7e7446e0b839
MD5 f406d7b1e870dbdbd2a755faaa47a728
BLAKE2b-256 ed98845e28ffc8bff3d8ae7a591b2f66276b0a4c01f5f8ab0304d3b65732d556

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page