Skip to main content

AuraData: A data-centric auditing and diagnostics engine for machine learning datasets, designed to detect noise, duplicates, label issues, and subgroup risks with transparent reporting.

Project description

AuraData — Automated Data Quality Auditing Engine

Author: Abdul Mofique Siddiqui
License: MIT

Install via pip:

pip install auradata

Import it in your Python code:

from auradata import Dataset

Overview

AuraData is a data-centric auditing and diagnostics engine for machine learning datasets.

It automatically inspects datasets to detect:

  • Duplicate samples
  • Noisy or anomalous records
  • Potentially mislabeled samples
  • Subgroup performance disparities (bias risks)

AuraData is designed to be transparent, conservative, and human-in-the-loop — it flags risks and provides diagnostics instead of blindly modifying data.


Installation

Install the package via pip:

pip install auradata

How It Works

  • Duplicate Detection Identifies exact row duplicates.
  • Noise Detection Uses Isolation Forest on numeric features to flag outliers.
  • Label Issue Detection Flags samples where the model strongly disagrees with provided labels.
  • Bias Audit Evaluates subgroup performance disparities across sensitive attributes.
  • State Tracking Tracks cleaning and fixing actions safely and reversibly.
  • HTML Reporting Produces structured, readable audit reports.

Getting Started

1. Import the package

from auradata import Dataset

2. Initialize the dataset

ds = Dataset(X, y)

3. Run an initial audit

ds.audit(check_labels=False, check_bias=False)

4. Clean obvious issues

ds.clean(remove_duplicates=True, remove_noise=True)

5. Train your model

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(ds.X.select_dtypes(include=["number"]), ds.y)

6. Run a full audit

ds.audit(model=model, sensitive_feature="gender", check_duplicates=False, check_noise=False)

7. Fix label issues (optional)

ds.fix_labels(model)

8. Generate a report

ds.report("auradata_report.html")

API Reference

Dataset(X, y=None, feature_names=None)

Initializes the dataset.

Parameters:

  • X: Feature matrix (array-like or DataFrame)
  • y: Labels (optional)
  • feature_names: Optional column names

.audit(...)

Audits the dataset for quality issues.


.clean(...)

Removes duplicate and/or noisy samples.


.fix_labels(model)

Replaces mislabeled values with model predictions.


.report(path)

Generates an HTML report summarizing all detected issues.


.restore_original()

Restores the dataset to its original unmodified state.


.summary()

Prints a quick console summary of the dataset state.


Example Usage

Example 1: Basic Audit

from auradata import Dataset
import pandas as pd

X = pd.read_csv("data.csv")
ds = Dataset(X)
ds.audit()
ds.summary()

Example 2: Audit + Clean + Fix Labels

from auradata import Dataset
from sklearn.linear_model import LogisticRegression

X = pd.read_csv("data.csv")
y = pd.read_csv("labels.csv").squeeze()

ds = Dataset(X, y)
ds.audit(check_labels=False, check_bias=False)
ds.clean()

model = LogisticRegression(max_iter=1000).fit(ds.X.select_dtypes(include=["number"]), ds.y)
ds.audit(model=model, sensitive_feature="gender", check_duplicates=False, check_noise=False)
ds.fix_labels(model)
ds.report("auradata_report.html")

Internals

  • Isolation Forest for outlier detection
  • Confidence-based disagreement for label validation
  • Group-wise evaluation for bias detection
  • State-aware cleaning with reversible actions
  • Transparent reporting for auditability

Notes

  • Works with numeric and mixed datasets
  • Conservative by default (no blind destructive actions)
  • Designed for ML practitioners and researchers
  • Suitable for responsible and regulated workflows

Author

Abdul Mofique Siddiqui


License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auradata-1.0.0.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

auradata-1.0.0-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file auradata-1.0.0.tar.gz.

File metadata

  • Download URL: auradata-1.0.0.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for auradata-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a8f18ca251b814704eca53b0416dac541a5b37d26aa7a4de4cd1e94c19f0de32
MD5 a9f77076f653c8ada365abc8116d6a8f
BLAKE2b-256 ba03fa46f3d37e78d791b57960654a5d3f3031b2f74bb10477836f87e36f5e9c

See more details on using hashes here.

File details

Details for the file auradata-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: auradata-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for auradata-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4bb2af9a919ad636215d845fe8f0e37b7e93882277ae2c81af208d44348cfa99
MD5 69f8a9db037aea2f52c3d65453063fd1
BLAKE2b-256 e09dee6c4103da18ef9bf64cd4e2f976190de2b0bbf99101a1ddfa98f8ba3c40

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page