AuraData: A data-centric auditing and diagnostics engine for machine learning datasets, designed to detect noise, duplicates, label issues, and subgroup risks with transparent reporting.
Project description
AuraData — Automated Data Quality Auditing Engine
Author: Abdul Mofique Siddiqui
License: MIT
Install via pip:
pip install auradata
Import it in your Python code:
from auradata import Dataset
Overview
AuraData is a data-centric auditing and diagnostics engine for machine learning datasets.
It automatically inspects datasets to detect:
- Duplicate samples
- Noisy or anomalous records
- Potentially mislabeled samples
- Subgroup performance disparities (bias risks)
AuraData is designed to be transparent, conservative, and human-in-the-loop — it flags risks and provides diagnostics instead of blindly modifying data.
Installation
Install the package via pip:
pip install auradata
How It Works
- Duplicate Detection Identifies exact row duplicates.
- Noise Detection Uses Isolation Forest on numeric features to flag outliers.
- Label Issue Detection Flags samples where the model strongly disagrees with provided labels.
- Bias Audit Evaluates subgroup performance disparities across sensitive attributes.
- State Tracking Tracks cleaning and fixing actions safely and reversibly.
- HTML Reporting Produces structured, readable audit reports.
Getting Started
1. Import the package
from auradata import Dataset
2. Initialize the dataset
ds = Dataset(X, y)
3. Run an initial audit
ds.audit(check_labels=False, check_bias=False)
4. Clean obvious issues
ds.clean(remove_duplicates=True, remove_noise=True)
5. Train your model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(ds.X.select_dtypes(include=["number"]), ds.y)
6. Run a full audit
ds.audit(model=model, sensitive_feature="gender", check_duplicates=False, check_noise=False)
7. Fix label issues (optional)
ds.fix_labels(model)
8. Generate a report
ds.report("auradata_report.html")
API Reference
Dataset(X, y=None, feature_names=None)
Initializes the dataset.
Parameters:
X: Feature matrix (array-like or DataFrame)y: Labels (optional)feature_names: Optional column names
.audit(...)
Audits the dataset for quality issues.
.clean(...)
Removes duplicate and/or noisy samples.
.fix_labels(model)
Replaces mislabeled values with model predictions.
.report(path)
Generates an HTML report summarizing all detected issues.
.restore_original()
Restores the dataset to its original unmodified state.
.summary()
Prints a quick console summary of the dataset state.
Example Usage
Example 1: Basic Audit
from auradata import Dataset
import pandas as pd
X = pd.read_csv("data.csv")
ds = Dataset(X)
ds.audit()
ds.summary()
Example 2: Audit + Clean + Fix Labels
from auradata import Dataset
from sklearn.linear_model import LogisticRegression
X = pd.read_csv("data.csv")
y = pd.read_csv("labels.csv").squeeze()
ds = Dataset(X, y)
ds.audit(check_labels=False, check_bias=False)
ds.clean()
model = LogisticRegression(max_iter=1000).fit(ds.X.select_dtypes(include=["number"]), ds.y)
ds.audit(model=model, sensitive_feature="gender", check_duplicates=False, check_noise=False)
ds.fix_labels(model)
ds.report("auradata_report.html")
Internals
- Isolation Forest for outlier detection
- Confidence-based disagreement for label validation
- Group-wise evaluation for bias detection
- State-aware cleaning with reversible actions
- Transparent reporting for auditability
Notes
- Works with numeric and mixed datasets
- Conservative by default (no blind destructive actions)
- Designed for ML practitioners and researchers
- Suitable for responsible and regulated workflows
Author
Abdul Mofique Siddiqui
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file auradata-1.0.1.tar.gz.
File metadata
- Download URL: auradata-1.0.1.tar.gz
- Upload date:
- Size: 12.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b08850442e038d78c2c52860b4f3b7ffe0b41e85db27c4f437cafc34580c9b86
|
|
| MD5 |
cee8b90604e827e28860da6017574388
|
|
| BLAKE2b-256 |
b966e38d3d35912e4b2793f041f3e4b7c81d73ed6e1eca91961eb3d9b10ddc1a
|
File details
Details for the file auradata-1.0.1-py3-none-any.whl.
File metadata
- Download URL: auradata-1.0.1-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38304e28d20317ac1e7a29eb64a4a916bde888964dfe6a2d7fad7e7446e0b839
|
|
| MD5 |
f406d7b1e870dbdbd2a755faaa47a728
|
|
| BLAKE2b-256 |
ed98845e28ffc8bff3d8ae7a591b2f66276b0a4c01f5f8ab0304d3b65732d556
|