Skip to main content

Unified Transformer for Multi-Task Data Quality

Project description

UNIDQ: Unified Data Quality

PyPI version License: MIT Python 3.8+ Downloads GitHub Repo GitHub Stars

A unified transformer architecture for multi-task data quality assessment.


What is UNIDQ?

UNIDQ (Unified Data Quality) is a deep learning library that handles multiple data quality tasks with a single model. Rather than using separate tools for each task, UNIDQ provides one unified solution for all your data quality needs.

The Problem

Real-world data is messy. You often deal with:

  • Wrong or corrupted values
  • Missing data
  • Mislabeled samples
  • Low-quality records

Fixing these issues typically requires multiple tools, each with different configurations and APIs.

The Solution

UNIDQ handles 6 data quality tasks with a single model:

Task What it does
Error Detection Find wrong/dirty values in your data
Data Repair Fix the detected errors
Imputation Fill in missing values
Label Noise Detection Find mislabeled samples
Label Classification Predict labels for your data
Data Valuation Score each sample's quality

Key Benefits

  • 🎯 Unified - One model handles all six tasks
  • Fast - Processes 200K records in under 20 minutes
  • 🔧 Simple - Clean, intuitive API
  • 📈 Accurate - State-of-the-art performance on benchmark datasets
  • 🪶 Lightweight - Efficient transformer architecture

Installation

pip install unidq

Requirements: Python 3.8+, PyTorch 1.9+


Quick Start

from unidq import UNIDQ

# Initialize model
model = UNIDQ(n_features=10)

# Detect errors in your data
errors = model.detect_errors(X)

# Impute missing values
X_filled = model.impute(X)

# Find mislabeled samples
noisy = model.detect_label_noise(X, y)

Usage Guide

1. Error Detection

Find erroneous values in your dataset.

from unidq import UNIDQ

model = UNIDQ(n_features=X.shape[1])
model.fit(train_data)

# Detect errors
results = model.detect_errors(X_dirty)

print(f"Found {results['count']} errors")
print(f"Error rate: {results['error_rate']:.2%}")

What you get:

  • predictions: Binary mask (1 = error, 0 = clean)
  • probabilities: Confidence scores for each cell
  • count: Total number of errors found
  • error_rate: Percentage of erroneous cells

2. Data Repair

Automatically fix detected errors.

# Repair errors
X_repaired = model.repair(X_dirty)

# Or detect and repair in one step
X_clean, report = model.detect_and_repair(X_dirty)

print(f"Repaired {report['repairs_made']} values")

What you get:

  • Cleaned data with errors corrected
  • Report showing what was changed

3. Missing Value Imputation

Fill in missing values intelligently.

import numpy as np

# Data with missing values
X_missing = X.copy()
X_missing[0, 2] = np.nan
X_missing[5, 1] = np.nan

# Impute
X_filled = model.impute(X_missing)

print("Missing values filled!")

What you get:

  • Complete data with no missing values
  • Imputation uses learned patterns from your data

4. Label Noise Detection

Find samples that might be mislabeled.

# Find noisy labels
results = model.detect_label_noise(X, y)

# Get suspicious samples
suspicious_indices = results['flagged_indices']
print(f"Found {len(suspicious_indices)} potentially mislabeled samples")

# Review them
for idx in suspicious_indices[:5]:
    print(f"Sample {idx}: label={y[idx]}, suggested={results['suggested_labels'][idx]}")

What you get:

  • noise_scores: Probability each label is wrong
  • flagged_indices: Samples likely mislabeled
  • suggested_labels: What the correct label might be

5. Data Valuation

Score how useful each sample is for training.

# Get quality scores
scores = model.valuate(X, y)

# Filter to high-quality samples
high_quality = scores > 0.7
X_clean = X[high_quality]
y_clean = y[high_quality]

print(f"Kept {high_quality.sum()}/{len(X)} samples")

What you get:

  • Score between 0 and 1 for each sample
  • Higher = better quality, more useful for training

6. Full Data Quality Assessment

Run everything at once.

# Complete assessment
report = model.assess_quality(X, y)

print("=== Data Quality Report ===")
print(f"Errors: {report['errors']['count']}")
print(f"Missing: {report['missing']['count']}")
print(f"Noisy labels: {report['noise']['count']}")
print(f"Average quality: {report['quality']['mean']:.2f}")

# Get cleaned data
X_clean = report['cleaned_data']
y_clean = report['cleaned_labels']

Training Your Own Model

Basic Training

from unidq import UNIDQ, MultiTaskDataset, UNIDQTrainer

# Prepare dataset
dataset = MultiTaskDataset(
    dirty_features=X_dirty,
    clean_features=X_clean,
    error_mask=errors,
    labels=y
)

# Create and train model
model = UNIDQ(n_features=X.shape[1])
trainer = UNIDQTrainer(model)
trainer.fit(dataset, epochs=50)

# Save for later
model.save('my_model.pt')

Loading a Saved Model

model = UNIDQ(n_features=10)
model.load('my_model.pt')

# Ready to use
results = model.detect_errors(new_data)

Working with DataFrames

UNIDQ works seamlessly with Pandas.

import pandas as pd
from unidq import UNIDQ

# Load your data
df = pd.read_csv('my_data.csv')

# Create model from DataFrame
model = UNIDQ.from_dataframe(df)

# Detect errors
errors = model.detect_errors(df)

# Get cleaned DataFrame
df_clean = model.clean(df)
df_clean.to_csv('cleaned_data.csv', index=False)

Configuration Options

Model Settings

model = UNIDQ(
    n_features=20,        # Number of features in your data
    d_model=128,          # Model dimension (default: 128)
    n_layers=3,           # Number of transformer layers (default: 3)
    dropout=0.1           # Dropout rate (default: 0.1)
)

Training Settings

trainer = UNIDQTrainer(
    model,
    batch_size=64,        # Batch size (default: 64)
    learning_rate=1e-3,   # Learning rate (default: 1e-3)
    early_stopping=10     # Stop if no improvement for N epochs
)

Example: Cleaning Customer Data

import pandas as pd
from unidq import UNIDQ

# Load messy customer data
df = pd.read_csv('customers.csv')
print(f"Loaded {len(df)} records")

# Initialize UNIDQ
model = UNIDQ.from_dataframe(df)

# Run full assessment
report = model.assess_quality(df)

# Print findings
print("\n📊 Data Quality Report")
print(f"  • Errors found: {report['errors']['count']}")
print(f"  • Missing values: {report['missing']['count']}")
print(f"  • Suspicious labels: {report['noise']['count']}")
print(f"  • Overall quality: {report['quality']['mean']:.0%}")

# Save cleaned data
df_clean = report['cleaned_data']
df_clean.to_csv('customers_clean.csv', index=False)
print(f"\n✅ Saved cleaned data!")

Performance

Scalability

UNIDQ efficiently handles datasets of various sizes:

Dataset Size Training Time
10,000 ~2 min
50,000 ~11 min
100,000 ~9 min
200,000 ~18 min

Accuracy

UNIDQ achieves strong results across all tasks:

Task Performance
Error Detection F1 = 0.89
Imputation R² = 0.94
Label Noise Detection F1 = 0.86
Label Classification Accuracy = 0.98

FAQ

Q: Do I need labeled data to use UNIDQ?

For error detection and imputation, you can use UNIDQ in unsupervised mode. For best results on all tasks, providing some labeled examples helps.

Q: How much data do I need?

UNIDQ works with datasets as small as 1,000 records. Performance improves with more data.

Q: Can I use UNIDQ with categorical data?

Yes! UNIDQ automatically handles both numerical and categorical features.

Q: How do I choose which tasks to run?

Use assess_quality() to run everything, or call individual methods like detect_errors() or impute() for specific tasks.

Q: Can I use a pre-trained model?

Yes, you can save and load models using model.save() and model.load().


Citation

If you use UNIDQ in your research:

@inproceedings{unidq2026,
  title={UNIDQ: A Unified Transformer Architecture for Multi-Task Data Quality},
  author={Koreddi, Shiva and Sowrupilli, Sravani},
  booktitle={Proceedings of the VLDB Endowment},
  year={2026}
}

Links


License

MIT License - free for personal and commercial use.


One model. Six tasks. Clean data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unidq-0.1.4.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unidq-0.1.4-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file unidq-0.1.4.tar.gz.

File metadata

  • Download URL: unidq-0.1.4.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for unidq-0.1.4.tar.gz
Algorithm Hash digest
SHA256 03a13a3117d3054a567bbe074ecf936fb46ac8f869cc1044d13091b2f85a5b69
MD5 d559ce32387d56a83ae7ffb87051fff0
BLAKE2b-256 5e248fca80e3854881923815731f236e7980a6d3ec1ca059463451db79cbe97a

See more details on using hashes here.

File details

Details for the file unidq-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: unidq-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for unidq-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 dac74ae72a9a03fe6d3d65695db45d16d14f1d82936512c2872f2f4294fec0fb
MD5 fde5eb8df1636a86b9116be678e983d9
BLAKE2b-256 a25dceabe175530e80f3e006cbd3b4c0370cc191e2749abd0553b763984a83da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page