Unified transformer for multi-task tabular data quality
Project description
UNIDQ: Unified Data Quality
UNIDQ is a unified transformer-based architecture for multi-task tabular data quality. A single model handles 6 data quality tasks simultaneously, replacing the need for multiple specialized tools.
๐ฏ Six Tasks, One Model
| Task | Description | Metric |
|---|---|---|
| Error Detection | Identify erroneous cells | F1: 0.894 |
| Data Repair | Correct erroneous values | Rยฒ: 0.539 |
| Imputation | Fill missing values | Rยฒ: 0.941 |
| Label Noise Detection | Identify mislabeled samples | F1: 0.856 |
| Label Classification | Classify samples | Acc: 0.922 |
| Data Valuation | Score data quality | ฯ: 0.336 |
๐ฆ Installation
pip install unidq
๐ Quick Start
from unidq import UNIDQ, MultiTaskDataset, UNIDQTrainer
# 1. Prepare your data
dataset = MultiTaskDataset(
dirty_features=X_dirty, # Corrupted data
clean_features=X_clean, # Ground truth (optional, for training)
error_mask=errors, # Binary error mask (optional)
labels=y_noisy, # Observed labels
clean_labels=y_clean, # True labels (optional)
noise_mask=noise_mask, # Label noise mask (optional)
)
# 2. Create and train model
model = UNIDQ(n_features=X_dirty.shape[1])
trainer = UNIDQTrainer(model)
trainer.fit(dataset, epochs=50)
# 3. Make predictions
results = model.predict(X_new)
print(results['error_mask']) # Detected errors
print(results['repaired']) # Repaired values
print(results['noise_mask']) # Detected noisy labels
print(results['quality_scores']) # Data quality scores
๐ Cross-Validation Example
from unidq import UNIDQ, MultiTaskDataset
from unidq.trainer import cross_validate
# Load your dataset
dataset = MultiTaskDataset(
dirty_features=X_dirty,
clean_features=X_clean,
error_mask=errors,
labels=y_noisy,
clean_labels=y_clean,
noise_mask=noise_mask,
)
# Run 5-fold cross-validation
results = cross_validate(
model_class=UNIDQ,
dataset=dataset,
n_features=X_dirty.shape[1],
n_folds=5,
epochs=50,
)
# Results contain mean ยฑ std for all metrics
print(f"Error F1: {results['error_f1']['mean']:.3f} ยฑ {results['error_f1']['std']:.3f}")
print(f"Noise F1: {results['noise_f1']['mean']:.3f} ยฑ {results['noise_f1']['std']:.3f}")
๐ง Model Architecture
UNIDQ uses a shared transformer encoder with task-specific heads:
Input Features โ [Value Embed + Z-Score Embed + Pos Embed]
โ
Transformer Encoder (3 layers)
โ
โโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ
โ โ โ
Cell Outputs CLS Token CLS Token
โ โ โ
โโโโโโโโโโโผโโโโโโโโโโ โ โโโโโโโโโโโผโโโโโโโโโโ
โ โ โ โ โ โ โ
Error Repair Impute Label Noise Value
Head Head Head Head Head Head
Model Size: ~495K parameters (default configuration)
๐ Configuration
from unidq import UNIDQ, UNIDQConfig
# Custom configuration
config = UNIDQConfig(
d_model=256, # Hidden dimension
n_heads=8, # Attention heads
n_layers=6, # Transformer layers
dropout=0.1,
)
model = UNIDQ(n_features=14, config=config)
print(f"Parameters: {model.get_num_parameters():,}")
Preset Configurations
from unidq.config import UNIDQ_SMALL, UNIDQ_BASE, UNIDQ_LARGE
# UNIDQ_SMALL: 64d, 2 heads, 2 layers (~125K params)
# UNIDQ_BASE: 128d, 4 heads, 3 layers (~495K params) [default]
# UNIDQ_LARGE: 256d, 8 heads, 6 layers (~2M params)
๐ API Reference
MultiTaskDataset
dataset = MultiTaskDataset(
dirty_features, # np.ndarray (n_samples, n_features) - Required
clean_features, # np.ndarray - Ground truth features
error_mask, # np.ndarray - Binary error indicators
missing_mask, # np.ndarray - Binary missing indicators
labels, # np.ndarray - Observed labels
clean_labels, # np.ndarray - True labels
noise_mask, # np.ndarray - Binary noise indicators
)
UNIDQ Model
model = UNIDQ(
n_features, # int - Number of input features
d_model=128, # int - Hidden dimension
n_heads=4, # int - Attention heads
n_layers=3, # int - Transformer layers
dropout=0.1, # float - Dropout rate
)
# Forward pass
outputs = model(features, z_scores, labels)
# Prediction
results = model.predict(X_new, threshold=0.5)
UNIDQTrainer
trainer = UNIDQTrainer(model, device='cuda')
trainer.fit(
train_dataset,
val_dataset=None,
epochs=50,
batch_size=64,
learning_rate=5e-4,
patience=10,
)
metrics = trainer.evaluate(test_dataset)
๐ฌ Evaluation Metrics
from unidq import evaluate_all_tasks
from unidq.evaluation import print_evaluation_report
metrics = evaluate_all_tasks(model, dataloader, device)
print_evaluation_report(metrics)
Output:
============================================================
UNIDQ Evaluation Report
============================================================
๐ ERROR DETECTION
F1 Score: 0.8940
ROC AUC: 0.9120
Precision: 0.8650
Recall: 0.9250
๐ง DATA REPAIR
Rยฒ Score: 0.5390
MAE: 0.1230
๐ฅ IMPUTATION
Rยฒ Score: 0.9410
MAE: 0.0540
๐ท๏ธ LABEL NOISE DETECTION
F1 Score: 0.8560
ROC AUC: 0.8890
๐ฏ LABEL CLASSIFICATION
Accuracy: 0.9220
F1 Score: 0.9150
๐ฐ DATA VALUATION
Correlation: 0.3360
============================================================
๐ Citation
If you use UNIDQ in your research, please cite:
@inproceedings{koreddi2026unidq,
title={UNIDQ: A Unified Transformer for Multi-Task Data Quality},
author={Koreddi, Shiva},
booktitle={Proceedings of the VLDB Endowment},
year={2026}
}
๐ License
MIT License - see LICENSE for details.
๐ Release Methodology
UNIDQ follows Semantic Versioning (MAJOR.MINOR.PATCH):
- MAJOR (e.g., 1.0.0 โ 2.0.0): Breaking API changes
- MINOR (e.g., 0.1.0 โ 0.2.0): New features, backwards compatible
- PATCH (e.g., 0.2.0 โ 0.2.1): Bug fixes, backwards compatible
Release Schedule
- Patch releases: As needed for critical bug fixes
- Minor releases: Monthly or when significant features are ready
- Major releases: When necessary for breaking changes
PyTorch Compatibility
UNIDQ supports the latest two major PyTorch releases. We update within 30 days of new PyTorch releases to ensure compatibility.
Current support:
- PyTorch 1.9.0+
- PyTorch 2.0.0+
- PyTorch 2.1.0+
How to Stay Updated
# Upgrade to latest version
pip install --upgrade unidq
# Check your version
python -c "import unidq; print(unidq.__version__)"
See CHANGELOG.md for detailed release notes.
๐ค Contributing
Contributions welcome! Please read our Contributing Guide and Code of Conduct.
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open a Pull Request
For governance details, see GOVERNANCE.md.
๐ง Contact
- Authors: Shiva Koreddi, Sravani Sowrupilli
- GitHub: @Shivakoreddi
- Issues: GitHub Issues
- Email: shivacse14@gmail.com, sravani.sowrupilli@gmail.com
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unidq-0.2.0.tar.gz.
File metadata
- Download URL: unidq-0.2.0.tar.gz
- Upload date:
- Size: 51.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7982c55f82c6e725acfc106607292335c63b36dd917fd92a67d7bc8ce4504d4
|
|
| MD5 |
3e05727ee692ad8fa20bed43ad47a864
|
|
| BLAKE2b-256 |
b941be528649951c7c558336a77206555baa39febe9685253bde4a067eda7fc9
|
File details
Details for the file unidq-0.2.0-py3-none-any.whl.
File metadata
- Download URL: unidq-0.2.0-py3-none-any.whl
- Upload date:
- Size: 43.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7facfc2e7664e44c4c5bc495dacb8b7ac4573c4b6539ea4b81475fb78626d6a9
|
|
| MD5 |
0cbbabf5c2646cc364011ba88159d991
|
|
| BLAKE2b-256 |
bbcd4829d4a672a6bb8f18cfdc2f0bfd92db412f09f03c037db4506914c2fc5b
|