Skip to main content

Unified transformer for multi-task tabular data quality

Project description

UNIDQ: Unified Data Quality

PyPI version Python 3.8+ License: MIT Paper

UNIDQ is a unified transformer-based architecture for multi-task tabular data quality. A single model handles 6 data quality tasks simultaneously, replacing the need for multiple specialized tools.

๐ŸŽฏ Six Tasks, One Model

Task Description Metric
Error Detection Identify erroneous cells F1: 0.894
Data Repair Correct erroneous values Rยฒ: 0.539
Imputation Fill missing values Rยฒ: 0.941
Label Noise Detection Identify mislabeled samples F1: 0.856
Label Classification Classify samples Acc: 0.922
Data Valuation Score data quality ฯ: 0.336

๐Ÿ“ฆ Installation

pip install unidq

๐Ÿš€ Quick Start

from unidq import UNIDQ, MultiTaskDataset, UNIDQTrainer

# 1. Prepare your data
dataset = MultiTaskDataset(
    dirty_features=X_dirty,      # Corrupted data
    clean_features=X_clean,      # Ground truth (optional, for training)
    error_mask=errors,           # Binary error mask (optional)
    labels=y_noisy,              # Observed labels
    clean_labels=y_clean,        # True labels (optional)
    noise_mask=noise_mask,       # Label noise mask (optional)
)

# 2. Create and train model
model = UNIDQ(n_features=X_dirty.shape[1])
trainer = UNIDQTrainer(model)
trainer.fit(dataset, epochs=50)

# 3. Make predictions
results = model.predict(X_new)

print(results['error_mask'])      # Detected errors
print(results['repaired'])        # Repaired values
print(results['noise_mask'])      # Detected noisy labels
print(results['quality_scores'])  # Data quality scores

๐Ÿ“Š Cross-Validation Example

from unidq import UNIDQ, MultiTaskDataset
from unidq.trainer import cross_validate

# Load your dataset
dataset = MultiTaskDataset(
    dirty_features=X_dirty,
    clean_features=X_clean,
    error_mask=errors,
    labels=y_noisy,
    clean_labels=y_clean,
    noise_mask=noise_mask,
)

# Run 5-fold cross-validation
results = cross_validate(
    model_class=UNIDQ,
    dataset=dataset,
    n_features=X_dirty.shape[1],
    n_folds=5,
    epochs=50,
)

# Results contain mean ยฑ std for all metrics
print(f"Error F1: {results['error_f1']['mean']:.3f} ยฑ {results['error_f1']['std']:.3f}")
print(f"Noise F1: {results['noise_f1']['mean']:.3f} ยฑ {results['noise_f1']['std']:.3f}")

๐Ÿ”ง Model Architecture

UNIDQ uses a shared transformer encoder with task-specific heads:

Input Features โ†’ [Value Embed + Z-Score Embed + Pos Embed]
                              โ†“
                    Transformer Encoder (3 layers)
                              โ†“
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ†“               โ†“               โ†“
        Cell Outputs      CLS Token      CLS Token
              โ†“               โ†“               โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ†“         โ†“         โ†“     โ†“     โ†“         โ†“         โ†“
  Error    Repair    Impute Label  Noise   Value
  Head     Head      Head   Head   Head    Head

Model Size: ~495K parameters (default configuration)

๐Ÿ“ˆ Configuration

from unidq import UNIDQ, UNIDQConfig

# Custom configuration
config = UNIDQConfig(
    d_model=256,      # Hidden dimension
    n_heads=8,        # Attention heads
    n_layers=6,       # Transformer layers
    dropout=0.1,
)

model = UNIDQ(n_features=14, config=config)
print(f"Parameters: {model.get_num_parameters():,}")

Preset Configurations

from unidq.config import UNIDQ_SMALL, UNIDQ_BASE, UNIDQ_LARGE

# UNIDQ_SMALL: 64d, 2 heads, 2 layers (~125K params)
# UNIDQ_BASE:  128d, 4 heads, 3 layers (~495K params) [default]
# UNIDQ_LARGE: 256d, 8 heads, 6 layers (~2M params)

๐Ÿ“‹ API Reference

MultiTaskDataset

dataset = MultiTaskDataset(
    dirty_features,    # np.ndarray (n_samples, n_features) - Required
    clean_features,    # np.ndarray - Ground truth features
    error_mask,        # np.ndarray - Binary error indicators
    missing_mask,      # np.ndarray - Binary missing indicators
    labels,            # np.ndarray - Observed labels
    clean_labels,      # np.ndarray - True labels
    noise_mask,        # np.ndarray - Binary noise indicators
)

UNIDQ Model

model = UNIDQ(
    n_features,        # int - Number of input features
    d_model=128,       # int - Hidden dimension
    n_heads=4,         # int - Attention heads
    n_layers=3,        # int - Transformer layers
    dropout=0.1,       # float - Dropout rate
)

# Forward pass
outputs = model(features, z_scores, labels)

# Prediction
results = model.predict(X_new, threshold=0.5)

UNIDQTrainer

trainer = UNIDQTrainer(model, device='cuda')

trainer.fit(
    train_dataset,
    val_dataset=None,
    epochs=50,
    batch_size=64,
    learning_rate=5e-4,
    patience=10,
)

metrics = trainer.evaluate(test_dataset)

๐Ÿ”ฌ Evaluation Metrics

from unidq import evaluate_all_tasks
from unidq.evaluation import print_evaluation_report

metrics = evaluate_all_tasks(model, dataloader, device)
print_evaluation_report(metrics)

Output:

============================================================
UNIDQ Evaluation Report
============================================================

๐Ÿ“Œ ERROR DETECTION
   F1 Score:    0.8940
   ROC AUC:     0.9120
   Precision:   0.8650
   Recall:      0.9250

๐Ÿ”ง DATA REPAIR
   Rยฒ Score:    0.5390
   MAE:         0.1230

๐Ÿ“ฅ IMPUTATION
   Rยฒ Score:    0.9410
   MAE:         0.0540

๐Ÿท๏ธ LABEL NOISE DETECTION
   F1 Score:    0.8560
   ROC AUC:     0.8890

๐ŸŽฏ LABEL CLASSIFICATION
   Accuracy:    0.9220
   F1 Score:    0.9150

๐Ÿ’ฐ DATA VALUATION
   Correlation: 0.3360
============================================================

๐Ÿ“„ Citation

If you use UNIDQ in your research, please cite:

@inproceedings{koreddi2026unidq,
  title={UNIDQ: A Unified Transformer for Multi-Task Data Quality},
  author={Koreddi, Shiva},
  booktitle={Proceedings of the VLDB Endowment},
  year={2026}
}

๐Ÿ“œ License

MIT License - see LICENSE for details.

๐Ÿ“‹ Release Methodology

UNIDQ follows Semantic Versioning (MAJOR.MINOR.PATCH):

  • MAJOR (e.g., 1.0.0 โ†’ 2.0.0): Breaking API changes
  • MINOR (e.g., 0.1.0 โ†’ 0.2.0): New features, backwards compatible
  • PATCH (e.g., 0.2.0 โ†’ 0.2.1): Bug fixes, backwards compatible

Release Schedule

  • Patch releases: As needed for critical bug fixes
  • Minor releases: Monthly or when significant features are ready
  • Major releases: When necessary for breaking changes

PyTorch Compatibility

UNIDQ supports the latest two major PyTorch releases. We update within 30 days of new PyTorch releases to ensure compatibility.

Current support:

  • PyTorch 1.9.0+
  • PyTorch 2.0.0+
  • PyTorch 2.1.0+

How to Stay Updated

# Upgrade to latest version
pip install --upgrade unidq

# Check your version
python -c "import unidq; print(unidq.__version__)"

See CHANGELOG.md for detailed release notes.

๐Ÿค Contributing

Contributions welcome! Please read our Contributing Guide and Code of Conduct.

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Commit changes: git commit -m 'Add amazing feature'
  4. Push to branch: git push origin feature/amazing-feature
  5. Open a Pull Request

For governance details, see GOVERNANCE.md.

๐Ÿ“ง Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unidq-0.2.0.tar.gz (51.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unidq-0.2.0-py3-none-any.whl (43.7 kB view details)

Uploaded Python 3

File details

Details for the file unidq-0.2.0.tar.gz.

File metadata

  • Download URL: unidq-0.2.0.tar.gz
  • Upload date:
  • Size: 51.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for unidq-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b7982c55f82c6e725acfc106607292335c63b36dd917fd92a67d7bc8ce4504d4
MD5 3e05727ee692ad8fa20bed43ad47a864
BLAKE2b-256 b941be528649951c7c558336a77206555baa39febe9685253bde4a067eda7fc9

See more details on using hashes here.

File details

Details for the file unidq-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: unidq-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 43.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for unidq-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7facfc2e7664e44c4c5bc495dacb8b7ac4573c4b6539ea4b81475fb78626d6a9
MD5 0cbbabf5c2646cc364011ba88159d991
BLAKE2b-256 bbcd4829d4a672a6bb8f18cfdc2f0bfd92db412f09f03c037db4506914c2fc5b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page