Skip to main content

AI Dataset Quality Auditor - Tabular & Image

Project description

OmniLint

AI Dataset Quality Auditor — detect, report, and score quality issues in tabular and image datasets before they break your model.

Overview

OmniLint analyzes raw datasets (CSV/Parquet for tabular, COCO/YOLO for images) and surfaces actionable quality issues with a composite Data Quality Score (DQS). It serves as a gatekeeper between raw data and ML pipelines.

Features

Tabular Auditing

  • Basic Checks — Missing values, data type mismatches, constant columns, duplicates
  • Distribution Analysis — Skewness detection, IQR/Z-score outlier identification
  • Label Auditing — Class imbalance detection, rare class warnings
  • Leakage Detection — High correlation with target (Pearson/Spearman), categorical leakage proxies (Cramér's V)
  • Feature Importance — RF probe model for noise/suspiciously powerful features
  • Duplicate Detection — Exact duplicates and near-duplicate (cosine similarity/FAISS) matching

Image Auditing (Beta)

  • Integrity Checks — Corrupt files, resolution outliers, format inconsistency
  • Distribution Analysis — Brightness, contrast, color channel imbalance
  • Label Checks — Label-file mismatch, class imbalance
  • Duplicate Detection — Perceptual hash (pHash) exact duplicates, CLIP near-duplicates
  • Anomaly Detection — Blur, exposure issues, blank images

Installation

Base (Tabular Only)

pip install omnilint

With Image Support

pip install omnilint[image]

Development

pip install -e ".[dev,image]"

Quick Start

CLI - Tabular

OmniLint run dataset.csv --target label --output report.html

CLI - Image

OmniLint run dataset/ --mode image --format yolo --output report.json

Python API

# Tabular
from omnilint.core import load, AuditEngine
from omnilint.core.engine import AuditConfig

df, schema = load("dataset.csv")
config = AuditConfig(target_col="label")
engine = AuditEngine(df, config)
result = engine.run()

print(result.quality_score)  # e.g., 73.4

# Image
from omnilint.core.loader import load

image_dataset = load("path/to/coco_data")
image_config = AuditConfig(mode="image")
image_engine = AuditEngine(image_dataset, image_config)
image_result = image_engine.run()

print(image_result.quality_score)  # e.g., 85.2

Streamlit UI

streamlit run app/streamlit_app.py

Input Formats

Tabular

  • CSV (.csv)
  • Parquet (.parquet)

Image

  • COCO JSON (annotations in JSON with images/annotations/categories)
  • YOLO (folder with data.yaml or train/images/train/labels structure)

Score Bands

Score Band Action
0–40 Critical Do not train
41–65 Poor Major fixes required
66–80 Fair Minor fixes recommended
81–100 Good Ready for training

Image Mode Weights

Module Weight
Integrity 30%
Duplicates 25%
Anomalies 20%
Labels 15%
Distribution 10%

Project Structure

OmniLint/
├── omnilint/
│   ├── core/           # Engine, loader (CSV/COCO/YOLO), scorer
│   ├── tabular/        # Tabular audit modules
│   │   ├── checks/     # basic, distribution, labels, leakage, importance, dedup
│   │   ├── report/     # Report builders
│   │   └── utils/      # Stats, embeddings
│   └── image/          # Image audit modules (beta)
│       ├── checks/     # integrity, distribution, labels, duplicates, anomalies
│       └── utils/      # phash, clip_encoder, pixel_stats
│
├── cli/                # Typer CLI
├── app/                # Streamlit UI
├── app/components/     # Tabular UI components
├── app/components/image/ # Image UI components
└── tests/
    ├── tabular/        # Tabular tests
    └── image/          # Image tests

Tech Stack

Core

  • Core: Python 3.11+, Pandas, NumPy, SciPy
  • ML: scikit-learn, FAISS, sentence-transformers
  • CLI: Typer, Rich
  • UI: Streamlit, Plotly
  • Reports: Jinja2

Image (Optional)

  • I/O: Pillow, OpenCV
  • Hashing: imagehash (pHash)
  • Embeddings: OpenAI CLIP (ViT-B/32)
  • Blur: OpenCV Laplacian

See full release history in CHANGELOG.md

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omnilint-0.1.5.tar.gz (33.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omnilint-0.1.5-py3-none-any.whl (46.1 kB view details)

Uploaded Python 3

File details

Details for the file omnilint-0.1.5.tar.gz.

File metadata

  • Download URL: omnilint-0.1.5.tar.gz
  • Upload date:
  • Size: 33.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for omnilint-0.1.5.tar.gz
Algorithm Hash digest
SHA256 bf0b7794bcbd5a3355b5a8986b7be269621e0bbf7580d87c3d9e5de28d1a6a90
MD5 2cbe6c1b67766729481f1ea157f29872
BLAKE2b-256 21a2441b7feba643b52656ec89ee70a05bda052c0aea0d55ed8ec30c99a2e25c

See more details on using hashes here.

File details

Details for the file omnilint-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: omnilint-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 46.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for omnilint-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 07804e51f2f9b796e7080f81bbdd9c78b061a23f443d34b28f547a68ebec533c
MD5 1817fca6a5b4a13bc1903fe891b57097
BLAKE2b-256 52ea4a6497d5edd7ba24fb4f67f91d1e502cd35fa6203392c5cee349058081b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page