AI Dataset Quality Auditor - Tabular & Image
Project description
OmniLint
AI Dataset Quality Auditor — detect, report, and score quality issues in tabular and image datasets before they break your model.
Overview
OmniLint analyzes raw datasets (CSV/Parquet for tabular, COCO/YOLO for images) and surfaces actionable quality issues with a composite Data Quality Score (DQS). It serves as a gatekeeper between raw data and ML pipelines.
Features
Tabular Auditing
- Basic Checks — Missing values, data type mismatches, constant columns, duplicates
- Distribution Analysis — Skewness detection, IQR/Z-score outlier identification
- Label Auditing — Class imbalance detection, rare class warnings
- Leakage Detection — High correlation with target (Pearson/Spearman), categorical leakage proxies (Cramér's V)
- Feature Importance — RF probe model for noise/suspiciously powerful features
- Duplicate Detection — Exact duplicates and near-duplicate (cosine similarity/FAISS) matching
Image Auditing (Beta)
- Integrity Checks — Corrupt files, resolution outliers, format inconsistency
- Distribution Analysis — Brightness, contrast, color channel imbalance
- Label Checks — Label-file mismatch, class imbalance
- Duplicate Detection — Perceptual hash (pHash) exact duplicates, CLIP near-duplicates
- Anomaly Detection — Blur, exposure issues, blank images
Installation
Base (Tabular Only)
pip install omnilint
With Image Support
pip install omnilint[image]
Development
pip install -e ".[dev,image]"
Quick Start
CLI - Tabular
OmniLint run dataset.csv --target label --output report.html
CLI - Image
OmniLint run dataset/ --mode image --format yolo --output report.json
Python API
# Tabular
from omnilint.core import load, AuditEngine
from omnilint.core.engine import AuditConfig
df, schema = load("dataset.csv")
config = AuditConfig(target_col="label")
engine = AuditEngine(df, config)
result = engine.run()
print(result.quality_score) # e.g., 73.4
# Image
from omnilint.core.loader import load
image_dataset = load("path/to/coco_data")
image_config = AuditConfig(mode="image")
image_engine = AuditEngine(image_dataset, image_config)
image_result = image_engine.run()
print(image_result.quality_score) # e.g., 85.2
Streamlit UI
streamlit run app/streamlit_app.py
Input Formats
Tabular
- CSV (
.csv) - Parquet (
.parquet)
Image
- COCO JSON (annotations in JSON with images/annotations/categories)
- YOLO (folder with
data.yamlortrain/images/train/labelsstructure)
Score Bands
| Score | Band | Action |
|---|---|---|
| 0–40 | Critical | Do not train |
| 41–65 | Poor | Major fixes required |
| 66–80 | Fair | Minor fixes recommended |
| 81–100 | Good | Ready for training |
Image Mode Weights
| Module | Weight |
|---|---|
| Integrity | 30% |
| Duplicates | 25% |
| Anomalies | 20% |
| Labels | 15% |
| Distribution | 10% |
Project Structure
OmniLint/
├── omnilint/
│ ├── core/ # Engine, loader (CSV/COCO/YOLO), scorer
│ ├── tabular/ # Tabular audit modules
│ │ ├── checks/ # basic, distribution, labels, leakage, importance, dedup
│ │ ├── report/ # Report builders
│ │ └── utils/ # Stats, embeddings
│ └── image/ # Image audit modules (beta)
│ ├── checks/ # integrity, distribution, labels, duplicates, anomalies
│ └── utils/ # phash, clip_encoder, pixel_stats
│
├── cli/ # Typer CLI
├── app/ # Streamlit UI
├── app/components/ # Tabular UI components
├── app/components/image/ # Image UI components
└── tests/
├── tabular/ # Tabular tests
└── image/ # Image tests
Tech Stack
Core
- Core: Python 3.11+, Pandas, NumPy, SciPy
- ML: scikit-learn, FAISS, sentence-transformers
- CLI: Typer, Rich
- UI: Streamlit, Plotly
- Reports: Jinja2
Image (Optional)
- I/O: Pillow, OpenCV
- Hashing: imagehash (pHash)
- Embeddings: OpenAI CLIP (ViT-B/32)
- Blur: OpenCV Laplacian
See full release history in CHANGELOG.md
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file omnilint-0.1.5.tar.gz.
File metadata
- Download URL: omnilint-0.1.5.tar.gz
- Upload date:
- Size: 33.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf0b7794bcbd5a3355b5a8986b7be269621e0bbf7580d87c3d9e5de28d1a6a90
|
|
| MD5 |
2cbe6c1b67766729481f1ea157f29872
|
|
| BLAKE2b-256 |
21a2441b7feba643b52656ec89ee70a05bda052c0aea0d55ed8ec30c99a2e25c
|
File details
Details for the file omnilint-0.1.5-py3-none-any.whl.
File metadata
- Download URL: omnilint-0.1.5-py3-none-any.whl
- Upload date:
- Size: 46.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07804e51f2f9b796e7080f81bbdd9c78b061a23f443d34b28f547a68ebec533c
|
|
| MD5 |
1817fca6a5b4a13bc1903fe891b57097
|
|
| BLAKE2b-256 |
52ea4a6497d5edd7ba24fb4f67f91d1e502cd35fa6203392c5cee349058081b0
|