Training Data Quality Analyzer — analyze labeled text classification datasets for quality issues
Project description
title: LabelLens emoji: 🔍 colorFrom: blue colorTo: purple sdk: docker app_file: app.py pinned: false license: mit
Label Lens
Training data quality analyzer for text classification datasets. Upload a CSV with text and label columns, get an automated quality report with actionable recommendations.
Features
- Auto-detect columns — Automatically identifies text and label columns in your CSV
- Class distribution analysis — Imbalance ratio, effective number of classes, long-tail detection, suggested focal loss weights
- Duplicate detection — Exact duplicates and near-duplicates via TF-IDF cosine similarity, with cross-class conflicts flagged as critical
- Label noise scoring — Cross-validated confidence scoring to surface likely mislabels
- Actionable report — Severity ratings (Critical/Warning/Info) with specific recommendations
- Interactive visualizations — Plotly charts for exploring your data
Quick Start
As a web app
pip install label-lens[app]
streamlit run app.py
Or with uv:
uv sync
uv run streamlit run app.py
A sample dataset is included for demo purposes.
As a library
pip install label-lens
import pandas as pd
from label_lens import (
analyze_distribution,
find_exact_duplicates,
find_near_duplicates,
score_label_noise,
generate_report,
)
df = pd.read_csv("your_dataset.csv") # must have 'text' and 'label' columns
dist = analyze_distribution(df)
dups = find_exact_duplicates(df)
near_dups = find_near_duplicates(df)
noise = score_label_noise(df)
report = generate_report(dist, dups, near_dups, noise)
print(report["overall_severity"]) # "Critical", "Warning", or "Info"
print(report["recommendations"])
If your CSV uses different column names, use prepare_dataframe to standardize them:
from label_lens import prepare_dataframe
df = prepare_dataframe(raw_df, text_col="content", label_col="category")
Installation
Requires Python 3.13+.
# Library only (pandas, numpy, scikit-learn)
pip install label-lens
# With Streamlit app and Plotly charts
pip install label-lens[app]
# Development
pip install label-lens[dev]
How It Works
Distribution analysis computes imbalance ratio, entropy-based effective class count, and identifies long-tail classes (<1% representation). It also calculates inverse-frequency focal loss alpha values.
Duplicate detection finds exact text matches and uses TF-IDF vectorization with chunked cosine similarity to find near-duplicates. Cross-class duplicates (same text, different labels) are flagged as critical since they represent definite labeling errors.
Noise scoring trains a logistic regression on TF-IDF features using stratified k-fold cross-validation. For each sample, it records the model's confidence in the given label. The bottom 5th percentile by confidence are flagged as mislabel suspects.
Project Structure
label_lens/
├── ingest.py # Column detection, validation, DataFrame prep
├── distribution.py # Class distribution analysis + visualization
├── duplicates.py # Exact and near-duplicate detection
├── noise.py # Label noise scoring via cross-validated confidence
├── report.py # Aggregate findings and generate recommendations
└── utils.py # Shared helpers
Development
# Install dev dependencies
uv sync --all-extras
# Run tests
pytest tests/ -v
# Lint and format
ruff check .
ruff format .
Deployment
Label Lens is designed for deployment on Hugging Face Spaces using Docker. The Dockerfile at the repo root handles the build.
Tech Stack
- Python 3.13+
- Streamlit
- pandas / numpy
- scikit-learn (TF-IDF, logistic regression, cross-validation)
- Plotly
License
MIT
Built by Mike Noe
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file label_lens-0.1.0.tar.gz.
File metadata
- Download URL: label_lens-0.1.0.tar.gz
- Upload date:
- Size: 13.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d07edddf9272a0fd0126dc3c63531de61cc1b61eba93f7f1280e07dbfe8a51e
|
|
| MD5 |
d1203037bfbb8c399a4787fdaae43976
|
|
| BLAKE2b-256 |
3fd9259602b2cd51106c158fdb16f9549d529b844228f8aa107191e571ecb150
|
File details
Details for the file label_lens-0.1.0-py3-none-any.whl.
File metadata
- Download URL: label_lens-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e0d1befe7716fdea22d44612d89133f55ac47eb1f412380febcb6232269e1f8
|
|
| MD5 |
ce782cf0934c71f099bc4c77d891884f
|
|
| BLAKE2b-256 |
d4e05db21e8fa7188488ad0a00af33c881488147667301c263596ccc0200996e
|