Skip to main content

Training Data Quality Analyzer — analyze labeled text classification datasets for quality issues

Project description


title: LabelLens emoji: 🔍 colorFrom: blue colorTo: purple sdk: docker app_file: app.py pinned: false license: mit

Label Lens

Training data quality analyzer for text classification datasets. Upload a CSV with text and label columns, get an automated quality report with actionable recommendations.

Features

  • Auto-detect columns — Automatically identifies text and label columns in your CSV
  • Class distribution analysis — Imbalance ratio, effective number of classes, long-tail detection, suggested focal loss weights
  • Duplicate detection — Exact duplicates and near-duplicates via TF-IDF cosine similarity, with cross-class conflicts flagged as critical
  • Label noise scoring — Cross-validated confidence scoring to surface likely mislabels
  • Actionable report — Severity ratings (Critical/Warning/Info) with specific recommendations
  • Interactive visualizations — Plotly charts for exploring your data

Quick Start

As a web app

pip install label-lens[app]
streamlit run app.py

Or with uv:

uv sync
uv run streamlit run app.py

A sample dataset is included for demo purposes.

As a library

pip install label-lens
import pandas as pd
from label_lens import (
    analyze_distribution,
    find_exact_duplicates,
    find_near_duplicates,
    score_label_noise,
    generate_report,
)

df = pd.read_csv("your_dataset.csv")  # must have 'text' and 'label' columns

dist = analyze_distribution(df)
dups = find_exact_duplicates(df)
near_dups = find_near_duplicates(df)
noise = score_label_noise(df)

report = generate_report(dist, dups, near_dups, noise)
print(report["overall_severity"])  # "Critical", "Warning", or "Info"
print(report["recommendations"])

If your CSV uses different column names, use prepare_dataframe to standardize them:

from label_lens import prepare_dataframe

df = prepare_dataframe(raw_df, text_col="content", label_col="category")

Installation

Requires Python 3.13+.

# Library only (pandas, numpy, scikit-learn)
pip install label-lens

# With Streamlit app and Plotly charts
pip install label-lens[app]

# Development
pip install label-lens[dev]

How It Works

Distribution analysis computes imbalance ratio, entropy-based effective class count, and identifies long-tail classes (<1% representation). It also calculates inverse-frequency focal loss alpha values.

Duplicate detection finds exact text matches and uses TF-IDF vectorization with chunked cosine similarity to find near-duplicates. Cross-class duplicates (same text, different labels) are flagged as critical since they represent definite labeling errors.

Noise scoring trains a logistic regression on TF-IDF features using stratified k-fold cross-validation. For each sample, it records the model's confidence in the given label. The bottom 5th percentile by confidence are flagged as mislabel suspects.

Project Structure

label_lens/
├── ingest.py         # Column detection, validation, DataFrame prep
├── distribution.py   # Class distribution analysis + visualization
├── duplicates.py     # Exact and near-duplicate detection
├── noise.py          # Label noise scoring via cross-validated confidence
├── report.py         # Aggregate findings and generate recommendations
└── utils.py          # Shared helpers

Development

# Install dev dependencies
uv sync --all-extras

# Run tests
pytest tests/ -v

# Lint and format
ruff check .
ruff format .

Deployment

Label Lens is designed for deployment on Hugging Face Spaces using Docker. The Dockerfile at the repo root handles the build.

Tech Stack

  • Python 3.13+
  • Streamlit
  • pandas / numpy
  • scikit-learn (TF-IDF, logistic regression, cross-validation)
  • Plotly

License

MIT


Built by Mike Noe

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

label_lens-0.1.0.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

label_lens-0.1.0-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file label_lens-0.1.0.tar.gz.

File metadata

  • Download URL: label_lens-0.1.0.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for label_lens-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6d07edddf9272a0fd0126dc3c63531de61cc1b61eba93f7f1280e07dbfe8a51e
MD5 d1203037bfbb8c399a4787fdaae43976
BLAKE2b-256 3fd9259602b2cd51106c158fdb16f9549d529b844228f8aa107191e571ecb150

See more details on using hashes here.

File details

Details for the file label_lens-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: label_lens-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for label_lens-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2e0d1befe7716fdea22d44612d89133f55ac47eb1f412380febcb6232269e1f8
MD5 ce782cf0934c71f099bc4c77d891884f
BLAKE2b-256 d4e05db21e8fa7188488ad0a00af33c881488147667301c263596ccc0200996e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page