Skip to main content

Dataset-centric CV toolkit: label-quality checks, mislabel detection, class-imbalance analysis, and active-learning loop orchestration.

Project description

cv-quality – Computer Vision Quality Toolkit

A Python library for computer vision dataset quality workflows, including label-quality auditing, class-imbalance analysis, mislabel discovery, and active-learning pipeline orchestration. Designed for COCO, ImageNet, and long-tail dataset variants.


Table of Contents


Features

Module What it does
cvquality.stats Dataset statistics: class counts, bbox distributions, Gini/entropy imbalance metrics, co-occurrence matrix
cvquality.quality Annotation integrity checks (out-of-bounds, duplicates, tiny boxes), Confident-Learning label-quality scoring, kNN-based mislabel detection
cvquality.active_learning Uncertainty (entropy, margin, LC, BALD), Diversity (CoreSet, cluster-margin, MinMax), Error-Localization (gradient norm, spatial entropy) strategies + loop orchestrator
cvquality.recipes Ready-made pipelines for COCO and ImageNet-style datasets
cvquality.io COCO-format reader + HTML/JSON report generator
cvquality.cli cvquality CLI: stats, check, report, imagenet commands

Installation

# Core (no ML framework required)
pip install cv-quality

# With PyTorch backend
pip install "cv-quality[torch]"

# With TensorFlow backend
pip install "cv-quality[tensorflow]"

# Everything + dev tools
pip install "cv-quality[all,dev]"

Import name: import cvquality (PyPI distribution name is cv-quality)


Quick Start

Dataset statistics

from cvquality.io import COCODataset
from cvquality.stats import DatasetStats

ds = COCODataset("annotations/instances_train2017.json")
stats = DatasetStats(ds)
print(stats.summary())
# {'num_images': 118287, 'num_categories': 80, 'class_imbalance': {'gini': 0.42, ...}, ...}

# Long-tail analysis
print(stats.tail_categories(percentile=10))
# ['toaster', 'hair drier', 'parking meter', ...]

Annotation quality checks

from cvquality.quality import AnnotationChecker

checker = AnnotationChecker(ds, min_bbox_area=4.0, max_overlap_iou=0.85)
summary = checker.summary()
print(f"Total issues: {summary['total_issues']}")
# {'total_issues': 312, 'by_type': {'out_of_bounds': 5, 'near_duplicate': 307}, ...}

Label quality scoring (Confident Learning)

from cvquality.quality import LabelQualityScorer
import numpy as np

# pred_probs: (N, K) out-of-fold predictions from your model
lq = LabelQualityScorer(pred_probs, labels)
issues = lq.ranked_issues(top_k=50)   # worst labels first
print(lq.summary())
# {'estimated_error_rate': 0.032, 'flagged_count': 47, ...}

Mislabel detection

from cvquality.quality import MislabelDetector

md = MislabelDetector(embeddings, labels, n_neighbors=15)
candidates = md.rank_candidates(top_k=100)
# [{'index': 2341, 'given_label': 3, 'suggested_label': 7, 'quality_score': 0.12}, ...]

Active learning

from cvquality.active_learning import ActiveLearningLoop, UncertaintyStrategy
from cvquality.active_learning.backends import PyTorchBackend
from cvquality.active_learning.loop import LoopConfig
import torchvision.models as M

model = M.resnet18(weights=M.ResNet18_Weights.DEFAULT)
backend = PyTorchBackend(model, device="cuda")
strategy = UncertaintyStrategy("entropy")

loop = ActiveLearningLoop(
    backend, strategy, images, labels,
    config=LoopConfig(budget_per_round=200, max_rounds=5),
)
history = loop.run()
print(loop.summary())

COCO full-pipeline recipe

from cvquality.recipes import COCORecipe

recipe = COCORecipe(
    "annotations/instances_train2017.json",
    image_dir="/data/coco/train2017",
    report_dir="./reports",
    dataset_name="COCO-2017-train",
)
result = recipe.run()
# Writes reports/instances_train2017_report.json + .html

CLI

# Print dataset statistics
cvquality stats annotations/instances_val2017.json

# Run annotation checks
cvquality check annotations/instances_val2017.json --min-bbox-area 4 --max-iou 0.85

# Generate full HTML + JSON report
cvquality report annotations/instances_val2017.json --output-dir ./reports --name "COCO-val"

# Analyse an ImageNet-style folder
cvquality imagenet /data/imagenet/val --output-dir ./reports

Supported Dataset Formats

Natively supported (no glue code needed)

Format Entry point
COCO JSON (instances_*.json) COCODataset + COCORecipe
ImageNet flat-folder (root/class_name/*.jpg) ImageNetRecipe

Works with any dataset — via numpy arrays

The stats, quality, and active-learning modules are format-agnostic. They only need:

Module What it needs
LabelQualityScorer (N, K) pred_probs + (N,) labels
MislabelDetector (N, D) embeddings + (N,) labels
All 3 AL strategies numpy arrays (probs / embeddings / gradients)
ActiveLearningLoop any image list + numpy labels

Pascal VOC, Open Images, Roboflow exports, custom CSVs, etc. all work — load your data into numpy arrays or convert to a COCODataset.

What needs a converter

  • Pascal VOC XML / YOLO .txt — no built-in reader; trivial to convert to COCO JSON or use quality/AL modules directly with numpy arrays.
  • Segmentation masks (stuff_*.json, panoptic) — COCODataset loads them (still COCO JSON) but AnnotationChecker currently only inspects bboxes, not polygon/RLE masks.
  • HuggingFace Datasets / TFRecords / LMDBs — load to numpy/PIL, pass to AL backends.

Any format → quality + active learning

# Your own loader — Pascal VOC, YOLO, CSV, anything
embeddings = my_loader.get_embeddings()   # (N, D)
labels      = my_loader.get_labels()      # (N,)
pred_probs  = my_model.predict(images)    # (N, K)

from cvquality.quality import LabelQualityScorer, MislabelDetector
from cvquality.active_learning.strategies import UncertaintyStrategy

lq       = LabelQualityScorer(pred_probs, labels)
md       = MislabelDetector(embeddings, labels)
strategy = UncertaintyStrategy("entropy")
indices  = strategy.query(pred_probs, budget=100)

Project Structure

cvquality/
├── stats/              Dataset statistics & imbalance metrics
├── quality/            Label quality, mislabel detection, annotation checks
├── active_learning/
│   ├── strategies/     uncertainty / diversity / error-localization
│   ├── backends/       PyTorch, TensorFlow (pluggable)
│   └── loop.py         Loop orchestrator
├── recipes/            COCO & ImageNet pipelines
├── io/                 COCO reader + report generator
└── cli/                Click-based CLI
tests/                  pytest suite (87 tests)

Publishing to PyPI

pip install build twine
python -m build
twine check dist/*
twine upload dist/*

Use __token__ as the username and a PyPI API token as the password. See deploy.md for the full step-by-step guide.


Authors

cv-quality is authored and maintained by Sai Teja Erukude.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cv_quality-1.0.0.tar.gz (40.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cv_quality-1.0.0-py3-none-any.whl (46.2 kB view details)

Uploaded Python 3

File details

Details for the file cv_quality-1.0.0.tar.gz.

File metadata

  • Download URL: cv_quality-1.0.0.tar.gz
  • Upload date:
  • Size: 40.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for cv_quality-1.0.0.tar.gz
Algorithm Hash digest
SHA256 1306c82faef7a750e9fe88306d387f75a13a3e7e15f68c0036e2e7af18531046
MD5 ce6f36e42f75597e273ee285be0e736b
BLAKE2b-256 478e2cdfb29e09c86863da4a6daa28888374f875feeb6ef8ed8125ce3682c0b6

See more details on using hashes here.

File details

Details for the file cv_quality-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: cv_quality-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 46.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for cv_quality-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 da55f642ae5b07a5964044a25f7582a82b9389f4ff445547211555ed662b7b2c
MD5 5e744d49b2077d38984a81da8c2a87b5
BLAKE2b-256 417d745450cc7683fff3dfd484cc802e18cbaa9eaeac7eef9b65f4efbf001a46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page