Dataset-centric CV toolkit: label-quality checks, mislabel detection, class-imbalance analysis, and active-learning loop orchestration.

These details have not been verified by PyPI

Project links

Project description

cv-quality – Computer Vision Quality Toolkit

A Python library for computer vision dataset quality workflows, including label-quality auditing, class-imbalance analysis, mislabel discovery, and active-learning pipeline orchestration. Designed for COCO, ImageNet, and long-tail dataset variants.

Features
Installation
Quick Start
CLI
Supported Dataset Formats
Project Structure
Publishing to PyPI
Authors
License

Features

Module	What it does
`cvquality.stats`	Dataset statistics: class counts, bbox distributions, Gini/entropy imbalance metrics, co-occurrence matrix
`cvquality.quality`	Annotation integrity checks (out-of-bounds, duplicates, tiny boxes), Confident-Learning label-quality scoring, kNN-based mislabel detection
`cvquality.active_learning`	Uncertainty (entropy, margin, LC, BALD), Diversity (CoreSet, cluster-margin, MinMax), Error-Localization (gradient norm, spatial entropy) strategies + loop orchestrator
`cvquality.recipes`	Ready-made pipelines for COCO and ImageNet-style datasets
`cvquality.io`	COCO-format reader + HTML/JSON report generator
`cvquality.cli`	`cvquality` CLI: `stats`, `check`, `report`, `imagenet` commands

Installation

# Core (no ML framework required)
pip install cv-quality

# With PyTorch backend
pip install "cv-quality[torch]"

# With TensorFlow backend
pip install "cv-quality[tensorflow]"

# Everything + dev tools
pip install "cv-quality[all,dev]"

Import name: import cvquality (PyPI distribution name is cv-quality)

Quick Start

Dataset statistics

from cvquality.io import COCODataset
from cvquality.stats import DatasetStats

ds = COCODataset("annotations/instances_train2017.json")
stats = DatasetStats(ds)
print(stats.summary())
# {'num_images': 118287, 'num_categories': 80, 'class_imbalance': {'gini': 0.42, ...}, ...}

# Long-tail analysis
print(stats.tail_categories(percentile=10))
# ['toaster', 'hair drier', 'parking meter', ...]

Annotation quality checks

from cvquality.quality import AnnotationChecker

checker = AnnotationChecker(ds, min_bbox_area=4.0, max_overlap_iou=0.85)
summary = checker.summary()
print(f"Total issues: {summary['total_issues']}")
# {'total_issues': 312, 'by_type': {'out_of_bounds': 5, 'near_duplicate': 307}, ...}

Label quality scoring (Confident Learning)

from cvquality.quality import LabelQualityScorer
import numpy as np

# pred_probs: (N, K) out-of-fold predictions from your model
lq = LabelQualityScorer(pred_probs, labels)
issues = lq.ranked_issues(top_k=50)   # worst labels first
print(lq.summary())
# {'estimated_error_rate': 0.032, 'flagged_count': 47, ...}

Mislabel detection

from cvquality.quality import MislabelDetector

md = MislabelDetector(embeddings, labels, n_neighbors=15)
candidates = md.rank_candidates(top_k=100)
# [{'index': 2341, 'given_label': 3, 'suggested_label': 7, 'quality_score': 0.12}, ...]

Active learning

from cvquality.active_learning import ActiveLearningLoop, UncertaintyStrategy
from cvquality.active_learning.backends import PyTorchBackend
from cvquality.active_learning.loop import LoopConfig
import torchvision.models as M

model = M.resnet18(weights=M.ResNet18_Weights.DEFAULT)
backend = PyTorchBackend(model, device="cuda")
strategy = UncertaintyStrategy("entropy")

loop = ActiveLearningLoop(
    backend, strategy, images, labels,
    config=LoopConfig(budget_per_round=200, max_rounds=5),
)
history = loop.run()
print(loop.summary())

COCO full-pipeline recipe

from cvquality.recipes import COCORecipe

recipe = COCORecipe(
    "annotations/instances_train2017.json",
    image_dir="/data/coco/train2017",
    report_dir="./reports",
    dataset_name="COCO-2017-train",
)
result = recipe.run()
# Writes reports/instances_train2017_report.json + .html

CLI

# Print dataset statistics
cvquality stats annotations/instances_val2017.json

# Run annotation checks
cvquality check annotations/instances_val2017.json --min-bbox-area 4 --max-iou 0.85

# Generate full HTML + JSON report
cvquality report annotations/instances_val2017.json --output-dir ./reports --name "COCO-val"

# Analyse an ImageNet-style folder
cvquality imagenet /data/imagenet/val --output-dir ./reports

Supported Dataset Formats

Natively supported (no glue code needed)

Format	Entry point
COCO JSON (`instances_*.json`)	`COCODataset` + `COCORecipe`
ImageNet flat-folder (`root/class_name/*.jpg`)	`ImageNetRecipe`

Works with any dataset — via numpy arrays

The stats, quality, and active-learning modules are format-agnostic. They only need:

Module	What it needs
`LabelQualityScorer`	`(N, K)` pred_probs + `(N,)` labels
`MislabelDetector`	`(N, D)` embeddings + `(N,)` labels
All 3 AL strategies	numpy arrays (probs / embeddings / gradients)
`ActiveLearningLoop`	any image list + numpy labels

Pascal VOC, Open Images, Roboflow exports, custom CSVs, etc. all work — load your data into numpy arrays or convert to a COCODataset.

What needs a converter

Pascal VOC XML / YOLO .txt — no built-in reader; trivial to convert to COCO JSON or use quality/AL modules directly with numpy arrays.
Segmentation masks (stuff_*.json, panoptic) — COCODataset loads them (still COCO JSON) but AnnotationChecker currently only inspects bboxes, not polygon/RLE masks.
HuggingFace Datasets / TFRecords / LMDBs — load to numpy/PIL, pass to AL backends.

Any format → quality + active learning

# Your own loader — Pascal VOC, YOLO, CSV, anything
embeddings = my_loader.get_embeddings()   # (N, D)
labels      = my_loader.get_labels()      # (N,)
pred_probs  = my_model.predict(images)    # (N, K)

from cvquality.quality import LabelQualityScorer, MislabelDetector
from cvquality.active_learning.strategies import UncertaintyStrategy

lq       = LabelQualityScorer(pred_probs, labels)
md       = MislabelDetector(embeddings, labels)
strategy = UncertaintyStrategy("entropy")
indices  = strategy.query(pred_probs, budget=100)

Project Structure

cvquality/
├── stats/              Dataset statistics & imbalance metrics
├── quality/            Label quality, mislabel detection, annotation checks
├── active_learning/
│   ├── strategies/     uncertainty / diversity / error-localization
│   ├── backends/       PyTorch, TensorFlow (pluggable)
│   └── loop.py         Loop orchestrator
├── recipes/            COCO & ImageNet pipelines
├── io/                 COCO reader + report generator
└── cli/                Click-based CLI
tests/                  pytest suite (87 tests)

Publishing to PyPI

pip install build twine
python -m build
twine check dist/*
twine upload dist/*

Use __token__ as the username and a PyPI API token as the password. See deploy.md for the full step-by-step guide.

Authors

cv-quality is authored and maintained by Sai Teja Erukude.

PyPI: https://pypi.org/project/cv-quality/
Homepage / repository: https://github.com/SaiTeja-Erukude/cv-quality
Issues: https://github.com/SaiTeja-Erukude/cv-quality/issues

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cv_quality-1.0.0.tar.gz (40.9 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cv_quality-1.0.0-py3-none-any.whl (46.2 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file cv_quality-1.0.0.tar.gz.

File metadata

Download URL: cv_quality-1.0.0.tar.gz
Upload date: Apr 3, 2026
Size: 40.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for cv_quality-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`1306c82faef7a750e9fe88306d387f75a13a3e7e15f68c0036e2e7af18531046`
MD5	`ce6f36e42f75597e273ee285be0e736b`
BLAKE2b-256	`478e2cdfb29e09c86863da4a6daa28888374f875feeb6ef8ed8125ce3682c0b6`

See more details on using hashes here.

File details

Details for the file cv_quality-1.0.0-py3-none-any.whl.

File metadata

Download URL: cv_quality-1.0.0-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 46.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for cv_quality-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`da55f642ae5b07a5964044a25f7582a82b9389f4ff445547211555ed662b7b2c`
MD5	`5e744d49b2077d38984a81da8c2a87b5`
BLAKE2b-256	`417d745450cc7683fff3dfd484cc802e18cbaa9eaeac7eef9b65f4efbf001a46`

See more details on using hashes here.

cv-quality 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

cv-quality – Computer Vision Quality Toolkit

Table of Contents

Features

Installation

Quick Start

Dataset statistics

Annotation quality checks

Label quality scoring (Confident Learning)

Mislabel detection

Active learning

COCO full-pipeline recipe

CLI

Supported Dataset Formats

Natively supported (no glue code needed)

Works with any dataset — via numpy arrays

What needs a converter

Any format → quality + active learning

Project Structure

Publishing to PyPI

Authors

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes