Dataset-centric CV toolkit: label-quality checks, mislabel detection, class-imbalance analysis, and active-learning loop orchestration.
Project description
cv-quality – Computer Vision Quality Toolkit
A Python library for computer vision dataset quality workflows, including label-quality auditing, class-imbalance analysis, mislabel discovery, and active-learning pipeline orchestration. Designed for COCO, ImageNet, and long-tail dataset variants.
Table of Contents
- Features
- Installation
- Quick Start
- CLI
- Supported Dataset Formats
- Project Structure
- Publishing to PyPI
- Authors
- License
Features
| Module | What it does |
|---|---|
cvquality.stats |
Dataset statistics: class counts, bbox distributions, Gini/entropy imbalance metrics, co-occurrence matrix |
cvquality.quality |
Annotation integrity checks (out-of-bounds, duplicates, tiny boxes), Confident-Learning label-quality scoring, kNN-based mislabel detection |
cvquality.active_learning |
Uncertainty (entropy, margin, LC, BALD), Diversity (CoreSet, cluster-margin, MinMax), Error-Localization (gradient norm, spatial entropy) strategies + loop orchestrator |
cvquality.recipes |
Ready-made pipelines for COCO and ImageNet-style datasets |
cvquality.io |
COCO-format reader + HTML/JSON report generator |
cvquality.cli |
cvquality CLI: stats, check, report, imagenet commands |
Installation
# Core (no ML framework required)
pip install cv-quality
# With PyTorch backend
pip install "cv-quality[torch]"
# With TensorFlow backend
pip install "cv-quality[tensorflow]"
# Everything + dev tools
pip install "cv-quality[all,dev]"
Import name:
import cvquality(PyPI distribution name iscv-quality)
Quick Start
Dataset statistics
from cvquality.io import COCODataset
from cvquality.stats import DatasetStats
ds = COCODataset("annotations/instances_train2017.json")
stats = DatasetStats(ds)
print(stats.summary())
# {'num_images': 118287, 'num_categories': 80, 'class_imbalance': {'gini': 0.42, ...}, ...}
# Long-tail analysis
print(stats.tail_categories(percentile=10))
# ['toaster', 'hair drier', 'parking meter', ...]
Annotation quality checks
from cvquality.quality import AnnotationChecker
checker = AnnotationChecker(ds, min_bbox_area=4.0, max_overlap_iou=0.85)
summary = checker.summary()
print(f"Total issues: {summary['total_issues']}")
# {'total_issues': 312, 'by_type': {'out_of_bounds': 5, 'near_duplicate': 307}, ...}
Label quality scoring (Confident Learning)
from cvquality.quality import LabelQualityScorer
import numpy as np
# pred_probs: (N, K) out-of-fold predictions from your model
lq = LabelQualityScorer(pred_probs, labels)
issues = lq.ranked_issues(top_k=50) # worst labels first
print(lq.summary())
# {'estimated_error_rate': 0.032, 'flagged_count': 47, ...}
Mislabel detection
from cvquality.quality import MislabelDetector
md = MislabelDetector(embeddings, labels, n_neighbors=15)
candidates = md.rank_candidates(top_k=100)
# [{'index': 2341, 'given_label': 3, 'suggested_label': 7, 'quality_score': 0.12}, ...]
Active learning
from cvquality.active_learning import ActiveLearningLoop, UncertaintyStrategy
from cvquality.active_learning.backends import PyTorchBackend
from cvquality.active_learning.loop import LoopConfig
import torchvision.models as M
model = M.resnet18(weights=M.ResNet18_Weights.DEFAULT)
backend = PyTorchBackend(model, device="cuda")
strategy = UncertaintyStrategy("entropy")
loop = ActiveLearningLoop(
backend, strategy, images, labels,
config=LoopConfig(budget_per_round=200, max_rounds=5),
)
history = loop.run()
print(loop.summary())
COCO full-pipeline recipe
from cvquality.recipes import COCORecipe
recipe = COCORecipe(
"annotations/instances_train2017.json",
image_dir="/data/coco/train2017",
report_dir="./reports",
dataset_name="COCO-2017-train",
)
result = recipe.run()
# Writes reports/instances_train2017_report.json + .html
CLI
# Print dataset statistics
cvquality stats annotations/instances_val2017.json
# Run annotation checks
cvquality check annotations/instances_val2017.json --min-bbox-area 4 --max-iou 0.85
# Generate full HTML + JSON report
cvquality report annotations/instances_val2017.json --output-dir ./reports --name "COCO-val"
# Analyse an ImageNet-style folder
cvquality imagenet /data/imagenet/val --output-dir ./reports
Supported Dataset Formats
Natively supported (no glue code needed)
| Format | Entry point |
|---|---|
COCO JSON (instances_*.json) |
COCODataset + COCORecipe |
ImageNet flat-folder (root/class_name/*.jpg) |
ImageNetRecipe |
Works with any dataset — via numpy arrays
The stats, quality, and active-learning modules are format-agnostic. They only need:
| Module | What it needs |
|---|---|
LabelQualityScorer |
(N, K) pred_probs + (N,) labels |
MislabelDetector |
(N, D) embeddings + (N,) labels |
| All 3 AL strategies | numpy arrays (probs / embeddings / gradients) |
ActiveLearningLoop |
any image list + numpy labels |
Pascal VOC, Open Images, Roboflow exports, custom CSVs, etc. all work — load your data into numpy arrays or convert to a COCODataset.
What needs a converter
- Pascal VOC XML / YOLO
.txt— no built-in reader; trivial to convert to COCO JSON or use quality/AL modules directly with numpy arrays. - Segmentation masks (
stuff_*.json, panoptic) —COCODatasetloads them (still COCO JSON) butAnnotationCheckercurrently only inspects bboxes, not polygon/RLE masks. - HuggingFace Datasets / TFRecords / LMDBs — load to numpy/PIL, pass to AL backends.
Any format → quality + active learning
# Your own loader — Pascal VOC, YOLO, CSV, anything
embeddings = my_loader.get_embeddings() # (N, D)
labels = my_loader.get_labels() # (N,)
pred_probs = my_model.predict(images) # (N, K)
from cvquality.quality import LabelQualityScorer, MislabelDetector
from cvquality.active_learning.strategies import UncertaintyStrategy
lq = LabelQualityScorer(pred_probs, labels)
md = MislabelDetector(embeddings, labels)
strategy = UncertaintyStrategy("entropy")
indices = strategy.query(pred_probs, budget=100)
Project Structure
cvquality/
├── stats/ Dataset statistics & imbalance metrics
├── quality/ Label quality, mislabel detection, annotation checks
├── active_learning/
│ ├── strategies/ uncertainty / diversity / error-localization
│ ├── backends/ PyTorch, TensorFlow (pluggable)
│ └── loop.py Loop orchestrator
├── recipes/ COCO & ImageNet pipelines
├── io/ COCO reader + report generator
└── cli/ Click-based CLI
tests/ pytest suite (87 tests)
Publishing to PyPI
pip install build twine
python -m build
twine check dist/*
twine upload dist/*
Use __token__ as the username and a PyPI API token as the password.
See deploy.md for the full step-by-step guide.
Authors
cv-quality is authored and maintained by Sai Teja Erukude.
- PyPI: https://pypi.org/project/cv-quality/
- Homepage / repository: https://github.com/SaiTeja-Erukude/cv-quality
- Issues: https://github.com/SaiTeja-Erukude/cv-quality/issues
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cv_quality-1.0.0.tar.gz.
File metadata
- Download URL: cv_quality-1.0.0.tar.gz
- Upload date:
- Size: 40.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1306c82faef7a750e9fe88306d387f75a13a3e7e15f68c0036e2e7af18531046
|
|
| MD5 |
ce6f36e42f75597e273ee285be0e736b
|
|
| BLAKE2b-256 |
478e2cdfb29e09c86863da4a6daa28888374f875feeb6ef8ed8125ce3682c0b6
|
File details
Details for the file cv_quality-1.0.0-py3-none-any.whl.
File metadata
- Download URL: cv_quality-1.0.0-py3-none-any.whl
- Upload date:
- Size: 46.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da55f642ae5b07a5964044a25f7582a82b9389f4ff445547211555ed662b7b2c
|
|
| MD5 |
5e744d49b2077d38984a81da8c2a87b5
|
|
| BLAKE2b-256 |
417d745450cc7683fff3dfd484cc802e18cbaa9eaeac7eef9b65f4efbf001a46
|