Perception model evaluation harness — sliced metrics, failure triage, and CI regression gating for detection models

These details have not been verified by PyPI

Project links

Project description

PerceptorGuard

A perception evaluation harness for YOLO detection models — slice-based metrics, CI regression gating, failure triage, and reproducible synthetic fixtures. Built to demonstrate that evaluation is an engineering discipline, not a one-liner.

Quick start

# 1. Install
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]" && pip install scikit-learn pyyaml jinja2

# 2. Generate a 20-scene dataset and run eval
python scripts/generate.py  --count 20 --seed 42 --out artifacts/dataset
python scripts/run_eval.py  --dataset artifacts/dataset --out artifacts/eval

# 3. Triage failures and generate report
python scripts/triage.py    --matches artifacts/eval/matches.csv
python scripts/generate_report.py \
    --dataset  artifacts/dataset \
    --eval     artifacts/eval \
    --baseline artifacts/baseline \
    --triage   artifacts/triage
# → open artifacts/report/report.html

Planted-regression demo (gate goes red, then green):

python scripts/save_baseline.py          # promote current eval to baseline
python scripts/demo_regression.py        # PASS → FAIL → PASS

System overview

┌───────────────────────────────────────────────────────────────────┐
│  Scenario generator (PyBullet DIRECT)                             │
│  6 challenge profiles × 2-tier object catalog (easy + hard)      │
│  → manifest.csv + per-scene frame.png + ground_truth.json        │
└──────────────────────────┬────────────────────────────────────────┘
                           │
┌──────────────────────────▼────────────────────────────────────────┐
│  Eval runner                                                      │
│  YOLO inference at conf=0.01 (full distribution for AP)          │
│  Greedy class-aware bipartite matching  (IoU ≥ 0.5)             │
│  → matches.csv  (tidy: one row per TP / FP / FN)                │
└──────────────────────────┬────────────────────────────────────────┘
                           │
          ┌────────────────┴──────────────────┐
          │                                   │
┌─────────▼──────────┐             ┌──────────▼──────────────────┐
│  Metrics engine    │             │  Failure triage              │
│  mAP (11-pt VOC)  │             │  missed / localization /     │
│  P/R/F1 @ 0.25    │             │  wrong_class / false_pos     │
│  slice tables:     │             │  KMeans cluster analysis     │
│  profile / dist /  │             │  → failures_classified.csv  │
│  lighting / clutter│             │  → cluster_summary.csv      │
│  tier / class      │             └──────────────────────────────┘
└─────────┬──────────┘
          │
┌─────────▼──────────────────────────────────────────────────────┐
│  Regression gate                                               │
│  Compare current metrics vs baseline (artifacts/baseline/)     │
│  46 checks: mAP, AP/recall per class, recall/FP per profile   │
│  Exit 1 on any regression beyond configured slack             │
└─────────┬──────────────────────────────────────────────────────┘
          │
┌─────────▼──────────────────────────────────────────────────────┐
│  Report renderer (Jinja2)                                      │
│  Self-contained HTML + Markdown                                │
│  Slice tables, failure gallery, gate diff table                │
│  Optional: W&B / MLflow experiment tracking                    │
└────────────────────────────────────────────────────────────────┘

CI split — deliberate engineering decision:

PR gate (.github/workflows/ci_gate.yml): 20 scenes, seed=42, ~2 min. Catches regressions before they land in main.
Nightly (.github/workflows/nightly.yml): 100 scenes, auto-promotes baseline on success. Authoritative quality bar.

Key design decisions

1. Evaluation as a first-class engineering subsystem

Most perception teams treat evaluation as an afterthought: run a metric script, log the number, move on. The script is throw-away code; the number is a spreadsheet cell; there is no regression gate.

PerceptorGuard treats the eval pipeline with the same engineering discipline as the model itself:

Property	Ad-hoc eval script	PerceptorGuard
Dataset	"whatever images we had"	Versioned, reproducible, committed
Metrics	Overall mAP only	Per-slice: profile, distance, lighting, clutter, tier, class
Failures	"recall is low"	Named failure modes with KMeans-clustered conditions
Regressions	Discovered in staging	Caught at PR time, 46 checks, named slice + delta
Reproducibility	"I think we used these settings"	Seed-fixed fixtures, LFS-tracked weights

The eval harness is the durable investment. Models come and go; the harness lets you compare them honestly.

2. Model-agnostic interface

The harness has exactly one coupling point to YOLO: EvalRunner._predict() (12 lines in runner/eval_runner.py). The rest of the pipeline — matching, metrics, gating, triage, reporting — operates on Detection and GroundTruth Pydantic schemas.

Swapping YOLOv8n for YOLOv8x, a RT-DETR, or a custom model requires changing exactly that one function. You can A/B test models through the same harness and compare them on the same reproducible dataset without any eval-code changes.

This is the same principle I apply to RAG and agentic systems: the evaluation framework must be agnostic to the implementation choice, or you end up with evaluation that only works for the thing you already have.

3. CI gate blast-radius argument

Catching a regression has very different costs depending on where it surfaces:

Where caught	Cost
At PR (CI gate)	2 min CI compute
In staging after merge	Deploy + rollback + engineer-hours
In production	Incident response + user-trust loss

The gate runs 46 checks per PR. Each check is a named (slice, metric) pair with an explicit floor: floor = baseline_value − slack. If any check fires, the gate exits 1 and names the regressed slice and delta. Engineers know exactly what broke and by how much — not "mAP went down a bit."

The 20-scene CI fast path is a deliberate tradeoff: 20 scenes is noisy enough that you'll miss subtle regressions, but it catches real structural breaks (IoU threshold bug, conf threshold change, class mapping error) in 2 minutes. The nightly full 100-scene run is the authoritative measurement. This split is explained in both workflow files.

4. Two-tier object catalog

Easy  (COCO-recognizable): cup, bottle, bowl, teddy bear, sports ball
Hard  (off-vocabulary):    cube, duck, lego, domino

The split gives you two signals simultaneously:

Easy tier measures how much signal you can extract from a COCO-pretrained backbone. Any recall at all indicates some domain-transfer.
Hard tier measures true zero-shot generalisation. Near-zero AP here is expected and honest — it's the documented result, not a bug.

Running only easy classes would give you false confidence. Running only hard classes would give you nothing to gate on. The mix is deliberate.

5. Sub-threshold IoU enrichment for FN rows

The matcher records best_iou_any_class and best_pred_class_at_overlap for every FN row. This enables the failure classifier to distinguish:

missed_detection — no prediction overlapped the GT (IoU < 0.1 from any box)
localization_error — right class, right location, IoU just below threshold
wrong_class — something overlaps (IoU ≥ 0.1) but with the wrong class label

These are different engineering problems requiring different interventions. Knowing you have 123 wrong_class failures in near-distance crowded scenes (the actual finding) is actionable. Knowing "FN = 355" is not.

Findings from the pilot run (YOLOv8n, 100 scenes)

Overall: mAP=0.1%  Precision=0%  Recall=0%
TP=0  FP=27  FN=355  (at op-point conf≥0.25)

The headline finding is a complete domain gap: YOLOv8n, pretrained on COCO, achieves 0% recall on PyBullet synthetic renders at any operating threshold. This is expected and honest.

The only live signal: sports ball, AP=0.9% — the soccerball URDF has a realistic texture that partially overlaps the COCO training distribution. Two sub-threshold predictions match GT at IoU≥0.5 during the full-distribution AP sweep.

Failure mode breakdown (771 failures):

false_positive (54%) — 416 hallucinations. The model confidently detects objects that aren't there (chairs, people, cars) because PyBullet renders look like partial-context frames from real images.
missed_detection (30%) — 232 GTs with zero model signal. Worst in crowded + far profiles; domino and lego lead by class.
wrong_class (16%) — 123 GTs where a prediction overlaps but carries the wrong label. Cube accounts for 34% of wrong-class failures — the model recognises a rectangular object but can't resolve the class.

Cluster insight: KMeans (k=5) on [camera_distance, ambient_light, num_objects, failure_mode, tier, profile] identifies two pure FP clusters (100% false positives at mid-distance and near-distance), a mixed missed+wrong-class cluster under crowded conditions, and a distinct dark-scene cluster with different FP characteristics.

What this means for next steps:

The gap is at the distribution level, not the architecture level. Domain randomization of textures + sim-to-real transfer is the right intervention, not model scaling.
The FP flood is a calibration problem. A classification head fine-tuned on synthetic negatives would reduce it dramatically.
The wrong-class failures on cube/domino suggest that geometric shape features are present but class-label resolution requires task-specific training.

Planted-regression demo

The gate verifiably catches regression and names the failing slice:

$ python scripts/demo_regression.py

DEMO STEP 1 — Gate on current (good) metrics
→ PASSED — all 46 checks within threshold

DEMO STEP 2 — Plant regression: zero sports-ball AP
(simulates iou_threshold cranked to 0.9)
→ FAILED — 1 regression(s) detected
  ✗ class:sports ball / ap
      baseline=0.0091  current=0.0000  floor=0.0041  delta=-0.0091

DEMO STEP 3 — Restore
→ PASSED — all 46 checks within threshold

Result: PASS → FAIL → PASS  ✓  (gate behaves correctly)

To trigger this in real CI, open a PR that changes --iou 0.9 in the eval runner. The CI workflow runs, the sports-ball AP drops from 0.9% to 0%, the gate exits 1, and the PR check fails — naming the regressed slice and delta. Revert → green.

Cross-domain principle: eval as infrastructure

The architectural insight that generalises across ML domains:

Evaluation should be a first-class subsystem, not an afterthought. It must be model-agnostic, slice-aware, and wired into CI.

I've applied the same principle in three domains:

Domain	What gets evaluated	The harness checks
Perception (this project)	YOLO detector	Per-slice mAP, FP rate, localization quality
RAG systems	Retriever + LLM	Retrieval recall, answer faithfulness, citation precision
Agentic systems	Tool-use agent	Task completion rate, tool selection accuracy, latency

In each case:

The harness is decoupled from the implementation (model-agnostic interface)
Metrics are sliced by the conditions that matter (difficulty, distance, topic domain, query type)
A baseline is stored and regressions are caught before they reach production
Failures are named, not just counted

The model changes; the eval discipline doesn't. This is the engineering investment that compounds.

What I'd do differently

1. Domain randomization before domain gap is "interesting" PyBullet renders are too synthetic. Before drawing any production conclusions, I'd add texture randomization (PBR materials, HDRI backgrounds), noise augmentation, and random object scales. The 0% recall result is expected — but a more realistic synthetic distribution would push that to a meaningful non-zero baseline worth gating on.

2. Real-image holdout The synthetic-to-real gap is documented but not measured. A small (50-100 image) real-world validation set, with the same object categories, would quantify the gap. This is the honest thing to do before claiming the eval harness is production-relevant.

3. Active learning loop The triage output (failure mode distribution, KMeans clusters) should feed back into the scenario generator: generate more scenes matching the hardest cluster conditions. Right now triage is a report; it should be a signal that drives the next data generation run.

4. Temporal and latency eval match_scene operates on single frames. Real robot perception needs tracking across frames, trajectory prediction, and FPS budgeting. None of that is here.

5. Confidence calibration The op-point threshold is set at conf≥0.25 by convention. A calibration sweep (precision-recall curve analysis by condition) would let you set per-slice thresholds that reflect the actual operating point you need — not a fixed global number.

Repository structure

perceptorguard/
├── scenarios/          Pydantic schemas, parameterized scenario generator
│   ├── schemas.py      BoundingBox, Detection, GroundTruth, Scenario, ObjectSpec
│   └── generator.py    6 profiles × 2-tier catalog, occluder placement
├── runner/             Inference + matching + GT extraction
│   ├── scene_runner.py PyBullet DIRECT renderer, AABB→screen projection
│   ├── eval_runner.py  YOLO inference loop, bin assignment
│   └── matcher.py      Greedy bipartite match; FN rows carry sub-threshold IoU
├── metrics/            Metrics, triage, reporting
│   ├── engine.py       11-pt VOC AP, operating-point P/R/F1, sliced tables
│   ├── failure_classifier.py  missed / localization / wrong_class / fp
│   ├── cluster_analyzer.py    KMeans on scenario feature vector
│   ├── triage_reporter.py     Ranked failure-mode summary
│   └── reporter.py     ASCII console report
├── gates/              Regression gate
│   ├── thresholds.py   GateThresholds dataclass, YAML-backed
│   ├── comparator.py   46-check comparison: mAP, AP, recall, FP per slice
│   └── gate_runner.py  Print report, return bool, exit 1 on failure
├── reports/            Report rendering
│   ├── annotator.py    Annotate failure scenes (GT boxes, missed/detected)
│   ├── renderer.py     Jinja2 → HTML + Markdown
│   ├── tracker.py      W&B + MLflow optional integration
│   └── templates/      report.html.j2, report.md.j2
├── scripts/
│   ├── generate.py     Dataset generation CLI
│   ├── run_eval.py     Eval CLI (--model, --iou, --imgsz)
│   ├── triage.py       Failure triage CLI
│   ├── run_gate.py     Gate CLI (exit 0/1)
│   ├── save_baseline.py  Promote eval → baseline
│   ├── generate_report.py  Report generation CLI
│   └── demo_regression.py  Planted-regression demo
├── tests/              74 unit tests (all pass)
│   ├── test_matcher.py, test_metrics.py
│   ├── test_triage.py, test_gates.py
│   └── verify_chunk2.py  (55 GT-pipeline invariant tests)
├── assets/             Custom URDFs (bottle.urdf, bowl.urdf)
├── configs/
│   └── gate_thresholds.yml  Tunable slack per metric
├── artifacts/
│   ├── baseline/       100-scene golden reference (committed)
│   ├── ci_baseline/    20-scene CI reference (committed, seed=42)
│   ├── ci_dataset/     Reproducible 20-scene fixture (committed, LFS)
│   ├── eval/           Full 100-scene eval output
│   └── triage/         Failure classification + cluster summary
└── .github/
    ├── workflows/ci_gate.yml     PR: 20 scenes, ~2 min
    └── workflows/nightly.yml     Scheduled: 100 scenes, auto-promote baseline

Running the full test suite

pytest tests/ -v
# → 74 passed

Tests are hermetic — no model inference, no disk artifacts required. The 55 verify_chunk2.py tests validate the GT pipeline (AABB projection, occlusion geometry, multi-object placement) against hardcoded fixtures. The 12 test_matcher.py tests cover IoU edge cases and greedy matching invariants. The 15 test_gates.py tests cover threshold loading, regression detection (mAP drop, FP spike, recall drop), and gate-runner pass/fail return values.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

Jun 22, 2026

0.1.1

Jun 21, 2026

This version

0.1.0

Jun 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

perceptorguard-0.1.0.tar.gz (72.7 kB view details)

Uploaded Jun 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

perceptorguard-0.1.0-py3-none-any.whl (83.4 kB view details)

Uploaded Jun 21, 2026 Python 3

File details

Details for the file perceptorguard-0.1.0.tar.gz.

File metadata

Download URL: perceptorguard-0.1.0.tar.gz
Upload date: Jun 21, 2026
Size: 72.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for perceptorguard-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`464b62ef8fb07493a31ec8418f779de494948c14ab42b79d1934683b9abe7196`
MD5	`965550c3e96f8fd284c05e17b2cf22f0`
BLAKE2b-256	`5989d6c631a57af402c8ab07d063519de9f4be3f1b0fb73e63ca83c8420f5a38`

See more details on using hashes here.

File details

Details for the file perceptorguard-0.1.0-py3-none-any.whl.

File metadata

Download URL: perceptorguard-0.1.0-py3-none-any.whl
Upload date: Jun 21, 2026
Size: 83.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for perceptorguard-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3112eb8398acb8cb54ea3f4d294c35b279ccb8132bcbe518c3b8f9c622460836`
MD5	`ae1e682619563fa55afc6ede7ec967a1`
BLAKE2b-256	`8f2c27b55749dfce79dc0cc3a0ad4dd2264025124a50c0ac897e1b37009a9284`

See more details on using hashes here.

perceptorguard 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PerceptorGuard

Quick start

System overview

Key design decisions

1. Evaluation as a first-class engineering subsystem

2. Model-agnostic interface

3. CI gate blast-radius argument

4. Two-tier object catalog

5. Sub-threshold IoU enrichment for FN rows

Findings from the pilot run (YOLOv8n, 100 scenes)

Planted-regression demo

Cross-domain principle: eval as infrastructure

What I'd do differently

Repository structure

Running the full test suite

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes