Various scene understanding and perception evaluation metrics.
Project description
Scalable Distributed Evaluation for Computer Vision
Evaluators is a high-throughput evaluation framework designed for large-scale computer vision research. It specializes in handling video tasks by decoupling inference I/O from metric computation.
This architecture enables offline evaluation workflows: models stream predictions to efficient storage backends (Memory Map or LMDB) during inference, and metrics are computed in a decoupled stage using distributed map-reduce logic. This approach prevents CPU-bound metric calculation from throttling GPU inference.
The key features are:
-
Zero-overhead inference.
Writes predictions to disk using non-blocking I/O, allowing the training loop to run at full GPU utilization. -
Distributed by design.
Automatically handles synchronization across multiple nodes and GPUs usingtorchmetricsand custom scheduling logic. -
Explicit Memory Schemas. Uses Pydantic-based schemas to define data formats and encodings (PNG, TIFF, Raw) up front, ensuring type safety and storage efficiency.
-
Lazy loading.
Supports referencing ground truth data from disk rather than duplicating it in memory caches, enabling evaluation of terabyte-scale datasets. -
Multi-domain. Includes verified implementations for:
- Segmentation: Panoptic quality (PQ), semantic mIoU.
- DVPS: Depth-aware video panoptic quality (DVPQ).
- Depth: Eigen et al. metrics (AbsRel, RMSE).
-
CLI. A command-line interface to inspect, index, and query saved inference results.
Installation
pip install evaluators
Quick start
Python API
The core abstractions are MetricStream (for writing) and run_offline_evaluation (for computing).
Step 1: Inference (online)
import torch
from evaluators import MetricStream, MemorySchema, TensorField, DynamicTemporalWriter
from evaluators.metrics.domain.segmentation import SemanticMetric
1. Configure metrics and schema
### Define the source for ground truth (lazy loading)
dataset = CityscapesDataset(...)
metric = SemanticMetric(
num_classes=19,
target_source=dataset,
)
### Define the explicit memory schema
schema = MemorySchema(fields={
"sem_seg": TensorField(dtype="int64", shape=(1024, 2048)),
"sequence_id": TensorField(dtype="int64", shape=()),
"frame_index": TensorField(dtype="int64", shape=()),
})
2. Initialize stream and writer
# Create a writer (backend)
writer = DynamicTemporalWriter(output_dir="./inference_cache/stream_1", schema=schema)
# Create a stream and bind the writer
stream = MetricStream(
metrics=[metric],
name="semantic",
schema=schema
)
stream.bind(writer)
3. Run inference loop
for batch in dataloader: # Model forward pass
preds = model(batch["image"])
# Push to stream (non-blocking)
stream.update(
batch={
"sem_seg": preds,
"sequence_id": batch["sequence_id"],
"frame_index": batch["frame_index"]
}
)
# Finalize
writer.close()
> Note: `sequence_id` and `frame_index` must be `torch.int64`.
4. Finalize and compute
from evaluators import run_offline_evaluation
# Syncs workers, builds catalog, and runs metrics
results = run_offline_evaluation(
metrics=[metric],
artifact_dir="./inference_cache/semantic"
)
print(results["SemanticMetric"]["mIoU"])
Step 2: Re-evaluation (offline)
Because predictions are persisted, metrics can be re-calculated or added without re-running the model.
# Run evaluation on existing artifacts
results = run_offline_evaluation(
metrics=[new_metric],
artifact_dir="./inference_cache/semantic"
)
CLI tools
The library includes a CLI for managing the inference cache.
List stored sequences.
evaluators memory ls ./inference_cache
Inspect specific tensor shapes.
evaluators memory inspect ./inference_cache --sequence_id frankfurt_000001
Export to standard PyTorch file.
evaluators memory export ./inference_cache --sequence_id frankfurt_000001 --out my_video.pt
Supported metrics
Depth estimation
Implements standard error metrics (AbsRel, SqRel, RMSE, RMSElog) and threshold accuracies ($\delta < 1.25^n$).
Segmentation
- Semantic: Mean intersection over union (mIoU).
- Panoptic: Panoptic quality (PQ), segmentation quality (SQ), recognition quality (RQ). Supports "Thing" and "Stuff" splits.
Depth-aware video panoptic segmentation (DVPS)
Implements DVPQ (Depth-aware video panoptic quality). This metric evaluates spatio-temporal consistency using sliding window tubes, gated by pixel-wise depth accuracy.
Architecture
The evaluation pipeline consists of three stages.
- Write.
Each GPU writes predictions to locally sharded files (e.g..memmapor.lmdb). No communication occurs. - Schedule.
A synchronization barrier is reached. The main process aggregates metadata manifests from all shards to build a Global Catalog. It partitions the workload (videos) among workers using a greedy strategy to balance duration. - Compute. Workers iterate through their assigned Virtual Sequences. Data is streamed from disk, processed by torchmetrics, and reduced globally.
See OFFLINE_EVALUATION.md for detailed usage and design principles.
Development
This project uses modern Python tooling for dependency management and quality assurance.
Setup
We use uv for fast dependency management.
# Install dependencies
uv sync --all-extras
Testing
Tests are managed by pytest.
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=evaluators
Linting & Formatting
We use ruff for all linting and formatting needs.
# Check code style
uv run ruff check .
# Format code
uv run ruff format .
Contributing
Contributions are welcome. Please ensure that:
- New features are covered by tests.
- Code passes all static analysis checks (
ruff). - Architecture changes are discussed in an issue first.
Acknowledgements
This work was developed at the Mobile Perception Systems (MPS) lab at Eindhoven University of Technology.
License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evaluators-2.1.3.tar.gz.
File metadata
- Download URL: evaluators-2.1.3.tar.gz
- Upload date:
- Size: 90.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"NixOS","version":"26.05","id":"yarara","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01c6b1b2ff7a030e4d059e7abd849d52ead8fd94b1765d3aa22281cbc98b3c2f
|
|
| MD5 |
0437ab2c54731ebae96a947408ec4c33
|
|
| BLAKE2b-256 |
f57e434e03fba29c2f2929f758d77f504e3ce2fb5bd6226eb1b87858dd60efc7
|
File details
Details for the file evaluators-2.1.3-py3-none-any.whl.
File metadata
- Download URL: evaluators-2.1.3-py3-none-any.whl
- Upload date:
- Size: 126.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"NixOS","version":"26.05","id":"yarara","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c51cc7d710a02cf92fb29e4be41aa7c6ee31c885b1aa0d91a7095e80e81594c
|
|
| MD5 |
3124e12d9126f2367e685a5e3e769c54
|
|
| BLAKE2b-256 |
13f4be6b3bdb1fc29ad79292098477a50df1ead603d0751ed9457e642f0c7778
|