Skip to main content

Various scene understanding and perception evaluation metrics.

Project description

Scalable Distributed Evaluation for Computer Vision

PyPI Python Version License Build Status

Evaluators is a high-throughput evaluation framework designed for large-scale computer vision research. It specializes in handling video tasks by decoupling inference I/O from metric computation.

This architecture enables offline evaluation workflows: models stream predictions to efficient storage backends (Memory Map or LMDB) during inference, and metrics are computed in a decoupled stage using distributed map-reduce logic. This approach prevents CPU-bound metric calculation from throttling GPU inference.

The key features are:

  • Zero-overhead inference.
    Writes predictions to disk using non-blocking I/O, allowing the training loop to run at full GPU utilization.

  • Distributed by design.
    Automatically handles synchronization across multiple nodes and GPUs using torchmetrics and custom scheduling logic.

  • Explicit Memory Schemas. Uses Pydantic-based schemas to define data formats and encodings (PNG, TIFF, Raw) up front, ensuring type safety and storage efficiency.

  • Lazy loading.
    Supports referencing ground truth data from disk rather than duplicating it in memory caches, enabling evaluation of terabyte-scale datasets.

  • Multi-domain. Includes verified implementations for:

    • Segmentation: Panoptic quality (PQ), semantic mIoU.
    • DVPS: Depth-aware video panoptic quality (DVPQ).
    • Depth: Eigen et al. metrics (AbsRel, RMSE).
  • CLI. A command-line interface to inspect, index, and query saved inference results.


Installation

pip install evaluators

Quick start

Python API

The core abstractions are MetricStream (for writing) and run_offline_evaluation (for computing).

Step 1: Inference (online)

import torch
from evaluators import MetricStream, MemorySchema, TensorField, DynamicTemporalWriter
from evaluators.metrics.domain.segmentation import SemanticMetric
1. Configure metrics and schema
### Define the source for ground truth (lazy loading)

dataset = CityscapesDataset(...)
metric = SemanticMetric(
    num_classes=19,
    target_source=dataset, 
)

### Define the explicit memory schema

schema = MemorySchema(fields={
    "sem_seg": TensorField(dtype="int64", shape=(1024, 2048)),
    "sequence_id": TensorField(dtype="int64", shape=()),
    "frame_index": TensorField(dtype="int64", shape=()),
})
2. Initialize stream and writer
# Create a writer (backend)
writer = DynamicTemporalWriter(output_dir="./inference_cache/stream_1", schema=schema)

# Create a stream and bind the writer
stream = MetricStream(
    metrics=[metric],
    name="semantic",
    schema=schema
)
stream.bind(writer)
3. Run inference loop
for batch in dataloader: # Model forward pass
    preds = model(batch["image"])

    # Push to stream (non-blocking)
    stream.update(
        batch={
            "sem_seg": preds,
            "sequence_id": batch["sequence_id"],
            "frame_index": batch["frame_index"]
        }
    )
    
# Finalize
writer.close()
> Note: `sequence_id` and `frame_index` must be `torch.int64`. 
4. Finalize and compute
from evaluators import run_offline_evaluation

# Syncs workers, builds catalog, and runs metrics
results = run_offline_evaluation(
    metrics=[metric],
    artifact_dir="./inference_cache/semantic"
)
print(results["SemanticMetric"]["mIoU"])

Step 2: Re-evaluation (offline)

Because predictions are persisted, metrics can be re-calculated or added without re-running the model.

# Run evaluation on existing artifacts
results = run_offline_evaluation(
    metrics=[new_metric],
    artifact_dir="./inference_cache/semantic"
)

CLI tools

The library includes a CLI for managing the inference cache.

List stored sequences.

evaluators memory ls ./inference_cache

Inspect specific tensor shapes.

evaluators memory inspect ./inference_cache --sequence_id frankfurt_000001

Export to standard PyTorch file.

evaluators memory export ./inference_cache --sequence_id frankfurt_000001 --out my_video.pt

Supported metrics

Depth estimation

Implements standard error metrics (AbsRel, SqRel, RMSE, RMSElog) and threshold accuracies ($\delta < 1.25^n$).

Segmentation

  • Semantic: Mean intersection over union (mIoU).
  • Panoptic: Panoptic quality (PQ), segmentation quality (SQ), recognition quality (RQ). Supports "Thing" and "Stuff" splits.

Depth-aware video panoptic segmentation (DVPS)

Implements DVPQ (Depth-aware video panoptic quality). This metric evaluates spatio-temporal consistency using sliding window tubes, gated by pixel-wise depth accuracy.


Architecture

The evaluation pipeline consists of three stages.

  1. Write.
    Each GPU writes predictions to locally sharded files (e.g. .memmap or .lmdb). No communication occurs.
  2. Schedule.
    A synchronization barrier is reached. The main process aggregates metadata manifests from all shards to build a Global Catalog. It partitions the workload (videos) among workers using a greedy strategy to balance duration.
  3. Compute. Workers iterate through their assigned Virtual Sequences. Data is streamed from disk, processed by torchmetrics, and reduced globally.

See OFFLINE_EVALUATION.md for detailed usage and design principles.


Performance

evaluators is built for high-throughput I/O. The following benchmarks were conducted using the Comprehensive Memory Evaluation Suite (CMES) on a mobile workstation (i7-12700H, NVMe SSD).

Throughput (FPS)

Backend Codec Resolution Write FPS Read FPS Compression
Memmap Raw 512x1024 250.8 501.6 1.0x
Memmap Blosc 512x1024 59.7 119.3 0.85x
LMDB Raw 512x1024 13.3 26.7 1.0x
LMDB PNG 512x1024 27.2 13.6 0.62x
Filesystem TIFF 512x1024 124.9 249.7 0.89x

Insights

  • Memmap is king for raw throughput: The MemmapTemporalWriter achieves >500 FPS for read operations on mid-resolution video frames, making it ideal for fast metric computation.
  • Blosc provides the best balance: Using BloscCodec with Memmap offers significant storage savings with minimal CPU overhead compared to PNG/TIFF.
  • LMDB for stability: While slower for sequential video access, LMDB provides robust ACID compliance and is preferred for random-access metadata or small feature vectors.

Full benchmark reports, including memory usage (Peak RSS) and CPU overhead plots, are available in docs/benchmarks/.


Development

This project uses modern Python tooling for dependency management and quality assurance.

Setup

We use uv for fast dependency management.

# Install dependencies
uv sync --all-extras

Testing

Tests are managed by pytest.

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=evaluators

Linting & Formatting

We use ruff for all linting and formatting needs.

# Check code style
uv run ruff check .

# Format code
uv run ruff format .

Contributing

Contributions are welcome. Please ensure that:

  1. New features are covered by tests.
  2. Code passes all static analysis checks (ruff).
  3. Architecture changes are discussed in an issue first.

Acknowledgements

This work was developed at the Mobile Perception Systems (MPS) lab at Eindhoven University of Technology.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evaluators-2.1.4.tar.gz (110.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evaluators-2.1.4-py3-none-any.whl (150.4 kB view details)

Uploaded Python 3

File details

Details for the file evaluators-2.1.4.tar.gz.

File metadata

  • Download URL: evaluators-2.1.4.tar.gz
  • Upload date:
  • Size: 110.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"NixOS","version":"26.05","id":"yarara","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for evaluators-2.1.4.tar.gz
Algorithm Hash digest
SHA256 e2e8a195db0025f052056bc8ff6baf5f33e623769910323c955cd41cbb8f0266
MD5 bfc9a85eeedaf8cda38997c8ab1b6270
BLAKE2b-256 726726cab4dc2f6f78aac8cb82509c4b7b29b3ac1d80d7a3efa7b8a9ed931106

See more details on using hashes here.

File details

Details for the file evaluators-2.1.4-py3-none-any.whl.

File metadata

  • Download URL: evaluators-2.1.4-py3-none-any.whl
  • Upload date:
  • Size: 150.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.30 {"installer":{"name":"uv","version":"0.9.30","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"NixOS","version":"26.05","id":"yarara","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for evaluators-2.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8c1a218dea6bf4f62b9eec55b0753ff6be0a0269fd26e24134bcf498657fcb16
MD5 390225870192c60c7d5d9d380d5c217b
BLAKE2b-256 981fcea08f154a36167ea1e1b13f2d6f85dd2ca0f52285044cce71f2762fc6eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page