Skip to main content

Various scene understanding and perception evaluation metrics.

Project description

Scalable Distributed Evaluation for Computer Vision

PyPI Python Version License Build Status

Evaluators is a high-throughput evaluation framework designed for large-scale computer vision research. It specializes in handling video tasks by decoupling inference I/O from metric computation.

This architecture enables offline evaluation workflows: models stream predictions to efficient storage backends (Memory Map or LMDB) during inference, and metrics are computed in a decoupled stage using distributed map-reduce logic. This approach prevents CPU-bound metric calculation from throttling GPU inference.

The key features are:

  • Zero-overhead inference.
    Writes predictions to disk using non-blocking I/O, allowing the training loop to run at full GPU utilization.

  • Distributed by design.
    Automatically handles synchronization across multiple nodes and GPUs using torchmetrics and custom scheduling logic.

  • Explicit Memory Schemas. Uses Pydantic-based schemas to define data formats and encodings (PNG, TIFF, Raw) up front, ensuring type safety and storage efficiency.

  • Lazy loading.
    Supports referencing ground truth data from disk rather than duplicating it in memory caches, enabling evaluation of terabyte-scale datasets.

  • Multi-domain. Includes verified implementations for:

    • Segmentation: Panoptic quality (PQ), semantic mIoU.
    • DVPS: Depth-aware video panoptic quality (DVPQ).
    • Depth: Eigen et al. metrics (AbsRel, RMSE).
  • CLI. A command-line interface to inspect, index, and query saved inference results.


Installation

pip install evaluators

Quick start

Python API

The core abstractions are MetricStream (for writing) and run_offline_evaluation (for computing).

Step 1: Inference (online)

import torch
from evaluators import MetricStream, MemorySchema, TensorField, DynamicTemporalWriter
from evaluators.metrics.domain.segmentation import SemanticMetric
1. Configure metrics and schema
### Define the source for ground truth (lazy loading)

dataset = CityscapesDataset(...)
metric = SemanticMetric(
    num_classes=19,
    target_source=dataset, 
)

### Define the explicit memory schema

schema = MemorySchema(fields={
    "sem_seg": TensorField(dtype="int64", shape=(1024, 2048)),
    "sequence_id": TensorField(dtype="int64", shape=()),
    "frame_index": TensorField(dtype="int64", shape=()),
})
2. Initialize stream and writer
# Create a writer (backend)
writer = DynamicTemporalWriter(output_dir="./inference_cache/stream_1", schema=schema)

# Create a stream and bind the writer
stream = MetricStream(
    metrics=[metric],
    name="semantic",
    schema=schema
)
stream.bind(writer)
3. Run inference loop
for batch in dataloader: # Model forward pass
    preds = model(batch["image"])

    # Push to stream (non-blocking)
    stream.update(
        batch={
            "sem_seg": preds,
            "sequence_id": batch["sequence_id"],
            "frame_index": batch["frame_index"]
        }
    )
    
# Finalize
writer.close()
> Note: `sequence_id` and `frame_index` must be `torch.int64`. 
4. Finalize and compute
from evaluators import run_offline_evaluation

# Syncs workers, builds catalog, and runs metrics
results = run_offline_evaluation(
    metrics=[metric],
    artifact_dir="./inference_cache/semantic"
)
print(results["SemanticMetric"]["mIoU"])

Step 2: Re-evaluation (offline)

Because predictions are persisted, metrics can be re-calculated or added without re-running the model.

# Run evaluation on existing artifacts
results = run_offline_evaluation(
    metrics=[new_metric],
    artifact_dir="./inference_cache/semantic"
)

CLI tools

The library includes a CLI for managing the inference cache.

List stored sequences.

evaluators memory ls ./inference_cache

Inspect specific tensor shapes.

evaluators memory inspect ./inference_cache --sequence_id frankfurt_000001

Export to standard PyTorch file.

evaluators memory export ./inference_cache --sequence_id frankfurt_000001 --out my_video.pt

Supported metrics

Depth estimation

Implements standard error metrics (AbsRel, SqRel, RMSE, RMSElog) and threshold accuracies ($\delta < 1.25^n$).

Segmentation

  • Semantic: Mean intersection over union (mIoU).
  • Panoptic: Panoptic quality (PQ), segmentation quality (SQ), recognition quality (RQ). Supports "Thing" and "Stuff" splits.

Depth-aware video panoptic segmentation (DVPS)

Implements DVPQ (Depth-aware video panoptic quality). This metric evaluates spatio-temporal consistency using sliding window tubes, gated by pixel-wise depth accuracy.


Architecture

The evaluation pipeline consists of three stages.

  1. Write.
    Each GPU writes predictions to locally sharded files (e.g. .memmap or .lmdb). No communication occurs.
  2. Schedule.
    A synchronization barrier is reached. The main process aggregates metadata manifests from all shards to build a Global Catalog. It partitions the workload (videos) among workers using a greedy strategy to balance duration.
  3. Compute. Workers iterate through their assigned Virtual Sequences. Data is streamed from disk, processed by torchmetrics, and reduced globally.

See OFFLINE_EVALUATION.md for detailed usage and design principles.


Development

This project uses modern Python tooling for dependency management and quality assurance.

Setup

We use uv for fast dependency management.

# Install dependencies
uv sync --all-extras

Testing

Tests are managed by pytest.

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=evaluators

Linting & Formatting

We use ruff for all linting and formatting needs.

# Check code style
uv run ruff check .

# Format code
uv run ruff format .

Contributing

Contributions are welcome. Please ensure that:

  1. New features are covered by tests.
  2. Code passes all static analysis checks (ruff).
  3. Architecture changes are discussed in an issue first.

Acknowledgements

This work was developed at the Mobile Perception Systems (MPS) lab at Eindhoven University of Technology.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evaluators-2.1.3.tar.gz (90.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evaluators-2.1.3-py3-none-any.whl (126.1 kB view details)

Uploaded Python 3

File details

Details for the file evaluators-2.1.3.tar.gz.

File metadata

  • Download URL: evaluators-2.1.3.tar.gz
  • Upload date:
  • Size: 90.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"NixOS","version":"26.05","id":"yarara","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for evaluators-2.1.3.tar.gz
Algorithm Hash digest
SHA256 01c6b1b2ff7a030e4d059e7abd849d52ead8fd94b1765d3aa22281cbc98b3c2f
MD5 0437ab2c54731ebae96a947408ec4c33
BLAKE2b-256 f57e434e03fba29c2f2929f758d77f504e3ce2fb5bd6226eb1b87858dd60efc7

See more details on using hashes here.

File details

Details for the file evaluators-2.1.3-py3-none-any.whl.

File metadata

  • Download URL: evaluators-2.1.3-py3-none-any.whl
  • Upload date:
  • Size: 126.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"NixOS","version":"26.05","id":"yarara","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for evaluators-2.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6c51cc7d710a02cf92fb29e4be41aa7c6ee31c885b1aa0d91a7095e80e81594c
MD5 3124e12d9126f2367e685a5e3e769c54
BLAKE2b-256 13f4be6b3bdb1fc29ad79292098477a50df1ead603d0751ed9457e642f0c7778

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page