Python SDK for EvalHub: common models, REST API client, and framework adapter SDK

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ruivieira tarilabs

These details have not been verified by PyPI

Project links

Homepage

Project description

EvalHub SDK

Framework Adapter SDK for EvalHub Integration

The EvalHub SDK provides a standardized way to create framework adapters that can be consumed by EvalHub, enabling a "Bring Your Own Framework" (BYOF) approach for evaluation frameworks.

Overview

The SDK creates a common API layer that allows EvalHub to communicate with ANY evaluation framework. Users only need to write minimal "glue" code to connect their framework to the standardized interface.

EvalHub → (Standard API) → Your Framework Adapter → Your Evaluation Framework

Architecture

The adapter SDK uses a job runner architecture:

graph TB
    subgraph pod["Kubernetes Job Pod"]
        subgraph adapter["Adapter Container"]
            A1["1. Read JobSpec<br/>from ConfigMap"]
            A2["2. run_benchmark_job()"]
            A3["3. Report status<br/>via callbacks"]
            A4["4. Create OCI artifacts<br/>via callbacks"]
            A5["5. Report results<br/>via callbacks"]
            A6["6. Exit"]
        end

        subgraph sidecar["Sidecar Container"]
            S1["ConfigMap mounted<br/>/meta/job.json"]
            S2["Forward status to<br/>EvalHub service (HTTP)"]
            S3["Authenticated push of<br/>OCI artifacts<br/>to OCI Registry"]
            S4["Forward results to<br/>EvalHub service (HTTP)"]
        end

        A1 -.-> S1
        A3 --> S2
        A4 --> S3
        A5 --> S4
    end

    S2 --> EvalHub["EvalHub Service"]
    S3 --> Registry["OCI Registry"]
    S4 --> EvalHub

    style pod fill:#f0f0f0,stroke:#333,stroke-width:2px
    style adapter fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style sidecar fill:#fff3e0,stroke:#f57c00,stroke-width:2px

Package Organization

The SDK is organized into distinct, focused packages:

Core (evalhub.models) - Shared data models

Request/response models for API communication
Common data structures for evaluations and benchmarks

Adapter SDK (evalhub.adapter) - Framework adapter components

FrameworkAdapter base class with run_benchmark_job() method
Job specification models (JobSpec, JobResults)
Callback interface for status updates and OCI artifacts
Example implementations

Client SDK (evalhub.client) - REST API client for EvalHub service

HTTP client for submitting evaluations to EvalHub
Resource navigation (providers, benchmarks, collections)
See Getting Started with the CLI

Key Components

JobSpec - Job configuration loaded from ConfigMap at pod startup
FrameworkAdapter - Base class that implements run_benchmark_job() method
JobCallbacks - Interface for reporting status and persisting artifacts
JobResults - Evaluation results returned when job completes
EvalCardMetadata - Standardized evaluation disclosure (Dhar et al., arXiv:2511.21695): modalities, languages, capability and safety evaluations
EnvironmentCardMetadata - Operational context of an evaluation run: hardware, software, Kubernetes, model identity, and run provenance
Sidecar - Container that handles service communication (provided by platform)

Quick Start

1. Installation

# Install from PyPI (when available)
pip install eval-hub-sdk

# Install from source
git clone https://github.com/eval-hub/eval-hub-sdk.git
cd eval-hub-sdk
pip install -e .[dev]

2. Create Your Adapter

Create a new Python file for your adapter:

# my_framework_adapter.py
from datetime import UTC, datetime
from pathlib import Path

from evalhub.adapter import (
    FrameworkAdapter,
    JobSpec,
    JobCallbacks,
    JobResults,
    JobStatus,
    JobPhase,
    JobStatusUpdate,
    EvaluationResult,
    MessageInfo,
    OCIArtifactSpec,
)

class MyFrameworkAdapter(FrameworkAdapter):
    def run_benchmark_job(
        self, config: JobSpec, callbacks: JobCallbacks
    ) -> JobResults:
        """Run a benchmark evaluation job."""

        # Report initialization
        callbacks.report_status(JobStatusUpdate(
            status=JobStatus.RUNNING,
            phase=JobPhase.INITIALIZING,
            progress=0.0,
            message=MessageInfo(
                message="Loading benchmark and model",
                message_code="initializing",
            ),
        ))

        # Load your evaluation framework and benchmark
        framework = load_your_framework()
        benchmark = framework.load_benchmark(config.benchmark_id)
        model = framework.load_model(config.model)

        # Report evaluation start
        callbacks.report_status(JobStatusUpdate(
            status=JobStatus.RUNNING,
            phase=JobPhase.RUNNING_EVALUATION,
            progress=0.3,
            message=MessageInfo(
                message=f"Evaluating on {config.num_examples} examples",
                message_code="running_evaluation",
            ),
        ))

        # Run evaluation (adapter-specific params come from parameters)
        results = framework.evaluate(
            benchmark=benchmark,
            model=model,
            num_examples=config.num_examples,
            num_few_shot=config.parameters.get("num_few_shot", 0)
        )

        # Save results to a directory and persist as OCI artifact
        results_dir = save_results(config.id, results)
        oci_artifact = None
        oci_exports = config.exports.oci if config.exports else None
        if oci_exports is not None:
            coords = oci_exports.coordinates.model_copy(deep=True)
            coords.annotations.update({
                "org.opencontainers.image.created": datetime.now(UTC).isoformat(),
                "io.github.eval-hub.benchmark": config.benchmark_id,
                "io.github.eval-hub.model": config.model.name,
                "io.github.eval-hub.job_id": config.id,
            })
            oci_artifact = callbacks.create_oci_artifact(OCIArtifactSpec(
                files_path=results_dir,
                coordinates=coords,
            ))

        # Return results
        return JobResults(
            id=config.id,
            benchmark_id=config.benchmark_id,
            benchmark_index=config.benchmark_index,
            model_name=config.model.name,
            results=[
                EvaluationResult(
                    metric_name="accuracy",
                    metric_value=results["accuracy"],
                    metric_type="float"
                )
            ],
            num_examples_evaluated=len(results),
            duration_seconds=results["duration"],
            oci_artifact=oci_artifact,
        )

3. OCI Artifact Persistence

The SDK exposes an OCI persistence API via callbacks.create_oci_artifact(...).

Using DefaultCallbacks

Use DefaultCallbacks for both production and development:

from evalhub.adapter import DefaultCallbacks

# Initialize adapter (loads settings and job spec internally)
adapter = MyFrameworkAdapter()

# Create callbacks from adapter (auto-configures sidecar, OCI proxy, etc.)
callbacks = DefaultCallbacks.from_adapter(adapter)

results = adapter.run_benchmark_job(adapter.job_spec, callbacks)

Key Points:

Status updates: Sent to sidecar if sidecar_url is provided, otherwise logged locally. Both report_status and report_results events always include benchmark_index (and provider_id when set) so the service can associate events with the correct benchmark in multi-benchmark jobs.
OCI artifacts: Created via SDK callbacks and pushed to the OCI registry through the sidecar-authenticated flow when mode is Kubernetes.

4. Containerise Your Adapter

Create a Dockerfile for your adapter:

FROM registry.access.redhat.com/ubi9/python-312

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy adapter code
COPY my_framework_adapter.py .
COPY run_adapter.py .

# Run adapter
CMD ["python", "run_adapter.py"]

Create the entrypoint script:

# run_adapter.py
from my_framework_adapter import MyFrameworkAdapter
from evalhub.adapter import DefaultCallbacks

# Initialize adapter (loads settings and job spec internally)
adapter = MyFrameworkAdapter()

# Create callbacks from adapter (auto-configures sidecar, OCI proxy, etc.)
callbacks = DefaultCallbacks.from_adapter(adapter)

# Run adapter
results = adapter.run_benchmark_job(adapter.job_spec, callbacks)

# Report final results to service via sidecar
callbacks.report_results(results)

print(f"Job completed: {results.id}")

5. Deploy to Kubernetes

The eval-hub service will create Kubernetes Jobs for your adapter:

apiVersion: batch/v1
kind: Job
metadata:
  name: eval-job-123
spec:
  template:
    spec:
      containers:
      # Your adapter container
      - name: adapter
        image: myregistry/my-adapter:latest
        volumeMounts:
        - name: job-spec
          mountPath: /meta
      # Sidecar container (provided by platform)
      - name: sidecar
        image: evalhub/sidecar:latest
        env:
        - name: EVALHUB_SERVICE_URL
          value: "http://evalhub-service:8080"
      volumes:
      - name: job-spec
        configMap:
          name: job-123-spec

For a complete working example, see examples/simple_adapter/simple_adapter.py.

Package Organization Guide

The EvalHub SDK is organized into distinct packages based on your use case:

Which Package Should I Use?

Use Case	Primary Package	Description
Building an Adapter	`evalhub.adapter`	Create a framework adapter for your evaluation framework
Interacting with EvalHub	`evalhub.client`	REST API client for submitting evaluations
Data Models	`evalhub.models`	Request/response models for API communication

Import Patterns

Framework Adapter Developer:

# Building your adapter
from evalhub.adapter import (
    FrameworkAdapter,
    JobSpec,
    JobCallbacks,
    JobResults,
    JobStatus,
    JobPhase,
    JobStatusUpdate,
    EvaluationResult,
    OCIArtifactSpec,
    # Card metadata (optional — auto-capture provides a baseline)
    CapabilityEvalEntry,
    EvalCardMetadata,
    EnvironmentCardMetadata,
)

EvalHub Service User:

# Interacting with EvalHub REST API
from evalhub import (
    EvalHubClient,
    BenchmarkConfig,
    EvaluationExports,
    EvaluationExportsOCI,
    JobSubmissionRequest,
    ModelConfig,
    OCIConnectionConfig,
    OCICoordinates,
)

Examples

Contributed Adapters

For real use-case adapter implementations, see the eval-hub-contrib repository which includes adapters for GuideLLM, LightEval, and MTEB.

Simple Adapter Example

The SDK includes a reference implementation showing all adapter patterns:

Example Adapter: examples/simple_adapter/simple_adapter.py

This example demonstrates:

Loading JobSpec from mounted ConfigMap
Validating configuration
Loading benchmark data
Running evaluation with progress reporting
Persisting results as OCI artifacts
Returning structured results

Using the Example

from evalhub.adapter.examples import ExampleAdapter
from evalhub.adapter import JobSpec

# Load job specification
job_spec = JobSpec(
    id="eval-123",
    provider_id="my-provider",
    benchmark_id="mmlu",
    benchmark_index=0,
    model=ModelConfig(
        url="http://vllm-service:8000",
        name="llama-2-7b"
    ),
    parameters={},
    callback_url="http://localhost:8080",
    num_examples=100
)

# Create adapter and run
adapter = ExampleAdapter()
results = adapter.run_benchmark_job(job_spec, callbacks)

Framework Adapter Interface

Your adapter must implement a single method:

from evalhub.adapter import FrameworkAdapter, JobSpec, JobCallbacks, JobResults

class MyFrameworkAdapter(FrameworkAdapter):
    def run_benchmark_job(
        self, config: JobSpec, callbacks: JobCallbacks
    ) -> JobResults:
        """Run a benchmark evaluation job.

        Args:
            config: Job specification from mounted ConfigMap
            callbacks: Callbacks for status updates and artifact persistence

        Returns:
            JobResults: Evaluation results and metadata

        Raises:
            ValueError: If configuration is invalid
            RuntimeError: If evaluation fails
        """
        # Your implementation here
        pass

Key Data Models

JobSpec - Configuration loaded from ConfigMap:

class JobSpec(BaseModel):
    # Mandatory fields
    id: str                           # Unique job identifier
    provider_id: str                   # Provider identifier
    benchmark_id: str                 # Benchmark to evaluate
    benchmark_index: int              # Index of this benchmark within the job (included in all status/result events)
    model: ModelConfig                # Model configuration (url, name)
    parameters: Dict[str, Any]  # Adapter-specific parameters
    callback_url: str                  # Base URL for callbacks (SDK appends /status, /results)

    # Optional fields
    num_examples: Optional[int]       # Number of examples to evaluate
    experiment_name: Optional[str]    # Experiment name
    tags: list[dict[str, str]]        # Custom tags (default: [])

    @classmethod
    def from_file(cls, path: Path | str) -> Self:
        """Load JobSpec from a JSON file."""

Load a job spec from file:

from evalhub.adapter import JobSpec

# Explicit path (recommended)
spec = JobSpec.from_file("/meta/job.json")

# Or use settings for the path
spec = JobSpec.from_file(settings.resolved_job_spec_path)

JobCallbacks - Interface for service communication:

class JobCallbacks(ABC):
    @abstractmethod
    def report_status(self, update: JobStatusUpdate) -> None:
        """Report status update to service"""

    @abstractmethod
    def create_oci_artifact(self, spec: OCIArtifactSpec) -> OCIArtifactResult:
        """Create and push OCI artifact"""

When using DefaultCallbacks, pass benchmark_index (and optionally provider_id) from the job spec so that status and result events sent to the service always include benchmark_index, allowing the service to associate events with the correct benchmark in multi-benchmark jobs.

JobResults - Returned when job completes:

class JobResults(BaseModel):
    id: str
    benchmark_id: str
    benchmark_index: int                       # Index within the job
    model_name: str
    results: List[EvaluationResult]           # Evaluation metrics
    overall_score: Optional[float]            # Overall score if applicable
    num_examples_evaluated: int               # Number of examples evaluated
    duration_seconds: float                   # Total evaluation time
    evaluation_metadata: Dict[str, Any]       # Framework-specific metadata
    oci_artifact: Optional[OCIArtifactResult] # OCI artifact info if persisted
    eval_card: Optional[EvalCardMetadata]     # EvalCard disclosure metadata
    env_card: Optional[EnvironmentCardMetadata] # Environment Card metadata

EvalCard & Environment Card - Evaluation documentation artifacts:

EvalCards and Environment Cards are serialized into the artifacts dict on report_results() and stored by the server — no server changes required.

If a provider does not set env_card, report_results() auto-captures a best-effort Environment Card from the runtime (Python version, OS, GPU info, installed packages). The capture_completeness field (0.0–1.0) reports how many of the 26 spec fields were populated.

# Explicit capture at job start (recommended — captures hardware before eval load)
env_card = EnvironmentCardMetadata.capture(
    framework_name="lm-evaluation-harness",
    framework_version="0.4.5",
)

# EvalCard with capability and safety evaluations
eval_card = EvalCardMetadata(
    modalities_input=["text"],
    modalities_output=["text"],
    languages_count=1,
    languages=["en"],
    capability_evaluations=[
        CapabilityEvalEntry(
            ability="knowledge",
            benchmark="MMLU",
            metric="exact_match",
            alt_prompting=0.712,
            alt_prompting_description="5-Shot",
        ),
    ],
)

# Attach to results before reporting
results = JobResults(..., eval_card=eval_card, env_card=env_card)
callbacks.report_results(results)

Deployment

Container Structure

Your adapter runs as a container in a Kubernetes Job alongside a sidecar:

FROM registry.access.redhat.com/ubi9/python-312

WORKDIR /app

# Install your framework and dependencies
RUN pip install lm-evaluation-harness==0.4.0 eval-hub-sdk

# Copy adapter implementation
COPY my_adapter.py .
COPY entrypoint.py .

CMD ["python", "entrypoint.py"]

Entrypoint Script

# entrypoint.py
from my_adapter import MyFrameworkAdapter
from evalhub.adapter import DefaultCallbacks

# Initialize adapter (loads settings and job spec internally)
adapter = MyFrameworkAdapter()

# Create callbacks from adapter (auto-configures sidecar, OCI proxy, etc.)
callbacks = DefaultCallbacks.from_adapter(adapter)

# Run adapter
results = adapter.run_benchmark_job(adapter.job_spec, callbacks)

# Report final results
callbacks.report_results(results)

print(f"Job {results.id} completed with score: {results.overall_score}")

Kubernetes Job

EvalHub creates Jobs automatically:

apiVersion: batch/v1
kind: Job
metadata:
  name: eval-job-123
spec:
  template:
    spec:
      containers:
      - name: adapter
        image: myregistry/my-framework-adapter:latest
        volumeMounts:
        - name: job-spec
          mountPath: /meta
      - name: sidecar
        image: evalhub/sidecar:latest
        env:
        - name: EVALHUB_SERVICE_URL
          value: "http://evalhub-service:8080"
      volumes:
      - name: job-spec
        configMap:
          name: job-123-spec
      restartPolicy: Never

Development

Development Setup

# Clone the repository
git clone https://github.com/eval-hub/eval-hub-sdk.git
cd eval-hub-sdk

# Install in development mode with all dependencies
pip install -e .[dev]

# Install pre-commit hooks
pre-commit install

# Run tests
pytest

# Run tests with coverage
pytest --cov=src/evalhub --cov-report=html

# Run type checking
mypy src/evalhub

# Run linting
ruff check src/ tests/
ruff format src/ tests/

Testing Your Adapter

from evalhub.adapter import AdapterSettings

def test_settings_parse(monkeypatch):
    monkeypatch.setenv("EVALHUB_MODE", "local")
    monkeypatch.setenv("OCI_INSECURE", "true")
    s = AdapterSettings.from_env()
    assert s.oci_insecure is True

Quality Assurance

Run all quality checks:

# Format code
ruff format .

# Lint and fix issues
ruff check --fix .

# Type check
mypy src/evalhub

# Run full test suite
pytest -v --cov=src/evalhub

Installing Pre-Release Versions

Pre-release development versions are published to TestPyPI. To install the latest pre-release:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ --pre eval-hub-sdk

The --extra-index-url flag ensures that dependencies are still resolved from the main PyPI index.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for your changes
Run the test suite
Submit a pull request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ruivieira tarilabs

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.8

May 15, 2026

0.1.7

May 8, 2026

This version

0.1.6

Apr 28, 2026

0.1.5

Apr 8, 2026

0.1.4

Mar 25, 2026

0.1.3

Mar 24, 2026

0.1.2

Mar 11, 2026

0.1.1

Mar 4, 2026

0.1.0

Mar 3, 2026

0.1.0a9 pre-release

Mar 2, 2026

0.1.0a8 pre-release

Feb 16, 2026

0.1.0a7 pre-release

Feb 15, 2026

0.1.0a6 pre-release

Feb 11, 2026

0.1.0a5 pre-release

Feb 9, 2026

0.1.0a4 pre-release

Feb 8, 2026

0.1.0a3 pre-release

Feb 6, 2026

0.1.0a2 pre-release

Feb 2, 2026

0.1.0a0 pre-release

Jan 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eval_hub_sdk-0.1.6.tar.gz (434.5 kB view details)

Uploaded Apr 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

eval_hub_sdk-0.1.6-py3-none-any.whl (82.1 kB view details)

Uploaded Apr 28, 2026 Python 3

File details

Details for the file eval_hub_sdk-0.1.6.tar.gz.

File metadata

Download URL: eval_hub_sdk-0.1.6.tar.gz
Upload date: Apr 28, 2026
Size: 434.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for eval_hub_sdk-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`812e17e8c83f9520253cbb0356428d763d92d64d7cf71050a442a60014f40027`
MD5	`94e0e407f7cda8030840ee4db244172c`
BLAKE2b-256	`3f6b8fb4a2e4466462418991b0920b0de98c23ce23dfd9197b4bfe0202f71fea`

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_hub_sdk-0.1.6.tar.gz:

Publisher: publish-pypi.yml on eval-hub/eval-hub-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: eval_hub_sdk-0.1.6.tar.gz
- Subject digest: 812e17e8c83f9520253cbb0356428d763d92d64d7cf71050a442a60014f40027
- Sigstore transparency entry: 1396999723
- Sigstore integration time: Apr 28, 2026
Source repository:
- Permalink: eval-hub/eval-hub-sdk@0e244d2636f452aa28ac212a2c2d706e0422a803
- Branch / Tag: refs/tags/v0.1.6
- Owner: https://github.com/eval-hub
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@0e244d2636f452aa28ac212a2c2d706e0422a803
- Trigger Event: release

File details

Details for the file eval_hub_sdk-0.1.6-py3-none-any.whl.

File metadata

Download URL: eval_hub_sdk-0.1.6-py3-none-any.whl
Upload date: Apr 28, 2026
Size: 82.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for eval_hub_sdk-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f8fbe2736394522aac8b7057e845a80c09532b0dcad1749311ea88edb6fda83a`
MD5	`d51b587fcf0b83162f875742768d9a1d`
BLAKE2b-256	`e062b8a9dc381d020195ebe16b8777c27a3716c407ef4fc1e1ea4b9e91728cf9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_hub_sdk-0.1.6-py3-none-any.whl:

Publisher: publish-pypi.yml on eval-hub/eval-hub-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: eval_hub_sdk-0.1.6-py3-none-any.whl
- Subject digest: f8fbe2736394522aac8b7057e845a80c09532b0dcad1749311ea88edb6fda83a
- Sigstore transparency entry: 1396999726
- Sigstore integration time: Apr 28, 2026
Source repository:
- Permalink: eval-hub/eval-hub-sdk@0e244d2636f452aa28ac212a2c2d706e0422a803
- Branch / Tag: refs/tags/v0.1.6
- Owner: https://github.com/eval-hub
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@0e244d2636f452aa28ac212a2c2d706e0422a803
- Trigger Event: release

eval-hub-sdk 0.1.6

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

EvalHub SDK

Overview

Architecture

Package Organization

Key Components

Quick Start

1. Installation

2. Create Your Adapter

3. OCI Artifact Persistence

Using DefaultCallbacks

4. Containerise Your Adapter

5. Deploy to Kubernetes

Package Organization Guide

Which Package Should I Use?

Import Patterns

Examples

Contributed Adapters

Simple Adapter Example

Using the Example

Framework Adapter Interface

Key Data Models

Deployment

Container Structure

Entrypoint Script

Kubernetes Job

Development

Development Setup

Testing Your Adapter

Quality Assurance

Installing Pre-Release Versions

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance