Skip to main content

A library for visualizing model evaluation results

Project description

Viseval / Vibes Eval

The original name was viseval, but it's taken on pypi. So now vibes eval Credit for the design of the evals goes to @johny-b

Tools for running model evaluations and visualizing results.

Install

pip install vibes_eval

Core Concept

Viseval assumes you have:

  1. A set of models organized by experimental groups:
models = {
    "baseline": ["model-v1", "model-v2"],
    "intervention": ["model-a", "model-b"],
}
  1. An async function that evaluates a single model and returns a DataFrame:
async def run_eval(model_id: str) -> pd.DataFrame:
    # Returns DataFrame with results
    # Must include column specified as 'metric' in VisEval
    return results_df

Usage

from vibes_eval import VisEval

# Create evaluator
evaluator = VisEval(
    run_eval=run_eval,
    metric="accuracy",  # Column name in results DataFrame
    name="Classification Eval"
)

# Run eval for all models
results = await evaluator.run(models)

# Create visualizations
results.model_plot()      # Compare individual models
results.group_plot()      # Compare groups (aggregated)
results.histogram()       # Score distributions per group
results.scatter(          # Compare two metrics
    x_column="accuracy",
    y_column="runtime"
)

Freeform questions

One built-in evaluation is provided by the FreeformQuestion class: a freeform question is a question that will be asked to the models, combined with a set of prompts that will be asked to an LLM judge. Questions are defined in yaml files such as this one. Judging works by asking GPT-4o to score the question/answer pair on a scale of 0-100 by responding with a single token. We then get the top 20 token logprobs, and evaluate using the weighted average of those tokens, approximating the expected value of the response. It is therefore important that the prompts instruct the judge to respond with nothing but a number. An example with code can be found here.

Visualizations

  • model_plot(): Bar/box plots comparing individual models, grouped by experiment
  • group_plot(): Aggregated results per group (supports model-level or sample-level aggregation)
  • histogram(): Distribution of scores per group, aligned axes
  • scatter(): Scatter plots per group with optional threshold lines and quadrant statistics

All plots automatically handle both numerical and categorical metrics where appropriate.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vibes_eval-0.2.4.tar.gz (24.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vibes_eval-0.2.4-py3-none-any.whl (24.8 kB view details)

Uploaded Python 3

File details

Details for the file vibes_eval-0.2.4.tar.gz.

File metadata

  • Download URL: vibes_eval-0.2.4.tar.gz
  • Upload date:
  • Size: 24.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibes_eval-0.2.4.tar.gz
Algorithm Hash digest
SHA256 cc6cad949560a4e9633ee4b695c3d92fa97260ec9081dfd90f1089166e5419b0
MD5 5be70a35f653404cf22e2ed0f8205e83
BLAKE2b-256 bb86d15397264ab340659375714a05863462d09aa648149c29be836d03994e86

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibes_eval-0.2.4.tar.gz:

Publisher: manual_publish.yaml on nielsrolf/viseval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vibes_eval-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: vibes_eval-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 24.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibes_eval-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 06a4a9e726d3a76363499d005de52f0d3a1411653c7fb8cb4c635d2a13fca018
MD5 5f9cffef2e3c7d6c09dd514c9e113348
BLAKE2b-256 3d092dd905cf1f92254357bbac3adeaafc09cdcd801c777f6d673386a4695c5c

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibes_eval-0.2.4-py3-none-any.whl:

Publisher: manual_publish.yaml on nielsrolf/viseval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page