Skip to main content

A library for visualizing model evaluation results

Project description

Viseval / Vibes Eval

The original name was viseval, but it's taken on pypi. So now vibes eval Credit for the design of the evals goes to @johny-b

Tools for running model evaluations and visualizing results.

Install

pip install vibes_eval

Core Concept

Viseval assumes you have:

  1. A set of models organized by experimental groups:
models = {
    "baseline": ["model-v1", "model-v2"],
    "intervention": ["model-a", "model-b"],
}
  1. An async function that evaluates a single model and returns a DataFrame:
async def run_eval(model_id: str) -> pd.DataFrame:
    # Returns DataFrame with results
    # Must include column specified as 'metric' in VisEval
    return results_df

Usage

from vibes_eval import VisEval

# Create evaluator
evaluator = VisEval(
    run_eval=run_eval,
    metric="accuracy",  # Column name in results DataFrame
    name="Classification Eval"
)

# Run eval for all models
results = await evaluator.run(models)

# Create visualizations
results.model_plot()      # Compare individual models
results.group_plot()      # Compare groups (aggregated)
results.histogram()       # Score distributions per group
results.scatter(          # Compare two metrics
    x_column="accuracy",
    y_column="runtime"
)

Freeform questions

One built-in evaluation is provided by the FreeformQuestion class: a freeform question is a question that will be asked to the models, combined with a set of prompts that will be asked to an LLM judge. Questions are defined in yaml files such as this one. Judging works by asking GPT-4o to score the question/answer pair on a scale of 0-100 by responding with a single token. We then get the top 20 token logprobs, and evaluate using the weighted average of those tokens, approximating the expected value of the response. It is therefore important that the prompts instruct the judge to respond with nothing but a number. An example with code can be found here.

Visualizations

  • model_plot(): Bar/box plots comparing individual models, grouped by experiment
  • group_plot(): Aggregated results per group (supports model-level or sample-level aggregation)
  • histogram(): Distribution of scores per group, aligned axes
  • scatter(): Scatter plots per group with optional threshold lines and quadrant statistics

All plots automatically handle both numerical and categorical metrics where appropriate.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vibes_eval-0.2.5.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vibes_eval-0.2.5-py3-none-any.whl (25.5 kB view details)

Uploaded Python 3

File details

Details for the file vibes_eval-0.2.5.tar.gz.

File metadata

  • Download URL: vibes_eval-0.2.5.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibes_eval-0.2.5.tar.gz
Algorithm Hash digest
SHA256 462047bd5235e097104a89a38ebfc45aa8bd6ce433ea98ace2ac43ee88c4aa60
MD5 c5bc6f7afdfeb0621b28805019ee1f08
BLAKE2b-256 e218fbda91724537b124727aa4a00bee45e8cfb9005a54063a42f3b6d5758350

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibes_eval-0.2.5.tar.gz:

Publisher: manual_publish.yaml on nielsrolf/viseval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vibes_eval-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: vibes_eval-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 25.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibes_eval-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 03f270a9b35b26190eec7c19eba57eb8b8c63e59a51f867dea5a825fa37913b8
MD5 b0c24960871bd4443495fb552fb3194b
BLAKE2b-256 0831c565ad5fee8afed7ce4ae62413966689da01fde03ddaa1b12292e242b2a0

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibes_eval-0.2.5-py3-none-any.whl:

Publisher: manual_publish.yaml on nielsrolf/viseval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page