Skip to main content

A library for visualizing model evaluation results

Project description

Viseval / Vibes Eval

The original name was viseval, but it's taken on pypi. So now vibes eval Credit for the design of the evals goes to @johny-b

Tools for running model evaluations and visualizing results.

Install

pip install vibes_eval

Core Concept

Viseval assumes you have:

  1. A set of models organized by experimental groups:
models = {
    "baseline": ["model-v1", "model-v2"],
    "intervention": ["model-a", "model-b"],
}
  1. An async function that evaluates a single model and returns a DataFrame:
async def run_eval(model_id: str) -> pd.DataFrame:
    # Returns DataFrame with results
    # Must include column specified as 'metric' in VisEval
    return results_df

Usage

from vibes_eval import VisEval

# Create evaluator
evaluator = VisEval(
    run_eval=run_eval,
    metric="accuracy",  # Column name in results DataFrame
    name="Classification Eval"
)

# Run eval for all models
results = await evaluator.run(models)

# Create visualizations
results.model_plot()      # Compare individual models
results.group_plot()      # Compare groups (aggregated)
results.histogram()       # Score distributions per group
results.scatter(          # Compare two metrics
    x_column="accuracy",
    y_column="runtime"
)

Freeform questions

One built-in evaluation is provided by the FreeformQuestion class: a freeform question is a question that will be asked to the models, combined with a set of prompts that will be asked to an LLM judge. Questions are defined in yaml files such as this one. Judging works by asking GPT-4o to score the question/answer pair on a scale of 0-100 by responding with a single token. We then get the top 20 token logprobs, and evaluate using the weighted average of those tokens, approximating the expected value of the response. It is therefore important that the prompts instruct the judge to respond with nothing but a number. An example with code can be found here.

Visualizations

  • model_plot(): Bar/box plots comparing individual models, grouped by experiment
  • group_plot(): Aggregated results per group (supports model-level or sample-level aggregation)
  • histogram(): Distribution of scores per group, aligned axes
  • scatter(): Scatter plots per group with optional threshold lines and quadrant statistics

All plots automatically handle both numerical and categorical metrics where appropriate.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vibes_eval-0.1.0.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vibes_eval-0.1.0-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file vibes_eval-0.1.0.tar.gz.

File metadata

  • Download URL: vibes_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for vibes_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b2e9b807651f29330e09664112abe4fbfd7902ac284041bd00372f6878277435
MD5 4cc1b3b594adf2305635b37982ef73df
BLAKE2b-256 9e21f1c523addc2c4d498bbcca3b8fa8a938ae151bcecc9aa46227cab5ccd228

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibes_eval-0.1.0.tar.gz:

Publisher: manual_publish.yaml on nielsrolf/viseval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vibes_eval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vibes_eval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for vibes_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c72c2f788fa50eae54b88f919ef989dfb5d081fdaa4cd1a6c568e3a983ca4bdf
MD5 bc21d416f5ffd294ccfd3f37e079dbeb
BLAKE2b-256 0f5c4e1d944c20cf4ab3559b361d96362c283400697ca06355ae2f90e673b90f

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibes_eval-0.1.0-py3-none-any.whl:

Publisher: manual_publish.yaml on nielsrolf/viseval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page