Skip to main content

A library for visualizing model evaluation results

Project description

Viseval / Vibes Eval

The original name was viseval, but it's taken on pypi. So now vibes eval Credit for the design of the evals goes to @johny-b

Tools for running model evaluations and visualizing results.

Install

pip install vibes_eval

Core Concept

Viseval assumes you have:

  1. A set of models organized by experimental groups:
models = {
    "baseline": ["model-v1", "model-v2"],
    "intervention": ["model-a", "model-b"],
}
  1. An async function that evaluates a single model and returns a DataFrame:
async def run_eval(model_id: str) -> pd.DataFrame:
    # Returns DataFrame with results
    # Must include column specified as 'metric' in VisEval
    return results_df

Usage

from vibes_eval import VisEval

# Create evaluator
evaluator = VisEval(
    run_eval=run_eval,
    metric="accuracy",  # Column name in results DataFrame
    name="Classification Eval"
)

# Run eval for all models
results = await evaluator.run(models)

# Create visualizations
results.model_plot()      # Compare individual models
results.group_plot()      # Compare groups (aggregated)
results.histogram()       # Score distributions per group
results.scatter(          # Compare two metrics
    x_column="accuracy",
    y_column="runtime"
)

Freeform questions

One built-in evaluation is provided by the FreeformQuestion class: a freeform question is a question that will be asked to the models, combined with a set of prompts that will be asked to an LLM judge. Questions are defined in yaml files such as this one. Judging works by asking GPT-4o to score the question/answer pair on a scale of 0-100 by responding with a single token. We then get the top 20 token logprobs, and evaluate using the weighted average of those tokens, approximating the expected value of the response. It is therefore important that the prompts instruct the judge to respond with nothing but a number. An example with code can be found here.

Visualizations

  • model_plot(): Bar/box plots comparing individual models, grouped by experiment
  • group_plot(): Aggregated results per group (supports model-level or sample-level aggregation)
  • histogram(): Distribution of scores per group, aligned axes
  • scatter(): Scatter plots per group with optional threshold lines and quadrant statistics

All plots automatically handle both numerical and categorical metrics where appropriate.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vibes_eval-0.2.0.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vibes_eval-0.2.0-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file vibes_eval-0.2.0.tar.gz.

File metadata

  • Download URL: vibes_eval-0.2.0.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibes_eval-0.2.0.tar.gz
Algorithm Hash digest
SHA256 9f7e1fbc29624c13398527fdc547472799880dcaa8cc6d81b850c1b22366812b
MD5 9b526f5d48f41ed924dc32eb42e948a5
BLAKE2b-256 c8501ee6c79b0ac4580d3e5efe3e35d7e045a36c527f4408e59fb5ee4de58c35

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibes_eval-0.2.0.tar.gz:

Publisher: manual_publish.yaml on nielsrolf/viseval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vibes_eval-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: vibes_eval-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vibes_eval-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4e527f0188deefe10a38b822422e5f73a0b183e619f7c34c69d9ff132faecaba
MD5 67970d057210cc420b5c50e4149b927b
BLAKE2b-256 73828ab8a7ff6f5efc685709ae8ac2aa72d1c9b80991915121a1b0e32ca4f8c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for vibes_eval-0.2.0-py3-none-any.whl:

Publisher: manual_publish.yaml on nielsrolf/viseval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page