A library for visualizing model evaluation results
Project description
Viseval / Vibes Eval
The original name was viseval, but it's taken on pypi. So now vibes eval Credit for the design of the evals goes to @johny-b
Tools for running model evaluations and visualizing results.
Install
pip install vibes_eval
Core Concept
Viseval assumes you have:
- A set of models organized by experimental groups:
models = {
"baseline": ["model-v1", "model-v2"],
"intervention": ["model-a", "model-b"],
}
- An async function that evaluates a single model and returns a DataFrame:
async def run_eval(model_id: str) -> pd.DataFrame:
# Returns DataFrame with results
# Must include column specified as 'metric' in VisEval
return results_df
Usage
from vibes_eval import VisEval
# Create evaluator
evaluator = VisEval(
run_eval=run_eval,
metric="accuracy", # Column name in results DataFrame
name="Classification Eval"
)
# Run eval for all models
results = await evaluator.run(models)
# Create visualizations
results.model_plot() # Compare individual models
results.group_plot() # Compare groups (aggregated)
results.histogram() # Score distributions per group
results.scatter( # Compare two metrics
x_column="accuracy",
y_column="runtime"
)
Freeform questions
One built-in evaluation is provided by the FreeformQuestion class: a freeform question is a question that will be asked to the models, combined with a set of prompts that will be asked to an LLM judge. Questions are defined in yaml files such as this one. Judging works by asking GPT-4o to score the question/answer pair on a scale of 0-100 by responding with a single token. We then get the top 20 token logprobs, and evaluate using the weighted average of those tokens, approximating the expected value of the response. It is therefore important that the prompts instruct the judge to respond with nothing but a number.
An example with code can be found here.
Visualizations
model_plot(): Bar/box plots comparing individual models, grouped by experimentgroup_plot(): Aggregated results per group (supports model-level or sample-level aggregation)histogram(): Distribution of scores per group, aligned axesscatter(): Scatter plots per group with optional threshold lines and quadrant statistics
All plots automatically handle both numerical and categorical metrics where appropriate.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vibes_eval-0.2.5.tar.gz.
File metadata
- Download URL: vibes_eval-0.2.5.tar.gz
- Upload date:
- Size: 25.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
462047bd5235e097104a89a38ebfc45aa8bd6ce433ea98ace2ac43ee88c4aa60
|
|
| MD5 |
c5bc6f7afdfeb0621b28805019ee1f08
|
|
| BLAKE2b-256 |
e218fbda91724537b124727aa4a00bee45e8cfb9005a54063a42f3b6d5758350
|
Provenance
The following attestation bundles were made for vibes_eval-0.2.5.tar.gz:
Publisher:
manual_publish.yaml on nielsrolf/viseval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vibes_eval-0.2.5.tar.gz -
Subject digest:
462047bd5235e097104a89a38ebfc45aa8bd6ce433ea98ace2ac43ee88c4aa60 - Sigstore transparency entry: 1208171628
- Sigstore integration time:
-
Permalink:
nielsrolf/viseval@578a5855203a04736cd16a2e40ce3d5b3df39536 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/nielsrolf
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
manual_publish.yaml@578a5855203a04736cd16a2e40ce3d5b3df39536 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file vibes_eval-0.2.5-py3-none-any.whl.
File metadata
- Download URL: vibes_eval-0.2.5-py3-none-any.whl
- Upload date:
- Size: 25.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03f270a9b35b26190eec7c19eba57eb8b8c63e59a51f867dea5a825fa37913b8
|
|
| MD5 |
b0c24960871bd4443495fb552fb3194b
|
|
| BLAKE2b-256 |
0831c565ad5fee8afed7ce4ae62413966689da01fde03ddaa1b12292e242b2a0
|
Provenance
The following attestation bundles were made for vibes_eval-0.2.5-py3-none-any.whl:
Publisher:
manual_publish.yaml on nielsrolf/viseval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vibes_eval-0.2.5-py3-none-any.whl -
Subject digest:
03f270a9b35b26190eec7c19eba57eb8b8c63e59a51f867dea5a825fa37913b8 - Sigstore transparency entry: 1208171694
- Sigstore integration time:
-
Permalink:
nielsrolf/viseval@578a5855203a04736cd16a2e40ce3d5b3df39536 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/nielsrolf
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
manual_publish.yaml@578a5855203a04736cd16a2e40ce3d5b3df39536 -
Trigger Event:
workflow_dispatch
-
Statement type: