A Python project for LLM evaluation.
Project description
Scorebook
A Python library for Model evaluation
Scorebook provides a flexible and extensible framework for evaluating models such as large language models (LLMs). Easily evaluate any model using evaluation datasets from Hugging Face such as MMLU-Pro, HellaSwag, and CommonSenseQA, or with data from any other source. Evaluations calculate scores for any number of specified metrics such as accuracy, precision, and recall, as well as any custom defined metrics, including LLM as a judge (LLMaJ).
Use Cases
Scorebook's evaluations can be used for:
- Model Benchmarking: Compare different models on standard datasets.
- Model Optimization: Find optimal model configurations.
- Iterative Experimentation: Reproducible evaluation workflows.
Key Features
- Model Agnostic: Evaluate any model, running locally or deployed on the cloud.
- Dataset Agnostic: Create evaluation datasets from Hugging Face datasets or any other source.
- Extensible Metric Engine: Use the Scorebook's built-in or implement your own.
- Hyperparameter Sweeping: Evaluate over multiple model hyperparameter configurations.
- Adaptive Evaluations: Run Trismik's ultra-fast adaptive evaluations.
- Trismik Integration: Upload evaluations to Trismik's platform.
Installation
pip install scorebook
Scoring Models Output
Scorebooks score function can be used to evaluate pre-generated model outputs.
Score Example
from scorebook import score
from scorebook.metrics import Accuracy
# 1. Prepare a list of generated model outputs and labels
model_predictions = [
{"input": "What is 2 + 2?", "output": "4", "label": "4"},
{"input": "What is the capital of France?", "output": "London", "label": "Paris"},
{"input": "Who wrote Romeo and Juliette?", "output": "William Shakespeare", "label": "William Shakespeare"},
{"input": "What is the chemical symbol for gold?", "output": "Au", "label": "Au"},
]
# 2. Score the model's predictions against labels using metrics
results = score(
items = model_predictions,
metrics = Accuracy,
)
Score Results:
{
"aggregate_results": [
{
"dataset": "scored_items",
"accuracy": 0.75
}
],
"item_results": [
{
"id": 0,
"dataset": "scored_items",
"input": "What is 2 + 2?",
"output": "4",
"label": "4",
"accuracy": true
}
// ... additional items
]
}
Classical Evaluations
Running a classical evaluation in Scorebook executes model inference on every item in the dataset, then scores the generated outputs using the dataset’s specified metrics to quantify model performance.
Classical Evaluation example:
from scorebook import evaluate, EvalDataset
from scorebook.metrics import Accuracy
# 1. Create an evaluation dataset
evaluation_items = [
{"question": "What is 2 + 2?", "answer": "4"},
{"question": "What is the capital of France?", "answer": "Paris"},
{"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]
evaluation_dataset = EvalDataset.from_list(
name = "basic_questions",
items = evaluation_items,
input = "question",
label = "answer",
metrics = Accuracy,
)
# 2. Define an inference function - This is a pseudocode example
def inference_function(inputs: List[Any], **hyperparameters):
# Create or call a model
model = Model()
model.temperature = hyperparameters.get("temperature")
# Call model inference
model_outputs = model(inputs)
# Return outputs
return model_outputs
# 3. Run evaluation
evaluation_results = evaluate(
inference_function,
evaluation_dataset,
hyperparameters = {"temperature": 0.7}
)
Evaluation Results:
{
"aggregate_results": [
{
"dataset": "basic_questions",
"temperature": 0.7,
"accuracy": 1.0,
"run_completed": true
}
],
"item_results": [
{
"id": 0,
"dataset": "basic_questions",
"input": "What is 2 + 2?",
"output": "4",
"label": "4",
"temperature": 0.7,
"accuracy": true
}
// ... additional items
]
}
Adaptive Evaluations with evaluate
To run an adaptive evaluation, use a Trismik adaptive dataset The CAT algorithm dynamically selects items to estimate the model’s ability (θ) with minimal standard error and fewest questions.
Adaptive Evaluation Example
from scorebook import evaluate, login
# 1. Log in with your Trismik API key
login("TRISMIK_API_KEY")
# 2. Define an inference function
def inference_function(inputs: List[Any], **hyperparameters):
# Create or call a model
model = Model()
# Call model inference
outputs = model(inputs)
# Return outputs
return outputs
# 3. Run an adaptive evaluation
results = evaluate(
inference_function,
datasets = "trismik/headQA:adaptive", # Adaptive datasets have the ":adaptive" suffix
project_id = "TRISMIK_PROJECT_ID", # Required: Create a project on your Trismik dashboard
experiment_id = "TRISMIK_EXPERIMENT_ID", # Optional: An identifier to upload this run under
)
Adaptive Evaluation Results
{
"aggregate_results": [
{
"dataset": "trismik/headQA:adaptive",
"experiment_id": "TRISMIK_EXPERIMENT_ID",
"project_id": "TRISMIK_PROJECT_ID",
"run_id": "RUN_ID",
"score": {
"theta": 1.2,
"std_error": 0.20
},
"responses": null
}
],
"item_results": []
}
Metrics
| Metric | Sync/Async | Aggregate Scores | Item Scores |
|---|---|---|---|
Accuracy |
Sync | Float: Percentage of correct outputs |
Boolean: Exact match between output and label |
ExactMatch |
Sync | Float: Percentage of exact string matches |
Boolean: Exact match with optional case/whitespace normalization |
F1 |
Sync | Dict[str, Float]: F1 scores per averaging method (macro, micro, weighted) |
Boolean: Exact match between output and label |
Precision |
Sync | Dict[str, Float]: Precision scores per averaging method (macro, micro, weighted) |
Boolean: Exact match between output and label |
Recall |
Sync | Dict[str, Float]: Recall scores per averaging method (macro, micro, weighted) |
Boolean: Exact match between output and label |
BLEU |
Sync | Float: Corpus-level BLEU score |
Float: Sentence-level BLEU score |
ROUGE |
Sync | Dict[str, Float]: Average F1 scores per ROUGE type |
Dict[str, Float]: F1 scores per ROUGE type |
BertScore |
Sync | Dict[str, Float]: Average precision, recall, and F1 scores |
Dict[str, Float]: Precision, recall, and F1 scores per item |
Tutorials
For local more detailed and runnable examples:
pip install scorebook[examples]
The tutorials/ directory contains comprehensive tutorials as notebooks and code examples:
tutorials/notebooks: Interactive Jupyter Notebooks showcasing Scorebook's capabilities.tutorials/examples: Runnable Python examples incrementally implementing Scorebook's features.
Run a notebook:
jupyter notebook tutorials/notebooks
Run an example:
python3 tutorials/examples/1-score/1-scoring_model_accuracy.py
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
License
This project is licensed under the MIT License - see the LICENSE file for details.
About
Scorebook is developed by Trismik to simplify and speed up your LLM evaluations.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scorebook-0.0.19.tar.gz.
File metadata
- Download URL: scorebook-0.0.19.tar.gz
- Upload date:
- Size: 110.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
442a9b74d40ee65c45d466dfaa28b5e27317373567e7f8c0e6286f3f508943b5
|
|
| MD5 |
755fbd89d2cf3cb08d57e742cd9dd3da
|
|
| BLAKE2b-256 |
e2a9a0dd5a3027165a7d0fc247b7c05d7ad5444d71b136576acd1e7ba756b7ee
|
Provenance
The following attestation bundles were made for scorebook-0.0.19.tar.gz:
Publisher:
publish-to-pypi.yml on trismik/scorebook
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scorebook-0.0.19.tar.gz -
Subject digest:
442a9b74d40ee65c45d466dfaa28b5e27317373567e7f8c0e6286f3f508943b5 - Sigstore transparency entry: 1003143477
- Sigstore integration time:
-
Permalink:
trismik/scorebook@fa19ce77caa8f99d70abfbe5b92ac735f4a32c51 -
Branch / Tag:
refs/tags/v0.0.19 - Owner: https://github.com/trismik
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@fa19ce77caa8f99d70abfbe5b92ac735f4a32c51 -
Trigger Event:
push
-
Statement type:
File details
Details for the file scorebook-0.0.19-py3-none-any.whl.
File metadata
- Download URL: scorebook-0.0.19-py3-none-any.whl
- Upload date:
- Size: 182.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7ada6f8af728649f6c9a62caf96d860032c98d12dd4a5ee25023b229bd34eb8
|
|
| MD5 |
9e33bb103b3d71f7d96ba1bc7b95d29e
|
|
| BLAKE2b-256 |
de651a7d21bd86a29f20004ee33bc0eb5f7bb78deb4f8881b570f1eb3466592b
|
Provenance
The following attestation bundles were made for scorebook-0.0.19-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on trismik/scorebook
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scorebook-0.0.19-py3-none-any.whl -
Subject digest:
f7ada6f8af728649f6c9a62caf96d860032c98d12dd4a5ee25023b229bd34eb8 - Sigstore transparency entry: 1003143538
- Sigstore integration time:
-
Permalink:
trismik/scorebook@fa19ce77caa8f99d70abfbe5b92ac735f4a32c51 -
Branch / Tag:
refs/tags/v0.0.19 - Owner: https://github.com/trismik
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@fa19ce77caa8f99d70abfbe5b92ac735f4a32c51 -
Trigger Event:
push
-
Statement type: