A Python project for LLM evaluation.
Project description
Scorebook
A Python library for LLM evaluation
Scorebook is a flexible and extensible framework for evaluating Large Language Models (LLMs). It provides clear contracts for data loading, model inference, and metrics computation, making it easy to run comprehensive evaluations across different datasets, models, and metrics.
โจ Key Features
- ๐ Flexible Data Loading: Support for Hugging Face datasets, CSV, JSON, and Python lists
- ๐ Model Agnostic: Works with any model or inference provider
- ๐ Extensible Metric Engine: Use the metrics we provide or implement your own
- ๐ Automated Sweeping: Test multiple model configurations automatically
- ๐ Rich Results: Export results to JSON, CSV, or structured formats like pandas DataFrames
๐ Quick Start
Installation
pip install scorebook
For OpenAI integration:
pip install scorebook[openai]
For local model examples:
pip install scorebook[examples]
Basic Usage
from scorebook import EvalDataset, evaluate
from scorebook.metrics import Accuracy
# 1. Create an evaluation dataset
data = [
{"question": "What is 2 + 2?", "answer": "4"},
{"question": "What is the capital of France?", "answer": "Paris"},
{"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]
dataset = EvalDataset.from_list(
name="basic_qa",
label="answer",
metrics=[Accuracy],
data=data
)
# 2. Define your inference function
def my_inference_function(items, **hyperparameters):
# Your model logic here
predictions = []
for item in items:
# Process each item and generate prediction
prediction = your_model.predict(item["question"])
predictions.append(prediction)
return predictions
# 3. Run evaluation
results = evaluate(my_inference_function, dataset)
print(results)
๐ Core Components
1. Evaluation Datasets
Scorebook supports multiple data sources through the EvalDataset class:
From Hugging Face
dataset = EvalDataset.from_huggingface(
"TIGER-Lab/MMLU-Pro",
label="answer",
metrics=[Accuracy],
split="validation"
)
From CSV
dataset = EvalDataset.from_csv(
"dataset.csv",
label="answer",
metrics=[Accuracy]
)
From JSON
dataset = EvalDataset.from_json(
"dataset.json",
label="answer",
metrics=[Accuracy]
)
From Python List
dataset = EvalDataset.from_list(
name="custom_dataset",
label="answer",
metrics=[Accuracy],
data=[{"question": "...", "answer": "..."}]
)
2. Model Integration
Scorebook offers two approaches for model integration:
Inference Functions
A single function that handles the complete pipeline:
def inference_function(eval_items, **hyperparameters):
results = []
for item in eval_items:
# 1. Preprocessing
prompt = format_prompt(item)
# 2. Inference
output = model.generate(prompt)
# 3. Postprocessing
prediction = extract_answer(output)
results.append(prediction)
return results
Inference Pipelines
Modular approach with separate stages:
from scorebook.types.inference_pipeline import InferencePipeline
def preprocessor(item):
return {"messages": [{"role": "user", "content": item["question"]}]}
def inference_function(processed_items, **hyperparameters):
return [model.generate(item) for item in processed_items]
def postprocessor(output):
return output.strip()
pipeline = InferencePipeline(
model="my-model",
preprocessor=preprocessor,
inference_function=inference_function,
postprocessor=postprocessor
)
results = evaluate(pipeline, dataset)
3. Metrics System
Built-in Metrics
- Accuracy: Percentage of correct predictions
- Precision: Accuracy of positive predictions
from scorebook.metrics import Accuracy, Precision
dataset = EvalDataset.from_list(
name="test",
label="answer",
metrics=[Accuracy, Precision], # Multiple metrics
data=data
)
Custom Metrics
Create custom metrics by extending MetricBase:
from scorebook.metrics import MetricBase, MetricRegistry
@MetricRegistry.register()
class F1Score(MetricBase):
@staticmethod
def score(outputs, labels):
# Calculate F1 score
item_scores = [calculate_f1_item(o, l) for o, l in zip(outputs, labels)]
aggregate_score = {"f1": sum(item_scores) / len(item_scores)}
return aggregate_score, item_scores
# Use by string name or class
dataset = EvalDataset.from_list(..., metrics=["f1score"])
# or
dataset = EvalDataset.from_list(..., metrics=[F1Score])
4. Hyperparameter Sweeping
Test multiple configurations automatically:
hyperparameters = {
"temperature": [0.7, 0.9, 1.0],
"max_tokens": [50, 100, 150],
"top_p": [0.8, 0.9]
}
results = evaluate(
inference_function,
dataset,
hyperparameters=hyperparameters,
score_type="all"
)
# Results include all combinations: 3 ร 3 ร 2 = 18 configurations
5. Results and Export
Control result format with score_type:
# Only aggregate scores (default)
results = evaluate(model, dataset, score_type="aggregate")
# Only per-item scores
results = evaluate(model, dataset, score_type="item")
# Both aggregate and per-item
results = evaluate(model, dataset, score_type="all")
Export results:
# Get EvalResult objects for advanced usage
results = evaluate(model, dataset, return_type="object")
# Export to files
for result in results:
result.to_json("results.json")
result.to_csv("results.csv")
๐ง OpenAI Integration
Scorebook includes built-in OpenAI support for both single requests and batch processing:
from scorebook.inference.openai import responses, batch
from scorebook.types.inference_pipeline import InferencePipeline
# For single requests
pipeline = InferencePipeline(
model="gpt-4o-mini",
preprocessor=format_for_openai,
inference_function=responses,
postprocessor=extract_response
)
# For batch processing (more efficient for large datasets)
batch_pipeline = InferencePipeline(
model="gpt-4o-mini",
preprocessor=format_for_openai,
inference_function=batch,
postprocessor=extract_response
)
๐ Examples
The examples/ directory contains comprehensive examples:
basic_example.py: Local model evaluation with Hugging Faceopenai_responses_api.py: OpenAI API integrationopenai_batch_api.py: OpenAI Batch API for large-scale evaluationhyperparam_sweep.py: Hyperparameter optimizationscorebook_showcase.ipynb: Interactive Jupyter notebook tutorial
Run an example:
cd examples/
python basic_example.py --output-dir ./my_results
๐๏ธ Architecture
Scorebook follows a modular architecture:
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ EvalDataset โ โ Inference โ โ Metrics โ
โ โ โ Pipeline โ โ โ
โ โข Data Loading โ โ โ โ โข Accuracy โ
โ โข HF Integrationโ โ โข Preprocess โ โ โข Precision โ
โ โข CSV/JSON โ โ โข Inference โ โ โข Custom โ
โ โข Validation โ โ โข Postprocessโ โ โข Registry โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโ
โ evaluate() โ
โ โ
โ โข Orchestration โ
โ โข Progress Tracking โ
โ โข Result Formatting โ
โ โข Export Options โ
โโโโโโโโโโโโโโโโโโโโโโโ
๐ฏ Use Cases
Scorebook is designed for:
- ๐ Model Benchmarking: Compare different models on standard datasets
- โ๏ธ Hyperparameter Optimization: Find optimal model configurations
- ๐ Dataset Analysis: Understand model performance across different data types
- ๐ A/B Testing: Compare model versions or approaches
- ๐ฌ Research Experiments: Reproducible evaluation workflows
- ๐ Production Monitoring: Track model performance over time
๐ค Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ข About
Scorebook is developed by Trismik to speed up your LLM evaluation.
For more examples and detailed documentation, check out the Jupyter notebook in examples/scorebook_showcase.ipynb
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scorebook-0.0.13.tar.gz.
File metadata
- Download URL: scorebook-0.0.13.tar.gz
- Upload date:
- Size: 54.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0765338fe0b9b4fa1d99b330e16cb7e2da18545261e3cfa423dedcdde654e961
|
|
| MD5 |
b676ef7ebb7fc04413a4df6d18ed9a18
|
|
| BLAKE2b-256 |
848e4ada39ead14acf05fc07ed8c8ce0479615b1ed8f98e67755da3d33ad08eb
|
Provenance
The following attestation bundles were made for scorebook-0.0.13.tar.gz:
Publisher:
publish-to-pypi.yml on trismik/scorebook
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scorebook-0.0.13.tar.gz -
Subject digest:
0765338fe0b9b4fa1d99b330e16cb7e2da18545261e3cfa423dedcdde654e961 - Sigstore transparency entry: 673183626
- Sigstore integration time:
-
Permalink:
trismik/scorebook@1f55534e9750f222d0238d82253952846fd2eeac -
Branch / Tag:
refs/tags/v0.0.13 - Owner: https://github.com/trismik
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@1f55534e9750f222d0238d82253952846fd2eeac -
Trigger Event:
push
-
Statement type:
File details
Details for the file scorebook-0.0.13-py3-none-any.whl.
File metadata
- Download URL: scorebook-0.0.13-py3-none-any.whl
- Upload date:
- Size: 73.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf3853f916883579275e10358c726afd08f3520d0f0822c212cb9a9a9c24bf92
|
|
| MD5 |
4d36186e5ae8c2c193dc82ca5e3d01ea
|
|
| BLAKE2b-256 |
f3953e023e21109b87e52aea3f5f588715e53f5a86e41f2dc52a42aacf8fe36e
|
Provenance
The following attestation bundles were made for scorebook-0.0.13-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on trismik/scorebook
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scorebook-0.0.13-py3-none-any.whl -
Subject digest:
bf3853f916883579275e10358c726afd08f3520d0f0822c212cb9a9a9c24bf92 - Sigstore transparency entry: 673183638
- Sigstore integration time:
-
Permalink:
trismik/scorebook@1f55534e9750f222d0238d82253952846fd2eeac -
Branch / Tag:
refs/tags/v0.0.13 - Owner: https://github.com/trismik
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@1f55534e9750f222d0238d82253952846fd2eeac -
Trigger Event:
push
-
Statement type: