Skip to main content

A Python project for LLM evaluation.

Project description

Scorebook

A Python library for LLM evaluation

Dynamic TOML Badge Python Version License

Scorebook is a flexible and extensible framework for evaluating Large Language Models (LLMs). It provides clear contracts for data loading, model inference, and metrics computation, making it easy to run comprehensive evaluations across different datasets, models, and metrics.

โœจ Key Features

  • ๐Ÿ”Œ Flexible Data Loading: Support for Hugging Face datasets, CSV, JSON, and Python lists
  • ๐Ÿš€ Model Agnostic: Works with any model or inference provider
  • ๐Ÿ“Š Extensible Metric Engine: Use the metrics we provide or implement your own
  • ๐Ÿ”„ Automated Sweeping: Test multiple model configurations automatically
  • ๐Ÿ“ˆ Rich Results: Export results to JSON, CSV, or structured formats like pandas DataFrames

๐Ÿš€ Quick Start

Installation

pip install scorebook

For OpenAI integration:

pip install scorebook[openai]

For local model examples:

pip install scorebook[examples]

Basic Usage

from scorebook import EvalDataset, evaluate
from scorebook.metrics import Accuracy

# 1. Create an evaluation dataset
data = [
    {"question": "What is 2 + 2?", "answer": "4"},
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]

dataset = EvalDataset.from_list(
    name="basic_qa",
    label="answer",
    metrics=[Accuracy],
    data=data
)

# 2. Define your inference function
def my_inference_function(items, **hyperparameters):
    # Your model logic here
    predictions = []
    for item in items:
        # Process each item and generate prediction
        prediction = your_model.predict(item["question"])
        predictions.append(prediction)
    return predictions

# 3. Run evaluation
results = evaluate(my_inference_function, dataset)
print(results)

๐Ÿ“Š Core Components

1. Evaluation Datasets

Scorebook supports multiple data sources through the EvalDataset class:

From Hugging Face

dataset = EvalDataset.from_huggingface(
    "TIGER-Lab/MMLU-Pro",
    label="answer",
    metrics=[Accuracy],
    split="validation"
)

From CSV

dataset = EvalDataset.from_csv(
    "dataset.csv",
    label="answer",
    metrics=[Accuracy]
)

From JSON

dataset = EvalDataset.from_json(
    "dataset.json",
    label="answer",
    metrics=[Accuracy]
)

From Python List

dataset = EvalDataset.from_list(
    name="custom_dataset",
    label="answer",
    metrics=[Accuracy],
    data=[{"question": "...", "answer": "..."}]
)

2. Model Integration

Scorebook offers two approaches for model integration:

Inference Functions

A single function that handles the complete pipeline:

def inference_function(eval_items, **hyperparameters):
    results = []
    for item in eval_items:
        # 1. Preprocessing
        prompt = format_prompt(item)

        # 2. Inference
        output = model.generate(prompt)

        # 3. Postprocessing
        prediction = extract_answer(output)
        results.append(prediction)

    return results

Inference Pipelines

Modular approach with separate stages:

from scorebook.types.inference_pipeline import InferencePipeline

def preprocessor(item):
    return {"messages": [{"role": "user", "content": item["question"]}]}

def inference_function(processed_items, **hyperparameters):
    return [model.generate(item) for item in processed_items]

def postprocessor(output):
    return output.strip()

pipeline = InferencePipeline(
    model="my-model",
    preprocessor=preprocessor,
    inference_function=inference_function,
    postprocessor=postprocessor
)

results = evaluate(pipeline, dataset)

3. Metrics System

Built-in Metrics

  • Accuracy: Percentage of correct predictions
  • Precision: Accuracy of positive predictions
from scorebook.metrics import Accuracy, Precision

dataset = EvalDataset.from_list(
    name="test",
    label="answer",
    metrics=[Accuracy, Precision],  # Multiple metrics
    data=data
)

Custom Metrics

Create custom metrics by extending MetricBase:

from scorebook.metrics import MetricBase, MetricRegistry

@MetricRegistry.register()
class F1Score(MetricBase):
    @staticmethod
    def score(outputs, labels):
        # Calculate F1 score
        item_scores = [calculate_f1_item(o, l) for o, l in zip(outputs, labels)]
        aggregate_score = {"f1": sum(item_scores) / len(item_scores)}
        return aggregate_score, item_scores

# Use by string name or class
dataset = EvalDataset.from_list(..., metrics=["f1score"])
# or
dataset = EvalDataset.from_list(..., metrics=[F1Score])

4. Hyperparameter Sweeping

Test multiple configurations automatically:

hyperparameters = {
    "temperature": [0.7, 0.9, 1.0],
    "max_tokens": [50, 100, 150],
    "top_p": [0.8, 0.9]
}

results = evaluate(
    inference_function,
    dataset,
    hyperparameters=hyperparameters,
    score_type="all"
)

# Results include all combinations: 3 ร— 3 ร— 2 = 18 configurations

5. Results and Export

Control result format with score_type:

# Only aggregate scores (default)
results = evaluate(model, dataset, score_type="aggregate")

# Only per-item scores
results = evaluate(model, dataset, score_type="item")

# Both aggregate and per-item
results = evaluate(model, dataset, score_type="all")

Export results:

# Get EvalResult objects for advanced usage
results = evaluate(model, dataset, return_type="object")

# Export to files
for result in results:
    result.to_json("results.json")
    result.to_csv("results.csv")

๐Ÿ”ง OpenAI Integration

Scorebook includes built-in OpenAI support for both single requests and batch processing:

from scorebook.inference.openai import responses, batch
from scorebook.types.inference_pipeline import InferencePipeline

# For single requests
pipeline = InferencePipeline(
    model="gpt-4o-mini",
    preprocessor=format_for_openai,
    inference_function=responses,
    postprocessor=extract_response
)

# For batch processing (more efficient for large datasets)
batch_pipeline = InferencePipeline(
    model="gpt-4o-mini",
    preprocessor=format_for_openai,
    inference_function=batch,
    postprocessor=extract_response
)

๐Ÿ“‹ Examples

The examples/ directory contains comprehensive examples:

  • basic_example.py: Local model evaluation with Hugging Face
  • openai_responses_api.py: OpenAI API integration
  • openai_batch_api.py: OpenAI Batch API for large-scale evaluation
  • hyperparam_sweep.py: Hyperparameter optimization
  • scorebook_showcase.ipynb: Interactive Jupyter notebook tutorial

Run an example:

cd examples/
python basic_example.py --output-dir ./my_results

๐Ÿ—๏ธ Architecture

Scorebook follows a modular architecture:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   EvalDataset   โ”‚    โ”‚  Inference   โ”‚    โ”‚     Metrics     โ”‚
โ”‚                 โ”‚    โ”‚   Pipeline   โ”‚    โ”‚                 โ”‚
โ”‚ โ€ข Data Loading  โ”‚    โ”‚              โ”‚    โ”‚ โ€ข Accuracy      โ”‚
โ”‚ โ€ข HF Integrationโ”‚    โ”‚ โ€ข Preprocess โ”‚    โ”‚ โ€ข Precision     โ”‚
โ”‚ โ€ข CSV/JSON      โ”‚    โ”‚ โ€ข Inference  โ”‚    โ”‚ โ€ข Custom        โ”‚
โ”‚ โ€ข Validation    โ”‚    โ”‚ โ€ข Postprocessโ”‚    โ”‚ โ€ข Registry      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚                       โ”‚                       โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                โ”‚
                   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                   โ”‚     evaluate()      โ”‚
                   โ”‚                     โ”‚
                   โ”‚ โ€ข Orchestration     โ”‚
                   โ”‚ โ€ข Progress Tracking โ”‚
                   โ”‚ โ€ข Result Formatting โ”‚
                   โ”‚ โ€ข Export Options    โ”‚
                   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐ŸŽฏ Use Cases

Scorebook is designed for:

  • ๐Ÿ† Model Benchmarking: Compare different models on standard datasets
  • โš™๏ธ Hyperparameter Optimization: Find optimal model configurations
  • ๐Ÿ“Š Dataset Analysis: Understand model performance across different data types
  • ๐Ÿ”„ A/B Testing: Compare model versions or approaches
  • ๐Ÿ”ฌ Research Experiments: Reproducible evaluation workflows
  • ๐Ÿ“ˆ Production Monitoring: Track model performance over time

๐Ÿค Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿข About

Scorebook is developed by Trismik to speed up your LLM evaluation.


For more examples and detailed documentation, check out the Jupyter notebook in examples/scorebook_showcase.ipynb

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scorebook-0.0.13.tar.gz (54.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scorebook-0.0.13-py3-none-any.whl (73.5 kB view details)

Uploaded Python 3

File details

Details for the file scorebook-0.0.13.tar.gz.

File metadata

  • Download URL: scorebook-0.0.13.tar.gz
  • Upload date:
  • Size: 54.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scorebook-0.0.13.tar.gz
Algorithm Hash digest
SHA256 0765338fe0b9b4fa1d99b330e16cb7e2da18545261e3cfa423dedcdde654e961
MD5 b676ef7ebb7fc04413a4df6d18ed9a18
BLAKE2b-256 848e4ada39ead14acf05fc07ed8c8ce0479615b1ed8f98e67755da3d33ad08eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for scorebook-0.0.13.tar.gz:

Publisher: publish-to-pypi.yml on trismik/scorebook

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scorebook-0.0.13-py3-none-any.whl.

File metadata

  • Download URL: scorebook-0.0.13-py3-none-any.whl
  • Upload date:
  • Size: 73.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scorebook-0.0.13-py3-none-any.whl
Algorithm Hash digest
SHA256 bf3853f916883579275e10358c726afd08f3520d0f0822c212cb9a9a9c24bf92
MD5 4d36186e5ae8c2c193dc82ca5e3d01ea
BLAKE2b-256 f3953e023e21109b87e52aea3f5f588715e53f5a86e41f2dc52a42aacf8fe36e

See more details on using hashes here.

Provenance

The following attestation bundles were made for scorebook-0.0.13-py3-none-any.whl:

Publisher: publish-to-pypi.yml on trismik/scorebook

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page