Skip to main content

A Python project for LLM evaluation.

Project description

Scorebook

A Python library for Model evaluation

PyPI - Version Python Version Documentation License Open In Colab

Scorebook provides a flexible and extensible framework for evaluating models such as large language models (LLMs). Easily evaluate any model using evaluation datasets from Hugging Face such as MMLU-Pro, HellaSwag, and CommonSenseQA, or with data from any other source. Evaluations calculate scores for any number of specified metrics such as accuracy, precision, and recall, as well as any custom defined metrics, including LLM as a judge (LLMaJ).

Use Cases

Scorebook's evaluations can be used for:

  • Model Benchmarking: Compare different models on standard datasets.
  • Model Optimization: Find optimal model configurations.
  • Iterative Experimentation: Reproducible evaluation workflows.

Key Features

  • Model Agnostic: Evaluate any model, running locally or deployed on the cloud.
  • Dataset Agnostic: Create evaluation datasets from Hugging Face datasets or any other source.
  • Extensible Metric Engine: Use the Scorebook's built-in or implement your own.
  • Hyperparameter Sweeping: Evaluate over multiple model hyperparameter configurations.
  • Adaptive Evaluations: Run Trismik's ultra-fast adaptive evaluations.
  • Trismik Integration: Upload evaluations to Trismik's platform.

Installation

pip install scorebook

Scoring Models Output

Scorebooks score function can be used to evaluate pre-generated model outputs.

Score Example

from scorebook import score
from scorebook.metrics import Accuracy

# 1. Prepare a list of generated model outputs and labels
model_predictions = [
    {"input": "What is 2 + 2?", "output": "4", "label": "4"},
    {"input": "What is the capital of France?", "output": "London", "label": "Paris"},
    {"input": "Who wrote Romeo and Juliette?", "output": "William Shakespeare", "label": "William Shakespeare"},
    {"input": "What is the chemical symbol for gold?", "output": "Au", "label": "Au"},
]

# 2. Score the model's predictions against labels using metrics
results = score(
    items = model_predictions,
    metrics = Accuracy,
)

Score Results:

{
    "aggregate_results": [
        {
            "dataset": "scored_items",
            "accuracy": 0.75
        }
    ],
    "item_results": [
        {
            "id": 0,
            "dataset": "scored_items",
            "input": "What is 2 + 2?",
            "output": "4",
            "label": "4",
            "accuracy": true
        }
        // ... additional items
    ]
}

Classical Evaluations

Running a classical evaluation in Scorebook executes model inference on every item in the dataset, then scores the generated outputs using the dataset’s specified metrics to quantify model performance.

Classical Evaluation example:

from scorebook import evaluate, EvalDataset
from scorebook.metrics import Accuracy

# 1. Create an evaluation dataset
evaluation_items = [
    {"question": "What is 2 + 2?", "answer": "4"},
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]

evaluation_dataset = EvalDataset.from_list(
    name = "basic_questions",
    items = evaluation_items,
    input = "question",
    label = "answer",
    metrics = Accuracy,
)

# 2. Define an inference function - This is a pseudocode example
def inference_function(inputs: List[Any], **hyperparameters):

    # Create or call a model
    model = Model()
    model.temperature = hyperparameters.get("temperature")

    # Call model inference
    model_outputs = model(inputs)

    # Return outputs
    return model_outputs

# 3. Run evaluation
evaluation_results = evaluate(
    inference_function,
    evaluation_dataset,
    hyperparameters = {"temperature": 0.7}
)

Evaluation Results:

{
    "aggregate_results": [
        {
            "dataset": "basic_questions",
            "temperature": 0.7,
            "accuracy": 1.0,
            "run_completed": true
        }
    ],
    "item_results": [
        {
            "id": 0,
            "dataset": "basic_questions",
            "input": "What is 2 + 2?",
            "output": "4",
            "label": "4",
            "temperature": 0.7,
            "accuracy": true
        }
        // ... additional items
    ]
}

Adaptive Evaluations with evaluate

To run an adaptive evaluation, use a Trismik adaptive dataset The CAT algorithm dynamically selects items to estimate the model’s ability (θ) with minimal standard error and fewest questions.

Adaptive Evaluation Example

from scorebook import evaluate, login

# 1. Log in with your Trismik API key
login("TRISMIK_API_KEY")

# 2. Define an inference function
def inference_function(inputs: List[Any], **hyperparameters):

    # Create or call a model
    model = Model()

    # Call model inference
    outputs = model(inputs)

    # Return outputs
    return outputs

# 3. Run an adaptive evaluation
results = evaluate(
    inference_function,
    datasets = "trismik/headQA:adaptive",    # Adaptive datasets have the ":adaptive" suffix
    project_id = "TRISMIK_PROJECT_ID",       # Required: Create a project on your Trismik dashboard
    experiment_id = "TRISMIK_EXPERIMENT_ID", # Optional: An identifier to upload this run under
)

Adaptive Evaluation Results

{
    "aggregate_results": [
        {
            "dataset": "trismik/headQA:adaptive",
            "experiment_id": "TRISMIK_EXPERIMENT_ID",
            "project_id": "TRISMIK_PROJECT_ID",
            "run_id": "RUN_ID",
            "score": {
                "theta": 1.2,
                "std_error": 0.20
            },
            "responses": null
        }
    ],
    "item_results": []
}

Metrics

Metric Sync/Async Aggregate Scores Item Scores
Accuracy Sync Float: Percentage of correct outputs Boolean: Exact match between output and label
ExactMatch Sync Float: Percentage of exact string matches Boolean: Exact match with optional case/whitespace normalization
F1 Sync Dict[str, Float]: F1 scores per averaging method (macro, micro, weighted) Boolean: Exact match between output and label
Precision Sync Dict[str, Float]: Precision scores per averaging method (macro, micro, weighted) Boolean: Exact match between output and label
Recall Sync Dict[str, Float]: Recall scores per averaging method (macro, micro, weighted) Boolean: Exact match between output and label
BLEU Sync Float: Corpus-level BLEU score Float: Sentence-level BLEU score
ROUGE Sync Dict[str, Float]: Average F1 scores per ROUGE type Dict[str, Float]: F1 scores per ROUGE type
BertScore Sync Dict[str, Float]: Average precision, recall, and F1 scores Dict[str, Float]: Precision, recall, and F1 scores per item

Tutorials

For local more detailed and runnable examples:

pip install scorebook[examples]

The tutorials/ directory contains comprehensive tutorials as notebooks and code examples:

  • tutorials/notebooks: Interactive Jupyter Notebooks showcasing Scorebook's capabilities.
  • tutorials/examples: Runnable Python examples incrementally implementing Scorebook's features.

Run a notebook:

jupyter notebook tutorials/notebooks

Run an example:

python3 tutorials/examples/1-score/1-scoring_model_accuracy.py

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Scorebook is developed by Trismik to simplify and speed up your LLM evaluations.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scorebook-0.0.19.tar.gz (110.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scorebook-0.0.19-py3-none-any.whl (182.5 kB view details)

Uploaded Python 3

File details

Details for the file scorebook-0.0.19.tar.gz.

File metadata

  • Download URL: scorebook-0.0.19.tar.gz
  • Upload date:
  • Size: 110.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scorebook-0.0.19.tar.gz
Algorithm Hash digest
SHA256 442a9b74d40ee65c45d466dfaa28b5e27317373567e7f8c0e6286f3f508943b5
MD5 755fbd89d2cf3cb08d57e742cd9dd3da
BLAKE2b-256 e2a9a0dd5a3027165a7d0fc247b7c05d7ad5444d71b136576acd1e7ba756b7ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for scorebook-0.0.19.tar.gz:

Publisher: publish-to-pypi.yml on trismik/scorebook

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scorebook-0.0.19-py3-none-any.whl.

File metadata

  • Download URL: scorebook-0.0.19-py3-none-any.whl
  • Upload date:
  • Size: 182.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scorebook-0.0.19-py3-none-any.whl
Algorithm Hash digest
SHA256 f7ada6f8af728649f6c9a62caf96d860032c98d12dd4a5ee25023b229bd34eb8
MD5 9e33bb103b3d71f7d96ba1bc7b95d29e
BLAKE2b-256 de651a7d21bd86a29f20004ee33bc0eb5f7bb78deb4f8881b570f1eb3466592b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scorebook-0.0.19-py3-none-any.whl:

Publisher: publish-to-pypi.yml on trismik/scorebook

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page