No project description provided

Project description

Kaggle Benchmarks

kaggle-benchmarks is a Python library designed to help you rigorously evaluate AI models on tasks that matter to you. It provides a structured framework for defining tasks, interacting with models, and asserting the correctness of their outputs.

This is especially useful for:

Reproducibility: Capture the exact inputs, outputs, and model interactions for later review.
Complex Evaluations: Go beyond simple string matching to test for code execution, tool use, and multi-turn conversational capabilities.
Rapid Prototyping: Quickly test a model’s capabilities on a new, creative task you’ve designed.
Features
Getting Started
- On Kaggle
- Local Development
Usage
Supported Models
Contributing
License

Features

Define Custom Tasks: Easily define evaluation tasks using a simple @kbench.task decorator.
Interact with Multiple LLMs: Programmatically interact with and compare various large language models.
Structured & Multimodal I/O: Go beyond plain text. Get structured dataclass or pydantic objects from models and provide image inputs.
Tool Use: Empower models with tools, including a built-in Python interpreter to execute code.
Robust Assertions: Use a rich set of built-in assertions or create your own to validate model outputs.
Dataset Evaluation: Run benchmarks over entire datasets (e.g., pandas DataFrames) to get aggregate performance metrics.

Getting Started

On Kaggle

The easiest way to use kaggle-benchmarks is directly within a Kaggle notebook.

Prerequisites: A Kaggle account.

Installation: No installation is needed! For early access, simply navigate to https://www.kaggle.com/benchmarks/tasks/new. This will create a new Kaggle notebook with the library and its dependencies pre-installed and ready to use.

Data Usage and Leaderboard Generation: When running in Kaggle notebook, each benchmark task outputs a task file and associated run files. These files are used to build the benchmark entity and display its results on a Kaggle leaderboard. An example can be seen on the ICML 2025 Experts Leaderboard.

Local Development

For local development, you will need to configure your environment to use the Kaggle Model Proxy.

Prerequisites:

Python 3.11+
Git
uv

Installation & Configuration:

Clone the repository:

git clone https://github.com/Kaggle/kaggle-benchmarks.git
cd kaggle-benchmarks

Create a virtual environment and install dependencies using uv:

# Create and activate the virtual environment
uv venv
source .venv/bin/activate  # On Windows, use `.venv\Scripts\activate`

# Install dependencies
uv pip install -e .

Obtain a Kaggle MODEL_PROXY_API_KEY for access.
Create a .env file in the project root and add your configuration:
```
MODEL_PROXY_URL=https://mp-staging.kaggle.net/models/openapi
MODEL_PROXY_API_KEY={your_token}
LLM_DEFAULT=google/gemini-2.5-flash
LLM_DEFAULT_EVAL=openai/gpt-4o
LLMS_AVAILABLE=anthropic/claude-sonnet-4,google/gemini-2.5-flash,meta/llama-3.1-70b,openai/gpt-4o
PYTHONPATH=src
```
- LLM_DEFAULT: Sets the model identifier for kbench.llm, the default model used for running tasks.
- LLM_DEFAULT_EVAL: Sets the model identifier for kbench.judge_llm, which is typically a model used for evaluation or judging the outputs of other models.
- LLMS_AVAILABLE: A comma-separated list of models authorized for use by your proxy token.
Note: The LLM_DEFAULT, LLM_DEFAULT_EVAL, and LLMS_AVAILABLE variables depend on the models authorized by your proxy token.

Usage

Here is a simple example of a benchmark that asks a model a riddle and checks its answer.

import kaggle_benchmarks as kbench

@kbench.task(name="simple_riddle")
def solve_riddle(llm, riddle: str, answer: str):
    """Asks a riddle and checks for a keyword in the answer."""
    response = llm.prompt(riddle)

    # Assert that the model's response contains the answer, ignoring case.
    kbench.assertions.assert_contains_regex(
        f"(?i){answer}", response, expectation="LLM should give the right answer."
    )

# Execute the task
solve_riddle.run(
    llm=kbench.llm, # Uses the default LLM
    riddle="What gets wetter as it dries?",
    answer="Towel",
)

For a detailed walkthrough of the library's features, check out our documentation:

Supported Models

This library supports a wide range of models available through Kaggle's backend. The exact models available to you depend on your environment (Kaggle Notebook vs. local proxy token).

Contributing

Contributions are welcome! Please refer to our Contribution Guidelines for more details.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Project details

Release history Release notifications | RSS feed

0.5.0

May 1, 2026

0.4.0

Apr 24, 2026

0.3.0

Mar 26, 2026

This version

0.2.0

Dec 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kaggle_benchmarks-0.2.0.tar.gz (396.7 kB view details)

Uploaded Dec 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kaggle_benchmarks-0.2.0-py3-none-any.whl (88.2 kB view details)

Uploaded Dec 1, 2025 Python 3

File details

Details for the file kaggle_benchmarks-0.2.0.tar.gz.

File metadata

Download URL: kaggle_benchmarks-0.2.0.tar.gz
Upload date: Dec 1, 2025
Size: 396.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.14

File hashes

Hashes for kaggle_benchmarks-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`64a1efb5997345de0a99f3d2f7bda3d228b279d30f58ff2e55da06c8119caeee`
MD5	`a58bf5297c72bc74e23c0e16e48ff6fb`
BLAKE2b-256	`91d099d52528a901d817dcb535614b18f947e6d0b2ff530f20c4917b3622f391`

See more details on using hashes here.

File details

Details for the file kaggle_benchmarks-0.2.0-py3-none-any.whl.

File metadata

Download URL: kaggle_benchmarks-0.2.0-py3-none-any.whl
Upload date: Dec 1, 2025
Size: 88.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.14

File hashes

Hashes for kaggle_benchmarks-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7fbb4a4354e3b27b4a11dc822277f7e44566d50a483ab4b2f1a8e7fa980d7e72`
MD5	`27886d438c174c73cfa2ba8d6e380b24`
BLAKE2b-256	`cf9b0fc1fa338cc1247bd30303e4e2788898b90b9b7fc0e898a18a25374c80ba`

See more details on using hashes here.

kaggle_benchmarks 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Kaggle Benchmarks

Features

Getting Started

On Kaggle

Local Development

Usage

Supported Models

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes