No project description provided
Project description
Kaggle Benchmarks
kaggle-benchmarks is a Python library designed to help you rigorously evaluate AI models on tasks that matter to you. It provides a structured framework for defining tasks, interacting with models, and asserting the correctness of their outputs.
This is especially useful for:
-
Reproducibility: Capture the exact inputs, outputs, and model interactions for later review.
-
Complex Evaluations: Go beyond simple string matching to test for code execution, tool use, and multi-turn conversational capabilities.
-
Rapid Prototyping: Quickly test a model’s capabilities on a new, creative task you’ve designed.
Features
- Define Custom Tasks: Easily define evaluation tasks using a simple
@kbench.taskdecorator. - Interact with Multiple LLMs: Programmatically interact with and compare various large language models.
- Structured & Multimodal I/O: Go beyond plain text. Get structured
dataclassorpydanticobjects from models and provide image, audio, and video inputs. - Tool Use: Empower models with tools, including a built-in Python interpreter to execute code.
- Robust Assertions: Use a rich set of built-in assertions or create your own to validate model outputs.
- Dataset Evaluation: Run benchmarks over entire datasets (e.g., pandas DataFrames) to get aggregate performance metrics.
Getting Started
On Kaggle
The easiest way to use kaggle-benchmarks is directly within a Kaggle notebook.
Prerequisites: A Kaggle account.
Installation: No installation is needed! For early access, simply navigate to https://www.kaggle.com/benchmarks/tasks/new. This will create a new Kaggle notebook with the library and its dependencies pre-installed and ready to use.
Data Usage and Leaderboard Generation: When running in Kaggle notebook, each benchmark task outputs a task file and associated run files. These files are used to build the benchmark entity and display its results on a Kaggle leaderboard. An example can be seen on the ICML 2025 Experts Leaderboard.
Usage
Here is a simple example of a benchmark that asks a model a riddle and checks its answer.
import kaggle_benchmarks as kbench
@kbench.task(name="simple_riddle")
def solve_riddle(llm, riddle: str, answer: str):
"""Asks a riddle and checks for a keyword in the answer."""
response = llm.prompt(riddle)
# Assert that the model's response contains the answer, ignoring case.
kbench.assertions.assert_contains_regex(
f"(?i){answer}", response, expectation="LLM should give the right answer."
)
# Execute the task
solve_riddle.run(
llm=kbench.llm, # Uses the default LLM
riddle="What gets wetter as it dries?",
answer="Towel",
)
For a detailed walkthrough of the library's features, check out our documentation:
Supported Models
This library supports a wide range of models available through Kaggle's backend. The exact models available to you depend on your environment (Kaggle Notebook vs. local proxy token).
Contributing
Contributions are welcome! Please refer to our Contribution Guidelines for more details.
License
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kaggle_benchmarks-0.4.0.tar.gz.
File metadata
- Download URL: kaggle_benchmarks-0.4.0.tar.gz
- Upload date:
- Size: 501.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b94202643b01a8f9e856e4de8ac4be31efa1f8f917c4edc7f784bc8ad8f065d
|
|
| MD5 |
dda7226f058141b74452aa60a797fe98
|
|
| BLAKE2b-256 |
1c117fb1b78e7153ee93bcd1fb30e9f70c350a5a94243dcbff5447f15b27f03a
|
File details
Details for the file kaggle_benchmarks-0.4.0-py3-none-any.whl.
File metadata
- Download URL: kaggle_benchmarks-0.4.0-py3-none-any.whl
- Upload date:
- Size: 111.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9175fa4652fb60c54745949b0cbe9d8a4850c32da48ca9281c94513ba828eeb
|
|
| MD5 |
9ecfba7e8d880258273cdb0b35168625
|
|
| BLAKE2b-256 |
785018dd657dcabc81a9cc8f3edd2e455658b5bd224b059a688e72a20a6e21c5
|