Skip to main content

No project description provided

Project description

Kaggle Benchmarks

Ask DeepWiki

kaggle-benchmarks is a Python library designed to help you rigorously evaluate AI models on tasks that matter to you. It provides a structured framework for defining tasks, interacting with models, and asserting the correctness of their outputs.

This is especially useful for:

Features

  • Define Custom Tasks: Easily define evaluation tasks using a simple @kbench.task decorator.
  • Interact with Multiple LLMs: Programmatically interact with and compare various large language models.
  • Structured & Multimodal I/O: Go beyond plain text. Get structured dataclass or pydantic objects from models and provide image, audio, and video inputs.
  • Tool Use: Empower models with tools, including a built-in Python interpreter to execute code.
  • Robust Assertions: Use a rich set of built-in assertions or create your own to validate model outputs.
  • Dataset Evaluation: Run benchmarks over entire datasets (e.g., pandas DataFrames) to get aggregate performance metrics.

Getting Started

On Kaggle

The easiest way to use kaggle-benchmarks is directly within a Kaggle notebook.

Prerequisites: A Kaggle account.

Installation: No installation is needed! For early access, simply navigate to https://www.kaggle.com/benchmarks/tasks/new. This will create a new Kaggle notebook with the library and its dependencies pre-installed and ready to use.

Data Usage and Leaderboard Generation: When running in Kaggle notebook, each benchmark task outputs a task file and associated run files. These files are used to build the benchmark entity and display its results on a Kaggle leaderboard. An example can be seen on the ICML 2025 Experts Leaderboard.

Usage

Here is a simple example of a benchmark that asks a model a riddle and checks its answer.

import kaggle_benchmarks as kbench

@kbench.task(name="simple_riddle")
def solve_riddle(llm, riddle: str, answer: str):
    """Asks a riddle and checks for a keyword in the answer."""
    response = llm.prompt(riddle)

    # Assert that the model's response contains the answer, ignoring case.
    kbench.assertions.assert_contains_regex(
        f"(?i){answer}", response, expectation="LLM should give the right answer."
    )

# Execute the task
solve_riddle.run(
    llm=kbench.llm, # Uses the default LLM
    riddle="What gets wetter as it dries?",
    answer="Towel",
)

For a detailed walkthrough of the library's features, check out our documentation:

Supported Models

This library supports a wide range of models available through Kaggle's backend. The exact models available to you depend on your environment (Kaggle Notebook vs. local proxy token).

Contributing

Contributions are welcome! Please refer to our Contribution Guidelines for more details.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kaggle_benchmarks-0.4.0.tar.gz (501.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kaggle_benchmarks-0.4.0-py3-none-any.whl (111.5 kB view details)

Uploaded Python 3

File details

Details for the file kaggle_benchmarks-0.4.0.tar.gz.

File metadata

  • Download URL: kaggle_benchmarks-0.4.0.tar.gz
  • Upload date:
  • Size: 501.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for kaggle_benchmarks-0.4.0.tar.gz
Algorithm Hash digest
SHA256 8b94202643b01a8f9e856e4de8ac4be31efa1f8f917c4edc7f784bc8ad8f065d
MD5 dda7226f058141b74452aa60a797fe98
BLAKE2b-256 1c117fb1b78e7153ee93bcd1fb30e9f70c350a5a94243dcbff5447f15b27f03a

See more details on using hashes here.

File details

Details for the file kaggle_benchmarks-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for kaggle_benchmarks-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e9175fa4652fb60c54745949b0cbe9d8a4850c32da48ca9281c94513ba828eeb
MD5 9ecfba7e8d880258273cdb0b35168625
BLAKE2b-256 785018dd657dcabc81a9cc8f3edd2e455658b5bd224b059a688e72a20a6e21c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page