Skip to main content

Add your description here

Project description

Quack Test

A plugin for pytest to evaluate non-deterministic agent components.

A runber duckie taking a test

Installation & Setup

Simply pip install it.

pip install quack-test

To use the LLM judge you will need to add the required config to your .env file.

# OpenAI Provider Configuration
# Set to "OpenAI" or "AzureOpenAI"
OPENAI_PROVIDER=OpenAI

# Unified Configuration
OPENAI_API_KEY=your_api_key_here
OPENAI_MODEL_NAME=gpt-4
OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com/

Run tests

Simply run your tests using pytest (quack_tests are just pytests)

pytest

Writing Tests

You can simply specify @nondeterministic for components, which are not deterministic and should be run multiple times. Implement everything in a test_*.py file in your test or tests folder.

import random
from quack_test import nondeterministic_fixture, nondeterministic_test

@nondeterministic_fixture(n=5)
def sample_text():
    return f"{random.randint(1, 10)} apples."

@nondeterministic_test(threshold=0.8)
def test_simple_assertion(sample_text):
    apples = float(sample_text.split(" "))
    assert apples > 1, f"Expected more than 1 apple found {apples}."

In practice choosing n=4 or n=5 is recommended. The threshold should be then 0.75 or 0.8, allowing for a single failure. If you are choosing a bigger n, you are rather running a benchmark than just a test. Here we assume, that if a feature is really broken, it does not only fail once, but more often. On the other side, if a feature works robustly, it will should not fail or at maximum once.

Judging via LLM

In many cases it is hard to evaluate with code, if an answer is actually correct. In those cases, use a judge.

from quack_test import judge, nondeterministic_test
# [...]

@nondeterministic_test(threshold=0.8)
def test_judge_with_criterion(sample_text):
    return judge(sample_text, criterion="More than 1 apple")

@nondeterministic_test(threshold=0.8)
def test_judge_with_gt(sample_text):
    return judge(sample_text, gt="N (0-10) apples")

Class Based Tests

Testing in classes can be beneficial in some use-cases. This is not hindered by quack-test.

class TestClassBased:
    @nondeterministic_test(n=3, threshold=0.2)
    def test_assert_only(self):
        # You can also create your own score and explanation
        return random.random(), "The expected value is above 0.2, but was lower."

Negative Tests

Writing negative tests that must fail is important. In quack test, you can simply specify that a test has to fail via should_fail=True.

@nondeterministic_test(threshold=0.8, should_fail=True)
def test_negative(sample_text):
    return judge(sample_text, criterion="Has only coal")

Showcase for Error Messages

The following example code will fail and cause an error message.

@nondeterministic_test(threshold=0.8)
def test_assertion_demo_failure(sample_text):
    assert False, "This message is shown in the pytest summary."
    # -> FAILED test/test_example.py::test_assertion_demo_failure - AssertionError: Test failed to meet success threshold. Score: 0.0 (required: 0.8), Success rate: 0.00% (0/5), This message is shown in the pytest summary.

@nondeterministic_test(threshold=0.8)
def test_judge_demo_failure(sample_text):
    return judge(sample_text, criterion="Has only coal")
    # -> FAILED test/test_example.py::test_judge_demo_failure - AssertionError: Test failed to meet success threshold. Score: 0.0 (required: 0.8), Success rate: 0.00% (0/5), Text: '3 apples.' Criterion: 'Has only coal'

Advanced Judge Setup

If you do not want to setup the judge via the .env file as shown above, you can also set it up via code. In this example we hardcode a local ollama instance. Of course you should NEVER put your api key in the code, but if you manage your secrets differently than in a .env this can come in clutch to pass your secrets to the judge.

def setup_module():
    configure_judge(
        provider="OpenAI",
        api_key="ollama",
        model_name="qwen3:4b",
        endpoint="http://localhost:11434/v1"
    )

Have fun quack testing your agents!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quack_test-0.3.0.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quack_test-0.3.0-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file quack_test-0.3.0.tar.gz.

File metadata

  • Download URL: quack_test-0.3.0.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for quack_test-0.3.0.tar.gz
Algorithm Hash digest
SHA256 3a02394b5812305728ad2ed87822c8923c99cbee5c66f71dc6712c27a692ee84
MD5 f06ba18e9adbf62dc17bea18703c8b73
BLAKE2b-256 aaecb7cb68f8faf541590092424c97f6d6d53658a926e84e39968f89673517d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for quack_test-0.3.0.tar.gz:

Publisher: python-publish.yml on penguinmenac3/quack-test

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file quack_test-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: quack_test-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for quack_test-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0ee67a81da80d23181ae26c12335087dd3bf37aa5c40ef92db07159b34c8863f
MD5 82c8e65b33cf239697ad048b942b00b3
BLAKE2b-256 08082dce8bb714d16abefdd8492f0a463d1a656d310e0e2701f91286a9cc283a

See more details on using hashes here.

Provenance

The following attestation bundles were made for quack_test-0.3.0-py3-none-any.whl:

Publisher: python-publish.yml on penguinmenac3/quack-test

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page