Skip to main content

Add your description here

Project description

Quack Test

A plugin for pytest to evaluate non-deterministic agent components.

A runber duckie taking a test

Installation & Setup

Simply pip install it.

pip install quack-test

To use the LLM judge you will need to add the required config to your .env file.

# OpenAI Provider Configuration
# Set to "OpenAI" or "AzureOpenAI"
OPENAI_PROVIDER=OpenAI

# Unified Configuration
OPENAI_API_KEY=your_api_key_here
OPENAI_MODEL_NAME=gpt-4
OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com/

Run tests

Simply run your tests using pytest (quack_tests are just pytests)

pytest

Writing Tests

You can simply specify @nondeterministic for components, which are not deterministic and should be run multiple times. Implement everything in a test_*.py file in your test or tests folder.

import random
from quack_test import nondeterministic_fixture, nondeterministic_test

@nondeterministic_fixture(n=5)
def sample_text():
    return f"{random.randint(1, 10)} apples."

@nondeterministic_test(threshold=0.8)
def test_simple_assertion(sample_text):
    apples = float(sample_text.split(" "))
    assert apples > 1, f"Expected more than 1 apple found {apples}."

In practice choosing n=4 or n=5 is recommended. The threshold should be then 0.75 or 0.8, allowing for a single failure. If you are choosing a bigger n, you are rather running a benchmark than just a test. Here we assume, that if a feature is really broken, it does not only fail once, but more often. On the other side, if a feature works robustly, it will should not fail or at maximum once.

Judging via LLM

In many cases it is hard to evaluate with code, if an answer is actually correct. In those cases, use a judge.

from quack_test import judge, nondeterministic_test
# [...]

@nondeterministic_test(threshold=0.8)
def test_judge_with_criterion(sample_text):
    return judge(sample_text, criterion="More than 1 apple")

@nondeterministic_test(threshold=0.8)
def test_judge_with_gt(sample_text):
    return judge(sample_text, gt="N (0-10) apples")

Class Based Tests

Testing in classes can be beneficial in some use-cases. This is not hindered by quack-test.

class TestClassBased:
    @nondeterministic_test(n=3, threshold=0.2)
    def test_assert_only(self):
        # You can also create your own score and explanation
        return random.random(), "The expected value is above 0.2, but was lower."

Negative Tests

Writing negative tests that must fail is important. In quack test, you can simply specify that a test has to fail via should_fail=True.

@nondeterministic_test(threshold=0.8, should_fail=True)
def test_negative(sample_text):
    return judge(sample_text, criterion="Has only coal")

Showcase for Error Messages

The following example code will fail and cause an error message.

@nondeterministic_test(threshold=0.8)
def test_assertion_demo_failure(sample_text):
    assert False, "This message is shown in the pytest summary."
    # -> FAILED test/test_example.py::test_assertion_demo_failure - AssertionError: Test failed to meet success threshold. Score: 0.0 (required: 0.8), Success rate: 0.00% (0/5), This message is shown in the pytest summary.

@nondeterministic_test(threshold=0.8)
def test_judge_demo_failure(sample_text):
    return judge(sample_text, criterion="Has only coal")
    # -> FAILED test/test_example.py::test_judge_demo_failure - AssertionError: Test failed to meet success threshold. Score: 0.0 (required: 0.8), Success rate: 0.00% (0/5), Text: '3 apples.' Criterion: 'Has only coal'

Advanced Judge Setup

If you do not want to setup the judge via the .env file as shown above, you can also set it up via code. In this example we hardcode a local ollama instance. Of course you should NEVER put your api key in the code, but if you manage your secrets differently than in a .env this can come in clutch to pass your secrets to the judge.

def setup_module():
    configure_judge(
        provider="OpenAI",
        api_key="ollama",
        model_name="qwen3:4b",
        endpoint="http://localhost:11434/v1"
    )

Have fun quack testing your agents!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quack_test-0.3.1.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quack_test-0.3.1-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file quack_test-0.3.1.tar.gz.

File metadata

  • Download URL: quack_test-0.3.1.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for quack_test-0.3.1.tar.gz
Algorithm Hash digest
SHA256 10da6b1043ecaf500c24adc02b5b1058749dfc2c2b5f0b18d5f7d8a689e52f96
MD5 62bc9636003cc9bee5576537275fb34c
BLAKE2b-256 1ccaf2c8be2d439fb02fb3b62f85d6d2d3c40bdc3f56d3462e94aa23c55cd216

See more details on using hashes here.

Provenance

The following attestation bundles were made for quack_test-0.3.1.tar.gz:

Publisher: python-publish.yml on penguinmenac3/quack-test

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file quack_test-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: quack_test-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for quack_test-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 469b4da4183bceb0c5ab9aa4124aea884e076c863863f81249c31fcaa46088ad
MD5 fb9eadfb8a656895f09b212b06eaaa2c
BLAKE2b-256 0cac6b6970ed86663926052a4acc2459b3b68ee08e30db63a1637e2a10d57e13

See more details on using hashes here.

Provenance

The following attestation bundles were made for quack_test-0.3.1-py3-none-any.whl:

Publisher: python-publish.yml on penguinmenac3/quack-test

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page