Skip to main content

Add your description here

Project description

giskard-checks

Lightweight primitives to define and run checks against model interactions.

This library provides:

  • Core types for describing interactions (Interact, Interaction, Trace)
  • A fluent scenario builder and runner (Scenario, ScenarioResult)
  • Built-in checks including string matching, comparisons, and LLM-based evaluation
  • JSONPath-based extraction utilities for referencing trace data
  • Seamless integration with giskard-agents generators for LLM-backed checks

Installation

pip install giskard-checks

Requires Python >= 3.12.

Dependencies:

  • pydantic>=2.11.7 - Core data validation and serialization
  • giskard-agents>=0.3 - LLM integration and workflow management
  • jsonpath-ng>=1.7.0 - JSONPath expressions for data extraction
  • jinja2>=3.1.6 - Template engine for LLM prompts

Quickstart

Use the fluent API to create and run scenarios:

from giskard.checks import Groundedness, Scenario

# Natural variable name: no shadowing (Scenario is the class, scenario is your instance)
scenario = (
    Scenario("test_france_capital")
    .interact(
        inputs="What is the capital of France?",
        outputs="The capital of France is Paris."
    )
    .check(
        Groundedness(
            name="answer is grounded",
            answer_key="trace.last.outputs",
            context="""France is a country in Western Europe. Its capital
                       and largest city is Paris, known for the Eiffel Tower
                       and the Louvre Museum."""
        )
    )
)

result = await scenario.run()
assert result.passed
print(f"Scenario completed in {result.duration_ms}ms")

The fluent API accepts static values or callables for inputs and outputs, so you can call your SUT directly:

from openai import OpenAI
from giskard.checks import Groundedness, Scenario

client = OpenAI()

def get_answer(inputs: str) -> str:
    response = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[{"role": "user", "content": inputs}],
    )
    return response.choices[0].message.content

scenario = (
    Scenario("test_dynamic_output")
    .interact(
        inputs="What is the capital of France?",
        outputs=get_answer
    )
    .check(
        Groundedness(
            name="answer is grounded",
            answer_key="trace.last.outputs",
            context="France is a country in Western Europe..."
        )
    )
)

The run() method is async. In a script, wrap it with asyncio.run():

import asyncio

async def main():
    result = await scenario.run()
    print(result)

asyncio.run(main())

Running Multiple Scenarios with Suite

Use a Suite to run multiple scenarios against a shared target SUT. You can bind a target at the suite level or override it during the run() call.

from giskard.checks import Equals, Scenario, Suite

# Define some scenarios without a target (no shadowing: Scenario is the class)
scenario1 = (
    Scenario("s1")
    .interact("hello")
    .check(Equals(expected_value="Echo: hello", key="trace.last.outputs"))
)
scenario2 = (
    Scenario("s2")
    .interact("world")
    .check(Equals(expected_value="Echo: world", key="trace.last.outputs"))
)

# Create a suite with a shared target
target_sut = lambda inputs: f"Echo: {inputs}"
suite = Suite(name="my_suite", target=target_sut)

# Add scenarios
suite.append(scenario1)
suite.append(scenario2)

# Run the suite
results = await suite.run()
print(f"Aggregated pass rate: {results.pass_rate * 100}%")

Why this library?

  • Small, explicit, and type-safe with pydantic models
  • Async-friendly: checks can be sync or async
  • Results are immutable and easy to serialize

Concepts

  • Fluent API: The recommended way to create tests using Scenario(...).interact().check(). This API builds a scenario and handles interaction generation.
  • Interact: A specification for generating interactions dynamically (static values, callables, or generators).
  • Trace: Immutable history of all Interaction objects produced while executing a scenario. Use trace.last in JSONPath expressions (e.g., trace.last.outputs).
  • Interaction: A recorded exchange with inputs, outputs, and optional metadata.
  • Check: Inspects the Trace and returns a CheckResult.
  • Scenario: Ordered sequence of interactions and checks with a shared Trace. Execution stops at the first failing check and later steps are skipped. Scenarios can have their own target SUT, which will be injected to interactions without defined outputs.
  • Suite: A collection of scenarios that can be executed together, optionally sharing a common target.

Advanced concepts (used internally by the fluent API):

  • TestCase: Wrapper that runs a set of checks against a single trace step and returns a TestCaseResult.
  • ScenarioRunner: Executes scenarios sequentially, maintaining trace state and aggregating step results.

API Overview

Core types

  • giskard.checks.Check: base class for all checks with discriminated-union registration.
  • giskard.checks.CheckResult, CheckStatus, Metric: typed results with convenience helpers.
  • giskard.checks.Trace / Interaction: a trace is an immutable sequence of recorded interactions with the system.
  • giskard.checks.Scenario and ScenarioResult: ordered sequence of components with shared trace. Execution stops at first failure and later steps are skipped.
  • giskard.checks.TestCase and TestCaseResult: runs checks against a trace step and aggregates results.

Interaction specs

  • giskard.checks.InteractionSpec: discriminated base for describing inputs/outputs. Subclasses implement generate() to yield interactions.
  • giskard.checks.Interact: batteries-included spec that supports static values, callables, generators, or InputGenerator instances for both inputs and outputs. Supports multi-turn interactions via generators.
  • giskard.checks.UserSimulator: LLM-powered input generator that simulates user personas (predefined or custom) for multi-turn scenarios.

Scenarios and runners

  • giskard.checks.Scenario: ordered sequence of components (InteractionSpecs and Checks) with shared trace. Components execute sequentially, stopping at first failure.
  • giskard.checks.ScenarioRunner: executes scenarios with timing, error capture, and early-stop semantics.
  • giskard.checks.TestCaseRunner: executes test cases with timing and error handling.

Built-in and LLM-based checks

  • giskard.checks.from_fn, FnCheck: wrap arbitrary callables.
  • giskard.checks.StringMatching, RegexMatching, SemanticSimilarity, Equals, NotEquals, GreaterThan, GreaterEquals, LesserThan, LesserThanEquals.
  • giskard.checks.BaseLLMCheck, LLMCheckResult, Groundedness, Conformity, LLMJudge.
  • JSONPath selectors (e.g., trace.last.outputs) are supported on relevant checks via key or check-specific fields like answer_key.

Testing utilities

  • giskard.checks.WithSpy: wrapper for spying on function calls during interaction generation.

Settings

  • giskard.checks.set_default_generator / get_default_generator: configure the generator used by LLM checks.

Testing

  • Tests live under tests/ mirroring the package structure (tests/core, tests/scenarios, tests/trace).
  • Use make test (or make ci) to run the full suite exactly as CI does.

Usage Notes

  • Define custom checks with a unique KIND via @Check.register("kind").
  • All discriminated types auto-register when imported; ensure modules are imported before deserialization.
  • Prefer model_dump() / model_validate() for serialization.
  • Attach extra metadata in CheckResult.details; JSONPath helpers (key=...) resolve against the entire trace.

Serialization

The library uses Pydantic's discriminated unions for polymorphic serialization.

from giskard.checks import Check, CheckResult, Interaction, TestCase, Trace


@Check.register("my_custom_check")
class MyCustomCheck(Check):
    async def run(self, trace: Trace) -> CheckResult:
        return CheckResult.success("Check passed")


trace = Trace(interactions=[Interaction(inputs="test", outputs="result")])
check = MyCustomCheck(name="test")
testcase = TestCase(trace=trace, checks=[check], name="example")

# Serialize to dict
serialized = testcase.model_dump()

# Deserialize back (requires classes to be imported)
restored = TestCase.model_validate(serialized)

Important: Import every custom type (checks and specs) before calling model_validate(). The registry only knows about classes already loaded into memory.

Creating Custom Checks and Interaction Specs

Step 1: Define a custom check

from giskard.checks import Check, CheckResult, Trace


@Check.register("advanced_security")
class AdvancedSecurityCheck(Check):
    threshold: float = 0.8

    async def run(self, trace: Trace) -> CheckResult:
        current = trace.last
        score = await some_security_analysis(current.outputs)
        if score >= self.threshold:
            return CheckResult.success(f"Security score {score:.2f} meets threshold")
        return CheckResult.failure(
            f"Security score {score:.2f} below threshold {self.threshold}"
        )

Step 2: Define a custom interaction specification

from giskard.checks import InteractionSpec, Interaction, Trace


@InteractionSpec.register("chat_conversation")
class ChatInteraction(InteractionSpec):
    session_id: str
    messages: list[str]

    async def generate(self, trace: Trace):
        summary = f"Conversation with {len(self.messages)} messages"
        record = Interaction(
            inputs=self.messages,
            outputs={"summary": summary},
            metadata={"session_id": self.session_id},
        )
        yield record

Step 3: Verify registration

from giskard.checks import Scenario

chat = ChatInteraction(session_id="session_123", messages=["hi", "hello"])
check = AdvancedSecurityCheck(name="security_test", threshold=0.7)
scenario = Scenario(name="custom_test").extend(chat, check)

serialized = scenario.model_dump()
restored = Scenario.model_validate(serialized)

Binding a Target SUT

You can bind a System Under Test (SUT) at three different levels, with the following precedence: run(target=...) > Suite(target=...) > Scenario(..., target=...).

1. Scenario Level

Pass the target directly to Scenario():

result = await Scenario("test", target=my_sut).interact("hello").run()

2. Suite Level

Bind a target to all scenarios in a suite:

suite = Suite(name="my_suite", target=shared_sut)
suite.append(scenario)
result = await suite.run()

3. Run Level

Override everything at execution time:

# This target will be used for all scenarios in the suite,
# overriding any suite-level or scenario-level targets.
result = await suite.run(target=emergency_override_sut)

Troubleshooting Serialization Issues

ValidationError: "Kind is not provided for Check"

  • Cause: Custom class not imported before deserialization.
  • Fix: Import classes before calling model_validate().

DuplicateKindError: "Duplicate kind 'my_check' detected"

  • Cause: Two classes share the same KIND.
  • Fix: Give every registered class a unique KIND.

Missing registration

  • Cause: Subclass missing the decorator.
  • Fix: Use @Check.register("...") (or the relevant base).

Import order issues in tests

  • Cause: Tests call model_validate() before importing custom modules.
  • Fix: Import those modules in test setup or fixtures first.

Structured data example

from giskard.checks import Equals, Scenario, StringMatching

result = await (
    Scenario("structured-example")
    .interact(
        {"question": "What is the capital of France?"},
        lambda inputs: {"answer": "Paris is the capital of France.", "confidence": 0.95}
    )
    .check(StringMatching(
        name="contains_paris",
        keyword="Paris",
        text_key="trace.last.outputs.answer",
    ))
    .check(Equals(
        name="high_confidence",
        expected_value=0.95,
        key="trace.last.outputs.confidence",
    ))
    .run()
)

assert result.passed
print(f"Scenario completed in {result.duration_ms}ms")

Multi-step workflows

Use the fluent API to create multi-turn scenarios. Components execute sequentially with a shared trace, stopping at the first failing check.

from giskard.checks import LLMJudge, RegexMatching, Scenario

result = await (
    Scenario("multi_step_conversation")
    .interact(
        "Hello, I want to apply for a job.",
        lambda inputs: "Hi! I'd be happy to help. Please provide your email."
    )
    .check(LLMJudge(
        prompt="The assistant asked for the email politely: {{ trace.last.outputs }}"
    ))
    .interact(
        "My email is test@example.com",
        lambda inputs: f"Thank you! I've saved your application with email: {inputs.split()[-1]}"
    )
    .check(RegexMatching(
        pattern="test@example.com",
        text_key="trace.last.outputs",
    ))
    .run()
)

assert result.passed

Dynamic interaction generation

The fluent API supports callables (sync/async) or generators for dynamic inputs. Multiple inputs can be produced by yielding from a generator.

from giskard.checks import Scenario, Trace, from_fn


async def input_generator(trace: Trace):
    count = len(trace.interactions)
    next_input = {"message": f"Hello! This is message #{count + 1}"}
    yield next_input  # Can also yield multiple times for streaming inputs


result = await (
    Scenario("dynamic-example")
    .interact(
        input_generator,
        lambda inputs: {
            "response": f"Hi there! Received: {inputs['message']}",
        }
    )
    .check(from_fn(lambda trace: True, name="noop"))
    .run()
)

User simulation

Use UserSimulator for LLM-powered user personas in multi-turn scenarios. Supports predefined personas (e.g., frustrated_customer, helpful_user) or custom descriptions.

from giskard.checks import scenario, UserSimulator, set_default_generator
from giskard.agents.generators import Generator

set_default_generator(Generator(model="openai/gpt-4o-mini"))

result = await (
    scenario("user-simulation")
    .interact(
        UserSimulator(persona="frustrated_customer", context="delayed order"),
        lambda inputs: "I apologize for the delay. Let me look into that.",
    )
    .run()
)

LLM-based checks

from giskard.agents.generators import Generator

from giskard.checks import (
    Conformity,
    LLMJudge,
    Scenario,
    set_default_generator,
)

# Configure the default LLM generator
set_default_generator(Generator(model="openai/gpt-4o-mini"))

result = await (
    Scenario("llm-example")
    .interact(
        {"question": "What is the capital of France?"},
        lambda inputs: {"answer": "Paris is the capital of France."}
    )
    .check(Conformity(name="conformity_check", rule="The agent must answer in English"))
    .check(LLMJudge(
        name="safety_check",
        prompt="""Evaluate if the following response is safe and appropriate.

Input: {{ trace.last.inputs }}
Response: {{ trace.last.outputs }}

Return 'passed: true' if safe, 'passed: false' if unsafe.""",
    ))
    .run()
)

assert result.passed
print(f"LLM evaluation completed in {result.duration_ms}ms")

Template customization & advanced LLM usage

  • LLM-based checks ship with template references registered inside giskard.agents.
  • Provide your own template by overriding get_prompt() in a subclass or by instantiating LLMJudge with inline prompts.
  • Templates use the same interpolation context you return from get_inputs().
from giskard.agents.workflow import TemplateReference
from pydantic import BaseModel

from giskard.checks import BaseLLMCheck, Check, CheckResult, Trace


class CustomResult(BaseModel):
    score: float
    passed: bool
    reasoning: str


@Check.register("custom_llm_check")
class CustomLLMCheck(BaseLLMCheck):
    def get_prompt(self) -> TemplateReference:
        return TemplateReference(template_name="my_project::checks/custom_check.j2")

    @property
    def output_type(self) -> type[BaseModel]:
        return CustomResult

    async def _handle_output(
        self,
        output_value: CustomResult,
        template_inputs: dict[str, str],
        trace: Trace,
    ) -> CheckResult:
        if output_value.score >= 0.8:
            return CheckResult.success(f"Score {output_value.score} meets threshold")
        return CheckResult.failure(f"Score {output_value.score} below threshold")

Notes

  • Trace captures every interaction; JSONPath keys like trace.last.outputs resolve against that structure.
  • Pass a generator to individual LLM checks or rely on the default configured via set_default_generator().
  • Built-in LLM checks rely on templates bundled in giskard.checks and registered with the giskard-agents template system; override get_prompt or get_inputs for customization.

Advanced Usage

For advanced use cases where you need direct control over interactions or trace construction, you can build a Trace for TestCase directly, using Interaction:

from giskard.checks import Interaction, TestCase, Trace

# Build a Trace manually for a TestCase
trace = Trace(interactions=[
    Interaction(inputs="some text", outputs=process("some text")),
])
tc = TestCase(trace=trace, checks=[check1, check2], name="advanced_example")
test_case_result = await tc.run()

For programmatic test generation or when you need fine-grained control, you can also construct Scenario objects directly, creating a sequence of InteractionSpec or Check objects:

from giskard.checks import (
    Scenario,
    Interact, # Inherits from `InteractionSpec`
    Equals # Inherits from `Check`
)

scenario = Scenario(name="programmatic_scenario").extend(
    Interact(inputs="Hello", outputs=lambda inputs: "Hi"),
    Equals(expected_value="Hi", key="trace.last.outputs"),
)

result = await scenario.run()

Note: For most use cases, the fluent API (Scenario(...).interact().check()) is recommended as it's simpler and more readable.

Development

Use the Makefile for all development workflows (make help for details).

make install   # Install dependencies
make setup     # Install dependencies + tools (Format, lint, typecheck, test)

Other common commands:

make test
make lint
make format
make typecheck
make check
make clean

For more details, see the Makefile or run make help.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

giskard_checks-1.0.1b1.tar.gz (41.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

giskard_checks-1.0.1b1-py3-none-any.whl (58.1 kB view details)

Uploaded Python 3

File details

Details for the file giskard_checks-1.0.1b1.tar.gz.

File metadata

  • Download URL: giskard_checks-1.0.1b1.tar.gz
  • Upload date:
  • Size: 41.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for giskard_checks-1.0.1b1.tar.gz
Algorithm Hash digest
SHA256 ec65cf7981cd11daca64b9a34acf993334d0533e693c5d0ebf029077e7c7264b
MD5 0ddf6a04f95303e0a811552f5709434c
BLAKE2b-256 6722ca212a756120c491fbffb789f30627264bb748d5e2198e84e5d182e2b92a

See more details on using hashes here.

File details

Details for the file giskard_checks-1.0.1b1-py3-none-any.whl.

File metadata

  • Download URL: giskard_checks-1.0.1b1-py3-none-any.whl
  • Upload date:
  • Size: 58.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.1 {"installer":{"name":"uv","version":"0.11.1","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for giskard_checks-1.0.1b1-py3-none-any.whl
Algorithm Hash digest
SHA256 d346ee1ffaa6dfa39d2313b0d31fde209a87d24be964f9a7d14f140830c00e39
MD5 7524d2ac43be1ddb0715850273b456c9
BLAKE2b-256 5219458dbadc04f7a9f173ac3549c8dfc6dc6e82dc8b6a5f46346a69712e420f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page