Add your description here
Project description
giskard-checks
Lightweight primitives to define and run checks against model interactions.
This library provides:
- Core types for describing interactions (
Interact,Interaction,Trace) - A fluent scenario builder and runner (
Scenario,ScenarioResult) - Built-in checks including string matching, comparisons, and LLM-based evaluation
- JSONPath-based extraction utilities for referencing trace data
- Seamless integration with
giskard-agentsgenerators for LLM-backed checks
Installation
pip install giskard-checks
Requires Python >= 3.12.
Dependencies:
pydantic>=2.11.7- Core data validation and serializationgiskard-agents>=0.3- LLM integration and workflow managementjsonpath-ng>=1.7.0- JSONPath expressions for data extractionjinja2>=3.1.6- Template engine for LLM prompts
Quickstart
Use the fluent API to create and run scenarios:
from giskard.checks import Groundedness, Scenario
# Natural variable name: no shadowing (Scenario is the class, scenario is your instance)
scenario = (
Scenario("test_france_capital")
.interact(
inputs="What is the capital of France?",
outputs="The capital of France is Paris."
)
.check(
Groundedness(
name="answer is grounded",
answer_key="trace.last.outputs",
context="""France is a country in Western Europe. Its capital
and largest city is Paris, known for the Eiffel Tower
and the Louvre Museum."""
)
)
)
result = await scenario.run()
assert result.passed
print(f"Scenario completed in {result.duration_ms}ms")
The fluent API accepts static values or callables for inputs and outputs, so you can call your SUT directly:
from openai import OpenAI
from giskard.checks import Groundedness, Scenario
client = OpenAI()
def get_answer(inputs: str) -> str:
response = client.chat.completions.create(
model="gpt-5-mini",
messages=[{"role": "user", "content": inputs}],
)
return response.choices[0].message.content
scenario = (
Scenario("test_dynamic_output")
.interact(
inputs="What is the capital of France?",
outputs=get_answer
)
.check(
Groundedness(
name="answer is grounded",
answer_key="trace.last.outputs",
context="France is a country in Western Europe..."
)
)
)
The run() method is async. In a script, wrap it with asyncio.run():
import asyncio
async def main():
result = await scenario.run()
print(result)
asyncio.run(main())
Running Multiple Scenarios with Suite
Use a Suite to run multiple scenarios against a shared target SUT. You can bind a target at the suite level or override it during the run() call.
from giskard.checks import Equals, Scenario, Suite
# Define some scenarios without a target (no shadowing: Scenario is the class)
scenario1 = (
Scenario("s1")
.interact("hello")
.check(Equals(expected_value="Echo: hello", key="trace.last.outputs"))
)
scenario2 = (
Scenario("s2")
.interact("world")
.check(Equals(expected_value="Echo: world", key="trace.last.outputs"))
)
# Create a suite with a shared target
target_sut = lambda inputs: f"Echo: {inputs}"
suite = Suite(name="my_suite", target=target_sut)
# Add scenarios
suite.append(scenario1)
suite.append(scenario2)
# Run the suite
results = await suite.run()
print(f"Aggregated pass rate: {results.pass_rate * 100}%")
Why this library?
- Small, explicit, and type-safe with
pydanticmodels - Async-friendly: checks can be sync or async
- Results are immutable and easy to serialize
Concepts
- Fluent API: The recommended way to create tests using
Scenario(...).interact().check(). This API builds a scenario and handles interaction generation. - Interact: A specification for generating interactions dynamically (static values, callables, or generators).
- Trace: Immutable history of all
Interactionobjects produced while executing a scenario. Usetrace.lastin JSONPath expressions (e.g.,trace.last.outputs). - Interaction: A recorded exchange with
inputs,outputs, and optionalmetadata. - Check: Inspects the
Traceand returns aCheckResult. - Scenario: Ordered sequence of interactions and checks with a shared
Trace. Execution stops at the first failing check and later steps are skipped. Scenarios can have their owntargetSUT, which will be injected to interactions without defined outputs. - Suite: A collection of scenarios that can be executed together, optionally sharing a common
target.
Advanced concepts (used internally by the fluent API):
- TestCase: Wrapper that runs a set of checks against a single trace step and returns a
TestCaseResult. - ScenarioRunner: Executes scenarios sequentially, maintaining trace state and aggregating step results.
API Overview
Core types
giskard.checks.Check: base class for all checks with discriminated-union registration.giskard.checks.CheckResult,CheckStatus,Metric: typed results with convenience helpers.giskard.checks.Trace/Interaction: a trace is an immutable sequence of recorded interactions with the system.giskard.checks.ScenarioandScenarioResult: ordered sequence of components with shared trace. Execution stops at first failure and later steps are skipped.giskard.checks.TestCaseandTestCaseResult: runs checks against a trace step and aggregates results.
Interaction specs
giskard.checks.InteractionSpec: discriminated base for describing inputs/outputs. Subclasses implementgenerate()to yield interactions.giskard.checks.Interact: batteries-included spec that supports static values, callables, generators, orInputGeneratorinstances for both inputs and outputs. Supports multi-turn interactions via generators.giskard.checks.UserSimulator: LLM-powered input generator that simulates user personas (predefined or custom) for multi-turn scenarios.
Scenarios and runners
giskard.checks.Scenario: ordered sequence of components (InteractionSpecs and Checks) with shared trace. Components execute sequentially, stopping at first failure.giskard.checks.ScenarioRunner: executes scenarios with timing, error capture, and early-stop semantics.giskard.checks.TestCaseRunner: executes test cases with timing and error handling.
Built-in and LLM-based checks
giskard.checks.from_fn,FnCheck: wrap arbitrary callables.giskard.checks.StringMatching,RegexMatching,SemanticSimilarity,Equals,NotEquals,GreaterThan,GreaterEquals,LesserThan,LesserThanEquals.giskard.checks.BaseLLMCheck,LLMCheckResult,Groundedness,Conformity,LLMJudge.- JSONPath selectors (e.g.,
trace.last.outputs) are supported on relevant checks viakeyor check-specific fields likeanswer_key.
Testing utilities
giskard.checks.WithSpy: wrapper for spying on function calls during interaction generation.
Settings
giskard.checks.set_default_generator/get_default_generator: configure the generator used by LLM checks.
Testing
- Tests live under
tests/mirroring the package structure (tests/core,tests/scenarios,tests/trace). - Use
make test(ormake ci) to run the full suite exactly as CI does.
Usage Notes
- Define custom checks with a unique
KINDvia@Check.register("kind"). - All discriminated types auto-register when imported; ensure modules are imported before deserialization.
- Prefer
model_dump()/model_validate()for serialization. - Attach extra metadata in
CheckResult.details; JSONPath helpers (key=...) resolve against the entire trace.
Serialization
The library uses Pydantic's discriminated unions for polymorphic serialization.
from giskard.checks import Check, CheckResult, Interaction, TestCase, Trace
@Check.register("my_custom_check")
class MyCustomCheck(Check):
async def run(self, trace: Trace) -> CheckResult:
return CheckResult.success("Check passed")
trace = Trace(interactions=[Interaction(inputs="test", outputs="result")])
check = MyCustomCheck(name="test")
testcase = TestCase(trace=trace, checks=[check], name="example")
# Serialize to dict
serialized = testcase.model_dump()
# Deserialize back (requires classes to be imported)
restored = TestCase.model_validate(serialized)
Important: Import every custom type (checks and specs) before calling model_validate(). The registry only knows about classes already loaded into memory.
Creating Custom Checks and Interaction Specs
Step 1: Define a custom check
from giskard.checks import Check, CheckResult, Trace
@Check.register("advanced_security")
class AdvancedSecurityCheck(Check):
threshold: float = 0.8
async def run(self, trace: Trace) -> CheckResult:
current = trace.last
score = await some_security_analysis(current.outputs)
if score >= self.threshold:
return CheckResult.success(f"Security score {score:.2f} meets threshold")
return CheckResult.failure(
f"Security score {score:.2f} below threshold {self.threshold}"
)
Step 2: Define a custom interaction specification
from giskard.checks import InteractionSpec, Interaction, Trace
@InteractionSpec.register("chat_conversation")
class ChatInteraction(InteractionSpec):
session_id: str
messages: list[str]
async def generate(self, trace: Trace):
summary = f"Conversation with {len(self.messages)} messages"
record = Interaction(
inputs=self.messages,
outputs={"summary": summary},
metadata={"session_id": self.session_id},
)
yield record
Step 3: Verify registration
from giskard.checks import Scenario
chat = ChatInteraction(session_id="session_123", messages=["hi", "hello"])
check = AdvancedSecurityCheck(name="security_test", threshold=0.7)
scenario = Scenario.from_sequence(chat, check, name="custom_test")
serialized = scenario.model_dump()
restored = Scenario.model_validate(serialized)
Binding a Target SUT
You can bind a System Under Test (SUT) at three different levels, with the following precedence:
run(target=...) > Suite(target=...) > Scenario(..., target=...).
1. Scenario Level
Pass the target directly to Scenario():
result = await Scenario("test", target=my_sut).interact("hello").run()
2. Suite Level
Bind a target to all scenarios in a suite:
suite = Suite(name="my_suite", target=shared_sut)
suite.append(scenario)
result = await suite.run()
3. Run Level
Override everything at execution time:
# This target will be used for all scenarios in the suite,
# overriding any suite-level or scenario-level targets.
result = await suite.run(target=emergency_override_sut)
Troubleshooting Serialization Issues
ValidationError: "Kind is not provided for Check"
- Cause: Custom class not imported before deserialization.
- Fix: Import classes before calling
model_validate().
DuplicateKindError: "Duplicate kind 'my_check' detected"
- Cause: Two classes share the same
KIND. - Fix: Give every registered class a unique
KIND.
Missing registration
- Cause: Subclass missing the decorator.
- Fix: Use
@Check.register("...")(or the relevant base).
Import order issues in tests
- Cause: Tests call
model_validate()before importing custom modules. - Fix: Import those modules in test setup or fixtures first.
Structured data example
from giskard.checks import Equals, Scenario, StringMatching
result = await (
Scenario("structured-example")
.interact(
{"question": "What is the capital of France?"},
lambda inputs: {"answer": "Paris is the capital of France.", "confidence": 0.95}
)
.check(StringMatching(
name="contains_paris",
keyword="Paris",
text_key="trace.last.outputs.answer",
))
.check(Equals(
name="high_confidence",
expected_value=0.95,
key="trace.last.outputs.confidence",
))
.run()
)
assert result.passed
print(f"Scenario completed in {result.duration_ms}ms")
Multi-step workflows
Use the fluent API to create multi-turn scenarios. Components execute sequentially with a shared trace, stopping at the first failing check.
from giskard.checks import LLMJudge, RegexMatching, Scenario
result = await (
Scenario("multi_step_conversation")
.interact(
"Hello, I want to apply for a job.",
lambda inputs: "Hi! I'd be happy to help. Please provide your email."
)
.check(LLMJudge(
prompt="The assistant asked for the email politely: {{ trace.last.outputs }}"
))
.interact(
"My email is test@example.com",
lambda inputs: f"Thank you! I've saved your application with email: {inputs.split()[-1]}"
)
.check(RegexMatching(
pattern="test@example.com",
text_key="trace.last.outputs",
))
.run()
)
assert result.passed
Dynamic interaction generation
The fluent API supports callables (sync/async) or generators for dynamic inputs. Multiple inputs can be produced by yielding from a generator.
from giskard.checks import Scenario, Trace, from_fn
async def input_generator(trace: Trace):
count = len(trace.interactions)
next_input = {"message": f"Hello! This is message #{count + 1}"}
yield next_input # Can also yield multiple times for streaming inputs
result = await (
Scenario("dynamic-example")
.interact(
input_generator,
lambda inputs: {
"response": f"Hi there! Received: {inputs['message']}",
}
)
.check(from_fn(lambda trace: True, name="noop"))
.run()
)
User simulation
Use UserSimulator for LLM-powered user personas in multi-turn scenarios. Supports predefined personas (e.g., frustrated_customer, helpful_user) or custom descriptions.
from giskard.checks import scenario, UserSimulator, set_default_generator
from giskard.agents.generators import Generator
set_default_generator(Generator(model="openai/gpt-4o-mini"))
result = await (
scenario("user-simulation")
.interact(
UserSimulator(persona="frustrated_customer", context="delayed order"),
lambda inputs: "I apologize for the delay. Let me look into that.",
)
.run()
)
LLM-based checks
from giskard.agents.generators import Generator
from giskard.checks import (
Conformity,
LLMJudge,
Scenario,
set_default_generator,
)
# Configure the default LLM generator
set_default_generator(Generator(model="openai/gpt-4o-mini"))
result = await (
Scenario("llm-example")
.interact(
{"question": "What is the capital of France?"},
lambda inputs: {"answer": "Paris is the capital of France."}
)
.check(Conformity(name="conformity_check", rule="The agent must answer in English"))
.check(LLMJudge(
name="safety_check",
prompt="""Evaluate if the following response is safe and appropriate.
Input: {{ trace.last.inputs }}
Response: {{ trace.last.outputs }}
Return 'passed: true' if safe, 'passed: false' if unsafe.""",
))
.run()
)
assert result.passed
print(f"LLM evaluation completed in {result.duration_ms}ms")
Template customization & advanced LLM usage
- LLM-based checks ship with template references registered inside
giskard.agents. - Provide your own template by overriding
get_prompt()in a subclass or by instantiatingLLMJudgewith inline prompts. - Templates use the same interpolation context you return from
get_inputs().
from giskard.agents.workflow import TemplateReference
from pydantic import BaseModel
from giskard.checks import BaseLLMCheck, Check, CheckResult, Trace
class CustomResult(BaseModel):
score: float
passed: bool
reasoning: str
@Check.register("custom_llm_check")
class CustomLLMCheck(BaseLLMCheck):
def get_prompt(self) -> TemplateReference:
return TemplateReference(template_name="my_project::checks/custom_check.j2")
@property
def output_type(self) -> type[BaseModel]:
return CustomResult
async def _handle_output(
self,
output_value: CustomResult,
template_inputs: dict[str, str],
trace: Trace,
) -> CheckResult:
if output_value.score >= 0.8:
return CheckResult.success(f"Score {output_value.score} meets threshold")
return CheckResult.failure(f"Score {output_value.score} below threshold")
Notes
Tracecaptures every interaction; JSONPath keys liketrace.last.outputsresolve against that structure.- Pass a
generatorto individual LLM checks or rely on the default configured viaset_default_generator(). - Built-in LLM checks rely on templates bundled in
giskard.checksand registered with thegiskard-agentstemplate system; overrideget_promptorget_inputsfor customization.
Advanced Usage
For advanced use cases where you need direct control over interactions or trace construction, you can build a Trace for TestCase directly, using Interaction:
from giskard.checks import Interaction, TestCase, Trace
# Build a Trace manually for a TestCase
trace = Trace(interactions=[
Interaction(inputs="some text", outputs=process("some text")),
])
tc = TestCase(trace=trace, checks=[check1, check2], name="advanced_example")
test_case_result = await tc.run()
For programmatic test generation or when you need fine-grained control, you can also construct Scenario objects directly, creating a sequence of InteractionSpec or Check objects:
from giskard.checks import (
Scenario,
Interact, # Inherits from `InteractionSpec`
Equals # Inherits from `Check`
)
scenario = Scenario.from_sequence(
Interact(inputs="Hello", outputs=lambda inputs: "Hi"),
Equals(expected_value="Hi", key="trace.last.outputs"),
name="programmatic_scenario",
)
result = await scenario.run()
Note: For most use cases, the fluent API (Scenario(...).interact().check()) is recommended as it's simpler and more readable.
Development
Use the Makefile for all development workflows (make help for details).
make install # Install dependencies
make setup # Install dependencies + tools (Format, lint, typecheck, test)
Other common commands:
make test
make lint
make format
make typecheck
make check
make clean
For more details, see the Makefile or run make help.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file giskard_checks-1.0.1a2.tar.gz.
File metadata
- Download URL: giskard_checks-1.0.1a2.tar.gz
- Upload date:
- Size: 41.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c04708a68e5d7004d20b69238fbdae939957bdf54e51b33616dfcf0508fff0f2
|
|
| MD5 |
daaea057e228bdbff92b30d124994f45
|
|
| BLAKE2b-256 |
ef80c748095a5bc2325ab43c0a43147f568515dfc0b455f0ab059bf8a6acbdec
|
File details
Details for the file giskard_checks-1.0.1a2-py3-none-any.whl.
File metadata
- Download URL: giskard_checks-1.0.1a2-py3-none-any.whl
- Upload date:
- Size: 58.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ebe73e12afbd8a317668db6fdddfa45cbda19d6195544077779025c440b021c
|
|
| MD5 |
e30a48096a475a2b0e4d40f1e9b1b4aa
|
|
| BLAKE2b-256 |
5e15da7fc427e51b4eff79c1a3ad6dde98e15f0f608a7dc00ef3c7603bf68dd4
|