Skip to main content

LLM testing framework for validating agent behavior and tool usage

Project description

LLM Goose 🪿

LLM-powered testing for LLM agents — define expectations as you'd describe them to a human

PyPI npm Python CI Coverage pre-commit


Goose is a Python library, CLI, and web dashboard that helps developers build and iterate on LLM agents faster.
Write tests in Python, run them from the terminal or dashboard, and instantly see what went wrong when things break.

Currently designed for LangChain-based agents, with plans for framework-agnostic support.

Why Goose?

Think of Goose as pytest for LLM agents:

  • Natural language expectations – Describe what should happen in plain English; an LLM validator checks if the agent delivered.
  • Tool call assertions – Verify your agent called the right tools, not just that it sounded confident.
  • Full execution traces – See every tool call, response, and validation result in the web dashboard.
  • Pytest-style fixtures – Reuse agent setup across tests with @fixture decorators.
  • Hot-reload during development – Edit your agent code, re-run tests instantly without restarting the server.

Dashboard screenshot

Detail screenshot

Install 🚀

pip install llm-goose
npm install -g @llm-goose/dashboard-cli

CLI

# run tests from the terminal
goose-run tests

# add -v / --verbose to stream detailed steps
goose-run -v tests

API & Dashboard

# start the API server (FastAPI + Uvicorn)
goose-api example_tests

# enable hot-reloading of your agent/tools code during development
goose-api example_tests --reload-target example_system

# run the dashboard (connects to localhost:8000 by default)
goose-dashboard

# or point the dashboard at a custom API URL
GOOSE_API_URL="http://localhost:8000" goose-dashboard

Quick Start: Minimal Example 🏃‍♂️

Here's a complete, runnable example of testing an LLM agent with Goose. This creates a simple weather assistant agent and tests it.

1. Set up your agent

Create my_agent.py:

from typing import Any

from dotenv import load_dotenv
from langchain.agents import create_agent
from langchain_core.messages import HumanMessage
from langchain_core.tools import tool
from goose.testing.models.messages import AgentResponse

load_dotenv()

@tool
def get_weather(location: str) -> str:
    """Get the current weather for a given location."""
    return f"The weather in {location} is sunny and 75°F."

agent = create_agent(
    model="gpt-4o-mini",
    tools=[get_weather],
    system_prompt="You are a helpful weather assistant",
)

def query_weather_agent(question: str) -> AgentResponse:
    """Query the agent and return a normalized response."""
    result = agent.invoke({"messages": [HumanMessage(content=question)]})
    return AgentResponse.from_langchain(result)

2. Set up fixtures

Create tests/conftest.py:

from goose.testing import Goose, fixture

from my_agent import query_weather_agent

@fixture(name="weather_goose") # name is optional - defaults to func name
def weather_goose_fixture() -> Goose:
    """Provide a Goose instance wired up to the sample LangChain agent."""

    return Goose(
        agent_query_func=query_weather_agent,
        validator_model=ChatOpenAI(model="gpt-4o-mini")
    )

3. Write a test

Create tests/test_weather.py. Fixture will be injected into recognized test functions. Test function and file names need to start with test_ in order to be discovered.

from goose.testing import Goose
from my_agent import query_weather_agent

def test_weather_query(weather_goose: Goose) -> None:
    """Test that the agent can answer weather questions."""

    weather_goose.case(
        query="What's the weather like in San Francisco?",
        expectations=[
            "Agent provides weather information for San Francisco",
            "Response mentions sunny weather and 75°F",
        ],
        expected_tool_calls=[get_weather],
    )

4. Run the test

goose-run tests

That's it! Goose will run your agent, check that it called the expected tools, and validate the response against your expectations.

Writing tests

At its core, Goose lets you describe what a good interaction looks like and then assert that your agent and tools actually behave that way.

Pytest-inspired syntax

Goose cases combine a natural‑language query, human‑readable expectations, and (optionally) the tools you expect the agent to call. This example is adapted from example_tests/agent_behaviour_test.py and shows an analytical workflow where the agent both retrieves data and creates records:

def test_sale_then_inventory_update(goose_fixture: Goose) -> None:
    """Complex workflow: Sell 2 Hiking Boots and report the remaining stock."""

    count_before = Transaction.objects.count()
    inventory = ProductInventory.objects.get(product__name="Hiking Boots")
    assert inventory is not None, "Expected inventory record for Hiking Boots"

    goose_fixture.case(
        query="Sell 2 pairs of Hiking Boots to John Doe and then tell me how many we have left",
        expectations=[
            "Agent created a sale transaction for 2 Hiking Boots to John Doe",
            "Agent then checked remaining inventory after the sale",
            "Response confirmed the sale was processed",
            "Response provided updated stock information",
        ],
        expected_tool_calls=[check_inventory, create_sale],
    )

    count_after = Transaction.objects.count()
    inventory_after = ProductInventory.objects.get(product__name="Hiking Boots")

    assert count_after == count_before + 1, f"Expected 1 new transaction, got {count_after - count_before}"
    assert inventory_after is not None, "Expected inventory record after sale"
    assert inventory_after.stock == inventory.stock - 2, f"Expected stock {inventory.stock - 2}, got {inventory_after.stock}"

Custom lifecycle hooks

You can use existing lifecycle hooks or implement yours to suit your needs. Hooks are invoked before a test starts and after it finishes. This lets you setup your environment and teardown it afterwards.

from goose.testing.hooks import TestLifecycleHook

class MyLifecycleHooks(TestLifecycleHook):
    """Suite and per-test lifecycle hooks invoked around Goose executions."""

    def pre_test(self, definition: TestDefinition) -> None:
        """Hook invoked before a single test executes."""
        setup()

    def post_test(self, definition: TestDefinition) -> None:
        """Hook invoked after a single test completes."""
        teardown()


# tests/conftest.py
from goose.testing import Goose, fixture
from my_agent import query

@fixture()
def goose_fixture() -> Goose:
    """Provide a Goose instance wired up to the sample LangChain agent."""

    model = ChatOpenAI(model="gpt-4o-mini")
    return Goose(
        agent_query_func=query,
        validator_model=model,
        hooks=MyLifecycleHooks()
    )

License

MIT License – see LICENSE for full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_goose-0.1.21.tar.gz (29.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_goose-0.1.21-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_goose-0.1.21.tar.gz.

File metadata

  • Download URL: llm_goose-0.1.21.tar.gz
  • Upload date:
  • Size: 29.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for llm_goose-0.1.21.tar.gz
Algorithm Hash digest
SHA256 ef27ea825b2f189bc7ca62c7311b4aa014a707ba82590ae2277521887c386541
MD5 94c91c12ab33cca69ff38f9f9573f640
BLAKE2b-256 ad3720666f0f0900c9e86dc211de2faac0773a5df3ae47142d5b537d16b1505b

See more details on using hashes here.

File details

Details for the file llm_goose-0.1.21-py3-none-any.whl.

File metadata

  • Download URL: llm_goose-0.1.21-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for llm_goose-0.1.21-py3-none-any.whl
Algorithm Hash digest
SHA256 eea49a03efb59513652b368efad0a203224e39e7cebeac82b5ca81ba5c827ea5
MD5 ec1fb995ca25870b12985d5b3a118c8c
BLAKE2b-256 c298a18d5bb7a6a07d9df65c8ff8b46c6d84588649cf4653764f285a8a39cc8a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page