Skip to main content

LLM testing framework for validating agent behavior and tool usage

Project description

Goose LLM 🪿

Goose is a batteries‑included Python library and CLI for validating LLM agents end‑to‑end.

Design conversational test cases, run them locally or in CI, and (optionally) plug in a React dashboard – all while staying in Python.

Turn your “vibes‑based” LLM evaluations into repeatable, versioned tests instead of “it felt smart on that one prompt” QA.

Why Goose?

Think of Goose as pytest for LLM agents:

  • Stop guessing – Encode expectations once, rerun them on every model/version/deploy.
  • See what actually happened – Rich execution traces, validation results, and per‑step history.
  • Fits your stack – Wraps your existing agents and tools; no framework rewrite required.
  • Stay in Python – Pydantic models, type hints, and a straightforward API.

Install in your project 🚀

Install the core library and CLI from PyPI:

pip install llm-goose

Quick Start: Minimal Example 🏃‍♂️

Here's a complete, runnable example of testing an LLM agent with Goose. This creates a simple weather assistant agent and tests it.

1. Set up your agent

Create my_agent.py:

from typing import Any

from dotenv import load_dotenv
from langchain.agents import create_agent
from langchain_core.messages import HumanMessage
from langchain_core.tools import tool
from goose.testing.models.messages import AgentResponse

load_dotenv()

@tool
def get_weather(location: str) -> str:
    """Get the current weather for a given location."""
    return f"The weather in {location} is sunny and 75°F."

agent = create_agent(
    model="gpt-4o-mini",
    tools=[get_weather],
    system_prompt="You are a helpful weather assistant",
)

def query(question: str) -> AgentResponse:
    """Query the agent and return a normalized response."""
    result = agent.invoke({"messages": [HumanMessage(content=question)]})
    return AgentResponse.from_langchain(result)

2. Write a test

Create tests/test_weather.py:

from goose.testing import Goose

def test_weather_query(goose: Goose) -> None:
    """Test that the agent can answer weather questions."""

    goose.case(
        query="What's the weather like in San Francisco?",
        expectations=[
            "Agent provides weather information for San Francisco",
            "Response mentions sunny weather and 75°F",
        ],
        expected_tool_calls=[get_weather],
    )

3. Set up fixtures

Create tests/conftest.py:

from goose.testing import Goose, fixture

from my_agent import query

@fixture()
def goose() -> Goose:
    """Provide a Goose instance for testing."""
    return Goose(query)

4. Run the test

pip install pytest
pytest tests/

That's it! Goose will run your agent, check that it called the expected tools, and validate the response against your expectations.

Writing & running tests ✅

At its core, Goose lets you describe what a good interaction looks like and then assert that your agent and tools actually behave that way.

Describe expectations & tool usage

Goose cases combine a natural‑language query, human‑readable expectations, and (optionally) the tools you expect the agent to call. This example is adapted from example_tests/agent_behaviour_test.py and shows an analytical workflow where the agent both retrieves data and computes aggregates:

from __future__ import annotations

from example_system.models import Transaction
from example_system.tools import calculate_revenue, get_sales_history
from goose.testing import Goose


def test_sales_history_with_revenue_analysis(goose: Goose) -> None:
    """What were sales in October 2025 and the total revenue?"""

    transactions = Transaction.objects.prefetch_related("items__product").all()
    total_revenue = sum(
        item.price_usd * item.quantity
        for txn in transactions
        for item in txn.items.all()
    )

    goose.case(
        query="What were our sales in October 2025 and how much total revenue?",
        expectations=[
            "Agent retrieved sales history for October 2025",
            "Agent calculated total revenue from the retrieved transactions",
            "Response included the sample transaction from October 15",
            f"Response showed total revenue of ${total_revenue:.2f}",
            "Agent used sales history data to compute revenue totals",
        ],
        expected_tool_calls=[get_sales_history, calculate_revenue],
    )

In the full example suite, the goose fixture is registered automatically through example_tests/goose_config.py, which wires up the example agent and runner. The remaining fixtures in example_tests/conftest.py (like the setup_data autouse fixture) continue to seed data using @fixture from goose.testing.

Pytest‑inspired style (including failures)

Goose is pytest‑inspired: tests are just functions, and you can mix Goose expectations with regular assertions. That makes intentional failure cases straightforward to express and debug:

from example_system.models import Product
from example_system.tools import get_product_details


def test_failure_assertion_missing_products(goose: Goose) -> None:
    """Intentional failure to verify assertion handling."""

    goose.case(
        query="What's the price of Hiking Boots?",
        expectations=["Agent provided the correct price"],
        expected_tool_calls=[get_product_details],
    )

    # This assertion is designed to fail – fixtures populate products
    assert Product.objects.count() == 0, "Intentional failure: products are populated in fixtures"

When you run this test, Goose will surface both expectation mismatches (if any) and the failing assertion in the same run, just like you’d expect from a testing framework.

Run tests from the CLI

Use the bundled CLI (installed as the goose command) to discover and execute your test modules. Point it at a Goose application using module.path:factory syntax:

# execute every test module under example_tests/
goose-run run example_tests.goose_config:get_goose_config

# list discovered tests without running them
goose-run run --list example_tests.goose_config:get_goose_config

Custom lifecycle hooks (optional)

Goose detects framework integrations via lifecycle hook classes:

  • TestLifecycleHooks – default no-op setup that works everywhere.
  • DjangoTestHooks – automatically used when DJANGO_SETTINGS_MODULE is set and Django is installed.

If you need special setup/teardown logic, subclass TestLifecycleHooks, override setup() / teardown() or the new pre_test() / post_test() hooks, and pass an instance to Goose(..., hooks=MyHooks()) when wiring your test config.

Example system & jobs API (optional) 🌐

For richer workflows (and the React dashboard), Goose ships an example Django system and a FastAPI jobs API. These live in this repo only – they are not installed with the PyPI package – but you can run them locally for inspiration or internal tooling.

On the dashboard, the main grid view shows one card per test in your suite, with a status pill (Passed, Failed, Queued, Running, or Not Run), the most recent duration if it has been executed, and any top‑level error from the last run. Toggling the "only failures" filter collapses the grid down to just the failing tests so you can quickly see which checks are red, which have never been executed, and which ones are currently running.

Dashboard screenshot

When you click into a test, the detail view shows a header with the test name, module path, latest status and duration, plus the original docstring so you remember what the scenario is meant to cover. Below that, an execution history lists each run as a card: you see every expectation with a green check or red cross, along with the validator's reasoning explaining why the run was considered a success or failure. For each step, the underlying messages are rendered as human / AI / tool bubbles, including tool calls and JSON payloads; if something went wrong mid‑run, the captured error text is shown at the bottom of the card.

Detail screenshot

When you install the api extra, you get an additional console script:

  • api – starts the FastAPI job orchestration server

Install with extras:

pip install "llm-goose[api]"

Then launch the service from your project (for example, after wiring Goose into your own system):

# start the API server (FastAPI + Uvicorn)
goose-api

The React dashboard shown in the screenshots lives in this repo under web/ and is not shipped as part of the PyPI package.

React dashboard setup 🖥️

The React dashboard is a separate web application that talks to the Goose jobs API over HTTP. It is built with Vite, React, and Tailwind and is designed to be run either locally during development or deployed as a static site.

Install the published CLI from npm and let it host the built dashboard for you:

npm install -g @llm-goose/dashboard-cli

# point the dashboard at your jobs API
GOOSE_API_URL="http://localhost:8000" goose-dashboard

This starts a small HTTP server (by default on http://localhost:8001) that serves the prebuilt dashboard against whatever Goose API URL you configure.

License

MIT License – see LICENSE for full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_goose-0.1.10.tar.gz (26.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_goose-0.1.10-py3-none-any.whl (30.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_goose-0.1.10.tar.gz.

File metadata

  • Download URL: llm_goose-0.1.10.tar.gz
  • Upload date:
  • Size: 26.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for llm_goose-0.1.10.tar.gz
Algorithm Hash digest
SHA256 050ee5a87aa29db9085b5897e9e1b1fa88a313b77c16e90a8a480c1e22477f08
MD5 d2fe6c32da5e9541eb84368dca6b6ab5
BLAKE2b-256 bdec4d1066197a2cd0a4dad393f5df03629d28ed4e17d21977b88d28d36f2d6f

See more details on using hashes here.

File details

Details for the file llm_goose-0.1.10-py3-none-any.whl.

File metadata

  • Download URL: llm_goose-0.1.10-py3-none-any.whl
  • Upload date:
  • Size: 30.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for llm_goose-0.1.10-py3-none-any.whl
Algorithm Hash digest
SHA256 3c195a7c2b861924de61d73f56f77236268aece26a4ec773652288edb32b24bc
MD5 c83107702fb18ba14cf9782451efb47d
BLAKE2b-256 67fbec451bc699705fbd729aed2ebdbe6401772c1dee456ec71eadcf6bb463f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page