Skip to main content

Behavioral testing for LLM applications. pytest plugin with semantic assertions, multi-turn conversation testing, and drift detection. No LLM judge needed.

Project description

llm-behave

Behavioral testing for LLM applications. A pytest plugin.

No LLM judge. No API cost. Offline models. Works with any provider.

def test_support_bot():
    output = my_support_bot("I want a refund for order 1234")

    assert_behavior(output) \
        .mentions("refund policy") \
        .tone("empathetic") \
        .not_mentions("competitor")

Why llm-behave?

Most LLM testing tools either:

  • Use another LLM to judge the output (expensive, slow, circular)
  • Only do exact string matching (misses semantic meaning)
  • Don't support multi-turn conversations at all

llm-behave uses small offline transformer models (80MB, runs on CPU) to understand meaning — no API calls, no cost, no internet required during tests.


Install

# Core (tool call assertions, structure tests — no ML deps)
pip install llm-behave

# Full (semantic assertions: mentions, tone, intent, drift)
pip install llm-behave[semantic]

Features at a glance

Feature What it does
mentions() Semantic similarity — "money back" matches "refund"
not_mentions() Assert a topic is NOT brought up
tone() Detect empathetic / professional / rude / helpful etc.
intent() Does the response intend to help? refuse? apologize?
calls_tool() Assert which tool the LLM called
ConversationTest Multi-turn testing with memory, contradiction detection
DriftTest Save baseline behavior, detect regressions in CI

Usage

Basic assertions

from llm_behave import assert_behavior

output = my_llm("I want a refund")

# Semantic match — not exact string
assert_behavior(output).mentions("refund policy")

# Tone detection
assert_behavior(output).tone("empathetic")
assert_behavior(output).tone("professional", threshold=0.6)

# Intent
assert_behavior(output).intent("offering to help the customer")

# Negative assertions
assert_behavior(output).not_mentions("competitor")

# Fluent chaining
assert_behavior(output) \
    .mentions("refund") \
    .tone("empathetic") \
    .not_mentions("competitor")

Tool call assertions

text, tool_calls = my_llm.chat_with_tools(messages, tools=my_tools)

assert_behavior(text, tool_calls) \
    .calls_tool("lookup_order") \
    .mentions("order")

Multi-turn conversation testing

from llm_behave import ConversationTest, MockProvider

conv = ConversationTest(agent=my_agent)

conv.say("Hi, my name is Alex")
conv.say("I placed order #5678 last week")
response = conv.say("When will it arrive?")

# Does it remember context from earlier turns?
assert response.recalls("order")
assert response.recalls("Alex")

# Is tone consistent across the whole conversation?
assert response.consistent_tone_across_turns(threshold=0.6)

Drift detection (for CI)

Catch silent regressions when you update your model or prompts.

from llm_behave import DriftTest

# First run: save baseline
@DriftTest.baseline(save_as="support_refund_flow")
def get_baseline_output():
    return my_llm("I need a refund")

# Every CI run: compare against baseline
result = DriftTest.compare("support_refund_flow", current_output)
assert result.passed, f"Behavior drift detected: {result.details}"

pytest fixtures (auto-registered)

# These fixtures are available in any test file automatically

def test_with_mock(mock_provider, assert_llm):
    provider = mock_provider(responses=["I'll help with your refund right away."])
    output = provider.chat([{"role": "user", "content": "refund please"}])
    assert_llm(output).mentions("refund").tone("helpful")

def test_conversation(conversation):
    conv = conversation(responses=["Hello!", "Sure, I can help with that."])
    conv.say("Hi")
    response = conv.say("I need help")
    assert "help" in response.text.lower()

Providers

Built-in adapters for all major LLM providers:

from llm_behave.providers.openai_adapter import OpenAIProvider
from llm_behave.providers.anthropic_adapter import AnthropicProvider
from llm_behave.providers.ollama_adapter import OllamaProvider
from llm_behave import MockProvider  # for tests, no API calls

# All providers have the same interface
provider = OpenAIProvider(model="gpt-4o-mini")
provider = AnthropicProvider(model="claude-haiku-4-5")
provider = OllamaProvider(model="llama3")

output = provider.chat([{"role": "user", "content": "Hello"}])
text, tool_calls = provider.chat_with_tools(messages, tools=my_tools)

Bring your own provider by subclassing LLMProvider:

from llm_behave.providers.base import LLMProvider

class MyProvider(LLMProvider):
    def chat(self, messages, **kwargs):
        ...
    def chat_with_tools(self, messages, tools, **kwargs):
        ...

How it works

llm-behave uses all-MiniLM-L6-v2 — an 80MB sentence-transformer model that runs fully offline on CPU.

  • mentions() / not_mentions() — splits text into sentences, computes max cosine similarity between any sentence and your concept
  • tone() — batch-encodes input text against example sentences for each tone, returns max similarity
  • intent() — semantic similarity between output and your intent description
  • contradicts_turn() — NLI (Natural Language Inference) model detects logical contradictions

Models load lazily on first use and are cached for the rest of the test session. Import time stays fast.


Performance

Measured after model warmup (model loads once per test session):

Assertion Time
mentions() ~32ms
tone() ~40ms
intent() ~32ms
4-assertion chain ~350ms

pytest markers

import pytest

@pytest.mark.behavioral
def test_refund_flow():
    ...

@pytest.mark.drift
def test_no_regression():
    ...

Run only behavioral tests:

pytest -m behavioral
pytest -m drift

Full install options

pip install llm-behave                          # core only
pip install llm-behave[semantic]                # + sentence-transformers + torch
pip install llm-behave[openai]                  # + openai SDK
pip install llm-behave[anthropic]               # + anthropic SDK
pip install llm-behave[ollama]                  # + ollama SDK
pip install llm-behave[all]                     # everything

Requirements

  • Python 3.10+
  • pytest 7.0+
  • For semantic assertions: pip install llm-behave[semantic]

License

MIT — free to use in personal and commercial projects.


Author

Built by Swanand Potnis — Pune, India.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_behave-0.1.2.tar.gz (29.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_behave-0.1.2-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_behave-0.1.2.tar.gz.

File metadata

  • Download URL: llm_behave-0.1.2.tar.gz
  • Upload date:
  • Size: 29.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llm_behave-0.1.2.tar.gz
Algorithm Hash digest
SHA256 e3ae2e7103f7b07c3d350ea8fddd889640bfe9f35aa7b345f49be8a61665f357
MD5 fa885ef6ae1829c9f471fffcbd7ce68a
BLAKE2b-256 f16cffc87e9782d0a6664842b830d3c689589105c3f5cd585f947c496c30f1c3

See more details on using hashes here.

File details

Details for the file llm_behave-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: llm_behave-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llm_behave-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 16b22c54223e7d45931273c6876446a618f9b9279cfefa01b39e58532195e8e4
MD5 5b114b4f36e247f60e572722aa4fd986
BLAKE2b-256 6a2c095fca5134bcc530a56ae3527edaf153bfc0cf165349acc42a83c2899f84

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page