Behavioral testing for LLM applications. pytest plugin with semantic assertions, multi-turn conversation testing, and drift detection. No LLM judge needed.

These details have not been verified by PyPI

Project links

Project description

llm-behave

Behavioral testing for LLM applications. A pytest plugin.

No LLM judge. No API cost. Offline models. Works with any provider.

def test_support_bot():
    output = my_support_bot("I want a refund for order 1234")

    assert_behavior(output) \
        .mentions("refund policy") \
        .tone("empathetic") \
        .not_mentions("competitor")

Why llm-behave?

Most LLM testing tools either:

Use another LLM to judge the output (expensive, slow, circular)
Only do exact string matching (misses semantic meaning)
Don't support multi-turn conversations at all

llm-behave uses small offline transformer models (80MB, runs on CPU) to understand meaning — no API calls, no cost, no internet required during tests.

Install

# Core (tool call assertions, structure tests — no ML deps)
pip install llm-behave

# Full (semantic assertions: mentions, tone, intent, drift)
pip install llm-behave[semantic]

Features at a glance

Feature	What it does
`mentions()`	Semantic similarity — "money back" matches "refund"
`not_mentions()`	Assert a topic is NOT brought up
`tone()`	Detect empathetic / professional / rude / helpful etc.
`intent()`	Does the response intend to help? refuse? apologize?
`calls_tool()`	Assert which tool the LLM called
`ConversationTest`	Multi-turn testing with memory, contradiction detection
`DriftTest`	Save baseline behavior, detect regressions in CI

Usage

Basic assertions

from llm_behave import assert_behavior

output = my_llm("I want a refund")

# Semantic match — not exact string
assert_behavior(output).mentions("refund policy")

# Tone detection
assert_behavior(output).tone("empathetic")
assert_behavior(output).tone("professional", threshold=0.6)

# Intent
assert_behavior(output).intent("offering to help the customer")

# Negative assertions
assert_behavior(output).not_mentions("competitor")

# Fluent chaining
assert_behavior(output) \
    .mentions("refund") \
    .tone("empathetic") \
    .not_mentions("competitor")

Tool call assertions

text, tool_calls = my_llm.chat_with_tools(messages, tools=my_tools)

assert_behavior(text, tool_calls) \
    .calls_tool("lookup_order") \
    .mentions("order")

Multi-turn conversation testing

from llm_behave import ConversationTest, MockProvider

conv = ConversationTest(agent=my_agent)

conv.say("Hi, my name is Alex")
conv.say("I placed order #5678 last week")
response = conv.say("When will it arrive?")

# Does it remember context from earlier turns?
assert response.recalls("order")
assert response.recalls("Alex")

# Is tone consistent across the whole conversation?
assert response.consistent_tone_across_turns(threshold=0.6)

Drift detection (for CI)

Catch silent regressions when you update your model or prompts.

from llm_behave import DriftTest

# First run: save baseline
@DriftTest.baseline(save_as="support_refund_flow")
def get_baseline_output():
    return my_llm("I need a refund")

# Every CI run: compare against baseline
result = DriftTest.compare("support_refund_flow", current_output)
assert result.passed, f"Behavior drift detected: {result.details}"

pytest fixtures (auto-registered)

# These fixtures are available in any test file automatically

def test_with_mock(mock_provider, assert_llm):
    provider = mock_provider(responses=["I'll help with your refund right away."])
    output = provider.chat([{"role": "user", "content": "refund please"}])
    assert_llm(output).mentions("refund").tone("helpful")

def test_conversation(conversation):
    conv = conversation(responses=["Hello!", "Sure, I can help with that."])
    conv.say("Hi")
    response = conv.say("I need help")
    assert "help" in response.text.lower()

Providers

Built-in adapters for all major LLM providers:

from llm_behave.providers.openai_adapter import OpenAIProvider
from llm_behave.providers.anthropic_adapter import AnthropicProvider
from llm_behave.providers.ollama_adapter import OllamaProvider
from llm_behave import MockProvider  # for tests, no API calls

# All providers have the same interface
provider = OpenAIProvider(model="gpt-4o-mini")
provider = AnthropicProvider(model="claude-haiku-4-5")
provider = OllamaProvider(model="llama3")

output = provider.chat([{"role": "user", "content": "Hello"}])
text, tool_calls = provider.chat_with_tools(messages, tools=my_tools)

Bring your own provider by subclassing LLMProvider:

from llm_behave.providers.base import LLMProvider

class MyProvider(LLMProvider):
    def chat(self, messages, **kwargs):
        ...
    def chat_with_tools(self, messages, tools, **kwargs):
        ...

How it works

llm-behave uses all-MiniLM-L6-v2 — an 80MB sentence-transformer model that runs fully offline on CPU.

mentions() / not_mentions() — splits text into sentences, computes max cosine similarity between any sentence and your concept
tone() — batch-encodes input text against example sentences for each tone, returns max similarity
intent() — semantic similarity between output and your intent description
contradicts_turn() — NLI (Natural Language Inference) model detects logical contradictions

Models load lazily on first use and are cached for the rest of the test session. Import time stays fast.

Performance

Measured after model warmup (model loads once per test session):

Assertion	Time
`mentions()`	~32ms
`tone()`	~40ms
`intent()`	~32ms
4-assertion chain	~350ms

pytest markers

import pytest

@pytest.mark.behavioral
def test_refund_flow():
    ...

@pytest.mark.drift
def test_no_regression():
    ...

Run only behavioral tests:

pytest -m behavioral
pytest -m drift

Full install options

pip install llm-behave                          # core only
pip install llm-behave[semantic]                # + sentence-transformers + torch
pip install llm-behave[openai]                  # + openai SDK
pip install llm-behave[anthropic]               # + anthropic SDK
pip install llm-behave[ollama]                  # + ollama SDK
pip install llm-behave[all]                     # everything

Requirements

Python 3.10+
pytest 7.0+
For semantic assertions: pip install llm-behave[semantic]

License

MIT — free to use in personal and commercial projects.

Author

Built by Swanand Potnis — Pune, India.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Apr 22, 2026

0.1.1

Mar 14, 2026

0.1.0

Mar 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_behave-0.1.2.tar.gz (29.7 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_behave-0.1.2-py3-none-any.whl (21.5 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file llm_behave-0.1.2.tar.gz.

File metadata

Download URL: llm_behave-0.1.2.tar.gz
Upload date: Apr 22, 2026
Size: 29.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llm_behave-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`e3ae2e7103f7b07c3d350ea8fddd889640bfe9f35aa7b345f49be8a61665f357`
MD5	`fa885ef6ae1829c9f471fffcbd7ce68a`
BLAKE2b-256	`f16cffc87e9782d0a6664842b830d3c689589105c3f5cd585f947c496c30f1c3`

See more details on using hashes here.

File details

Details for the file llm_behave-0.1.2-py3-none-any.whl.

File metadata

Download URL: llm_behave-0.1.2-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 21.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llm_behave-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`16b22c54223e7d45931273c6876446a618f9b9279cfefa01b39e58532195e8e4`
MD5	`5b114b4f36e247f60e572722aa4fd986`
BLAKE2b-256	`6a2c095fca5134bcc530a56ae3527edaf153bfc0cf165349acc42a83c2899f84`

See more details on using hashes here.

llm-behave 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llm-behave

Why llm-behave?

Install

Features at a glance

Usage

Basic assertions

Tool call assertions

Multi-turn conversation testing

Drift detection (for CI)

pytest fixtures (auto-registered)

Providers

How it works

Performance

pytest markers

Full install options

Requirements

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes