A modern Python library for evaluating multi-turn chatbot conversations
Project description
Turnwise
A modern Python library for evaluating multi-turn chatbot conversations. Turnwise provides a declarative and lightweight approach to testing conversational AI systems with composable metrics and structured evaluation results.
Features
- Multi-turn Conversation Support: Evaluate complete conversation flows, not just single interactions
- Composable Metrics: Mix and match evaluation metrics to create custom evaluation suites
- Declarative API: Define conversations and metrics, then evaluate with a simple function call
- Structured Results: Get detailed, structured evaluation reports with scores and pass/fail status
- Batch Evaluation: Evaluate multiple conversations at once with summary statistics
- Dataset Management: Organize and filter conversation datasets for evaluation
- CLI Interface: Run evaluations from the command line
- LLM-based Metrics: Use OpenAI API for advanced evaluation metrics
- Extensible: Easy to create custom metrics for specific use cases
- Parallel Processing: Fast evaluation with configurable parallel execution
Installation
pip install turnwise
Or install from source:
git clone https://github.com/0xHericles/turnwise.git
cd turnwise
pip install -e .
Quick Start
Basic Usage
from turnwise import (
Turn, Conversation, Role,
ConversationLengthMetric, ResponseRelevanceMetric,
ConversationCoherenceMetric, evaluate
)
# Create a conversation
conversation = Conversation(
turns=[
Turn(role=Role.USER, content="What is machine learning?"),
Turn(role=Role.ASSISTANT, content="Machine learning is a subset of AI that enables computers to learn from data."),
Turn(role=Role.USER, content="Can you give me an example?"),
Turn(role=Role.ASSISTANT, content="Sure! Image recognition is a common example of machine learning.")
]
)
# Create evaluator with metrics
evaluator = Evaluator(metrics=[
ConversationLengthMetric(min_turns=2, max_turns=10),
ResponseRelevanceMetric(threshold=0.6),
ConversationCoherenceMetric(threshold=0.7)
])
# Run evaluation
report = evaluator.run(conversation)
print(f"Overall Score: {report.overall_score:.3f}")
print(f"Status: {'PASSED' if report.passed else 'FAILED'}")
for result in report.results:
status = "✓" if result.passed else "✗"
print(f"{status} {result.metric_name}: {result.score:.3f}")
Batch Evaluation
from turnwise import Evaluator, create_summary_report, ConversationDataset
# Create dataset from conversation data
conversation_data = [
{
"turns": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"}
],
"metadata": {"session_id": "001"}
}
]
dataset = ConversationDataset(conversation_data)
# Create evaluator with metrics
evaluator = Evaluator(metrics=[
ConversationLengthMetric(min_turns=1, max_turns=10),
ResponseRelevanceMetric(threshold=0.6)
])
# Run batch evaluation
reports = evaluator.batch(dataset)
# Create summary report
summary = create_summary_report(reports)
print(f"Overall Pass Rate: {summary['overall_pass_rate']:.1%}")
print(f"Average Score: {summary['average_overall_score']:.3f}")
LLM-based Evaluation
from turnwise import Evaluator, HelpfulnessMetric, QualityMetric, SafetyMetric
# Create evaluator with LLM-based metrics and API key
evaluator = Evaluator(
metrics=[
HelpfulnessMetric(),
QualityMetric(),
SafetyMetric()
],
openai_api_key="your-api-key"
)
report = evaluator.run(conversation)
Built-in Metrics
Conversation Metrics
ConversationLengthMetric
Evaluates if conversations have appropriate length.
ConversationLengthMetric(min_turns=1, max_turns=20, threshold=0.5)
ConversationCoherenceMetric
Evaluates overall conversation coherence and flow.
ConversationCoherenceMetric(threshold=0.7)
Response Metrics
ResponseRelevanceMetric
Evaluates if assistant responses are relevant to user inputs.
ResponseRelevanceMetric(threshold=0.6)
LLM-based Metrics
HelpfulnessMetric
Uses OpenAI API to evaluate how helpful assistant responses are.
HelpfulnessMetric(threshold=0.7) # API key configured at evaluation level
QualityMetric
Evaluates overall conversation quality using LLM.
QualityMetric(threshold=0.8) # API key configured at evaluation level
SafetyMetric
Assesses conversation safety and identifies potential concerns.
SafetyMetric(threshold=0.9) # API key configured at evaluation level
Command Line Interface
Install the CLI
pip install turnwise
Evaluate a single conversation
turnwise evaluate examples/sample_conversation.json -m length -m relevance
Batch evaluate multiple conversations
turnwise batch-evaluate examples/ -m length -m coherence -o results.json
List available metrics
turnwise list-metrics
Conversation JSON format
{
"turns": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"}
],
"metadata": {"session_id": "001"}
}
For batch evaluation, use an array of conversation objects:
[
{
"turns": [{"role": "user", "content": "Hello"}],
"metadata": {"session_id": "001"}
},
{
"turns": [{"role": "user", "content": "Hi"}],
"metadata": {"session_id": "002"}
}
]
Dataset Management
from turnwise import ConversationDataset
# Create dataset
dataset = ConversationDataset(conversation_data)
# Filter conversations by length
short_conversations = dataset.filter_length(min_turns=1, max_turns=3)
# Filter by user turn ratio
balanced_conversations = dataset.filter_ratio(min_user_ratio=0.3, max_user_ratio=0.7)
# Get dataset statistics
stats = dataset.stats()
print(f"Total conversations: {stats['total_conversations']}")
print(f"Average turns: {stats['average_turns_per_conversation']:.1f}")
Creating Custom Metrics
from turnwise import Metric, EvaluationResult
class CustomMetric(Metric):
def __init__(self, name="custom_metric", threshold=0.5):
super().__init__(name, threshold)
def evaluate(self, conversation):
# Your evaluation logic here
score = 0.8 # Calculate your score
details = {"custom_info": "value"}
return self._create_result(score, details)
Examples
See the examples/ directory for complete examples:
basic_usage.py- Simple conversation evaluationadvanced_usage.py- Advanced features and custom metricsbatch_evaluation.py- Batch evaluation with summary reportscli_usage.py- CLI usage examplesdataframe_like_api.py- DataFrame-like API usagellm_evaluation.py- LLM-based evaluation example
Configuration
API Key Management
API keys are configured at the evaluation level, not per metric. This provides better security and easier management.
Environment Variables (Recommended)
Set your API keys as environment variables:
export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"
Or create a .env file:
OPENAI_API_KEY=your-openai-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key
Programmatic Configuration
Pass API keys directly to the Evaluator:
from turnwise import Evaluator, HelpfulnessMetric, QualityMetric
# Create evaluator with metrics and API keys
evaluator = Evaluator(
metrics=[HelpfulnessMetric(), QualityMetric()],
openai_api_key="your-key",
anthropic_api_key="your-key"
)
# Single evaluation
report = evaluator.run(conversation)
# Batch evaluation
reports = evaluator.batch(conversations)
Configuration File
Create a turnwise.yaml configuration file:
evaluation:
max_workers: 4
parallel: true
metrics:
length:
min_turns: 1
max_turns: 20
relevance:
threshold: 0.6
coherence:
threshold: 0.7
llm:
api_key: ${OPENAI_API_KEY}
model: gpt-3.5-turbo
temperature: 0.0
Development
Setup
git clone https://github.com/your-username/turnwise.git
cd turnwise
uv sync
Running Tests
make test
# or
uv run pytest
Code Quality
make lint
make format
# or
uv run ruff check .
uv run ruff format .
uv run mypy src/
Available Make Commands
make help # Show all available commands
make install # Install dependencies
make test # Run tests
make lint # Run linter
make format # Format code
make clean # Clean build artifacts
make run-example # Run basic example
make run-llm-example # Run LLM evaluation example
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Built with Pydantic AI for LLM integration
- Uses Ruff for fast linting and formatting
- Inspired by modern evaluation frameworks for conversational AI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turnwise-0.1.0.tar.gz.
File metadata
- Download URL: turnwise-0.1.0.tar.gz
- Upload date:
- Size: 112.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6849ae038296266b3658045554f29deac50147373b44de2fca797efd74e93521
|
|
| MD5 |
7315ba52ca9b4084726a645bc27a6d32
|
|
| BLAKE2b-256 |
123a8af11da95632ea7fcc475acdd421f920e07c3f5410d73731766107a9f2bf
|
File details
Details for the file turnwise-0.1.0-py3-none-any.whl.
File metadata
- Download URL: turnwise-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6deb940f1883d98c663ced234ebc89f256cba9ef76aeea640ba0e83bac8832d4
|
|
| MD5 |
b87bb6fbea0ba006f657bcadb8cf4423
|
|
| BLAKE2b-256 |
23da2509cb6e4d5f9730ce4b3a6f707e67e246ebd7a2c231b35049a9d7a6befb
|