Skip to main content

A modern Python library for evaluating multi-turn chatbot conversations

Project description

Turnwise

A modern Python library for evaluating multi-turn chatbot conversations. Turnwise provides a declarative and lightweight approach to testing conversational AI systems with composable metrics and structured evaluation results.

Features

  • Multi-turn Conversation Support: Evaluate complete conversation flows, not just single interactions
  • Composable Metrics: Mix and match evaluation metrics to create custom evaluation suites
  • Declarative API: Define conversations and metrics, then evaluate with a simple function call
  • Structured Results: Get detailed, structured evaluation reports with scores and pass/fail status
  • Batch Evaluation: Evaluate multiple conversations at once with summary statistics
  • Dataset Management: Organize and filter conversation datasets for evaluation
  • CLI Interface: Run evaluations from the command line
  • LLM-based Metrics: Use OpenAI API for advanced evaluation metrics
  • Extensible: Easy to create custom metrics for specific use cases
  • Parallel Processing: Fast evaluation with configurable parallel execution

Installation

pip install turnwise

Or install from source:

git clone https://github.com/0xHericles/turnwise.git
cd turnwise
pip install -e .

Quick Start

Basic Usage

from turnwise import (
    Turn, Conversation, Role,
    ConversationLengthMetric, ResponseRelevanceMetric, 
    ConversationCoherenceMetric, evaluate
)

# Create a conversation
conversation = Conversation(
    turns=[
        Turn(role=Role.USER, content="What is machine learning?"),
        Turn(role=Role.ASSISTANT, content="Machine learning is a subset of AI that enables computers to learn from data."),
        Turn(role=Role.USER, content="Can you give me an example?"),
        Turn(role=Role.ASSISTANT, content="Sure! Image recognition is a common example of machine learning.")
    ]
)

# Create evaluator with metrics
evaluator = Evaluator(metrics=[
    ConversationLengthMetric(min_turns=2, max_turns=10),
    ResponseRelevanceMetric(threshold=0.6),
    ConversationCoherenceMetric(threshold=0.7)
])

# Run evaluation
report = evaluator.run(conversation)

print(f"Overall Score: {report.overall_score:.3f}")
print(f"Status: {'PASSED' if report.passed else 'FAILED'}")

for result in report.results:
    status = "✓" if result.passed else "✗"
    print(f"{status} {result.metric_name}: {result.score:.3f}")

Batch Evaluation

from turnwise import Evaluator, create_summary_report, ConversationDataset

# Create dataset from conversation data
conversation_data = [
    {
        "turns": [
            {"role": "user", "content": "Hello"},
            {"role": "assistant", "content": "Hi there!"}
        ],
        "metadata": {"session_id": "001"}
    }
]

dataset = ConversationDataset(conversation_data)

# Create evaluator with metrics
evaluator = Evaluator(metrics=[
    ConversationLengthMetric(min_turns=1, max_turns=10),
    ResponseRelevanceMetric(threshold=0.6)
])

# Run batch evaluation
reports = evaluator.batch(dataset)

# Create summary report
summary = create_summary_report(reports)
print(f"Overall Pass Rate: {summary['overall_pass_rate']:.1%}")
print(f"Average Score: {summary['average_overall_score']:.3f}")

LLM-based Evaluation

from turnwise import Evaluator, HelpfulnessMetric, QualityMetric, SafetyMetric

# Create evaluator with LLM-based metrics and API key
evaluator = Evaluator(
    metrics=[
        HelpfulnessMetric(),
        QualityMetric(),
        SafetyMetric()
    ],
    openai_api_key="your-api-key"
)
report = evaluator.run(conversation)

Built-in Metrics

Conversation Metrics

ConversationLengthMetric

Evaluates if conversations have appropriate length.

ConversationLengthMetric(min_turns=1, max_turns=20, threshold=0.5)

ConversationCoherenceMetric

Evaluates overall conversation coherence and flow.

ConversationCoherenceMetric(threshold=0.7)

Response Metrics

ResponseRelevanceMetric

Evaluates if assistant responses are relevant to user inputs.

ResponseRelevanceMetric(threshold=0.6)

LLM-based Metrics

HelpfulnessMetric

Uses OpenAI API to evaluate how helpful assistant responses are.

HelpfulnessMetric(threshold=0.7)  # API key configured at evaluation level

QualityMetric

Evaluates overall conversation quality using LLM.

QualityMetric(threshold=0.8)  # API key configured at evaluation level

SafetyMetric

Assesses conversation safety and identifies potential concerns.

SafetyMetric(threshold=0.9)  # API key configured at evaluation level

Command Line Interface

Install the CLI

pip install turnwise

Evaluate a single conversation

turnwise evaluate examples/sample_conversation.json -m length -m relevance

Batch evaluate multiple conversations

turnwise batch-evaluate examples/ -m length -m coherence -o results.json

List available metrics

turnwise list-metrics

Conversation JSON format

{
  "turns": [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"}
  ],
  "metadata": {"session_id": "001"}
}

For batch evaluation, use an array of conversation objects:

[
  {
    "turns": [{"role": "user", "content": "Hello"}],
    "metadata": {"session_id": "001"}
  },
  {
    "turns": [{"role": "user", "content": "Hi"}],
    "metadata": {"session_id": "002"}
  }
]

Dataset Management

from turnwise import ConversationDataset

# Create dataset
dataset = ConversationDataset(conversation_data)

# Filter conversations by length
short_conversations = dataset.filter_length(min_turns=1, max_turns=3)

# Filter by user turn ratio
balanced_conversations = dataset.filter_ratio(min_user_ratio=0.3, max_user_ratio=0.7)

# Get dataset statistics
stats = dataset.stats()
print(f"Total conversations: {stats['total_conversations']}")
print(f"Average turns: {stats['average_turns_per_conversation']:.1f}")

Creating Custom Metrics

from turnwise import Metric, EvaluationResult

class CustomMetric(Metric):
    def __init__(self, name="custom_metric", threshold=0.5):
        super().__init__(name, threshold)
    
    def evaluate(self, conversation):
        # Your evaluation logic here
        score = 0.8  # Calculate your score
        details = {"custom_info": "value"}
        
        return self._create_result(score, details)

Examples

See the examples/ directory for complete examples:

  • basic_usage.py - Simple conversation evaluation
  • advanced_usage.py - Advanced features and custom metrics
  • batch_evaluation.py - Batch evaluation with summary reports
  • cli_usage.py - CLI usage examples
  • dataframe_like_api.py - DataFrame-like API usage
  • llm_evaluation.py - LLM-based evaluation example

Configuration

API Key Management

API keys are configured at the evaluation level, not per metric. This provides better security and easier management.

Environment Variables (Recommended)

Set your API keys as environment variables:

export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"

Or create a .env file:

OPENAI_API_KEY=your-openai-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key

Programmatic Configuration

Pass API keys directly to the Evaluator:

from turnwise import Evaluator, HelpfulnessMetric, QualityMetric

# Create evaluator with metrics and API keys
evaluator = Evaluator(
    metrics=[HelpfulnessMetric(), QualityMetric()],
    openai_api_key="your-key",
    anthropic_api_key="your-key"
)

# Single evaluation
report = evaluator.run(conversation)

# Batch evaluation
reports = evaluator.batch(conversations)

Configuration File

Create a turnwise.yaml configuration file:

evaluation:
  max_workers: 4
  parallel: true

metrics:
  length:
    min_turns: 1
    max_turns: 20
  relevance:
    threshold: 0.6
  coherence:
    threshold: 0.7

llm:
  api_key: ${OPENAI_API_KEY}
  model: gpt-3.5-turbo
  temperature: 0.0

Development

Setup

git clone https://github.com/your-username/turnwise.git
cd turnwise
uv sync

Running Tests

make test
# or
uv run pytest

Code Quality

make lint
make format
# or
uv run ruff check .
uv run ruff format .
uv run mypy src/

Available Make Commands

make help          # Show all available commands
make install       # Install dependencies
make test          # Run tests
make lint          # Run linter
make format        # Format code
make clean         # Clean build artifacts
make run-example   # Run basic example
make run-llm-example # Run LLM evaluation example

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built with Pydantic AI for LLM integration
  • Uses Ruff for fast linting and formatting
  • Inspired by modern evaluation frameworks for conversational AI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turnwise-0.1.0.tar.gz (112.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turnwise-0.1.0-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file turnwise-0.1.0.tar.gz.

File metadata

  • Download URL: turnwise-0.1.0.tar.gz
  • Upload date:
  • Size: 112.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.10

File hashes

Hashes for turnwise-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6849ae038296266b3658045554f29deac50147373b44de2fca797efd74e93521
MD5 7315ba52ca9b4084726a645bc27a6d32
BLAKE2b-256 123a8af11da95632ea7fcc475acdd421f920e07c3f5410d73731766107a9f2bf

See more details on using hashes here.

File details

Details for the file turnwise-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: turnwise-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.10

File hashes

Hashes for turnwise-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6deb940f1883d98c663ced234ebc89f256cba9ef76aeea640ba0e83bac8832d4
MD5 b87bb6fbea0ba006f657bcadb8cf4423
BLAKE2b-256 23da2509cb6e4d5f9730ce4b3a6f707e67e246ebd7a2c231b35049a9d7a6befb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page