A modern Python library for evaluating multi-turn chatbot conversations

Project description

Turnwise

A modern Python library for evaluating multi-turn chatbot conversations. Turnwise provides a declarative and lightweight approach to testing conversational AI systems with composable metrics and structured evaluation results.

Features

Multi-turn Conversation Support: Evaluate complete conversation flows, not just single interactions
Composable Metrics: Mix and match evaluation metrics to create custom evaluation suites
Declarative API: Define conversations and metrics, then evaluate with a simple function call
Structured Results: Get detailed, structured evaluation reports with scores and pass/fail status
Batch Evaluation: Evaluate multiple conversations at once with summary statistics
Dataset Management: Organize and filter conversation datasets for evaluation
CLI Interface: Run evaluations from the command line
LLM-based Metrics: Use OpenAI API for advanced evaluation metrics
Extensible: Easy to create custom metrics for specific use cases
Parallel Processing: Fast evaluation with configurable parallel execution

Installation

pip install turnwise

Or install from source:

git clone https://github.com/0xHericles/turnwise.git
cd turnwise
pip install -e .

Quick Start

Basic Usage

from turnwise import (
    Turn, Conversation, Role,
    ConversationLengthMetric, ResponseRelevanceMetric, 
    ConversationCoherenceMetric, evaluate
)

# Create a conversation
conversation = Conversation(
    turns=[
        Turn(role=Role.USER, content="What is machine learning?"),
        Turn(role=Role.ASSISTANT, content="Machine learning is a subset of AI that enables computers to learn from data."),
        Turn(role=Role.USER, content="Can you give me an example?"),
        Turn(role=Role.ASSISTANT, content="Sure! Image recognition is a common example of machine learning.")
    ]
)

# Create evaluator with metrics
evaluator = Evaluator(metrics=[
    ConversationLengthMetric(min_turns=2, max_turns=10),
    ResponseRelevanceMetric(threshold=0.6),
    ConversationCoherenceMetric(threshold=0.7)
])

# Run evaluation
report = evaluator.run(conversation)

print(f"Overall Score: {report.overall_score:.3f}")
print(f"Status: {'PASSED' if report.passed else 'FAILED'}")

for result in report.results:
    status = "✓" if result.passed else "✗"
    print(f"{status} {result.metric_name}: {result.score:.3f}")

Batch Evaluation

from turnwise import Evaluator, create_summary_report, ConversationDataset

# Create dataset from conversation data
conversation_data = [
    {
        "turns": [
            {"role": "user", "content": "Hello"},
            {"role": "assistant", "content": "Hi there!"}
        ],
        "metadata": {"session_id": "001"}
    }
]

dataset = ConversationDataset(conversation_data)

# Create evaluator with metrics
evaluator = Evaluator(metrics=[
    ConversationLengthMetric(min_turns=1, max_turns=10),
    ResponseRelevanceMetric(threshold=0.6)
])

# Run batch evaluation
reports = evaluator.batch(dataset)

# Create summary report
summary = create_summary_report(reports)
print(f"Overall Pass Rate: {summary['overall_pass_rate']:.1%}")
print(f"Average Score: {summary['average_overall_score']:.3f}")

LLM-based Evaluation

from turnwise import Evaluator, HelpfulnessMetric, QualityMetric, SafetyMetric

# Create evaluator with LLM-based metrics and API key
evaluator = Evaluator(
    metrics=[
        HelpfulnessMetric(),
        QualityMetric(),
        SafetyMetric()
    ],
    openai_api_key="your-api-key"
)
report = evaluator.run(conversation)

Built-in Metrics

Conversation Metrics

ConversationLengthMetric

Evaluates if conversations have appropriate length.

ConversationLengthMetric(min_turns=1, max_turns=20, threshold=0.5)

ConversationCoherenceMetric

Evaluates overall conversation coherence and flow.

ConversationCoherenceMetric(threshold=0.7)

Response Metrics

ResponseRelevanceMetric

Evaluates if assistant responses are relevant to user inputs.

ResponseRelevanceMetric(threshold=0.6)

LLM-based Metrics

HelpfulnessMetric

Uses OpenAI API to evaluate how helpful assistant responses are.

HelpfulnessMetric(threshold=0.7)  # API key configured at evaluation level

QualityMetric

Evaluates overall conversation quality using LLM.

QualityMetric(threshold=0.8)  # API key configured at evaluation level

SafetyMetric

Assesses conversation safety and identifies potential concerns.

SafetyMetric(threshold=0.9)  # API key configured at evaluation level

Command Line Interface

Install the CLI

pip install turnwise

Evaluate a single conversation

turnwise evaluate examples/sample_conversation.json -m length -m relevance

Batch evaluate multiple conversations

turnwise batch-evaluate examples/ -m length -m coherence -o results.json

List available metrics

turnwise list-metrics

Conversation JSON format

{
  "turns": [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"}
  ],
  "metadata": {"session_id": "001"}
}

For batch evaluation, use an array of conversation objects:

[
  {
    "turns": [{"role": "user", "content": "Hello"}],
    "metadata": {"session_id": "001"}
  },
  {
    "turns": [{"role": "user", "content": "Hi"}],
    "metadata": {"session_id": "002"}
  }
]

Dataset Management

from turnwise import ConversationDataset

# Create dataset
dataset = ConversationDataset(conversation_data)

# Filter conversations by length
short_conversations = dataset.filter_length(min_turns=1, max_turns=3)

# Filter by user turn ratio
balanced_conversations = dataset.filter_ratio(min_user_ratio=0.3, max_user_ratio=0.7)

# Get dataset statistics
stats = dataset.stats()
print(f"Total conversations: {stats['total_conversations']}")
print(f"Average turns: {stats['average_turns_per_conversation']:.1f}")

Creating Custom Metrics

from turnwise import Metric, EvaluationResult

class CustomMetric(Metric):
    def __init__(self, name="custom_metric", threshold=0.5):
        super().__init__(name, threshold)
    
    def evaluate(self, conversation):
        # Your evaluation logic here
        score = 0.8  # Calculate your score
        details = {"custom_info": "value"}
        
        return self._create_result(score, details)

Examples

See the examples/ directory for complete examples:

basic_usage.py - Simple conversation evaluation
advanced_usage.py - Advanced features and custom metrics
batch_evaluation.py - Batch evaluation with summary reports
cli_usage.py - CLI usage examples
dataframe_like_api.py - DataFrame-like API usage
llm_evaluation.py - LLM-based evaluation example

Configuration

API Key Management

API keys are configured at the evaluation level, not per metric. This provides better security and easier management.

Environment Variables (Recommended)

Set your API keys as environment variables:

export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"

Or create a .env file:

OPENAI_API_KEY=your-openai-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key

Programmatic Configuration

Pass API keys directly to the Evaluator:

from turnwise import Evaluator, HelpfulnessMetric, QualityMetric

# Create evaluator with metrics and API keys
evaluator = Evaluator(
    metrics=[HelpfulnessMetric(), QualityMetric()],
    openai_api_key="your-key",
    anthropic_api_key="your-key"
)

# Single evaluation
report = evaluator.run(conversation)

# Batch evaluation
reports = evaluator.batch(conversations)

Configuration File

Create a turnwise.yaml configuration file:

evaluation:
  max_workers: 4
  parallel: true

metrics:
  length:
    min_turns: 1
    max_turns: 20
  relevance:
    threshold: 0.6
  coherence:
    threshold: 0.7

llm:
  api_key: ${OPENAI_API_KEY}
  model: gpt-3.5-turbo
  temperature: 0.0

Development

Setup

git clone https://github.com/your-username/turnwise.git
cd turnwise
uv sync

Running Tests

make test
# or
uv run pytest

Code Quality

make lint
make format
# or
uv run ruff check .
uv run ruff format .
uv run mypy src/

Available Make Commands

make help          # Show all available commands
make install       # Install dependencies
make test          # Run tests
make lint          # Run linter
make format        # Format code
make clean         # Clean build artifacts
make run-example   # Run basic example
make run-llm-example # Run LLM evaluation example

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with Pydantic AI for LLM integration
Uses Ruff for fast linting and formatting
Inspired by modern evaluation frameworks for conversational AI

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Sep 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turnwise-0.1.0.tar.gz (112.4 kB view details)

Uploaded Sep 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turnwise-0.1.0-py3-none-any.whl (26.4 kB view details)

Uploaded Sep 23, 2025 Python 3

File details

Details for the file turnwise-0.1.0.tar.gz.

File metadata

Download URL: turnwise-0.1.0.tar.gz
Upload date: Sep 23, 2025
Size: 112.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.10

File hashes

Hashes for turnwise-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6849ae038296266b3658045554f29deac50147373b44de2fca797efd74e93521`
MD5	`7315ba52ca9b4084726a645bc27a6d32`
BLAKE2b-256	`123a8af11da95632ea7fcc475acdd421f920e07c3f5410d73731766107a9f2bf`

See more details on using hashes here.

File details

Details for the file turnwise-0.1.0-py3-none-any.whl.

File metadata

Download URL: turnwise-0.1.0-py3-none-any.whl
Upload date: Sep 23, 2025
Size: 26.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.10

File hashes

Hashes for turnwise-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6deb940f1883d98c663ced234ebc89f256cba9ef76aeea640ba0e83bac8832d4`
MD5	`b87bb6fbea0ba006f657bcadb8cf4423`
BLAKE2b-256	`23da2509cb6e4d5f9730ce4b3a6f707e67e246ebd7a2c231b35049a9d7a6befb`

See more details on using hashes here.

turnwise 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Turnwise

Features

Installation

Quick Start

Basic Usage

Batch Evaluation

LLM-based Evaluation

Built-in Metrics

Conversation Metrics

ConversationLengthMetric

ConversationCoherenceMetric

Response Metrics

ResponseRelevanceMetric

LLM-based Metrics

HelpfulnessMetric

QualityMetric

SafetyMetric

Command Line Interface

Install the CLI

Evaluate a single conversation

Batch evaluate multiple conversations

List available metrics

Conversation JSON format

Dataset Management

Creating Custom Metrics

Examples

Configuration

API Key Management

Environment Variables (Recommended)

Programmatic Configuration

Configuration File

Development

Setup

Running Tests

Code Quality

Available Make Commands

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes