A package for evaluating LLMs in customer service scenarios

These details have not been verified by PyPI

Project links

Homepage

Project description

Magnific-Evals

A python package for testing LLMs specifically for voice agents under customer service scenarios.

Installation

pip install magnific-llm-evals

This will automatically install all required dependencies.

Alternatively, if installing from source:

git clone https://github.com/austinw1995/magnific-llm-evals
cd magnific-llm-evals
pip install -r requirements.txt

Features

Evaluate multiple LLMs against various customer conversation scenarios in parallel
Define custom evaluation criteria
Record conversation transcripts
Generate detailed evaluation reports

Usage

First, set up your API keys depending on what LLMs you want to test.

os.environ["OPENAI_API_KEY"] = "..."
os.environ["ANTHROPIC_API_KEY"] = "..."
os.environ["TOGETHER_API_KEY"] = "..."
os.environ["GROQ_API_KEY"] = "..."
os.environ["DEEPSEEK_API_KEY"] = "..."
os.environ["XAI_API_KEY"] = "..."
os.environ["GEMINI_API_KEY"] = "..."

To set the configuration for a service or customer agent, create a LLMConfig object with the desired parameters.

params is a dictionary of parameters for the LLM, which can be any parameter supported by the LLM provider.
system_prompt is a string of instructions for the LLM to follow.
end_call_enabled is a boolean that determines if the LLM should be able to end the call with the tool use/function call end_call().

service_config_1 = LLMConfig(
        params={
            "model": "gpt-4o-mini",
            "temperature": 0.7,
            "max_tokens": 150,
        },
        system_prompt="""You are a voice assistant for Vappy's Pizzeria, a pizza shop located on the Internet.
Your job is to take the order of customers calling in. The menu has only 3 types of items: pizza, sides, and drinks.
Keep responses short and simple. Do not end the call until the customer says bye.
IMPORTANT: Do not use tool end_call() until all of the customer's questions are answered and they say something like "bye" or "see you.'
""",
        end_call_enabled=True
    )

service_config_2 = LLMConfig(
        params={
            "model": "claude-3-5-sonnet-20241022",
            "temperature": 0.2,
            "max_tokens": 90,
        },
        system_prompt="""You are a voice assistant for Vappy's Burgers, a burger shop located on the Internet.
Your job is to take the order of customers calling in. The menu has only 3 types of items: burgers, sides, and milkshakes.
Keep responses short and simple. Do not end the call until the customer says bye.
IMPORTANT: Do not use tool end_call() until all of the customer's questions are answered and they say something like "bye" or "see you.'
""",
        end_call_enabled=True
    )

customer_config_1 = LLMConfig(
        params={
            "model": "gemini-2.0-flash"
        },
        system_prompt="""You are a hungry customer who wants to order food.
Your tone is casual and excited.
IMPORTANT: Use the tool end_call() only when you are satisfied with your order and all your questions are answered.
""",
        end_call_enabled=True
    )

customer_config_2 = LLMConfig(
        params={
            "model": "llama-3.3-70b-versatile",
            "temperature": 0.9,
            "max_tokens": 50,
        },
        system_prompt="""You are a cheerful customer who wants to order food.
Your tone is cheerful and excited.
IMPORTANT: Use the tool end_call() only when you are satisfied with your order and all your questions are answered.
""",
        end_call_enabled=True
    )

To initialize the providers, use the LLMProvider class.

service_provider_1 = OpenAIProvider(config=service_config_1)
service_provider_2 = AnthropicProvider(config=service_config_2)
customer_provider_1 = GeminiProvider(config=customer_config_1)
customer_provider_2 = GroqProvider(config=customer_config_2)

For each provider, the following models are supported:

OpenAIProvider: gpt-4o, gpt-4o-mini, gpt-3.5-turbo-0125, o1, o1-mini, o3-mini
AnthropicProvider: claude-3-5-sonnet-20241022, claude-3-5-haiku-20241022, claude-3-opus-20240229, claude-3-sonnet-20240229, claude-3-haiku-20240307
TogetherAIProvider: meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo, meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo, meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo, meta-llama/Llama-3.3-70B-Instruct-Turbo, mistralai/Mixtral-8x7B-Instruct-v0.1, mistralai/Mistral-7B-Instruct-v0.1, Qwen/Qwen2.5-7B-Instruct-Turbo, Qwen/Qwen2.5-72B-Instruct-Turbo
GroqProvider: qwen-2.5-32b, deepseek-r1-distill-qwen-32b, deepseek-r1-distill-llama-70b, llama-3.3-70b-versatile, llama-3.1-8b-instant, mixtral-8x7b-32768, gemma2-9b-it
DeepSeekProvider: deepseek-chat, deepseek-reasoner
CerebrasProvider: llama3.1-8b, llama-3.3-70b, DeepSeek-R1-Distill-Llama-70B
XAIProvider: xgrok-2-latest
GeminiProvider: gemini-2.0-flash, gemini-2.0-flash-lite-preview-02-05, gemini-1.5-flash, gemini-1.5-flash-8b, gemini-1.5-pro

To instantiate a list of conversations, use the LLMConversation class.

service_provider is the provider for the service agent.
customer_provider is the provider for the customer agent.
type is the type of conversation, either "inbound" or "outbound", where inbound is the customer calling in and outbound is the service agent calling out.
first_message is the first message in the conversation by the caller.
evaluations is a list of custom evaluations to be performed on the conversation, where name is the name of the evaluation and prompt is the prompt/criteria for the evaluation.

conversations = [
        LLMConversation(
            service_provider=service_provider_1,
            customer_provider=customer_provider_1,
            type="inbound",
            first_message="Hi, what's on the menu today?",
            evaluations=[
                Evaluation(name="Menu", prompt="The menu should be displayed in a structured format, with each item on a new line."),
                Evaluation(name="helpfulness", prompt="The service agent should be helpful and answer all questions.")
            ]
        ),
        LLMConversation(
            service_provider=service_provider_2,
            customer_provider=customer_provider_2,
            type="outbound",
            first_message="Hi, what would you like to order?",
            evaluations=[
                Evaluation(name="Menu", prompt="The menu should be displayed in a structured format, with each item on a new line."),
                Evaluation(name="conciseness", prompt="The service agent should be concise.")
            ]
        ),
        LLMConversation(
            service_provider=service_provider_1,
            customer_provider=customer_provider_1,
            type="inbound",
            first_message="Hi, I'm so hungry",
            evaluations=[
                Evaluation(name="empathy", prompt="The service agent should be empathetic and show understanding of the customer's situation."),
                Evaluation(name="frustration", prompt="The customer should not be frustrated or annoyed.")
            ]
        )
    ]

To run the tests in parallel, with a specific llm-as-a-judge evaluation model (we only support openai models for now), use the TestRunner class.

runner = TestRunner(eval_model="gpt-4o-mini")
results = await runner.run_tests(conversations)

The results will be a dictionary with the test_id as the key and the result as the value. An example result based on the conversations above is:

{
  "1": {
    "test_id": 1,
    "call_type": "inbound",
    "transcript": "...",
    "evaluation_results": [
      {
        "name": "Menu",
        "passed": false,
        "score": 0.2,
        "reason": "The menu items are mentioned in a conversational format rather than a structured format with each item on a new line. The response does not meet the requirement for clear presentation."
      },
      {
        "name": "helpfulness",
        "passed": true,
        "score": 1.0,
        "reason": "The service agent provided detailed information about the menu items, answered all questions asked by the customer, and offered additional options, demonstrating a high level of helpfulness."
      }
    ],
    "service_config": {
      "params": {
        "model": "gpt-4o-mini",
        "temperature": 0.7,
        "max_tokens": 150
      },
      "system_prompt": "You are a voice assistant for Vappy's Pizzeria, a pizza shop located on the Internet.\nYour job is to take the order of customers calling in. The menu has only 3 types of items: pizza, sides, and drinks.\nKeep responses short and simple. Do not end the call until the customer says bye.\nIMPORTANT: Do not use tool end_call() until all of the customer's questions are answered and they say something like \"bye\" or \"see you.\"",
      "end_call_enabled": true
    },
    "customer_config": {
      "params": {
        "model": "claude-3-5-sonnet-20241022",
        "temperature": 0.9,
        "max_tokens": 100
      },
      "system_prompt": "You are a hungry customer who wants to order food.\nYour tone is casual and excited.\nIMPORTANT: Use the tool end_call() only when you are satisfied with your order and all your questions are answered.",
      "end_call_enabled": true
    }
  },
  "2": {
    "test_id": 2,
    "call_type": "outbound",
    "transcript": "...",
    "evaluation_results": [
      {
        "name": "Menu",
        "passed": true,
        "score": 1.0,
        "reason": "The menu items were clearly listed in a structured format, with each item on a new line, making it easy to read and understand."
      },
      {
        "name": "conciseness",
        "passed": true,
        "score": 0.8,
        "reason": "The service agent provided clear and relevant information without unnecessary elaboration. However, there were moments where the responses could have been slightly more succinct, particularly in confirming the order."
      }
    ],
    "service_config": {
      "params": {
        "model": "gpt-4o-mini",
        "temperature": 0.7,
        "max_tokens": 150
      },
      "system_prompt": "You are a voice assistant for Vappy's Pizzeria, a pizza shop located on the Internet.\nYour job is to take the order of customers calling in. The menu has only 3 types of items: pizza, sides, and drinks.\nKeep responses short and simple. Do not end the call until the customer says bye.\nIMPORTANT: Do not use tool end_call() until all of the customer's questions are answered and they say something like \"bye\" or \"see you.\"",
      "end_call_enabled": true
    },
    "customer_config": {
      "params": {
        "model": "claude-3-5-sonnet-20241022",
        "temperature": 0.9,
        "max_tokens": 100
      },
      "system_prompt": "You are a hungry customer who wants to order food.\nYour tone is casual and excited.\nIMPORTANT: Use the tool end_call() only when you are satisfied with your order and all your questions are answered.",
      "end_call_enabled": true
    }
  },
  "3": {
    "test_id": 3,
    "call_type": "inbound",
    "transcript": "...",
    "evaluation_results": [
      {
        "name": "empathy",
        "passed": true,
        "score": 0.9,
        "reason": "The service agent demonstrated a good level of empathy by responding positively to the customer's excitement about food and acknowledging their hunger. However, there could have been more explicit expressions of understanding or concern for the customer's situation."
      },
      {
        "name": "frustration",
        "passed": true,
        "score": 1.0,
        "reason": "The customer expressed excitement and eagerness throughout the conversation, showing no signs of frustration or annoyance."
      }
    ],
    "service_config": {
      "params": {
        "model": "gpt-4o-mini",
        "temperature": 0.7,
        "max_tokens": 150
      },
      "system_prompt": "You are a voice assistant for Vappy's Pizzeria, a pizza shop located on the Internet.\nYour job is to take the order of customers calling in. The menu has only 3 types of items: pizza, sides, and drinks.\nKeep responses short and simple. Do not end the call until the customer says bye.\nIMPORTANT: Do not use tool end_call() until all of the customer's questions are answered and they say something like \"bye\" or \"see you.\"",
      "end_call_enabled": true
    },
    "customer_config": {
      "params": {
        "model": "claude-3-5-sonnet-20241022",
        "temperature": 0.9,
        "max_tokens": 100
      },
      "system_prompt": "You are a hungry customer who wants to order food.\nYour tone is casual and excited.\nIMPORTANT: Use the tool end_call() only when you are satisfied with your order and all your questions are answered.",
      "end_call_enabled": true
    }
  }
}

More examples can be found in the examples folder.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.2

Feb 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

magnific_llm_evals-0.0.2.tar.gz (13.0 kB view details)

Uploaded Feb 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

magnific_llm_evals-0.0.2-py3-none-any.whl (12.2 kB view details)

Uploaded Feb 17, 2025 Python 3

File details

Details for the file magnific_llm_evals-0.0.2.tar.gz.

File metadata

Download URL: magnific_llm_evals-0.0.2.tar.gz
Upload date: Feb 17, 2025
Size: 13.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for magnific_llm_evals-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`6198f491925341a7e00e6cb396c05f100724f2a1bb2760fa0a2bde1ce774f264`
MD5	`def5ea9e5523ce78266948380f16c0aa`
BLAKE2b-256	`8edce8c6e3e5c1240c21c0df610fe85da3b35e5332779732b2f1ac8891a2200a`

See more details on using hashes here.

File details

Details for the file magnific_llm_evals-0.0.2-py3-none-any.whl.

File metadata

Download URL: magnific_llm_evals-0.0.2-py3-none-any.whl
Upload date: Feb 17, 2025
Size: 12.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for magnific_llm_evals-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ecb58dd484068eb0afa8b7cee899843b5e1fafdfbd2d472b998b7683af23d581`
MD5	`f2e749170b85fa42b97d2f84216ae316`
BLAKE2b-256	`d71acc48d3ddf2000a9308b1c68802ff6364a178c4854ee560459d28071f74f3`

See more details on using hashes here.

magnific-llm-evals 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Magnific-Evals

Installation

Features

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes