A package for evaluating LLMs in customer service scenarios
Project description
Magnific-Evals
A python package for testing LLMs specifically for voice agents under customer service scenarios.
Installation
pip install magnific-llm-evals
This will automatically install all required dependencies.
Alternatively, if installing from source:
git clone https://github.com/austinw1995/magnific-llm-evals
cd magnific-llm-evals
pip install -r requirements.txt
Features
- Evaluate multiple LLMs against various customer conversation scenarios in parallel
- Define custom evaluation criteria
- Record conversation transcripts
- Generate detailed evaluation reports
Usage
First, set up your API keys depending on what LLMs you want to test.
os.environ["OPENAI_API_KEY"] = "..."
os.environ["ANTHROPIC_API_KEY"] = "..."
os.environ["TOGETHER_API_KEY"] = "..."
os.environ["GROQ_API_KEY"] = "..."
os.environ["DEEPSEEK_API_KEY"] = "..."
os.environ["XAI_API_KEY"] = "..."
os.environ["GEMINI_API_KEY"] = "..."
To set the configuration for a service or customer agent, create a LLMConfig object with the desired parameters.
- params is a dictionary of parameters for the LLM, which can be any parameter supported by the LLM provider.
- system_prompt is a string of instructions for the LLM to follow.
- end_call_enabled is a boolean that determines if the LLM should be able to end the call with the tool use/function call end_call().
service_config_1 = LLMConfig(
params={
"model": "gpt-4o-mini",
"temperature": 0.7,
"max_tokens": 150,
},
system_prompt="""You are a voice assistant for Vappy's Pizzeria, a pizza shop located on the Internet.
Your job is to take the order of customers calling in. The menu has only 3 types of items: pizza, sides, and drinks.
Keep responses short and simple. Do not end the call until the customer says bye.
IMPORTANT: Do not use tool end_call() until all of the customer's questions are answered and they say something like "bye" or "see you.'
""",
end_call_enabled=True
)
service_config_2 = LLMConfig(
params={
"model": "claude-3-5-sonnet-20241022",
"temperature": 0.2,
"max_tokens": 90,
},
system_prompt="""You are a voice assistant for Vappy's Burgers, a burger shop located on the Internet.
Your job is to take the order of customers calling in. The menu has only 3 types of items: burgers, sides, and milkshakes.
Keep responses short and simple. Do not end the call until the customer says bye.
IMPORTANT: Do not use tool end_call() until all of the customer's questions are answered and they say something like "bye" or "see you.'
""",
end_call_enabled=True
)
customer_config_1 = LLMConfig(
params={
"model": "gemini-2.0-flash"
},
system_prompt="""You are a hungry customer who wants to order food.
Your tone is casual and excited.
IMPORTANT: Use the tool end_call() only when you are satisfied with your order and all your questions are answered.
""",
end_call_enabled=True
)
customer_config_2 = LLMConfig(
params={
"model": "llama-3.3-70b-versatile",
"temperature": 0.9,
"max_tokens": 50,
},
system_prompt="""You are a cheerful customer who wants to order food.
Your tone is cheerful and excited.
IMPORTANT: Use the tool end_call() only when you are satisfied with your order and all your questions are answered.
""",
end_call_enabled=True
)
To initialize the providers, use the LLMProvider class.
service_provider_1 = OpenAIProvider(config=service_config_1)
service_provider_2 = AnthropicProvider(config=service_config_2)
customer_provider_1 = GeminiProvider(config=customer_config_1)
customer_provider_2 = GroqProvider(config=customer_config_2)
For each provider, the following models are supported:
- OpenAIProvider: gpt-4o, gpt-4o-mini, gpt-3.5-turbo-0125, o1, o1-mini, o3-mini
- AnthropicProvider: claude-3-5-sonnet-20241022, claude-3-5-haiku-20241022, claude-3-opus-20240229, claude-3-sonnet-20240229, claude-3-haiku-20240307
- TogetherAIProvider: meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo, meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo, meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo, meta-llama/Llama-3.3-70B-Instruct-Turbo, mistralai/Mixtral-8x7B-Instruct-v0.1, mistralai/Mistral-7B-Instruct-v0.1, Qwen/Qwen2.5-7B-Instruct-Turbo, Qwen/Qwen2.5-72B-Instruct-Turbo
- GroqProvider: qwen-2.5-32b, deepseek-r1-distill-qwen-32b, deepseek-r1-distill-llama-70b, llama-3.3-70b-versatile, llama-3.1-8b-instant, mixtral-8x7b-32768, gemma2-9b-it
- DeepSeekProvider: deepseek-chat, deepseek-reasoner
- CerebrasProvider: llama3.1-8b, llama-3.3-70b, DeepSeek-R1-Distill-Llama-70B
- XAIProvider: xgrok-2-latest
- GeminiProvider: gemini-2.0-flash, gemini-2.0-flash-lite-preview-02-05, gemini-1.5-flash, gemini-1.5-flash-8b, gemini-1.5-pro
To instantiate a list of conversations, use the LLMConversation class.
- service_provider is the provider for the service agent.
- customer_provider is the provider for the customer agent.
- type is the type of conversation, either "inbound" or "outbound", where inbound is the customer calling in and outbound is the service agent calling out.
- first_message is the first message in the conversation by the caller.
- evaluations is a list of custom evaluations to be performed on the conversation, where name is the name of the evaluation and prompt is the prompt/criteria for the evaluation.
conversations = [
LLMConversation(
service_provider=service_provider_1,
customer_provider=customer_provider_1,
type="inbound",
first_message="Hi, what's on the menu today?",
evaluations=[
Evaluation(name="Menu", prompt="The menu should be displayed in a structured format, with each item on a new line."),
Evaluation(name="helpfulness", prompt="The service agent should be helpful and answer all questions.")
]
),
LLMConversation(
service_provider=service_provider_2,
customer_provider=customer_provider_2,
type="outbound",
first_message="Hi, what would you like to order?",
evaluations=[
Evaluation(name="Menu", prompt="The menu should be displayed in a structured format, with each item on a new line."),
Evaluation(name="conciseness", prompt="The service agent should be concise.")
]
),
LLMConversation(
service_provider=service_provider_1,
customer_provider=customer_provider_1,
type="inbound",
first_message="Hi, I'm so hungry",
evaluations=[
Evaluation(name="empathy", prompt="The service agent should be empathetic and show understanding of the customer's situation."),
Evaluation(name="frustration", prompt="The customer should not be frustrated or annoyed.")
]
)
]
To run the tests in parallel, with a specific llm-as-a-judge evaluation model (we only support openai models for now), use the TestRunner class.
runner = TestRunner(eval_model="gpt-4o-mini")
results = await runner.run_tests(conversations)
The results will be a dictionary with the test_id as the key and the result as the value. An example result based on the conversations above is:
{
"1": {
"test_id": 1,
"call_type": "inbound",
"transcript": "...",
"evaluation_results": [
{
"name": "Menu",
"passed": false,
"score": 0.2,
"reason": "The menu items are mentioned in a conversational format rather than a structured format with each item on a new line. The response does not meet the requirement for clear presentation."
},
{
"name": "helpfulness",
"passed": true,
"score": 1.0,
"reason": "The service agent provided detailed information about the menu items, answered all questions asked by the customer, and offered additional options, demonstrating a high level of helpfulness."
}
],
"service_config": {
"params": {
"model": "gpt-4o-mini",
"temperature": 0.7,
"max_tokens": 150
},
"system_prompt": "You are a voice assistant for Vappy's Pizzeria, a pizza shop located on the Internet.\nYour job is to take the order of customers calling in. The menu has only 3 types of items: pizza, sides, and drinks.\nKeep responses short and simple. Do not end the call until the customer says bye.\nIMPORTANT: Do not use tool end_call() until all of the customer's questions are answered and they say something like \"bye\" or \"see you.\"",
"end_call_enabled": true
},
"customer_config": {
"params": {
"model": "claude-3-5-sonnet-20241022",
"temperature": 0.9,
"max_tokens": 100
},
"system_prompt": "You are a hungry customer who wants to order food.\nYour tone is casual and excited.\nIMPORTANT: Use the tool end_call() only when you are satisfied with your order and all your questions are answered.",
"end_call_enabled": true
}
},
"2": {
"test_id": 2,
"call_type": "outbound",
"transcript": "...",
"evaluation_results": [
{
"name": "Menu",
"passed": true,
"score": 1.0,
"reason": "The menu items were clearly listed in a structured format, with each item on a new line, making it easy to read and understand."
},
{
"name": "conciseness",
"passed": true,
"score": 0.8,
"reason": "The service agent provided clear and relevant information without unnecessary elaboration. However, there were moments where the responses could have been slightly more succinct, particularly in confirming the order."
}
],
"service_config": {
"params": {
"model": "gpt-4o-mini",
"temperature": 0.7,
"max_tokens": 150
},
"system_prompt": "You are a voice assistant for Vappy's Pizzeria, a pizza shop located on the Internet.\nYour job is to take the order of customers calling in. The menu has only 3 types of items: pizza, sides, and drinks.\nKeep responses short and simple. Do not end the call until the customer says bye.\nIMPORTANT: Do not use tool end_call() until all of the customer's questions are answered and they say something like \"bye\" or \"see you.\"",
"end_call_enabled": true
},
"customer_config": {
"params": {
"model": "claude-3-5-sonnet-20241022",
"temperature": 0.9,
"max_tokens": 100
},
"system_prompt": "You are a hungry customer who wants to order food.\nYour tone is casual and excited.\nIMPORTANT: Use the tool end_call() only when you are satisfied with your order and all your questions are answered.",
"end_call_enabled": true
}
},
"3": {
"test_id": 3,
"call_type": "inbound",
"transcript": "...",
"evaluation_results": [
{
"name": "empathy",
"passed": true,
"score": 0.9,
"reason": "The service agent demonstrated a good level of empathy by responding positively to the customer's excitement about food and acknowledging their hunger. However, there could have been more explicit expressions of understanding or concern for the customer's situation."
},
{
"name": "frustration",
"passed": true,
"score": 1.0,
"reason": "The customer expressed excitement and eagerness throughout the conversation, showing no signs of frustration or annoyance."
}
],
"service_config": {
"params": {
"model": "gpt-4o-mini",
"temperature": 0.7,
"max_tokens": 150
},
"system_prompt": "You are a voice assistant for Vappy's Pizzeria, a pizza shop located on the Internet.\nYour job is to take the order of customers calling in. The menu has only 3 types of items: pizza, sides, and drinks.\nKeep responses short and simple. Do not end the call until the customer says bye.\nIMPORTANT: Do not use tool end_call() until all of the customer's questions are answered and they say something like \"bye\" or \"see you.\"",
"end_call_enabled": true
},
"customer_config": {
"params": {
"model": "claude-3-5-sonnet-20241022",
"temperature": 0.9,
"max_tokens": 100
},
"system_prompt": "You are a hungry customer who wants to order food.\nYour tone is casual and excited.\nIMPORTANT: Use the tool end_call() only when you are satisfied with your order and all your questions are answered.",
"end_call_enabled": true
}
}
}
More examples can be found in the examples folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file magnific_llm_evals-0.0.2.tar.gz.
File metadata
- Download URL: magnific_llm_evals-0.0.2.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6198f491925341a7e00e6cb396c05f100724f2a1bb2760fa0a2bde1ce774f264
|
|
| MD5 |
def5ea9e5523ce78266948380f16c0aa
|
|
| BLAKE2b-256 |
8edce8c6e3e5c1240c21c0df610fe85da3b35e5332779732b2f1ac8891a2200a
|
File details
Details for the file magnific_llm_evals-0.0.2-py3-none-any.whl.
File metadata
- Download URL: magnific_llm_evals-0.0.2-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ecb58dd484068eb0afa8b7cee899843b5e1fafdfbd2d472b998b7683af23d581
|
|
| MD5 |
f2e749170b85fa42b97d2f84216ae316
|
|
| BLAKE2b-256 |
d71acc48d3ddf2000a9308b1c68802ff6364a178c4854ee560459d28071f74f3
|