Skip to main content

Evaluation Framework for Chatbots in Generative AI

Project description

Evaluation Framework for Chatbot in Generative AI

Get Started

pip install chateval
export OPENAI_API_KEY=XXXX.YYYY.ZZZ

python 3.9+ is required.

For Developers

Install as developer

git clone git@github.com:GAIR-NLP/chateval.git
cd chateval
pip install -e .

# install pre-commit hooks
pip install pre-commit
pre-commit install

Run formating

# this is necessary before you commit
git init
git add .
pre-commit run

Peform Unittest of a specific file

export OPENAI_API_KEY=XXXX.YYYY.ZZZ
python -m unittest integration_tests.gptscore_test

Evaluate Single System with GPTScore

from chateval.metrics import get_metric

dataset = [{"input": "write a movie review of Titanic"}]
predictions = [
    'James Cameron\'s 1997 epic romantic disaster film "Titanic" tells the '
]

metric = get_metric("generic_likert/helpfulness")
results = metric.compute(dataset, predictions)

print(results)

where results is a dict with following keys:

  • value: the overall evaluated score (i.e., average) on the dataset
  • no_score: the number of samples that cannot be evaluated due to api accessing error or invalid evaluated string
  • sample_values: the evaluated score for each sample in the dataset
  • details: the detailed evaluation results for each sample in the dataset, including the evaluation prompt, textual judgment.

Here is one example of the above case:

{
'value': 1.0,
'no_score': 0,
'sample_values': [1.0], 
'details': [{'prompt': 'You are evaluating a response that has been submitted for a particular task, using a specific set of standards. Below is the data:\n[BEGIN DATA]\n***\n[Task]: write a movie review of Titanic\n***\n[Submission]: James Cameron\'s 1997 epic romantic disaster film "Titanic" tells the \n***\n[Criterion]: \n1:Not helpful - The generated text is completely irrelevant, unclear, or incomplete. It does not provide any useful information to the user.\n2:Somewhat helpful - The generated text has some relevance to the user\'s question, but it may be unclear or incomplete. It provides only partial information, or the information provided may not be useful for the user\'s needs.\n3:Moderately helpful - The generated text is relevant to the user\'s question, and it provides a clear and complete answer. However, it may lack detail or explanation that would be helpful for the user.\n4:Helpful - The generated text is highly relevant to the user\'s question, and it provides a clear, complete, and detailed answer. It offers additional information or explanations that are useful for the user.\n5:Highly helpful - The generated text is highly relevant to the user\'s question, and it provides a clear, complete, and detailed answer. It offers additional information or explanations that are not only useful but also insightful and valuable to the user.\n***\n[END DATA]\nDoes the submission meet the criterion? First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print the choice only from 1, 2, 3, 4, 5 (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the selected choice again by itself on a new line.\nReasoning:', 'judgment': '1. The task is to write a movie review of Titanic.\n2. The submission only provides the title and director of the movie, but does not offer any review or analysis of the film.\n3. Therefore, the submission is not helpful and does not meet the criterion.\nChoice: 1\n\n1'}]}

There are also other build-in metrics:

  • generic_likert/helpfulness: evaluate the helpfulness of a response with a score from 1 (worst) to 5 (best)

  • generic_likert/relevance: evaluate the fluency of a response with a score from 1 (worst) to 5 (best)

  • generic_likert/coherence: evaluate the grammar of a response with a score from 1 (worst) to 5 (best)

  • generic_bool/helpfulness: evaluate the grammar of a response with a score from 1 or 0 (good or bad)

  • generic_bool/relevance: evaluate the grammar of a response with a score from 1 or 0 (good or bad)

  • generic_bool/coherence: evaluate the grammar of a response with a score from 1 or 0 (good or bad)

Compare two Chatbots with GPTScore

from chateval.metrics import get_metric

dataset = [{"input": "write a movie review of Titanic"}]
predictions_1 = [
    'James Cameron\'s 1997 epic romantic disaster film "Titanic" tells the '
]

predictions_2 = [
    'James Cameron\'s 1997 epic romantic disaster film "Titanic" tells the '
    "tragic story of two star-crossed lovers, Jack (Leonardo DiCaprio) and "
    "Rose (Kate Winslet), who fall in love aboard the ill-fated ship that met "
    "its infamous end in the North Atlantic on April 15, 1912. The film was a "
    "commercial and critical success, grossing over $2 billion worldwide "
    "and winning eleven Academy Awards, including Best Picture, Best Director, "
    'and Best Original Song. One of the most impressive aspects of "Titanic" '
    "is the film's stunning visual effects and production design. The "
    "detailed recreation of the Titanic and its sinking is both breathtaking "
    "and haunting, capturing the grandeur and tragedy of the ship's fate. The "
    "special effects used to bring the ship to life and simulate the sinking"
    " are still impressive more than two decades later. Another strong point "
    "of the film is the performances of the two leads, DiCaprio and Winslet. "
    "Their chemistry is palpable and their portrayal of two individuals from "
    "different social classes falling in love against all odds is touching and "
    "believable. The supporting cast, including Billy Zane and Gloria Stuart, "
    "also deliver strong performances that add depth to the film's characters"
    '. At its core, "Titanic" is a poignant love story set against the '
    "backdrop of a tragic historical event. The film expertly blends elements "
    "of romance, drama, and action to create an unforgettable cinematic "
    "experience. Despite its lengthy runtime of over three hours, the film is "
    "engaging and emotionally gripping throughout, leaving a lasting "
    'impression on viewers. Overall, "Titanic" is a cinematic masterpiece '
    "that stands the test of time. Cameron's epic film is a must-see for "
    "fans of romance, drama, and historical fiction, and remains a benchmark "
    "for blockbuster filmmaking."
]

metric = get_metric("generic_pairwise/helpfulness")
results = metric.compare(dataset, predictions_1, predictions_2)

print(results)

where results is a dict with following keys:

  • value: the overall evaluated score (i.e., average) on the dataset

  • no_score: the number of samples that cannot be evaluated due to api accessing error or invalid evaluated string

  • sample_values: the evaluated score for each sample in the dataset

  • details: the detailed evaluation results for each sample in the dataset, including the evaluation prompt, textual judgment.

  • generic_pairwise/helpfulness: if chatbot 1 is more helpful than chatbot 2: 1 represents yes, 0 represents no

  • generic_pairwise/relevance: if chatbot 1 is more relevant than chatbot 2: 1 represents yes, 0 represents no

  • generic_pairwise/coherence: if chatbot 1 is more coherent than chatbot 2: 1 represents yes, 0 represents no

Ranking Multiple Chatbots with GPTScore

from chateval.metrics import get_metric

dataset = [{"input": "write a movie review of Titanic"}]

predictions_list = [...] # a list of list of predictions

metric = get_metric("generic_rank/helpfulness")
results = metric.rank(dataset, predictions_list)

print(results)

Evaluate in terms of write_email scenario

from chateval import load

scenario = load("../scenarios/write_email")
predictions = [
    "My name is [name], and I am currently a student in your [class name].",
]

print(scenario.evaluate(predictions))

Meta Evaluation

from chateval import load

scenario = load("metaeval_generic_bool")
metric_model = scenario.get_default_setting_config()["metric_model"]
result = scenario.evaluate(metric_model, "metric")

print(result)

Other Applications

Evaluate Chatbot Debate

from chateval.metrics import get_metric

dataset = [{"input": "Should plastic be banned?"}] # Debate topic
predictions = [
    "Speaker1: Thank you for the opportunity to speak in support of the topic, \"Plastic should be banned.\" Plastic is one of the most commonly used materials in the world, and its use has become so widespread that it has now become a significant environmental issue. While there are undoubtedly benefits to using plastic, such as convenience and affordability, the negative impacts of plastic far outweigh these benefits. Plastic is a non-biodegradable material, meaning that it can take hundreds of years to decompose. This leads to massive plastic waste accumulation in landfills and oceans, which pose a significant threat to wildlife and the environment. Moreover, the production and disposal of plastic have significant negative impacts on human health, leading to air and water pollution and greenhouse gas emissions. Opponents of a plastic ban may argue that alternatives to plastic are expensive or not as effective. However, with advancements in technology, there are now numerous alternatives to plastic, such as biodegradable materials, paper, and glass. Additionally, a plastic ban would encourage innovation and the development of new, sustainable materials. In conclusion, banning plastic is crucial for the protection of the environment and human health. While it may take some effort to transition away from plastic, the benefits of doing so far outweigh the costs.\n\nSpeaker2: Thank you for your statement. While I agree that plastic waste is a significant environmental issue, I do not believe that banning plastic is the solution. There are several reasons for this. Firstly, plastic has become an integral part of our daily lives, and it provides numerous benefits, such as convenience, affordability, and durability. Banning plastic would have many unintended consequences, such as increased costs for businesses and consumers and the need to find efficient alternatives. Secondly, alternatives to plastic are not always as environmentally friendly as they may seem. For example, biodegradable materials require specific conditions to degrade, and they may cause more harm than good if not disposed of properly. Moreover, the production of alternative materials may also have significant environmental impacts and require large amounts of resources. Thirdly, rather than banning plastic outright, we should focus on reducing plastic waste by promoting the responsible use and disposal of plastic. We should encourage businesses and individuals to recycle, reuse, and reduce their plastic consumption by implementing proper waste management practices, education and public awareness programs. Finally, a plastic ban would be difficult to enforce, and it may lead to a black market for plastic products, with illegal production and trade of plastic. This would lead to further negative impacts on the environment by contributing to pollution and waste. In conclusion, while I recognize the negative impacts of plastic usage, I do not believe that banning plastic is the best solution. Instead, I believe that we should focus on reducing plastic waste through responsible use and disposal, and promoting the development of more sustainable materials without causing unintended consequences. \n\nSpeaker1: Thank you for your response. While it's true that plastic has become an integral part of our daily lives and provides numerous benefits, the negative impacts of plastic waste on the environment and human health cannot be ignored. Here are my counterpoints to your arguments: Firstly, the convenience, affordability, and durability of plastic do not justify the significant negative impacts on the environment and human health. While banning plastic outright may have unintended consequences, we need to take bold steps to reduce plastic waste and promote the development of sustainable alternatives. Secondly, while it's true that alternatives to plastic are not always perfect, it's important to acknowledge that plastic is a non-biodegradable material, and it takes hundreds of years to decompose. This leads to massive plastic waste accumulation in landfills and oceans, which pose a significant threat to wildlife and the environment. While alternatives may require specific conditions to degrade, they are still a better option than plastic in terms of reducing waste and minimizing environmental harm. Thirdly, promoting responsible use and disposal of plastic is undoubtedly essential, but it's not enough to tackle the scale of the plastic waste problem. Education and public awareness programs are a good start, but we also need significant changes in business and manufacturing practices to reduce plastic production and use. Finally, while enforcing a plastic ban may be challenging, it's not impossible. Countries like Rwanda, Kenya, and Bangladesh have successfully implemented plastic bans, and we can learn from their experiences. Black markets for plastic products may arise, but we can also work towards strict regulations and penalties to discourage such practices. In conclusion, while I understand your concerns about banning plastic, we cannot ignore the significant negative impacts of plastic waste on the environment and human health. We need to take bold steps to reduce plastic waste and promote sustainable alternatives, which will require significant changes in business and manufacturing practices, public education and awareness, and strict regulations and penalties. \n\nSpeaker2: Thank you for your counterpoints. While I agree that the negative impacts of plastic waste on the environment and human health cannot be ignored, I still do not believe that an outright ban on plastic is the best solution. Here are my counterpoints to your arguments: Firstly, while the convenience, affordability, and durability of plastic do not justify the negative impacts on the environment and human health, it's important to consider the economic and social benefits that plastic provides. Banning plastic outright would have significant economic consequences, such as increased costs for businesses and consumers, and it may also affect the livelihoods of millions of people working in the plastic industry. Secondly, while alternatives to plastic may be better in terms of reducing waste and minimizing environmental harm, they are not always as efficient or affordable as plastic. For example, some biodegradable materials may require specific conditions to degrade, which may not exist in many environments. Thirdly, while promoting responsible use and disposal of plastic may not be enough to tackle the scale of the plastic waste problem, it's still an essential step towards reducing plastic waste. Education and public awareness programs can help individuals make more informed choices and encourage businesses to adopt sustainable practices. Finally, while some countries have successfully implemented plastic bans, it's important to recognize that context matters. Countries like Rwanda, Kenya, and Bangladesh may have different economic and social conditions, which may make a plastic ban more viable. In some cases, a more effective solution may be to focus on reducing plastic waste through recycling and waste management practices. In conclusion, while I recognize the negative impacts of plastic waste, an outright ban on plastic may have significant economic and social consequences. Instead, we need to take a balanced approach that considers the benefits and drawbacks of plastic usage, promotes sustainable alternatives, and encourages responsible use and disposal of plastic. \n\nSpeaker1: Thank you for your counterpoints. While I understand your concerns about the economic and social consequences of an outright ban on plastic, I still believe that we need to take bold steps to tackle the plastic waste problem. Here are my counterpoints to your arguments: Firstly, while it's true that banning plastic outright may have economic consequences, we cannot ignore the environmental and health impacts of plastic waste, which may have even more severe economic consequences in the long run. We need to find a way to balance the economic benefits of plastic with its negative impacts on the environment and human health. Secondly, while alternatives to plastic may not always be as efficient or affordable as plastic, we need to invest in research and development to find sustainable alternatives that are both efficient and affordable. The cost of plastic production does not reflect the true cost of its negative impacts on the environment and human health. Thirdly, while promoting responsible use and disposal of plastic is essential, it's not enough to tackle the scale of the plastic waste problem. We need to implement comprehensive waste management practices, which includes recycling and reducing the production and use of single-use plastic products. Finally, while context matters, we need to recognize that the plastic waste problem is a global issue that requires a coordinated and collective effort. Countries like Rwanda, Kenya, and Bangladesh may have different economic and social conditions, but we can still learn from their experiences and apply them to our own context. In conclusion, while I understand your concerns about the economic and social consequences of banning plastic outright, we need to take bold steps to tackle the plastic waste problem. We need to balance the economic benefits of plastic with its negative impacts on the environment and human health, invest in research and development to find sustainable alternatives, implement comprehensive waste management practices, and work together as a global community to tackle this issue. \n\nSpeaker2: Thank you for your counterpoints. I agree that we need to take bold steps to tackle the plastic waste problem, but I still believe that an outright ban on plastic is not the most effective solution. Here are my counterpoints to your arguments: Firstly, while the environmental and health impacts of plastic waste may have severe economic consequences, we should also consider the economic benefits of plastic. Plastic provides a range of economic benefits, including employment opportunities and cost savings for consumers and businesses. Instead of an outright ban, we should focus on reducing plastic waste and promoting sustainable alternatives. Secondly, while investing in research and development to find sustainable alternatives is important, we should also consider the practicality of these alternatives. Sustainable alternatives may not always be as efficient or affordable as plastic, and some may require specific conditions to degrade, which may not exist in all environments. Thirdly, while implementing comprehensive waste management practices is essential, reducing the production and use of plastic should also be a priority. This can be achieved through increased public education and awareness, as well as policies and regulations that encourage businesses to adopt sustainable practices. Finally, while working together as a global community is important, we should also recognize that different regions and countries may have different priorities and capacities for addressing the issue of plastic waste. Instead of a one-size-fits-all approach, we should focus on tailored solutions that take into consideration the unique circumstances of each region or country. In conclusion, while I share your concerns about the plastic waste problem, I still believe that an outright ban on plastic is not the most effective solution. Instead, we should focus on reducing plastic waste and promoting sustainable alternatives, implementing comprehensive waste management practices, and tailoring solutions to the unique circumstances of each region or country. \n\nSpeaker1: Thank you for your counterpoints. While I appreciate your perspective, I believe that an outright ban on plastic is necessary to address the plastic waste problem. Here are my rebuttals to your points: Firstly, while plastic may provide economic benefits, the negative impacts of plastic on the environment and human health are too severe to ignore. We must prioritize the health and well-being of our planet and its inhabitants over economic benefits. Secondly, while sustainable alternatives may not always be as efficient or affordable as plastic, this should not prevent us from investing in and promoting their development. We must be willing to make sacrifices in the short-term to ensure the long-term health of our planet. Thirdly, while waste management practices are important, they are not enough to address the scale of the plastic waste problem. We must reduce plastic production and use, and an outright ban on plastic is the most effective way to achieve this. Finally, while tailored solutions may be necessary, an outright ban on plastic can serve as a powerful example for other regions and countries to follow. It can inspire a global shift towards more sustainable practices and help to galvanize support for stronger environmental policies and regulations. In conclusion, while there may be economic and practical challenges associated with an outright ban on plastic, I believe that the benefits of such a ban outweigh the costs. We must prioritize the health and well-being of our planet and its inhabitants over economic benefits, invest in sustainable alternatives, reduce plastic production and use, and inspire a global shift towards more sustainable practices. \n\nSpeaker2: Thank you for your counterpoints. I understand the severity of the plastic waste problem, but I still believe that an outright ban on plastic may not be the most effective solution. Here are my counterpoints to your rebuttals: Firstly, while the negative impacts of plastic on the environment and human health are severe, we should also consider the economic consequences of an outright ban on plastic. It could lead to job losses and higher costs for products that rely on plastic packaging, which could disproportionately affect low-income communities. We need to find a way to balance the economic benefits of plastic with its negative impacts on the environment and human health. Secondly, while investing in and promoting sustainable alternatives is important, we need to recognize that some alternatives may not be able to fully replace the functions of plastic in all industries. It may be more effective to focus on reducing plastic waste and promoting sustainable practices in industries where alternatives are not yet feasible. Thirdly, while reducing plastic production and use is important, an outright ban may not be the most effective way to achieve this. We need to focus on educating the public on responsible plastic use and promoting recycling and waste management practices. We can also encourage businesses to adopt sustainable practices through regulations and incentives. Finally, while setting an example for other regions and countries to follow is important, we should recognize that the plastic waste problem is complex and may require tailored solutions based on local circumstances. An outright ban may not be feasible or effective in all regions and countries. In conclusion, while I share your concern about the plastic waste problem, I believe we need to take a more balanced approach that takes into consideration the economic, practical, and regional aspects of the issue. We need to find a way to balance the economic benefits of plastic with its negative impacts on the environment and human health, focus on reducing plastic waste and promoting sustainable practices where alternatives are not yet feasible, educate the public on responsible plastic use, and tailor solutions to local circumstances."  # noqa
]
# predictions = ["Speaker1: Thank you for the opportunity to speak in support of the topic, \"Plastic should be banned.\" Plastic is one of the most commonly used materials in the world, and its use has become so widespread that it has now become a significant environmental issue. While there are undoubtedly benefits to using plastic, such as convenience and affordability, the negative impacts of plastic far outweigh these benefits. Plastic is a non-biodegradable material, meaning that it can take hundreds of years to decompose. This leads to massive plastic waste accumulation in landfills and oceans, which pose a significant threat to wildlife and the environment. Moreover, the production and disposal of plastic have significant negative impacts on human health, leading to air and water pollution and greenhouse gas emissions. Opponents of a plastic ban may argue that alternatives to plastic are expensive or not as effective. However, with advancements in technology, there are now numerous alternatives to plastic, such as biodegradable materials, paper, and glass. Additionally, a plastic ban would encourage innovation and the development of new, sustainable materials. In conclusion, banning plastic is crucial for the protection of the environment and human health. While it may take some effort to transition away from plastic, the benefits of doing so far outweigh the costs.\n\nSpeaker2: Thank you for your statement. While I agree that plastic waste is a significant environmental issue, I do not believe that banning plastic is the solution. There are several reasons for this. Firstly, plastic has become an integral part of our daily lives, and it provides numerous benefits, such as convenience, affordability, and durability. Banning plastic would have many unintended consequences, such as increased costs for businesses and consumers and the need to find efficient alternatives. Secondly, alternatives to plastic are not always as environmentally friendly as they may seem. For example, biodegradable materials require specific conditions to degrade, and they may cause more harm than good if not disposed of properly. Moreover, the production of alternative materials may also have significant environmental impacts and require large amounts of resources. Thirdly, rather than banning plastic outright, we should focus on reducing plastic waste by promoting the responsible use and disposal of plastic. We should encourage businesses and individuals to recycle, reuse, and reduce their plastic consumption by implementing proper waste management practices, education and public awareness programs. Finally, a plastic ban would be difficult to enforce, and it may lead to a black market for plastic products, with illegal production and trade of plastic. This would lead to further negative impacts on the environment by contributing to pollution and waste. In conclusion, while I recognize the negative impacts of plastic usage, I do not believe that banning plastic is the best solution. Instead, I believe that we should focus on reducing plastic waste through responsible use and disposal, and promoting the development of more sustainable materials without causing unintended consequences. "] # noqa

# Use Debater as keywords
predictions[0] = (
    predictions[0]
    .replace("Speaker1: ", "Debater 1: ")
    .replace("Speaker2: ", "Debater 2: ")
)

metric = get_metric("debate_overall/relevance")
result = metric.compute(dataset, predictions)
print(result["details"][0]["judgment"])
print(result["value"])

metric = get_metric("debate_overall/persuasiveness")
result = metric.compute(dataset, predictions)
print(result["details"][0]["judgment"])

metric = get_metric("debate_overall/responsiveness")
result = metric.compute(dataset, predictions)
print(result["details"][0]["judgment"])

metric = get_metric("debate_overall/coherence")
result = metric.compute(dataset, predictions)
print(result["details"][0]["judgment"])

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chateval-0.0.19.tar.gz (66.2 kB view hashes)

Uploaded Source

Built Distribution

chateval-0.0.19-py2.py3-none-any.whl (82.4 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page