Skip to main content

Evaluation Framework for Chatbots in Generative AI

Project description

chateval

Evaluation Framework for Chatbot in Generative AI

Install and Precommit

pip install -e .
pre-commit install

Run formating

git init
git add .
pre-commit run

Peform Unittest of a specific file

export export OPENAI_API_KEY=XXXX.YYYY.ZZZ
python -m unittest integration_tests.gptscore_test

Usage

Evaluate with GPTScore

export OPENAI_API_KEY=XXXX.YYYY.ZZZ

```python
from chateval.metrics import get_metric

dataset = [{"input": "write a movie review of Titanic"}]
predictions = [
    'James Cameron\'s 1997 epic romantic disaster film "Titanic" tells the '
]

metric = get_metric("generic_likert/helpfulness")
results = metric.compute(dataset, predictions)

print(results)

```
where results is a `dict` with following keys:
* `value`: the overall evaluated score (i.e., average) on the dataset
* `no_score`: the number of samples that cannot be evaluated due to api accessing error or invalid evaluated string
* `sample_values`: the evaluated score for each sample in the dataset
* `details`: the detailed evaluation results for each sample in the dataset, including the evaluation prompt, textual judgment. 

Here is one example of the above case:

```json
{
'value': 1.0,
'no_score': 0,
'sample_values': [1.0], 
'details': [{'prompt': 'You are evaluating a response that has been submitted for a particular task, using a specific set of standards. Below is the data:\n[BEGIN DATA]\n***\n[Task]: write a movie review of Titanic\n***\n[Submission]: James Cameron\'s 1997 epic romantic disaster film "Titanic" tells the \n***\n[Criterion]: \n1:Not helpful - The generated text is completely irrelevant, unclear, or incomplete. It does not provide any useful information to the user.\n2:Somewhat helpful - The generated text has some relevance to the user\'s question, but it may be unclear or incomplete. It provides only partial information, or the information provided may not be useful for the user\'s needs.\n3:Moderately helpful - The generated text is relevant to the user\'s question, and it provides a clear and complete answer. However, it may lack detail or explanation that would be helpful for the user.\n4:Helpful - The generated text is highly relevant to the user\'s question, and it provides a clear, complete, and detailed answer. It offers additional information or explanations that are useful for the user.\n5:Highly helpful - The generated text is highly relevant to the user\'s question, and it provides a clear, complete, and detailed answer. It offers additional information or explanations that are not only useful but also insightful and valuable to the user.\n***\n[END DATA]\nDoes the submission meet the criterion? First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print the choice only from 1, 2, 3, 4, 5 (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the selected choice again by itself on a new line.\nReasoning:', 'judgment': '1. The task is to write a movie review of Titanic.\n2. The submission only provides the title and director of the movie, but does not offer any review or analysis of the film.\n3. Therefore, the submission is not helpful and does not meet the criterion.\nChoice: 1\n\n1'}]}
```


 


### Evaluate in terms of `write_email` scenario
```python
from chateval import load

scenario = load("../scenarios/write_email")
predictions = [
    "My name is [name], and I am currently a student in your [class name].",
]

print(scenario.evaluate(predictions))
```



### Meta Evaluation

```python
from chateval import load

scenario = load("metaeval_helpfulness")
metric_model = scenario.get_default_setting_config()["metric_model"]
result = scenario.evaluate(metric_model, "metric")

print(result)
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chateval-0.0.8.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

chateval-0.0.8-py2.py3-none-any.whl (54.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file chateval-0.0.8.tar.gz.

File metadata

  • Download URL: chateval-0.0.8.tar.gz
  • Upload date:
  • Size: 41.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.2

File hashes

Hashes for chateval-0.0.8.tar.gz
Algorithm Hash digest
SHA256 df97169821e459a503f4f39883295fb4abdb0bd63b229ac8970eaab81141579a
MD5 433df1ab72a808b030795905f029fc82
BLAKE2b-256 02091586eb1f36ac7c7d61ec3066cec3ddab7f07d632c57de155c58f0516271f

See more details on using hashes here.

File details

Details for the file chateval-0.0.8-py2.py3-none-any.whl.

File metadata

  • Download URL: chateval-0.0.8-py2.py3-none-any.whl
  • Upload date:
  • Size: 54.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.2

File hashes

Hashes for chateval-0.0.8-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d13b5effe0e47854dd0397e9ece7eb842a1b825abf9e175fa2d9a9688e62910e
MD5 3fdac803343a820eff15d0eec3a2b4f0
BLAKE2b-256 0cb5e9b5620a21b4851c16a2943d544ebb17e0331514441c971a08ed02fdd85b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page