Evaluation and benchmark for Generative AI
Project description
GenAIEval
Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety, and hallucination
Installation
- Install from Pypi
pip install -r requirements.txt
pip install opea-eval
notes: We have to install requirements.txt at first, cause Pypi can't have direct dependency with specific commit.
- Build from Source
git clone https://github.com/opea-project/GenAIEval
cd GenAIEval
pip install -r requirements.txt
pip install -e .
Evaluation
lm-evaluation-harness
For evaluating the models on text-generation tasks, we follow the lm-evaluation-harness and provide the command line usage and function call usage. Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented, such as ARC
, HellaSwag
, MMLU
, TruthfulQA
, Winogrande
, GSM8K
and so on.
command line usage
Gaudi2
# pip install --upgrade-strategy eager optimum[habana]
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
--model gaudi-hf \
--model_args pretrained=EleutherAI/gpt-j-6B \
--tasks hellaswag \
--device hpu \
--batch_size 8
CPU
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
--model hf \
--model_args pretrained=EleutherAI/gpt-j-6B \
--tasks hellaswag \
--device cpu \
--batch_size 8
function call usage
from evals.evaluation.lm_evaluation_harness import LMEvalParser, evaluate
args = LMevalParser(
model="hf",
user_model=user_model,
tokenizer=tokenizer,
tasks="hellaswag",
device="cpu",
batch_size=8,
)
results = evaluate(args)
remote service usage
- setup a separate server with GenAIComps
# build cpu docker
docker build -f Dockerfile.cpu -t opea/lm-eval:latest .
# start the server
docker run -p 9006:9006 --ipc=host -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
- evaluate the model
- set
base_url
,tokenizer
and--model genai-hf
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
--model genai-hf \
--model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
--tasks "lambada_openai" \
--batch_size 2
bigcode-evaluation-harness
For evaluating the models on coding tasks or specifically coding LLMs, we follow the bigcode-evaluation-harness and provide the command line usage and function call usage. HumanEval, HumanEval+, InstructHumanEval, APPS, MBPP, MBPP+, and DS-1000 for both completion (left-to-right) and insertion (FIM) mode are available.
command line usage
cd evals/evaluation/bigcode_evaluation_harness/examples
python main.py \
--model "codeparrot/codeparrot-small" \
--tasks "humaneval" \
--n_samples 100 \
--batch_size 10 \
--allow_code_execution
function call usage
from evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate
args = BigcodeEvalParser(
user_model=user_model,
tokenizer=tokenizer,
tasks="humaneval",
n_samples=100,
batch_size=10,
allow_code_execution=True,
)
results = evaluate(args)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file opea_eval-0.6.tar.gz
.
File metadata
- Download URL: opea_eval-0.6.tar.gz
- Upload date:
- Size: 40.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7273417401804037a46e1f12003e51963169e8c074293c9601399983b1421a75 |
|
MD5 | be435fe402fb37bfce120146f2352f69 |
|
BLAKE2b-256 | a4c25167bfde18047cbbf7a1d91e39b5d08f4b45697d1f269b5d66f28f6313ad |
File details
Details for the file opea_eval-0.6-py3-none-any.whl
.
File metadata
- Download URL: opea_eval-0.6-py3-none-any.whl
- Upload date:
- Size: 42.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32fc493880cf0a7045e28f4e60791f0fb4e64c8992d39fc91f311d12ccc04cdd |
|
MD5 | e3b51be0d3e6cf3fbbaf5ba08bc4a191 |
|
BLAKE2b-256 | b8180b38ca99dab6f457b0e09d823083b7e93a785417d0007935b19121faaac5 |