Skip to main content

Evaluation and benchmark for Generative AI

Project description

GenAIEval

Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety, and hallucination

Installation

  • Install from Pypi
pip install -r requirements.txt
pip install opea-eval

notes: We have to install requirements.txt at first, cause Pypi can't have direct dependency with specific commit.

  • Build from Source
git clone https://github.com/opea-project/GenAIEval
cd GenAIEval
pip install -r requirements.txt
pip install -e .

Evaluation

lm-evaluation-harness

For evaluating the models on text-generation tasks, we follow the lm-evaluation-harness and provide the command line usage and function call usage. Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented, such as ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K and so on.

command line usage

Gaudi2
# pip install --upgrade-strategy eager optimum[habana]
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
    --model gaudi-hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device hpu \
    --batch_size 8
CPU
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
    --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device cpu \
    --batch_size 8

function call usage

from evals.evaluation.lm_evaluation_harness import LMEvalParser, evaluate

args = LMevalParser(
    model="hf",
    user_model=user_model,
    tokenizer=tokenizer,
    tasks="hellaswag",
    device="cpu",
    batch_size=8,
)
results = evaluate(args)

remote service usage

  1. setup a separate server with GenAIComps
# build cpu docker
docker build -f Dockerfile.cpu -t opea/lm-eval:latest .

# start the server
docker run -p 9006:9006 --ipc=host  -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
  1. evaluate the model
  • set base_url, tokenizer and --model genai-hf
cd evals/evaluation/lm_evaluation_harness/examples

python main.py \
    --model genai-hf \
    --model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
    --tasks  "lambada_openai" \
    --batch_size 2

bigcode-evaluation-harness

For evaluating the models on coding tasks or specifically coding LLMs, we follow the bigcode-evaluation-harness and provide the command line usage and function call usage. HumanEval, HumanEval+, InstructHumanEval, APPS, MBPP, MBPP+, and DS-1000 for both completion (left-to-right) and insertion (FIM) mode are available.

command line usage

cd evals/evaluation/bigcode_evaluation_harness/examples
python main.py \
    --model "codeparrot/codeparrot-small" \
    --tasks "humaneval" \
    --n_samples 100 \
    --batch_size 10 \
    --allow_code_execution

function call usage

from evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate

args = BigcodeEvalParser(
    user_model=user_model,
    tokenizer=tokenizer,
    tasks="humaneval",
    n_samples=100,
    batch_size=10,
    allow_code_execution=True,
)
results = evaluate(args)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opea_eval-0.6.tar.gz (40.1 kB view details)

Uploaded Source

Built Distribution

opea_eval-0.6-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file opea_eval-0.6.tar.gz.

File metadata

  • Download URL: opea_eval-0.6.tar.gz
  • Upload date:
  • Size: 40.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for opea_eval-0.6.tar.gz
Algorithm Hash digest
SHA256 7273417401804037a46e1f12003e51963169e8c074293c9601399983b1421a75
MD5 be435fe402fb37bfce120146f2352f69
BLAKE2b-256 a4c25167bfde18047cbbf7a1d91e39b5d08f4b45697d1f269b5d66f28f6313ad

See more details on using hashes here.

File details

Details for the file opea_eval-0.6-py3-none-any.whl.

File metadata

  • Download URL: opea_eval-0.6-py3-none-any.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for opea_eval-0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 32fc493880cf0a7045e28f4e60791f0fb4e64c8992d39fc91f311d12ccc04cdd
MD5 e3b51be0d3e6cf3fbbaf5ba08bc4a191
BLAKE2b-256 b8180b38ca99dab6f457b0e09d823083b7e93a785417d0007935b19121faaac5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page