Skip to main content

Evaluation and benchmark for Generative AI

Project description

GenAIEval

Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety, and hallucination

Installation

  • Install from Pypi
pip install -r requirements.txt
pip install opea-eval

notes: We have to install requirements.txt at first, cause Pypi can't have direct dependency with specific commit.

  • Build from Source
git clone https://github.com/opea-project/GenAIEval
cd GenAIEval
pip install -r requirements.txt
pip install -e .

Evaluation

lm-evaluation-harness

For evaluating the models on text-generation tasks, we follow the lm-evaluation-harness and provide the command line usage and function call usage. Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented, such as ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K and so on.

command line usage

Gaudi2
# pip install --upgrade-strategy eager optimum[habana]
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
    --model gaudi-hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device hpu \
    --batch_size 8
CPU
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
    --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device cpu \
    --batch_size 8

function call usage

from evals.evaluation.lm_evaluation_harness import LMEvalParser, evaluate

args = LMevalParser(
    model="hf",
    user_model=user_model,
    tokenizer=tokenizer,
    tasks="hellaswag",
    device="cpu",
    batch_size=8,
)
results = evaluate(args)

remote service usage

  1. setup a separate server with GenAIComps

    # build cpu docker
    docker build -f Dockerfile.cpu -t opea/lm-eval:latest .
    
    # start the server
    docker run -p 9006:9006 --ipc=host  -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
    
  2. evaluate the model

    • set base_url, tokenizer and --model genai-hf

      cd evals/evaluation/lm_evaluation_harness/examples
      
      python main.py \
          --model genai-hf \
          --model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
          --tasks  "lambada_openai" \
          --batch_size 2
      

bigcode-evaluation-harness

For evaluating the models on coding tasks or specifically coding LLMs, we follow the bigcode-evaluation-harness and provide the command line usage and function call usage. HumanEval, HumanEval+, InstructHumanEval, APPS, MBPP, MBPP+, and DS-1000 for both completion (left-to-right) and insertion (FIM) mode are available.

command line usage

cd evals/evaluation/bigcode_evaluation_harness/examples
python main.py \
    --model "codeparrot/codeparrot-small" \
    --tasks "humaneval" \
    --n_samples 100 \
    --batch_size 10 \
    --allow_code_execution

function call usage

from evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate

args = BigcodeEvalParser(
    user_model=user_model,
    tokenizer=tokenizer,
    tasks="humaneval",
    n_samples=100,
    batch_size=10,
    allow_code_execution=True,
)
results = evaluate(args)

Kubernetes platform optimization

Node resource management helps optimizing AI container performance and isolation on Kubernetes nodes. See Platform optimization.

Benchmark

We provide a OPEA microservice benchmarking tool which is designed for microservice performance testing and benchmarking. It allows you to define test cases for various services based on YAML configurations, run load tests using stresscli, built on top of locust, and analyze the results for performance insights.

Features

  • Services load testing: Simulates high concurrency levels to test services like LLM, reranking, ASR, E2E and more.
  • YAML-based configuration: Easily define test cases, service endpoints, and parameters.
  • Service metrics collection: Optionally collect service metrics to analyze performance bottlenecks.
  • Flexible testing: Supports a variety of tests like chatqna, codegen, codetrans, faqgen, audioqna, and visualqna.
  • Data analysis and visualization: Visualize test results to uncover performance trends and bottlenecks.

How to use

Define Test Cases: Configure your tests in the benchmark.yaml file.

Increase File Descriptor Limit (if running large-scale tests):

ulimit -n 100000

This ensures the system can handle high concurrency by allowing more open files and connections.

Run the benchmark script:

python evals/benchmark/benchmark.py

Results will be saved in the directory specified by test_output_dir in the configuration.

For more details on configuring test cases, refer to the README.

Grafana Dashboards

Prometheus metrics collected during the tests can be used to create Grafana dashboards for visualizing performance trends and monitoring bottlenecks. For more information, refer to the Grafana README

tgi microservice dashboard

Additional Content

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opea_eval-1.0.tar.gz (62.1 kB view details)

Uploaded Source

Built Distribution

opea_eval-1.0-py3-none-any.whl (72.0 kB view details)

Uploaded Python 3

File details

Details for the file opea_eval-1.0.tar.gz.

File metadata

  • Download URL: opea_eval-1.0.tar.gz
  • Upload date:
  • Size: 62.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for opea_eval-1.0.tar.gz
Algorithm Hash digest
SHA256 14fabd16b58fa2f8acfa2180e44dc0abb7c769ec37e6c3708daec38006fb7c3a
MD5 d0f5cf5594a0bbe7d6337eef2e62d413
BLAKE2b-256 fc802ee5c32ffa75ad2a52c0a17a738f04ecffed87c45a7a87c5e1d143233e2d

See more details on using hashes here.

File details

Details for the file opea_eval-1.0-py3-none-any.whl.

File metadata

  • Download URL: opea_eval-1.0-py3-none-any.whl
  • Upload date:
  • Size: 72.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for opea_eval-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 427ac028abb96b03ed6323387ec99509f2163de9274871b903006d98bcfa2b07
MD5 3247f0bb41c9e5631676ee84a479813d
BLAKE2b-256 30d426e8b0eeb91ac2ff8c269a33f83c5f82b13773ed09385d2ad09843338c4c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page