Skip to main content

Evaluation and benchmark for Generative AI

Project description

GenAIEval

Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety, and hallucination

Installation

  • Install from Pypi
pip install -r requirements.txt
pip install opea-eval

notes: We have to install requirements.txt at first, cause Pypi can't have direct dependency with specific commit.

  • Build from Source
git clone https://github.com/opea-project/GenAIEval
cd GenAIEval
pip install -r requirements.txt
pip install -e .

Evaluation

lm-evaluation-harness

For evaluating the models on text-generation tasks, we follow the lm-evaluation-harness and provide the command line usage and function call usage. Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented, such as ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K and so on.

command line usage

Gaudi2
# pip install --upgrade-strategy eager optimum[habana]
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
    --model gaudi-hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device hpu \
    --batch_size 8
CPU
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
    --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device cpu \
    --batch_size 8

function call usage

from evals.evaluation.lm_evaluation_harness import LMEvalParser, evaluate

args = LMevalParser(
    model="hf",
    user_model=user_model,
    tokenizer=tokenizer,
    tasks="hellaswag",
    device="cpu",
    batch_size=8,
)
results = evaluate(args)

remote service usage

  1. setup a separate server with GenAIComps
# build cpu docker
docker build -f Dockerfile.cpu -t opea/lm-eval:latest .

# start the server
docker run -p 9006:9006 --ipc=host  -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
  1. evaluate the model
  • set base_url, tokenizer and --model genai-hf
cd evals/evaluation/lm_evaluation_harness/examples

python main.py \
    --model genai-hf \
    --model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
    --tasks  "lambada_openai" \
    --batch_size 2

bigcode-evaluation-harness

For evaluating the models on coding tasks or specifically coding LLMs, we follow the bigcode-evaluation-harness and provide the command line usage and function call usage. HumanEval, HumanEval+, InstructHumanEval, APPS, MBPP, MBPP+, and DS-1000 for both completion (left-to-right) and insertion (FIM) mode are available.

command line usage

cd evals/evaluation/bigcode_evaluation_harness/examples
python main.py \
    --model "codeparrot/codeparrot-small" \
    --tasks "humaneval" \
    --n_samples 100 \
    --batch_size 10 \
    --allow_code_execution

function call usage

from evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate

args = BigcodeEvalParser(
    user_model=user_model,
    tokenizer=tokenizer,
    tasks="humaneval",
    n_samples=100,
    batch_size=10,
    allow_code_execution=True,
)
results = evaluate(args)

Kubernetes platform optimization

Node resource management helps optimizing AI container performance and isolation on Kubernetes nodes. See Platform optimization.

Additional Content

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opea-eval-0.9.tar.gz (53.7 kB view details)

Uploaded Source

Built Distribution

opea_eval-0.9-py3-none-any.whl (69.2 kB view details)

Uploaded Python 3

File details

Details for the file opea-eval-0.9.tar.gz.

File metadata

  • Download URL: opea-eval-0.9.tar.gz
  • Upload date:
  • Size: 53.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0.post20200518 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for opea-eval-0.9.tar.gz
Algorithm Hash digest
SHA256 fb01185c43dca97e484623df79022bbe999aa18146535a95f8b497e0f6aa2139
MD5 cb3af14e4b2677450bd717c2a7a29a86
BLAKE2b-256 f9f595d15970af81c892e0678e7d5bd729abb7e1ad81d150241bd04aee3a76e9

See more details on using hashes here.

File details

Details for the file opea_eval-0.9-py3-none-any.whl.

File metadata

  • Download URL: opea_eval-0.9-py3-none-any.whl
  • Upload date:
  • Size: 69.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0.post20200518 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3

File hashes

Hashes for opea_eval-0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 7f387d878e6f691d483d45ac3d15f33927b1e1cb378f1757b7c090a4cfb82535
MD5 a52b9666cbef703a55d79c445dfdad77
BLAKE2b-256 280b3f239be9ff25345d11a66e761bc6aece7e197dbe125df27b74742d4f8d92

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page