Skip to main content

Evaluate the capability of open-source LLMs in Agent, formatted output, instruction following, long context retrieval, multilingual, coding, math and custom task.

Project description

🚀open-souce LLMs benchmark

Evaluate the capabilities of open-source LLMs in agent, tool calling, formatted output, long context retrieval, multilingual support, coding, mathematics, and custom tasks.

Example

🤖Agent Tasak

The ReAct Agent can access 5 functions. There are 10 questions to be solved, 4 of which are simple questions that can be solved using a single function, and 6 of which are complicated questions that require the agent to use multiple steps to solve.

The score ranges from 1 to 5, with 5 representing complete correctness. Here is an screen shot while running evaluation.

🧐Retrieval Task

Insert the needle(answer) into a haystack(long context) and ask the model retrieval the question based on the long context.

🗣️Format output Task

Evaluate the model's ability to repond in specified format, such as JSON, Number, Python, etc.

BenchMark Evaluation

Supported:

  • 🤖Agent, evaluate whether the model can accurately select tools or functions for invocation and follow the ReAct pattern to solve problems.
  • 🗣️Formated output, evaluate whether the model can output content in required formats such as JSON, Single Number, Code Bloch, etc.
  • 🧐Long context retrieval, capability to retrieval correct fact from a long context.

Plan:

  • 🇺🇸🇨🇳Multilingual, capability to understand and respond in different languages.
  • ⌨️coding, capability to solve complicated promblem with code.
  • Mathematics, capability to solve mathematic problem w/ or w/o code interpreter
  • 😀Custom Task, easily define and evaluate any specific task which you concern.

Install

Install from pypi:

pip install open_llm_benchmark

Install from github repo:

git clone git@github.com:EvilPsyCHo/Open-LLM-Benchmark.git
cd Open-LLM-Benchmark
python setup.py install

Supoorted Backend

  • Huggingface transformers
  • llama-cpp-pyton
  • vLLM
  • OpenAI

Contribute

Feel free to contribute this project!

  • more backend such as Anthropic, ollama, etc.
  • more tasks.
  • more evaluation data.
  • visualize the evaluation result.
  • etc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_llm_benchmark-0.1.0.tar.gz (1.6 MB view details)

Uploaded Source

File details

Details for the file open_llm_benchmark-0.1.0.tar.gz.

File metadata

  • Download URL: open_llm_benchmark-0.1.0.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for open_llm_benchmark-0.1.0.tar.gz
Algorithm Hash digest
SHA256 761d662ad93b86ce13c5635b2f50df2c6d51012bca8ed33717c9807eeb5269fb
MD5 6a7d927e2273f84490765d7a169a83c7
BLAKE2b-256 83b6ca1527699dd4285db751bfa09cc06cf26f6f84696b0fd389e42b77c206cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page