Skip to main content

ParsBench provides toolkits for benchmarking LLMs based on the Persian language tasks.

Project description

ParsBench

ParsBench provides toolkits for benchmarking Large Language Models (LLMs) based on the Persian language. It includes various tasks for evaluating LLMs on different topics, benchmarking tools to compare multiple models and rank them, and an easy, fully customizable API for developers to create custom models, tasks, scores, and benchmarks.

Key Features

  • Variety of Tasks: Evaluate LLMs across various topics.
  • Benchmarking Tools: Compare and rank multiple models.
  • Customizable API: Create custom models, tasks, scores, and benchmarks with ease.

Motivation

I was trying to fine-tune an open-source LLM for the Persian language. I needed some evaluation to test the performance and utility of my LLM. It leads me to research and find this paper. It's great work that they prepared some datasets and evaluation methods to test on ChatGPT. They even shared their code in this repository.

So, I thought that I should build a handy framework that includes various tasks and datasets for evaluating LLMs based on the Persian language. I used some parts of their work (Datasets, Metrics, Basic prompt templates) in this library.

Installation

Install Math Equivalence package manually:

pip install git+https://github.com/hendrycks/math.git

Install ParsBench using pip:

pip install parsbench

Usage

Evaluating a PreTrained Model

Load the pre-trained model and tokenizer from the HuggingFace and then, evaluate the model using the PersianMath task:

from transformers import AutoModelForCausalLM, AutoTokenizer

from parsbench.models import PreTrainedTransformerModel
from parsbench.tasks import PersianMath

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-72B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-72B-Instruct")

tf_model = PreTrainedTransformerModel(model=model, tokenizer=tokenizer)

with PersianMath() as task:
    results = task.evaluate(tf_model)

Benchmarking Multiple Models with Multiple Tasks

For example, we run our local models using Ollama:

ollama run qwen2
ollama run aya

Then we benchmark those models using the ParsBench.

from parsbench.benchmarks import CustomBenchmark
from parsbench.models import OpenAIModel
from parsbench.tasks import ParsiNLUMultipleChoice, PersianMath, ParsiNLUReadingComprehension

qwen2_model = OpenAIModel(
    api_base_url="http://localhost:11434/v1/",
    api_secret_key="ollama",
    model="qwen2:latest",
)
aya_model = OpenAIModel(
    api_base_url="http://localhost:11434/v1/",
    api_secret_key="ollama",
    model="aya:latest",
)

benchmark = CustomBenchmark(
    models=[qwen2_model, aya_model],
    tasks=[
        ParsiNLUMultipleChoice,
        ParsiNLUReadingComprehension,
        PersianMath,
    ],
)
result = benchmark.run(
    prompt_lang="fa",
    prompt_shots=[0, 3],
    n_first=100,
    sort_by_score=True,
)
result.show_radar_plot()

Benchmark Bar Plot

Available Tasks

Task Name Score Name Dataset
ParsiNLU Sentiment Analysis Exact Match (F1) ParsiNLU
ParsiNLU Entailment Exact Match (F1) ParsiNLU
ParsiNLU Machine Translation En -> Fa Bleu ParsiNLU
ParsiNLU Machine Translation Fa -> En Bleu ParsiNLU
ParsiNLU Multiple Choice Exact Match (Accuracy) ParsiNLU
ParsiNLU Reading Comprehension Common Tokens (F1) ParsiNLU
Persian NER NER Exact Match (F1) PersianNER
Persian Math Math Equivalence (Accuracy) Source
ConjNLI Entailment Exact Match (F1) Source
Persian MMLU (Khayyam Challenge) Exact Match (Accuracy) Khayyam Challenge
FarsTail Entailment Exact Match (F1) FarsTail
Persian News Summary Rouge PNSummary
XL-Sum Rouge XLSum

You can import the class of above tasks from parsbench.tasks and use it for evaluating your model.

Example Notebooks

  • Benchmark Aya models: aya
  • Benchmark Ava models: ava
  • Benchmark Dorna models: dorna
  • Benchmark MaralGPT models: maralgpt

Sponsors

Here are the names of companies/people who helped us to keep maintaining this project. If you want to donate this project, see this page.

  • AvalAI: They gave us free OpenAI API credit several times in their "AvalAward" program. It helped us for doing R&D and benchmarking GPT models.
  • Basalam: They voluntarily helped us to run the benchmarks on open-weight models and build the ParsBench Leaderboard.

Contributing

Contributions are welcome! Please refer to the contribution guidelines for more information on how to contribute.

License

ParsBench is distributed under the Apache-2.0 license.

Contact Information

For support or questions, please contact: shahriarshm81@gmail.com Feel free to let me know if there are any additional details or changes you'd like to make!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsbench-0.1.7.tar.gz (42.8 kB view details)

Uploaded Source

Built Distribution

parsbench-0.1.7-py3-none-any.whl (71.0 kB view details)

Uploaded Python 3

File details

Details for the file parsbench-0.1.7.tar.gz.

File metadata

  • Download URL: parsbench-0.1.7.tar.gz
  • Upload date:
  • Size: 42.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.4 Darwin/23.5.0

File hashes

Hashes for parsbench-0.1.7.tar.gz
Algorithm Hash digest
SHA256 ca4b335d40adfaa657384ee982484977919b8fc2ab8b66e378e85ab55063c22b
MD5 0b87152a576edc98bd119adf0ed66f70
BLAKE2b-256 a4ec9e95cf3bf244d08b761e081a1b89b6e8e1c9cdf6fb205a8ace5c59fe5cbf

See more details on using hashes here.

File details

Details for the file parsbench-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: parsbench-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 71.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.4 Darwin/23.5.0

File hashes

Hashes for parsbench-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 7c7ed9434ec77fb7ecc8db3315fc7427c8aa0730d07ea369f2939d8dcc1375ee
MD5 9e3a0d2f16a9d80cce27b89af87561ea
BLAKE2b-256 54288ba29a9900753fa5522898c3808f7bce7205ac9205b44e40bc9c2396d3e6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page