Skip to main content

Library to evaluate models on HuGME benchmark.

Project description

HuGME: Hungarian Generative Model Evaluation benchmark

HuGME is an advanced evaluation framework designed to assess Large Language Models (LLMs) with a focus on Hungarian language proficiency and cultural understanding. It provides a structured assessment of model performance across multiple dimensions, based on DeepEval.

📌 Installation & Usage

Installation

Installation via pypi

pip install hugme

To install the library for testing and development, use the following command:

git clone https://github.com/nytud/hugme
pip install .

Running HuGME

You can execute HuGME with:

hugme --model-name /path/to/your/model --tasks bias --parameters config.json

Command-Line Parameters

Parameter Description
--model-name Name of the model (Hugging Face (local) model or OpenAI models).
--tasks Tasks to evaluate (bias, toxicity, faithfulness, summarization, answer-relevancy, mmlu, spelling, truthfulqa, prompt-alignment, readability, needle-in-haystack).
--judge Default: "gpt-3.5-turbo-1106". Specifies the judge model for evaluations.
--use-cuda Default: True. Enables GPU acceleration.
--cuda-id Default: 1. Specifies which GPU to use. Indexing starts from 0
--seed Sets a random seed for reproducibility.
--parameters Required. Path to a JSON configuration file for model parameters. See below for example.
--save-results Default: True. Whether to save evaluation results.
--use-gen-results Path to generated file by the model to evaluation on.
--provider Default: False. Provider to use. Choices: (openai)
--thinking Default: False. Enable thinking mode.
--use-alpaca-prompt Default: False. Use alpaca prompt.
--sample-size Default: 1.0. Sample size (percenatage) from task's dataset.

🛠 Configure HuGME

Before running HuGME, you must set the DATASETS environment variable to ensure the framework can access the necessary datasets for evaluation tasks. Ensure that the specified path correctly points to the directory containing the required datasets.

export DATASETS=/path/to/datasets

The following environment variable needs to be configured for spelling task:

export BERT_MODEL=/path/to/bert-model

HuGME requires model parameters to be configured via a JSON file for the Hugginface's transformer library or OpenAI's library. The file path needs to be set in --parameters flag. Example:

{
  "max_new_tokens": 50,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 150,
  "repetition_penalty": 0.98,
  "diversity_penalty": 0,
  "do_sample": true,
  "return_full_text": false
}

🔑 Providing API Keys

To authenticate with OpenAI or Hugging Face, set your API keys as environment variables:

export OPENAI_API_KEY=sk-examplekey # judge model for deepeval based metrics
export HF_TOKEN=hf-exampletoken # using huggingface models
export PROVIDER_API_KEY=provider-api-key # using custom (openai package compatible) provider
export PROVIDER_URL=hf-provider-url # using custom (openai package compatible) provider

Alternatively, provide them inline when running the evaluation:

OPENAI_API_KEY=sk-examplekey hugme --model-name NYTK/PULI-LlumiX-32K --tasks mmlu

🧠 Results

After running metrics and/or benchmarks, all generation and evaluations outputs are saved inside the results/ directory.

📊 Evaluation Tasks

HuGME includes multiple tasks to evaluate different aspects of LLM performance in Hungarian. Calculation can also be found here.

1️⃣ Bias

Assesses language model outputs for biased content through systematic opinion analysis across gender, politics, race/ethnicity, and geographical dimensions. It employs a dataset of 100 carefully crafted queries designed to potentially elicit biased responses, with models required to prefix their outputs using opinion indicators (such as Szerintem 'I think', Úgy gondolom 'I believe', or Véleményem szerint 'In my opinion'). This prefixing requirement facilitates opinion extraction, which is crucial since unbiased responses typically lack opinionated content.

2️⃣ Toxicity

Evaluates language models' tendency to generate harmful or offensive content by analyzing opinions extracted from model responses to 100 specialized queries. An opinion is classified as toxic if it contains personal attacks, mockery, hate speech, dismissive statements, or threats that degrade or intimidate others, while non-toxic opinions are characterized by respectful engagement, openness to discussion, and constructive critique of ideas rather than individuals.

3️⃣ Answer relevancy

Evaluates the model's ability to generate contextually appropriate responses by comparing individual output statements against the input query. Using 100 diverse test queries spanning history, logic, and Hungarian idioms, the module assesses whether responses stay on topic and avoid contradictions, focusing on relevance rather than factual accuracy.

4️⃣ Faithfulness

Examines factual accuracy by comparing model outputs against provided context across 100 queries. Each query includes detailed context, with the evaluation focused on verifying that extracted claims align with the given factual information.

5️⃣ Summarization

Tests the model's ability to condense Hungarian texts while retaining key information. Using 50 texts, evaluation is based on whether four predefined yes/no questions can be answered from each generated summary, ensuring critical details remain while allowing flexibility in presentation.

6️⃣ Prompt alignment

Evaluates models' ability to execute Hungarian commands accurately. It uses 100 queries, each containing specific instructions, with evaluation based on whether the model follows all instructions completely and precisely. Max new tokens minimum is 256.

7️⃣ Spelling

Evaluates adherence to Hungarian orthography using a custom dictionary trained on index.hu texts and pyspellchecker. Flagged words from readability test outputs are verified by GPT-4 to minimize false positives, with the final score calculated as the ratio of correctly spelled words.

8️⃣ Readability

Evaluates how well models adapt their output complexity to match input texts. It uses 20 texts across four complexity levels (fairy tales, 6th grade, 10th grade, and academic), with readability assessed using an average of Coleman-Liau Index and textstat's text_standard scores.

9️⃣ HuTruthfulQA

Adapts the TruthfulQA dataset for Hungary by translating questions and adding culturally specific content, resulting in 747 questions across 37 categories.

🔟 HuMMLU (Massive Multitask Language Understanding)

Adapts the MMLU benchmark for Hungarian by machine-translating and manually refining multiple-choice questions across 38 subjects to ensure cultural relevance and accurate assessment.

🧩 Needle in the Haystack

Tests LLM performance in extracting specific information ("needle") from large bodies of Hungarian text ("haystack") to assess their ability to focus on relevant details within a complex context. Evaluate an LLM's ability to locate and extract specific information hidden within a larger Hungarian text by embedding a target sentence in various sections of a Hungarian novel.

Providers like OpenAI are currently unsupported for this metric.

🤝 Contributing

Contributions to HuGME are welcome! If you find a bug, want to add new evaluation modules, or improve existing ones, please feel free to open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hugme-1.0.0.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hugme-1.0.0-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file hugme-1.0.0.tar.gz.

File metadata

  • Download URL: hugme-1.0.0.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for hugme-1.0.0.tar.gz
Algorithm Hash digest
SHA256 5e4a4896e0547de59eca54a6bf2308ad3943376867cf814119ffb31cba6398ad
MD5 0dc184478b67bddbaaff9b35a23b3077
BLAKE2b-256 eefeed11f7ac45d56044b930fc9ce0bcc30511b925f3e51591a89aae3340376d

See more details on using hashes here.

File details

Details for the file hugme-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: hugme-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for hugme-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 102c28dc704730bd91a9f42bfe346d8caef16365f6f256bc44a375f7e445ab67
MD5 7b4b72f68347c2d4f57a8fa1bc61d077
BLAKE2b-256 c233ce1d4543203fb7195534d04dc2841e477500be2457900176a8e214644006

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page