Library to evaluate models on HuGME benchmark.

These details have not been verified by PyPI

Project links

Project description

HuGME: Hungarian Generative Model Evaluation benchmark

HuGME is an advanced evaluation framework designed to assess Large Language Models (LLMs) with a focus on Hungarian language proficiency and cultural understanding. It provides a structured assessment of model performance across multiple dimensions, based on DeepEval.

📌 Installation & Usage

Installation

Installation via pypi

pip install hugme

To install the library for testing and development, use the following command:

git clone https://github.com/nytud/hugme
pip install .

Running HuGME

You can execute HuGME with:

hugme --model-name /path/to/your/model --tasks bias --parameters config.json

Command-Line Parameters

Parameter	Description
`--model-name`	Name of the model (Hugging Face (local) model or OpenAI models).
`--tasks`	Tasks to evaluate (`bias`, `toxicity`, `faithfulness`, `summarization`, `answer-relevancy`, `mmlu`, `spelling`, `truthfulqa`, `prompt-alignment`, `readability`, `needle-in-haystack`).
`--judge`	Default: `"gpt-3.5-turbo-1106"`. Specifies the judge model for evaluations.
`--use-cuda`	Default: `True`. Enables GPU acceleration.
`--cuda-id`	Default: `1`. Specifies which GPU to use. Indexing starts from 0
`--seed`	Sets a random seed for reproducibility.
`--parameters`	Required. Path to a JSON configuration file for model parameters. See below for example.
`--save-results`	Default: `True`. Whether to save evaluation results.
`--use-gen-results`	Path to generated file by the model to evaluation on.
`--provider`	Default: `False`. Provider to use. Choices: (`openai`)
`--thinking`	Default: `False`. Enable thinking mode.
`--use-alpaca-prompt`	Default: `False`. Use alpaca prompt.
`--sample-size`	Default: `1.0`. Sample size (percenatage) from task's dataset.

🛠 Configure HuGME

Before running HuGME, you must set the DATASETS environment variable to ensure the framework can access the necessary datasets for evaluation tasks. Ensure that the specified path correctly points to the directory containing the required datasets.

export DATASETS=/path/to/datasets

The following environment variable needs to be configured for spelling task:

export BERT_MODEL=/path/to/bert-model

HuGME requires model parameters to be configured via a JSON file for the Hugginface's transformer library or OpenAI's library. The file path needs to be set in --parameters flag. Example:

{
  "max_new_tokens": 50,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 150,
  "repetition_penalty": 0.98,
  "diversity_penalty": 0,
  "do_sample": true,
  "return_full_text": false
}

🔑 Providing API Keys

To authenticate with OpenAI or Hugging Face, set your API keys as environment variables:

export OPENAI_API_KEY=sk-examplekey # judge model for deepeval based metrics
export HF_TOKEN=hf-exampletoken # using huggingface models
export PROVIDER_API_KEY=provider-api-key # using custom (openai package compatible) provider
export PROVIDER_URL=hf-provider-url # using custom (openai package compatible) provider

Alternatively, provide them inline when running the evaluation:

OPENAI_API_KEY=sk-examplekey hugme --model-name NYTK/PULI-LlumiX-32K --tasks mmlu

🧠 Results

After running metrics and/or benchmarks, all generation and evaluations outputs are saved inside the results/ directory.

📊 Evaluation Tasks

HuGME includes multiple tasks to evaluate different aspects of LLM performance in Hungarian. Calculation can also be found here.

1️⃣ Bias

Assesses language model outputs for biased content through systematic opinion analysis across gender, politics, race/ethnicity, and geographical dimensions. It employs a dataset of 100 carefully crafted queries designed to potentially elicit biased responses, with models required to prefix their outputs using opinion indicators (such as Szerintem 'I think', Úgy gondolom 'I believe', or Véleményem szerint 'In my opinion'). This prefixing requirement facilitates opinion extraction, which is crucial since unbiased responses typically lack opinionated content.

2️⃣ Toxicity

Evaluates language models' tendency to generate harmful or offensive content by analyzing opinions extracted from model responses to 100 specialized queries. An opinion is classified as toxic if it contains personal attacks, mockery, hate speech, dismissive statements, or threats that degrade or intimidate others, while non-toxic opinions are characterized by respectful engagement, openness to discussion, and constructive critique of ideas rather than individuals.

3️⃣ Answer relevancy

Evaluates the model's ability to generate contextually appropriate responses by comparing individual output statements against the input query. Using 100 diverse test queries spanning history, logic, and Hungarian idioms, the module assesses whether responses stay on topic and avoid contradictions, focusing on relevance rather than factual accuracy.

4️⃣ Faithfulness

Examines factual accuracy by comparing model outputs against provided context across 100 queries. Each query includes detailed context, with the evaluation focused on verifying that extracted claims align with the given factual information.

5️⃣ Summarization

Tests the model's ability to condense Hungarian texts while retaining key information. Using 50 texts, evaluation is based on whether four predefined yes/no questions can be answered from each generated summary, ensuring critical details remain while allowing flexibility in presentation.

6️⃣ Prompt alignment

Evaluates models' ability to execute Hungarian commands accurately. It uses 100 queries, each containing specific instructions, with evaluation based on whether the model follows all instructions completely and precisely. Max new tokens minimum is 256.

7️⃣ Spelling

Evaluates adherence to Hungarian orthography using a custom dictionary trained on index.hu texts and pyspellchecker. Flagged words from readability test outputs are verified by GPT-4 to minimize false positives, with the final score calculated as the ratio of correctly spelled words.

8️⃣ Readability

Evaluates how well models adapt their output complexity to match input texts. It uses 20 texts across four complexity levels (fairy tales, 6th grade, 10th grade, and academic), with readability assessed using an average of Coleman-Liau Index and textstat's text_standard scores.

9️⃣ HuTruthfulQA

Adapts the TruthfulQA dataset for Hungary by translating questions and adding culturally specific content, resulting in 747 questions across 37 categories.

🔟 HuMMLU (Massive Multitask Language Understanding)

Adapts the MMLU benchmark for Hungarian by machine-translating and manually refining multiple-choice questions across 38 subjects to ensure cultural relevance and accurate assessment.

🧩 Needle in the Haystack

Tests LLM performance in extracting specific information ("needle") from large bodies of Hungarian text ("haystack") to assess their ability to focus on relevant details within a complex context. Evaluate an LLM's ability to locate and extract specific information hidden within a larger Hungarian text by embedding a target sentence in various sections of a Hungarian novel.

Providers like OpenAI are currently unsupported for this metric.

🤝 Contributing

Contributions to HuGME are welcome! If you find a bug, want to add new evaluation modules, or improve existing ones, please feel free to open an issue or submit a pull request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Jan 28, 2026

0.0.1

Oct 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hugme-1.0.0.tar.gz (27.1 kB view details)

Uploaded Jan 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hugme-1.0.0-py3-none-any.whl (29.6 kB view details)

Uploaded Jan 28, 2026 Python 3

File details

Details for the file hugme-1.0.0.tar.gz.

File metadata

Download URL: hugme-1.0.0.tar.gz
Upload date: Jan 28, 2026
Size: 27.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for hugme-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`5e4a4896e0547de59eca54a6bf2308ad3943376867cf814119ffb31cba6398ad`
MD5	`0dc184478b67bddbaaff9b35a23b3077`
BLAKE2b-256	`eefeed11f7ac45d56044b930fc9ce0bcc30511b925f3e51591a89aae3340376d`

See more details on using hashes here.

File details

Details for the file hugme-1.0.0-py3-none-any.whl.

File metadata

Download URL: hugme-1.0.0-py3-none-any.whl
Upload date: Jan 28, 2026
Size: 29.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for hugme-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`102c28dc704730bd91a9f42bfe346d8caef16365f6f256bc44a375f7e445ab67`
MD5	`7b4b72f68347c2d4f57a8fa1bc61d077`
BLAKE2b-256	`c233ce1d4543203fb7195534d04dc2841e477500be2457900176a8e214644006`

See more details on using hashes here.

hugme 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HuGME: Hungarian Generative Model Evaluation benchmark

📌 Installation & Usage

Installation

Running HuGME

Command-Line Parameters

🛠 Configure HuGME

🔑 Providing API Keys

🧠 Results

📊 Evaluation Tasks

1️⃣ Bias

2️⃣ Toxicity

3️⃣ Answer relevancy

4️⃣ Faithfulness

5️⃣ Summarization

6️⃣ Prompt alignment

7️⃣ Spelling

8️⃣ Readability

9️⃣ HuTruthfulQA

🔟 HuMMLU (Massive Multitask Language Understanding)

🧩 Needle in the Haystack

🤝 Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes