Strict, auditable HumanEval benchmark for GGUF models via llama.cpp

These details have not been verified by PyPI

Project links

Project description

Nerdsking HumanEval Benchmark for llama.cpp (GGUF)

A strict, auditable HumanEval benchmark runner for GGUF models served via llama.cpp, using its OpenAI-compatible HTTP API.

This project focuses on correct execution semantics and reproducibility:

Prompts are preserved verbatim (no stripping or truncation).
Only fenced Python code is accepted.
Each task is executed using strict HumanEval semantics.
Full outputs and failure reasons are saved for auditing.

Key Features

✅ Correct HumanEval Semantics (Strict)

For every task, execution follows exactly:

Execute the original prompt (function signature + docstring)
Execute the model-generated code
Execute the test harness
Call check(entry_point)

✅ Prompt Integrity

HumanEval prompts are used verbatim
No stripping, rewriting, or truncation
Only a minimal instruction header is prepended
Raw prompts are stored in the output JSON

✅ Strict Code Extraction

Only code inside a single fenced block is accepted:

```python
# code here


If no such block exists → **automatic failure (`no_code`)**.

---

### ✅ Full Failure Attribution
Each failed task records:
- `error_type`
- `error_detail`
- `full_response`
- `generated_code`

---

### ✅ llama.cpp Native Support
- Automatic server start/stop **or** reuse an existing server
- Uses `/v1/completions` OpenAI-compatible API
- Streaming-safe and timeout-safe
- **GGUF-only by design**

---

## Repository Structure

. ├── benchmark.py ├── HumanEval.jsonl ├── LICENSE ├── README.md └── eval_utils/ ├── init.py ├── bench_config.json └── code_bench.py


---

## Dependencies

### Required
- Python 3.10+
- llama.cpp (with server support)
- GGUF model

### Python packages
```bash
pip install requests datasets

Building llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir -p build && cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

Usage

Automatic server management

python benchmark.py --model model.gguf --server-path /path/to/llama.cpp/build/bin

Use existing server

python benchmark.py --server-url http://127.0.0.1:8080 --no-server

Use local HumanEval

python benchmark.py --humaneval-jsonl HumanEval.jsonl

Output

A JSON file is generated containing:

Full configuration
Per-task results
Raw model outputs
Error attribution
Timing metrics

License / citation

-If you use this code in research or benchmarking, please cite:

https://github.com/nerdskingcom/gguf_benchmark, IMNECHO / Nerdsking.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jan 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gguf_humaneval_benchmark-0.1.0.tar.gz (22.4 kB view details)

Uploaded Jan 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gguf_humaneval_benchmark-0.1.0-py3-none-any.whl (26.9 kB view details)

Uploaded Jan 2, 2026 Python 3

File details

Details for the file gguf_humaneval_benchmark-0.1.0.tar.gz.

File metadata

Download URL: gguf_humaneval_benchmark-0.1.0.tar.gz
Upload date: Jan 2, 2026
Size: 22.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for gguf_humaneval_benchmark-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6b163f765d820501d76199a226a192c06c88421056a73b850e18b1a0473db870`
MD5	`b700276e6a096f7ed509757aa240eef1`
BLAKE2b-256	`885aa1626bc17c5cecb8cd98d66f4cf1b3b03e339f9fbde38967066bb7f0ce4a`

See more details on using hashes here.

File details

Details for the file gguf_humaneval_benchmark-0.1.0-py3-none-any.whl.

File metadata

Download URL: gguf_humaneval_benchmark-0.1.0-py3-none-any.whl
Upload date: Jan 2, 2026
Size: 26.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for gguf_humaneval_benchmark-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f41263c4015edfd806d23e888edcdae15087f1d63dd25bf8507b045461d42c48`
MD5	`9b85e716a8dba005dcd2c3ecaa73a318`
BLAKE2b-256	`e68edb38ac8b35fc9dd00bcb931681d764be8b1fd7d60cbb69baec39d4270d2c`

See more details on using hashes here.

gguf-humaneval-benchmark 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Nerdsking HumanEval Benchmark for llama.cpp (GGUF)

Key Features

✅ Correct HumanEval Semantics (Strict)

✅ Prompt Integrity

✅ Strict Code Extraction

Building llama.cpp

Usage

Automatic server management

Use existing server

Use local HumanEval

Output

License / citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes