Strict, auditable HumanEval benchmark for GGUF models via llama.cpp
Project description
Nerdsking HumanEval Benchmark for llama.cpp (GGUF)
A strict, auditable HumanEval benchmark runner for GGUF models served via llama.cpp, using its OpenAI-compatible HTTP API.
This project focuses on correct execution semantics and reproducibility:
- Prompts are preserved verbatim (no stripping or truncation).
- Only fenced Python code is accepted.
- Each task is executed using strict HumanEval semantics.
- Full outputs and failure reasons are saved for auditing.
Key Features
✅ Correct HumanEval Semantics (Strict)
For every task, execution follows exactly:
- Execute the original prompt (function signature + docstring)
- Execute the model-generated code
- Execute the test harness
- Call
check(entry_point)
✅ Prompt Integrity
- HumanEval prompts are used verbatim
- No stripping, rewriting, or truncation
- Only a minimal instruction header is prepended
- Raw prompts are stored in the output JSON
✅ Strict Code Extraction
Only code inside a single fenced block is accepted:
```python
# code here
If no such block exists → **automatic failure (`no_code`)**.
---
### ✅ Full Failure Attribution
Each failed task records:
- `error_type`
- `error_detail`
- `full_response`
- `generated_code`
---
### ✅ llama.cpp Native Support
- Automatic server start/stop **or** reuse an existing server
- Uses `/v1/completions` OpenAI-compatible API
- Streaming-safe and timeout-safe
- **GGUF-only by design**
---
## Repository Structure
. ├── benchmark.py ├── HumanEval.jsonl ├── LICENSE ├── README.md └── eval_utils/ ├── init.py ├── bench_config.json └── code_bench.py
---
## Dependencies
### Required
- Python 3.10+
- llama.cpp (with server support)
- GGUF model
### Python packages
```bash
pip install requests datasets
Building llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir -p build && cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release
Usage
Automatic server management
python benchmark.py --model model.gguf --server-path /path/to/llama.cpp/build/bin
Use existing server
python benchmark.py --server-url http://127.0.0.1:8080 --no-server
Use local HumanEval
python benchmark.py --humaneval-jsonl HumanEval.jsonl
Output
A JSON file is generated containing:
- Full configuration
- Per-task results
- Raw model outputs
- Error attribution
- Timing metrics
License / citation
-If you use this code in research or benchmarking, please cite:
https://github.com/nerdskingcom/gguf_benchmark, IMNECHO / Nerdsking.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gguf_humaneval_benchmark-0.1.0.tar.gz.
File metadata
- Download URL: gguf_humaneval_benchmark-0.1.0.tar.gz
- Upload date:
- Size: 22.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b163f765d820501d76199a226a192c06c88421056a73b850e18b1a0473db870
|
|
| MD5 |
b700276e6a096f7ed509757aa240eef1
|
|
| BLAKE2b-256 |
885aa1626bc17c5cecb8cd98d66f4cf1b3b03e339f9fbde38967066bb7f0ce4a
|
File details
Details for the file gguf_humaneval_benchmark-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gguf_humaneval_benchmark-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f41263c4015edfd806d23e888edcdae15087f1d63dd25bf8507b045461d42c48
|
|
| MD5 |
9b85e716a8dba005dcd2c3ecaa73a318
|
|
| BLAKE2b-256 |
e68edb38ac8b35fc9dd00bcb931681d764be8b1fd7d60cbb69baec39d4270d2c
|