Skip to main content

Strict, auditable HumanEval benchmark for GGUF models via llama.cpp

Project description

Nerdsking HumanEval Benchmark for llama.cpp (GGUF)

A strict, auditable HumanEval benchmark runner for GGUF models served via llama.cpp, using its OpenAI-compatible HTTP API.

This project focuses on correct execution semantics and reproducibility:

  • Prompts are preserved verbatim (no stripping or truncation).
  • Only fenced Python code is accepted.
  • Each task is executed using strict HumanEval semantics.
  • Full outputs and failure reasons are saved for auditing.

Key Features

✅ Correct HumanEval Semantics (Strict)

For every task, execution follows exactly:

  1. Execute the original prompt (function signature + docstring)
  2. Execute the model-generated code
  3. Execute the test harness
  4. Call check(entry_point)

✅ Prompt Integrity

  • HumanEval prompts are used verbatim
  • No stripping, rewriting, or truncation
  • Only a minimal instruction header is prepended
  • Raw prompts are stored in the output JSON

✅ Strict Code Extraction

Only code inside a single fenced block is accepted:

```python
# code here

If no such block exists → **automatic failure (`no_code`)**.

---

### ✅ Full Failure Attribution
Each failed task records:
- `error_type`
- `error_detail`
- `full_response`
- `generated_code`

---

### ✅ llama.cpp Native Support
- Automatic server start/stop **or** reuse an existing server
- Uses `/v1/completions` OpenAI-compatible API
- Streaming-safe and timeout-safe
- **GGUF-only by design**

---

## Repository Structure

. ├── benchmark.py ├── HumanEval.jsonl ├── LICENSE ├── README.md └── eval_utils/ ├── init.py ├── bench_config.json └── code_bench.py


---

## Dependencies

### Required
- Python 3.10+
- llama.cpp (with server support)
- GGUF model

### Python packages
```bash
pip install requests datasets

Building llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir -p build && cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

Usage

Automatic server management

python benchmark.py --model model.gguf --server-path /path/to/llama.cpp/build/bin

Use existing server

python benchmark.py --server-url http://127.0.0.1:8080 --no-server

Use local HumanEval

python benchmark.py --humaneval-jsonl HumanEval.jsonl

Output

A JSON file is generated containing:

  • Full configuration
  • Per-task results
  • Raw model outputs
  • Error attribution
  • Timing metrics

License / citation

-If you use this code in research or benchmarking, please cite:

https://github.com/nerdskingcom/gguf_benchmark, IMNECHO / Nerdsking.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gguf_humaneval_benchmark-0.1.0.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gguf_humaneval_benchmark-0.1.0-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file gguf_humaneval_benchmark-0.1.0.tar.gz.

File metadata

  • Download URL: gguf_humaneval_benchmark-0.1.0.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for gguf_humaneval_benchmark-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6b163f765d820501d76199a226a192c06c88421056a73b850e18b1a0473db870
MD5 b700276e6a096f7ed509757aa240eef1
BLAKE2b-256 885aa1626bc17c5cecb8cd98d66f4cf1b3b03e339f9fbde38967066bb7f0ce4a

See more details on using hashes here.

File details

Details for the file gguf_humaneval_benchmark-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for gguf_humaneval_benchmark-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f41263c4015edfd806d23e888edcdae15087f1d63dd25bf8507b045461d42c48
MD5 9b85e716a8dba005dcd2c3ecaa73a318
BLAKE2b-256 e68edb38ac8b35fc9dd00bcb931681d764be8b1fd7d60cbb69baec39d4270d2c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page