Benchmark LLMs on Apple MLX knowledge and coding tasks
Project description
MLX Benchmark
Benchmark LLMs on Apple MLX framework knowledge and coding tasks.
Install
pip install mlx-benchmark
For cloud provider support, install extras:
pip install "mlx-benchmark[anthropic]" # Claude models
pip install "mlx-benchmark[openai]" # OpenAI, Groq, OpenRouter
pip install "mlx-benchmark[all]" # All providers
pip install "mlx-benchmark[plot]" # PNG chart export (matplotlib)
Quick Start
Benchmark a local Ollama model:
mlx-bench --model llama3.2
Benchmark multiple models sequentially:
mlx-bench --model llama3.2,mistral,qwen2.5-coder
Use a stronger model as judge:
mlx-bench --model llama3.2 --judge-model gemma4
Cloud Providers
# Anthropic Claude
mlx-bench --provider anthropic --model claude-sonnet-4-20250514
# OpenAI
mlx-bench --provider openai --model gpt-4o
# OpenRouter (access to many models)
mlx-bench --provider openrouter --model anthropic/claude-sonnet-4-20250514
# Groq (fast inference)
mlx-bench --provider groq --model llama-3.2-70b-versatile
API keys are read from environment variables:
| Provider | Environment Variable |
|---|---|
| Anthropic | ANTHROPIC_API_KEY |
| OpenAI | OPENAI_API_KEY |
| Groq | GROQ_API_KEY |
| OpenRouter | OPEN_ROUTER_API_KEY |
Filtering
# Only coding and debug questions
mlx-bench --model llama3.2 --types coding debug
# Only hard questions
mlx-bench --model llama3.2 --difficulties hard
# Specific categories
mlx-bench --model llama3.2 --categories mlx_core mlx_nn
# Quick test run (10 samples)
mlx-bench --model llama3.2 --limit 10
Export
Generate LaTeX tables and PNG charts from benchmark results:
# Export LaTeX + PNG from all results in ./results/
mlx-bench --latex --plot
# Export only LaTeX from specific result files
mlx-bench --latex --results results/bench_ollama_llama3-2_*.json
# Run benchmark and also generate exports
mlx-bench --config models.yml --latex --plot
Outputs:
bench_results.tex— twobooktabstables (accuracy by difficulty, accuracy by type)bench_results.png— grouped bar chart comparing models
The
--plotflag requires matplotlib. Install withpip install mlx-benchmark[plot].
All Options
--model MODEL Model name (comma-separated for multiple)
--provider PROVIDER ollama | anthropic | openai | groq | openrouter
--judge-model MODEL Judge model (default: same as --model)
--judge-provider PROVIDER Judge provider (default: same as --provider)
--dataset PATH Custom dataset JSONL (default: bundled v2)
--output-dir DIR Where to save results (default: ./results)
--max-tokens N Max response tokens (default: 1024)
--temperature T Sampling temperature (default: 0.0)
--limit N Limit number of samples
--categories [...] Filter by category
--difficulties [...] Filter by difficulty (easy, medium, hard, very-hard)
--types [...] Filter by type (qa, fill_blank, mcq, true_false, coding, debug)
--rate-limit SECS Delay between API calls (default: 0.5)
--host URL Ollama host (default: http://localhost:11434)
--api-key KEY API key for cloud providers
--base-url URL Custom base URL for OpenAI-compatible APIs
--latex Generate LaTeX table from results
--plot Generate PNG bar chart from results
--results [FILES ...] Result JSON files to export (default: all in --output-dir)
Python API
from mlx_benchmark import run_benchmark
results, stats = run_benchmark(
model="llama3.2",
provider="ollama",
limit=20,
types=["coding", "debug"],
)
print(f"Accuracy: {stats.accuracy:.1f}%")
Export results programmatically:
from mlx_benchmark import load_result_files, generate_latex_table, generate_plot
data = load_result_files(["results/bench_ollama_llama3-2_20260412.json"])
latex = generate_latex_table(data, output_dir="results")
generate_plot(data, output_dir="results")
Output
Results are saved as JSON files in the output directory with:
- Per-question scores (correct/incorrect)
- Aggregate accuracy by type, difficulty, and category
- Model answers alongside reference answers for review
Example result filename: bench_ollama_llama3-2_20260412_220855.json
Dataset
The bundled dataset (dataset_v2.jsonl) contains 441 questions across 6 types:
| Type | Description |
|---|---|
qa |
Knowledge questions about MLX APIs |
mcq |
Multiple choice |
true_false |
True/false statements |
fill_blank |
Code completion tasks |
coding |
Full code writing tasks |
debug |
Identify and fix bugs in MLX code |
Covering 11 categories and 4 difficulty levels (easy, medium, hard, very-hard).
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlx_benchmark-1.0.0.tar.gz.
File metadata
- Download URL: mlx_benchmark-1.0.0.tar.gz
- Upload date:
- Size: 76.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
568abdc4fb79830c4594902d3ef0d43b14ff3d29381de220de8011dc4bb10355
|
|
| MD5 |
83b0574166c4d20a529b54e3136ebb23
|
|
| BLAKE2b-256 |
82c2c58bac46314cb22e070fb136f3eec1fb98ee660f57ccf28622c2f11287a4
|
File details
Details for the file mlx_benchmark-1.0.0-py3-none-any.whl.
File metadata
- Download URL: mlx_benchmark-1.0.0-py3-none-any.whl
- Upload date:
- Size: 79.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7839231b74010792b735582d9f64a47cd4abe00a228b8819975d4f3e81ac348b
|
|
| MD5 |
b2449bb2fffa97ab8b0d85b2e9add2cc
|
|
| BLAKE2b-256 |
37e34ea5d31577a13be1e6b891550d30ea5cfc71ecb0e7ed399aeaef2a5bf1f5
|