Benchmark LLMs on Apple MLX knowledge and coding tasks

These details have not been verified by PyPI

Project description

MLX Benchmark

Benchmark LLMs on Apple MLX framework knowledge and coding tasks.

Install

pip install mlx-benchmark

For cloud provider support, install extras:

pip install "mlx-benchmark[anthropic]"   # Claude models
pip install "mlx-benchmark[openai]"       # OpenAI, Groq, OpenRouter
pip install "mlx-benchmark[all]"          # All providers
pip install "mlx-benchmark[plot]"         # PNG chart export (matplotlib)

Quick Start

Benchmark a local Ollama model:

mlx-bench --model llama3.2

Benchmark multiple models sequentially:

mlx-bench --model llama3.2,mistral,qwen2.5-coder

Use a stronger model as judge:

mlx-bench --model llama3.2 --judge-model gemma4

Cloud Providers

# Anthropic Claude
mlx-bench --provider anthropic --model claude-sonnet-4-20250514

# OpenAI
mlx-bench --provider openai --model gpt-4o

# OpenRouter (access to many models)
mlx-bench --provider openrouter --model anthropic/claude-sonnet-4-20250514

# Groq (fast inference)
mlx-bench --provider groq --model llama-3.2-70b-versatile

API keys are read from environment variables:

Provider	Environment Variable
Anthropic	`ANTHROPIC_API_KEY`
OpenAI	`OPENAI_API_KEY`
Groq	`GROQ_API_KEY`
OpenRouter	`OPEN_ROUTER_API_KEY`

Filtering

# Only coding and debug questions
mlx-bench --model llama3.2 --types coding debug

# Only hard questions
mlx-bench --model llama3.2 --difficulties hard

# Specific categories
mlx-bench --model llama3.2 --categories mlx_core mlx_nn

# Quick test run (10 samples)
mlx-bench --model llama3.2 --limit 10

Export

Generate LaTeX tables and PNG charts from benchmark results:

# Export LaTeX + PNG from all results in ./results/
mlx-bench --latex --plot

# Export only LaTeX from specific result files
mlx-bench --latex --results results/bench_ollama_llama3-2_*.json

# Run benchmark and also generate exports
mlx-bench --config models.yml --latex --plot

Outputs:

bench_results.tex — two booktabs tables (accuracy by difficulty, accuracy by type)
bench_results.png — grouped bar chart comparing models

The --plot flag requires matplotlib. Install with pip install mlx-benchmark[plot].

All Options

--model MODEL          Model name (comma-separated for multiple)
--provider PROVIDER    ollama | anthropic | openai | groq | openrouter
--judge-model MODEL   Judge model (default: same as --model)
--judge-provider PROVIDER  Judge provider (default: same as --provider)
--dataset PATH         Custom dataset JSONL (default: bundled v2)
--output-dir DIR       Where to save results (default: ./results)
--max-tokens N         Max response tokens (default: 1024)
--temperature T        Sampling temperature (default: 0.0)
--limit N              Limit number of samples
--categories [...]     Filter by category
--difficulties [...]   Filter by difficulty (easy, medium, hard, very-hard)
--types [...]          Filter by type (qa, fill_blank, mcq, true_false, coding, debug)
--rate-limit SECS      Delay between API calls (default: 0.5)
--host URL             Ollama host (default: http://localhost:11434)
--api-key KEY          API key for cloud providers
--base-url URL         Custom base URL for OpenAI-compatible APIs
--latex                Generate LaTeX table from results
--plot                 Generate PNG bar chart from results
--results [FILES ...]  Result JSON files to export (default: all in --output-dir)

Python API

from mlx_benchmark import run_benchmark

results, stats = run_benchmark(
    model="llama3.2",
    provider="ollama",
    limit=20,
    types=["coding", "debug"],
)

print(f"Accuracy: {stats.accuracy:.1f}%")

Export results programmatically:

from mlx_benchmark import load_result_files, generate_latex_table, generate_plot

data = load_result_files(["results/bench_ollama_llama3-2_20260412.json"])
latex = generate_latex_table(data, output_dir="results")
generate_plot(data, output_dir="results")

Output

Results are saved as JSON files in the output directory with:

Per-question scores (correct/incorrect)
Aggregate accuracy by type, difficulty, and category
Model answers alongside reference answers for review

Example result filename: bench_ollama_llama3-2_20260412_220855.json

Dataset

The bundled dataset (dataset_v2.jsonl) contains 441 questions across 6 types:

Type	Description
`qa`	Knowledge questions about MLX APIs
`mcq`	Multiple choice
`true_false`	True/false statements
`fill_blank`	Code completion tasks
`coding`	Full code writing tasks
`debug`	Identify and fix bugs in MLX code

Covering 11 categories and 4 difficulty levels (easy, medium, hard, very-hard).

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_benchmark-1.0.0.tar.gz (76.2 kB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlx_benchmark-1.0.0-py3-none-any.whl (79.2 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file mlx_benchmark-1.0.0.tar.gz.

File metadata

Download URL: mlx_benchmark-1.0.0.tar.gz
Upload date: Apr 17, 2026
Size: 76.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_benchmark-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`568abdc4fb79830c4594902d3ef0d43b14ff3d29381de220de8011dc4bb10355`
MD5	`83b0574166c4d20a529b54e3136ebb23`
BLAKE2b-256	`82c2c58bac46314cb22e070fb136f3eec1fb98ee660f57ccf28622c2f11287a4`

See more details on using hashes here.

File details

Details for the file mlx_benchmark-1.0.0-py3-none-any.whl.

File metadata

Download URL: mlx_benchmark-1.0.0-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 79.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_benchmark-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7839231b74010792b735582d9f64a47cd4abe00a228b8819975d4f3e81ac348b`
MD5	`b2449bb2fffa97ab8b0d85b2e9add2cc`
BLAKE2b-256	`37e34ea5d31577a13be1e6b891550d30ea5cfc71ecb0e7ed399aeaef2a5bf1f5`

See more details on using hashes here.

mlx-benchmark 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

MLX Benchmark

Install

Quick Start

Cloud Providers

Filtering

Export

All Options

Python API

Output

Dataset

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes