Skip to main content

Benchmark LLMs on Apple MLX knowledge and coding tasks

Project description

MLX Benchmark

Benchmark LLMs on Apple MLX framework knowledge and coding tasks.

Install

pip install mlx-benchmark

For cloud provider support, install extras:

pip install "mlx-benchmark[anthropic]"   # Claude models
pip install "mlx-benchmark[openai]"       # OpenAI, Groq, OpenRouter
pip install "mlx-benchmark[all]"          # All providers
pip install "mlx-benchmark[plot]"         # PNG chart export (matplotlib)

Quick Start

Benchmark a local Ollama model:

mlx-bench --model llama3.2

Benchmark multiple models sequentially:

mlx-bench --model llama3.2,mistral,qwen2.5-coder

Use a stronger model as judge:

mlx-bench --model llama3.2 --judge-model gemma4

Cloud Providers

# Anthropic Claude
mlx-bench --provider anthropic --model claude-sonnet-4-20250514

# OpenAI
mlx-bench --provider openai --model gpt-4o

# OpenRouter (access to many models)
mlx-bench --provider openrouter --model anthropic/claude-sonnet-4-20250514

# Groq (fast inference)
mlx-bench --provider groq --model llama-3.2-70b-versatile

API keys are read from environment variables:

Provider Environment Variable
Anthropic ANTHROPIC_API_KEY
OpenAI OPENAI_API_KEY
Groq GROQ_API_KEY
OpenRouter OPEN_ROUTER_API_KEY

Filtering

# Only coding and debug questions
mlx-bench --model llama3.2 --types coding debug

# Only hard questions
mlx-bench --model llama3.2 --difficulties hard

# Specific categories
mlx-bench --model llama3.2 --categories mlx_core mlx_nn

# Quick test run (10 samples)
mlx-bench --model llama3.2 --limit 10

Export

Generate LaTeX tables and PNG charts from benchmark results:

# Export LaTeX + PNG from all results in ./results/
mlx-bench --latex --plot

# Export only LaTeX from specific result files
mlx-bench --latex --results results/bench_ollama_llama3-2_*.json

# Run benchmark and also generate exports
mlx-bench --config models.yml --latex --plot

Outputs:

  • bench_results.tex — two booktabs tables (accuracy by difficulty, accuracy by type)
  • bench_results.png — grouped bar chart comparing models

The --plot flag requires matplotlib. Install with pip install mlx-benchmark[plot].

All Options

--model MODEL          Model name (comma-separated for multiple)
--provider PROVIDER    ollama | anthropic | openai | groq | openrouter
--judge-model MODEL   Judge model (default: same as --model)
--judge-provider PROVIDER  Judge provider (default: same as --provider)
--dataset PATH         Custom dataset JSONL (default: bundled v2)
--output-dir DIR       Where to save results (default: ./results)
--max-tokens N         Max response tokens (default: 1024)
--temperature T        Sampling temperature (default: 0.0)
--limit N              Limit number of samples
--categories [...]     Filter by category
--difficulties [...]   Filter by difficulty (easy, medium, hard, very-hard)
--types [...]          Filter by type (qa, fill_blank, mcq, true_false, coding, debug)
--rate-limit SECS      Delay between API calls (default: 0.5)
--host URL             Ollama host (default: http://localhost:11434)
--api-key KEY          API key for cloud providers
--base-url URL         Custom base URL for OpenAI-compatible APIs
--latex                Generate LaTeX table from results
--plot                 Generate PNG bar chart from results
--results [FILES ...]  Result JSON files to export (default: all in --output-dir)

Python API

from mlx_benchmark import run_benchmark

results, stats = run_benchmark(
    model="llama3.2",
    provider="ollama",
    limit=20,
    types=["coding", "debug"],
)

print(f"Accuracy: {stats.accuracy:.1f}%")

Export results programmatically:

from mlx_benchmark import load_result_files, generate_latex_table, generate_plot

data = load_result_files(["results/bench_ollama_llama3-2_20260412.json"])
latex = generate_latex_table(data, output_dir="results")
generate_plot(data, output_dir="results")

Output

Results are saved as JSON files in the output directory with:

  • Per-question scores (correct/incorrect)
  • Aggregate accuracy by type, difficulty, and category
  • Model answers alongside reference answers for review

Example result filename: bench_ollama_llama3-2_20260412_220855.json

Dataset

The bundled dataset (dataset_v2.jsonl) contains 441 questions across 6 types:

Type Description
qa Knowledge questions about MLX APIs
mcq Multiple choice
true_false True/false statements
fill_blank Code completion tasks
coding Full code writing tasks
debug Identify and fix bugs in MLX code

Covering 11 categories and 4 difficulty levels (easy, medium, hard, very-hard).

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlx_benchmark-1.0.0.tar.gz (76.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mlx_benchmark-1.0.0-py3-none-any.whl (79.2 kB view details)

Uploaded Python 3

File details

Details for the file mlx_benchmark-1.0.0.tar.gz.

File metadata

  • Download URL: mlx_benchmark-1.0.0.tar.gz
  • Upload date:
  • Size: 76.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_benchmark-1.0.0.tar.gz
Algorithm Hash digest
SHA256 568abdc4fb79830c4594902d3ef0d43b14ff3d29381de220de8011dc4bb10355
MD5 83b0574166c4d20a529b54e3136ebb23
BLAKE2b-256 82c2c58bac46314cb22e070fb136f3eec1fb98ee660f57ccf28622c2f11287a4

See more details on using hashes here.

File details

Details for the file mlx_benchmark-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: mlx_benchmark-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 79.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mlx_benchmark-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7839231b74010792b735582d9f64a47cd4abe00a228b8819975d4f3e81ac348b
MD5 b2449bb2fffa97ab8b0d85b2e9add2cc
BLAKE2b-256 37e34ea5d31577a13be1e6b891550d30ea5cfc71ecb0e7ed399aeaef2a5bf1f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page