Skip to main content

Multi-engine LLM benchmark & monitoring CLI for Apple Silicon

Project description

asiai logo

asiai

Apple Silicon AI — Multi-engine LLM benchmark & monitoring CLI

PyPI Downloads CI Coverage License Python macOS Sponsor Benchmarks Top Speed AI Agents

asiai bench demo

asiai compares inference engines side-by-side on your Mac. Load the same model on Ollama and LM Studio, run asiai bench, get the numbers. No guessing, no vibes — just tok/s, TTFT, power efficiency, and stability per engine.

Share your results with the community (--share), compare against other Apple Silicon users (asiai compare), and get smart engine recommendations (asiai recommend).

Born from the OpenClaw project, where we needed hard data to pick the fastest engine for multi-agent swarms on Mac Mini M4 Pro.

Quick start

pipx install asiai        # Recommended: isolated install

Or via Homebrew:

brew tap druide67/tap
brew install asiai

Other options:

uvx asiai detect           # Run without installing (requires uv)
pip install asiai           # Standard pip install

Then benchmark and share:

asiai bench --quick --card --share    # Bench + shareable card in ~15 seconds

Commands

asiai detect

Auto-detect running inference engines across 7 ports.

$ asiai detect

Detected engines:

  ● ollama 0.17.4
    URL: http://localhost:11434

  ● lmstudio 0.4.5
    URL: http://localhost:1234
    Running: 1 model(s)
      - qwen3.5-35b-a3b  MLX

asiai bench

Cross-engine benchmark with standardized prompts. Runs 3 iterations per prompt by default, reports median tok/s (SPEC standard) with stability classification.

$ asiai bench -m qwen3.5 --runs 3 --power

  Mac Mini M4 Pro — Apple M4 Pro  RAM: 64.0 GB (42% used)  Pressure: normal

Benchmark: qwen3.5

  Engine       tok/s (±stddev)    Tokens   Duration     TTFT       VRAM    Thermal
  ────────── ───────────────── ───────── ────────── ──────── ────────── ──────────
  lmstudio    72.6 ± 0.0 (stable)   435    6.20s    0.28s        —    nominal
  ollama      30.4 ± 0.1 (stable)   448   15.28s    0.25s   26.0 GB   nominal

  Winner: lmstudio (2.4x faster)
  Power: lmstudio 13.2W (5.52 tok/s/W) — ollama 16.0W (1.89 tok/s/W)

Options:

-m, --model MODEL          Model to benchmark (default: auto-detect)
-e, --engines LIST         Filter engines (e.g. ollama,lmstudio,mlxlm)
-p, --prompts LIST         Prompt types: code, tool_call, reasoning, long_gen
-r, --runs N               Runs per prompt (default: 3, for median + stddev)
    --power                Cross-validate power with sudo powermetrics (IOReport always-on)
    --context-size SIZE    Context fill prompt: 4k, 16k, 32k, 64k
    --share                Share results with the community (anonymous, opt-in)
-Q, --quick                Quick benchmark: 1 prompt, 1 run (~15 seconds)
    --card                 Generate shareable benchmark card (SVG + PNG with --share)
-H, --history PERIOD       Show past benchmarks (e.g. 7d, 24h)
    --agentic-mode         Run the 8-run agentic prefix-cache-reuse protocol
    --agentic-output FILE  Save agentic-mode results as JSON
    --agentic-skip-long    Skip phases 7-8 (50K context) to save ~10 min
    --agentic-only LIST    Run only specified phases (cold,prefix-test-1,...)
    --code                 Dev-quality eval: tool-call, recovery, thinking, coding
    --code-suite LIST      tool-call[-stress],recovery,thinking[,coding[-hard]]
    --instruct             Instruction-following: IFEval-style verifiable + agentic deliverable
    --instruct-scenario L  verifiable,research-brief[,order-control]
    --language CODE        Multilingual retention eval (fr/de/es/it/pt/ja/ko/zh)
    --language-suite LIST  adherence,diacritics[,fluency] (default: deterministic 2)
    --judge-url URL        OpenAI-compat LLM judge for the 'coding'/'fluency' suites

Agentic mode — measuring prefix cache reuse

asiai bench --agentic-mode --url http://localhost:8080 --model my-model \
    --agentic-output bench.json

Runs 8 sequential prompts with a fixed long system message and varying user messages to expose how the engine reuses cached prefix tokens. Reads cached_tokens from the streaming usage when the engine exposes it (llama.cpp, mlx-lm), falls back to the TTFT ratio otherwise. Outputs a verdict prefix_cache_reuse: yes | partial | no. The metric that matters when your workload is multi-turn agentic with shared system prompts.

Quality modes — measuring quality, not just speed

Throughput is not quality. Three deterministic modes (no LLM judge needed for the core signal) measure whether a model is actually usable for real work:

# Dev quality: tool-call reliability (the JSON arg-truncation / empty-object bug),
# agentic error-recovery, thinking discipline — + an optional LLM-judged coding task.
asiai bench --code --url http://localhost:8080 --code-output code.json

# Instruction-following: IFEval-style verifiable instructions (format/length/
# keywords/case…) + an agentic task — does the model produce the primary
# multi-section deliverable AFTER a tool sequence, or only confirm the last step?
asiai bench --instruct --url http://localhost:8080 --instruct-output instruct.json

# Multilingual retention: did a finetune keep the base model's language?
# Adherence (stays in the language) + diacritics (café stays café), 8 languages.
asiai bench --language fr --url http://localhost:8080 --language-output lang.json

--code scores tool-call validity, the empty-object truncation bug, schema conformance and error-recovery deterministically; add --code-suite coding with --judge-url <openai-compat-endpoint> for an LLM-judged code-quality grade (no SDK is bundled; the API key is read from the environment). --instruct runs IFEval-style verifiable instructions (strict + loose, prompt- and instruction-level) plus a tools-then-deliverable scenario that catches a finetune doing the tool work but skipping the primary written output. --language measures language adherence + orthography retention — the catastrophic-forgetting signatures a task-specific finetune can introduce. All JSON-only and compare across models by diffing the output. See Dev-quality benchmarks.

Cross-model comparison — benchmark multiple models in one run and get a ranked summary:

# Cross-model comparison
asiai bench --compare qwen3.5:4b deepseek-r1:7b -e ollama --card

The runner resolves model names across engines automatically — gemma2:9b (Ollama) and gemma-2-9b (LM Studio) are matched as the same model.

asiai models

List loaded models across all engines. Use --json for machine-readable output.

$ asiai models

ollama  http://localhost:11434
  ● qwen3.5:35b-a3b                             26.0 GB Q4_K_M

lmstudio  http://localhost:1234
  ● qwen3.5-35b-a3b                                 MLX

asiai monitor

System and inference metrics snapshot, stored in SQLite. Use --json for machine-readable output.

$ asiai monitor

System
  Uptime:    3d 12h
  CPU Load:  2.45 / 3.12 / 2.89  (1m / 5m / 15m)
  Memory:    45.2 GB / 64.0 GB  71%
  Pressure:  normal
  Thermal:   nominal  (100%)

Inference  ollama 0.17.4
  Models loaded: 1  VRAM total: 26.0 GB

  Model                                        VRAM   Format  Quant
  ──────────────────────────────────────── ────────── ──────── ──────
  qwen3.5:35b-a3b                            26.0 GB     gguf Q4_K_M

Options:

-w, --watch SEC            Refresh every SEC seconds
-q, --quiet                Collect and store without output (for daemon use)
    --json                 Output as JSON (for scripting)
-H, --history PERIOD       Show history (e.g. 24h, 1h)
-a, --analyze HOURS        Comprehensive analysis with trends
-c, --compare TS TS        Compare two timestamps
    --alert-webhook URL    POST alerts on state transitions (memory, thermal, engine down)

asiai doctor

Diagnose installation, engines, system health, and database.

$ asiai doctor

Doctor

  System
    ✓ Apple Silicon       Mac Mini M4 Pro — Apple M4 Pro
    ✓ RAM                 64 GB total, 42% used
    ✓ Memory pressure     normal
    ✓ Thermal             nominal (100%)

  Engine
    ✓ Ollama              v0.17.4 — 1 model(s): qwen3.5:35b-a3b
    ✓ LM Studio           v0.4.5 — 1 model(s): qwen3.5-35b-a3b
    ✗ mlx-lm              not installed
    ✗ llama.cpp            not installed
    ✗ vllm-mlx            not installed

  Database
    ✓ SQLite              2.4 MB, last entry: 1m ago

  5 ok, 0 warning(s), 3 failed

asiai versions

Line up each engine's running, installed, and available versions and flag what's behind — including the post-upgrade trap where a live process predates the binary you just upgraded (running-stale).

asiai versions                   # offline: running/installed + brew outdated
asiai versions --check-upstream  # also query PyPI / GitHub (network, opt-in)
asiai versions --engine llamacpp # filter to one engine
asiai versions --json | jq
Engine versions

  ENGINE     RUNNING  INSTALLED  AVAILABLE  STATUS
  ─────────  ───────  ─────────  ─────────  ─────────────────
  llama.cpp  9370     9370       9380       upgrade-available
  Ollama     —        0.24.0     0.24.0     up-to-date

  1 upgrade(s) available

asiai doctor carries an offline recap of this, and asiai web exposes a /versions page with changelog links. Triggering an upgrade is a write and lives in aisctl upgrade <engine> (see docs/versions-mode.md).

asiai daemon

Background monitoring via macOS launchd. Collects metrics every minute.

asiai daemon start              # Install and start the daemon
asiai daemon start --interval 30  # Custom interval (seconds)
asiai daemon status             # Check if running
asiai daemon logs               # View recent logs
asiai daemon stop               # Stop and uninstall

asiai web

Web dashboard with real-time monitoring, benchmark controls, and interactive charts. Requires pip install asiai[web].

asiai web                    # Opens browser at http://127.0.0.1:8899
asiai web --port 9000        # Custom port
asiai web --host 0.0.0.0     # Listen on all interfaces
asiai web --no-open          # Don't auto-open browser

Features: system overview, engine status, live benchmark with SSE progress, history charts, doctor checks, dark/light theme.

asiai fleet + asiai auth + aisctl fleet

Multi-host management across several Macs. Two phases ship today:

  • Phase 1 — read-only observability (in asiai). Each remote Mac runs asiai web --host 0.0.0.0; the orchestrator declares the nodes in ~/.config/asiai/fleet.json and polls each one's /api/v1/snapshot in parallel.
  • Phase 2 — authenticated writes (Bearer auth in asiai, aisctl serve + aisctl fleet push in asiai-inference-server). Issue purge, stop/start/restart, unload, install/uninstall, upgrade against remote nodes with rate-limited token auth and a per-call audit log.
# --- Read-only (Phase 1) -------------------------------------------
asiai fleet add studio --url http://192.0.2.10:8899 --role workstation
asiai fleet list
asiai fleet status               # parallel poll, aggregated table
asiai fleet status --json | jq   # machine-readable form
asiai fleet ping studio          # single-node check

# --- Writes (Phase 2) ----------------------------------------------
# On the node: initialize the auth surface (prints secret ONCE).
asiai auth init
# On the node: start the loopback companion that runs the commands.
aisctl serve &
# On the orchestrator: register the node with its secret.
asiai fleet add studio --url http://192.0.2.10:8899 --auth-token asai_...
# Issue a write.
aisctl fleet push studio purge
aisctl fleet push studio restart --engine ollama
aisctl fleet push studio unload --engine ollama --model llama3.2

The /fleet page in asiai web shows a card per node with HTMX auto-refresh every 10 seconds. Phase 3 will add mDNS Bonjour auto-discovery and TLS. Full guide: docs/fleet-mode.md.

⚠️ Phase 1 is unauthenticated read-only; Phase 2 requires Bearer tokens. Both are designed for trusted LANs (or LANs glued together by a VPN like Tailscale/WireGuard). No TLS is enforced between nodes in v1.8 — that's a Phase 3 deliverable.

asiai leaderboard

Browse community benchmarks. Filter by chip or model.

asiai leaderboard                      # All results
asiai leaderboard --chip "M4 Pro"      # Filter by chip
asiai leaderboard --model qwen2.5      # Filter by model

asiai compare

Compare your local results against community medians.

asiai compare --chip "Apple M1 Max" --model qwen2.5:7b

asiai recommend

Get engine recommendations based on your hardware and benchmarks.

asiai recommend                                # Best engine for your Mac
asiai recommend --use-case latency             # Optimize for TTFT
asiai recommend --model qwen2.5 --community    # Include community data

asiai setup

Interactive setup wizard — detects hardware, engines, models, and suggests next steps.

asiai setup

asiai mcp

Start the MCP server for AI agent integration. 11 tools, 3 resources.

asiai mcp                          # stdio (Claude Code, Cursor)
asiai mcp --transport sse          # SSE (network agents)

asiai tui

Interactive terminal dashboard with auto-refresh. Requires pip install asiai[tui].

asiai tui

Benchmark Card — share your results

Generate a shareable benchmark card image with one flag:

asiai bench --card                    # SVG saved locally (zero dependencies)
asiai bench --card --share            # SVG + PNG via community API
asiai bench --quick --card --share    # Quick bench + card + share

Benchmark card example

A 1200x630 dark-themed card with your model, chip, specs banner (quantization, RAM, GPU cores, context size), engine comparison bar chart, winner highlight, and metric chips (tok/s, TTFT, power, engine version). Optimized for Reddit, X, Discord, and GitHub READMEs.

Every shared card includes asiai branding — the Speedtest.net model for local LLM inference.

Supported engines

Engine Port Install API
Ollama 11434 brew install ollama Native
LM Studio 1234 brew install --cask lm-studio OpenAI-compatible
mlx-lm 8080 brew install mlx-lm OpenAI-compatible
llama.cpp 8080 brew install llama.cpp OpenAI-compatible
oMLX 8000 brew tap jundot/omlx && brew install omlx OpenAI-compatible
vllm-mlx 8000 pip install vllm-mlx OpenAI-compatible
vMLX 8000 pip install vmlx OpenAI-compatible
Exo 52415 pip install exo OpenAI-compatible

What it measures

Metric Description
tok/s Generation speed (tokens/sec), excluding prompt processing (TTFT)
TTFT Time to first token — prompt processing latency
Power GPU, CPU, ANE, DRAM power in watts (IOReport, no sudo)
tok/s/W Energy efficiency — tokens per second per watt
Stability Run-to-run variance: stable (CV<5%), variable (<10%), unstable (>10%)
VRAM Memory footprint — native API (Ollama, LM Studio) or ri_phys_footprint estimate (all other engines)
Thermal CPU throttling state and speed limit percentage

All metrics stored in SQLite (~/.local/share/asiai/metrics.db) with 90-day retention and automatic regression detection.

Benchmark methodology

Following MLPerf, SPEC CPU 2017, and NVIDIA GenAI-Perf standards:

  • Warmup: 1 non-timed generation per engine before measured runs
  • Runs: 3 iterations per prompt (configurable), median as primary metric
  • Sampling: temperature=0 (greedy decoding) for deterministic results
  • Power: Always-on via IOReport (no sudo). Per-engine, not session-wide average
  • Variance: Pooled intra-prompt stddev (isolates run-to-run noise)
  • Metadata: Engine version, model quantization, hardware chip, macOS version stored per result

See docs/benchmark-best-practices.md for the full conformance audit.

Benchmark prompts

Four standardized prompts test different generation patterns:

Name Tokens Tests
code 512 Structured code generation (BST in Python)
tool_call 256 JSON function calling / instruction following
reasoning 384 Multi-step math problem
long_gen 1024 Sustained throughput (bash script)

Use --context-size 4k|16k|32k|64k to test with large context fill prompts instead.

API & Prometheus

When running asiai web, three REST API endpoints are available for programmatic access. Interactive API documentation (Swagger UI) is available at http://localhost:8899/docs.

Endpoint Description
GET /api/status Lightweight health check (< 500ms) — engine reachability, memory pressure, thermal
GET /api/snapshot Full system + engine snapshot with loaded models, VRAM, versions
GET /api/benchmarks Benchmark results with tok/s, TTFT, power, context_size, engine_version
GET /api/engine-history Engine status history (TCP, KV cache, tokens predicted)
GET /api/benchmark-process Process CPU/RSS metrics from benchmark runs (7d retention)
GET /api/metrics Prometheus exposition format — system, engine, model, benchmark gauges

Prometheus integration

# prometheus.yml
scrape_configs:
  - job_name: 'asiai'
    static_configs:
      - targets: ['localhost:8899']
    metrics_path: '/api/metrics'
    scrape_interval: 30s

CLI JSON output

asiai monitor --json | jq '.mem_pressure'
asiai models --json | jq '.engines[].models[].name'

Requirements

  • macOS on Apple Silicon (M1 / M2 / M3 / M4 families)
  • Python 3.11+
  • At least one inference engine running locally

Zero dependencies

The core uses only the Python standard library — urllib, sqlite3, subprocess, argparse. No requests, no psutil, no rich. Just stdlib.

Optional extras:

  • asiai[web] — FastAPI web dashboard with charts
  • asiai[tui] — Textual terminal dashboard
  • asiai[all] — Web + TUI
  • asiai[dev] — pytest, ruff

Roadmap

Version Scope Status
v0.1 detect + bench + monitor + models (CLI, stdlib) Done
v0.2 mlx-lm + doctor + daemon + TUI (Textual) Done
v0.3 5 engines, power metrics, multi-run variance, regression detection Done
v0.4 CI, MkDocs, export JSON, thermal drift, web dashboard Done
v0.5 REST API, Prometheus /metrics, CLI --json, engine uptime tracking Done
v0.6 Multi-service LaunchAgent (daemon start web), daemon status/logs/stop --all Done
v0.7 Alert webhooks, LM Studio VRAM, Ollama config in doctor Done
v1.0 Community Benchmark DB, smart recommendations, Exo engine, leaderboard Done
v1.0.1 MCP server (11 tools), benchmark card, --quick mode, setup wizard, agent integration Done
v1.2 Web dashboard redesign, shareable cards, Share on X/Reddit, community API Done
v1.3 Dark theme, self-hosted fonts, universal VRAM (phys_footprint), power in Monitor/History Done
v1.7 Fleet mode Phase 1 (multi-Mac read-only observability), asiai fleet CLI, /fleet web page Shipped
v1.8 Fleet Phase 2 (cross-host writes with Bearer auth, rate limit, audit log), asiai auth CLI, aisctl serve + aisctl fleet push companions Shipped
v1.9+ Fleet Phase 3 (mDNS Bonjour auto-discovery, TLS/mTLS, TUI fleet panel, MCP write tools), notifications macOS Planned

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asiai-1.12.0.tar.gz (8.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asiai-1.12.0-py3-none-any.whl (841.7 kB view details)

Uploaded Python 3

File details

Details for the file asiai-1.12.0.tar.gz.

File metadata

  • Download URL: asiai-1.12.0.tar.gz
  • Upload date:
  • Size: 8.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for asiai-1.12.0.tar.gz
Algorithm Hash digest
SHA256 ea5cbc27a8c8827a0ea6a872cc6916d4cabe1ddf4f536a2b355446f7a0bd397f
MD5 f32bb92f68a09258431c8193f3c00c1d
BLAKE2b-256 da711b7d64dbc3e5083dbcc5c005de5e3141402c8c5661524c0ef1e3ee3d5f7e

See more details on using hashes here.

Provenance

The following attestation bundles were made for asiai-1.12.0.tar.gz:

Publisher: release.yml on druide67/asiai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file asiai-1.12.0-py3-none-any.whl.

File metadata

  • Download URL: asiai-1.12.0-py3-none-any.whl
  • Upload date:
  • Size: 841.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for asiai-1.12.0-py3-none-any.whl
Algorithm Hash digest
SHA256 65daf03a3ffca8f4a6b5a879ec5dad02bf8359dda81eda4c371f2bebc7610d87
MD5 f4478f9f8e266cc7b04e547dee91652c
BLAKE2b-256 40302272443ed85583e1d7af934b7c8a314855e351674d9eb3df27ceaf166dfc

See more details on using hashes here.

Provenance

The following attestation bundles were made for asiai-1.12.0-py3-none-any.whl:

Publisher: release.yml on druide67/asiai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page