Evaluation framework and CLI runner for LiteRT LM and native models.

These details have not been verified by PyPI

Project links

Project description

🎯 AI Edge Eval

An advanced evaluation framework and CLI runner for LiteRT LM and native models.

ai-edge-eval is a powerful evaluation framework and CLI runner designed for LiteRT LM models and standard native models (e.g., HuggingFace). Built for POSIX-compliant systems, it officially supports Linux, macOS, and Windows (via WSL2), providing robust support for both single-modality (text) and multi-modality (vision + text) use cases.

📖 Table of Contents

🚀 Installation
⚡ Running Evaluations
- LiteRT LM Runners
- Direct Native Library Runners
🛠️ Custom Task CUJ
🔍 Discovery Commands
⚖️ Dataset Licensing and Terms of Use

🚀 Installation

📋 System Requirements

ai-edge-eval requires a POSIX-compliant Unix-like environment. The following platforms are officially supported:

Linux: Standard distributions (e.g., Ubuntu, Debian).
macOS: Both Intel and Apple Silicon (M-series) architectures.
Windows: Supported exclusively via the Windows Subsystem for Linux (WSL2). (Note: Native Windows execution via CMD or PowerShell is not supported.)

We support installation using either uv (recommended for ultra-fast dependency resolution) or standard pip within a virtual environment (Python 3.10+).

Option 1: Use `uv` (Recommended)

[!TIP] uv is an extremely fast Python package manager written in Rust. Using it significantly speeds up environment creation and dependency installation.

1. Create and Activate Virtual Environment

# Create a virtual environment with Python 3.13 in the current directory.
uv venv --clear --python=3.13 --seed
source .venv/bin/activate

2. Install `ai-edge-eval`

Option A: Install from PyPI

# Install the package into the active virtual environment
uv pip install -q ai-edge-eval

Option B: Install from Local Clone (Recommended for Development)

git clone https://github.com/google-ai-edge/eval.git
cd eval

# Install in editable mode inside the active virtual environment
uv pip install -e .

Option 2: Use Standard `pip`

1. Create and Activate Virtual Environment

# Create and activate a Python virtual environment
python3 -m venv .venv
source .venv/bin/activate

2. Install `ai-edge-eval`

Option A: Install from PyPI

pip install -q ai-edge-eval

Option B: Install from Local Clone

git clone https://github.com/google-ai-edge/eval.git
cd eval

# Install in editable mode
pip install -e .

📦 Optional Dependency Groups

The base installation bundles full support for LiteRT-LM evaluation out-of-the-box. To install support for running native PyTorch/HuggingFace models, specify the optional dependency groups:

Using `uv` (Recommended)

# Install HuggingFace native runner support (includes PyTorch)
uv pip install "ai-edge-eval[hf]"

# Install HuggingFace multimodal runner support (includes TorchVision)
uv pip install "ai-edge-eval[hf-multimodal]"

# Install everything for local evaluation
uv pip install "ai-edge-eval[all]"

Using Standard `pip`

# Install HuggingFace native runner support (includes PyTorch)
pip install "ai-edge-eval[hf]"

# Install HuggingFace multimodal runner support (includes TorchVision)
pip install "ai-edge-eval[hf-multimodal]"

# Install everything for local evaluation
pip install "ai-edge-eval[all]"

[!NOTE] Quotes around package names with brackets (e.g., "ai-edge-eval[hf]") prevent shell globbing issues in Zsh and Bash.

⚡ Running Evaluations

ai-edge-eval provides high-performance runners for both LiteRT models and native HuggingFace models.

🤖 LiteRT LM Runners

Text Sampling

Run evaluation on standard text benchmarks like ifeval and bbh:

ai-edge-eval \
      --runner litert-lm \
      --model-path /path/to/model.litertlm \
      --device cpu \
      --tasks ifeval \
      --tasks bbh \
      --framework lm-eval \
      --limit 10 \
      --output-dir your_result_directory

Text Scoring

Run evaluation on standard multiple-choice scoring benchmarks like piqa:

ai-edge-eval \
      --runner litert-lm \
      --model-path /path/to/model.litertlm \
      --device cpu \
      --tasks piqa \
      --framework lm-eval \
      --limit 10 \
      --output-dir your_result_directory

Multimodal Sampling

Run multimodal sampling using vision capabilities (e.g., on mmmu_val):

ai-edge-eval \
      --runner litert-lm \
      --model-path /path/to/model.litertlm \
      --device cpu \
      --runner-args "vision_backend=cpu" \
      --tasks mmmu_val \
      --framework lm-eval \
      --limit 10 \
      --output-dir your_result_directory

🚀 Direct Native Library Runners (HuggingFace, etc.)

Text Evaluation

Run evaluation natively using direct library wrappers via lm-eval:

ai-edge-eval \
      --runner hf \
      --model-path huggingface/repo \
      --device cpu \
      --tasks mmlu \
      --framework lm-eval \
      --limit 10 \
      --output-dir your_result_directory

Multimodal Evaluation

Run multimodal evaluation natively using direct library wrappers via lm-eval:

ai-edge-eval \
      --runner hf-multimodal \
      --model-path huggingface/repo \
      --device cpu \
      --tasks mmmu_val \
      --framework lm-eval \
      --limit 10 \
      --batch-size 1 \
      --output-dir your_result_directory

[!IMPORTANT] For HuggingFace runners, huggingface/repo refers to the HuggingFace model ID, such as Qwen/Qwen2.5-7B-Instruct or google/gemma-3-270m.

🛠️ Custom Task CUJ

ai-edge-eval makes it seamless to define and run custom evaluation benchmarks tailored to your specific datasets and metrics.

1. Prepare the Dataset

Prepare your evaluation dataset in JSON Lines (.jsonl) format, where each entry separates the input context (messages) and the expected output (ground_truth), along with optional metadata.

[!NOTE] The messages field strictly follows the canonical OpenAI Chat Completion format (a list of dictionaries specifying role and content).

{
  "messages": [{"role": "user", "content": "What is the capital of France?"}],
  "ground_truth": "Paris"
}
{
  "messages": [{"role": "user", "content": "Calculate 5 + 7"}],
  "ground_truth": "12"
}

2. Task Definition

To run custom evaluation benchmarks, register your generation parameters and evaluation hooks via a Python file (e.g., register_custom_tasks.py):

# File: register_custom_tasks.py

from typing import Iterator
from model_eval.config.generation_config import GenerationConfig
from model_eval.custom_tasks.base import CustomTask, DatasetRow
from model_eval.custom_tasks.registry import TaskRegistry

def exact_match(
    preds: Iterator[str], gts: Iterator[str], rows: Iterator[DatasetRow[str]]
) -> dict[str, float]:
  # Retrieve generated text and ground truth text.
  p = [text.strip().lower() for text in preds]
  g = [text.strip().lower() for text in gts]
  accuracy = sum(pi == gi for pi, gi in zip(p, g)) / len(p)
  return {"exact_match": accuracy}

qa_task = CustomTask(
    name="my_custom_qa",
    dataset="path/to/dataset.jsonl",
    metric_fn=exact_match,
    generation_config=GenerationConfig(
        temperature=0.5, max_new_tokens=64, stop_sequences=["\n"]
    )
)

TaskRegistry.global_registry().register(qa_task)

3. Run Custom Evaluation

Point the CLI to your custom registration file authored in Step 2 using the --custom-tasks-file flag:

ai-edge-eval \
      --runner litert-lm \
      --runner-args "model_path=/path/to/model.litertlm,backend=cpu" \
      --tasks my_custom_qa \
      --framework custom \
      --custom-tasks-file register_custom_tasks.py \
      --eval-args "limit=10" \
      --output-dir your_result_directory

🔍 Discovery Commands

ai-edge-eval includes built-in discovery utilities to help you explore supported configurations, tasks, and runners.

Argument Discovery

Use the list-args subcommand to inspect the available configurations and parameters exposed by a given runner or evaluation framework:

# Discover runner arguments
ai-edge-eval list-args --runner litert-lm

# Discover evaluation framework arguments
ai-edge-eval list-args --framework lm-eval

Supported Tasks and Runners

Use the list-tasks and list-runners subcommands to view the allowlist of supported tasks and runners for a given framework:

# List supported tasks for a framework
ai-edge-eval list-tasks --framework lm-eval

# List supported runners for a framework
ai-edge-eval list-runners --framework lm-eval

⚖️ Dataset Licensing and Terms of Use

ai-edge-eval is an evaluation runner and command-line toolkit licensed under the Apache 2.0 License.

Third-Party Dataset Integration

[!WARNING] When executing benchmark evaluations, ai-edge-eval relies on upstream execution frameworks (such as EleutherAI's lm-eval harness) to dynamically download and cache evaluation datasets from external sources (e.g., HuggingFace Hub). ai-edge-eval does not host, redistribute, or sublicense these external datasets.

User Responsibility

Every evaluation dataset maintains its own licensing terms, ownership rights, and permitted usage policies (including potential non-commercial restrictions).

[!IMPORTANT] By executing evaluations using ai-edge-eval, you are responsible for:

Reviewing and consenting to the specific terms of service and license agreement associated with each evaluated benchmark.

Adhering to any commercial or distribution constraints associated with the underlying data.

For detailed licensing information regarding specific datasets, refer to their respective model and dataset cards on the HuggingFace Hub or official repository pages.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

May 19, 2026

0.0.1.dev2026051901 pre-release

May 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ai_edge_eval-0.0.1-py3-none-any.whl (78.7 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file ai_edge_eval-0.0.1-py3-none-any.whl.

File metadata

Download URL: ai_edge_eval-0.0.1-py3-none-any.whl
Upload date: May 19, 2026
Size: 78.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for ai_edge_eval-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca2755aaf5cfffc44737f79973fcbf8fd31328178dc15125d6b2f37beaaabcaa`
MD5	`f3b6320757189359a935386d00f7a203`
BLAKE2b-256	`6c097b41757d97b3443eda9b822fdf10af027617ef46cf0faf0f92057ddab750`

See more details on using hashes here.

ai-edge-eval 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🎯 AI Edge Eval

📖 Table of Contents

🚀 Installation

📋 System Requirements

Option 1: Use uv (Recommended)

1. Create and Activate Virtual Environment

2. Install ai-edge-eval

Option 2: Use Standard pip

1. Create and Activate Virtual Environment

2. Install ai-edge-eval

📦 Optional Dependency Groups

Using uv (Recommended)

Using Standard pip

⚡ Running Evaluations

🤖 LiteRT LM Runners

Text Sampling

Text Scoring

Multimodal Sampling

🚀 Direct Native Library Runners (HuggingFace, etc.)

Text Evaluation

Multimodal Evaluation

🛠️ Custom Task CUJ

1. Prepare the Dataset

2. Task Definition

3. Run Custom Evaluation

🔍 Discovery Commands

Argument Discovery

Supported Tasks and Runners

⚖️ Dataset Licensing and Terms of Use

Third-Party Dataset Integration

User Responsibility

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

Option 1: Use `uv` (Recommended)

2. Install `ai-edge-eval`

Option 2: Use Standard `pip`

2. Install `ai-edge-eval`

Using `uv` (Recommended)

Using Standard `pip`