Exgentic - General agent evaluation
Project description
Evaluate any agent on any benchmark in the simplest way possible
๐ฏ What is Exgentic?
Exgentic is a universal evaluation framework that enables standardized testing of AI agents across diverse benchmarks and domains. It provides a consistent interface for evaluating any agent on any benchmark, making it easy to compare performance, reproduce results, and ensure your agent works reliably across different tasks and environments.
๐ฅ Who is it for?
Exgentic serves multiple audiences in the AI agent ecosystem:
- General Audience - Visit www.exgentic.ai to explore the first general agent leaderboard comparing leading agents and frontier models across varied tasks.
- Agent Builders - Evaluate your agents comprehensively across multiple domains and benchmarks to ensure robust performance and identify areas for improvement.
- Researchers & Component Developers - Test general agentic components (such as memory systems, context compression, planning modules) across different agents and domains to validate their effectiveness and generalizability.
- Benchmark Builders - Evaluate your benchmark across multiple agents to ensure it provides meaningful differentiation and works reliably with different agent architectures.
โจ Key Features
- ๐ Universal Evaluation Framework - Evaluate any agent on any benchmark with a consistent, standardized interface
- ๐ค Multi-Agent Support - Built-in support for LiteLLM, SmolAgents, OpenAI MCP, and Claude Code agents
- ๐ Comprehensive Benchmarks - Industry-standard benchmarks including TAU2, AppWorld, SWE-bench, BrowseComp+, GSM8K, and HotpotQA
- ๐ Flexible LLM Integration - Works with any LLM provider through LiteLLM (OpenAI, Anthropic, and more)
- ๐ป Multi-Interface - Python API, CLI and GUI for different workflows and use cases
- ๐ Interactive Dashboard - Web-based interface for real-time monitoring and configuration
- ๐ Advanced Observability - Comprehensive logging with trajectory tracking, OpenTelemetry integration, and detailed traces
- โ๏ธ Fine-Grained Control - Configure model parameters, run limits, and benchmark/agent-specific settings
- ๐ Structured Output - Organized directory structure with session-level and run-level artifacts for easy analysis
- ๐ฐ Cost Tracking - Automatic tracking of API costs and performance metrics
- ๐ญ Scalable - Built-in caching, optimization, and monitoring features for scaled evaluations
- ๐งฉ Extensibility - Modular architecture makes it easy to add new agents or benchmarks
๐ Quick Start
๐ Prerequisites
- Python 3.11 or higher
- Virtual environment (recommended)
๐ฆ Installation
Clone the repository:
git clone https://github.com/Exgentic/exgentic.git
cd exgentic
Set up your environment:
python3.11 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
๐ง Setup
Run the setup command for benchmarks and agents before first use:
Benchmarks:
exgentic setup --benchmark appworld
exgentic setup --benchmark tau2
Agents:
exgentic setup --agent smolagents
exgentic setup --agent openai
exgentic setup --agent claude
๐ API Credentials
Configure your API keys for OpenAI, Anthropic or any other platform support by LiteLLM.
# For OpenAI
export OPENAI_API_KEY=... # your OpenAI API key
# Or for Anthropic
export ANTHROPIC_API_KEY=... # your Anthropic API key
Alternatively, create a .env file in the project root with your API keys. Exgentic will load them automatically.
๐ก Usage Examples
๐ฅ๏ธ CLI Usage
TAU2 Benchmark with LiteLLM
exgentic list benchmarks
exgentic list agents
# Using OpenAI
exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
--model gpt-4o \
--set benchmark.user_simulator_model="gpt-4o" \
--output-dir ./outputs
# Or using Anthropic
exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
--model claude-3-5-sonnet-20241022 \
--set benchmark.user_simulator_model="claude-3-5-sonnet-20241022" \
--output-dir ./outputs
๐ Available Benchmarks
You can see all available benchmarks using:
exgentic benchmark list
You can install and use any of these:
- ๐ฏ tau2 - Simulated customer support tasks across multiple domains (retail, airline, banking) with realistic user interactions
- ๐ฑ appworld - Multi-app API environment testing agents' ability to interact with various application interfaces
- ๐ browsecompplus - Web search and browsing benchmark evaluating information retrieval and navigation capabilities
- ๐ป swebench - Software engineering benchmark for resolving real-world GitHub issues in Python repositories
There are two simple example benchmarks:
- ๐ hotpotqa - Multi-hop question answering over Wikipedia requiring reasoning across multiple documents
- ๐ข gsm8k - Grade school math word problems (GSM8K dataset) with optional calculator tool support
๐ค Available Agents
โก LiteLLM Tool Calling
Setup: exgentic setup --agent litellm_tool_calling
๐ง SmolAgents Tool calling and Code Agents
Setup: exgentic setup --agent smolagents "`
๐ท OpenAI MCP
Setup: exgentic setup --agent openai
๐จ Claude Code
Setup: exgentic setup --agent claude
๐ Dashboard Interface
Launch the interactive dashboard:
exgentic dashboard
The dashboard provides a web interface where you can:
- Select benchmarks and agents from the sidebar
- Configure parameters in the 'Run' tab
- Initiate evaluations with the 'Run' button
- Monitor progress and view results in real-time
๐ Output Structure
Each run creates its own directory under outputs/<run_id>/:
outputs/<run_id>/
โโโ benchmark_results.json # Benchmark-specific aggregated results
โโโ results.json # Overall scores, costs, and per-session statistics
โโโ run/
โ โโโ config.json # Snapshot of benchmark and agent configuration
โ โโโ run.log # Main execution log
โ โโโ error.log # Error that cause session to fail
โ โโโ warnings.log # Warnings and issues during execution
โ โโโ litellm/
โ โโโ trace.jsonl # LiteLLM API traces (if using LiteLLM)
โโโ sessions/<session_id>/
โโโ config.json # Session-specific configuration
โโโ results.json # Framework-level results for the session
โโโ session.json # Session metadata
โโโ trajectory.jsonl # One JSON line per step (action + observation)
โโโ agent/
โ โโโ agent.log # Agent execution log
โ โโโ litellm/
โ โโโ cache.log # LiteLLM cache operations
โ โโโ trace.jsonl # Agent-specific LiteLLM traces
โโโ benchmark/
โโโ config.json # Benchmark configuration
โโโ results.json # Benchmark-specific results
โโโ session.log # Benchmark adaptor log
โโโ [benchmark-specific files]
TAU2-Specific Files
For TAU2 runs, additional files appear under sessions/<session_id>/benchmark/:
| File | Description |
|---|---|
dialog.log |
Human-readable conversation transcript |
tau2_session.log |
TAU2 framework internal log |
๐ CLI Reference
Common Commands
exgentic list benchmarks
exgentic list subsets --benchmark tau2
exgentic list tasks --benchmark tau2 --subset retail --limit 5
exgentic list agents
exgentic evaluate --benchmark tau2 --agent tool_calling --subset airline --task 12 --task 34
exgentic evaluate --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10 \
--set benchmark.user_simulator_model="gpt-4o" \
--set agent.max_steps=20 \
--max-steps 100 \
--max-actions 100
exgentic evaluate execute --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic evaluate aggregate --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic evaluate session --benchmark tau2 --agent tool_calling --subset airline --task 12
exgentic evaluate session --config session_config.json
exgentic status --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic preview --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic results --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
๐ Python API
For Python API usage examples, see the examples/ directory.
๐ How It Works
To learn more about Exgentic's architecture and design, see our arXiv paper.
๐ค Contributing
We welcome issues and pull requests! Please see CONTRIBUTING.md for guidelines.
โ๏ธ Advanced Features
๐๏ธ Model Configuration
exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
--set agent.model.temperature=0.2
Supported fields: temperature, top_p, max_tokens, reasoning_effort, num_retries, retry_after, retry_strategy
Agents will warn and ignore any unsupported fields.
โฑ๏ธ Run Limits
exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
--max-steps 100 \
--max-actions 100
The Exgentic stops a session when it reaches either limit and records a limit_reached status.
Default: 100 for both limits
๐ OpenTelemetry Tracing
Agent trajectories can be recorded as OTel traces and emitted to a collector server. Exgentic traces follow the OTel semantic conventions for GenAI. For more information see OTEL_SEMANTIC_CONVENTIONS.md. Enable OTel distributed tracing for your agent evaluations:
-
Install dependencies:
pip install -e '.[otel]'
-
Set up an OTEL Collector: Use Jaeger or Langfuse. See the OpenTelemetry Collector documentation for details.
-
Configure environment variables:
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 # typical for self-hosted Jaeger server with http export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf # or 'grpc' export EXGENTIC_OTEL_ENABLED=true
-
Content Recording: LLM inputs and outputs are not emitted to OTel by default. To enable, set the environment variable:
export EXGENTIC_OTEL_RECORD_CONTENT=true
-
View traces: OTEL logs are written to
<session_root>/otel.log
๐ Citing Exgentic
If you use Exgentic or Exgentic harness in your research, please cite it by using the following BibTeX entry.
@misc{bandel2026generalagentevaluation,
title={General Agent Evaluation},
author={Elron Bandel and Asaf Yehudai and Lilach Eden and Yehoshua Sagron and Yotam Perlitz and Elad Venezian and Natalia Razinkov and Natan Ergas and Shlomit Shachor Ifergan and Segev Shlomov and Michal Jacovi and Leshem Choshen and Liat Ein-Dor and Yoav Katz and Michal Shmueli-Scheuer},
year={2026},
url={https://arxiv.org/abs/2602.22953},
}
๐ License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
๐ฌ Support
For questions and support, please open an issue on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file exgentic-0.1.1.tar.gz.
File metadata
- Download URL: exgentic-0.1.1.tar.gz
- Upload date:
- Size: 257.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
390a5f6f3b6a8b161d3f1d2b7b78202d0d43dd30ed43fe25a09aa8c3c46f908d
|
|
| MD5 |
d20fc5a1af4b667955dff45a581cc0db
|
|
| BLAKE2b-256 |
4f650ba45cf965d8ab17a4c29942824b984feb39e0eab17fd995cac467ed994d
|
Provenance
The following attestation bundles were made for exgentic-0.1.1.tar.gz:
Publisher:
publish-pypi.yml on Exgentic/exgentic
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
exgentic-0.1.1.tar.gz -
Subject digest:
390a5f6f3b6a8b161d3f1d2b7b78202d0d43dd30ed43fe25a09aa8c3c46f908d - Sigstore transparency entry: 1108269764
- Sigstore integration time:
-
Permalink:
Exgentic/exgentic@e2580653c0398433047f2c385bade925f06eaa5c -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Exgentic
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@e2580653c0398433047f2c385bade925f06eaa5c -
Trigger Event:
push
-
Statement type:
File details
Details for the file exgentic-0.1.1-py3-none-any.whl.
File metadata
- Download URL: exgentic-0.1.1-py3-none-any.whl
- Upload date:
- Size: 307.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5154b5196d58f35b2d334a833db4ab2831645bfda8e572f8e1b780ca962c63d
|
|
| MD5 |
d28286a3eb6d3036476c3d2c5e61b82c
|
|
| BLAKE2b-256 |
220daa6187ceb0e92ca6247d7cdf6c63089aa87baffa42946da5d17269190fca
|
Provenance
The following attestation bundles were made for exgentic-0.1.1-py3-none-any.whl:
Publisher:
publish-pypi.yml on Exgentic/exgentic
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
exgentic-0.1.1-py3-none-any.whl -
Subject digest:
b5154b5196d58f35b2d334a833db4ab2831645bfda8e572f8e1b780ca962c63d - Sigstore transparency entry: 1108269765
- Sigstore integration time:
-
Permalink:
Exgentic/exgentic@e2580653c0398433047f2c385bade925f06eaa5c -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Exgentic
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@e2580653c0398433047f2c385bade925f06eaa5c -
Trigger Event:
push
-
Statement type: