Skip to main content

Exgentic - General agent evaluation

Project description

Exgentic Banner

Evaluate any agent on any benchmark in the simplest way possible


๐ŸŽฏ What is Exgentic?

Exgentic is a universal evaluation framework that enables standardized testing of AI agents across diverse benchmarks and domains. It provides a consistent interface for evaluating any agent on any benchmark, making it easy to compare performance, reproduce results, and ensure your agent works reliably across different tasks and environments.

๐Ÿ‘ฅ Who is it for?

Exgentic serves multiple audiences in the AI agent ecosystem:

  1. General Audience - Visit www.exgentic.ai to explore the first general agent leaderboard comparing leading agents and frontier models across varied tasks.
  2. Agent Builders - Evaluate your agents comprehensively across multiple domains and benchmarks to ensure robust performance and identify areas for improvement.
  3. Researchers & Component Developers - Test general agentic components (such as memory systems, context compression, planning modules) across different agents and domains to validate their effectiveness and generalizability.
  4. Benchmark Builders - Evaluate your benchmark across multiple agents to ensure it provides meaningful differentiation and works reliably with different agent architectures.

โœจ Key Features

  • ๐Ÿ”„ Universal Evaluation Framework - Evaluate any agent on any benchmark with a consistent, standardized interface
  • ๐Ÿค– Multi-Agent Support - Built-in support for LiteLLM, SmolAgents, OpenAI MCP, and Claude Code agents
  • ๐Ÿ“Š Comprehensive Benchmarks - Industry-standard benchmarks including TAU2, AppWorld, SWE-bench, BrowseComp+, GSM8K, and HotpotQA
  • ๐Ÿ”Œ Flexible LLM Integration - Works with any LLM provider through LiteLLM (OpenAI, Anthropic, and more)
  • ๐Ÿ’ป Multi-Interface - Python API, CLI and GUI for different workflows and use cases
  • ๐Ÿ“ˆ Interactive Dashboard - Web-based interface for real-time monitoring and configuration
  • ๐Ÿ” Advanced Observability - Comprehensive logging with trajectory tracking, OpenTelemetry integration, and detailed traces
  • โš™๏ธ Fine-Grained Control - Configure model parameters, run limits, and benchmark/agent-specific settings
  • ๐Ÿ“ Structured Output - Organized directory structure with session-level and run-level artifacts for easy analysis
  • ๐Ÿ’ฐ Cost Tracking - Automatic tracking of API costs and performance metrics
  • ๐Ÿญ Scalable - Built-in caching, optimization, and monitoring features for scaled evaluations
  • ๐Ÿงฉ Extensibility - Modular architecture makes it easy to add new agents or benchmarks

๐Ÿš€ Quick Start

๐Ÿ“‹ Prerequisites

  • Python 3.11 or higher
  • Virtual environment (recommended)

๐Ÿ“ฆ Installation

Clone the repository:

git clone  https://github.com/Exgentic/exgentic.git
cd exgentic

Set up your environment:

python3.11 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

๐Ÿ”ง Setup

Run the setup command for benchmarks and agents before first use:

Benchmarks:

exgentic setup --benchmark appworld
exgentic setup --benchmark tau2

Agents:

exgentic setup --agent smolagents
exgentic setup --agent openai
exgentic setup --agent claude

๐Ÿ”‘ API Credentials

Configure your API keys for OpenAI, Anthropic or any other platform support by LiteLLM.

# For OpenAI
export OPENAI_API_KEY=...          # your OpenAI API key

# Or for Anthropic
export ANTHROPIC_API_KEY=...       # your Anthropic API key

Alternatively, create a .env file in the project root with your API keys. Exgentic will load them automatically.


๐Ÿ’ก Usage Examples

๐Ÿ–ฅ๏ธ CLI Usage

TAU2 Benchmark with LiteLLM

exgentic list benchmarks
exgentic list agents

# Using OpenAI
exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --model gpt-4o \
  --set benchmark.user_simulator_model="gpt-4o" \
  --output-dir ./outputs

# Or using Anthropic
exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --model claude-3-5-sonnet-20241022 \
  --set benchmark.user_simulator_model="claude-3-5-sonnet-20241022" \
  --output-dir ./outputs

๐Ÿ“Š Available Benchmarks

You can see all available benchmarks using:

exgentic benchmark list

You can install and use any of these:

  • ๐ŸŽฏ tau2 - Simulated customer support tasks across multiple domains (retail, airline, banking) with realistic user interactions
  • ๐Ÿ“ฑ appworld - Multi-app API environment testing agents' ability to interact with various application interfaces
  • ๐ŸŒ browsecompplus - Web search and browsing benchmark evaluating information retrieval and navigation capabilities
  • ๐Ÿ’ป swebench - Software engineering benchmark for resolving real-world GitHub issues in Python repositories

There are two simple example benchmarks:

  • ๐Ÿ“š hotpotqa - Multi-hop question answering over Wikipedia requiring reasoning across multiple documents
  • ๐Ÿ”ข gsm8k - Grade school math word problems (GSM8K dataset) with optional calculator tool support

๐Ÿค– Available Agents

โšก LiteLLM Tool Calling

Setup: exgentic setup --agent litellm_tool_calling

๐Ÿง  SmolAgents Tool calling and Code Agents

Setup: exgentic setup --agent smolagents "`

๐Ÿ”ท OpenAI MCP

Setup: exgentic setup --agent openai

๐ŸŽจ Claude Code

Setup: exgentic setup --agent claude


๐Ÿ“ˆ Dashboard Interface

CLI

Launch the interactive dashboard:

exgentic dashboard

The dashboard provides a web interface where you can:

  1. Select benchmarks and agents from the sidebar
  2. Configure parameters in the 'Run' tab
  3. Initiate evaluations with the 'Run' button
  4. Monitor progress and view results in real-time

๐Ÿ“ Output Structure

Each run creates its own directory under outputs/<run_id>/:

outputs/<run_id>/
โ”œโ”€โ”€ benchmark_results.json          # Benchmark-specific aggregated results
โ”œโ”€โ”€ results.json                    # Overall scores, costs, and per-session statistics
โ”œโ”€โ”€ run/
โ”‚   โ”œโ”€โ”€ config.json                # Snapshot of benchmark and agent configuration
โ”‚   โ”œโ”€โ”€ run.log                    # Main execution log
โ”‚   โ”œโ”€โ”€ error.log                  # Error that cause session to fail
โ”‚   โ”œโ”€โ”€ warnings.log               # Warnings and issues during execution
โ”‚   โ””โ”€โ”€ litellm/
โ”‚       โ””โ”€โ”€ trace.jsonl            # LiteLLM API traces (if using LiteLLM)
โ””โ”€โ”€ sessions/<session_id>/
    โ”œโ”€โ”€ config.json                # Session-specific configuration
    โ”œโ”€โ”€ results.json               # Framework-level results for the session
    โ”œโ”€โ”€ session.json               # Session metadata
    โ”œโ”€โ”€ trajectory.jsonl           # One JSON line per step (action + observation)
    โ”œโ”€โ”€ agent/
    โ”‚   โ”œโ”€โ”€ agent.log             # Agent execution log
    โ”‚   โ””โ”€โ”€ litellm/
    โ”‚       โ”œโ”€โ”€ cache.log         # LiteLLM cache operations
    โ”‚       โ””โ”€โ”€ trace.jsonl       # Agent-specific LiteLLM traces
    โ””โ”€โ”€ benchmark/
        โ”œโ”€โ”€ config.json           # Benchmark configuration
        โ”œโ”€โ”€ results.json          # Benchmark-specific results
        โ”œโ”€โ”€ session.log           # Benchmark adaptor log
        โ””โ”€โ”€ [benchmark-specific files]

TAU2-Specific Files

For TAU2 runs, additional files appear under sessions/<session_id>/benchmark/:

File Description
dialog.log Human-readable conversation transcript
tau2_session.log TAU2 framework internal log

๐Ÿ“ CLI Reference

CLI

Common Commands

exgentic list benchmarks
exgentic list subsets --benchmark tau2
exgentic list tasks --benchmark tau2 --subset retail --limit 5
exgentic list agents
exgentic evaluate --benchmark tau2 --agent tool_calling --subset airline --task 12 --task 34
exgentic evaluate --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10 \
  --set benchmark.user_simulator_model="gpt-4o" \
  --set agent.max_steps=20 \
  --max-steps 100 \
  --max-actions 100
exgentic evaluate execute --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic evaluate aggregate --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic evaluate session --benchmark tau2 --agent tool_calling --subset airline --task 12
exgentic evaluate session --config session_config.json
exgentic status --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic preview --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic results --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10

๐Ÿ Python API

For Python API usage examples, see the examples/ directory.

๐Ÿ“– How It Works

To learn more about Exgentic's architecture and design, see our arXiv paper.

๐Ÿค Contributing

We welcome issues and pull requests! Please see CONTRIBUTING.md for guidelines.


โš™๏ธ Advanced Features

๐ŸŽ›๏ธ Model Configuration

exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --set agent.model.temperature=0.2

Supported fields: temperature, top_p, max_tokens, reasoning_effort, num_retries, retry_after, retry_strategy

Agents will warn and ignore any unsupported fields.

โฑ๏ธ Run Limits

exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --max-steps 100 \
  --max-actions 100

The Exgentic stops a session when it reaches either limit and records a limit_reached status. Default: 100 for both limits

๐Ÿ” OpenTelemetry Tracing

Agent trajectories can be recorded as OTel traces and emitted to a collector server. Exgentic traces follow the OTel semantic conventions for GenAI. For more information see OTEL_SEMANTIC_CONVENTIONS.md. Enable OTel distributed tracing for your agent evaluations:

  1. Install dependencies:

    pip install -e '.[otel]'
    
  2. Set up an OTEL Collector: Use Jaeger or Langfuse. See the OpenTelemetry Collector documentation for details.

  3. Configure environment variables:

    export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 # typical for self-hosted Jaeger server with http
    export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf  # or 'grpc'
    export EXGENTIC_OTEL_ENABLED=true 
    
  4. Content Recording: LLM inputs and outputs are not emitted to OTel by default. To enable, set the environment variable:

    export EXGENTIC_OTEL_RECORD_CONTENT=true
    
  5. View traces: OTEL logs are written to <session_root>/otel.log


๐Ÿ“– Citing Exgentic

If you use Exgentic or Exgentic harness in your research, please cite it by using the following BibTeX entry.

@misc{bandel2026generalagentevaluation,
      title={General Agent Evaluation}, 
      author={Elron Bandel and Asaf Yehudai and Lilach Eden and Yehoshua Sagron and Yotam Perlitz and Elad Venezian and Natalia Razinkov and Natan Ergas and Shlomit Shachor Ifergan and Segev Shlomov and Michal Jacovi and Leshem Choshen and Liat Ein-Dor and Yoav Katz and Michal Shmueli-Scheuer},
      year={2026},
      url={https://arxiv.org/abs/2602.22953}, 
}

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ’ฌ Support

For questions and support, please open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exgentic-0.1.1.tar.gz (257.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

exgentic-0.1.1-py3-none-any.whl (307.2 kB view details)

Uploaded Python 3

File details

Details for the file exgentic-0.1.1.tar.gz.

File metadata

  • Download URL: exgentic-0.1.1.tar.gz
  • Upload date:
  • Size: 257.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for exgentic-0.1.1.tar.gz
Algorithm Hash digest
SHA256 390a5f6f3b6a8b161d3f1d2b7b78202d0d43dd30ed43fe25a09aa8c3c46f908d
MD5 d20fc5a1af4b667955dff45a581cc0db
BLAKE2b-256 4f650ba45cf965d8ab17a4c29942824b984feb39e0eab17fd995cac467ed994d

See more details on using hashes here.

Provenance

The following attestation bundles were made for exgentic-0.1.1.tar.gz:

Publisher: publish-pypi.yml on Exgentic/exgentic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file exgentic-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: exgentic-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 307.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for exgentic-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b5154b5196d58f35b2d334a833db4ab2831645bfda8e572f8e1b780ca962c63d
MD5 d28286a3eb6d3036476c3d2c5e61b82c
BLAKE2b-256 220daa6187ceb0e92ca6247d7cdf6c63089aa87baffa42946da5d17269190fca

See more details on using hashes here.

Provenance

The following attestation bundles were made for exgentic-0.1.1-py3-none-any.whl:

Publisher: publish-pypi.yml on Exgentic/exgentic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page