exgentic

Exgentic - General agent evaluation

These details have not been verified by PyPI

Project description

Evaluate any agent on any benchmark in the simplest way possible

🎯 What is Exgentic?

Exgentic is a universal evaluation framework that enables standardized testing of AI agents across diverse benchmarks and domains. It provides a consistent interface for evaluating any agent on any benchmark, making it easy to compare performance, reproduce results, and ensure your agent works reliably across different tasks and environments.

👥 Who is it for?

Exgentic serves multiple audiences in the AI agent ecosystem:

General Audience - Visit www.exgentic.ai to explore the first general agent leaderboard comparing leading agents and frontier models across varied tasks.
Agent Builders - Evaluate your agents comprehensively across multiple domains and benchmarks to ensure robust performance and identify areas for improvement.
Researchers & Component Developers - Test general agentic components (such as memory systems, context compression, planning modules) across different agents and domains to validate their effectiveness and generalizability.
Benchmark Builders - Evaluate your benchmark across multiple agents to ensure it provides meaningful differentiation and works reliably with different agent architectures.

✨ Key Features

🔄 Universal Evaluation Framework - Evaluate any agent on any benchmark with a consistent, standardized interface
🤖 Multi-Agent Support - Built-in support for LiteLLM, SmolAgents, OpenAI MCP, and Claude Code agents
📊 Comprehensive Benchmarks - Industry-standard benchmarks including TAU2, AppWorld, SWE-bench, BrowseComp+, GSM8K, and HotpotQA
🔌 Flexible LLM Integration - Works with any LLM provider through LiteLLM (OpenAI, Anthropic, and more)
💻 Multi-Interface - Python API, CLI and GUI for different workflows and use cases
📈 Interactive Dashboard - Web-based interface for real-time monitoring and configuration
🔍 Advanced Observability - Comprehensive logging with trajectory tracking, OpenTelemetry integration, and detailed traces
⚙️ Fine-Grained Control - Configure model parameters, run limits, and benchmark/agent-specific settings
📁 Structured Output - Organized directory structure with session-level and run-level artifacts for easy analysis
💰 Cost Tracking - Automatic tracking of API costs and performance metrics
🏭 Scalable - Built-in caching, optimization, and monitoring features for scaled evaluations
🧩 Extensibility - Modular architecture makes it easy to add new agents or benchmarks

🚀 Quick Start

📋 Prerequisites

Python 3.11 or higher
Virtual environment (recommended)

📦 Installation

Clone the repository:

git clone  https://github.com/Exgentic/exgentic.git
cd exgentic

Set up your environment:

python3.11 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

🔧 Setup

Run the setup command for benchmarks and agents before first use:

Benchmarks:

exgentic setup --benchmark appworld
exgentic setup --benchmark tau2

Agents:

exgentic setup --agent smolagents
exgentic setup --agent openai
exgentic setup --agent claude

🔑 API Credentials

Configure your API keys for OpenAI, Anthropic or any other platform support by LiteLLM.

# For OpenAI
export OPENAI_API_KEY=...          # your OpenAI API key

# Or for Anthropic
export ANTHROPIC_API_KEY=...       # your Anthropic API key

Alternatively, create a .env file in the project root with your API keys. Exgentic will load them automatically.

💡 Usage Examples

🖥️ CLI Usage

TAU2 Benchmark with LiteLLM

exgentic list benchmarks
exgentic list agents

# Using OpenAI
exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --model gpt-4o \
  --set benchmark.user_simulator_model="gpt-4o" \
  --output-dir ./outputs

# Or using Anthropic
exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --model claude-3-5-sonnet-20241022 \
  --set benchmark.user_simulator_model="claude-3-5-sonnet-20241022" \
  --output-dir ./outputs

📊 Available Benchmarks

You can see all available benchmarks using:

exgentic benchmark list

You can install and use any of these:

🎯 tau2 - Simulated customer support tasks across multiple domains (retail, airline, banking) with realistic user interactions
📱 appworld - Multi-app API environment testing agents' ability to interact with various application interfaces
🌐 browsecompplus - Web search and browsing benchmark evaluating information retrieval and navigation capabilities
💻 swebench - Software engineering benchmark for resolving real-world GitHub issues in Python repositories

There are two simple example benchmarks:

📚 hotpotqa - Multi-hop question answering over Wikipedia requiring reasoning across multiple documents
🔢 gsm8k - Grade school math word problems (GSM8K dataset) with optional calculator tool support

🤖 Available Agents

⚡ LiteLLM Tool Calling

Setup: exgentic setup --agent litellm_tool_calling

🧠 SmolAgents Tool calling and Code Agents

Setup: exgentic setup --agent smolagents "`

🔷 OpenAI MCP

Setup: exgentic setup --agent openai

🎨 Claude Code

Setup: exgentic setup --agent claude

📈 Dashboard Interface

Launch the interactive dashboard:

exgentic dashboard

The dashboard provides a web interface where you can:

Select benchmarks and agents from the sidebar
Configure parameters in the 'Run' tab
Initiate evaluations with the 'Run' button
Monitor progress and view results in real-time

📁 Output Structure

Each run creates its own directory under outputs/<run_id>/:

outputs/<run_id>/
├── benchmark_results.json          # Benchmark-specific aggregated results
├── results.json                    # Overall scores, costs, and per-session statistics
├── run/
│   ├── config.json                # Snapshot of benchmark and agent configuration
│   ├── run.log                    # Main execution log
│   ├── error.log                  # Error that cause session to fail
│   ├── warnings.log               # Warnings and issues during execution
│   └── litellm/
│       └── trace.jsonl            # LiteLLM API traces (if using LiteLLM)
└── sessions/<session_id>/
    ├── config.json                # Session-specific configuration
    ├── results.json               # Framework-level results for the session
    ├── session.json               # Session metadata
    ├── trajectory.jsonl           # One JSON line per step (action + observation)
    ├── agent/
    │   ├── agent.log             # Agent execution log
    │   └── litellm/
    │       ├── cache.log         # LiteLLM cache operations
    │       └── trace.jsonl       # Agent-specific LiteLLM traces
    └── benchmark/
        ├── config.json           # Benchmark configuration
        ├── results.json          # Benchmark-specific results
        ├── session.log           # Benchmark adaptor log
        └── [benchmark-specific files]

TAU2-Specific Files

For TAU2 runs, additional files appear under sessions/<session_id>/benchmark/:

File	Description
`dialog.log`	Human-readable conversation transcript
`tau2_session.log`	TAU2 framework internal log

📝 CLI Reference

Common Commands

exgentic list benchmarks
exgentic list subsets --benchmark tau2
exgentic list tasks --benchmark tau2 --subset retail --limit 5
exgentic list agents
exgentic evaluate --benchmark tau2 --agent tool_calling --subset airline --task 12 --task 34
exgentic evaluate --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10 \
  --set benchmark.user_simulator_model="gpt-4o" \
  --set agent.max_steps=20 \
  --max-steps 100 \
  --max-actions 100
exgentic evaluate execute --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic evaluate aggregate --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic evaluate session --benchmark tau2 --agent tool_calling --subset airline --task 12
exgentic evaluate session --config session_config.json
exgentic status --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic preview --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic results --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10

🐍 Python API

For Python API usage examples, see the examples/ directory.

📖 How It Works

To learn more about Exgentic's architecture and design, see our arXiv paper.

🤝 Contributing

We welcome issues and pull requests! Please see CONTRIBUTING.md for guidelines.

⚙️ Advanced Features

🎛️ Model Configuration

exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --set agent.model.temperature=0.2

Supported fields: temperature, top_p, max_tokens, reasoning_effort, num_retries, retry_after, retry_strategy

Agents will warn and ignore any unsupported fields.

⏱️ Run Limits

exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --max-steps 100 \
  --max-actions 100

The Exgentic stops a session when it reaches either limit and records a limit_reached status. Default: 100 for both limits

🔍 OpenTelemetry Tracing

Agent trajectories can be recorded as OTel traces and emitted to a collector server. Exgentic traces follow the OTel semantic conventions for GenAI. For more information see OTEL_SEMANTIC_CONVENTIONS.md. Enable OTel distributed tracing for your agent evaluations:

Install dependencies:
```
pip install -e '.[otel]'
```
Set up an OTEL Collector: Use Jaeger or Langfuse. See the OpenTelemetry Collector documentation for details.

Configure environment variables:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 # typical for self-hosted Jaeger server with http
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf  # or 'grpc'
export EXGENTIC_OTEL_ENABLED=true

Content Recording: LLM inputs and outputs are not emitted to OTel by default. To enable, set the environment variable:
```
export EXGENTIC_OTEL_RECORD_CONTENT=true
```
View traces: OTEL logs are written to <session_root>/otel.log

📖 Citing Exgentic

If you use Exgentic or Exgentic harness in your research, please cite it by using the following BibTeX entry.

@misc{bandel2026generalagentevaluation,
      title={General Agent Evaluation}, 
      author={Elron Bandel and Asaf Yehudai and Lilach Eden and Yehoshua Sagron and Yotam Perlitz and Elad Venezian and Natalia Razinkov and Natan Ergas and Shlomit Shachor Ifergan and Segev Shlomov and Michal Jacovi and Leshem Choshen and Liat Ein-Dor and Yoav Katz and Michal Shmueli-Scheuer},
      year={2026},
      url={https://arxiv.org/abs/2602.22953}, 
}

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

💬 Support

For questions and support, please open an issue on GitHub.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.5

Apr 19, 2026

0.3.4

Apr 6, 2026

0.3.3

Mar 30, 2026

0.3.2

Mar 29, 2026

0.3.1

Mar 29, 2026

0.3.0

Mar 29, 2026

0.2.0

Mar 24, 2026

0.1.3

Mar 16, 2026

0.1.2

Mar 16, 2026

This version

0.1.1

Mar 15, 2026

0.1.0

Feb 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exgentic-0.1.1.tar.gz (257.1 kB view details)

Uploaded Mar 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

exgentic-0.1.1-py3-none-any.whl (307.2 kB view details)

Uploaded Mar 15, 2026 Python 3

File details

Details for the file exgentic-0.1.1.tar.gz.

File metadata

Download URL: exgentic-0.1.1.tar.gz
Upload date: Mar 15, 2026
Size: 257.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for exgentic-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`390a5f6f3b6a8b161d3f1d2b7b78202d0d43dd30ed43fe25a09aa8c3c46f908d`
MD5	`d20fc5a1af4b667955dff45a581cc0db`
BLAKE2b-256	`4f650ba45cf965d8ab17a4c29942824b984feb39e0eab17fd995cac467ed994d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for exgentic-0.1.1.tar.gz:

Publisher: publish-pypi.yml on Exgentic/exgentic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: exgentic-0.1.1.tar.gz
- Subject digest: 390a5f6f3b6a8b161d3f1d2b7b78202d0d43dd30ed43fe25a09aa8c3c46f908d
- Sigstore transparency entry: 1108269764
- Sigstore integration time: Mar 15, 2026
Source repository:
- Permalink: Exgentic/exgentic@e2580653c0398433047f2c385bade925f06eaa5c
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/Exgentic
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@e2580653c0398433047f2c385bade925f06eaa5c
- Trigger Event: push

File details

Details for the file exgentic-0.1.1-py3-none-any.whl.

File metadata

Download URL: exgentic-0.1.1-py3-none-any.whl
Upload date: Mar 15, 2026
Size: 307.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for exgentic-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b5154b5196d58f35b2d334a833db4ab2831645bfda8e572f8e1b780ca962c63d`
MD5	`d28286a3eb6d3036476c3d2c5e61b82c`
BLAKE2b-256	`220daa6187ceb0e92ca6247d7cdf6c63089aa87baffa42946da5d17269190fca`

See more details on using hashes here.

Provenance

The following attestation bundles were made for exgentic-0.1.1-py3-none-any.whl:

Publisher: publish-pypi.yml on Exgentic/exgentic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: exgentic-0.1.1-py3-none-any.whl
- Subject digest: b5154b5196d58f35b2d334a833db4ab2831645bfda8e572f8e1b780ca962c63d
- Sigstore transparency entry: 1108269765
- Sigstore integration time: Mar 15, 2026
Source repository:
- Permalink: Exgentic/exgentic@e2580653c0398433047f2c385bade925f06eaa5c
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/Exgentic
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@e2580653c0398433047f2c385bade925f06eaa5c
- Trigger Event: push

exgentic 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

🎯 What is Exgentic?

👥 Who is it for?

✨ Key Features

🚀 Quick Start

📋 Prerequisites

📦 Installation

🔧 Setup

🔑 API Credentials

💡 Usage Examples

🖥️ CLI Usage

📊 Available Benchmarks

🤖 Available Agents

⚡ LiteLLM Tool Calling

🧠 SmolAgents Tool calling and Code Agents

🔷 OpenAI MCP

🎨 Claude Code

📈 Dashboard Interface

📁 Output Structure

TAU2-Specific Files

📝 CLI Reference

🐍 Python API

📖 How It Works

🤝 Contributing

⚙️ Advanced Features

🎛️ Model Configuration

⏱️ Run Limits

🔍 OpenTelemetry Tracing

📖 Citing Exgentic

📄 License

💬 Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance