exgentic

Exgentic - General agent evaluation

Project description

Evaluate any agent on any benchmark in the simplest way possible

What is Exgentic?

Exgentic is a universal evaluation framework that enables standardized testing of AI agents across diverse benchmarks and domains. It provides a consistent interface for evaluating any agent on any benchmark, making it easy to compare performance, reproduce results, and ensure your agent works reliably across different tasks and environments.

Who is it for?

General Audience - Visit www.exgentic.ai to explore the first general agent leaderboard comparing leading agents and frontier models across varied tasks.
Agent Builders - Evaluate your agents comprehensively across multiple domains and benchmarks.
Researchers & Component Developers - Test agentic components (memory, context compression, planning) across different agents and domains.
Benchmark Builders - Evaluate your benchmark across multiple agents to ensure meaningful differentiation.

Quick Start

Installation

uv tool install exgentic

API Credentials

export OPENAI_API_KEY=...
# or
export ANTHROPIC_API_KEY=...

Run an Evaluation

# List available benchmarks and agents
exgentic list benchmarks
exgentic list agents

# Evaluate an agent on a benchmark
exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --model gpt-4o \
  --set benchmark.user_simulator_model="gpt-4o"

Benchmarks are automatically installed on first run — no manual installation needed. You can also install them explicitly:

exgentic install --benchmark tau2              # install deps + data (default)
exgentic install --agent tool_calling
exgentic install --benchmark tau2 --docker     # build Docker image
exgentic install --benchmark tau2 --local      # install into local environment
exgentic uninstall --benchmark tau2            # remove installed environment

Note: exgentic setup still works but is deprecated in favor of install/uninstall.

For full container isolation, use the Docker runner (--set benchmark.runner=docker). You only need Docker installed and running:

exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --model gpt-4o \
  --set benchmark.runner=docker \
  --set benchmark.user_simulator_model="gpt-4o"

Python API

To use exgentic as a library, install it first:

uv add exgentic   # or: pip install exgentic

from exgentic import evaluate

results = evaluate(
    benchmark="tau2",
    agent="tool_calling",
    subset="retail",
    num_tasks=2,
    model="gpt-4o",
    benchmark_kwargs={"user_simulator_model": "gpt-4o"},
)

For more examples, see the examples/ directory.

Available Benchmarks

exgentic list benchmarks

Benchmark	Description
tau2	Simulated customer support tasks across multiple domains (mock, retail, airline, telecom)
appworld	Multi-app API environment testing agents' ability to interact with application interfaces
browsecompplus	Web search and browsing benchmark for information retrieval and navigation
swebench	Software engineering benchmark for resolving real-world GitHub issues
hotpotqa	Multi-hop question answering over Wikipedia
gsm8k	Grade school math word problems with optional calculator tool
bfcl	Berkeley Function Calling Leaderboard for evaluating tool-use capabilities

Available Agents

Agent	Description
LiteLLM Tool Calling	Generic tool-calling agent via LiteLLM
SmolAgents	HuggingFace SmolAgents framework
OpenAI MCP	OpenAI Responses API with MCP tools
Claude Code	Anthropic Claude Code agent
Codex CLI	OpenAI Codex CLI agent
Gemini CLI	Google Gemini CLI agent

Dashboard

exgentic dashboard

Output Structure

Each run creates its own directory under outputs/<run_id>/:

outputs/<run_id>/
├── results.json                    # Overall scores, costs, per-session statistics
├── benchmark_results.json          # Benchmark-specific aggregated results
├── aggregator/
│   └── runtime.json               # Run-level evaluator context (list_tasks, aggregation)
├── run/
│   ├── config.json                # Snapshot of benchmark and agent configuration
│   ├── run.log                    # Main execution log
│   └── warnings.log               # Warnings during execution
└── sessions/<session_id>/
    ├── config.json                # Session configuration
    ├── results.json               # Session results
    ├── trajectory.jsonl           # One JSON line per step (action + observation)
    ├── otel.log                   # OpenTelemetry span log (when OTEL is enabled)
    ├── otel_spans.jsonl           # Full OTEL spans as JSONL (when OTEL is enabled)
    ├── agent/
    │   ├── runtime.json          # Per-service context (run_id, session_id, OTEL trace, settings)
    │   └── agent.log             # Agent execution log
    └── benchmark/
        ├── runtime.json          # Per-service context (run_id, session_id, OTEL trace, settings)
        ├── results.json          # Benchmark-specific results
        └── session.log           # Benchmark session log

Each service (agent, benchmark) reads its own runtime.json on startup to bootstrap context, settings, and OTEL trace propagation — so subprocesses launched via venv/docker runners can attach to the parent's trace without sharing process memory.

CLI Reference

# Discover
exgentic list benchmarks
exgentic list subsets --benchmark tau2
exgentic list tasks --benchmark tau2 --subset retail --limit 5
exgentic list agents
exgentic install --benchmark tau2
exgentic install --benchmark tau2 --docker
exgentic install --benchmark tau2 --local
exgentic uninstall --benchmark tau2

# Run
exgentic evaluate --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic batch run --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10

# Inspect
exgentic status --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic preview --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic results --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10

# Analyze
exgentic compare --agents tool_calling openai --benchmark tau2

# Explore
exgentic dashboard

Advanced

Model Configuration

exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --set agent.model.temperature=0.2

Supported fields: temperature, top_p, max_tokens, reasoning_effort, num_retries, retry_after, retry_strategy

Run Limits

exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --max-steps 100 --max-actions 100

Sessions stop at either limit and record limit_reached status. Default: 100 for both.

HuggingFace

Use HuggingFace models or run evaluations on HuggingFace Jobs. See docs/huggingface.md.

How It Works

To learn more about Exgentic's architecture and design, see our arXiv paper.

Development

For local development, editing, and contributing, see DEVELOPMENT.md.

Contributing

We welcome issues and pull requests! See CONTRIBUTING.md for guidelines.

Citing Exgentic

@misc{bandel2026generalagentevaluation,
      title={General Agent Evaluation},
      author={Elron Bandel and Asaf Yehudai and Lilach Eden and Yehoshua Sagron and Yotam Perlitz and Elad Venezian and Natalia Razinkov and Natan Ergas and Shlomit Shachor Ifergan and Segev Shlomov and Michal Jacovi and Leshem Choshen and Liat Ein-Dor and Yoav Katz and Michal Shmueli-Scheuer},
      year={2026},
      url={https://arxiv.org/abs/2602.22953},
}

License

Apache License 2.0 — see LICENSE.

Support

For questions and support, open an issue on GitHub.

Project details

Release history Release notifications | RSS feed

0.3.5

Apr 19, 2026

This version

0.3.4

Apr 6, 2026

0.3.3

Mar 30, 2026

0.3.2

Mar 29, 2026

0.3.1

Mar 29, 2026

0.3.0

Mar 29, 2026

0.2.0

Mar 24, 2026

0.1.3

Mar 16, 2026

0.1.2

Mar 16, 2026

0.1.1

Mar 15, 2026

0.1.0

Feb 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exgentic-0.3.4.tar.gz (496.6 kB view details)

Uploaded Apr 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

exgentic-0.3.4-py3-none-any.whl (363.8 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file exgentic-0.3.4.tar.gz.

File metadata

Download URL: exgentic-0.3.4.tar.gz
Upload date: Apr 6, 2026
Size: 496.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for exgentic-0.3.4.tar.gz
Algorithm	Hash digest
SHA256	`969df1afa60265e5d71381b206950bc61443d6a65e0ecf7c8b278184d09eff86`
MD5	`4f22cfbd8129c7b1d4ac321bab56e38a`
BLAKE2b-256	`c0d4f49b0ac75685f995a3206576bc40e59179b496c414eb854c9a1e624d0e26`

See more details on using hashes here.

Provenance

The following attestation bundles were made for exgentic-0.3.4.tar.gz:

Publisher: publish-pypi.yml on Exgentic/exgentic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: exgentic-0.3.4.tar.gz
- Subject digest: 969df1afa60265e5d71381b206950bc61443d6a65e0ecf7c8b278184d09eff86
- Sigstore transparency entry: 1242944014
- Sigstore integration time: Apr 6, 2026
Source repository:
- Permalink: Exgentic/exgentic@5d3090d8746c8c5bfc9c54bf1c23f67fc2cc64ac
- Branch / Tag: refs/tags/v0.3.4
- Owner: https://github.com/Exgentic
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@5d3090d8746c8c5bfc9c54bf1c23f67fc2cc64ac
- Trigger Event: push

File details

Details for the file exgentic-0.3.4-py3-none-any.whl.

File metadata

Download URL: exgentic-0.3.4-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 363.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for exgentic-0.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`313ed2fc6017238497dc8b4d270a4e4291f5e1f3c966bd0a2af765dbfd6ce674`
MD5	`daf7e70d9cc39e068edddfed6201b770`
BLAKE2b-256	`94fa4adedf1eb39d2d76878de4dc7811e19f79f74d46a6f6c40cf733e2814905`

See more details on using hashes here.

Provenance

The following attestation bundles were made for exgentic-0.3.4-py3-none-any.whl:

Publisher: publish-pypi.yml on Exgentic/exgentic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: exgentic-0.3.4-py3-none-any.whl
- Subject digest: 313ed2fc6017238497dc8b4d270a4e4291f5e1f3c966bd0a2af765dbfd6ce674
- Sigstore transparency entry: 1242944020
- Sigstore integration time: Apr 6, 2026
Source repository:
- Permalink: Exgentic/exgentic@5d3090d8746c8c5bfc9c54bf1c23f67fc2cc64ac
- Branch / Tag: refs/tags/v0.3.4
- Owner: https://github.com/Exgentic
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@5d3090d8746c8c5bfc9c54bf1c23f67fc2cc64ac
- Trigger Event: push

exgentic 0.3.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

What is Exgentic?

Who is it for?

Quick Start

Installation

API Credentials

Run an Evaluation

Python API

Available Benchmarks

Available Agents

Dashboard

Output Structure

CLI Reference

Advanced

Model Configuration

Run Limits

HuggingFace

How It Works

Development

Contributing

Citing Exgentic

License

Support

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance