A tool for benchmark your agent design in different scenarios.
Project description
AgentSociety Benchmark Toolkit
Tool Objectives
The AgentSociety Benchmark Toolkit is a comprehensive evaluation tool designed for multi-agent systems. This tool aims to help researchers and developers:
- Standardize Evaluation Process: Provide a unified evaluation framework to ensure fair comparison between different agent systems
- Multi-Scenario Testing: Support agent behavior evaluation in various real-world scenarios
- Automated Assessment: Simplify the entire evaluation workflow from data preparation to result analysis
- Reproducibility: Ensure reproducibility and verifiability of experimental results
Supported Benchmarks
1. BehaviorModeling
The BehaviorModeling benchmark focuses on evaluating agents' behavior modeling capabilities in complex social scenarios. This benchmark tests how agents:
- Understand and simulate human behavior patterns
- Make reasonable decisions in group environments
- Adapt to different social norms and constraints
2. DailyMobility
The DailyMobility benchmark evaluates agents' mobility behavior modeling capabilities in daily life. This benchmark tests how agents:
- Plan reasonable daily activity routes
- Consider correlations between time, location, and activities
- Simulate real-world mobility patterns
3. HarricaneMobility
The HarricaneMobility benchmark focuses on evaluating agents' behavior modeling capabilities in emergency situations. This benchmark tests how agents:
- Make emergency decisions during natural disasters
- Simulate crowd evacuation and shelter-seeking behaviors
- Handle uncertainty and resource allocation in emergency situations
Basic Usage
Installation
# Install from source
git clone <repository-url>
cd packages/agentsociety-benchmark
pip install -e .
# Or install via pip (if published)
pip install agentsociety-benchmark
Command Line Interface
View Available Tasks
# List all available benchmark tasks
asbench list-tasks
Download Datasets
- You need to install git lfc firstly.
# Clone datasets for specific benchmarks, this will clone the datasets and install dependancies at the same time
asbench clone BehaviorModeling
asbench clone DailyMobility
asbench clone HarricaneMobility
# View installed benchmarks
asbench list-installed
# Force clone datasets for specific benchmarks
asbench clone BehaviorModeling --force
# Only install dependancies for specific benchmarks
asbench clone BehaviorModeling --only-install-deps
Run Evaluation
# Run evaluation for specific benchmark
asbench run <TASK-NAME> --config your_config.yaml --agent your_agent.py --mode test/inference
Independent Result Evaluation
# Independently evaluate generated result files
asbench evaluate <TASK-NAME> results.json --config your_config.yaml
Update Benchmarks
# Update all benchmarks to latest versions
asbench update-benchmarks
Programmatic Usage
The AgentSociety Benchmark also supports programmatic usage through Python code, allowing you to integrate benchmark execution into your applications and scripts.
Basic Usage
import asyncio
from agentsociety_benchmark import BenchmarkRunner, run_benchmark, evaluate_results
async def run_my_benchmark():
# Create a BenchmarkRunner instance
runner = BenchmarkRunner(home_dir="./benchmark_data")
# Run a benchmark
result = await runner.run_benchmark(
task_name="BehaviorModeling",
benchmark_config="config.yaml",
agent_config="agent_config.yaml",
mode="test", # Run with evaluation
tenant_id="my_tenant",
exp_id="my_experiment_001"
)
print(f"Benchmark completed! Experiment ID: {result['exp_id']}")
print(f"Results: {result['results']}")
print(f"Evaluation: {result['evaluation']}")
# Run the benchmark
asyncio.run(run_my_benchmark())
Using Convenience Functions
import asyncio
from agentsociety_benchmark import run_benchmark, evaluate_results, list_available_tasks
async def simple_benchmark():
# List available tasks
tasks = list_available_tasks()
print(f"Available tasks: {tasks}")
# Run benchmark using convenience function
result = await run_benchmark(
task_name="BehaviorModeling",
benchmark_config="config.yaml",
agent_config="agent_config.yaml",
mode="inference" # Run without evaluation
)
# Evaluate results separately if needed
if result['results']:
evaluation = await evaluate_results(
task_name="BehaviorModeling",
results_file="results.pkl",
benchmark_config="config.yaml"
)
print(f"Evaluation result: {evaluation['evaluation_result']}")
asyncio.run(simple_benchmark())
Advanced Usage
import asyncio
from agentsociety_benchmark import BenchmarkRunner
async def advanced_benchmark():
# Create runner with custom settings
runner = BenchmarkRunner(home_dir="./custom_data")
# Run with advanced configuration
result = await runner.run_benchmark(
task_name="BehaviorModeling",
benchmark_config="my_config.yaml",
agent_config="my_agent.py",
datasets_path="./my_datasets",
mode="test",
tenant_id="advanced_example",
exp_id="advanced_001",
official=True, # Mark as official validation
callback_url="https://my-callback-url.com/webhook"
)
print(f"Advanced benchmark completed: {result}")
asyncio.run(advanced_benchmark())
Configuration Files
You'll need to create configuration files for your benchmarks:
Benchmark Configuration (config.yaml):
llm:
- api_key: YOUR-API-KEY
model: gpt-4
provider: openai
semaphore: 200
env:
db:
enabled: true
home_dir: .agentsociety-benchmark/agentsociety_data
Agent Configuration (agent_config.yaml):
agent_class: my_agent.py
number: 1
config:
name: MyAgent
description: A custom agent for testing
Complete Example
See examples/programmatic_usage.py for a complete working example that demonstrates all the features of the programmatic API.
Storage and Configuration
- Data Storage: All benchmark data is stored in the
.agentsociety-benchmark/directory - Configuration Files: Use YAML format configuration files to define agent parameters and evaluation settings
- Result Storage: Evaluation results are saved in JSON format for easy subsequent analysis and comparison
Configuration Example
# config.yaml
llm:
- api_key: YOUR-API-KEY # LLM API key
model: YOUR-MODEL # LLM model
provider: PROVIDER # LLM provider
semaphore: 200 # Semaphore for LLM requests, control the max number of concurrent requests
env:
db:
enabled: true # Whether to enable database
home_dir: .agentsociety-benchmark/agentsociety_data
Requirements
- Python >= 3.11
- agentsociety >= 1.5.0a11
- Other dependencies see
pyproject.toml
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentsociety_benchmark-0.1.4.tar.gz.
File metadata
- Download URL: agentsociety_benchmark-0.1.4.tar.gz
- Upload date:
- Size: 65.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bbdfd2f27af16dbf97ef950088602bfaa90d0c4cb29566e76049c1b2bbcc9c8
|
|
| MD5 |
a3c4d0423a377c3d2a800073c03abefc
|
|
| BLAKE2b-256 |
ae5e9609f179f01a0c367fce414f97ba59b0f632a2c1adaa293ddf6a00ab48ba
|
File details
Details for the file agentsociety_benchmark-0.1.4-py3-none-any.whl.
File metadata
- Download URL: agentsociety_benchmark-0.1.4-py3-none-any.whl
- Upload date:
- Size: 88.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.8.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ad44a7602055267ae3df1772a5222e3d8ce05057cbaff89d58fb27aed2080c3
|
|
| MD5 |
2ca1d2c8a1a47a981ac7cf3a1de1ea70
|
|
| BLAKE2b-256 |
5de42d76a284411af43f3c7b61309a06c336d6023cfb3ed70313bb940b82eae2
|