Package for running and creating benchmarks.

These details have not been verified by PyPI

Project description

RAI Benchmarks

The RAI Bench is a package including benchmarks and providing frame for creating new benchmarks

Manipulation O3DE Benchmark

The Manipulation O3DE Benchmark manipulation_o3de_benchmark_module provides tasks and scene configurations for robotic arm manipulation simulation in O3DE. The tasks use a common ManipulationTask logic and can be parameterized, which allows for many task variants. The current tasks include:

MoveObjectToLeftTask
GroupObjectsTask
BuildCubeTowerTask
PlaceObjectAtCoordTask
RotateObjectTask (currently not applicable due to limitations in the ManipulatorMoveTo tool)

The result of a task is a value between 0 and 1, calculated like initially_misplaced_now_correct / initially_misplaced. This score is calculated at the end of each scenario.

Frame Components

Task
Scenario
Benchmark

For more information about these classes go to -> benchmark and Task and

Example usage

Example of how to load scenes, define scenarios and run benchmark can be found in manipulation_o3de_benchmark_example

Scenarios can be loaded manually like:

one_carrot_simulation_config = O3DExROS2SimulationConfig.load_config(
        base_config_path=Path("path_to_scene.yaml"),
        connector_config_path=Path("path_to_o3de_config.yaml"),
    )

Scenario(task=GrabCarrotTask(logger=some_logger), simulation_config=one_carrot_simulation_config)

or automatically like:

scenarios = Benchmark.create_scenarios(
        tasks=tasks, simulation_configs=simulations_configs
    )

which will result in list of scenarios with combination of every possible task and scene(task decides if scene config is suitable for it).

or can be imported from exisitng packets scenarios_packets:

t_scenarios = trivial_scenarios(
        configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
    )
e_scenarios = easy_scenarios(
    configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
)
m_scenarios = medium_scenarios(
    configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
)
h_scenarios = hard_scenarios(
    configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
)
vh_scenarios = very_hard_scenarios(
    configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
)

which are grouped by their subjective difficulty. For now there are 10 trivial, 42 easy, 23 medium, 38 hard and 47 very hard scenarios. Check docstrings and code in scenarios_packets if you want to know how scenarios are assigned to difficulty level.

Running

Download O3DE simulation binary and unzip it.
- ros2-humble
- ros2-jazzy
Follow step 2 from Manipulation demo Setup section
Adjust the path to the binary in: o3de_config.yaml
Choose the model you want to run and a vendor.

[!NOTE] The configs of vendors are defined in config.toml Change ithem if needed.
Run benchmark with:

cd rai
source setup_shell.sh
python src/rai_bench/rai_bench/examples/manipulation_o3de/main.py --model-name llama3.2  --vendor ollama

[!NOTE] For now benchmark runs all available scenarios (~160). See Examples section for details.

Development

When creating new task or changing existing ones, make sure to add unit tests for score calculation in rai_bench_tests. This applies also when you are adding or changing the helper methods in Task or ManipulationTask.

The number of scenarios can be easily extened without writing new tasks, by increasing number of variants of the same task and adding more simulation configs but it won't improve variety of scenarios as much as creating new tasks.

Tool Calling Agent Benchmark

The Tool Calling Agent Benchmark is the benchmark for LangChain tool calling agents. It includes a set of tasks and a benchmark that evaluates the performance of the agent on those tasks by verifying the correctness of the tool calls requested by the agent. The benchmark is integrated with LangSmith and Langfuse tracing backends to easily track the performance of the agents.

Frame Components

Tool Calling Agent Benchmark - Benchmark for LangChain tool calling agents
Scores tracing - Component handling sending scores to tracing backends
Interfaces - Interfaces for validation classes - Task, Validator, SubTask For detailed description of validation visit -> Validation

tool_calling_agent_test_bench.py - Script providing benchmark on tasks based on the ROS2 tools usage.

Example Usage

Validators can be constructed from any SubTasks, Tasks can be validated by any numer of Validators, which makes whole validation process incredibly versital.

# subtasks
get_topics_subtask = CheckArgsToolCallSubTask(
    expected_tool_name="get_ros2_topics_names_and_types"
)
color_image_subtask = CheckArgsToolCallSubTask(
    expected_tool_name="get_ros2_image", expected_args={"topic": "/camera_image_color"}
)
# validators - consist of subtasks
topics_ord_val = OrderedCallsValidator(subtasks=[get_topics_subtask])
color_image_ord_val = OrderedCallsValidator(subtasks=[color_image_subtask])
topics_and_color_image_ord_val = OrderedCallsValidator(
    subtasks=[
        get_topics_subtask,
        color_image_subtask,
    ]
)
# tasks - validated by list of validators
GetROS2TopicsTask(validators=[topics_ord_val])
GetROS2RGBCameraTask(validators=[topics_and_color_image_ord_val]),
GetROS2RGBCameraTask(validators=[topics_ord_val, color_image_ord_val]),

Running

To set up tracing backends, please follow the instructions in the tracing.md document.

To run the benchmark:

cd rai
source setup_shell.sh
python src/rai_bench/rai_bench/examples/tool_calling_agent/main.py

There is also flags to declare model type and vendor:

python src/rai_bench/rai_bench/examples/tool_calling_agent/main.py --model-name llama3.2 --vendor ollama

[!NOTE] The configs of vendors are defined in config.toml Change ithem if needed.

Testing Models

To test multiple models, different benchamrks or couple repeats in one go - use script test_models

Modify these params:

models_name = ["llama3.2", "qwen2.5:7b"]
vendors = ["ollama", "ollama"]
benchmarks = ["tool_calling_agent"]
repeats = 1

to your liking and run the script!

python src/rai_bench/rai_bench/examples/test_models.py

Results and Visualization

All results from running benchmarks will be saved to folder experiments

If you run single benchmark test like:

python src/rai_bench/rai_bench/examples/<benchmark_name>/main.py

Results will be saved to dedicated directory named <benchmark_name>

When you run a test via:

python src/rai_bench/rai_bench/examples/test_models.py

results will be saved to separate folder in results, with prefix run_

To visualise the results run:

streamlit run src/rai_bench/rai_bench/results_processing/visualise.py

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

Jun 11, 2025

This version

0.1.1

Jun 11, 2025

0.1.0

May 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rai_bench-0.1.1.tar.gz (966.5 kB view details)

Uploaded Jun 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rai_bench-0.1.1-py3-none-any.whl (1.0 MB view details)

Uploaded Jun 11, 2025 Python 3

File details

Details for the file rai_bench-0.1.1.tar.gz.

File metadata

Download URL: rai_bench-0.1.1.tar.gz
Upload date: Jun 11, 2025
Size: 966.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.10.12 Linux/6.8.0-60-generic

File hashes

Hashes for rai_bench-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3b8a9062235130426779ab7d8786f701e6c35d8881641550e19e2d7f06a7e747`
MD5	`d06e3f9b306f2601a7dcfa2749391750`
BLAKE2b-256	`728e5d6668679ea601a78a8996dd6c7ba731c7deb078d00f581227a7682320f7`

See more details on using hashes here.

File details

Details for the file rai_bench-0.1.1-py3-none-any.whl.

File metadata

Download URL: rai_bench-0.1.1-py3-none-any.whl
Upload date: Jun 11, 2025
Size: 1.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.1 CPython/3.10.12 Linux/6.8.0-60-generic

File hashes

Hashes for rai_bench-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`175ceb0335fa5c245416622a71bef782b834118766f742f56fabe8c0a15d3876`
MD5	`5fc5c9e79b10df46b342cc76fb559f32`
BLAKE2b-256	`969d822a4814a0a46bb7fda6d61aa0146ac4b62f29bd6ee07bdc5174bb6daafe`

See more details on using hashes here.

rai-bench 0.1.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

RAI Benchmarks

Manipulation O3DE Benchmark

Frame Components

Example usage

Running

Development

Tool Calling Agent Benchmark

Frame Components

Example Usage

Running

Testing Models

Results and Visualization

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes