Skip to main content

Package for running and creating benchmarks.

Project description

RAI Benchmarks

The RAI Bench is a package including benchmarks and providing frame for creating new benchmarks

Manipulation O3DE Benchmark

The Manipulation O3DE Benchmark manipulation_o3de_benchmark_module provides tasks and scene configurations for robotic arm manipulation simulation in O3DE. The tasks use a common ManipulationTask logic and can be parameterized, which allows for many task variants. The current tasks include:

  • MoveObjectToLeftTask
  • GroupObjectsTask
  • BuildCubeTowerTask
  • PlaceObjectAtCoordTask
  • RotateObjectTask (currently not applicable due to limitations in the ManipulatorMoveTo tool)

The result of a task is a value between 0 and 1, calculated like initially_misplaced_now_correct / initially_misplaced. This score is calculated at the end of each scenario.

Frame Components

  • Task
  • Scenario
  • Benchmark

For more information about these classes go to -> benchmark and Task and

Example usage

Example of how to load scenes, define scenarios and run benchmark can be found in manipulation_o3de_benchmark_example

Scenarios can be loaded manually like:

one_carrot_simulation_config = O3DExROS2SimulationConfig.load_config(
        base_config_path=Path("path_to_scene.yaml"),
        connector_config_path=Path("path_to_o3de_config.yaml"),
    )

Scenario(task=GrabCarrotTask(logger=some_logger), simulation_config=one_carrot_simulation_config)

or automatically like:

scenarios = Benchmark.create_scenarios(
        tasks=tasks, simulation_configs=simulations_configs
    )

which will result in list of scenarios with combination of every possible task and scene(task decides if scene config is suitable for it).

or can be imported from exisitng packets scenarios_packets:

t_scenarios = trivial_scenarios(
        configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
    )
e_scenarios = easy_scenarios(
    configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
)
m_scenarios = medium_scenarios(
    configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
)
h_scenarios = hard_scenarios(
    configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
)
vh_scenarios = very_hard_scenarios(
    configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
)

which are grouped by their subjective difficulty. For now there are 10 trivial, 42 easy, 23 medium, 38 hard and 47 very hard scenarios. Check docstrings and code in scenarios_packets if you want to know how scenarios are assigned to difficulty level.

Running

  1. Download O3DE simulation binary and unzip it.

  2. Follow step 2 from Manipulation demo Setup section

  3. Adjust the path to the binary in: o3de_config.yaml

  4. Choose the model you want to run and a vendor.

    [!NOTE] The configs of vendors are defined in config.toml Change ithem if needed.

  5. Run benchmark with:

cd rai
source setup_shell.sh
python src/rai_bench/rai_bench/examples/manipulation_o3de/main.py --model-name llama3.2  --vendor ollama

[!NOTE] For now benchmark runs all available scenarios (~160). See Examples section for details.

Development

When creating new task or changing existing ones, make sure to add unit tests for score calculation in rai_bench_tests. This applies also when you are adding or changing the helper methods in Task or ManipulationTask.

The number of scenarios can be easily extened without writing new tasks, by increasing number of variants of the same task and adding more simulation configs but it won't improve variety of scenarios as much as creating new tasks.

Tool Calling Agent Benchmark

The Tool Calling Agent Benchmark is the benchmark for LangChain tool calling agents. It includes a set of tasks and a benchmark that evaluates the performance of the agent on those tasks by verifying the correctness of the tool calls requested by the agent. The benchmark is integrated with LangSmith and Langfuse tracing backends to easily track the performance of the agents.

Frame Components

tool_calling_agent_test_bench.py - Script providing benchmark on tasks based on the ROS2 tools usage.

Example Usage

Validators can be constructed from any SubTasks, Tasks can be validated by any numer of Validators, which makes whole validation process incredibly versital.

# subtasks
get_topics_subtask = CheckArgsToolCallSubTask(
    expected_tool_name="get_ros2_topics_names_and_types"
)
color_image_subtask = CheckArgsToolCallSubTask(
    expected_tool_name="get_ros2_image", expected_args={"topic": "/camera_image_color"}
)
# validators - consist of subtasks
topics_ord_val = OrderedCallsValidator(subtasks=[get_topics_subtask])
color_image_ord_val = OrderedCallsValidator(subtasks=[color_image_subtask])
topics_and_color_image_ord_val = OrderedCallsValidator(
    subtasks=[
        get_topics_subtask,
        color_image_subtask,
    ]
)
# tasks - validated by list of validators
GetROS2TopicsTask(validators=[topics_ord_val])
GetROS2RGBCameraTask(validators=[topics_and_color_image_ord_val]),
GetROS2RGBCameraTask(validators=[topics_ord_val, color_image_ord_val]),

Running

To set up tracing backends, please follow the instructions in the tracing.md document.

To run the benchmark:

cd rai
source setup_shell.sh
python src/rai_bench/rai_bench/examples/tool_calling_agent/main.py

There is also flags to declare model type and vendor:

python src/rai_bench/rai_bench/examples/tool_calling_agent/main.py --model-name llama3.2 --vendor ollama

[!NOTE] The configs of vendors are defined in config.toml Change ithem if needed.

Testing Models

To test multiple models, different benchamrks or couple repeats in one go - use script test_models

Modify these params:

models_name = ["llama3.2", "qwen2.5:7b"]
vendors = ["ollama", "ollama"]
benchmarks = ["tool_calling_agent"]
repeats = 1

to your liking and run the script!

python src/rai_bench/rai_bench/examples/test_models.py

Results and Visualization

All results from running benchmarks will be saved to folder experiments

If you run single benchmark test like:

python src/rai_bench/rai_bench/examples/<benchmark_name>/main.py

Results will be saved to dedicated directory named <benchmark_name>

When you run a test via:

python src/rai_bench/rai_bench/examples/test_models.py

results will be saved to separate folder in results, with prefix run_

To visualise the results run:

streamlit run src/rai_bench/rai_bench/results_processing/visualise.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rai_bench-0.1.2.tar.gz (966.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rai_bench-0.1.2-py3-none-any.whl (1.0 MB view details)

Uploaded Python 3

File details

Details for the file rai_bench-0.1.2.tar.gz.

File metadata

  • Download URL: rai_bench-0.1.2.tar.gz
  • Upload date:
  • Size: 966.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.12 Linux/6.8.0-60-generic

File hashes

Hashes for rai_bench-0.1.2.tar.gz
Algorithm Hash digest
SHA256 411cb28b9884bd7aa7404c5eb07b3ef8d12b8be5fface906ad2cbc2c601b58ac
MD5 4f94dcf5dcba682bb65504520de60e3f
BLAKE2b-256 55efe1d343da7608d5c35b5c73817c91041614d1097c4b90275136a7b3ed7873

See more details on using hashes here.

File details

Details for the file rai_bench-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: rai_bench-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.12 Linux/6.8.0-60-generic

File hashes

Hashes for rai_bench-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 16890f7e8108e25c02642a3f804ac7e0b2d0a53110504d892d50cb12ff09a202
MD5 22bfdffba29621b01a7112b91b9e3d69
BLAKE2b-256 6e8a784eeea6f848afef96beda948be6086d571fd25e45e0ea5bac0dbded4311

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page