Skip to main content

Package for running and creating benchmarks.

Project description

RAI Benchmarks

The RAI Bench is a package including benchmarks and providing frame for creating new benchmarks

Manipulation O3DE Benchmark

The Manipulation O3DE Benchmark manipulation_o3de_benchmark_module provides tasks and scene configurations for robotic arm manipulation simulation in O3DE. The tasks use a common ManipulationTask logic and can be parameterized, which allows for many task variants. The current tasks include:

  • MoveObjectToLeftTask
  • GroupObjectsTask
  • BuildCubeTowerTask
  • PlaceObjectAtCoordTask
  • RotateObjectTask (currently not applicable due to limitations in the ManipulatorMoveTo tool)

The result of a task is a value between 0 and 1, calculated like initially_misplaced_now_correct / initially_misplaced. This score is calculated at the end of each scenario.

Frame Components

  • Task
  • Scenario
  • Benchmark

For more information about these classes go to -> benchmark and Task and

Example usage

Example of how to load scenes, define scenarios and run benchmark can be found in manipulation_o3de_benchmark_example

Scenarios can be loaded manually like:

one_carrot_simulation_config = O3DExROS2SimulationConfig.load_config(
        base_config_path=Path("path_to_scene.yaml"),
        connector_config_path=Path("path_to_o3de_config.yaml"),
    )

Scenario(task=GrabCarrotTask(logger=some_logger), simulation_config=one_carrot_simulation_config)

or automatically like:

scenarios = Benchmark.create_scenarios(
        tasks=tasks, simulation_configs=simulations_configs
    )

which will result in list of scenarios with combination of every possible task and scene(task decides if scene config is suitable for it).

or can be imported from exisitng packets scenarios_packets:

t_scenarios = trivial_scenarios(
        configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
    )
e_scenarios = easy_scenarios(
    configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
)
m_scenarios = medium_scenarios(
    configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
)
h_scenarios = hard_scenarios(
    configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
)
vh_scenarios = very_hard_scenarios(
    configs_dir=configs_dir, connector_path=connector_path, logger=bench_logger
)

which are grouped by their subjective difficulty. For now there are 10 trivial, 42 easy, 23 medium, 38 hard and 47 very hard scenarios. Check docstrings and code in scenarios_packets if you want to know how scenarios are assigned to difficulty level.

Running

  1. Download O3DE simulation binary and unzip it.

  2. Follow step 2 from Manipulation demo Setup section

  3. Adjust the path to the binary in: o3de_config.yaml

  4. Choose the model you want to run and a vendor.

    [!NOTE] The configs of vendors are defined in config.toml Change ithem if needed.

  5. Run benchmark with:

cd rai
source setup_shell.sh
python src/rai_bench/rai_bench/examples/manipulation_o3de/main.py --model-name llama3.2  --vendor ollama

[!NOTE] For now benchmark runs all available scenarios (~160). See Examples section for details.

Development

When creating new task or changing existing ones, make sure to add unit tests for score calculation in rai_bench_tests. This applies also when you are adding or changing the helper methods in Task or ManipulationTask.

The number of scenarios can be easily extened without writing new tasks, by increasing number of variants of the same task and adding more simulation configs but it won't improve variety of scenarios as much as creating new tasks.

Tool Calling Agent Benchmark

The Tool Calling Agent Benchmark is the benchmark for LangChain tool calling agents. It includes a set of tasks and a benchmark that evaluates the performance of the agent on those tasks by verifying the correctness of the tool calls requested by the agent. The benchmark is integrated with LangSmith and Langfuse tracing backends to easily track the performance of the agents.

Frame Components

tool_calling_agent_test_bench.py - Script providing benchmark on tasks based on the ROS2 tools usage.

Example Usage

Validators can be constructed from any SubTasks, Tasks can be validated by any numer of Validators, which makes whole validation process incredibly versital.

# subtasks
get_topics_subtask = CheckArgsToolCallSubTask(
    expected_tool_name="get_ros2_topics_names_and_types"
)
color_image_subtask = CheckArgsToolCallSubTask(
    expected_tool_name="get_ros2_image", expected_args={"topic": "/camera_image_color"}
)
# validators - consist of subtasks
topics_ord_val = OrderedCallsValidator(subtasks=[get_topics_subtask])
color_image_ord_val = OrderedCallsValidator(subtasks=[color_image_subtask])
topics_and_color_image_ord_val = OrderedCallsValidator(
    subtasks=[
        get_topics_subtask,
        color_image_subtask,
    ]
)
# tasks - validated by list of validators
GetROS2TopicsTask(validators=[topics_ord_val])
GetROS2RGBCameraTask(validators=[topics_and_color_image_ord_val]),
GetROS2RGBCameraTask(validators=[topics_ord_val, color_image_ord_val]),

Running

To set up tracing backends, please follow the instructions in the tracing.md document.

To run the benchmark:

cd rai
source setup_shell.sh
python src/rai_bench/rai_bench/examples/tool_calling_agent/main.py

There is also flags to declare model type and vendor:

python src/rai_bench/rai_bench/examples/tool_calling_agent/main.py --model-name llama3.2 --vendor ollama

[!NOTE] The configs of vendors are defined in config.toml Change ithem if needed.

Testing Models

To test multiple models, different benchamrks or couple repeats in one go - use script test_models

Modify these params:

models_name = ["llama3.2", "qwen2.5:7b"]
vendors = ["ollama", "ollama"]
benchmarks = ["tool_calling_agent"]
repeats = 1

to your liking and run the script!

python src/rai_bench/rai_bench/examples/test_models.py

Results and Visualization

All results from running benchmarks will be saved to folder experiments

If you run single benchmark test like:

python src/rai_bench/rai_bench/examples/<benchmark_name>/main.py

Results will be saved to dedicated directory named <benchmark_name>

When you run a test via:

python src/rai_bench/rai_bench/examples/test_models.py

results will be saved to separate folder in results, with prefix run_

To visualise the results run:

streamlit run src/rai_bench/rai_bench/results_processing/visualise.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rai_bench-0.1.1.tar.gz (966.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rai_bench-0.1.1-py3-none-any.whl (1.0 MB view details)

Uploaded Python 3

File details

Details for the file rai_bench-0.1.1.tar.gz.

File metadata

  • Download URL: rai_bench-0.1.1.tar.gz
  • Upload date:
  • Size: 966.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.12 Linux/6.8.0-60-generic

File hashes

Hashes for rai_bench-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3b8a9062235130426779ab7d8786f701e6c35d8881641550e19e2d7f06a7e747
MD5 d06e3f9b306f2601a7dcfa2749391750
BLAKE2b-256 728e5d6668679ea601a78a8996dd6c7ba731c7deb078d00f581227a7682320f7

See more details on using hashes here.

File details

Details for the file rai_bench-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: rai_bench-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.12 Linux/6.8.0-60-generic

File hashes

Hashes for rai_bench-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 175ceb0335fa5c245416622a71bef782b834118766f742f56fabe8c0a15d3876
MD5 5fc5c9e79b10df46b342cc76fb559f32
BLAKE2b-256 969d822a4814a0a46bb7fda6d61aa0146ac4b62f29bd6ee07bdc5174bb6daafe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page