Skip to main content

Berkeley Function Calling Leaderboard (BFCL) - packaged by NVIDIA

Project description

NVIDIA NeMo Evaluator

The goal of NVIDIA NeMo Evaluator is to advance and refine state-of-the-art methodologies for model evaluation, and deliver them as modular evaluation packages (evaluation containers and pip wheels) that teams can use as standardized building blocks.

Quick start guide

NVIDIA NeMo Evaluator provide you with evaluation clients, that are specifically built to evaluate model endpoints using our Standard API.

Launching an evaluation for an LLM

  1. Install the package

    pip install nvidia-bfcl
    
  2. (Optional) Set a token to your API endpoint if it's protected

    export MY_API_KEY="your_api_key_here"
    
  3. List the available evaluations:

    $ nemo-evaluator ls
    Available tasks:
    * bfclv2 (in bfcl)
    * bfclv2_ast (in bfcl)
    * bfclv3 (in bfcl)
    * bfclv3_ast (in bfcl)
    ...
    
  4. Run the evaluation of your choice:

    nemo-evaluator run_eval \
        --eval_type bfclv3_ast \
        --model_id meta/llama-3.1-70b-instruct \
        --model_url https://integrate.api.nvidia.com/v1/chat/completions \
        --model_type chat \
        --api_key_name MY_API_KEY \
        --output_dir /workspace/results
    
  5. Gather the results

    cat /workspace/results/results.yml
    

Command-Line Tool

Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the bfcl (bfcl):

Commands

1. List Evaluation Types

nemo-evaluator ls

Displays the evaluation types available within the harness.

2. Run an evaluation

The nemo-evaluator run_eval command executes the evaluation process. Below are the flags and their descriptions:

Required flags

  • --eval_type <string> The type of evaluation to perform
  • --model_id <string> The name or identifier of the model to evaluate.
  • --model_url <url> The API endpoint where the model is accessible.
  • --model_type <string> The type of the model to evaluate, currently either "chat" or "completions".
  • --output_dir <directory> The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here.

Optional flags

  • --api_key_name <string> The name of the environment variable that stores the Bearer token for the API, if authentication is required.
  • --run_config <path> Specifies the path to a YAML file containing the evaluation definition.

Example

nemo-evaluator run_eval \
    --eval_type bfclv3_ast \
    --model_id my_model \
    --model_type chat \
    --model_url http://localhost:8000 \
    --output_dir ./evaluation_results

If the model API requires authentication, set the API key in an environment variable and reference it using the --api_key_name flag:

export MY_API_KEY="your_api_key_here"

nemo-evaluator run_eval \
    --eval_type bfclv3_ast \
    --model_id my_model \
    --model_type chat \
    --model_url http://localhost:8000 \
    --api_key_name MY_API_KEY \
    --output_dir ./evaluation_results

Configuring evaluations via YAML

Evaluations in NVIDIA NeMo Evaluator are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.

Example of a YAML config:

config:
  type: bfclv3_ast
  params:
    parallelism: 50
    limit_samples: 20
target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    type: chat
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NVIDIA_API_KEY

The priority of overrides is as follows:

  1. command line arguments
  2. user config (as seen above)
  3. task defaults (defined per task type)
  4. framework defaults

--dry_run option allows you to print the final run configuration and command without executing the evaluation.

Example:

nemo-evaluator run_eval \
    --eval_type bfclv3_ast \
    --model_id my_model \
    --model_type chat \
    --model_url http://localhost:8000 \
    --output_dir .evaluation_results \
    --dry_run

Output:

Rendered config:

command: '{% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{%
  endif %} bfcl generate --model {{target.api_endpoint.model_id}} --test-category
  {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --model-args
  base_url={{target.api_endpoint.url}}  {% if config.params.limit_samples is not none
  %} --limit {{config.params.limit_samples}}{% endif %} --num-threads  {{config.params.parallelism}}
  && {% if target.api_endpoint.api_key_name is not none %}OPENAI_API_KEY=${{target.api_endpoint.api_key_name}}{%
  endif %} bfcl evaluate --model {{target.api_endpoint.model_id}} --test-category
  {{config.params.task}} --model-mapping oai --result-dir {{config.output_dir}} --score-dir
  {{config.output_dir}}

  '
framework_name: bfcl
pkg_name: bfcl
config:
  output_dir: .evaluation_results
  params:
    limit_samples: null
    max_new_tokens: null
    max_retries: null
    parallelism: 10
    task: multi_turn,ast
    temperature: null
    timeout: null
    top_p: null
    extra: {}
  supported_endpoint_types:
  - llm
  type: bfclv3_ast
target:
  api_endpoint:
    api_key_name: null
    model_id: my_model
    stream: null
    type: chat
    url: http://localhost:8000


Rendered command:

 bfcl generate --model my_model --test-category multi_turn,ast --model-mapping oai --result-dir .evaluation_results --model-args base_url=http://localhost:8000   --num-threads  10 &&  bfcl evaluate --model my_model --test-category multi_turn,ast --model-mapping oai --result-dir .evaluation_results --score-dir .evaluation_results

Custom datasets

To use your own datasets for evaluation, specify custom dataset parameters in your evaluation configuration file under config.params.extra.custom_dataset. This feature supports two primary dataset formats: native and openai.

Configuration Parameters

  • path: (string, required)
    • Specifies the location of your dataset.
    • If format is native, this must be the absolute path to a directory containing your dataset files (see 'Native Format' section below for structure).
    • If format is openai, this must be the absolute path to your JSONL dataset file.
  • format: (string, required)
    • Defines the format of your custom dataset. Must be either native or openai.
  • data_template_path: (string, optional)
    • Used only when format is openai.
    • Absolute path to a JSON file defining a custom mapping for fields in your OpenAI-format dataset if it deviates from the default structure.

Processing Workflow

  1. Input: The system takes the path, format, and optional data_template_path.
  2. Validation/Conversion:
    • If format is native, the dataset at path is validated. The BFCL_DATA_DIR environment variable is then set directly to this path.
    • If format is openai, the dataset file at path (using data_template_path if provided) is converted into the native format within a temporary directory. BFCL_DATA_DIR is then set to this temporary directory's path.
  3. Evaluation: The bfcl evaluation tool uses the BFCL_DATA_DIR to find the questions.jsonl and ground_truth.jsonl files for the evaluation.

Native Format

The native format requires a specific directory structure. The directory specified in custom_dataset.path should contain:

  • BFCL_v3_<test_category>.json: This file, located directly under the path directory, should contain the questions or prompts for the LLM. Each line must be a JSON object. Replace <test_category> with a valid test category supported by BFCL e.g.: simple, ast, executable, multi_turn_base.
  • A subdirectory named possible_answer, which in turn contains:
    • BFCL_v3_<test_category>.json (i.e., path/possible_answer/BFCL_v3_<test_category>.json). It contains the corresponding ground truth, with each line being a JSON object representing the expected function calls or responses. The <test_category> in this filename must match the one in the questions file.

For multi-turn test categories, the native format is more complex and may require an additional multi_turn_func_doc directory within your custom_dataset.path.

Structure of BFCL_v3_<test_category>.json (Questions File)

Each line in this JSONL file represents a single question/prompt and should be a JSON object with the following fields:

  • id: (string) A unique identifier for the test case, typically in the format <test_category>_<unique_id>.
  • question: (list) A list of conversations, where each conversation is a list of message objects (e.g., {"role": "user", "content": "..."}). This follows the standard OpenAI message format.
  • function: (list) A list of Function objects available for the LLM to call. Each Function object should have:
    • name: (string) The name of the function.
    • description: (string) A description of what the function does.
    • parameters: (object) An object describing the function's parameters. This typically follows a JSON Schema-like structure, defining type (e.g., "object"), properties (a dictionary of parameter names to their schemas, each specifying type, description, etc.), and required (a list of required parameter names).

Example (BFCL_v3_simple.json line):

{
  "id": "simple_0",
  "question": [
    [
      {
        "role": "user",
        "content": "Find the area of a triangle with a base of 10 units and height of 5 units."
      }
    ]
  ],
  "function": [
    {
      "name": "calculate_triangle_area",
      "description": "Calculate the area of a triangle given its base and height.",
      "parameters": {
        "type": "dict",
        "properties": {
          "base": {
            "type": "integer",
            "description": "The base of the triangle."
          },
          "height": {
            "type": "integer",
            "description": "The height of the triangle."
          },
          "unit": {
            "type": "string",
            "description": "The unit of measure (defaults to 'units' if not specified)"
          }
        },
        "required": ["base", "height"]
      }
    }
  ]
}

Structure of possible_answers/BFCL_v3_<test_category>.json (Ground Truth File)

Each line in this JSONL file corresponds to a question in the questions file and should be a JSON object with the following fields:

  • id: (string) The unique identifier for the test case, matching the id in the corresponding questions file.
  • ground_truth: (list) A list of expected tool call objects. Each object typically maps a function name to a dictionary of its arguments and their values.

Example (possible_answers/BFCL_v3_simple.json line):

{
 "id": "simple_0",
 "ground_truth": [
   {
     "calculate_triangle_area": {
       "base": [10],
       "height": [5],
       "unit": ["units", ""]
     }
   }
 ]
}

Note on JSONL: Ensure your files strictly follows the JSON Lines format (one complete JSON object per line). Refer to jsonlines.org for details.

Validation:

Native format datasets undergo validation. While validation errors are displayed, they might not halt the process (to accommodate potential future format changes). Detailed validation failure information is saved to validation_failure_details.json in the evaluation output directory. It's recommended to ensure your native data adheres to the expected structure.

Native Dataset Example

config:
  type: bfclv3
  params:
    task: simple
    extra:
      custom_dataset:
        path: /path/to/your/native_data_directory
        format: native

OpenAI Format

The openai format allows you to provide your dataset as a single JSONL file. Each line in this file must be a valid JSON object.

Structure of each JSON object:

Each JSON object in the JSONL file should typically contain the following fields:

  • messages: A list of conversations, where each conversation is a list of message objects (e.g., {"role": "user", "content": "..."}).
  • tools: A list of Tool objects. Each Tool object must have a type (e.g., "function") and a function object. The function object, in turn, contains name, description, and parameters (which defines the schema for the function's arguments).
  • tool_calls_ground_truth: A list of expected tool call objects.

Example of a single JSON object in the OpenAI format JSONL file:

{
  "messages": [
    [
      {
        "role": "user",
        "content": "Calculate the factorial of 5 using math functions."
      }
    ]
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "math.factorial",
        "description": "Calculate the factorial of a given number.",
        "parameters": {
          "type": "dict",
          "properties": {
            "number": {
              "type": "integer",
              "description": "The number for which factorial needs to be calculated."
            }
          },
          "required": ["number"]
        }
      }
    }
  ],
  "tool_calls_ground_truth": [
    {
      "math.factorial": {
        "number": [5]
      }
    }
  ]
}

Using data templates with OpenAI format:

If your OpenAI-formatted JSONL file uses a custom structure, specify a data template JSON file via custom_dataset.data_template_path to map custom fields to the expected format. This template file uses a jinja2 templating language to define these mappings. E.g.:

{
  "messages": "{{ item.user_input | tojson }}",
  "tools": "{{ item.function | tojson }}",
  "tool_calls_ground_truth": "{{ item.reference | tojson }}"
}

Limitations of the OpenAI format:

  • It is not supported for multi-turn test categories (multi_turn_base, multi_turn_miss_func, multi_turn_miss_param, and multi_turn_long_context). Use native format for these test categories.
  • It is not supported for running multiple test categories. Use native format for running multiple test categories.

OpenAI Dataset with Custom Data Template Example

config:
  type: bfclv3
  params:
    task: simple
    extra:
      custom_dataset:
        path: /path/to/your/data.jsonl
        format: openai
        data_template_path: /path/to/your/data_template.json

FAQ

BFCL only - API Keys for Executable Test Categories

If you want to run executable test categories, you must provide API keys. Add the keys to your .env file, so that the placeholder values used in questions/params/answers can be replaced with real data. There are 4 API keys to include:

  1. RAPID-API Key: https://rapidapi.com/hub

    All the Rapid APIs we use have free tier usage. You need to subscribe to those API providers in order to have the executable test environment setup but it will be free of charge!

  2. Exchange Rate API: https://www.exchangerate-api.com

  3. OMDB API: http://www.omdbapi.com/apikey.aspx

  4. Geocode API: https://geocode.maps.co/

Deploying a model as an endpoint

NVIDIA NeMo Evaluator utilize a client-server communication architecture to interact with the model. As a prerequisite, the model must be deployed as an endpoint with a NIM-compatible API.

Users have the flexibility to deploy their model using their own infrastructure and tooling.

Servers with APIs that conform to the OpenAI/NIM API standard are expected to work seamlessly out of the box.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nvidia_bfcl-26.3-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file nvidia_bfcl-26.3-py3-none-any.whl.

File metadata

  • Download URL: nvidia_bfcl-26.3-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for nvidia_bfcl-26.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8d6a8d9244b9e7e8f1e18d7e989d315073f682f75e78b7a2fbbfe6e92ac14d7f
MD5 ce1ff330009e6620ad4fc7988a1a34ae
BLAKE2b-256 1ddcf92bc6d396272083a3698785cca9749e8435dc4e6f8304eec974b3bdad29

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page