A lightweight, extensible evaluator for testing model-generated responses

These details have not been verified by PyPI

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

BbEval

A lightweight black-box agent evaluator using YAML specifications to score task completion.

Installation and Setup

Installation for End Users

This is the recommended method for users who want to use bbeval as a command-line tool.

Ensure you have uv installed. If you don't, you can install them via pip:
```
pip install uv
```

Install bbeval:

uv tool install bbeval

Alternatively, if you want the latest (unstable) version:

uv tool install "git+https://github.com/EntityProcess/bbeval.git"

Verify the installation: After installation, the bbeval command will be available in your terminal. You can verify it by running:
```
bbeval --help
```

Local Development Setup

Follow these steps if you want to contribute to the bbeval project itself. This workflow uses a virtual environment and an editable install, which means changes you make to the source code are immediately available without reinstalling.

Clone the repository and navigate into it:

git clone https://github.com/entityprocess/bbeval.git
cd bbeval

Create and activate a virtual environment:

# Create the virtual environment
uv venv

# Activate it (macOS/Linux)
source .venv/bin/activate

# Activate it (Windows PowerShell)
.venv\Scripts\Activate.ps1

Perform an editable install with development dependencies:

This command installs bbeval in editable (-e) mode and includes the extra tools needed for development and testing ([dev]).
```
# For non-Windows or if you don't need VS Code focus functionality
uv pip install -e ".[dev]"

# For Windows users who want the VS Code focus functionality
uv pip install -e ".[dev,windows]"
```
Note: The windows optional dependency includes pywin32 and psutil, which are needed for the --focus flag with the open_vscode_workspace.py script. Without them, the script will work but skip the window focusing feature.

You are now ready to start development. You can run the tool with bbeval, edit the code in src/, and run tests with pytest.

Environment Setup

Configure environment variables:
- Copy .env.template to .env in your project root
- Fill in your API keys, endpoints, and other configuration values
Set up targets:
- Copy targets.yaml to .bbeval/targets.yaml
- Update the environment variable names in targets.yaml to match those defined in your .env file

Quick start

Run eval (target auto-selected from test file or CLI override):

# If your test.yaml contains "target: azure_base", it will be used automatically
bbeval "c:/path/to/test.yaml"

# Override the test file's target with CLI flag
bbeval --target vscode_projectx "c:/path/to/test.yaml"

Run a specific test case with custom targets path:

bbeval --target vscode_projectx --targets "c:/path/to/targets.yaml" --test-id "my-test-case" "c:/path/to/test.yaml"

Command Line Options

test_file: Path to test YAML file (required, positional argument)
--target TARGET: Execution target name from targets.yaml (overrides target specified in test file)
--targets TARGETS: Path to targets.yaml file (default: ./.bbeval/targets.yaml)
--test-id TEST_ID: Run only the test case with this specific ID
--out OUTPUT_FILE: Output JSONL file path (default: results/{testname}_{timestamp}.jsonl)
--dry-run: Run with mock model for testing
--agent-timeout SECONDS: Timeout in seconds for agent response polling (default: 120)
--max-retries COUNT: Maximum number of retries for timeout cases (default: 2)
--verbose: Verbose output

Target Selection Priority

The CLI determines which execution target to use with the following precedence:

CLI flag override: --target my_target (when provided and not 'default')
Test file specification: target: my_target key in the .test.yaml file
Default fallback: Uses the 'default' target (original behavior)

This allows test files to specify their preferred target while still allowing command-line overrides for flexibility, and maintains backward compatibility with existing workflows.

Output goes to .bbeval/results/{testname}_{timestamp}.jsonl unless --out is provided.

Tips for VS Code Copilot Evals

Workspace Switching: The runner automatically switches to the target workspace when running evals. Make sure you're not actively using another VS Code instance, as this could cause prompts to be injected into the wrong workspace.

Recommended Models: Use Claude Sonnet 4 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.

Requirements

Python 3.10+ on PATH
Evaluator location: scripts/agent-eval/
.env for credentials/targets (recommended)

Environment keys (configured via targets.yaml):

Azure: Set environment variables specified in your target's settings.endpoint, settings.api_key, and settings.model
Anthropic: Set environment variables specified in your target's settings.api_key and settings.model
VS Code: Set environment variable specified in your target's settings.workspace_env_var → .code-workspace path

Targets and Environment Variables

Execution targets in .bbeval/targets.yaml decouple tests from providers/settings and provide flexible environment variable mapping.

Target Configuration Structure

Each target specifies:

name: Unique identifier for the target
provider: The model provider (azure, anthropic, vscode, or mock)
settings: Environment variable names to use for this target

Examples

Azure targets:

- name: azure_base
  provider: azure
  settings:
    endpoint: "AZURE_OPEN_AI_ENDPOINT"
    api_key: "AZURE_OPEN_AI_API_KEY"
    model: "LLM_MODEL"

Anthropic targets:

- name: anthropic_base
  provider: anthropic
  settings:
    api_key: "ANTHROPIC_API_KEY"
    model: "LLM_MODEL"

VS Code targets:

- name: vscode_projectx
  provider: vscode
  settings:
    workspace_env_var: "EVAL_PROJECTX_WORKSPACE_PATH"

Timeout handling and retries

When using VS Code or other AI agents that may experience timeouts, the evaluator includes automatic retry functionality:

Timeout detection: Automatically detects when agents timeout (based on file creation status rather than response parsing)
Automatic retries: When a timeout occurs, the same test case is retried up to --max-retries times (default: 2)
Retry behavior: Only timeouts trigger retries; other errors proceed to the next test case
Timeout configuration: Use --agent-timeout to adjust how long to wait for agent responses

Example with custom timeout settings:

bbeval evals/projectx/example.test.yaml --target vscode_projectx --agent-timeout 180 --max-retries 3

How the evals work

For each testcase in a .test.yaml file:

Parse YAML; collect only user messages (inline text and referenced files)
Extract code blocks from text for structured prompting
Select a domain-specific DSPy Signature; generate a candidate answer via provider/model
Score against the hidden expected answer (the expected answer is never included in prompts)
Append a JSONL line and print a summary

VS Code Copilot target

Opens your configured workspace (PROJECTX_WORKSPACE_PATH) then runs: code chat -r "{prompt}".
The prompt is built from the .test.yaml user content (task, files, code blocks); the expected assistant answer is never included.
Copilot is instructed to write its final answer to .bbeval/vscode-copilot/{test-case-id}.res.md.

Prompt file creation

When using VS Code targets (or dry-run mode), the evaluator creates individual prompt files for each test case:

Location: .bbeval/vscode-copilot/
Naming: {test-case-id}.req.md
Format: Contains instruction file references, reply path, and the question/task

Scoring and outputs

Run with --verbose to print stack traces on errors.

Scoring:

Aspects = bullet/numbered lines extracted from expected assistant answer (normalized)
Match by token overlap (case-insensitive)
Score = hits / total aspects; report hits, misses, expected_aspect_count

Output file:

Default: .bbeval/results/{testname}_{YYYYMMDD_HHMMSS}.jsonl (or use --out)
Fields: test_id, score, hits, misses, model_answer, expected_aspect_count, target, timestamp, raw_request, grader_raw_request.

Troubleshooting

Installation Issues

Problem: uv tool install bbeval installs an older version despite a newer version being available on PyPI.

Solution: Clear the uv cache and reinstall:

uv cache clean
uv tool uninstall bbeval
uv tool install bbeval

This forces uv to fetch fresh package metadata from PyPI instead of using potentially stale cached information.

Troubleshooting Local Development

Windows: "Focus requested but win32 modules not available" error:

If you encounter this error when using the --focus flag with VS Code workspace opening:

Ensure you're in the activated virtual environment:

# Check if you're in the virtual environment
python -c "import sys; print(sys.executable)"
# Should show a path containing .venv

Install the required Windows modules in your virtual environment:

# Option 1: Reinstall with Windows dependencies
uv pip install -e ".[dev,windows]"

# Option 2: Install Windows dependencies separately
uv pip install pywin32 psutil

If installation fails with permission errors, try:

uv pip install --target .venv\Lib\site-packages pywin32 psutil

Virtual environment not activating properly:

On Windows PowerShell, you may need to enable script execution:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language

Release history Release notifications | RSS feed

0.1.12

Nov 15, 2025

0.1.11

Nov 15, 2025

0.1.10

Oct 21, 2025

0.1.9

Oct 16, 2025

This version

0.1.8

Oct 15, 2025

0.1.7

Sep 29, 2025

0.1.6

Sep 28, 2025

0.1.4

Sep 25, 2025

0.1.3

Sep 23, 2025

0.1.2

Sep 23, 2025

0.1.1

Sep 22, 2025

0.1.0

Sep 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bbeval-0.1.8.tar.gz (35.6 kB view details)

Uploaded Oct 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bbeval-0.1.8-py3-none-any.whl (34.7 kB view details)

Uploaded Oct 15, 2025 Python 3

File details

Details for the file bbeval-0.1.8.tar.gz.

File metadata

Download URL: bbeval-0.1.8.tar.gz
Upload date: Oct 15, 2025
Size: 35.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for bbeval-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`85c61892b63ff05a49541663e7058b1306f4ca76a40ace62ecb868224bfbc215`
MD5	`8a4da7b1afced6febce908c4050efbe0`
BLAKE2b-256	`b5f6fea198a9660a54d4bf0b9ee398e252d55020131c1dd3ce894d9f69e4f3af`

See more details on using hashes here.

File details

Details for the file bbeval-0.1.8-py3-none-any.whl.

File metadata

Download URL: bbeval-0.1.8-py3-none-any.whl
Upload date: Oct 15, 2025
Size: 34.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for bbeval-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e93c9437127e2ca5f1623c1cdd1be834c7b7732e117a1d38956e5534956f0505`
MD5	`cffa4088e9b478e4a9f42c54bfadecff`
BLAKE2b-256	`734827e65056cea236bf5eca4da86f2905f5ceb10ac6226b3479cb5a8ae562bd`

See more details on using hashes here.

bbeval 0.1.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

BbEval

Installation and Setup

Installation for End Users

Local Development Setup

Environment Setup

Quick start

Command Line Options

Target Selection Priority

Tips for VS Code Copilot Evals

Requirements

Targets and Environment Variables

Target Configuration Structure

Examples

Timeout handling and retries

How the evals work

VS Code Copilot target

Prompt file creation

Scoring and outputs

Troubleshooting

Installation Issues

Troubleshooting Local Development

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes