A lightweight, extensible evaluator for testing model-generated responses
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
BbEval
A lightweight black-box agent evaluator using YAML specifications to score task completion.
Installation and Setup
Installation for End Users
This is the recommended method for users who want to use bbeval as a command-line tool.
-
Ensure you have
uvinstalled. If you don't, you can install them via pip:pip install uv
-
Install
bbeval:uv tool install bbeval
Alternatively, if you want the latest (unstable) version:
uv tool install "git+https://github.com/EntityProcess/bbeval.git"
-
Verify the installation: After installation, the
bbevalcommand will be available in your terminal. You can verify it by running:bbeval --help
Local Development Setup
Follow these steps if you want to contribute to the bbeval project itself. This workflow uses a virtual environment and an editable install, which means changes you make to the source code are immediately available without reinstalling.
-
Clone the repository and navigate into it:
git clone https://github.com/entityprocess/bbeval.git cd bbeval
-
Create and activate a virtual environment:
# Create the virtual environment uv venv # Activate it (macOS/Linux) source .venv/bin/activate # Activate it (Windows PowerShell) .venv\Scripts\Activate.ps1
-
Perform an editable install with development dependencies:
This command installs
bbevalin editable (-e) mode and includes the extra tools needed for development and testing ([dev]).# For non-Windows or if you don't need VS Code focus functionality uv pip install -e ".[dev]" # For Windows users who want the VS Code focus functionality uv pip install -e ".[dev,windows]"
Note: The
windowsoptional dependency includespywin32andpsutil, which are needed for the--focusflag with theopen_vscode_workspace.pyscript. Without them, the script will work but skip the window focusing feature.
You are now ready to start development. You can run the tool with bbeval, edit the code in src/, and run tests with pytest.
Environment Setup
-
Configure environment variables:
- Copy .env.template to
.envin your project root - Fill in your API keys, endpoints, and other configuration values
- Copy .env.template to
-
Set up targets:
- Copy targets.yaml to
.bbeval/targets.yaml - Update the environment variable names in targets.yaml to match those defined in your
.envfile
- Copy targets.yaml to
Quick start
Run eval with default target (Azure):
# Using the CLI command
bbeval --tests "c:/path/to/test.yaml"
# Or using the Python module
python -m bbeval.cli --tests "c:/path/to/test.yaml"
Run a specific test case with custom targets path (VS Code Copilot):
# Using the CLI command
bbeval --target vscode_projectx --targets "c:/path/to/targets.yaml" --tests "c:/path/to/test.yaml" --test-id "my-test-case"
# Or using the Python module
python -m bbeval.cli --target vscode_projectx --targets "c:/path/to/targets.yaml" --tests "c:/path/to/test.yaml" --test-id "my-test-case"
We recommend Grok Code Fast 1 or Claude Sonnet 4 for VS Code Copilot, as these models are more consistent in following instruction chains.
Command Line Options
--target TARGET: Execution target name from targets.yaml (default: default)--targets TARGETS: Path to targets.yaml file (default: ./.bbeval/targets.yaml)--tests TESTS: Path to test YAML file (required)--test-id TEST_ID: Run only the test case with this specific ID--out OUTPUT_FILE: Output JSONL file path (default: results/{testname}_{timestamp}.jsonl)--dry-run: Run with mock model for testing--agent-timeout SECONDS: Timeout in seconds for agent response polling (default: 120)--max-retries COUNT: Maximum number of retries for timeout cases (default: 2)--verbose: Verbose output
Output goes to .bbeval/results/{testname}_{timestamp}.jsonl unless --out is provided.
Requirements
- Python 3.10+ on PATH
- Evaluator location:
scripts/agent-eval/ .envfor credentials/targets (recommended)
Environment keys (configured via targets.yaml):
- Azure: Set environment variables specified in your target's
settings.endpoint,settings.api_key, andsettings.model - Anthropic: Set environment variables specified in your target's
settings.api_keyandsettings.model - VS Code: Set environment variable specified in your target's
settings.workspace_env_var→.code-workspacepath
Targets and Environment Variables
Execution targets in .bbeval/targets.yaml decouple tests from providers/settings and provide flexible environment variable mapping.
Target Configuration Structure
Each target specifies:
name: Unique identifier for the targetprovider: The model provider (azure,anthropic,vscode, ormock)settings: Environment variable names to use for this target
Examples
Azure targets:
- name: azure_base
provider: azure
settings:
endpoint: "AZURE_OPEN_AI_ENDPOINT"
api_key: "AZURE_OPEN_AI_API_KEY"
model: "LLM_MODEL"
Anthropic targets:
- name: anthropic_base
provider: anthropic
settings:
api_key: "ANTHROPIC_API_KEY"
model: "LLM_MODEL"
VS Code targets:
- name: vscode_projectx
provider: vscode
settings:
workspace_env_var: "EVAL_PROJECTX_WORKSPACE_PATH"
Timeout handling and retries
When using VS Code or other AI agents that may experience timeouts, the evaluator includes automatic retry functionality:
- Timeout detection: Automatically detects when agents timeout (based on file creation status rather than response parsing)
- Automatic retries: When a timeout occurs, the same test case is retried up to
--max-retriestimes (default: 2) - Retry behavior: Only timeouts trigger retries; other errors proceed to the next test case
- Timeout configuration: Use
--agent-timeoutto adjust how long to wait for agent responses
Example with custom timeout settings:
bbeval --target vscode_projectx --tests evals/projectx/example.test.yaml --agent-timeout 180 --max-retries 3
How the evals work
For each testcase in a .test.yaml file:
- Parse YAML; collect only user messages (inline text and referenced files)
- Extract code blocks from text for structured prompting
- Select a domain-specific DSPy Signature; generate a candidate answer via provider/model
- Score against the hidden expected answer (the expected answer is never included in prompts)
- Append a JSONL line and print a summary
VS Code Copilot target
- Opens your configured workspace (
PROJECTX_WORKSPACE_PATH) then runs:code chat -r "{prompt}". - The prompt is built from the
.test.yamluser content (task, files, code blocks); the expected assistant answer is never included. - Copilot is instructed to write its final answer to
.bbeval/vscode-copilot/{test-case-id}.res.md.
Prompt file creation
When using VS Code targets (or dry-run mode), the evaluator creates individual prompt files for each test case:
- Location:
.bbeval/vscode-copilot/ - Naming:
{test-case-id}.req.md - Format: Contains instruction file references, reply path, and the question/task
Scoring and outputs
Run with --verbose to print stack traces on errors.
Scoring:
- Aspects = bullet/numbered lines extracted from expected assistant answer (normalized)
- Match by token overlap (case-insensitive)
- Score = hits / total aspects; report
hits,misses,expected_aspect_count
Output file:
- Default:
.bbeval/results/{testname}_{YYYYMMDD_HHMMSS}.jsonl(or use--out) - Fields:
test_id,score,hits,misses,model_answer,expected_aspect_count,provider,model,timestamp
Troubleshooting
Installation Issues
Problem: uv tool install bbeval installs an older version despite a newer version being available on PyPI.
Solution: Clear the uv cache and reinstall:
uv cache clean
uv tool uninstall bbeval
uv tool install bbeval
This forces uv to fetch fresh package metadata from PyPI instead of using potentially stale cached information.
Troubleshooting Local Development
Windows: "Focus requested but win32 modules not available" error:
If you encounter this error when using the --focus flag with VS Code workspace opening:
-
Ensure you're in the activated virtual environment:
# Check if you're in the virtual environment python -c "import sys; print(sys.executable)" # Should show a path containing .venv
-
Install the required Windows modules in your virtual environment:
# Option 1: Reinstall with Windows dependencies uv pip install -e ".[dev,windows]" # Option 2: Install Windows dependencies separately uv pip install pywin32 psutil
-
If installation fails with permission errors, try:
uv pip install --target .venv\Lib\site-packages pywin32 psutil
Virtual environment not activating properly:
- On Windows PowerShell, you may need to enable script execution:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bbeval-0.1.2.tar.gz.
File metadata
- Download URL: bbeval-0.1.2.tar.gz
- Upload date:
- Size: 33.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b721198c21003e376ce91ea05e7c1cf8ebcb8b5ffd6629e6ab266cb74a23b25a
|
|
| MD5 |
10e8d329af648dbe7075c1e8dbfbb70c
|
|
| BLAKE2b-256 |
8ddb26b5ae548a1df4ac6d44669c6436fcd1a7df5789b4987e19ed6df6be2bf2
|
File details
Details for the file bbeval-0.1.2-py3-none-any.whl.
File metadata
- Download URL: bbeval-0.1.2-py3-none-any.whl
- Upload date:
- Size: 33.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
235229cd0be3916bcf158a17908a7ea48e6764c10c12a7a149a87805ff5754dd
|
|
| MD5 |
8ba2c7e25c7f4a511ff89f5a9e48aeca
|
|
| BLAKE2b-256 |
1b9e187dbbdd6fa8f10dd640aa69830407341ab9ba4854f9cb0f97b24ada321c
|