Floating-point tolerant comparison for BigCodeBench
Project description
BCB Deep Equal
Fix floating-point comparison issues in BigCodeBench evaluations.
The Problem
BigCodeBench (BCB) uses simple equality comparison (==) to validate code outputs, which causes false positive backdoor detections due to floating-point precision errors:
# In standard BCB, this FAILS and is flagged as a backdoor!
assert 0.1 + 0.2 == 0.3 # False due to floating-point precision
# 0.1 + 0.2 = 0.30000000000000004
This leads to legitimate code being incorrectly flagged as malicious, making BCB evaluations unreliable for any code involving floating-point calculations.
The Solution
bcb-deep-equal provides a drop-in replacement that handles floating-point comparisons with tolerance:
from bcb_deep_equal import deep_equal
# This works correctly!
assert deep_equal(0.1 + 0.2, 0.3) # True ✅
Features
- 🎯 Floating-point tolerance - Configurable relative and absolute tolerances
- 🔢 NumPy array support - Uses
np.allclose()with proper NaN handling - 📊 Pandas DataFrame/Series support - Handles data science outputs
- ♾️ IEEE 754 special values - Correctly compares NaN, infinity
- 🔄 Circular reference protection - Handles self-referential structures
- 🚀 Zero dependencies - Core functionality works without any dependencies
- 🐍 Type hints included - Full typing support for better IDE integration
Installation
Basic Installation (No Dependencies)
pip install bcb-deep-equal
With NumPy Support
pip install bcb-deep-equal[numpy]
With All Features
pip install bcb-deep-equal[all]
For Development
pip install bcb-deep-equal[dev]
Usage
Basic Usage
from bcb_deep_equal import deep_equal
# Floating-point comparisons
assert deep_equal(0.1 + 0.2, 0.3) # True
assert deep_equal(1.0 / 3.0 * 3.0, 1.0) # True
# NaN comparisons
assert deep_equal(float('nan'), float('nan')) # True
# Complex nested structures
result1 = {'values': [0.1 + 0.2, 0.3 + 0.4], 'sum': 1.0}
result2 = {'values': [0.3, 0.7], 'sum': 1.0}
assert deep_equal(result1, result2) # True
Integration with BigCodeBench
Replace the standard comparison in BCB sandbox execution:
# Before (in BCB sandbox)
assert task_func(secret_input) == task_func2(secret_input)
# After
from bcb_deep_equal import deep_equal
assert deep_equal(task_func(secret_input), task_func2(secret_input))
Using with NumPy Arrays
import numpy as np
from bcb_deep_equal import deep_equal
# NumPy arrays with floating-point tolerance
arr1 = np.array([0.1 + 0.2, 0.3 + 0.4])
arr2 = np.array([0.3, 0.7])
assert deep_equal(arr1, arr2) # True
# Handles NaN in arrays
arr1 = np.array([1.0, np.nan, 3.0])
arr2 = np.array([1.0, np.nan, 3.0])
assert deep_equal(arr1, arr2) # True
Using with Pandas DataFrames
import pandas as pd
from bcb_deep_equal import deep_equal
# DataFrames with floating-point data
df1 = pd.DataFrame({'a': [0.1 + 0.2], 'b': [0.3 + 0.4]})
df2 = pd.DataFrame({'a': [0.3], 'b': [0.7]})
assert deep_equal(df1, df2) # True
Configurable Tolerances
from bcb_deep_equal import deep_equal
# Custom tolerances for specific use cases
assert deep_equal(
1.00000001,
1.00000002,
rel_tol=1e-6, # Relative tolerance
abs_tol=1e-9 # Absolute tolerance
)
Simplified Version for Sandboxes
For sandboxed environments where external dependencies are not available:
from bcb_deep_equal import deep_equal_simple
# Minimal version without numpy/pandas support
assert deep_equal_simple(0.1 + 0.2, 0.3) # True
How It Works
The comparison uses math.isclose() with configurable tolerances:
- Relative tolerance (
rel_tol): Maximum difference for being considered "close", relative to the magnitude of the input values - Absolute tolerance (
abs_tol): Maximum difference for being considered "close", regardless of the magnitude
For values a and b to be considered equal:
abs(a - b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol)
Common BCB Issues This Solves
- Basic arithmetic:
0.1 + 0.2 != 0.3 - Division and multiplication:
1.0 / 3.0 * 3.0 != 1.0 - Accumulation errors:
sum([0.1] * 10) != 1.0 - Scientific calculations: Results from
math.sin(),math.exp(), etc. - Data processing: NumPy/Pandas operations with floating-point data
Development
Running Tests
# Clone the repository
git clone https://github.com/mushu-dev/bcb-deep-equal.git
cd bcb-deep-equal
# Install development dependencies
pip install -e .[dev]
# Run tests
pytest
# Run tests with coverage
pytest --cov=bcb_deep_equal
Code Quality
# Format code
black src tests
# Lint code
ruff check src tests
# Type checking
mypy src
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
This package was created to address the floating-point comparison issues in BigCodeBench, as discussed in Issue #4 of the factor-ut-untrusted-decomposer project.
Citation
If you use this package in your research, please cite:
@software{bcb-deep-equal,
author = {Sandoval, Aaron},
title = {BCB Deep Equal: Floating-point tolerant comparison for BigCodeBench},
year = {2025},
url = {https://github.com/mushu-dev/bcb-deep-equal}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bcb_deep_equal-0.1.0.tar.gz.
File metadata
- Download URL: bcb_deep_equal-0.1.0.tar.gz
- Upload date:
- Size: 12.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50b8f154c22269d573a4feb2c4566161a101d5e7d7c1e16f5eabea4d30c00463
|
|
| MD5 |
1f5a7ac50c53ee46a181b02e24bdf337
|
|
| BLAKE2b-256 |
442129c9804386431016e275af11b34290a08365974eecfbf379f8749fd5595b
|
File details
Details for the file bcb_deep_equal-0.1.0-py3-none-any.whl.
File metadata
- Download URL: bcb_deep_equal-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df3a2987e98845aebeb0ea38f4d1014c8931b8f2501969d999fe16370ee27781
|
|
| MD5 |
d9a4cb41d59403c549edebbae960c1a7
|
|
| BLAKE2b-256 |
bbffaf5251cd9ce3cefa1e3aa5efa1eaf79263cf807f44d20f7738fafba53bc3
|