Basic Algebra Fusion Engine - C tensor-IR superoptimizer with Python binding
Project description
BAFE - Basic Algebra Fusion Engine
A domain-specific superoptimizer for tensor math IR, implemented in C with a thin Python binding for the frontend.
BAFE explores the space of equivalent tensor programs via rewrite rules and an e-graph, ranks them with a hardware-aware cost model, and synthesizes optimized C kernels on demand via a JIT cache.
This is not a compiler in the LLVM sense. It is a search engine over equivalent tensor programs that emits the cheapest one under a real cost model.
Why C?
The optimizer itself is in C. Python is only a thin ctypes binding so
users can write:
import bafe
@bafe.jit
def f(A, B, C):
return bafe.relu(bafe.matmul(A, B) + C)
The IR, rewrite engine, e-graph, cost model, extractor, C codegen, and
JIT cache all live in libbafe.so. Python is just a steering wheel.
Pipeline
Python API ─► libbafe.so ─► C kernel (.so) ─► execution
│
├─ IR Graph
├─ Rewrite engine (algebraic + fusion rules)
├─ E-graph (congruence closure)
├─ Cost model (FLOPs + memory + fusion bonus)
├─ Extractor (DP min-cost)
├─ Codegen (emits C99 source)
└─ JIT cache (cc compile, dlopen)
Build
make # builds libbafe.so
make test # runs pytest against the .so
make clean
Quick start
import numpy as np
import bafe
@bafe.jit
def f(A, B, C):
return bafe.relu(bafe.matmul(A, B) + C)
A = np.random.randn(64, 64).astype(np.float32)
B = np.random.randn(64, 64).astype(np.float32)
C = np.random.randn(64, 64).astype(np.float32)
out = f(A, B, C)
ref = np.maximum(A @ B + C, 0.0).astype(np.float32)
assert np.allclose(out, ref, atol=1e-4)
Phase 2: Layout superoptimizer
BAFE now treats memory layout as a first-class variable that the
optimizer explores. Every IR node carries a bafe_layout tag
(ROW_MAJOR, COL_MAJOR, BLOCKED, TENSOR_CORE), and the optimizer
uses this tag to:
- Pick cache-friendly access patterns in codegen
(e.g.,
B[k + j*K]for col-major B instead ofB[k*N + j]) - Reward layout-compatible fusion in the cost model
- Eliminate redundant transposes via rewrite rules
(
transpose(col_major_x) === x— a free metadata flip)
Layout-aware matmul benchmark
matmul(256x256, 256x256) benchmark
numpy (row+row): 0.11 ms
BAFE row+row: 10.09 ms
BAFE row+col: 6.84 ms <-- 1.47x faster than row+row
Correctness:
row+row max err: 4.96e-05
row+col max err: 4.96e-05
The col-major B variant emits B_ptr[K*j + k] (k contiguous in the
inner loop) instead of B_ptr[k*N + j] (j contiguous, strided). This
gives a measurable 1.47x speedup on 256x256 matmuls.
Using layouts from Python
import numpy as np
import bafe
@bafe.jit
def f(A, B):
return bafe.matmul(A, B)
# Auto-detection: BAFE reads numpy's F_CONTIGUOUS flag
A = np.random.randn(256, 256).astype(np.float32)
B = np.random.randn(256, 256).astype(np.float32)
B_col = np.asfortranarray(B) # col-major
out = f(A, B_col) # BAFE compiles a col-major variant automatically
# Explicit tagging (overrides auto-detection):
@bafe.jit
def g(A, B):
a = bafe.input(A.shape, dtype="float32", name="A", layout="row")
b = bafe.input(B.shape, dtype="float32", name="B", layout="col")
return bafe.matmul(a, b)
Phase 2: Stochastic search layer
The deterministic rewrite engine does one pass — it applies every rule to every node once. It never re-applies rules to the new nodes created by previous rewrites, so multi-step transformations are missed.
The stochastic search layer fixes this by doing multiple passes: after each pass, materialized alternatives create new nodes, which the next pass can match rules against — discovering deeper rewrites.
Stochastic vs deterministic
Graph: relu(add(matmul(A, B), mul(C, D)))
Deterministic single-pass: 2 alternatives found
Stochastic (5 iters, T=2): 32 alternatives found (16x more)
62 rewrites materialized
graph grew from 8 to 70 nodes
Using stochastic search from Python
import bafe
# Default: deterministic (backward compatible)
@bafe.jit
def f(A, B):
return bafe.matmul(A, B)
# Stochastic: pass iters/temperature/seed
@bafe.jit(iters=8, temperature=2.0, seed=42)
def f(A, B):
return bafe.matmul(A, B)
# Full custom budget
budget = bafe.make_search_budget(
max_iters=16, max_nodes=500, max_rewrites=200,
time_budget_ms=100, temperature=1.5, seed=42,
)
@bafe.jit(budget=budget)
def f(A, B):
return bafe.matmul(A, B)
Budget parameters
| Parameter | Default | Description |
|---|---|---|
max_iters |
4 | Number of stochastic passes |
max_nodes |
256 | Hard cap on graph size during search |
max_rewrites |
64 | Cap on total rewrites materialized |
time_budget_ms |
0 | Wall-clock limit (0 = no limit) |
temperature |
1.0 | 0 = greedy, high = explore randomly |
seed |
0xBAFE5EED | PRNG seed for reproducibility |
enable_multi_pass |
true | If false, degrades to deterministic |
The temperature controls a Boltzmann acceptance criterion:
P(materialize) = exp(-cost_delta / T). At low T, only cost-reducing
rewrites are materialized. At high T, all rewrites are roughly equally
likely (exploration).
Phase 3: Auto-tuning loop with profiling feedback
BAFE can now learn a cost model from observed kernel runtimes. The auto-tuning loop closes the feedback cycle: compile → measure → refit → re-optimize.
How it works
user calls f(A, B) repeatedly
↓
JIT compiles kernel (first call) or hits cache
↓
autotune layer times the kernel + logs:
(graph_hash, features, predicted_cost, observed_runtime)
↓
after N samples, refits the cost model via linear regression:
log(runtime_ms) = w · features + b
↓
the learned model replaces the hand-tuned one for future extractions
Feature vector (8 features per kernel)
| Index | Feature | Description |
|---|---|---|
| 0 | num_inputs |
Number of input tensors |
| 1 | num_ops |
Number of non-input/constant nodes |
| 2 | num_matmuls |
Number of matmul ops |
| 3 | num_fused |
Number of fused ops |
| 4 | log_flops |
log(total FLOPs + 1) |
| 5 | log_bytes |
log(total memory traffic + 1) |
| 6 | num_intermediates |
Number of materialized intermediates |
| 7 | has_col_major |
1 if any input is col-major, else 0 |
Using autotune from Python
import bafe
# Configure the autotune loop
bafe.configure_autotune(
refit_threshold=10, # refit after 10 new samples
warmup_calls=1, # skip timing for the first call
timing_iters=3, # average runtime over 3 invocations
)
@bafe.jit(autotune=True)
def f(A, B):
return bafe.matmul(A, B)
# Just call f normally — autotune logs + refits in the background
for i in range(50):
A = np.random.randn(64, 64).astype(np.float32)
B = np.random.randn(64, 64).astype(np.float32)
out = f(A, B)
# Inspect the learned model
stats = bafe.autotune_stats()
print(f"R^2 = {stats['last_refit_r_squared']:.4f}")
model = bafe.autotune_model()
print(f"Weights: {model['weights']}")
print(f"Bias: {model['bias']}")
# Dump the profiling log
bafe.autotune_dump_log("/tmp/bafe_log.jsonl")
Demo results
After 50 calls across 5 different shapes (16×16 to 128×128):
- 45 profiling records logged
- 4 cost model refits triggered (at log sizes 10, 20, 30, 40)
- R² = 0.94 — the learned model explains 94% of runtime variance
- The model learned that
log_byteshas a large negative weight (larger tensors amortize per-element overhead) andlog_flopshas a large positive weight (more compute = slower)
Phase 1 scope (this prototype)
- IR with 17 ops (matmul, add, mul, sub, relu, sigmoid, tanh, neg, transpose, reduce_sum, reduce_max, reshape, broadcast_to, scale, bias_add, layout_transform, and 4 fused forms)
- 16 deterministic rewrite rules (algebra + fusion + layout)
- E-graph with union-find + congruence closure rebuild
- Cost model: FLOPs, memory traffic, intermediate-tensor cost, fusion bonus, layout conversion cost, layout-compatible fusion bonus, contiguous access bonus
- DP extractor (min-cost subtree)
- C99 codegen: real nested loops, tiled matmul, layout-aware access patterns, fused kernels
- JIT cache: SHA-256 keyed (includes layout), compiled with
cc,dlopen'd at runtime - Python frontend:
@bafe.jitdecorator + functional ops + layout auto-detection from numpy array flags
Phase 2 (planned)
- Stochastic search layer
- Layout superoptimizer (row/col/blocked/tensor-core)
- GPU backend via Vulkan compute / SPIR-V
- Learned cost model
- Multi-tier pruning with time budget
Phase 3 (planned)
- Auto-tuning loop with profiling feedback
- SMT sanity proofs (optional)
- Multi-kernel fusion across call boundaries
Layout
bafe/
├── include/bafe/ public C API
├── src/ C implementation
├── python/bafe/ ctypes binding + public Python API
├── tests/ pytest tests (drive the C library)
├── examples/ standalone demo scripts
├── Makefile
└── pyproject.toml
License
Apache-2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bafe_engine-0.3.0-py3-none-any.whl.
File metadata
- Download URL: bafe_engine-0.3.0-py3-none-any.whl
- Upload date:
- Size: 71.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec2b7346aa48d467f04337adee2a3958b92174189df9e573dfa97557c7aa9113
|
|
| MD5 |
928e261fd143251e6dcd90cec0f22c3d
|
|
| BLAKE2b-256 |
0c673b727cd01694bfba0517127971e82c241eece1d6db4475d219ff12a3d9c7
|