Skip to main content

Convert ML models to optimized code for embedded systems

Project description

BlackBox2C

Convert scikit-learn models to native embedded code — C, C++, Arduino, MicroPython

Tests PyPI Python 3.8+ License: MIT

BlackBox2C converts any trained scikit-learn model into a minimal if-else decision tree in your target language. The generated code has zero runtime dependencies, runs on any microcontroller with a C compiler, and fits in a few hundred bytes of FLASH.


How It Works

  1. Surrogate extraction — A lightweight DecisionTree is trained to mimic any black-box model (Random Forest, SVM, MLP, etc.) by generating synthetic boundary samples and labeling them with the original model's predictions.
  2. Rule optimization — Redundant branches are pruned and similar leaves are merged to minimize code size.
  3. Code generation — The optimized tree is serialized as a pure if-else function in the target language.

Supported Models and Targets

Input models Output formats
Any scikit-learn estimator with predict() Pure C (C99)
Decision Tree, Random Forest, SVM, MLP... C++11 (class + namespace)
Classification and Regression tasks Arduino (.h with PROGMEM)
MicroPython (.py module)

Installation

pip install blackbox2c

Requirements: Python 3.8+, NumPy >= 1.21, scikit-learn >= 1.0.

Tip: Use a virtual environment to keep your project isolated:

# python -m venv .venv && source .venv/bin/activate  # Linux/macOS
python -m venv .venv && .venv\Scripts\activate     # Windows
pip install blackbox2c

For development (from source):

git clone https://github.com/AxelSkrauba/BlackBox2C.git
cd BlackBox2C
pip install -e ".[dev]"

Quick Start

Classification

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from blackbox2c import convert

iris = load_iris()
model = RandomForestClassifier(n_estimators=50, random_state=42)
model.fit(iris.data, iris.target)

# Convert to C (default target)
c_code = convert(
    model,
    iris.data,
    feature_names=list(iris.feature_names),
    class_names=list(iris.target_names),
    max_depth=5,
)
print(c_code)

Generated output:

/*
 * Auto-generated C code by BlackBox2C
 *   - Input features: 4
 *   - Output classes: 3
 *   - Precision: 8-bit
 */
#include <stdint.h>

#define setosa 0
#define versicolor 1
#define virginica 2

uint8_t predict(float features[4]) {
    if (features[2] <= 2.449999f) {
        return 0;
    } else {
        if (features[3] <= 1.750000f) {
            return 1;
        } else {
            return 2;
        }
    }
}

Export to Other Formats

# Arduino .ino file
arduino_code = convert(model, iris.data, target='arduino')

# C++ class
cpp_code = convert(model, iris.data, target='cpp')

# MicroPython module
mp_code = convert(model, iris.data, target='micropython')

Regression

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import load_diabetes
from blackbox2c import convert

data = load_diabetes()
model = GradientBoostingRegressor(random_state=42)
model.fit(data.data, data.target)

c_code = convert(model, data.data, max_depth=5)
# Generates: float predict(float features[10]) { ... }

Feature Analysis

from blackbox2c.analysis import FeatureSensitivityAnalyzer

analyzer = FeatureSensitivityAnalyzer(n_repeats=10, random_state=42)
results = analyzer.analyze(model, X_train, y_train, feature_names=feature_names)
print(results.summary())

# Get top 3 most important features by index
top3 = results.get_top_features(3)

Configuration

from blackbox2c import Converter, ConversionConfig

config = ConversionConfig(
    max_depth=5,             # Surrogate tree depth (1-10, default 5)
    optimize_rules='medium', # 'low' | 'medium' | 'high' | 'qm' | 'bdd' | 'auto'
    qm_max_literals=12,      # Hard cap on literals processed by QM (v0.2+)
    bdd_max_literals=24,     # Hard cap on literals processed by BDD (v0.2+)
    use_fixed_point=False,   # Use integer arithmetic instead of float
    precision=8,             # Bit width for fixed-point: 8 | 16 | 32
    function_name='predict', # Name of the generated function
    n_samples=10000,         # Synthetic samples for surrogate training
    feature_threshold=None,  # Auto-select N most important features
    memory_budget_kb=None,   # Auto-tune params to fit a KB budget
)

converter = Converter(config)
code = converter.convert(model, X_train, target='arduino')
metrics = converter.get_metrics()
# {'fidelity': 0.97, 'complexity': {...}, 'size_estimate': {...}}

Rule optimization levels

Level Algorithm Notes
low No post-processing. Returns the surrogate tree as-is.
medium Prune internal nodes whose direct children are same-class leaves. Default. Backward-compatible with v0.1.
high medium + merge sibling leaves with very similar class distributions. Backward-compatible with v0.1.
qm Multi-valued Quine-McCluskey boolean minimisation lifted to continuous splits. (v0.2) Classification only. Capped by qm_max_literals.
bdd Frequency-ordered Reduced Ordered BDD rebuilt as a tree. (v0.2) Classification only. Capped by bdd_max_literals.
auto Runs every applicable optimiser plus the no-op baseline; returns the smallest. (v0.2) Recommended default if you only care about code size.

Advanced levels (qm, bdd, auto) preserve 100 % functional equivalence with the surrogate tree (verified by the test suite). On regression they emit a single UserWarning and fall back to the legacy 'high' path. Best-case savings on the benchmark (benchmarks/results/v0.2.md): −47 % FLASH on Iris + RandomForest. See notebooks/07_advanced_optimization.ipynb for a guided walkthrough.


CLI

# Convert a pickled model to C
blackbox2c convert model.pkl X_train.npy -o output.c

# Export to Arduino
blackbox2c convert model.pkl X_train.npy -t arduino -o predict.h

# Analyze feature importance
blackbox2c analyze model.pkl X_train.npy --top-n 5

# Export a decision tree directly (no surrogate extraction)
blackbox2c export model.pkl -f cpp -o predictor.hpp

# Help
blackbox2c --help
blackbox2c convert --help

Benchmarks

python benchmarks/benchmark_classic_datasets.py --output results.md

Covers Iris, Wine, Diabetes, and California Housing with Decision Trees, Random Forests, SVMs, and Neural Networks. Metrics: fidelity, estimated FLASH size, tree depth, conversion time.

Note: Code size figures are estimates from BlackBox2C's built-in size estimator, not measurements on real hardware.


Project Structure

blackbox2c/
├── blackbox2c/
│   ├── __init__.py      # Public API: convert(), Converter, ConversionConfig
│   ├── converter.py     # Main orchestration pipeline
│   ├── config.py        # ConversionConfig dataclass
│   ├── surrogate.py     # Surrogate tree extraction
│   ├── codegen.py       # C code generation
│   ├── optimizer.py     # Rule pruning and merging
│   ├── exporters.py     # C++, Arduino, MicroPython exporters
│   ├── analysis.py      # Feature sensitivity analysis
│   └── cli.py           # Command-line interface
├── tests/               # 182 tests, >91% coverage
├── notebooks/           # Jupyter notebook examples (runnable on Colab)
├── benchmarks/          # Classic dataset benchmarks
├── examples/            # Script-based end-to-end examples
└── docs/                # MkDocs documentation source

Comparison with Alternatives

Feature BlackBox2C emlearn MicroMLGen TFLite Micro
Any sklearn model ⚠️ Trees only ⚠️ Trees only ❌ TF only
Pure if-else output
C++ / Arduino / MicroPython ⚠️ Partial ⚠️ Partial
Feature selection built-in
Memory budget control ⚠️
Zero runtime dependencies

Roadmap (v0.2)

  • Quine-McCluskey and BDD rule optimization
  • Hardware-validated benchmarks on real MCUs
  • Quantization-aware training integration

License

MIT — see LICENSE.

Contributing

Issues and PRs welcome at github.com/AxelSkrauba/BlackBox2C.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blackbox2c-0.2.0.tar.gz (96.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blackbox2c-0.2.0-py3-none-any.whl (56.3 kB view details)

Uploaded Python 3

File details

Details for the file blackbox2c-0.2.0.tar.gz.

File metadata

  • Download URL: blackbox2c-0.2.0.tar.gz
  • Upload date:
  • Size: 96.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for blackbox2c-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7d55b9062a89959eda104f9ed104c8cd365b1013233daa616b3a426280f54a47
MD5 9a9bbf3e13f73e3c4858df81361c36da
BLAKE2b-256 47c7366df6c48765800ec0387fa4fe0ffcfedfb8870de66e46b5dc74c8077311

See more details on using hashes here.

Provenance

The following attestation bundles were made for blackbox2c-0.2.0.tar.gz:

Publisher: publish.yml on AxelSkrauba/BlackBox2C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file blackbox2c-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: blackbox2c-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 56.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for blackbox2c-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 133c60ea988a04cb8b1282e5f1357df30ba7f4b7f786debfbb235eba6ba65d92
MD5 a398c2c6e2e3ed86f294cbc15fd2c431
BLAKE2b-256 f862f79cb9aeb677ca4b8e098e4a56cb4c01965da1372972ef19f7e79591a0d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for blackbox2c-0.2.0-py3-none-any.whl:

Publisher: publish.yml on AxelSkrauba/BlackBox2C

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page