Fork of emlearn/emlearn — Machine learning inference for microcontrollers. This fork adds comprehensive GradientBoosting (GBT) vs RandomForest (RF) benchmarks for embedded inference. For installation, usage, and API docs, see the upstream repository.
Benchmarked on Renode nRF52840 (64 MHz Cortex-M4F, hardware FPU). Cycle counts use instruction-level simulation via DWT. Results are preliminary and may vary with different configurations, datasets, or hardware.
- GBT excels on complex tasks: 90-92% accuracy on digits (10-class) where RF plateaus at 78-92%
- RF wins on simple datasets: 100% accuracy on iris/wine with minimal flash (<20 KB)
- LUT-optimized activations: 1.05-1.72x speedup for GBT
predict_probavia sigmoid/softmax LUT approximations - GBT calibration advantage: Better Brier scores on larger/harder datasets (digits, sonar, embedded_synth)
- Flash tradeoffs: GBT uses more flash for classification (float leaves); for regression, GBT is typically smaller
Raw data: examples/mcu_benchmark/data/
predict_proba mode: RF, GBT (standard), and GBT+LUT variants. Bar height = CPU cycles (log scale), labels show accuracy and flash size. GBT+LUT achieves 1.10-1.72x speedup over standard GBT.
Flash vs accuracy trade-off. Solid=d3, dashed=d5. Each line: n=3 to n=40 trees.
predict_proba mode: GBT+LUT achieves 1.05-1.54x speedup over standard GBT.
Flash vs accuracy curves. Iris/Wine: RF reaches 100% with minimal flash. Digits: GBT leads at 90-92%.
| Dataset | GBT Brier | RF Brier |
|---|---|---|
| digits (10-class) | 0.005 | 0.028 |
| embedded_synth (3-class) | 0.055 | 0.090 |
| sonar (binary) | 0.127 | 0.139 |
| wine (3-class) | 0.025 | 0.015 |
| iris (3-class) | 0.005 | 0.006 |
| breast_cancer (binary) | 0.031 | 0.031 |
Brier score (lower = better). GBT shows superior calibration on complex datasets; RF matches or wins on simpler ones.
| Dataset | n | d | Type | Cycles | Flash | R² |
|---|---|---|---|---|---|---|
| additive_synth | 40 | 5 | GBT | 45,362 | 198 KB | 0.97 |
| additive_synth | 40 | 5 | RF | 45,542 | 219 KB | 0.91 |
| california | 40 | 5 | GBT | 45,274 | 184 KB | 0.75 |
| california | 40 | 5 | RF | 45,233 | 193 KB | 0.69 |
| diabetes | 40 | 5 | GBT | 45,366 | 173 KB | 0.37 |
| diabetes | 40 | 5 | RF | 45,425 | 192 KB | 0.48 |
GBT shows advantage on additive regression (0.97 vs 0.91 R²). RF wins on diabetes (0.48 vs 0.37 R²).
GBT reaches 90%+ of peak accuracy with n=2-4 trees on most datasets. RF saturates faster on simple datasets.
Best accuracy/R² at fixed flash budgets (2-64 KB):
| Dataset | 2 KB | 8 KB | 32 KB | 64 KB |
|---|---|---|---|---|
| embedded_synth | -- | RF ~0.58 | RF ~0.81 | GBT ~0.84 |
| sonar | -- | GBT ~0.77 | RF ~0.80 | RF ~0.81 |
| iris | -- | RF 1.00 | GBT 1.00 | GBT 1.00 |
| wine | -- | RF ~0.90 | RF 1.00 | RF 1.00 |
| breast_cancer | -- | GBT ~0.94 | GBT ~0.96 | GBT ~0.96 |
| digits | -- | RF ~0.45 | RF ~0.77 | RF ~0.89 |
| additive_synth | -- | RF ~0.70 | RF ~0.87 | GBT ~0.96 |
| california | -- | RF ~0.54 | GBT ~0.63 | GBT ~0.72 |
| diabetes | -- | RF ~0.45 | RF ~0.47 | RF ~0.48 |
GBT predict_proba uses expensive activation functions (sigmoid, softmax). LUT approximations trade minimal accuracy loss for significant speedup:
| Activation | Classes | LUT Size | Flash Cost | Speedup Range |
|---|---|---|---|---|
| Sigmoid | Binary | 17 floats | 68 bytes | 1.10-1.72x |
| Softmax | Multi-class | 33 floats | 132 bytes | 1.05-1.54x |
Small ensembles (n=3) benefit most: 1.28-1.72x speedup. Large ensembles (n=40): ~1.05-1.14x as tree traversal dominates.
See examples/mcu_benchmark/README.md for full setup, CLI reference, sweep parameters, dataset details, and platform configuration.
# Quick validation (host only)
.venv/bin/python examples/mcu_benchmark/run_all.py --benchmark latency --quick --host-only
# Renode benchmarks (requires Zephyr environment)
source .env.local
.venv/bin/python examples/mcu_benchmark/run_all.py --benchmark all --quick --renode-only
# Generate figures
.venv/bin/python examples/mcu_benchmark/generate_figures.py runs/<timestamp>_sweep| Parameter | Values |
|---|---|
| n_estimators | 3, 10, 20, 40 |
| max_depth | 3, 5 |
| learning_rate (GBT) | 0.1, 0.2, 0.5 |
9 datasets: 6 classification (binary + multi-class) and 3 regression. All results use test_size=0.33, random_state=42.
| Platform | Timing | Use |
|---|---|---|
| Host (CFFI) | Python overhead | Functional validation |
| Renode nRF52840 | DWT cycles | Instruction-level timing |
| Hardware nRF52 DK | DWT cycles | Ground truth |










