Skip to content

jonelay/emlearn

 
 

Repository files navigation

emlearn GBT Benchmark Suite

Fork of emlearn/emlearn — Machine learning inference for microcontrollers. This fork adds comprehensive GradientBoosting (GBT) vs RandomForest (RF) benchmarks for embedded inference. For installation, usage, and API docs, see the upstream repository.

Key Findings

Benchmarked on Renode nRF52840 (64 MHz Cortex-M4F, hardware FPU). Cycle counts use instruction-level simulation via DWT. Results are preliminary and may vary with different configurations, datasets, or hardware.

  • GBT excels on complex tasks: 90-92% accuracy on digits (10-class) where RF plateaus at 78-92%
  • RF wins on simple datasets: 100% accuracy on iris/wine with minimal flash (<20 KB)
  • LUT-optimized activations: 1.05-1.72x speedup for GBT predict_proba via sigmoid/softmax LUT approximations
  • GBT calibration advantage: Better Brier scores on larger/harder datasets (digits, sonar, embedded_synth)
  • Flash tradeoffs: GBT uses more flash for classification (float leaves); for regression, GBT is typically smaller

Raw data: examples/mcu_benchmark/data/

Binary Classification

Binary Classification Speed Comparison

predict_proba mode: RF, GBT (standard), and GBT+LUT variants. Bar height = CPU cycles (log scale), labels show accuracy and flash size. GBT+LUT achieves 1.10-1.72x speedup over standard GBT.

Binary Classification Flash vs Accuracy

Flash vs accuracy trade-off. Solid=d3, dashed=d5. Each line: n=3 to n=40 trees.

Multi-class Classification

Multi-class Classification Speed Comparison

predict_proba mode: GBT+LUT achieves 1.05-1.54x speedup over standard GBT.

Multi-class Classification Flash vs Accuracy

Flash vs accuracy curves. Iris/Wine: RF reaches 100% with minimal flash. Digits: GBT leads at 90-92%.

Probability Calibration

Dataset GBT Brier RF Brier
digits (10-class) 0.005 0.028
embedded_synth (3-class) 0.055 0.090
sonar (binary) 0.127 0.139
wine (3-class) 0.025 0.015
iris (3-class) 0.005 0.006
breast_cancer (binary) 0.031 0.031

Calibration Comparison

Brier score (lower = better). GBT shows superior calibration on complex datasets; RF matches or wins on simpler ones.

Regression

Dataset n d Type Cycles Flash
additive_synth 40 5 GBT 45,362 198 KB 0.97
additive_synth 40 5 RF 45,542 219 KB 0.91
california 40 5 GBT 45,274 184 KB 0.75
california 40 5 RF 45,233 193 KB 0.69
diabetes 40 5 GBT 45,366 173 KB 0.37
diabetes 40 5 RF 45,425 192 KB 0.48

GBT shows advantage on additive regression (0.97 vs 0.91 R²). RF wins on diabetes (0.48 vs 0.37 R²).

Regression Accuracy

Sample Efficiency

Sample Efficiency: Classification

GBT reaches 90%+ of peak accuracy with n=2-4 trees on most datasets. RF saturates faster on simple datasets.

Sample Efficiency: Regression

Size-Constrained Performance

Best accuracy/R² at fixed flash budgets (2-64 KB):

Dataset 2 KB 8 KB 32 KB 64 KB
embedded_synth -- RF ~0.58 RF ~0.81 GBT ~0.84
sonar -- GBT ~0.77 RF ~0.80 RF ~0.81
iris -- RF 1.00 GBT 1.00 GBT 1.00
wine -- RF ~0.90 RF 1.00 RF 1.00
breast_cancer -- GBT ~0.94 GBT ~0.96 GBT ~0.96
digits -- RF ~0.45 RF ~0.77 RF ~0.89
additive_synth -- RF ~0.70 RF ~0.87 GBT ~0.96
california -- RF ~0.54 GBT ~0.63 GBT ~0.72
diabetes -- RF ~0.45 RF ~0.47 RF ~0.48

Size Constrained: Classification Size Constrained: Regression

LUT Optimizations

GBT predict_proba uses expensive activation functions (sigmoid, softmax). LUT approximations trade minimal accuracy loss for significant speedup:

Activation Classes LUT Size Flash Cost Speedup Range
Sigmoid Binary 17 floats 68 bytes 1.10-1.72x
Softmax Multi-class 33 floats 132 bytes 1.05-1.54x

LUT Speedup

Small ensembles (n=3) benefit most: 1.28-1.72x speedup. Large ensembles (n=40): ~1.05-1.14x as tree traversal dominates.

Running the Benchmarks

See examples/mcu_benchmark/README.md for full setup, CLI reference, sweep parameters, dataset details, and platform configuration.

# Quick validation (host only)
.venv/bin/python examples/mcu_benchmark/run_all.py --benchmark latency --quick --host-only

# Renode benchmarks (requires Zephyr environment)
source .env.local
.venv/bin/python examples/mcu_benchmark/run_all.py --benchmark all --quick --renode-only

# Generate figures
.venv/bin/python examples/mcu_benchmark/generate_figures.py runs/<timestamp>_sweep

Sweep Parameters

Parameter Values
n_estimators 3, 10, 20, 40
max_depth 3, 5
learning_rate (GBT) 0.1, 0.2, 0.5

9 datasets: 6 classification (binary + multi-class) and 3 regression. All results use test_size=0.33, random_state=42.

Platform

Platform Timing Use
Host (CFFI) Python overhead Functional validation
Renode nRF52840 DWT cycles Instruction-level timing
Hardware nRF52 DK DWT cycles Ground truth

About

Machine Learning inference engine for Microcontrollers and Embedded devices

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

No contributors

Languages

  • Python 70.6%
  • C 25.5%
  • C++ 2.4%
  • Jinja 0.7%
  • CMake 0.6%
  • Makefile 0.1%
  • Shell 0.1%