When Groth16 wins
Bandwidth-bound transport. LoRa, NFC, low-bandwidth BLE. 256 B on the wire fits a single radio frame, 30 KB doesn’t. Pay the verify-time cost on the receiver.
zkmcu-verifier-stark is a no_std wrapper around winterfell 0.13 that exposes a zkmcu-shaped verify API for Goldilocks-field STARK proofs. First verified on silicon on 2026-04-23: 74.7 ms on Cortex-M33 @ 150 MHz, 112 ms on Hazard3 RV32, 100 KB total RAM, variance 0.08 %. Every property that matters for a production hardware-wallet-class deployment, measured and reproducible.
Groth16 proofs are tiny (256-512 bytes) but expensive to verify (1-2 seconds on an MCU). STARK proofs are bigger (25-31 KB) but verify in ~75 ms. Different tradeoff, different workloads.
When Groth16 wins
Bandwidth-bound transport. LoRa, NFC, low-bandwidth BLE. 256 B on the wire fits a single radio frame, 30 KB doesn’t. Pay the verify-time cost on the receiver.
When STARK wins
Verify-time-bound receiver. Per-packet verification, hot loops, latency budgets under 100 ms. An MCU that needs to verify a proof 10 times per second can’t afford Groth16’s per-verify cost ofcourse.
When both win
Hardware wallets. USB or BLE bandwidth is plenty (kB/s, not B/s), verify latency barely matters at human-action speed (one per transaction confirmation). Ship both and let the prover pick wichever fits.
Post-quantum angle
STARK soundness doesn’t depend on elliptic-curve discrete log or pairing hardness. Blake3 hash-based construction is conjectured post-quantum secure. Groth16 is not.
The headline 74.7 ms number is measured at the configuration wich is actually defensible in production, not a demo config:
N = 1024 trace steps (small, representative)FieldExtension::Quadratic over Goldilocks → 95-bit conjectured STARK securityMinConjecturedSecurity(95) enforced by the verifier, the prover must submit options that meet this bar, otherwise verify rejectsThe 95-bit figure matches winterfell’s own Fibonacci reference configuration. A lower bound like 63-bit (what FieldExtension::None gives) verifies in 43.8 ms but isn’t production security, that config exists in the repo as phase 3.1 just for comparison. Phase 3.3 tested an alternative path to 95-bit via BabyBear × Quartic, see BabyBear × Quartic. It didn’t beat Goldilocks on latency, but it collapsed the cross-ISA gap from 1.51× to 1.04×, wich is a surprise on its own.
| Cortex-M33 | Hazard3 RV32 | |
|---|---|---|
| Verify time (median) | 74.7 ms | 112.4 ms |
| Iteration-to-iteration std-dev | 0.081 % | 0.110 % |
| Iteration-to-iteration IQR | 0.113 % | 0.191 % |
| Peak heap | 93.5 KB | ~93 KB (est.) |
| Peak stack | 5.6 KB | 5.5 KB |
| Total RAM | ~100 KB | ~100 KB |
All iterations ok=true | yes | yes |
Allocator: embedded-alloc::TlsfHeap. Clone-hoisted pattern (proof clone outside the timed window so the cycle span reflects pure verify work). Full raw data: 2026-04-24-m33-stark-fib-1024-q-tlsf and rv32 counterpart.
Before phase-3 measurements, the worry was wether STARK verify could even fit alongside BN254 Groth16 (97 KB total) and BLS12-381 Groth16 (99 KB total) under the 128 KB hardware-wallet SRAM tier. Turns out it does:
| Verifier family | Total RAM on M33 | Fits 128 KB? |
|---|---|---|
| BN254 Groth16 | ~97 KB | ✓ |
| BLS12-381 Groth16 | ~99 KB | ✓ |
| STARK (TlsfHeap, 95-bit) | ~100 KB | ✓ |
All three verifier families now sit on the same silicon tier. nRF52832, STM32F405, Ledger ST33K1M5, Infineon SLE78, every hardware-wallet-grade chip on the market can run any of the three.
The STARK verify path allocates ~400 Vecs internally for FRI state, auth-path parsing, and composition polynomial scratch. With the stock LlffHeap allocator, the free-list state evolves iteration to iteration and timing variance lands around 0.25 % on M33. For side-channel-sensitive deployments (hardware wallets), that’s noisy enough to be a problem.
Swapping to TlsfHeap (O(1) two-level segregated fit) brings variance down to 0.08 %, the silicon noise floor. Verify path becomes timing-deterministic to the level of cache and USB-peripheral noise, wich is basically as good as it gets without writing constant-time code by hand.
Full methodology: Deterministic timing.
# One-time: generate the committed test vector via winter-prover.cargo run -p zkmcu-host-gen --release -- stark
# Verify on the host (parse + winterfell verify, cross-checks before disk).cargo test -p zkmcu-verifier-stark --release
# Flash + bench on hardware.cargo build -p bench-rp2350-m33-stark --releasescp target/thumbv8m.main-none-eabihf/release/bench-rp2350-m33-stark \ <pi-host>:/tmp/bench-m33-stark.elf
# On the Pi 5 with the Pico in BOOTSEL:picotool load -v -x -t elf /tmp/bench-m33-stark.elfcat /dev/ttyACM0The Fibonacci AIR + public input is deterministic under fixed prover options, so the committed proof.bin (30,888 B) is byte-reproducible. If your regen produces different bytes, the winterfell-version pin has drifted.
N = 2^16 traces to take 150-300 ms and push heap past 150 KB. Phase 4 territory.embedded-alloc::TlsfHeap is the fastest option. A hand-tuned bump allocator with watermark reset gives 67.9 ms median (phase 3.2.y benchmark) but needs a 384 KB arena, too big for the 128 KB tier. TlsfHeap is the best-of-both production pick, raw bump is just a measurement tool.Vec allocations per verify is more than a hand-rolled STARK verifier would do. An upstream contribution that adds a &mut [u8] scratch-buffer API would eliminate runtime alloc entirely, but that’s Phase 4 engineering.See Deterministic timing for the full story on allocator sensitivity and the cross-ISA implications, and Security for the threat model.