Latency-first on either ISA
Goldilocks × Quadratic, TLSF. 74.65 ms on M33, 112.40 ms on Hazard3. See STARK verify on MCU.
Phase 3.3 was supposed to be the BabyBear win. Small field fits a u32, right? 32-bit MCU has a native u32 multiply, Goldilocks has to emulate u64 arithmetic everywhere, ofcourse BabyBear beats it. That was the pitch going in.
Didn’t work out like that. BabyBear × Quartic is +66 % slower than Goldilocks × Quadratic on Cortex-M33, and +15 % slower on Hazard3 RV32 even after hand-written Karatsuba optimisations. The 31-bit-fits-register advantage gets eaten by the Quartic extension overhead wich you can’t escape at 95-bit conjectured security.
But the run wasn’t a write-off. Two findings came out that I think are more interesting than the “BabyBear wins” story would’ve been:
| Config | M33 | RV32 | RV32/M33 |
|---|---|---|---|
| Goldilocks × Quadratic, TLSF (phase 3.2.z baseline) | 74.65 ms | 112.40 ms | 1.506× |
| BabyBear × Quartic schoolbook, TLSF | 124.21 ms | 136.64 ms | 1.100× |
| BabyBear × Quartic Karatsuba, TLSF | 124.22 ms | 129.05 ms | 1.039× |
| Δ vs Goldilocks | +66 % | +15 % |
Same silicon, same Blake3 hash, same TLSF allocator, same AIR (Fib-1024), same proof options. The only variables are the base field, the extension degree, and wich extension-mul algorithm we used.
BabyBear × Quadratic is only 62 bits of extension, wich caps conjectured security around 50-bit. To hit 95-bit with BabyBear you need Quartic at minimum, 124 bits. Winterfell 0.13.1 ships FieldExtension::{None, Quadratic, Cubic}, no Quartic, no QuartExtension<B> wrapper. Dead end unless we fork.
So we forked. Niek-Kamer/winterfell at upstream v0.13.1, vendored in vendor/winterfell as a submodule, path-patched every winter-* crate via [patch.crates-io]. The change is architecturally additive, existing Quadratic and Cubic behaviour is untouched and all 280+ of winterfell’s own tests still pass:
Quartic = 4 variant on FieldExtension enum plus serializationQuartExtension<B: ExtensibleField<4>>(B, B, B, B) wrapper type in winter-mathAir::BaseField and Prover::BaseField bounds tightened to include ExtensibleField<4>ExtensibleField<4> with is_supported() -> false on f62, f64, f128 (same pattern f128 already uses for Cubic)FieldExtension::Quartic => QuartExtension<Self::BaseField> match arms in prover + verifier dispatchTook about 4 hours start to finish, not the 4-5 days I budgeted. Most of the plumbing is already generic in the extension degree.
The base-field advantage of BabyBear is real but bounded. On Cortex-M33 a BabyBear mont_reduce is ~3× faster than Goldilocks mont_red_cst because it’s a single UMULL + add chain versus emulated u128 arithmetic. Great. But:
ExtensibleField<4>::mul schoolbook is 16 base multiplies plus 3 W × X multiplies. Karatsuba brings it to 9 + 3. Goldilocks ExtensibleField<2>::mul is 3 base multiplies total, via Karatsuba-style sub-product sharing.Per-op speed 3× in our favor. Extension-level work 3-4× against us. Net is worse, not better.
There’s a general rule hiding in here: at a fixed STARK security target, you don’t get to pick your field independent of the extension degree you need. A 31-bit field is only a win at security levels where Quadratic suffices, wich for BabyBear caps around 50-bit. For 95-bit production-grade security you pay the extension cost, and the trade goes the other way. This is non-obvious from the MCU-friendly-small-field papers, wich tend to assume extension cost is a wash.
I swapped the schoolbook ExtensibleField<4>::mul for a 9-mult Karatsuba plus sparse mul_by_W (since W = 11 = 8 + 2 + 1, three doublings and two adds replace a full Montgomery reduction). Mont-mul count per extension multiply dropped from 19 to 12, a 37 % cut.
| ISA | Schoolbook | Karatsuba | Δ |
|---|---|---|---|
| Cortex-M33 | 124.21 ms | 124.22 ms | +12 µs (noise) |
| Hazard3 RV32 | 136.64 ms | 129.05 ms | -7.59 ms (-5.55 %) |
Same code change, completely different results. Two things going on:
LLVM did most of Karatsuba’s work for free on M33. At opt-level=s + lto=fat the compiler CSEs the schoolbook’s 16 products, because many sub-terms like a[0] * b[0] feed multiple output coefficients. By the time the Mont-reduce adds arrive, a lot of the common structure Karatsuba exploits is already folded.
Cortex-M33’s pipelined UMULL + DSP + BTB branch predictor absorbs the rest. Hazard3 doesn’t have any of that. Minimal integer pipeline, no branch predictor worth mentioning, in-order issue. Every MUL is a full cycle, every conditional branch can mispredict. So hand-written algorithmic savings land on Hazard3 that are invisible on ARM.
Rule: if you’re optimising an extension-arithmetic hot path and you want it to actually show up in wall clock, measure Hazard3 first. If it helps there but not on M33, that’s expected, the M33 compiler probably had it already. See the postmortem for the full writeup.
Phase 3.2.z allocator matrix cross-ISA ratios ranged from 1.21× (BumpAlloc) to 1.51× (TlsfHeap). Phase 3.3 BabyBear-Karatsuba hits 1.039×. Field choice is a bigger cross-ISA lever than allocator choice by a lot.
Cause is pretty straightforward in hindsight: Goldilocks’ 64-bit arithmetic is expensive on Hazard3 (pair of MUL + MULHU synthesising each 64-bit multiply), cheap-ish on M33 (wider u64 path). So most of the M33-vs-RV32 gap under Goldilocks is Hazard3 paying tax on u64 emulation. Swap to BabyBear and that tax evaporates because a 31-bit field is a single u32 multiply on both cores.
For someone deciding between Cortex-M33 and Hazard3 deployment at the same clock rate on this workload: your field choice matters more than your ISA choice. If you go Goldilocks, Hazard3 is 50 % slower. If you go BabyBear-Karatsuba, it’s 4 % slower. Hazard3 is smaller silicon, fewer gates, lower power, cheaper to license. Picking it over M33 got a lot more defensible this phase.
BabyBear × Quartic Karatsuba on Hazard3 landed at 0.053 % std-dev / mean, the tightest timing profile measured on RP2350 in zkmcu to date. Comparison:
| Config | Std-dev |
|---|---|
| BumpAlloc (3.2.y, Goldilocks) | 0.076 % |
| TlsfHeap (3.2.z, Goldilocks, the “production” pick) | 0.081 % |
| BabyBear × Q Karatsuba on Hazard3 | 0.053 % |
Probably because BabyBear Mont-reduce is branch-free on u32-native Hazard3 (one MUL + one conditional subtract) and Karatsuba has fewer nested conditional branches than the schoolbook. Fewer mispredict opportunities on a minimal predictor.
Phase 3.2.z said TlsfHeap + Goldilocks × Quadratic for 95-bit production STARK verify on RP2350. That’s still the recommendation. BabyBear-Karatsuba becomes the pick in two narrower shapes:
Latency-first on either ISA
Goldilocks × Quadratic, TLSF. 74.65 ms on M33, 112.40 ms on Hazard3. See STARK verify on MCU.
Side-channel-sensitive on Hazard3
BabyBear × Quartic Karatsuba, TLSF. 129 ms median, 0.053 % variance. Tighter timing profile than any Goldilocks config, at the cost of ~15 % median.
Cross-ISA portable SDK
Either works, but BabyBear-Karatsuba closes the M33-vs-RV32 gap to 1.04×. If your SDK targets both cores and you want the same binary to hit similar latency on both, this is the field to ship.
Post-quantum angle
Unchanged from Goldilocks. Both fields underpin hash-based conjectured-post-quantum STARK soundness. Field choice doesn’t affect the crypto assumption, only the performance envelope.
# Generate the BabyBear proof (host-side winterfell prover)cargo run -p zkmcu-host-gen --release -- stark-babybear
# Build the BabyBear firmware (cargo feature on the stark bench crate)just build-m33-stark-bbjust build-rv32-stark-bb
# Flash + measurescp target/thumbv8m.main-none-eabihf/release/bench-rp2350-m33-stark \ <pi-host>:/tmp/bench-m33-stark-bb.elf
# Pi 5, Pico in BOOTSEL:picotool load -v -x -t elf /tmp/bench-m33-stark-bb.elfcat /dev/ttyACM0Raw runs under benchmarks/runs/2026-04-24-{m33,rv32}-stark-fib-1024-babybear-q{,-kara}. Full typst report at research/reports/2026-04-24-babybear-quartic-cross-isa.typ.
Three open tracks for phase 3.4 or later, in rough order of payoff:
mont_reduce on M33. Same playbook as phase-2 Groth16 went 988 ms → 641 ms. But Karatsuba already showed extension-mul isn’t the M33 bottleneck, so expected payoff is small.The cross-ISA and variance results hold regardless of wich thread gets picked up.