BabyBear × Quartic on MCU, a negative latency result

winterfell 0.13 (forked)BabyBear × Quartic95-bit conjectured securitycross-ISA gap 1.04×variance 0.053 % on Hazard3Goldilocks still wins on latency

Phase 3.3 was supposed to be the BabyBear win. Small field fits a u32, right? 32-bit MCU has a native u32 multiply, Goldilocks has to emulate u64 arithmetic everywhere, ofcourse BabyBear beats it. That was the pitch going in.

Didn’t work out like that. BabyBear × Quartic is +66 % slower than Goldilocks × Quadratic on Cortex-M33, and +15 % slower on Hazard3 RV32 even after hand-written Karatsuba optimisations. The 31-bit-fits-register advantage gets eaten by the Quartic extension overhead wich you can’t escape at 95-bit conjectured security.

But the run wasn’t a write-off. Two findings came out that I think are more interesting than the “BabyBear wins” story would’ve been:

The cross-ISA gap collapses from 1.506× to 1.039× when you swap Goldilocks-Quadratic for BabyBear-Karatsuba. That’s the tightest M33-vs-Hazard3 ratio in the whole project.
Variance record: 0.053 % std-dev on Hazard3 with BabyBear-Karatsuba. Below the previous record of 0.076 % from the bump allocator experiment.

The four measurements

Config	M33	RV32	RV32/M33
Goldilocks × Quadratic, TLSF (phase 3.2.z baseline)	74.65 ms	112.40 ms	1.506×
BabyBear × Quartic schoolbook, TLSF	124.21 ms	136.64 ms	1.100×
BabyBear × Quartic Karatsuba, TLSF	124.22 ms	129.05 ms	1.039×
Δ vs Goldilocks	+66 %	+15 %

Same silicon, same Blake3 hash, same TLSF allocator, same AIR (Fib-1024), same proof options. The only variables are the base field, the extension degree, and wich extension-mul algorithm we used.

Why we had to fork winterfell

BabyBear × Quadratic is only 62 bits of extension, wich caps conjectured security around 50-bit. To hit 95-bit with BabyBear you need Quartic at minimum, 124 bits. Winterfell 0.13.1 ships FieldExtension::{None, Quadratic, Cubic}, no Quartic, no QuartExtension<B> wrapper. Dead end unless we fork.

So we forked. Niek-Kamer/winterfell at upstream v0.13.1, vendored in vendor/winterfell as a submodule, path-patched every winter-* crate via [patch.crates-io]. The change is architecturally additive, existing Quadratic and Cubic behaviour is untouched and all 280+ of winterfell’s own tests still pass:

Quartic = 4 variant on FieldExtension enum plus serialization
QuartExtension<B: ExtensibleField<4>>(B, B, B, B) wrapper type in winter-math
Air::BaseField and Prover::BaseField bounds tightened to include ExtensibleField<4>
Stub ExtensibleField<4> with is_supported() -> false on f62, f64, f128 (same pattern f128 already uses for Cubic)
FieldExtension::Quartic => QuartExtension<Self::BaseField> match arms in prover + verifier dispatch

Took about 4 hours start to finish, not the 4-5 days I budgeted. Most of the plumbing is already generic in the extension degree.

Why Goldilocks still wins

The base-field advantage of BabyBear is real but bounded. On Cortex-M33 a BabyBear mont_reduce is ~3× faster than Goldilocks mont_red_cst because it’s a single UMULL + add chain versus emulated u128 arithmetic. Great. But:

ExtensibleField<4>::mul schoolbook is 16 base multiplies plus 3 W × X multiplies. Karatsuba brings it to 9 + 3. Goldilocks ExtensibleField<2>::mul is 3 base multiplies total, via Karatsuba-style sub-product sharing.
Extension inversion costs 3 Frobenius + 3 extension multiplies for Quartic (the norm-via-orbit-product trick). Quadratic is 1 + 1.
FRI folding at degree 4 does 2× the base-field work per fold as degree 2.

Per-op speed 3× in our favor. Extension-level work 3-4× against us. Net is worse, not better.

There’s a general rule hiding in here: at a fixed STARK security target, you don’t get to pick your field independent of the extension degree you need. A 31-bit field is only a win at security levels where Quadratic suffices, wich for BabyBear caps around 50-bit. For 95-bit production-grade security you pay the extension cost, and the trade goes the other way. This is non-obvious from the MCU-friendly-small-field papers, wich tend to assume extension cost is a wash.

Why Karatsuba helped Hazard3 but not M33

I swapped the schoolbook ExtensibleField<4>::mul for a 9-mult Karatsuba plus sparse mul_by_W (since W = 11 = 8 + 2 + 1, three doublings and two adds replace a full Montgomery reduction). Mont-mul count per extension multiply dropped from 19 to 12, a 37 % cut.

ISA	Schoolbook	Karatsuba	Δ
Cortex-M33	124.21 ms	124.22 ms	+12 µs (noise)
Hazard3 RV32	136.64 ms	129.05 ms	-7.59 ms (-5.55 %)

Same code change, completely different results. Two things going on:

LLVM did most of Karatsuba’s work for free on M33. At opt-level=s + lto=fat the compiler CSEs the schoolbook’s 16 products, because many sub-terms like a[0] * b[0] feed multiple output coefficients. By the time the Mont-reduce adds arrive, a lot of the common structure Karatsuba exploits is already folded.

Cortex-M33’s pipelined UMULL + DSP + BTB branch predictor absorbs the rest. Hazard3 doesn’t have any of that. Minimal integer pipeline, no branch predictor worth mentioning, in-order issue. Every MUL is a full cycle, every conditional branch can mispredict. So hand-written algorithmic savings land on Hazard3 that are invisible on ARM.

Rule: if you’re optimising an extension-arithmetic hot path and you want it to actually show up in wall clock, measure Hazard3 first. If it helps there but not on M33, that’s expected, the M33 compiler probably had it already. See the postmortem for the full writeup.

The cross-ISA collapse

Phase 3.2.z allocator matrix cross-ISA ratios ranged from 1.21× (BumpAlloc) to 1.51× (TlsfHeap). Phase 3.3 BabyBear-Karatsuba hits 1.039×. Field choice is a bigger cross-ISA lever than allocator choice by a lot.

Cause is pretty straightforward in hindsight: Goldilocks’ 64-bit arithmetic is expensive on Hazard3 (pair of MUL + MULHU synthesising each 64-bit multiply), cheap-ish on M33 (wider u64 path). So most of the M33-vs-RV32 gap under Goldilocks is Hazard3 paying tax on u64 emulation. Swap to BabyBear and that tax evaporates because a 31-bit field is a single u32 multiply on both cores.

For someone deciding between Cortex-M33 and Hazard3 deployment at the same clock rate on this workload: your field choice matters more than your ISA choice. If you go Goldilocks, Hazard3 is 50 % slower. If you go BabyBear-Karatsuba, it’s 4 % slower. Hazard3 is smaller silicon, fewer gates, lower power, cheaper to license. Picking it over M33 got a lot more defensible this phase.

Variance record

BabyBear × Quartic Karatsuba on Hazard3 landed at 0.053 % std-dev / mean, the tightest timing profile measured on RP2350 in zkmcu to date. Comparison:

Config	Std-dev
BumpAlloc (3.2.y, Goldilocks)	0.076 %
TlsfHeap (3.2.z, Goldilocks, the “production” pick)	0.081 %
BabyBear × Q Karatsuba on Hazard3	0.053 %

Probably because BabyBear Mont-reduce is branch-free on u32-native Hazard3 (one MUL + one conditional subtract) and Karatsuba has fewer nested conditional branches than the schoolbook. Fewer mispredict opportunities on a minimal predictor.

Production recommendation hasn’t changed

Phase 3.2.z said TlsfHeap + Goldilocks × Quadratic for 95-bit production STARK verify on RP2350. That’s still the recommendation. BabyBear-Karatsuba becomes the pick in two narrower shapes:

Latency-first on either ISA

Goldilocks × Quadratic, TLSF. 74.65 ms on M33, 112.40 ms on Hazard3. See STARK verify on MCU.

Side-channel-sensitive on Hazard3

BabyBear × Quartic Karatsuba, TLSF. 129 ms median, 0.053 % variance. Tighter timing profile than any Goldilocks config, at the cost of ~15 % median.

Cross-ISA portable SDK

Either works, but BabyBear-Karatsuba closes the M33-vs-RV32 gap to 1.04×. If your SDK targets both cores and you want the same binary to hit similar latency on both, this is the field to ship.

Post-quantum angle

Unchanged from Goldilocks. Both fields underpin hash-based conjectured-post-quantum STARK soundness. Field choice doesn’t affect the crypto assumption, only the performance envelope.

Reproducing

# Generate the BabyBear proof (host-side winterfell prover)
cargo run -p zkmcu-host-gen --release -- stark-babybear

# Build the BabyBear firmware (cargo feature on the stark bench crate)
just build-m33-stark-bb
just build-rv32-stark-bb

# Flash + measure
scp target/thumbv8m.main-none-eabihf/release/bench-rp2350-m33-stark \
    <pi-host>:/tmp/bench-m33-stark-bb.elf

# Pi 5, Pico in BOOTSEL:
picotool load -v -x -t elf /tmp/bench-m33-stark-bb.elf
cat /dev/ttyACM0

Raw runs under benchmarks/runs/2026-04-24-{m33,rv32}-stark-fib-1024-babybear-q{,-kara}. Full typst report at research/reports/2026-04-24-babybear-quartic-cross-isa.typ.

What’s next

Three open tracks for phase 3.4 or later, in rough order of payoff:

Plonky3 circle-STARK over Mersenne-31. A different protocol where extension arithmetic is designed around 32-bit hardware. Probably the only realistic path to “small-field STARK beats Goldilocks-Quadratic on MCU at 95-bit”. Significant port cost, 1-2 weeks.
Bigger AIRs ($N = 2^16$, $N = 2^18$). Extension-field work scales linearly with trace length; hashing is sub-linear. Possible BabyBear’s relative cost improves at larger AIRs where extension work is a smaller fraction of total verify time.
Hand-asm UMULL for BabyBear mont_reduce on M33. Same playbook as phase-2 Groth16 went 988 ms → 641 ms. But Karatsuba already showed extension-mul isn’t the M33 bottleneck, so expected payoff is small.

The cross-ISA and variance results hold regardless of wich thread gets picked up.