STARK prover on a microcontroller

winterfell 0.13 (forked)on-device prove + self-verify148 ms at 95-bit securityN=256 SRAM ceiling confirmedBlake3 is the ISA gap floor

Threshold circuit: first real predicate

Yeah so Fibonacci is the hello world of ZK. Fine for proving the prover fits on the chip, not exactly something anyone deploys.

The threshold circuit is the first non-trivial predicate we’ve run. The claim: value=37 < threshold=100. Circuit bit-decomposes diff = threshold - value - 1 over 32 rows, then asserts remaining[32] = 0 as a boundary condition. If that holds there was no field underflow, so diff ≥ 0, so value < threshold. The device cannot produce a valid proof for a false claim without either breaking STARK soundness or forging the Poseidon binding (see the security split below).

	Cortex-M33
prove	49 ms
verify	50 ms
heap peak	78 KB
proof size	11.9 KB
STARK soundness (conjectured)	123 bit
Poseidon binding (conjectured)	~64 bit
trace	2 columns × 64 rows
heap after	10 bytes

64 FRI queries at blowup=4. Bumping from 11 to 64 queries added 1.2% prove overhead and 14 KB heap while jumping from 21-bit to 123-bit security. Prove time is basically query-invariant because the work is in LDE + FRI folding, not query opening.

Compare that against Fibonacci N=256 (same field, 11 queries, 95-bit security):

	Fibonacci N=256	Threshold N=64	delta
prove	148 ms	49 ms	3.1× faster
verify	29 ms	50 ms	1.5× slower
heap	248 KB	78 KB	3.2× less

Heap peaks read from 2026-04-26-m33-stark-prover-bb/result.toml (heap_peak_bytes = 253_804) and 2026-04-27-m33-stark-threshold-q64/result.toml (heap_peak = 78_144).

The verify is actually slower for threshold even though the trace is 4× shorter. Reason: 64 queries vs 11 queries means the verifier checks 64 Merkle paths instead of 11. At 11 queries the FRI cost dominates trace cost; at 64 queries the per-query work shows up. Prove and heap scale with trace length, verifier scales with query count.

So yeah the prover scales nicely with circuit size, the verifier is query-bound. If you want a faster verifier you need fewer queries or a larger blowup factor, not a shorter trace.

Yeah so phases 1-3 were all about verification. Take a proof someone else generated, verify it fits. Phase 4 asks a different question: can the RP2350 generate the proof too? I figured it probably couldn’t at production security and I was basically wrong.

148 ms to prove on Cortex-M33 at 95-bit conjectured security. 211 ms on Hazard3 RV32. That’s +10 % over the 32-bit-security baseline for three times the security bits. And the heap actually shrinks because BabyBear elements are 4 bytes not 8. Not what I expected going in.

Phase 4: the feasibility run

Phase 4 is deliberately weak on security. Goldilocks with no extension gives ~32-bit conjectured security, which is not production-grade, but it answers the “does it even fit” question first. Then Phase 5 tightens it.

	Cortex-M33	Hazard3 RV32
Prove time (N=256, median)	134 ms	208 ms
Verify time (self-check)	19 ms	25 ms
Heap peak	299 KB	299 KB
Proof size	6 668 B	6 668 B
Security (conjectured)	~32 bit	~32 bit
ISA gap (prove)	—	1.55×

Each firmware proves and then immediately self-verifies on the same device. If verify returns anything other than ok, it halts. So the proof sizes and claim correctness are checked on actual hardware, not assumed from the host prover.

Heap peak of 299 KB means running with a 384 KB arena. The remaining ~128 KB headroom covers .bss, stack, and USB. N=512 was tried immediately after. It OOM’d before the first prove call returned.

Phase 5: BabyBear + Quartic, production security

Switches base field and extension: Goldilocks 64-bit → BabyBear 31-bit, FieldExtension::None → FieldExtension::Quartic. Conjectured security goes from ~32 bits to ~95 bits.

	Phase 4 (Goldilocks+None)	Phase 5 (BabyBear+Quartic)	Delta
Security (conjectured)	~32 bit	~95 bit	+63 bit
Prove time (M33)	134 ms	148 ms	+10 %
Verify time (M33)	19 ms	29 ms	+49 %
Heap peak	299 KB	248 KB	-17.5 %
Proof size	6 668 B	6 872 B	+3 %

+10 % prove overhead for three times the security bits. That is a really good tradeoff tbh. The heap shrinking is a nice bonus — BabyBear elements are 4 bytes vs 8 bytes for Goldilocks, so the LDE matrix is half the size in memory.

Verify costs more (+49 %) because the Quartic extension forces a bigger composition polynomial (4 columns instead of 1). But 29 ms is still fine.

The cross-ISA gap hypothesis that didn’t hold

Going into Phase 5 the hypothesis was: the M33-vs-RV32 gap collapses from 1.55× to around 1.07×. Reasoning: BabyBear multiplication is a single 32×32→64 MUL on both ISAs. No UMAAL advantage for M33. Symmetric operation, should be symmetric speed.

Didn’t happen.

	Cortex-M33	Hazard3 RV32	Ratio (RV32/M33)
Prove (Phase 4, Goldilocks)	134 ms	208 ms	1.55×
Prove (Phase 5, BabyBear)	148 ms	211 ms	1.42×
Verify (Phase 5, BabyBear)	29 ms	40 ms	1.39×

The gap moved from 1.55× to 1.42×, not to 1.07×. But look at verify: 1.39×. Verify at this N is almost pure Blake3 Merkle path work, basically no field arithmetic at all. If verify is 1.39× slower on RV32, that’s the Blake3 floor. Prove is 1.42×. So field arithmetic adds 0.03× on top of the hash baseline. That’s it.

Blake3 runs ~1.4× faster on M33 than on Hazard3 regardless of field. Changing fields doesn’t help because the bottleneck was never the field to begin with. The Phase 3.3 verifier had a 1.55× gap because Goldilocks field arithmetic was adding a big layer on top. Remove that layer by switching to BabyBear and you land at the hash floor. The floor is 1.39× and isn’t going anywhere unless you attack the hash itself.

Breaking the N=256 ceiling

N=512 OOMs instantly. Two realistic paths forward:

Field-native hash

Poseidon or Rescue over BabyBear. A field hash operates directly on field elements, so FRI commitment is all field arithmetic, no byte-oriented compression. Merkle trees shrink, prove time drops, and the 1.4× ISA gap might finally narrow. This is what Plonky3 does. The existing zkmcu-poseidon-circuit crate is BN254 R1CS sizing-only (placeholder MDS, zeroed ARK, marked NOT cryptographically sound in its own docstring) and gets retired during the PQ-Semaphore audit, the field-native BabyBear Poseidon is fresh work in that milestone.

External PSRAM

The RP2350 supports QSPI-connected PSRAM. The Pico 2 W doesn’t have it, but a custom board with 8-16 MB PSRAM lifts the ceiling to N=4096 or beyond. No software changes, just memory. Useful if you need large traces and the QSPI latency hit is acceptable.

Zbb Blake3 on Hazard3

Hazard3 implements the RISC-V Zbb bitmanip extension — ror, rol, andn etc. Blake3’s compression function has 4 ror instructions per g-function call × 8 calls × 7 rounds = 224 rotations per block. Without Zbb each rotation is 3 instructions (srl + sll + or). With Zbb it’s 1.

LLVM handles this automatically with the right target feature, no hand-written assembly needed. Just add -C target-feature=+zbb to the RV32 rustflags and recompile.

Rough estimate: ~20 % Blake3 speedup on RV32. Prove goes from 211 ms to ~170 ms, ISA gap from 1.42× to ~1.15×. Not zero because the hash floor is a real floor, but a meaningful improvement for basically free. This is Phase 6.

Reproducing

# Phase 5 M33 firmware (BabyBear + Quartic, ~95-bit security)
just build-m33-stark-prover-bb

# scp to Pi 5:
scp target/thumbv8m.main-none-eabihf/release/bench-rp2350-m33-stark-prover-bb \
    pid-admin@10.42.0.30:/tmp/bench-m33-stark-prover-bb.elf

# On the Pi 5, Pico in BOOTSEL:
picotool load -v -x -t elf /tmp/bench-m33-stark-prover-bb.elf
cat /dev/ttyACM0
# PROVE ok, VERIFY ok, cycle counts, heap peak printed over serial

# Phase 5 RV32 firmware — same flow, different binary
just build-rv32-stark-prover-bb

Raw data under benchmarks/runs/2026-04-26-{m33,rv32}-stark-prover-{fib,bb}. Typst reports at research/reports/2026-04-26-stark-prover-results.typ and -bb-results.typ.

What this does not claim

N=256 is not a large workload. Real zkVM traces are 2^16 to 2^22 steps. This is a Fibonacci toy. It proves that proving fits on the chip, not that production-scale proving fits.
~32-bit security (Phase 4) is not production security. It’s a comparison baseline. Do not ship Phase 4 config.
The prover is not constant-time. Winterfell’s internal code has known non-constant-time branches. For threat models where timing of the prove call leaks secrets, that’s a separate problem.