Findings and postmortems

8 published, more comingHonest failure log

Every bug, every regression, every benchmark surprise. I write them down so when I make the same class of mistake six months from now I can grep the trail instead of wasting another day. The two parser-related ones (stark-unbounded-vec-alloc, stark-cross-field-panic) were caught and killed within 24 hours, both in the same week.

The postmortems

Each entry below links to the full Typst source on GitHub. PDFs compile via just docs.

2026-04-23, no-umaal-codegen

Cortex-M33 has the UMAAL instruction (unsigned multiply-accumulate-accumulate long) which is exactly the shape Montgomery reduction needs. LLVM did not emit it from straight Rust, so the original 988 ms baseline was leaving performance on the floor. Hand-wrote the asm path, dropped to 641 ms; later firmware additions shifted code placement enough to drop the verify another ~14 % to 551 ms without touching the verify logic. Rule extracted: don’t trust the compiler to find dual-accumulator instructions, and don’t underestimate how much LTO + linker placement can move a hot path that lives in .ram_text.

2026-04-23, opt-level-3-regression

opt-level = 3 was actually slower than opt-level = 2 for the BN254 verify on M33. Counter-intuitive. Tracked to inlining causing register-allocation pressure that spilled to the stack. Rule extracted: always benchmark opt-level = 2 and s alongside 3, and trust measured numbers over compiler-default lore.

2026-04-24, stark-cross-field-panic

Tried to verify a STARK proof generated over Goldilocks using a verifier expecting BabyBear, expected a clean Err(WrongField). Instead got an arithmetic panic from inside winterfell’s deserializer, which is the wrong failure mode for adversarial input. Patched in the vendor/winterfell fork. Rule extracted: panics on bad input are bugs, full stop, no matter how unrealistic the input looks.

2026-04-24, stark-unbounded-vec-alloc

parse_vk accepted untrusted num_ic into a Vec::with_capacity call, which for num_ic = u32::MAX triggered a 412 GB allocation request. SIGABRT on host, instant reset on the Pico. Classic DoS via untrusted-length parsing. Patched with an upfront buffer-length sanity check and a cap. Rule extracted: every with_capacity(n) where n came from untrusted bytes is a potential DoS vector, no exceptions.

2026-04-24, karatsuba-isa-asymmetric

Karatsuba multiplication helped on RISC-V Hazard3 but actively HURT on Cortex-M33. Reason: M33 has UMAAL which is faster than the equivalent shift-and-add chain Karatsuba reduces multiplications to. Karatsuba assumes you don’t have a fast wide-multiply primitive; M33 has one. Rule extracted: cross-ISA optimizations are not free transfers, instruction-set asymmetries flip the answer.

2026-04-24, bench-core-babybear-speedup

Switching the benchmark harness from Goldilocks to BabyBear made measurements appear ~2× faster. Looked too good. Tracked to the harness re-using a stale Goldilocks state vector while running BabyBear arithmetic, which essentially short-circuited part of the work. Rule extracted: when a benchmark gets surprisingly faster, suspect the harness before suspecting the optimization.

2026-04-24, babybear-quartic-regresses

Hypothesis: BabyBear × Quartic would be faster than Goldilocks × Quadratic at the same security target. Reality: it was 66% slower on M33. The 31-bit field saved on memory but field arithmetic costs dominated, and the quartic extension multiplied that cost by 4 instead of 2. Negative result published as a real finding because most projects would have buried it. Rule extracted: smaller field doesn’t always win, the extension degree multiplier matters.

2026-05-03, rtc-verify-closing-m5

First on-silicon ct-reject sweep of the dual-hash CT verifier surfaced a 9.46x wall-clock speedup on Mutation::M5_public_byte: 168 ms reject vs 1593 ms honest, while M0–M4 sat within 0.06% of honest. A single bit flip in the public input desynced the Fiat-Shamir transcript and tripped an early-return inside Plonky3’s FRI commit-phase Merkle check, before any of the real verification work ran. Boolean-parity host tests passed cleanly the entire time — only the on-silicon timing harness caught it. Closed by vendoring Plonky3 and adding parallel verify_run_to_completion entry points across p3-uni-stark, p3-fri, and p3-commit that accumulate data-path failures into a single status flag instead of ?-propagating the first one. Re-run on M33: M5 lands at 0.99977 of honest, statistically tied with M0–M4. Rule extracted: any code path claiming to be CT needs a timing assertion, not just a boolean one.

Coming up

More postmortems landing as PQ-Semaphore work runs through May-June 2026. The Poseidon2 audit already turned up one Plonky3 deviation from the paper, that lives on the audit page rather than here since it’s a deliberate design choice and not a bug.