Great find @Irvise
We ran our model (steelman-r7, newest version) against it alongside
Claude Opus 4.6 for comparison. Results on the 19 SPARK challenges:
| | Steelman R7 (14B) | Opus 4.6 |
|---------------------------------------------------
| Build | 11/19 (58%) | 16/19 (84%) |
| Prove | 3/19 (16%) | 9/19 (47%) |
| Test | 10/19 (53%) | 12/19 (63%) |
Steelman got 3 full proofs including HumanEval_2_truncate_number, which Opus couldn’t build. But Opus dominated on proof engineering overall — loop invariants, ghost functions, the things GNATprove needs beyond compilation.
This exposed a real gap in our pipeline: we’ve been scoring SPARK on compilation only, never running GNATprove. Their eval tests what ours doesn’t. We’re adding proof verification to our training loop
now.
Thanks for pointing us at this — it’s exactly the kind of independent benchmark we needed.