Steelman — a 14B model fine-tuned for Ada 2022 code generation (runs locally)

Great find @Irvise

We ran our model (steelman-r7, newest version) against it alongside
Claude Opus 4.6 for comparison. Results on the 19 SPARK challenges:

| | Steelman R7 (14B) | Opus 4.6 |
|---------------------------------------------------
| Build | 11/19 (58%) | 16/19 (84%) |
| Prove | 3/19 (16%) | 9/19 (47%) |
| Test | 10/19 (53%) | 12/19 (63%) |

Steelman got 3 full proofs including HumanEval_2_truncate_number, which Opus couldn’t build. But Opus dominated on proof engineering overall — loop invariants, ghost functions, the things GNATprove needs beyond compilation.

This exposed a real gap in our pipeline: we’ve been scoring SPARK on compilation only, never running GNATprove. Their eval tests what ours doesn’t. We’re adding proof verification to our training loop
now.

Thanks for pointing us at this — it’s exactly the kind of independent benchmark we needed.

Yes or in other words they can only exhibit reasoning and aren’t actually reasoning at all. What they get wrong is far more important than what they get right. You can not have any confidence in them dealing well with new situations or failing well like you can with humans or well specified algorithms.

1 Like