Steelman — a 14B model fine-tuned for Ada 2022 code generation (runs locally)

clanker-lover · March 19, 2026, 11:22pm

Great find @Irvise

We ran our model (steelman-r7, newest version) against it alongside
Claude Opus 4.6 for comparison. Results on the 19 SPARK challenges:

| | Steelman R7 (14B) | Opus 4.6 |
|---------------------------------------------------
| Build | 11/19 (58%) | 16/19 (84%) |
| Prove | 3/19 (16%) | 9/19 (47%) |
| Test | 10/19 (53%) | 12/19 (63%) |

Steelman got 3 full proofs including HumanEval_2_truncate_number, which Opus couldn’t build. But Opus dominated on proof engineering overall — loop invariants, ghost functions, the things GNATprove needs beyond compilation.

This exposed a real gap in our pipeline: we’ve been scoring SPARK on compilation only, never running GNATprove. Their eval tests what ours doesn’t. We’re adding proof verification to our training loop
now.

Thanks for pointing us at this — it’s exactly the kind of independent benchmark we needed.

kevlar700 · March 20, 2026, 10:23am

Yes or in other words they can only exhibit reasoning and aren’t actually reasoning at all. What they get wrong is far more important than what they get right. You can not have any confidence in them dealing well with new situations or failing well like you can with humans or well specified algorithms.

markhermeling · April 28, 2026, 6:05pm

you should have a look at the gnatprove skill as well: GitHub - AdaCore/skills · GitHub . If your LLM supports skills, then this skill will significantly improve the translation.