But I also published HumanEval-Ada results via MultiPL-E, which executes test cases against the output. That’s functional correctness, not just compilation. Steelman gets 47.1% pass@1 vs the base model’s 34.4%.
The term “human” appears nowhere in the model card.
Steelman gets 47.1% pass@1 vs the base model’s 34.4%.
I would expect it to do better than the base model. What about the other models though?
The same flags are applied to every model identically. These aren’t obscure choices — they’re the flags @Irvise recommended in this thread, and they reflect standard production Ada practice.
Sure, but they’re certainly not the only valid choices or else they would be defaults. How are the other models meant to know to use that specific style?
I didn’t tell Steelman what flags to expect during inference either. The model learned to write clean Ada because it was trained on clean Ada.
That’s exactly the problem, you’ve trained to emit code that matches the flags, which is implicitly telling it which flags to use. If I train a model to reply only reply in French and then compare it to other models that reply in the same language as the input then I can not just say that my model is a better French speaker, I would have to tell the other models to reply in French for it to be a fair comparison.
Even without that, style (or French) will be picked up from the input context, except there’s no context here, just a one-line natural language prompt.
When I first applied -gnatwe to my own training data, 37% of it failed. The strict flags cost me 1,270 training pairs. They hurt my dataset before they ever touched a benchmark.
So you’ve removed all training data that doesn’t match the style flags? See my last point.
Temperature zero: This is standard practice across every major code generation benchmark — HumanEval, MultiPL-E, MBPP, SWE-bench.
This certainly isn’t true for SWE-bench these days and likely predates thinking models. Gemini specifically warns against using a temperature of 0 and I suspect others do the same.
The generic swap example is the “hello world” of Ada generics — it appears in every textbook, and the prompts are worded differently. But you’re right that this deserves a systematic overlap analysis rather than spot-checking. I’ll run one against the full training set and publish the results. If there’s meaningful contamination I’ll remove those prompts and re-eval.
It’s not just that one prompt. We also have “Build an Ada 2022 function using a declare expression to compute the hypotenuse of a right triangle.” and “Write an Ada program that uses a declare expression to compute and print the hypotenuse of a right triangle with sides 3.0 and 4.0.” among others.
Note that both here are related to declare expressions on top of asking for more-or-less the same functionality. The training set and eval set are very clearly not independent.
What benchmark would you suggest? I’m genuinely asking. HumanEval-Ada via MultiPL-E is the only standardized Ada benchmark I’ve been able to find. ACATS exists but it’s a compiler conformance suite, not a code generation benchmark — adapting it is on my roadmap but it’s a major effort. If there’s an Ada code generation benchmark I’m missing, point me to it and I’ll run it. The reason I built a custom eval is because nothing else exists for this language.
If you’re asking for specific features to be implemented then frankly you need a real codebase, similar to what SWE-Bench does. No one is ever going to ask a LLM to “Write multiple expression functions in Ada 2022 style” or “Write Ada 2022 program: for..of with index”. Even SWE-Bench is flawed in that there’s no evaluation of whether the code is good or not, but it’s the best option you have that doesn’t involve massive amounts of work.
The eval prompts (eval_v3_500.json), methodology (eval_v3_README.md), and training data
(steelman_sft_dataset.jsonl) are all in the dataset repo
I read the readme, it references files which are not in the repo.
For what it’s worth, I’m not against LLMs and I have found them useful in some narrow circumstances for writing Ada code. I just think the evaluation here is flawed.