Steelman — a 14B model fine-tuned for Ada 2022 code generation (runs locally)

Hey all — I wanted to share a project I’ve been working on that I think this community might find useful.

Steelman is a QLoRA fine-tune of Qwen2.5-Coder-14B-Instruct, trained specifically on compiler-verified Ada 2022 code. It runs locally (Ollama, llama.cpp, etc.) and doesn’t need a cloud API.

I built it because every frontier model I tried was genuinely bad at Ada. Claude, GPT, Gemini — they all produce code that looks plausible but won’t compile. So I started generating Ada pairs, verifying them with GNAT (-gnat2022 -gnatwa), and training on only the code that actually passes the compiler.

Results on a custom compilation benchmark (923 prompts):

  • Steelman R5: 68.6%

  • Claude Opus 4.6: 40.3%

  • Qwen base (untuned): 35.0%

  • Claude Sonnet 4.6: 27.5%

It also scores 47.1% pass@1 on HumanEval-Ada (MultiPL-E), up from 34.4% for the base model. As far as I can tell these are the first published Ada pass@1 results for any model.

I should be upfront — I’m not an Ada programmer by trade. I learned enough Ada to read and validate the generated code, and every training pair is compiler-verified, but I’m sure experienced Ada developers would spot style issues or patterns that the model picked up from its training data. Feedback from people who actually write Ada professionally would be incredibly valuable for improving the dataset.

The model, GGUF, and dataset are all on HuggingFace:

If anyone tries it and has feedback — especially on the quality of the Ada it produces — I’d love to hear it. The next training round is in progress and community input would directly improve the model.

10 Likes

Welcome @clanker-lover!

While I am not entirely in favour of AI, I love the concept of having a fine-tune, locally runnable system that is very good with Ada :slight_smile:

Since you say that you are new to Ada, here are some tips if you are training an AI. First, as you mentioned, code that compiles is good, and the flags that you used are great for that. But Ada being Ada and having a very strong focus on correctness, we have so many more flags that can be used to check for a lot of things. So here are a few tips:

  1. Enable runtime checks and force all programs to run (not just compile). This is useful as Ada checks a lot of things during runtime, so not everything that compiles is correct. Things that help here are: -fstack-check, -gnata, -gnateE, -gnateF, -gnateV, -gnatU, -gnatVa and optionally -gnaty for style checks (which should teach the AI to create nicely looking Ada). The full documentation is here Alphabetical List of All Switches (GNAT User’s Guide for Native Platforms)
  2. Doing all the checking at runtime can be heavy and sometimes quite undesirable. For that reason we have SPARK! SPARK is the formally verifiable subset of Ada (well, nowadays it is almost all of Ada). So you may want to throw the AI’s code against SPARK and see if it finds issues. For that I recommend that you run with --level=4 checking.

I hope this helps and do not hesitate to ask questions here in the community! Happy hacking,
Fer

EDIT: also, you may want to train your AI against the ACATS (Ada Conformity Assessment Test Suite (ACATS))

3 Likes

Thank you Fer, this is exactly the kind of feedback I was hoping to get by posting here.

The runtime checks suggestion is immediately actionable. Right now the pipeline only verifies compilation, so code that compiles but would crash at runtime is getting through. Adding -gnata, -gnatVa and the other flags you mentioned, plus actually executing the output, would catch a whole layer of issues the current approach misses. That’s going on the roadmap for the next dataset revision.

SPARK verification with --level=4 is the long-term goal. The whole reason I chose Ada for this project is because the toolchain can verify correctness without a human in the loop. The prover is the oracle. Not there yet but it’s planned.

I had not heard of ACATS before your suggestion. I’ll look into it both as a potential benchmark and as a source of high-quality training examples.

As I’ve said, I’m not an Ada developer. I do have great respect for the language from what I learned when I was initially choosing one for this goal though. That goal being agentic coding loops with verification. I’m a self-taught engineer who builds with AI assistance. I learned enough Ada to validate compiler output but my knowledge of idiomatic style and best practices is limited. If you or anyone here ever has time to look at sample outputs and flag issues, that would be enormously helpful for improving the dataset.

2 Likes

Excuse my naive view on AI, I am not really knowledgeable about how they work.

Adas syntax is defined using BNF by the Ada Reference Manual. How come that these models are failing on the simple task of creating correct syntax?
From what I know these AIs just “guess” the next most probable word. Why hasn’t anyone added some sort of check that validates the next word or even only allows valid words in the first place as defined by the BNF?

Also: I am not so sure how fruit full your endeavour is going to be. For good AI models, large sets of high quality examples are needed. But a lot of professional, high quality code, written by experienced developers is not open source. I’d argue that it’s even worse for Ada as it’s not an academic, but industrial language.

Maybe you could train more code using Advent of Code or rosettacode examples? But I don’t know if you are even allowed to do that since most people have copyright on their code. (I don’t want to start the AI ethics and copyright issue now. I’m just pointing out that everyone owns their code and may not want to be part of a training set)

There is also some examples by AdaCore on how to use AI for Ada. See: https://www.youtube.com/watch?v=PkInULSymD4
But they don’t make their own AI models. See: https://youtu.be/tYAod_61ZuQ?si=wR0YQFmi8_NUaHOB&t=2916

One small addition. Could you add some explanation to the AI keywords? I had to look up what “QLoRA fine-tune of Qwen2.5-Coder-14B-Instruct” and what these benchmark tests mean. :sweat_smile:

1 Like

I use Grok and the code was always very good. Of course a local model is always welcome.

1 Like

These are really good questions, let me try to address them.

I should also frame that these posts reflect my philosophy around using AI. I use what Andrej Karpathy recently coined “Agentic Engineering” — the practice where you’re not writing code directly 99% of the time, you’re orchestrating AI agents who do, acting as oversight, architect, and final decision-maker. There’s an art and science and real expertise to it, the same way there’s expertise in managing a team of engineers even if you’re not typing every line yourself. My workflow is: I design the architecture and strategy, AI executes under my direction, I validate everything. These responses are written the same way — human-in-the-loop AI generations, reviewed and edited by me.

On the BNF/syntax point: you’re right that Ada’s syntax is formally defined, and in theory you could constrain the model to only produce syntactically valid tokens. This is actually an active area of research called “constrained decoding” or “grammar-guided generation.” The problem is that syntactic validity is the easy part. Most of the failures I see aren’t syntax errors — they’re semantic errors: using a package that doesn’t exist, passing the wrong type to a procedure, misusing visibility rules, incorrect generic instantiation. The code looks like valid Ada and parses fine, but GNAT rejects it because of type mismatches or scoping issues that no BNF grammar can capture. Ada’s type system is strict enough that this is where most models fall down.

On the data quality concern: you’re absolutely right that this is the core challenge. Most high-quality Ada code is proprietary and behind closed doors. My training data is entirely synthetic — every pair is an instruction plus a compiler-verified completion, not scraped from existing codebases. So I’m not training on anyone’s proprietary code. The tradeoff is that synthetic data has its own biases and quality ceiling, which is exactly why feedback from experienced Ada developers like yourselves is so valuable. It’s the kind of signal I can’t generate synthetically.

On the copyright point: I appreciate you raising it. The dataset is entirely synthetic, generated by a base LLM and then filtered through the compiler. No code was copied from Advent of Code, Rosetta Code, or any open source projects. That said, the broader AI training ethics discussion is valid and I understand the concerns.

I hadn’t seen those AdaCore videos — thank you for sharing. I’ll watch them.

On the terminology, that’s a fair point and I should have been more accessible. In short: Qwen2.5-Coder-14B-Instruct is an open-weight coding model with 14 billion parameters made by Alibaba. QLoRA is a technique for fine-tuning a model using very little memory by working in 4-bit precision. The compilation benchmark measures how often the model’s output actually compiles on the first try with gnatmake. HumanEval-Ada is a standardized set of 157 programming problems translated into Ada, and pass@1 means the model gets the correct answer on its first attempt. I’ll add a terminology section to the post — good suggestion.

4 Likes

Interesting, I haven’t tested Grok against the benchmark yet. If it’s generating clean Ada for you that’s good to hear. Would be curious how it holds up on the compilation benchmark if I get around to adding it to the frontier comparison.

Update: v0.2 — 72% strict compilation with full runtime and style checks

Major update. Steelman R6 is released.

The headline: 72.0% compilation rate on a new 500-prompt eval using the strict flag set that came directly from this thread. Every output compiled with:

gnatmake -gnat2022 -gnatwa -gnata -gnateE -gnateF -gnateV -gnatU -gnatVa -gnatyabehiklprt -gnatwe -cargs -fstack-check

That’s warnings-as-errors, runtime assertions, validity checks, style enforcement — the works. Code has to be completely clean to pass.

The same eval was run against four frontier models for direct comparison:

Model Compile Rate
Steelman v0.2 (14B, local) 72.0%
Gemini 3.1 Pro 56.6%
Claude Opus 4.6 49.8%
GPT-5.4 46.0%
Grok 4 37.0%

Category results that may interest this community:

  • SPARK contracts: 95% — Pre/Post/Contract_Cases/Loop_Invariant generation is near-ceiling

  • Generics: 78% — generic packages, procedures, formal types

  • Tasking: 74% — tasks, protected objects, entries, rendezvous

  • Ada 2022: 75% — declare expressions, delta aggregates, @ symbol, iterated component associations

  • Error-fix: 85% — given broken code + GNAT error, generates the fix

  • Spec-to-body: 56% — given .ads, generates .adb (room to grow here)

@Irvise— your flags became the foundation of this eval. When I tested -gnatwe on the training data, 37% had warnings. That discovery led to a complete dataset rebuild where every training pair was recompiled with the full strict set. Only warning-free code made it into R6’s training data. The 27 runtime crashes caught by -gnata proved your point exactly: compilation ≠ correctness. Thank you.

The eval set (500 prompts, 8 categories) and all results are published alongside the model. If anyone wants to run it against their own tools or workflows, it’s available on HuggingFace.

ACATS integration and SPARK/GNATprove verification are still on the roadmap — those are bigger efforts that need their own sessions.

Model card with full methodology and per-category breakdown: the-clanker-lover/steelman-14b-ada · Hugging Face

6 Likes

Depending on what you mean, “guess” is not quite the right term. Educated guess is a much better term.

Even that didn’t impress me much until I started reading a recent article by Tanay Wakhare, an MIT student who claims to have used ChatGPT 5 Pro to prove a conjecture from number theory. I didn’t think these things were anywhere near capable of that sort of deductive reasoning.

Excuse my skepticism but every time someone talks about how great LLMs and AIs are able to “reason” I have to think about this XKCD.

I don’t know who Tanay Wakhare is looking at his google scholar and linkedin I couldn’t really see (from the abstracts) which paper used AI. But Terence Tao, a far more reputable mathematician, gave multiple talks on the state of AI usage in math using LEAN and how it’s able to cross reference math papers on many different fields. (There are many talks about this topic by him but I recently saw this one and found it quite good!)

1 Like

How much more effort would it be to fine-tune the latest Qwen3 3.5 27B?

1 Like

Not much…we started on 14B to get faster, cheaper training iterations while we built up the dataset.
Moving to larger parameter models was always the next step. Benchmarking will take longer but stay tuned to our repos, you’ll probably see it in a few days.

An Ada finetune of a LLM may be useful for some applications, but what you’ve got here just looks like it’s entirely built for gaming benchmark numbers rather than actually being useful, at least based on what’s presented.

You’re not evaluating the correctness of the responses and you’re applying a large set of style flags with warnings treated as errors. Obviously models not tuned for those style flags are not going to produce code that sticks to those style flags, especially when you’re evaluating them without even telling them what style flags you’re using. On top of this setting the temperature to zero is well known to kneecap all these thinking models you’re testing against since they’ll get stuck in loops and produce garbage output.

Even worse, your training data includes prompts that are in your eval set. For example: The training set contains “Draft an Ada generic procedure that swaps two values of any type” and the eval set contains “Write Ada program: generic swap elements”. Of course you’re going to benchmark well if you literally train the model on your evaluation set!

Lastly your evaluation methodology references files that aren’t public, so no one can even see how you got these numbers.

2 Likes

Thanks for the feedback. I’ll take each point directly.

Correctness testing: The 72% number is compilation-only — that’s accurate and stated in the methodology. But I also published HumanEval-Ada results via MultiPL-E, which executes test cases against the output. That’s functional correctness, not just compilation. Steelman gets 47.1% pass@1 vs the base model’s 34.4%. Those are the first published Ada-specific HumanEval results for any model. Both benchmarks are linked from the model card.

Style flags: The same flags are applied to every model identically. These aren’t obscure choices — they’re the flags @Irvise recommended in this thread, and they reflect standard production Ada practice. I didn’t tell Steelman what flags to expect during inference either. The model learned to write clean Ada because it was trained on clean Ada.

When I first applied -gnatwe to my own training data, 37% of it failed. The strict flags cost me 1,270 training pairs. They hurt my dataset before they ever touched a benchmark.

I also added gnatchop specifically after discovering that frontier models output multi-unit files in formats my eval didn’t handle. That fix flipped 88 results from FAIL to PASS on GPT-5.4. I went out of my way to make the eval fair to models that weren’t tuned for Ada.

Temperature zero: This is standard practice across every major code generation benchmark — HumanEval, MultiPL-E, MBPP, SWE-bench. That said, I’d be open to running a temperature sweep as a supplemental comparison. The primary results follow established methodology.

Train/eval overlap: The generic swap example is the “hello world” of Ada generics — it appears in every textbook, and the prompts are worded differently. But you’re right that this deserves a systematic overlap analysis rather than spot-checking. I’ll run one against the full training set and publish the results. If there’s meaningful contamination I’ll remove those prompts and re-eval.

“Gaming benchmarks”: A 14B fine-tune outperforming generalist models on its specific domain is what fine-tuning is for. Steelman can’t write a poem or debug React. It writes Ada. The frontier models are general-purpose. That tradeoff is the entire thesis, and it’s stated explicitly in the model card. The methodology, dataset, eval prompts, and results are all published — anyone can reproduce or challenge the numbers.

What benchmark would you suggest? I’m genuinely asking. HumanEval-Ada via MultiPL-E is the only standardized Ada benchmark I’ve been able to find. ACATS exists but it’s a compiler conformance suite, not a code generation benchmark — adapting it is on my roadmap but it’s a major effort. If there’s an Ada code generation benchmark I’m missing, point me to it and I’ll run it. The reason I built a custom eval is because nothing else exists for this language.

On what’s actually being measured: The benchmark is single-pass generation with no retries — deliberately the hardest test. For context, I tested Claude Opus 4.6 inside Claude Code with full project context and compiler feedback loops, and it hit around 92% with multiple retries. Every model benefits from retries. The question the benchmark answers is where you start from before any of that help kicks in.

The eval prompts (eval_v3_500.json), methodology (eval_v3_README.md), and training data
(steelman_sft_dataset.jsonl) are all in the dataset repo:
the-clanker-lover/steelman-sft-ada · Datasets at Hugging Face . The model card links to it.

1 Like

But I also published HumanEval-Ada results via MultiPL-E, which executes test cases against the output. That’s functional correctness, not just compilation. Steelman gets 47.1% pass@1 vs the base model’s 34.4%.

The term “human” appears nowhere in the model card.

Steelman gets 47.1% pass@1 vs the base model’s 34.4%.

I would expect it to do better than the base model. What about the other models though?

The same flags are applied to every model identically. These aren’t obscure choices — they’re the flags @Irvise recommended in this thread, and they reflect standard production Ada practice.

Sure, but they’re certainly not the only valid choices or else they would be defaults. How are the other models meant to know to use that specific style?

I didn’t tell Steelman what flags to expect during inference either. The model learned to write clean Ada because it was trained on clean Ada.

That’s exactly the problem, you’ve trained to emit code that matches the flags, which is implicitly telling it which flags to use. If I train a model to reply only reply in French and then compare it to other models that reply in the same language as the input then I can not just say that my model is a better French speaker, I would have to tell the other models to reply in French for it to be a fair comparison.

Even without that, style (or French) will be picked up from the input context, except there’s no context here, just a one-line natural language prompt.

When I first applied -gnatwe to my own training data, 37% of it failed. The strict flags cost me 1,270 training pairs. They hurt my dataset before they ever touched a benchmark.

So you’ve removed all training data that doesn’t match the style flags? See my last point.

Temperature zero: This is standard practice across every major code generation benchmark — HumanEval, MultiPL-E, MBPP, SWE-bench.

This certainly isn’t true for SWE-bench these days and likely predates thinking models. Gemini specifically warns against using a temperature of 0 and I suspect others do the same.

The generic swap example is the “hello world” of Ada generics — it appears in every textbook, and the prompts are worded differently. But you’re right that this deserves a systematic overlap analysis rather than spot-checking. I’ll run one against the full training set and publish the results. If there’s meaningful contamination I’ll remove those prompts and re-eval.

It’s not just that one prompt. We also have “Build an Ada 2022 function using a declare expression to compute the hypotenuse of a right triangle.” and “Write an Ada program that uses a declare expression to compute and print the hypotenuse of a right triangle with sides 3.0 and 4.0.” among others.

Note that both here are related to declare expressions on top of asking for more-or-less the same functionality. The training set and eval set are very clearly not independent.

What benchmark would you suggest? I’m genuinely asking. HumanEval-Ada via MultiPL-E is the only standardized Ada benchmark I’ve been able to find. ACATS exists but it’s a compiler conformance suite, not a code generation benchmark — adapting it is on my roadmap but it’s a major effort. If there’s an Ada code generation benchmark I’m missing, point me to it and I’ll run it. The reason I built a custom eval is because nothing else exists for this language.

If you’re asking for specific features to be implemented then frankly you need a real codebase, similar to what SWE-Bench does. No one is ever going to ask a LLM to “Write multiple expression functions in Ada 2022 style” or “Write Ada 2022 program: for..of with index”. Even SWE-Bench is flawed in that there’s no evaluation of whether the code is good or not, but it’s the best option you have that doesn’t involve massive amounts of work.

The eval prompts (eval_v3_500.json), methodology (eval_v3_README.md), and training data
(steelman_sft_dataset.jsonl) are all in the dataset repo

I read the readme, it references files which are not in the repo.

For what it’s worth, I’m not against LLMs and I have found them useful in some narrow circumstances for writing Ada code. I just think the evaluation here is flawed.

1 Like

I’ll go through each point.

HumanEval on the model card: The HumanEval-Ada results were on the v0.1 model card. When I updated the card for v0.2, they didn’t get carried forward — that’s an oversight on a document that’s been revised multiple times in a few days. The results are in the dataset repo (multipl_e_r6v3_report.md). I haven’t run frontier models against HumanEval-Ada yet because it’s 157 prompts through paid APIs and I’m funding this myself. If that’s something you’d find valuable I’ll prioritize it, but that’s additional work beyond what any other open Ada model has published — because there are no other open Ada models.

Style flags / French analogy: The analogy doesn’t hold. Writing warning-free, style-conformant Ada isn’t like choosing French over English — it’s like writing grammatically correct French. -gnatwe catches real bugs: uninitialized variables, unused assignments, unreachable code. The style flags (-gnatyabehiklprt) enforce standard Ada conventions — Irvise recommended them in this thread, and they reflect what production Ada shops use from my research. Every model was tested identically. None were told what flags to expect, including Steelman.

You’re right that training on clean Ada gives Steelman an advantage on clean-Ada benchmarks. That’s what fine-tuning is. The frontier models have the advantage of 1000x more training data, tool use, and multi-turn reasoning. A 14B model beating them on domain-specific style is the expected outcome of specialization, not evidence of gaming.

“So you’ve removed all training data that doesn’t match the style flags?” Yes. 37% of it. 1,270 pairs that compiled but had warnings. I threw away training data that would have inflated my dataset numbers to keep the quality bar high. That’s the opposite of gaming a benchmark.

Temperature zero: For code generation benchmarks — HumanEval, MultiPL-E, MBPP — temperature 0 is the standard methodology. SWE-bench is a software engineering benchmark with different evaluation mechanics, not a single-pass generation benchmark. These are different categories. I said I’m open to running a temperature sweep as a supplemental comparison, and I am, but the primary results follow established practice for this benchmark type.

Train/eval overlap: I acknowledged this in my first reply and committed to a systematic overlap audit. That hasn’t changed. I’ll publish the results and remove any contaminated prompts.

“You need a real codebase” / SWE-bench for Ada: I agree that would be better. It also doesn’t exist, and building it would be a major undertaking. I built the eval I could build with the resources I have. If you know of an Ada code generation benchmark I’m missing, I’ve asked twice now — genuinely point me to it.

Referenced files not in the repo: Fair catch. The build scripts (assemble_v3.py and build_v3.py) had local paths that needed cleaning before they could be uploaded. They’re cleaned and will be in the repo with the next update.

Thanks for taking the time to look through the repo though. For context, I started this project 4-5 days ago and released it publicly 3 days ago. It hasn’t even lived a week and is marked as v0.2 — it’s a WIP. This is the first model I’ve ever fine-tuned and the first time I’ve done a public release, so the rough edges are real and I’m fixing them as they get flagged.

1 Like

Just curious, what kind of hardware is required to (realistically) use the 14B model?

The GGUF is Q4_K_M quantized, so it needs about 9-10 GB of VRAM to run fully on GPU. A 12 GB card (like an RTX 3060/4070/5070) keeps it entirely in VRAM with room for KV cache. If you have less, Ollama will split layers between GPU and RAM — it still works, just slower proportional to how much spills over.
On CPU-only(using the RAM) it’s usable but noticeably slower.

2 Likes

Well, talking about having AIs specialised in proving and correct code generation… Mistral has just released Leanstral: Open-Source foundation for trustworthy vibe-coding | Mistral AI

It may be of interest to some.

Best regards,
Fer

2 Likes

I just became aware of this AdaCore open project to evaluate LLM performance for Ada/SPARK code GitHub - AdaCore/ada-eval · GitHub It may be of your interest @clanker-lover

Best regards,
Fer

2 Likes