Steelman — a 14B model fine-tuned for Ada 2022 code generation (runs locally)

Hey all — I wanted to share a project I’ve been working on that I think this community might find useful.

Steelman is a QLoRA fine-tune of Qwen2.5-Coder-14B-Instruct, trained specifically on compiler-verified Ada 2022 code. It runs locally (Ollama, llama.cpp, etc.) and doesn’t need a cloud API.

I built it because every frontier model I tried was genuinely bad at Ada. Claude, GPT, Gemini — they all produce code that looks plausible but won’t compile. So I started generating Ada pairs, verifying them with GNAT (-gnat2022 -gnatwa), and training on only the code that actually passes the compiler.

Results on a custom compilation benchmark (923 prompts):

  • Steelman R5: 68.6%

  • Claude Opus 4.6: 40.3%

  • Qwen base (untuned): 35.0%

  • Claude Sonnet 4.6: 27.5%

It also scores 47.1% pass@1 on HumanEval-Ada (MultiPL-E), up from 34.4% for the base model. As far as I can tell these are the first published Ada pass@1 results for any model.

I should be upfront — I’m not an Ada programmer by trade. I learned enough Ada to read and validate the generated code, and every training pair is compiler-verified, but I’m sure experienced Ada developers would spot style issues or patterns that the model picked up from its training data. Feedback from people who actually write Ada professionally would be incredibly valuable for improving the dataset.

The model, GGUF, and dataset are all on HuggingFace:

If anyone tries it and has feedback — especially on the quality of the Ada it produces — I’d love to hear it. The next training round is in progress and community input would directly improve the model.

6 Likes

Welcome @clanker-lover!

While I am not entirely in favour of AI, I love the concept of having a fine-tune, locally runnable system that is very good with Ada :slight_smile:

Since you say that you are new to Ada, here are some tips if you are training an AI. First, as you mentioned, code that compiles is good, and the flags that you used are great for that. But Ada being Ada and having a very strong focus on correctness, we have so many more flags that can be used to check for a lot of things. So here are a few tips:

  1. Enable runtime checks and force all programs to run (not just compile). This is useful as Ada checks a lot of things during runtime, so not everything that compiles is correct. Things that help here are: -fstack-check, -gnata, -gnateE, -gnateF, -gnateV, -gnatU, -gnatVa and optionally -gnaty for style checks (which should teach the AI to create nicely looking Ada). The full documentation is here Alphabetical List of All Switches (GNAT User’s Guide for Native Platforms)
  2. Doing all the checking at runtime can be heavy and sometimes quite undesirable. For that reason we have SPARK! SPARK is the formally verifiable subset of Ada (well, nowadays it is almost all of Ada). So you may want to throw the AI’s code against SPARK and see if it finds issues. For that I recommend that you run with --level=4 checking.

I hope this helps and do not hesitate to ask questions here in the community! Happy hacking,
Fer

EDIT: also, you may want to train your AI against the ACATS (Ada Conformity Assessment Test Suite (ACATS))

2 Likes

Thank you Fer, this is exactly the kind of feedback I was hoping to get by posting here.

The runtime checks suggestion is immediately actionable. Right now the pipeline only verifies compilation, so code that compiles but would crash at runtime is getting through. Adding -gnata, -gnatVa and the other flags you mentioned, plus actually executing the output, would catch a whole layer of issues the current approach misses. That’s going on the roadmap for the next dataset revision.

SPARK verification with --level=4 is the long-term goal. The whole reason I chose Ada for this project is because the toolchain can verify correctness without a human in the loop. The prover is the oracle. Not there yet but it’s planned.

I had not heard of ACATS before your suggestion. I’ll look into it both as a potential benchmark and as a source of high-quality training examples.

As I’ve said, I’m not an Ada developer. I do have great respect for the language from what I learned when I was initially choosing one for this goal though. That goal being agentic coding loops with verification. I’m a self-taught engineer who builds with AI assistance. I learned enough Ada to validate compiler output but my knowledge of idiomatic style and best practices is limited. If you or anyone here ever has time to look at sample outputs and flag issues, that would be enormously helpful for improving the dataset.

2 Likes

Excuse my naive view on AI, I am not really knowledgeable about how they work.

Adas syntax is defined using BNF by the Ada Reference Manual. How come that these models are failing on the simple task of creating correct syntax?
From what I know these AIs just “guess” the next most probable word. Why hasn’t anyone added some sort of check that validates the next word or even only allows valid words in the first place as defined by the BNF?

Also: I am not so sure how fruit full your endeavour is going to be. For good AI models, large sets of high quality examples are needed. But a lot of professional, high quality code, written by experienced developers is not open source. I’d argue that it’s even worse for Ada as it’s not an academic, but industrial language.

Maybe you could train more code using Advent of Code or rosettacode examples? But I don’t know if you are even allowed to do that since most people have copyright on their code. (I don’t want to start the AI ethics and copyright issue now. I’m just pointing out that everyone owns their code and may not want to be part of a training set)

There is also some examples by AdaCore on how to use AI for Ada. See: https://www.youtube.com/watch?v=PkInULSymD4
But they don’t make their own AI models. See: https://youtu.be/tYAod_61ZuQ?si=wR0YQFmi8_NUaHOB&t=2916

One small addition. Could you add some explanation to the AI keywords? I had to look up what “QLoRA fine-tune of Qwen2.5-Coder-14B-Instruct” and what these benchmark tests mean. :sweat_smile:

1 Like

I use Grok and the code was always very good. Of course a local model is always welcome.

1 Like

These are really good questions, let me try to address them.

I should also frame that these posts reflect my philosophy around using AI. I use what Andrej Karpathy recently coined “Agentic Engineering” — the practice where you’re not writing code directly 99% of the time, you’re orchestrating AI agents who do, acting as oversight, architect, and final decision-maker. There’s an art and science and real expertise to it, the same way there’s expertise in managing a team of engineers even if you’re not typing every line yourself. My workflow is: I design the architecture and strategy, AI executes under my direction, I validate everything. These responses are written the same way — human-in-the-loop AI generations, reviewed and edited by me.

On the BNF/syntax point: you’re right that Ada’s syntax is formally defined, and in theory you could constrain the model to only produce syntactically valid tokens. This is actually an active area of research called “constrained decoding” or “grammar-guided generation.” The problem is that syntactic validity is the easy part. Most of the failures I see aren’t syntax errors — they’re semantic errors: using a package that doesn’t exist, passing the wrong type to a procedure, misusing visibility rules, incorrect generic instantiation. The code looks like valid Ada and parses fine, but GNAT rejects it because of type mismatches or scoping issues that no BNF grammar can capture. Ada’s type system is strict enough that this is where most models fall down.

On the data quality concern: you’re absolutely right that this is the core challenge. Most high-quality Ada code is proprietary and behind closed doors. My training data is entirely synthetic — every pair is an instruction plus a compiler-verified completion, not scraped from existing codebases. So I’m not training on anyone’s proprietary code. The tradeoff is that synthetic data has its own biases and quality ceiling, which is exactly why feedback from experienced Ada developers like yourselves is so valuable. It’s the kind of signal I can’t generate synthetically.

On the copyright point: I appreciate you raising it. The dataset is entirely synthetic, generated by a base LLM and then filtered through the compiler. No code was copied from Advent of Code, Rosetta Code, or any open source projects. That said, the broader AI training ethics discussion is valid and I understand the concerns.

I hadn’t seen those AdaCore videos — thank you for sharing. I’ll watch them.

On the terminology, that’s a fair point and I should have been more accessible. In short: Qwen2.5-Coder-14B-Instruct is an open-weight coding model with 14 billion parameters made by Alibaba. QLoRA is a technique for fine-tuning a model using very little memory by working in 4-bit precision. The compilation benchmark measures how often the model’s output actually compiles on the first try with gnatmake. HumanEval-Ada is a standardized set of 157 programming problems translated into Ada, and pass@1 means the model gets the correct answer on its first attempt. I’ll add a terminology section to the post — good suggestion.

4 Likes

Interesting, I haven’t tested Grok against the benchmark yet. If it’s generating clean Ada for you that’s good to hear. Would be curious how it holds up on the compilation benchmark if I get around to adding it to the frontier comparison.