Ada Bootstrap compiler funding thanks to nlnet!

Irvise · April 23, 2025, 4:09pm

Dear all,

the nlnet foundation has announced the funding grant for an Ada Bootstrap compiler!

I am a bit surprised by these news as I was involved in some behind the scenes conversations about it but I was not expecting it to pop up as soon as this! I will try to gather more information and see if there is a candidate selected or if it is open to the public to select a candidate.

Will keep you updated,
Fer

OneWingedShark · April 23, 2025, 6:48pm

Ada is an important computer language with a long history, with the compilers being built for new architectures in an ad-hoc basis based on previously existing Ada compilers from other architectures. This project aims to create a bootstrap path from the C language to an Ada compiler without relying on an existing Ada compiler binary. This will allow us to have a fully auditable trail from C to a working Ada compiler, removing concerns about hidden backdoors or other issues that may arise from using a compiler without a clear bootstrap path.

I have three reactions:

A bootstrap implementation, yay!
Not using existing tech to leverage the bootstrap process; disappointing, but understandable.
Using C… *sigh* (though it is auditable, yay-ish… that should be a self-hosted feature.)

IMO, using self-hosted SPARK-verified Ada (employing Annex H) , to self-host the compiler, having a modular backend/codegen would be a better route: aside from the compiler, create two subprojects: (1) a SeedForth [small-wordset, tokenized Forth] interpreter, SPARK verified; and (2) a SPARK-verified SeedForth emitting backend. — You can then feed things into each other so that you get the ability to interpret Ada on a new architecture implementing about 30 Forth-words, with provably correct interpreter and codegen. Native-code generation then becomes implementing a new backend, and you can use the previous results to generate an native-code compiler.

Lucretia · April 23, 2025, 7:16pm

I didn’t notice the C part as I skimmed it, as I thought it was going to be in Scheme.

If you must use ancient dangerous C, at least use misra tools or something.

pyj · April 23, 2025, 11:55pm

How much is this grant? Is this something you apply to?

Would an Ada to C transpiler fit under this?

I’ve seen Forth mentioned multiple times in this respect. For those without Forth experience, what’s the practical benefits of this?

OneWingedShark · April 24, 2025, 1:07am

It really comes from the foundational definitions of Forth: in Forth a “word” is (a) a sequence of words to execute, or (b) a chunk of machine-instructions to execute. From this we get two things: (1) a Forth is very simple, and very low-level/hardware-oriented; and (2) by this definition, a Forth is also extremely portable, as to port it to a completely different architecture is implement the “core words” that all the other words devolve into a sequence of.

In standard Forth, words are essentially a dictionary look-up of text. (Meaning they can be bytes and bytes and bytes long e.g. some_really_long_operation_that_describes_everything_it_does, or short +.) The execution of Forth involves a stack machine, so the sequence “12 34 +” would be parsed as 12, pushed on the stack, 34, pushed on the stack, the top two items added together and the result pushed on the stack: 46.

With SeedForth, the words are tokenized, meaning that you could have say 0..2**6-indexed words and use the “leftover bits” of your wordsize to index a user-dictionary, if you provide a user-dictionary at all. What this means is that your indices into the dictionary are a uniform, length, and that the entries therein are lists of uniform-element length; i.e.

Core_Word_Count  : Constant := 2**6;
System_Cell_Size : Constant :=    8;
Subtype Word_Length is Natural 0..Natural'Pred( 2**System_Cell_Size );
Type Token is range -Core_Word_Count..2**System_Cell_Size-Core_Word_Count
  with Size => System_Cell_Size;
Subtype Core_Token is Token range Token'First..0;
Subtype User_Token is Token range Token'Succ(Core_Token'Last)..Token'Last;

-- Instantiate MAP package with User_Token as the index, and an array of Token as element.

So, once you get your SeedForth tokenized, you have what’s essentially equivalent to a bytecode, except for a very simple (thus easily provable) conceptual virtual-machine. —The details can be so close/integrated to the bare-metal that there’s no real “virtual”… those core words are now constants, in this case things like -26, -3, -18— this means that you now have a stable instruction-stream, such that all you need to port is implementing those core words.

In practice: You feed your SPARK-proved SeedForth interpreter to your SPARK-proved SeedForth producing backend to get a verifiable SeedForth-in-SeedForth, allowing us to leverage that to get a verified interpreter, in itself. — This is a verified correct interpreter.

Then you feed your Ada compiler whole into the SeedForth producing backend, this gives you the Ada compiler in SeedForth — this allows you to run your verified compiler on the above verified interpreter. (You can now produce code to run from the new architecture, and on the new architecture.)

Having these in-place, getting verified native-code execution on the new hardware is to implement a SPARK-proved backend for the compiler, then run that through the above to get a native-code producing SeedForth Ada-compiler. (Feed it to itself, again, and you have a native-code producing native-code Ada compiler.)

TL;DR — Using SeedForth means: (1) there is no C, at all, in the dependencies; (2) the simplified nature makes it easier to verify; (3) the token-stream can be used as a bytecode; (4) you get a very portable verified VM for the “bytecode” of #3; (4) You can, as needed, (a) interpret your Ada programs on this VM, achieving high portability with little work, or (b) implement just the verified backend, and generate a verified native-compiler on the new architecture with all the stuff that’s been done before.

TL;DR for the TL;DR — It allows high-portability, after the initial work, with the option to “half-port” (interpret) your Ada programs.

dmitry-kazakov · April 24, 2025, 8:59am

I am nostalgic about Forth. I learned it on a PDP-11 machine under RSX-11M when I looked for alternatives. C compiler was 5 stages mess that took half an hour to compile hello word. But Forth amazed me. I was flabbergasted by the fact that the geniuses didn’t even care to make a binary dictionary search of the words, it was linear! No idea if it is still so. Add to this the fact that they managed to create a “language” of lower level than machine code itself! I ditched the garbage in a few weeks and continued with MACRO-11…

jelle · April 24, 2025, 9:32am

Ooh, this is really good news and will help cement the ability to port Ada in the future. I can see this will also help in the permacomputing/retrocomputing communities. It’s sometimes easy to forget how brittle our foundations can be and how important it is to have reproducible builds from the ground up. I’m not super fond of C, but you can write C for most platforms so in the practical sense I think it’s the best move.

OneWingedShark · April 24, 2025, 11:16pm

You discarded the language because a particular implementation used linear search? Did dictionary sizes justify the added work/complexity? or space? (A lot of Forths were implemented on very constrained processors.)

In any case, SeedForth doesn’t have that issue because it’s tokenized: all the words are uniform in length and encoded (bitpattern). For the core-words, you could implement with a constant array (indexed by the Token) of say 256 cells (because we’re being greedy, could probably get by w/ 64), putting them as the machine codes padded w/ NOP… porting would then be filling-in this constant with the instructions of the new architecture.

dmitry-kazakov · April 25, 2025, 7:44am

No, I discarded it for being a very bad language.

Irvise · April 26, 2025, 2:13pm

Hi all,

I will try to shed some light on some of the topics raised.

It is using the available tech… in the constrained environment in which this is taking place

The thing is, this is no normal bootstrap compiler, where one can pick and use any available technology to carry such task. The framework for this project is the live-bootstrap system. This is a project that is trying as hard as possible to create a root-of-trust from where the world can be rebuilt in a bit-by-bit reproducible and transparent manner, in a way that a human being could verify the process and each and every single line of code. And this is done with the absolute bare minimum, which means that they are starting form a tiny binary seed to bootstrap all the way to modern Linux, GCC and Guile.

C and Scheme are some of the first languages bootstrapped in the current live-bootstrap system, C being the more complete and capable of the set (the Scheme they implemented is incredibly small and it is only used to create a basic LibC and barebones C compiler). We would like to have Ada bootstrapped as early as possible, so C is the ideal system. However, other ideas have been tossed around, such as Go, a more advance Scheme (TR7 or Guile), Python or Rust for example.

Sadly, I have not been able to gather more information…

This idea was indeed discussed. I was the one carrying the tests. The LLVM-based, open source C Code Generator is not capable of translating the current GNAT source code as it is rather limited.

A more capable system LLVM-CBE, using GNAT-LLVM was also tested. This is a lot more capable than CCG as it transpiles all LLVM-IR, regardless of where it came from. However, there are two issues:

No machine generated code is allowed, as auditing such code by a human is considered against the goal of the project.
The generated C code is very difficult to read
However, I was told that the generated code would be acceptable IF the Ada source code lines for which the IR is generated were interleaved with the equivalent C code. In order to make auditing the generated code a reasonable endeavour.

Best regards,
Fer

jere · April 26, 2025, 3:30pm

This is not a complaint @Irvise as I am very excited for this, so please don’t take it as such. I think this might make some scenarios very interesting to tackle in C. There are some Ada features that are hard to transpile to C. One that comes to mind is nested functions, which have no analog in C (no lambdas or nested functions in ANSI or K&R C). I don’t know if that matters for a bootstrap compiler or not (maybe using a limited subset of Ada?), but at some point a compiler (maybe one created by the bootstrap compiler though) may need to dip into machine instructions.

Lucretia · April 26, 2025, 8:20pm

That’s not really an issue when implementing a compiler for another language. But gcc does allow them and most likely clang does too.

jere · April 26, 2025, 10:29pm

I was thinking about the transpiler to C. If it is targeting valid C then implementing a C solution might be challenging. If it targets a specific compiler and its extensions then yeah it would definitely be simpler to solve

Lucretia · April 26, 2025, 11:50pm

Even with a transpiler, it’s not a problem as you can either use the compiler extensions or flatten out the functions.

jere · April 27, 2025, 3:00am

Can you flatten them if you are taking 'Access of them. Or does the compiler make two different access types (one for the flattened functions with the additional parameters and one for the non flattened, though how does it tell which is which at runtime)?

Consider:

procedure Update_1(Value : in out Integer) is
begin
   Value := Value + 1;
end;

procedure Something_Complex(Vector : in out Some_Integer_Vector) is

   Thing: Integer := Vector(3);

   procedure Update_2(Value : in out Integer) is
   begin
      Value := Value + Thing;
   end;

begin
   Vector.Update_Element(1, Update_1'Access);
   Vector.Update_Element(2, Update_2'Access);
end;

I feel like flattening maybe wouldn’t work here? Update_1 and Update_2 would have different parameter lists since #2 would be flattened and #1 may not, and then the call to Update_Element expects a procedure with one parameter.. You’d maybe have to rely on compiler extensions and only target a specific compiler?

waleedmebane · April 27, 2025, 4:46pm

Thing could be made global in the C version, too, and if there is already a variable named Thing at the global level, they would get two different names. I’m working with a library called Cil for OCaml that analyzes C code, and it does some similar things to standardize code targeting C compilers with different feature sets.

jere · April 27, 2025, 5:50pm

That’s pretty neat. How does it handle the runtime differences in calls to a procedure like Something_Complex. For different calls at different times the value of Thing would be different (in the general case, I’m sure specific examples like above can be covered). A single global variable named Thing would only be able to cover one call of the procedure without messing up other instantiations of the call. Maybe they mangle the name of the Thing variable based on call site?

waleedmebane · April 27, 2025, 6:11pm

It wouldn’t be safe to run concurrently.
I’m not sure that Cil specifically handles nested functions in that way. I saw that it handled local stucts that way and guessed.

You have a very good point. It is not obvious how (the generalization of) your example would be implemented in standard C.

jere · April 27, 2025, 6:22pm

Yeah I’ve thought about this because I started down the path of making a small preemptive OS for an embedded C compiler I use, but it is very unique compiler in that instead of allocating function variables and parameters on the stack, it allocates them globally at compile time and assigns them either a unique or shared address based on function call analysis (which is somewhat similar to what we are talking about here).

For that compiler, this is really great for certain code optimizations but it kinda hoses any chance at a pre-emptive OS because the moment context switches come into play those global variables ended up clobbering each other when two different threads/processes called the same function, updating the same variable.

In this example it would be similar to if one task called Something_Complex, got preempted, then another task also called Something_Complex updating the Thing variable, clobbering the Thing variable for the first task’s call of the procedure when it comes back later. If a global variable implementation was used under the hood in a transpiler, I think it would have the same weakness that I’m seeing in my own little project.

It’s honestly a fun problem for me to try and work out in my head.

waleedmebane · April 27, 2025, 6:46pm

I just did a quick search through the source code of the latest version of Cil, and it seems that it doesn’t support nested functions. We have our own non-public fork. I don’t know if we support nested functions. If not, it’s possible I’ll be thinking about it later in the year, too!

Maybe for your example, we could just add extra parameters (which would be unused in the version that was never nested) and a switch statement.