I’ve just finished Chapter 1, which has you implement lexer, parser, compiler, and codegen for a very very small subset of C. The provided unit tests are excellent, allowing you to check your progress at each stage of development. The text is fairly abstract and assumes you’re already familiar with your target language and programming in general. I like this more than the prescriptive approach that Crafting Interpreters takes where you’re simply given code snippets to type or translate into your target language.
Implementation notes
The book recommends using an implementation language with “pattern matching” features. I think the “prototyped” status of this RFC means GNAT supports that. I’m also pretty sure this is just syntactic sugar for a bunch of successive elsif blocks, so I think I can manage even if it isn’t well supported.
The “compiler driver” program is contained within main.adb. Calls to gcc for preprocessing, assembly, and linking use the AAA.Processes library. The book does assembly and linking in a single step, but I’ve always considered those to be discrete operations. I’ve broken them into separate steps to make it easier to do multiple object linking later.
I wrote a WACC.IO package to abstract all of the file operations. The whole input file is read into memory with System.Mmap upon open and writing uses Ada.Text_IO. There are many opportunities for future optimization.
I pulled a handful of Restrictions and the Jorvik profile into a project-wide gnat.adc to keep myself from doing anything too crazy.
I had intended to write the package specs with SPARK_Mode => On, but that severely limits the standard library packages available and I kept getting sidetracked finding or writing alternatives. Dropping the SPARK requirement for now.
I wrote the lexer by hand rather than using regex as suggested in the book. GNAT.Regexp is not PCRE-compatible and I didn’t want to try to work around the edge cases there. Maybe this will come back to bite me, but seems okay for now.
My lexer initially caught the “misspelled_keyword” test having returns instead of return, which is supposed to pass at this stage according to the test suite. Ignoring the \b part of the regex pattern for keywords and identifiers fixes it, but doesn’t catch this error. The book expects you to catch this during the parse stage apparently.
I chose to store token literals in Unbounded_String. This means every token allocates a little bit of memory. I don’t bother allocating the literal for single character tokens as it should be obvious. Storing offsets within the input text would be more efficient and enable contextual error messages later but I want to avoid tight coupling between the I/O routines and lexer for now.
The book links to a paper describing three options for AST data structures with a preference towards an object oriented implementation allowing use of the visitor pattern. I’m going to stick with discriminated record types for now, but will move to objects if needed.
Translating ASDL to discriminated records seems to be working well. If this pattern continues to go well, I’ll likely write an automated asdl2ada generator.
Codegen went suprisingly smoothly. Having a separate data structure for assembly seems like overkill at this point, but I can see how it might be useful for optimization passes later.