Writing a C Compiler

It’s a Three-Address Code (TAC) defined in the book… Sort of a high level assembly IR where each instruction has up to three operands: two sources and a destination.

If you’re at all interested in this stuff, I highly recommend picking up a copy of the book. No Starch has 32% off everything this weekend with the DC32 coupon code.

1 Like

Chapter 5: Local Variables

The bug I encountered in Chapter 1 where return2 was parsed as the return keyword came back to haunt me. I was missing '0' .. '9' character range in identifier names. Chapter 5 tests include a variable named void2 which was incorrectly parsed as a void token. After fixing the lexer, I re-ran all of the previous chapters’ tests and everything looks fine now.

The rest of the implementation went fairly smoothly once I got through reading the chapter, which took longer than usual as I kept losing focus. Variable resolution just doesn’t excite me and feels like a bit of a chore.

I refactored most of the Unbounded_String operations on Identifiers into a single WACC.Strings package just to avoid duplicating the Make_Identifier logic in both the Semantic_Analysis and TACKY packages. I expect the Strings package to grow more functionality over time.

1 Like

Chapter 6: if Statements and Conditional Expressions

I never liked ternary operators in C. The list of examples for precedence rules and undefined behavior in this chapter really drive home how dangerous they are. Still, it’s part of the langauge, so we must. I liked how the later stages of the compiler didn’t need updating here. We already have conditional jump instructions, so adding if statements is really just a frontend concern.

I’m a bit surprised that compound statements get their own chapter, but I’m sure I’ll understand soon.

1 Like

Chapter 7: Compound Statements

Oh. Scoping rules. That’s why. I like the variable renaming and map copy approach here. Seems simpler than what Crafting Interpreters does with a nested tree of scopes you have to traverse to resolve.

I did a bit of refactoring in the semantic analysis pass here to more closely match the function names in the book’s pseudocode. Every time I try to be more clever than the book, I have trouble remembering why I did it a few days later. There’s a lesson there, certainly.

Chapter 8: Loops

There’s a lot of subtle behavior happening in this chapter that took me a while to wrap my head around. For loops in particular have a lot of moving parts. I’ve fallen into the trap of using null pointers to represent optional values. I may go back and try to fix this at some point, but it would be a large refactor of the ASDL data structures. I’ll live with it for now.

The extra credit goto statements I added in chapter 6 are broken now. The goto_bypass_init_exp test has a goto jump to a label inside a for loop. This is supposed to skip executing the for initializer. From my understanding of the assembly output, it does this correctly, but I’m not certain. If the label the goto is jumping to is indeed in the wrong place, I think I need to add a new fixup pass in the assembly stage that moves the label if it precedes a For_Init_Node. I’m not certain this is the right approach though.

1 Like

Hi you might prefer the book “Crafting a Compiler” by Charles N. Fischer and Richard J LeBlanc, Jr bublished by Benjamen Cummings 1988. Tt’s all about writing an Ada compiler in Ada of course. It covers all the basic theory and has lots of Ada code in it. Great project. I would love to write an almost Ada compiler and get rid of a few things I find a bit irritating about it. However after more than 3000 hours of Ada and years of C/C++ before that I am hooked on Ada. Even so far as to write www.adaworks.it no its an http not https. Sorry. Server written in Ada though.
Tom

2 Likes

My advice: never use RegEx, not even for “recognizing integers”.

An alternative to this would be to use EVIL’s file-interface generic (providing a self-closing file-type w/ stream access) and/or Byron’s Readington subsystem (stream to Wide_Wide_String [full Unicode]).

Another option is Using a High-level Language as a Cross Assembler. (3 page paper)

Any progress on this?

I’m almost done with chapter 9. There are some more notes in the GitHub repo: GitHub - JeremyGrosser/wacc

I got distracted by nice summer weather, then had some health issues that prevented me from working on it for a few months. I figure I’ll get back into the compiler book next year after Advent of Code.

1 Like

The preference for a language which supports pattern matching doesn’t surprise me. The Bootstraping of Rustc uses OCaml which supports this feature. Why the choice of such an unusual language if not for such a reason?

Then a simple optimisation pass - here a constant folding - could be written (here in OCaml):

let rec constant_folding expr =
   match expr with
   | Addition(expr1,expr2) -> 
       (match constant_folding expr1, constant_folding expr2 with
           | Integer(i1), Integer(i2) -> Integer(i1+i2)
           | opt1,opt2 -> Adition(opt1,opt2)
       )
   ...

With a language that doesn’t support pattern matchning, such an algorithm would need longer expressions.

IIRC, OCaml and SML are both well-defined, stable, consistent/constant/completed languages — meaning that they share none of the variability, lack of standardization, and churn that Rust has.

I would say rustc can’t be chosen to bootstrap itself. But among all the other languages, OCaml has been chosen for multiple reason, and one of them could be the pattern matching. (The type safety and the functional paradigm surely may be other reasons).

Now the Rustc is self-hosted.

I know it has been close to 2 years since this thread started, but I took a glance at your code, trying to pick up tips on structuring my Ada code better. I did an Ocaml version of the compiler, although only through chapter 10 (if you stopped at 9, you made the right choice!). I have lately been working on one in Ada. I have a tendency to just return structs or have them as out parameters, but I notice you use access types pretty much everywhere. Is that the typical Ada way to do things? So far, I have only used access types for expressions because of the recursive references. I was particularly wondering about vectors, because it seemed like it would be better to have a vector of structs than a vector of pointers to structs. I’m not asking this as a criticism, I just want to write better Ada.

No, it is not necessary to use access types and generally they should be avoided or abstracted away inside containers if possible.

I made the mistake of using access in this codebase to indicate an optional field, where null is the absence of a value. I really should have used a discriminated type or something like the generic optional crate instead. If I were to rewrite my compiler, I’d do it differently.

Chapter 10 is really a stress test of how well structured and maintainable your codebase is. It requires refactoring several passes all at once. I found it to be quite difficult with the way I structured the AST as discriminated record types. I did eventually get it working, but it took several days of reading x86_64 assembly and comparing it to the output from the author’s reference implementation. Not fun.

I do want to finish the book at some point, but I think I want to start over, either with an object-based approach or in an interpreted language with better introspection primitives.

If you’re looking for an example of a very well structured compiler, check out @sttaft’s parasail.