Writing a C Compiler

JeremyGrosser · August 10, 2024, 8:27pm

It’s a Three-Address Code (TAC) defined in the book… Sort of a high level assembly IR where each instruction has up to three operands: two sources and a destination.

If you’re at all interested in this stuff, I highly recommend picking up a copy of the book. No Starch has 32% off everything this weekend with the DC32 coupon code.

JeremyGrosser · August 15, 2024, 7:05am

Chapter 5: Local Variables

The bug I encountered in Chapter 1 where return2 was parsed as the return keyword came back to haunt me. I was missing '0' .. '9' character range in identifier names. Chapter 5 tests include a variable named void2 which was incorrectly parsed as a void token. After fixing the lexer, I re-ran all of the previous chapters’ tests and everything looks fine now.

The rest of the implementation went fairly smoothly once I got through reading the chapter, which took longer than usual as I kept losing focus. Variable resolution just doesn’t excite me and feels like a bit of a chore.

I refactored most of the Unbounded_String operations on Identifiers into a single WACC.Strings package just to avoid duplicating the Make_Identifier logic in both the Semantic_Analysis and TACKY packages. I expect the Strings package to grow more functionality over time.

JeremyGrosser · August 16, 2024, 6:26pm

Chapter 6: if Statements and Conditional Expressions

I never liked ternary operators in C. The list of examples for precedence rules and undefined behavior in this chapter really drive home how dangerous they are. Still, it’s part of the langauge, so we must. I liked how the later stages of the compiler didn’t need updating here. We already have conditional jump instructions, so adding if statements is really just a frontend concern.

I’m a bit surprised that compound statements get their own chapter, but I’m sure I’ll understand soon.

JeremyGrosser · August 21, 2024, 8:34pm

Chapter 7: Compound Statements

Oh. Scoping rules. That’s why. I like the variable renaming and map copy approach here. Seems simpler than what Crafting Interpreters does with a nested tree of scopes you have to traverse to resolve.

I did a bit of refactoring in the semantic analysis pass here to more closely match the function names in the book’s pseudocode. Every time I try to be more clever than the book, I have trouble remembering why I did it a few days later. There’s a lesson there, certainly.

Chapter 8: Loops

There’s a lot of subtle behavior happening in this chapter that took me a while to wrap my head around. For loops in particular have a lot of moving parts. I’ve fallen into the trap of using null pointers to represent optional values. I may go back and try to fix this at some point, but it would be a large refactor of the ASDL data structures. I’ll live with it for now.

The extra credit goto statements I added in chapter 6 are broken now. The goto_bypass_init_exp test has a goto jump to a label inside a for loop. This is supposed to skip executing the for initializer. From my understanding of the assembly output, it does this correctly, but I’m not certain. If the label the goto is jumping to is indeed in the wrong place, I think I need to add a new fixup pass in the assembly stage that moves the label if it precedes a For_Init_Node. I’m not certain this is the right approach though.

Tom · August 25, 2024, 10:50am

Hi you might prefer the book “Crafting a Compiler” by Charles N. Fischer and Richard J LeBlanc, Jr bublished by Benjamen Cummings 1988. Tt’s all about writing an Ada compiler in Ada of course. It covers all the basic theory and has lots of Ada code in it. Great project. I would love to write an almost Ada compiler and get rid of a few things I find a bit irritating about it. However after more than 3000 hours of Ada and years of C/C++ before that I am hooked on Ada. Even so far as to write www.adaworks.it no its an http not https. Sorry. Server written in Ada though.
Tom

OneWingedShark · August 26, 2024, 5:30pm

My advice: never use RegEx, not even for “recognizing integers”.

An alternative to this would be to use EVIL’s file-interface generic (providing a self-closing file-type w/ stream access) and/or Byron’s Readington subsystem (stream to Wide_Wide_String [full Unicode]).

Another option is Using a High-level Language as a Cross Assembler. (3 page paper)