> Rather than getting stuck in front-end minutiae, the tutorial goes straight to generating working assembly code, from very early on.
I think this is important and for a more sophisticated compiler design I find Ghuloum approach very appealing [1]. I.e. build a very simple subset of the language from top to bottom and then grow the meat gradually.
The really great book following this approach I've discovered recently was [2]. Although I find both C and x86 not the best targets for your first compiler, still a very good book for writing your first compiler.
This is something I've noticed on academic vs "practicing" coders. Academics tend to build in layers, though not always, and "practicing" coders tend to build in pipes give or take. The layers approach might give you buildable code, but is hard to exercise and run. Both approaches can work though, especially if you build in executable chunks, but you have to focus on the smallest chunk you can actually run.
Yeah, I think this is one of the (few, rare) cases where the "official" academic way of teaching the subject is actually baggage and not really aligned with what's practically useful.
Compiler courses are structured like that because parsing really was the most important part, but I'd say in the "modern" world once you have a clear idea of how parsing actually works, it's more important to understand how compilers implement language features.
Even if you want to implement a compiler yourself, "Claude, please generate a recursive descent parser for this grammar" is close to working one-shot.
This article sums it up perfectly. I was interested in building a compiler long before going to college and this was the most accessible body of work.
Building a recursive descent parser from scratch was an eye opener to 17yo me on how a seemingly very complex problem that I had no idea how to approach can be made simple by breaking it down into the right primitives.
"breaking things down into the right primitives" is the real key to programming. There are many books and web pages about algorithms, but I wish there were more searchable and browsable resources for how to approach problems through primitives.
The process of breaking a complex problem down into the right primitives requires great understanding of the original problem in the first place.
Whats blocking me during programming usually are edge cases I had no idea about. Its still hard to find good material on compilers if you are not into reading dry ass books. Thats a me problem though, I simply cant force myself to read boring factual only content (one of the reasons as to why I love beejs guides).
> The process of breaking a complex problem down into the right primitives requires great understanding of the original problem in the first place.
Yes, but with experience that just becomes a matter of recognizing problem and design patterns. When you see a parsing problem, you know that the simplest/best design pattern is just to define a Token class representing the units of the language (keywords, operators, etc), write a NextToken() function to parse characters to tokens, then write a recursive descent parser using that.
Any language may have it's own gotchas and edge cases, but knowing that recursive descent is pretty much always going to be a viable design pattern (for any language you are likely to care about), you can tackle those when you come to them.
That's a good point - recursive descent as a general lesson in program design, in addition to being a good way to write a parser.
Table driven parsers (using yacc/etc) used to be emphasized in old compiler writing books such as Aho & Ullman's famous "dragon (front cover) book". I'm not sure why - maybe part efficiency for the slower computers of the day, and part because in the infancy of computing a more theoretical/algorithmic approach seemed more sophisticated and preferable (the cannonical table driven parser building algorithm was one of Knuth's algorithms).
Nowadays it seems that recursive descent is the preferred approach for compilers because it's ultimately more practical and flexible. Table driven can still be a good option for small DSLs and simple parsing tasks, but recursive descent is so easy that it's hard to justify anything else, and LLM code generation now makes that truer than ever!
There is a huge difference in complexity between building a full-blown commercial quality optimizing compiler and a toy one built as a learning exercise. Using something like LLVM as a starting point for a learning exercise doesn't seem very useful (unless your goal is to build real compilers) since it's doing all the heavy lifting for you.
I guess you can argue about how much can be cut out of a toy compiler for it still to be a useful learning exercise in both compilers and tackling complex problems, but I don't see any harm in going straight from parsing to code generation, cutting out AST building and of course any IR and optimization. The problems this direct approach causes for code generation, and optimization, can be a learning lesson for why a non-toy compiler uses those!
A fun approach I used at work once, wanting to support a pretty major C subset as the language supported by a programmable regression test tool, was even simpler ... Rather than having the recursive descent parser generate code, I just had it generate executable data structures - subclasses of Statement and Expression base classes, with virtual Execute() and Value() methods respectively, so that the parsed program could be run by calling program->Execute() on the top level object. The recursive descent functions just returned these statement or expression values directly. To give a flavor of it, the ForLoopStatement subclass held the initialization, test and increment expression class pointers, and then the ForLoopStatement::Execute() method could just call testExpression->Value() etc.
From the Publications section of that Wikipedia page:
>The April 1971 Communications of the ACM article "Program Development by Stepwise Refinement",[22][23] concerning the teaching of programming, is considered to be a classic text in software engineering.[24] The paper is considered to be the earliest work to formally outline the top-down method for designing programs.[25][26] The article was discussed by Fred Brooks in his influential book The Mythical Man-Month and was described as "seminal" in the ACM's brief biography of Wirth published in connection to his Turing Award.[27][28]
What language do you use parser combinators in, and what kind of grammar do you parse usually? Nom was terribly verbose and unergonomic even by Rust's standards. Haskell's Megaparsec/Parsec is good but yeah, it's Haskell, you need to handle multiple monads (Parser itself is monadic, then your AST state, and maybe some error handling) at once and that's where I got confused. But I appreciated the elegance.
I experimented with PCs in Haskell and Rust (nom), then moved on to parser generators in Rust (pest.rs), Ocaml (Menhir), Haskell (Happy) and finally ended up with python's Lark - the speed of experimenting with different syntax/grammars is just insane.
I've used tree-sitter for generating my parsers in Rust, and just working with the untyped syntax tree it generates, and gives you error-tolerance for free. It's a bit of a setup at first tho, requiring an extra crate for the generated parser, but editing it from there saves so much time.
What do you mean exactly by "error-tolerance"? Is it like, each node is wrapped into a result type, that you have to match against each time you visit it, even though you know for a fact, that it is not empty or something like that?
I suppose that one of the pros of using tree-sitter is its portability? For example, I could define my grammar to both parse my code and to do proper syntax highlighting in the browser with the same library and same grammar? Is that correct? Also it is used in neovim extensively to define syntax for a languages? Otherwise it would have taken to slightly modify the grammar.
Parser combinators is more of a concept than a library. You could make your own supporting the stuff you need. I like writing programs in languages I don't know or I barely know. I usually just take one of the popular libraries in any given language.
For Rust I used Nom and I didn't mind it all that much although I noticed it's quite baroque. If I had more to write I'd probably make some wrappers or macros of my own for most commonly used Nom snippets.
I printed this out (so I could have it with me everywhere) and read it when I was younger. It was so cool to see it come together so quickly. Some of these works (this one, Beej's guides, etc) are some of the best CS documentation we have and don't get nearly the credit they deserve.
> Jack Crenshaw's tutorial takes the syntax-directed translation approach, where code is emitted while parsing, without having to divide the compiler into explicit phases with IRs.
Is "syntax-directed translation" just another term for a single-pass compiler, e.g. as used by Lua (albeit to generate bytecode instead of assembly / machine code)? Or is it something more specific?
> in the latter parts of the tutorial it starts showing its limitations. Especially once we get to types [...] it's easy to generate working code; it's just not easy to generate optimal code
So, using a single-pass compiler for a statically-typed language makes it difficult to apply type-based optimizations. (Of course, Lua sidesteps this problem because the language is dynamically typed.)
Are there any other downsides? Does single-pass compilation also restrict the level of type checking that can be performed?
As long as your target language has a strict define-before-use rule and no advanced inference is required you will know the types of expressions, and can perform type-based optimizations. You can also do constant folding and (very rudimentary) inlining. But the best optimizations are done on IRs, which you don't have access to in an old-school single pass design. LICM, CSE, GVN, DCE, and all the countless loop opts are not available to you. You'll also spill to memory a lot, because you can't run a decent regalloc in a single pass.
I'm actually a big fan a function-by-function dual-pass compilation. You generate IR from the parser in one pass, and do codegen right after. Most intermediate state is thrown out (including the AST, for non-polymorphic functions) and you move on to the next function. This give you an extremely fast data-oriented baseline compiler with reasonable codegen (much better than something like tcc).
Pros:
* uses Python and recursive descent parsing
* separates front and backend via an IR
* generates ELF binaries (either x86 or ARM)
* meant for real world use
Cons:
* more complex
* not written in a tutorial style
Having similar reasoning, I would up writing a tiny-optimizing-compiler tutorial that only explains how to write a middle and back end of a compiler: https://github.com/bollu/tiny-optimising-compiler
LLVM makes it so much easier to build a compiler - it's not even funny. Whenever I use it, I feel like I'm just arranging some rocks on a top of a pyramid.
> Rather than getting stuck in front-end minutiae, the tutorial goes straight to generating working assembly code, from very early on
Good summary.
I had no background in compilers or related theory but read Jack Crenshaw's Let's Build a Compiler tutorials some time ago. My main take away from reading half a dozen or so of these tutorials was that building a simple compiler for a toy language was a small project that was well within my grasp and ability, not a huge undertaking that required mastery of esoteric pre-requisites or a large amount of planning.
I got a lot of enjoyment messing about with toy compiler projects related to Brainfuck.
Why Brainfuck? It's a beautiful little toy language. Brainfuck has 8 instructions, each instruction is 1 character, so parsing reduces to getting a char and switching on it. I guess it depends on what you want to explore. If you want to focus on writing recursive descent parsers, not the best choice!
One initial project could be to compile (transpile) from Brainfuck source to C source. You can do this as a source to source compiler without any internal representation by transforming each Brainfuck operation to a corresponding C statement. Brainfuck is specified in terms of a single fixed length array of bytes, and a pointer - an index into that array - that can be moved around, and basic manipulations of the byte it is pointing it. So on the C side you need two variables: one for the array and a second, an index for the pointer.
A second project could be compiling from Brainfuck to assembly language, skipping C. You'd need to read a few tutorials/reference docs about your chosen assembly language and learn how to run the assembler to compile tiny assembly programs into native executables. You could explore some examples of what output assembly programs you get when you compile small Brainfuck programs to C and then compile those C programs to assembly. You could write a direct source to source compiler without an internal representation, where each Brainfuck operation is directly mapped to a snippet of assembly instructions. Once you've got this working, you can compile a Brainfuck program into an assembly program, and then use the usual toolchain to assemble that into a native executable and run it.
There's also lots of projects in another direction, treating Brainfuck as the target language. Imagine that your job is to write Brainfuck programs for a CPU that natively executes Brainfuck. Try writing a few tiny Brainfuck programs by hand and savour how trying to do almost anything involves solving horrible little puzzles. Maybe it'd be much easier to do your job if you, the Brainfuck programmer, didn't have to manually track which index of the array is used to store what. You could invent a higher level language supporting concepts like local variables, where you could add two local variables together and store the results in a third local variable! Maybe you could allow the programmer to define and call their own functions! Maybe you could support `if` blocks, comparisons! You could have a compiler that manages the book-keeping of memory allocation and mapping complex high level abstractions such as integer addition into native Brainfuck concepts of adding one to things and moving left or right. Projects in this direction let you explore more stuff about parsers (the input syntax for your higher level language is richer), internal representations, scopes and so on.
> Rather than getting stuck in front-end minutiae, the tutorial goes straight to generating working assembly code, from very early on.
I think this is important and for a more sophisticated compiler design I find Ghuloum approach very appealing [1]. I.e. build a very simple subset of the language from top to bottom and then grow the meat gradually.
The really great book following this approach I've discovered recently was [2]. Although I find both C and x86 not the best targets for your first compiler, still a very good book for writing your first compiler.
[1] http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf
[2] https://norasandler.com/2024/08/20/The-Book-Is-Here.html
This is something I've noticed on academic vs "practicing" coders. Academics tend to build in layers, though not always, and "practicing" coders tend to build in pipes give or take. The layers approach might give you buildable code, but is hard to exercise and run. Both approaches can work though, especially if you build in executable chunks, but you have to focus on the smallest chunk you can actually run.
Yeah, I think this is one of the (few, rare) cases where the "official" academic way of teaching the subject is actually baggage and not really aligned with what's practically useful.
Compiler courses are structured like that because parsing really was the most important part, but I'd say in the "modern" world once you have a clear idea of how parsing actually works, it's more important to understand how compilers implement language features.
Even if you want to implement a compiler yourself, "Claude, please generate a recursive descent parser for this grammar" is close to working one-shot.
This article sums it up perfectly. I was interested in building a compiler long before going to college and this was the most accessible body of work.
Building a recursive descent parser from scratch was an eye opener to 17yo me on how a seemingly very complex problem that I had no idea how to approach can be made simple by breaking it down into the right primitives.
"breaking things down into the right primitives" is the real key to programming. There are many books and web pages about algorithms, but I wish there were more searchable and browsable resources for how to approach problems through primitives.
The process of breaking a complex problem down into the right primitives requires great understanding of the original problem in the first place.
Whats blocking me during programming usually are edge cases I had no idea about. Its still hard to find good material on compilers if you are not into reading dry ass books. Thats a me problem though, I simply cant force myself to read boring factual only content (one of the reasons as to why I love beejs guides).
> The process of breaking a complex problem down into the right primitives requires great understanding of the original problem in the first place.
Yes, but with experience that just becomes a matter of recognizing problem and design patterns. When you see a parsing problem, you know that the simplest/best design pattern is just to define a Token class representing the units of the language (keywords, operators, etc), write a NextToken() function to parse characters to tokens, then write a recursive descent parser using that.
Any language may have it's own gotchas and edge cases, but knowing that recursive descent is pretty much always going to be a viable design pattern (for any language you are likely to care about), you can tackle those when you come to them.
That's a good point - recursive descent as a general lesson in program design, in addition to being a good way to write a parser.
Table driven parsers (using yacc/etc) used to be emphasized in old compiler writing books such as Aho & Ullman's famous "dragon (front cover) book". I'm not sure why - maybe part efficiency for the slower computers of the day, and part because in the infancy of computing a more theoretical/algorithmic approach seemed more sophisticated and preferable (the cannonical table driven parser building algorithm was one of Knuth's algorithms).
Nowadays it seems that recursive descent is the preferred approach for compilers because it's ultimately more practical and flexible. Table driven can still be a good option for small DSLs and simple parsing tasks, but recursive descent is so easy that it's hard to justify anything else, and LLM code generation now makes that truer than ever!
There is a huge difference in complexity between building a full-blown commercial quality optimizing compiler and a toy one built as a learning exercise. Using something like LLVM as a starting point for a learning exercise doesn't seem very useful (unless your goal is to build real compilers) since it's doing all the heavy lifting for you.
I guess you can argue about how much can be cut out of a toy compiler for it still to be a useful learning exercise in both compilers and tackling complex problems, but I don't see any harm in going straight from parsing to code generation, cutting out AST building and of course any IR and optimization. The problems this direct approach causes for code generation, and optimization, can be a learning lesson for why a non-toy compiler uses those!
A fun approach I used at work once, wanting to support a pretty major C subset as the language supported by a programmable regression test tool, was even simpler ... Rather than having the recursive descent parser generate code, I just had it generate executable data structures - subclasses of Statement and Expression base classes, with virtual Execute() and Value() methods respectively, so that the parsed program could be run by calling program->Execute() on the top level object. The recursive descent functions just returned these statement or expression values directly. To give a flavor of it, the ForLoopStatement subclass held the initialization, test and increment expression class pointers, and then the ForLoopStatement::Execute() method could just call testExpression->Value() etc.
>a seemingly very complex problem that I had no idea how to approach can be made simple by breaking it down into the right primitives.
https://en.wikipedia.org/wiki/Niklaus_Wirth
From the Publications section of that Wikipedia page:
>The April 1971 Communications of the ACM article "Program Development by Stepwise Refinement",[22][23] concerning the teaching of programming, is considered to be a classic text in software engineering.[24] The paper is considered to be the earliest work to formally outline the top-down method for designing programs.[25][26] The article was discussed by Fred Brooks in his influential book The Mythical Man-Month and was described as "seminal" in the ACM's brief biography of Wirth published in connection to his Turing Award.[27][28]
When I need to parse something nowadays I always end up with parser combinators. They just make so much sense.
What language do you use parser combinators in, and what kind of grammar do you parse usually? Nom was terribly verbose and unergonomic even by Rust's standards. Haskell's Megaparsec/Parsec is good but yeah, it's Haskell, you need to handle multiple monads (Parser itself is monadic, then your AST state, and maybe some error handling) at once and that's where I got confused. But I appreciated the elegance.
I experimented with PCs in Haskell and Rust (nom), then moved on to parser generators in Rust (pest.rs), Ocaml (Menhir), Haskell (Happy) and finally ended up with python's Lark - the speed of experimenting with different syntax/grammars is just insane.
I've used tree-sitter for generating my parsers in Rust, and just working with the untyped syntax tree it generates, and gives you error-tolerance for free. It's a bit of a setup at first tho, requiring an extra crate for the generated parser, but editing it from there saves so much time.
What do you mean exactly by "error-tolerance"? Is it like, each node is wrapped into a result type, that you have to match against each time you visit it, even though you know for a fact, that it is not empty or something like that?
I suppose that one of the pros of using tree-sitter is its portability? For example, I could define my grammar to both parse my code and to do proper syntax highlighting in the browser with the same library and same grammar? Is that correct? Also it is used in neovim extensively to define syntax for a languages? Otherwise it would have taken to slightly modify the grammar.
Parser combinators is more of a concept than a library. You could make your own supporting the stuff you need. I like writing programs in languages I don't know or I barely know. I usually just take one of the popular libraries in any given language.
For Rust I used Nom and I didn't mind it all that much although I noticed it's quite baroque. If I had more to write I'd probably make some wrappers or macros of my own for most commonly used Nom snippets.
I printed this out (so I could have it with me everywhere) and read it when I was younger. It was so cool to see it come together so quickly. Some of these works (this one, Beej's guides, etc) are some of the best CS documentation we have and don't get nearly the credit they deserve.
> Jack Crenshaw's tutorial takes the syntax-directed translation approach, where code is emitted while parsing, without having to divide the compiler into explicit phases with IRs.
Is "syntax-directed translation" just another term for a single-pass compiler, e.g. as used by Lua (albeit to generate bytecode instead of assembly / machine code)? Or is it something more specific?
> in the latter parts of the tutorial it starts showing its limitations. Especially once we get to types [...] it's easy to generate working code; it's just not easy to generate optimal code
So, using a single-pass compiler for a statically-typed language makes it difficult to apply type-based optimizations. (Of course, Lua sidesteps this problem because the language is dynamically typed.)
Are there any other downsides? Does single-pass compilation also restrict the level of type checking that can be performed?
As long as your target language has a strict define-before-use rule and no advanced inference is required you will know the types of expressions, and can perform type-based optimizations. You can also do constant folding and (very rudimentary) inlining. But the best optimizations are done on IRs, which you don't have access to in an old-school single pass design. LICM, CSE, GVN, DCE, and all the countless loop opts are not available to you. You'll also spill to memory a lot, because you can't run a decent regalloc in a single pass.
I'm actually a big fan a function-by-function dual-pass compilation. You generate IR from the parser in one pass, and do codegen right after. Most intermediate state is thrown out (including the AST, for non-polymorphic functions) and you move on to the next function. This give you an extremely fast data-oriented baseline compiler with reasonable codegen (much better than something like tcc).
It is more specific, it means emiting code as you go along throught the source file.
A sigle pass compiler can still split the various phases, and only do the code generation on the last phase.
Shamless plug: http://cwerg.org
Pros: * uses Python and recursive descent parsing * separates front and backend via an IR * generates ELF binaries (either x86 or ARM) * meant for real world use
Cons: * more complex * not written in a tutorial style
Having similar reasoning, I would up writing a tiny-optimizing-compiler tutorial that only explains how to write a middle and back end of a compiler: https://github.com/bollu/tiny-optimising-compiler
For modern compiler and a more direct approach I recommend https://www.cs.cornell.edu/~asampson/blog/llvm.html
LLVM makes it so much easier to build a compiler - it's not even funny. Whenever I use it, I feel like I'm just arranging some rocks on a top of a pyramid.
A trend started with tools like the Amsterdam Compiler Toolkit, LLVM happens to be the more famous one.
https://en.wikipedia.org/wiki/Amsterdam_Compiler_Kit
This brings back memories, I got to learn from Jack Crenshaw's tutorial from comp.compilers newsgroup on USENET.
Which by the way, it is still active, https://compilers.iecc.com/index.phtml
https://t3x.org has several examples on creating a C, Lisp and several more.
The T3X language it's very Pascal like and fun to use (and portable: DOS/Win/Unix/CPM...).
Also, as an intro, with Zenlisp you can get a mini CS-101 a la SICP or CACS but simpler and explained in a much easier way.
> Rather than getting stuck in front-end minutiae, the tutorial goes straight to generating working assembly code, from very early on
Good summary.
I had no background in compilers or related theory but read Jack Crenshaw's Let's Build a Compiler tutorials some time ago. My main take away from reading half a dozen or so of these tutorials was that building a simple compiler for a toy language was a small project that was well within my grasp and ability, not a huge undertaking that required mastery of esoteric pre-requisites or a large amount of planning.
I got a lot of enjoyment messing about with toy compiler projects related to Brainfuck.
Why Brainfuck? It's a beautiful little toy language. Brainfuck has 8 instructions, each instruction is 1 character, so parsing reduces to getting a char and switching on it. I guess it depends on what you want to explore. If you want to focus on writing recursive descent parsers, not the best choice!
One initial project could be to compile (transpile) from Brainfuck source to C source. You can do this as a source to source compiler without any internal representation by transforming each Brainfuck operation to a corresponding C statement. Brainfuck is specified in terms of a single fixed length array of bytes, and a pointer - an index into that array - that can be moved around, and basic manipulations of the byte it is pointing it. So on the C side you need two variables: one for the array and a second, an index for the pointer.
A second project could be compiling from Brainfuck to assembly language, skipping C. You'd need to read a few tutorials/reference docs about your chosen assembly language and learn how to run the assembler to compile tiny assembly programs into native executables. You could explore some examples of what output assembly programs you get when you compile small Brainfuck programs to C and then compile those C programs to assembly. You could write a direct source to source compiler without an internal representation, where each Brainfuck operation is directly mapped to a snippet of assembly instructions. Once you've got this working, you can compile a Brainfuck program into an assembly program, and then use the usual toolchain to assemble that into a native executable and run it.
There's also lots of projects in another direction, treating Brainfuck as the target language. Imagine that your job is to write Brainfuck programs for a CPU that natively executes Brainfuck. Try writing a few tiny Brainfuck programs by hand and savour how trying to do almost anything involves solving horrible little puzzles. Maybe it'd be much easier to do your job if you, the Brainfuck programmer, didn't have to manually track which index of the array is used to store what. You could invent a higher level language supporting concepts like local variables, where you could add two local variables together and store the results in a third local variable! Maybe you could allow the programmer to define and call their own functions! Maybe you could support `if` blocks, comparisons! You could have a compiler that manages the book-keeping of memory allocation and mapping complex high level abstractions such as integer addition into native Brainfuck concepts of adding one to things and moving left or right. Projects in this direction let you explore more stuff about parsers (the input syntax for your higher level language is richer), internal representations, scopes and so on.
I also enjoyed working with BF for toy compiler projects; here's a series of JIT compilers for BF in increasing level of sophistication: https://eli.thegreenplace.net/2017/adventures-in-jit-compila...