WEBVTT 1 00:00:00.160 --> 00:00:03.960 Welcome to the deep dive. Today, we're tackling something fundamental 2 00:00:04.240 --> 00:00:07.799 yet often unseen in the world of coding. The compiler. 3 00:00:08.320 --> 00:00:12.359 Think of it as the architect behind the scenes, taking 4 00:00:12.400 --> 00:00:15.919 your human readable instructions and blueprinting them into the precise 5 00:00:16.039 --> 00:00:18.679 machine language your computer understands precisely. 6 00:00:19.039 --> 00:00:21.199 And for this deep dive, we're not just talking about 7 00:00:21.199 --> 00:00:24.760 compilers conceptually. We're getting into the initial stages of building 8 00:00:24.760 --> 00:00:28.359 our own. Our mission is to really dissect the core 9 00:00:28.440 --> 00:00:32.560 transformations required to compile even the most well the simplest 10 00:00:32.600 --> 00:00:33.960 see programs. 11 00:00:33.679 --> 00:00:37.039 And to guide this architectural exploration, we'll be using excerpts 12 00:00:37.039 --> 00:00:40.439 from Writing a C Compiler as our foundational text. It's 13 00:00:40.479 --> 00:00:43.240 our guide as we uncover how source code undergoes its 14 00:00:43.280 --> 00:00:47.600 initial metamorphosis into executable form. Okay, let's unpack this, let's 15 00:00:47.640 --> 00:00:51.119 do it. Our compiler construction starts with establishing a modular 16 00:00:51.159 --> 00:00:53.039 pipeline of four key stages. 17 00:00:53.399 --> 00:00:55.920 That's right. Yeah, even for the simplest stuff like maybe 18 00:00:55.960 --> 00:00:59.840 not even Hello World, maybe just returning a number, will 19 00:00:59.840 --> 00:01:02.000 be setting up a four pass structure. The first of. 20 00:01:01.960 --> 00:01:04.760 These is the lex lexer or tokenizer. 21 00:01:04.280 --> 00:01:05.920 Right or tokenizer, Yeah, same thing. 22 00:01:06.200 --> 00:01:09.159 So the lexer's fundamental task is to scan our c 23 00:01:09.319 --> 00:01:12.159 code and break it down into its essential components, right 24 00:01:12.239 --> 00:01:15.400 like identifying the individual building blocks before you start assembling 25 00:01:15.400 --> 00:01:16.120 them exactly. 26 00:01:16.480 --> 00:01:23.280 These smallest meaningful units are the tokens. Think about curly braces, 27 00:01:23.599 --> 00:01:27.439 defining scope, the summit, colon, ending statements, see words like 28 00:01:27.519 --> 00:01:31.319 int right, int return, and then the identifiers you create 29 00:01:31.359 --> 00:01:34.280 for functions, variables. All of these are distinct tokens. The 30 00:01:34.400 --> 00:01:37.760 lexer read your code character by character and groups them intelligently. 31 00:01:37.920 --> 00:01:40.680 I see, so a basic line like into main return 32 00:01:40.760 --> 00:01:44.599 thirty two would be segmented into tokens int main return 33 00:01:44.680 --> 00:01:46.079 thirty two taxes. 34 00:01:45.959 --> 00:01:47.439 Enter exactly that sequence. 35 00:01:47.519 --> 00:01:51.439 Okay, what's the next step after this initial segmentation. 36 00:01:51.159 --> 00:01:55.280 Following tokenization gives the parser? It takes that linear sequence 37 00:01:55.280 --> 00:01:56.439 of tokens. 38 00:01:55.959 --> 00:01:57.879 Just a list basically just a list. 39 00:01:57.879 --> 00:02:01.239 And imposes a hierarchical structure. It constructs what's known as 40 00:02:01.239 --> 00:02:04.359 an abstract syntax tree or at an AST. 41 00:02:04.920 --> 00:02:06.840 Okay, so instead of just a flat list, we get 42 00:02:06.840 --> 00:02:09.879 a structured representation that shows how these tokens relate to 43 00:02:09.919 --> 00:02:10.319 each other. 44 00:02:10.599 --> 00:02:13.919 Precisely. Think of it as moving from say a list 45 00:02:13.960 --> 00:02:15.919 of ingredients to an actual recipe. 46 00:02:15.960 --> 00:02:17.319 Oh nice analogy. 47 00:02:17.599 --> 00:02:21.000 The AST is this tree like structure that embodies the 48 00:02:21.039 --> 00:02:25.199 grammatical rules of C and reveals the program's operational flow. 49 00:02:25.879 --> 00:02:29.199 It's a format that lets the compiler analyze and well 50 00:02:29.360 --> 00:02:33.199 understand the codes intent much better than just that stream of. 51 00:02:33.240 --> 00:02:37.039 Tokens, right, because just having a list of words doesn't 52 00:02:37.080 --> 00:02:39.680 tell you the sentence structure of the meaning exactly. So, 53 00:02:39.719 --> 00:02:44.000 for our example in Maine return thirty two, what would 54 00:02:44.039 --> 00:02:46.400 the AST look like in its simplest form. 55 00:02:46.719 --> 00:02:50.759 In a simplified view, the root might be a program node, okay. 56 00:02:50.919 --> 00:02:53.199 Descending from that, you'd have a function node for Maine. 57 00:02:53.520 --> 00:02:56.360 Inside the function there'd be a return node, and finally, 58 00:02:56.680 --> 00:03:00.039 connected to return, a constant node holding the value. 59 00:02:59.800 --> 00:03:04.439 There So it visually organizes the program's components and their relationships. 60 00:03:04.560 --> 00:03:07.039 Function contains a return statement which returns a. 61 00:03:07.000 --> 00:03:09.240 Constant exactly captures that structure. 62 00:03:09.360 --> 00:03:12.960 Interesting, Now the compiler has this structured tree. What happens 63 00:03:12.960 --> 00:03:14.120 next in the transformation. 64 00:03:14.479 --> 00:03:17.120 This is where the translation to a lower level language begins. 65 00:03:17.759 --> 00:03:21.800 The code generation pass takes the AST the tree we 66 00:03:21.919 --> 00:03:25.280 just spilled, yes, that tree, and translates it into assembly 67 00:03:25.319 --> 00:03:26.680 language instruction assembly. 68 00:03:26.759 --> 00:03:29.080 Okay, that's much closer to the hardware, right exactly. 69 00:03:29.280 --> 00:03:32.840 And it's important to understand at this stage the compiler 70 00:03:33.039 --> 00:03:37.759 isn't directly writing out like a human readable assembly text file. 71 00:03:38.039 --> 00:03:42.840 Oh okay, it's still manipulating data structures internally. It's creating 72 00:03:42.879 --> 00:03:46.120 an in memory representation of these assembly instructions. 73 00:03:46.199 --> 00:03:48.800 So the creation of the actual DOTS file comes later. 74 00:03:49.120 --> 00:03:53.199 Yes, that's the job of the fourth initial pass code emission, right. 75 00:03:53.360 --> 00:03:57.400 This pass takes the in memory assembly representation what just 76 00:03:57.400 --> 00:04:02.039 created and finally writes it out, serializes it into a file, 77 00:04:02.400 --> 00:04:04.240 usually with the dot extension. 78 00:04:03.879 --> 00:04:06.000 And that S file is what the assembler and linker 79 00:04:06.439 --> 00:04:07.120 use later on. 80 00:04:07.240 --> 00:04:09.240 That's the one they take that and produce your final 81 00:04:09.319 --> 00:04:10.400 executable program. 82 00:04:10.520 --> 00:04:13.360 It seems like a considerable number of steps for such 83 00:04:13.360 --> 00:04:15.879 a basic program, you know, return thirty two. Why not 84 00:04:15.919 --> 00:04:17.480 a more direct approach. 85 00:04:17.240 --> 00:04:19.759 That's a really fair question. Again, it might seem like 86 00:04:19.800 --> 00:04:24.519 overkill for these tiny examples, but establishing this multipass architecture 87 00:04:24.560 --> 00:04:28.199 right from the start gives us huge advantages later. By 88 00:04:28.240 --> 00:04:33.000 separating concerns lexical analysis here, syntax there, cogen over here, 89 00:04:33.360 --> 00:04:39.160 modularity exactly, modularity, we create a more maintainable system. Imagine 90 00:04:39.199 --> 00:04:42.160 trying to translate I don't know, a complex novel straight 91 00:04:42.199 --> 00:04:45.600 into another language without first understanding the grammar and structure. 92 00:04:45.720 --> 00:04:46.680 Yeah, that would be a mess. 93 00:04:46.680 --> 00:04:49.839 It would be incredibly hard. This separation lets us handle 94 00:04:49.839 --> 00:04:52.160 more complexity later without having to rip everything up and 95 00:04:52.160 --> 00:04:55.680 start again. It's really about building a scalable foundation that. 96 00:04:55.600 --> 00:04:58.279 Makes a lot of sense planning for future complexity. Okay, 97 00:04:58.319 --> 00:05:01.160 speaking of assembly language, can we actually peek under the 98 00:05:01.160 --> 00:05:04.639 hood see what a real compiler like GCC generates for 99 00:05:04.800 --> 00:05:06.199 a simple C program. 100 00:05:06.240 --> 00:05:09.160 That's an excellent idea. It helps ground this discussion. For 101 00:05:09.199 --> 00:05:12.399 a simple program like in main return two saved us 102 00:05:12.439 --> 00:05:14.959 say return two dot C. You can use GCC with 103 00:05:15.000 --> 00:05:18.279 some specific commandline flags gccss and f and O. A 104 00:05:18.360 --> 00:05:22.600 secret is unwine tables FCF protection none, return two dot 105 00:05:22.639 --> 00:05:23.839 e t whoa. 106 00:05:24.040 --> 00:05:26.480 Okay, lots of flags there, but the key is mesh 107 00:05:26.639 --> 00:05:27.120 s ash. 108 00:05:27.160 --> 00:05:29.040 This is the main one telling it to stop after 109 00:05:29.079 --> 00:05:32.480 compilation and output assembly. The others just simplify the output 110 00:05:32.519 --> 00:05:33.399 a bit for our purposes. 111 00:05:33.439 --> 00:05:35.199 Got it, And what's the output look like? 112 00:05:35.360 --> 00:05:38.519 This command generates a file probably return two dots. With 113 00:05:38.600 --> 00:05:41.720 the assembly for that C program, you'll likely see something 114 00:05:41.759 --> 00:05:45.879 really simple like dot globalmine, dot main, dot movell two 115 00:05:45.920 --> 00:05:48.480 dollars percent ax red. 116 00:05:48.560 --> 00:05:50.839 Okay, that's definitely not c what's going on. 117 00:05:50.800 --> 00:05:53.079 Here, Let's break it down. The syntax is AT and 118 00:05:53.120 --> 00:05:56.439 T assembly syntax common on Linux and mac os. That 119 00:05:56.560 --> 00:05:59.680 first line dot global main that starts with a period, right, 120 00:06:00.079 --> 00:06:02.560 That means it's an assembler directive. It's an instruction for 121 00:06:02.600 --> 00:06:06.000 the assembler itself, not the CPU docl WI main just 122 00:06:06.000 --> 00:06:08.040 makes the main symbol visible outside this. 123 00:06:08.040 --> 00:06:10.600 File, a symbol like a label or a name for 124 00:06:10.639 --> 00:06:11.560 a place in memory. 125 00:06:11.639 --> 00:06:14.759 Precisely. Here, Maine is a symbol representing the starting address 126 00:06:14.759 --> 00:06:17.199 of our main function. The compiler doesn't know the final 127 00:06:17.199 --> 00:06:19.680 address yet. The linker figures that out later. Ah. 128 00:06:19.759 --> 00:06:22.959 The linker's job resolving symbols exactly. 129 00:06:23.079 --> 00:06:27.360 It resolves them assigns actual memory locations. If code refers 130 00:06:27.399 --> 00:06:30.439 to Maine, the linker patches in the real address. That's 131 00:06:30.519 --> 00:06:31.399 called relocation. 132 00:06:31.560 --> 00:06:33.639 So Maine on the next line is just marking the 133 00:06:33.680 --> 00:06:35.600 spot the start of the code exactly. 134 00:06:35.680 --> 00:06:39.560 It's the label. Then move two dollars percent ax. That's 135 00:06:39.560 --> 00:06:40.319 a real instruction. 136 00:06:40.439 --> 00:06:43.199 MOLL move long thirty two bit. 137 00:06:43.279 --> 00:06:47.360 Yep thirty two bit integer two dollars means the literal value. 138 00:06:47.040 --> 00:06:49.720 Too an immediate value, right, and percent. 139 00:06:49.720 --> 00:06:52.639 X is a register, a small fast storage spot inside 140 00:06:52.680 --> 00:06:55.879 the CPU. So this instruction puts the value too into 141 00:06:55.920 --> 00:06:57.160 the percent ax register. 142 00:06:57.279 --> 00:07:00.800 Okay, but y percent x specifically convention. 143 00:07:01.199 --> 00:07:04.399 In many standard ways, functions call each other calling conventions, 144 00:07:04.439 --> 00:07:07.439 the percent x register is designated to hold the function's 145 00:07:07.439 --> 00:07:08.079 return value. 146 00:07:08.160 --> 00:07:11.079 Oh okay, So because our C code returns two, we put. 147 00:07:10.879 --> 00:07:13.160 Two in percent acts so whoever called main can find 148 00:07:13.160 --> 00:07:13.879 the result. 149 00:07:13.600 --> 00:07:16.480 There makes sense. And the last line writ. 150 00:07:16.360 --> 00:07:18.279 Just means return tells the CPU to go back to 151 00:07:18.319 --> 00:07:20.600 where main was called from. So yeah, those four lines 152 00:07:20.639 --> 00:07:23.399 are the complete assembly for a tiny C program. 153 00:07:23.480 --> 00:07:24.879 That's surprisingly direct. 154 00:07:25.319 --> 00:07:25.639 Cool. 155 00:07:26.120 --> 00:07:28.120 So when we compile a C program, even with our 156 00:07:28.120 --> 00:07:31.600 own simple compiler, what's the typical sequence of operations overall? 157 00:07:31.879 --> 00:07:34.439 Right? So while our first compiler focuses mainly on that 158 00:07:34.680 --> 00:07:36.199 compilation to assembly. 159 00:07:35.839 --> 00:07:38.879 Step step two, in the usual process, Yeah, the standard. 160 00:07:38.480 --> 00:07:42.519 C process has a few phases. First, there's pre processing. 161 00:07:42.079 --> 00:07:44.680 Handling hashtag include and macros and stuff. 162 00:07:44.720 --> 00:07:48.160 Exactly Commands like GCCE do this. It often outputs a 163 00:07:48.199 --> 00:07:48.800 DITI file. 164 00:07:49.040 --> 00:07:53.040 Then comes compilation proper our focus generating the didass assembly 165 00:07:53.079 --> 00:07:53.800 file correct? 166 00:07:53.879 --> 00:07:56.240 Then in a full setup you have assembly and linking 167 00:07:56.959 --> 00:07:59.600 usually just GCC assembly file. Oh, output file. 168 00:08:00.279 --> 00:08:04.000 Takes the dot S file, makes machine code, links libraries. 169 00:08:03.560 --> 00:08:06.439 And gives you the final executable. Right. Our initial compiler 170 00:08:06.480 --> 00:08:08.560 will sort of stub out that last step, relying on 171 00:08:08.600 --> 00:08:10.560 the system's assembler and linker. Gotcha. 172 00:08:10.879 --> 00:08:13.160 And for our own compiler driver, the command line tool 173 00:08:13.199 --> 00:08:14.439 we're building, how should. 174 00:08:14.199 --> 00:08:17.160 That behave, good question. It should take the path to 175 00:08:17.199 --> 00:08:19.800 a C source file like your compiler paths to program 176 00:08:19.920 --> 00:08:23.040 dot C. If it works, success, it should create an 177 00:08:23.040 --> 00:08:26.199 executable in the same directory, same name, but no dot C. 178 00:08:26.480 --> 00:08:29.439 So pats a program and exit with code zero. And 179 00:08:29.480 --> 00:08:32.720 if it fails, non zero exit code and crucially no 180 00:08:32.799 --> 00:08:36.519 output files, no executable, clean failure. 181 00:08:36.360 --> 00:08:39.559 Clear rules. And I saw mentions of lex and parse 182 00:08:39.559 --> 00:08:40.559 options in the notes. 183 00:08:40.759 --> 00:08:43.120 Uh yeah, those are mostly for testing and debugging. Our 184 00:08:43.120 --> 00:08:46.240 compiler lexmis just run the lexer. 185 00:08:45.919 --> 00:08:48.279 And stop check tokenizing, right, and. 186 00:08:48.399 --> 00:08:52.399 Parse runs the lexer and parser builds the AST then stops. 187 00:08:52.840 --> 00:08:55.039 Neither should create any output files, they just check those 188 00:08:55.039 --> 00:08:55.960 stages internally. 189 00:08:56.039 --> 00:08:59.039 Okay, that makes sense for development. All right, we've got 190 00:08:59.039 --> 00:09:01.840 a solid high light picture. The four passes the standard 191 00:09:01.840 --> 00:09:06.240 GCC flow. Let's dive deeper into the lexer and pulser. 192 00:09:06.279 --> 00:09:08.720 Now they're the first big hurdle in building our own 193 00:09:08.799 --> 00:09:09.600 right absolutely. 194 00:09:09.919 --> 00:09:13.679 Chapter two of the guide digs into building these starting 195 00:09:13.679 --> 00:09:16.799 with the lexer. As we said, its job is finding tokens, 196 00:09:17.480 --> 00:09:19.879 and one simplifying assumption we make early on is that 197 00:09:19.919 --> 00:09:22.879 our c files only use ASKI characters. 198 00:09:22.519 --> 00:09:25.559 Just standard ask you for now sensible starting point. How 199 00:09:25.559 --> 00:09:27.720 do we actually test if our lexer is doing the 200 00:09:27.799 --> 00:09:28.159 right thing. 201 00:09:28.320 --> 00:09:31.240 The guide provides a test compiler tool, which is super helpful. 202 00:09:31.360 --> 00:09:33.960 It comes with test programs. In test chapter two, you'll 203 00:09:33.960 --> 00:09:35.679 find directories like invalid. 204 00:09:35.320 --> 00:09:37.559 Lex programs that should fail the lexer. 205 00:09:37.480 --> 00:09:41.320 Exactly, bad tokens, weird characters, and then invalid parts and 206 00:09:41.399 --> 00:09:45.519 valid directories. For later stages. You test the lexer specifically 207 00:09:46.039 --> 00:09:50.120 using dot test compiler path toward compiler chapter two stage lex. 208 00:09:50.000 --> 00:09:53.159 Okay, So that command runs our compiler inlex only mode 209 00:09:53.240 --> 00:09:56.440 against those test cases and checks if it accepts or 210 00:09:56.480 --> 00:09:57.519 rejects them correctly. 211 00:09:57.600 --> 00:10:00.399 That's the idea. It verifies the pass feel behavior. It 212 00:10:00.440 --> 00:10:03.600 doesn't necessarily check the exact stream of tokens for the 213 00:10:03.639 --> 00:10:04.759 valid files. 214 00:10:04.399 --> 00:10:07.159 Though, ah so for that level of detail, we'd need 215 00:10:07.200 --> 00:10:08.240 our own unit tests. 216 00:10:08.480 --> 00:10:11.759 Precisely, you'd write tests to feed it valid code and 217 00:10:11.799 --> 00:10:14.360 assert that the token list matches exactly what you expect, 218 00:10:14.639 --> 00:10:17.159 and feed it invalid code to check the error messages. 219 00:10:17.279 --> 00:10:20.960 Got it? Any key implementation tips for the lexer itself, Yes. 220 00:10:20.879 --> 00:10:23.360 A couple of important ones. First, when you see something 221 00:10:23.360 --> 00:10:25.679 that looks like an identifier, a sequence of letters, numbers, 222 00:10:25.759 --> 00:10:29.519 underscores like maine or my variable right your logic. Maybe 223 00:10:29.559 --> 00:10:32.799 your rejects will probably also match keywords like int or return. 224 00:10:33.399 --> 00:10:37.320 The efficient way is first recognize it as a generic identifier. 225 00:10:37.799 --> 00:10:40.000 Then check if that identifier happens to be on the 226 00:10:40.039 --> 00:10:41.320 list of reserved keywords. 227 00:10:41.559 --> 00:10:45.799 Ah, don't try to make the initial pattern, distinguish them, identify, then. 228 00:10:45.720 --> 00:10:49.639 Classify exactly two steps. The other thing is, don't rely 229 00:10:49.759 --> 00:10:52.720 only on white space to split tokens. Oh right, think 230 00:10:52.720 --> 00:10:55.879 about main. That's three tokens main and no white space 231 00:10:55.919 --> 00:10:59.240 separating them. If you just split on spaces, you'd get 232 00:10:59.279 --> 00:10:59.559 it wrong. 233 00:10:59.799 --> 00:11:03.399 Point Okay, So the lexer spins out tokens. Then the 234 00:11:03.440 --> 00:11:06.120 parser steps in to build the ast exactly. 235 00:11:06.480 --> 00:11:09.960 The parser takes that flat stream and gives it structure 236 00:11:10.159 --> 00:11:13.679 hierarchy based on the C grammar. The AST is the 237 00:11:13.759 --> 00:11:15.480 data structure holding that hierarchy. 238 00:11:15.879 --> 00:11:18.879 We saw the simple AST for return thirty two. What 239 00:11:18.960 --> 00:11:22.440 about something slightly more complex, like an if statement. How 240 00:11:22.440 --> 00:11:23.559 does the hierarchy show up there? 241 00:11:23.600 --> 00:11:27.000 Okay, good example. Let's say you have if ab return 242 00:11:27.039 --> 00:11:29.600 two plus two. Right, the top AST node might be 243 00:11:29.639 --> 00:11:32.159 an if node. This if node would have say, two 244 00:11:32.200 --> 00:11:34.279 main children, one for the condition. 245 00:11:34.080 --> 00:11:36.159 Ab, which itself might be structured. 246 00:11:36.200 --> 00:11:38.480 Oh yeah, that condition could be a binary op node 247 00:11:38.519 --> 00:11:40.879 for with this own children for the variable A and 248 00:11:40.919 --> 00:11:41.440 the variable b. 249 00:11:41.559 --> 00:11:43.679 Okay, and the other child of the if. 250 00:11:43.720 --> 00:11:46.399 That would be the then block return two plus two. 251 00:11:46.840 --> 00:11:48.960 That could be a return node, and its child would 252 00:11:48.960 --> 00:11:52.080 be another binary op node for the plus with two 253 00:11:52.159 --> 00:11:54.399 constant children both holding to wow. 254 00:11:54.440 --> 00:11:57.639 Okay, So the tree really mirrors the nesting and the 255 00:11:57.679 --> 00:12:00.679 logic if conditioned then and the condition the then part 256 00:12:00.720 --> 00:12:02.279 have their own little subtrees. 257 00:12:02.360 --> 00:12:04.879 Decisely, it captures that structure directly, which is what the 258 00:12:04.879 --> 00:12:08.600 next stages need now. To define these AST structures formally 259 00:12:08.679 --> 00:12:12.759 and importantly in a language neutral way, the guide introduces 260 00:12:12.759 --> 00:12:14.320 something called ASDL. 261 00:12:13.919 --> 00:12:16.879 Asdl zephyr abstract syntax description language. 262 00:12:16.879 --> 00:12:18.679 That's the one. It's just a formal way to write 263 00:12:18.679 --> 00:12:20.159 down what our AST nodes look like. 264 00:12:20.279 --> 00:12:22.960 Okay, So what does the ASDL look like for our 265 00:12:23.240 --> 00:12:25.519 super simple C subset in chapter two? 266 00:12:25.840 --> 00:12:30.759 It's pretty minimal. It's like program program function definition, function definition, function, 267 00:12:30.799 --> 00:12:33.399 identify your name, statement, body, return next, sovisp. 268 00:12:34.000 --> 00:12:37.039 Okay, let's decode that a program is just a program 269 00:12:37.080 --> 00:12:40.240 node containing one function definition. Yep, A function definition is 270 00:12:40.279 --> 00:12:43.360 a function node. It has a name, which is an 271 00:12:43.360 --> 00:12:46.879 identifier type, and a body which is a statement type. 272 00:12:47.120 --> 00:12:49.759 Right, and those words name and body are just field 273 00:12:49.840 --> 00:12:51.000 names helpful labels. 274 00:12:51.039 --> 00:12:54.279 Gotcha. Then a statement can only be a return node 275 00:12:54.360 --> 00:12:57.080 containing an x expression for now, yes, and an x 276 00:12:57.159 --> 00:12:59.720 can only be a constant node holding an int. 277 00:13:00.159 --> 00:13:03.360 That's it for chapter two. Identifier and int are like 278 00:13:03.440 --> 00:13:05.000 built in ASDL types. 279 00:13:05.480 --> 00:13:08.720 So when we implement this, say in Python or Rust 280 00:13:08.840 --> 00:13:12.639 or drava, will create classes or data types that match 281 00:13:12.720 --> 00:13:14.759 this ASDL structure exactly. 282 00:13:15.120 --> 00:13:18.759 Functional languages might use algebraic data types. OP languages might 283 00:13:18.840 --> 00:13:22.200 use abstract classes and inheritance. The guide mentioned some idioms 284 00:13:22.200 --> 00:13:23.559 and points to more reading if you want to go 285 00:13:23.600 --> 00:13:25.120 deeper into implementation strategies. 286 00:13:25.159 --> 00:13:28.919 Okay, but the ASDL defines the structure, but it doesn't 287 00:13:28.960 --> 00:13:31.879 tell the parser which tokens in what order make up 288 00:13:32.000 --> 00:13:34.879 say a function definition. Right, it doesn't mention the ind 289 00:13:34.960 --> 00:13:37.120 keyword or the parentheses or braces. 290 00:13:37.360 --> 00:13:39.519 That is a crucial distinction. You're absolutely right. The AST 291 00:13:39.679 --> 00:13:43.120 is abstract. It leaves out the syntactic sugar like semicolons embraces. 292 00:13:43.519 --> 00:13:46.480 The parser needs a concrete map of the token sequences. 293 00:13:45.960 --> 00:13:47.519 Which is where the formal grammar comes. 294 00:13:47.360 --> 00:13:50.759 In, exactly, using a notation called backus nair form or 295 00:13:50.840 --> 00:13:52.279 BNF BNF. 296 00:13:52.559 --> 00:13:56.279 Okay, what's the BNF for this simplecy it mirrors. 297 00:13:55.919 --> 00:14:00.879 The ASDL pretty closely. The program function I identify return 298 00:14:00.919 --> 00:14:04.679 expeed a statement return expediment, and then it clarifies the 299 00:14:04.759 --> 00:14:09.360 terminals identifier an identify your token and in a constant. 300 00:14:09.039 --> 00:14:11.440 Token then okay, So things in angle brackets like this 301 00:14:11.519 --> 00:14:14.960 are non terminals. They correspond to our AST node types. 302 00:14:15.080 --> 00:14:17.200 Yes, grammatical categories. 303 00:14:16.639 --> 00:14:19.720 And things in quotes like this are terminals. The actual 304 00:14:19.799 --> 00:14:21.039 tokens the lexer gives. 305 00:14:20.919 --> 00:14:23.679 Us exactly the literal tokens we expect to see. The 306 00:14:23.720 --> 00:14:27.480 bn F spells out the exact sequence and int token. 307 00:14:27.759 --> 00:14:31.840 Then an identifier token then art rcedo a statement right, 308 00:14:31.960 --> 00:14:32.240 and the. 309 00:14:32.240 --> 00:14:35.200 Question mark definitions are just clarifying what kind of token 310 00:14:35.240 --> 00:14:38.240 identifier and in refer to. So the BNF is the 311 00:14:38.279 --> 00:14:41.879 parser's rulebook for matching token sequences to build the AST 312 00:14:42.080 --> 00:14:44.159 nodes defined by the asdo. 313 00:14:43.840 --> 00:14:46.399 You've got it perfect summary. The guide also shows how 314 00:14:46.440 --> 00:14:49.519 you'd extend bn F like adding an if statement rules 315 00:14:49.799 --> 00:14:53.120 ifpanis statement l statement the brackets mean the l's part 316 00:14:53.159 --> 00:14:54.639 is optional neat Okay. 317 00:14:54.639 --> 00:14:57.600 So we have tokens, the ASDL defining the target AST 318 00:14:58.360 --> 00:15:01.360 and the BNF grammar as a rule book. How does 319 00:15:01.360 --> 00:15:04.279 the parser actually do the parsing? What's the technique? 320 00:15:04.519 --> 00:15:08.000 The guide introduces a common technique called recursive descent parsing. 321 00:15:08.120 --> 00:15:10.240 Recursive dissent sounds intriguing. 322 00:15:10.480 --> 00:15:13.759 The basic idea is simple, For each non terminal symbol 323 00:15:14.080 --> 00:15:18.639 in the BNF grammar, like program function statement, you write 324 00:15:18.799 --> 00:15:20.360 a corresponding parsing function. 325 00:15:20.440 --> 00:15:22.240 Okay, a function for each rule. 326 00:15:22.120 --> 00:15:25.200 Pretty much, and these functions often call each other, mirroring 327 00:15:25.240 --> 00:15:27.679 the structure of the grammar. That's the recursive part. 328 00:15:27.799 --> 00:15:31.320 Ah okay, So how would parse statement work based on 329 00:15:31.360 --> 00:15:32.279 our simple grammar? 330 00:15:32.440 --> 00:15:35.759 Well, the rule is statement return x biller. So the 331 00:15:35.799 --> 00:15:38.720 par statement function would first look for a return token. Okay, 332 00:15:38.879 --> 00:15:41.799 if it finds one, it consumes it. Then it needs 333 00:15:41.799 --> 00:15:45.840 to parson x, so it would call another function, maybe parsex. 334 00:15:45.399 --> 00:15:47.840 Which would handle parsing the integer constant in our. 335 00:15:47.840 --> 00:15:51.639 Case, right parsiicus would return the constant ast node, then parsated, 336 00:15:51.639 --> 00:15:54.360 but looks for the final token, consumes that, consumes that, 337 00:15:55.000 --> 00:15:58.039 and if everything worked, it bundles up the constant node 338 00:15:58.120 --> 00:16:01.919 returned by parsis inside a new return ast node and 339 00:16:01.960 --> 00:16:03.320 returns that got it. 340 00:16:03.480 --> 00:16:06.320 The guide showed some pseudocode with an expect helper function. 341 00:16:06.799 --> 00:16:09.480 Yeah. Expect is useful. It basically means check if the 342 00:16:09.519 --> 00:16:12.240 next token is x, consume it if yes, raise an 343 00:16:12.320 --> 00:16:12.879 error if no. 344 00:16:13.279 --> 00:16:16.000 And these functions consume tokens as they go, So if 345 00:16:16.039 --> 00:16:19.240 parse program finishes and there are still tokens left over. 346 00:16:19.200 --> 00:16:21.919 That usually means there's extra stuff that doesn't fit the grammar. 347 00:16:22.000 --> 00:16:23.399 A syntax error makes sense. 348 00:16:23.559 --> 00:16:26.399 The guide mentioned predictive parsers and backtracking briefly too. 349 00:16:26.679 --> 00:16:29.639 Yeah. For more complex grammars where a rule might have 350 00:16:29.759 --> 00:16:33.840 multiple options like if versus return for statement, the parser 351 00:16:33.960 --> 00:16:36.039 might need to peek ahead at the next token to 352 00:16:36.080 --> 00:16:40.159 decide which path to take predictive or try one path 353 00:16:40.200 --> 00:16:41.399 and backtrack if it fails. 354 00:16:41.960 --> 00:16:45.159 But for our simple start, direct recursive descent works well. 355 00:16:45.240 --> 00:16:47.200 And testing the parser same tool. 356 00:16:47.080 --> 00:16:50.440 YEP test compiler path where you're compiler chapter two stage pars. 357 00:16:50.840 --> 00:16:53.919 It checks against the invalid pars and valid tests again. 358 00:16:54.000 --> 00:16:56.120 Writing your own tests to check the structure of the 359 00:16:56.159 --> 00:17:00.200 output AST is super helpful for debugging and the implementation 360 00:17:00.279 --> 00:17:03.480 tips where write a pretty printer for the ASD definitely 361 00:17:03.519 --> 00:17:05.799 helps visualize the tree and give good error messages. 362 00:17:05.880 --> 00:17:10.200 Crucial expected but found return online five column ten is 363 00:17:10.359 --> 00:17:12.119 way better than just syntax error. 364 00:17:12.240 --> 00:17:16.240 Absolutely okay. So source DAN lexer, the DAN pokins, the 365 00:17:16.279 --> 00:17:20.119 met parser, DAN cast. We have the tree. What's next? 366 00:17:20.440 --> 00:17:24.400 Now we hit cogeneration. This pass takes that c language AST. 367 00:17:24.279 --> 00:17:25.680 The one the parser is built. 368 00:17:25.519 --> 00:17:30.039 Exactly and transforms it into our target by sixty four 369 00:17:30.039 --> 00:17:33.640 assembly instructions, but again not as text yet. We represent 370 00:17:33.680 --> 00:17:35.880 the assembly program as another internal data. 371 00:17:35.680 --> 00:17:39.559 Structure first another AST, an assembly AST precisely. 372 00:17:39.759 --> 00:17:41.559 The guide calls it that. To keep things clear, it 373 00:17:41.599 --> 00:17:43.319 has its own ASDL definition two. 374 00:17:43.480 --> 00:17:45.839 Okay, what does the assembly ASDL look like? 375 00:17:45.920 --> 00:17:49.039 It's also quite simple for now. Program function definition function 376 00:17:49.160 --> 00:17:54.599 identify our name, instruction instructions instructions op src, opern and 377 00:17:54.680 --> 00:17:57.960 dst ret oper im in register. 378 00:17:58.119 --> 00:18:02.440 Okay, interesting parallels. Program has a function definition. A function 379 00:18:02.480 --> 00:18:04.599 has a name, but instead of a C statement body, 380 00:18:04.640 --> 00:18:06.799 it has instruction a list of instructions. 381 00:18:07.119 --> 00:18:09.720 The astrisk means a list or sequence. 382 00:18:09.440 --> 00:18:13.039 And the instruction types are mauve or reht and their 383 00:18:13.079 --> 00:18:16.039 operations can be immediate, a constant or a register. 384 00:18:16.400 --> 00:18:18.480 That's it for now, and initially, the only register we 385 00:18:18.519 --> 00:18:21.799 care about is percent ax for the return value. The 386 00:18:21.799 --> 00:18:25.720 code generator walks the cast and for each node it 387 00:18:25.799 --> 00:18:28.880 figures out the equivalent assembly instructions and builds up this 388 00:18:28.960 --> 00:18:30.000 assembly AST. 389 00:18:30.359 --> 00:18:33.599 The guide had a table mapping cast nodes to assembly 390 00:18:33.640 --> 00:18:37.839 AST constructs like return in C becomes a mauv register 391 00:18:38.559 --> 00:18:40.200 than are ret in assembling. 392 00:18:39.920 --> 00:18:43.319 Right, and constant int in the cast becomes in in 393 00:18:43.359 --> 00:18:44.359 the assembly AST. 394 00:18:44.559 --> 00:18:46.759 So it's a translation step building a new tree that 395 00:18:46.799 --> 00:18:48.960 represents the assembly code needed exactly. 396 00:18:48.960 --> 00:18:51.400 And you can see how one C statement return maps 397 00:18:51.440 --> 00:18:54.759 to two assembly instructions movel and ret. That becomes more 398 00:18:54.799 --> 00:18:56.039 common as things get complex. 399 00:18:56.200 --> 00:18:59.319 Okay, assembly AST constructed in memory. The final step of 400 00:18:59.319 --> 00:19:01.119 the initial four passes code emission. 401 00:19:01.160 --> 00:19:03.599