WEBVTT 1 00:00:00.200 --> 00:00:03.720 So if you ask an artificial intelligence to write a 2 00:00:03.839 --> 00:00:06.559 Shakespearean saunet about I don't know a toaster. 3 00:00:06.480 --> 00:00:08.800 Right, it just does it in like three seconds. 4 00:00:08.400 --> 00:00:13.919 Flat, exactly. It's so incredibly fast and honestly so convincingly 5 00:00:14.039 --> 00:00:16.079 human that it's really easy to just throw our hands 6 00:00:16.120 --> 00:00:18.480 up and say, well, the computer is just thinking. 7 00:00:18.359 --> 00:00:19.399 Yeah, that's the illusion. 8 00:00:19.519 --> 00:00:22.640 But here is the secret that the tech world sort 9 00:00:22.640 --> 00:00:27.920 of you know, blides right past. The AI doesn't actually 10 00:00:27.960 --> 00:00:28.839 know what a toaster is. 11 00:00:29.239 --> 00:00:30.039 No, not at all. 12 00:00:30.079 --> 00:00:32.640 It doesn't know what a poem is. It's not experiencing 13 00:00:32.719 --> 00:00:37.560 this burst of creative genius underneath the hood. It is 14 00:00:37.640 --> 00:00:41.200 really just doing a massive amount of incredibly fast, very 15 00:00:41.240 --> 00:00:42.679 boring accounting. 16 00:00:42.320 --> 00:00:43.920 Which is exactly what we're going to get into. 17 00:00:44.079 --> 00:00:47.679 Right. So today, for you, our listener, we're opening that ledger. 18 00:00:47.679 --> 00:00:50.119 We are taking you on a custom tailored deep dive 19 00:00:50.479 --> 00:00:55.000 to totally demystify how artificial intelligence actually you know, learns. 20 00:00:55.119 --> 00:00:59.240 Yeah, no magic, no impenetrable labyrinths, just the raw mechanics. 21 00:00:59.479 --> 00:00:59.960 The mechanics. 22 00:01:00.079 --> 00:01:02.880 Because I mean, we appreciate the result of a neural network, right, 23 00:01:03.359 --> 00:01:06.640 we rarely understand the underlying chemistry of how it actually 24 00:01:06.680 --> 00:01:07.040 got there. 25 00:01:07.159 --> 00:01:09.719 It's totally a black box for most people exactly. 26 00:01:10.040 --> 00:01:12.480 So our guide for pulling back the curtain today is 27 00:01:12.519 --> 00:01:16.159 this fantastic book by Seth Widman. It's called deep Learning 28 00:01:16.200 --> 00:01:20.439 from Scratch, Building with Python from first principles A great resource, 29 00:01:20.599 --> 00:01:22.519 really is. And what we're going to do is take 30 00:01:22.599 --> 00:01:26.719 all that intimidating jargon, you know, the algorithms, the calculus, 31 00:01:26.760 --> 00:01:29.640 the hyper parameteris very stuff, all of it, and we're 32 00:01:29.680 --> 00:01:31.480 going to strip it all the way down to the 33 00:01:31.560 --> 00:01:32.959 foundational floorboards. 34 00:01:33.239 --> 00:01:36.599 So we're going to use simple math, visual diagrams, and 35 00:01:36.640 --> 00:01:39.120 some basic code to show you that deep learning is 36 00:01:39.239 --> 00:01:43.200 really just a highly scaled assembly line of very very 37 00:01:43.239 --> 00:01:44.879 simple mathematical factories. 38 00:01:44.959 --> 00:01:46.079 That's a great way to put it. 39 00:01:46.159 --> 00:01:48.760 But before we actually get to building that assembly line, 40 00:01:48.799 --> 00:01:50.959 we need to talk about why loving this stuff in 41 00:01:51.000 --> 00:01:53.760 the first place is normally such a complete nightmare. Oh 42 00:01:53.799 --> 00:01:55.680 it really is, because if you try to read a 43 00:01:55.719 --> 00:01:59.439 standard academic paper on neural networks, it often feels like 44 00:01:59.480 --> 00:02:02.840 you're trying to read ancient Greek. Why is the entry 45 00:02:02.879 --> 00:02:04.000 point so brutal? 46 00:02:04.719 --> 00:02:07.200 Well, Wideman tackles this right out of the gate. He 47 00:02:07.319 --> 00:02:10.159 uses that old parable of the blind men and the elephant. 48 00:02:10.599 --> 00:02:11.520 Oh right, sure. 49 00:02:11.680 --> 00:02:14.199 So you have a group of blind men who encounter 50 00:02:14.240 --> 00:02:17.439 an elephant for the first time. One touches the trunk 51 00:02:17.840 --> 00:02:20.360 and says, oh, an elephant is like a thick snake, right, 52 00:02:20.439 --> 00:02:22.360 Another touches the year and says, no, it's a fan. 53 00:02:22.680 --> 00:02:24.759 Another grabs the leg and says it's a tree trunk. 54 00:02:25.120 --> 00:02:30.319 And they're all kind of right but also completely wrong exactly. 55 00:02:30.759 --> 00:02:35.319 They are all describing a correct, isolated part, but none 56 00:02:35.360 --> 00:02:38.400 of them are describing the whole animal. And deep learning 57 00:02:38.439 --> 00:02:41.120 resources have historically done the exact same thing. 58 00:02:41.240 --> 00:02:43.400 Okay, that makes a lot of sense because, like, if 59 00:02:43.400 --> 00:02:46.439 you want to learn a standard computer science concept, say 60 00:02:46.800 --> 00:02:50.080 how a search algorithm works, the resources out there are 61 00:02:50.159 --> 00:02:53.240 usually holistic, Like a good textbook gives you a plain 62 00:02:53.319 --> 00:02:56.879 English explanation, then they give you a whiteboard diagram, then 63 00:02:56.919 --> 00:02:59.120 the math, and finally the pseudo code so you can 64 00:02:59.120 --> 00:03:01.400 actually build it. You get the whole elephant, right, you 65 00:03:01.439 --> 00:03:05.080 get the whole elephant. But AI resources fracture this, don't They. 66 00:03:05.039 --> 00:03:09.599 Completely The field sort of fractured into two really extreme camps. 67 00:03:09.919 --> 00:03:15.039 On one side, you have these highly conceptual, incredibly dense. 68 00:03:15.199 --> 00:03:18.360 Math textbooks, the ancient Greek exactly. 69 00:03:18.719 --> 00:03:22.800 Wideman points to Ian Goodfellow's famous Deep Learning book and look, 70 00:03:22.840 --> 00:03:26.840 it's an absolute masterpiece. Sure, but if you aren't already 71 00:03:26.879 --> 00:03:30.800 fluent in advanced calculus and linear algebra, you're going to 72 00:03:30.879 --> 00:03:33.120 hit a brook wall on like page ten, it's just 73 00:03:33.159 --> 00:03:34.879 a sea of abstract equations. 74 00:03:34.960 --> 00:03:37.960 So what's the other extreme then, Because if I don't 75 00:03:37.960 --> 00:03:40.280 want to drown in calculus, where do people usually go? 76 00:03:40.439 --> 00:03:44.960 Well, they go to the highly practical, code heavy tutorials. 77 00:03:45.439 --> 00:03:47.599 So you might look up the documentation for a modern 78 00:03:47.680 --> 00:03:51.439 library like PyTorch okay, and you just copy a block 79 00:03:51.439 --> 00:03:53.800 of Python code, you paste it, you run it, and 80 00:03:53.840 --> 00:03:55.960 you watch this number on your screen called the loss 81 00:03:56.039 --> 00:03:57.639 value start to go down. 82 00:03:57.520 --> 00:03:58.879 Which means it's working right. 83 00:03:59.080 --> 00:04:02.879 Technically, Yes, the network is learning, but the tutorial never 84 00:04:02.919 --> 00:04:05.800 actually stops to explain the why. It's like you're driving 85 00:04:05.840 --> 00:04:08.520 a sports car but you have zero clue how the 86 00:04:08.520 --> 00:04:10.159 internal combustion engine works. 87 00:04:09.960 --> 00:04:13.400 Which is I'm guessing where Widman's approach comes in. He argues, 88 00:04:13.439 --> 00:04:15.960 you have to merge these perspectives. 89 00:04:15.439 --> 00:04:19.519 Yes, exactly. His core thesis is that to truly understand 90 00:04:19.519 --> 00:04:22.240 neural networks, you have to hold multiple mental models in 91 00:04:22.279 --> 00:04:23.560 your head simultaneously. 92 00:04:23.759 --> 00:04:24.879 Okay, what does that look like? 93 00:04:25.000 --> 00:04:27.079 Well, you have to look at a neural network and 94 00:04:27.120 --> 00:04:30.079 see it as a mathematical function, but at the exact 95 00:04:30.160 --> 00:04:32.519 same time, you have to see it as a computational 96 00:04:32.600 --> 00:04:36.240 graph where data physically flows from left to right. Got it. 97 00:04:36.439 --> 00:04:38.120 You also have to see it as a series of 98 00:04:38.240 --> 00:04:41.959 layered neurons, and finally, you have to understand it conceptually 99 00:04:42.240 --> 00:04:44.720 as a universal function approximator. 100 00:04:44.839 --> 00:04:49.480 Wait, hold on, a universal function approximator. Yeah, that sounds 101 00:04:49.560 --> 00:04:53.839 like a fancy blender from a late night infomercial or something. 102 00:04:53.879 --> 00:04:54.879 What does that actually mean? 103 00:04:55.040 --> 00:04:58.560 I know it sounds super intimidating, but it just means 104 00:04:58.560 --> 00:05:02.120 a machine that can mold it self to approximate literally 105 00:05:02.199 --> 00:05:05.480 any pattern in the universe, provided it has enough parts. 106 00:05:05.680 --> 00:05:09.439 Any pattern, pretty much, whether the pattern is predicting tomorrow's weather, 107 00:05:09.920 --> 00:05:13.279 or recognizing a cat in a photo, or translating English 108 00:05:13.319 --> 00:05:16.639 to French. If there's a logical relationship between the input 109 00:05:16.720 --> 00:05:20.240 and the output, a neural network can approximate it. That's wild, 110 00:05:20.600 --> 00:05:22.959 it is, But you only realize how it does that 111 00:05:23.040 --> 00:05:25.439 if you force yourself to see the math, the diagram 112 00:05:25.519 --> 00:05:27.199 and the code side by side and. 113 00:05:27.160 --> 00:05:29.959 I guess that's why Widman forces the reader to build 114 00:05:30.000 --> 00:05:34.399 these networks from scratch in Python, using just like basic arrays. 115 00:05:34.040 --> 00:05:35.079 In numpis exactly. 116 00:05:35.160 --> 00:05:37.439 It's not because you're trying to build the fastest AI 117 00:05:37.519 --> 00:05:40.879 in the world. It's purely an exercise in solidifying your 118 00:05:40.920 --> 00:05:42.079 understanding of those models. 119 00:05:42.079 --> 00:05:42.519 Spot on. 120 00:05:42.920 --> 00:05:46.279 So let's start doing exactly that for the listener. Let's 121 00:05:46.319 --> 00:05:49.600 abandon the complex terminology. We need to start at the 122 00:05:49.920 --> 00:05:54.439 absolute foundation of all machine learning, the mathematical function and 123 00:05:54.480 --> 00:05:55.480 the derivative right. 124 00:05:55.959 --> 00:05:58.360 And usually when we learn about functions in high school, 125 00:05:58.399 --> 00:06:01.959 we use the Cartesian plane, you know, Rineiti Card's classic 126 00:06:02.360 --> 00:06:02.839 X and Y. 127 00:06:02.920 --> 00:06:04.720 Axes, the good old graph paper. 128 00:06:04.920 --> 00:06:07.879 Exactly, you plot some points, you draw a curved line 129 00:06:07.920 --> 00:06:10.680 through them, and that's fine for basic geometry, but it's 130 00:06:10.680 --> 00:06:13.560 actually a terrible mental model for deep learning. 131 00:06:13.720 --> 00:06:17.279 Yeah, drawing parabolas isn't going to help us build an AI. Instead, 132 00:06:17.319 --> 00:06:20.600 Widman tells us to visualize a function as a mini factory, 133 00:06:20.920 --> 00:06:23.800 just a physical box sitting on a table. Inputs go 134 00:06:23.879 --> 00:06:27.079 into the box on a conveyor belt. The factory has 135 00:06:27.120 --> 00:06:30.959 some internal strict rules that it applies to whatever comes in, 136 00:06:31.519 --> 00:06:34.800 and then a transformed output comes out the other side precisely. 137 00:06:35.120 --> 00:06:37.800 So let's say the factory is a square function. Okay, 138 00:06:38.040 --> 00:06:40.959 you send the number two into the factory. The factory's 139 00:06:41.040 --> 00:06:44.600 internal rule is to multiply the input by itself, so 140 00:06:44.680 --> 00:06:47.360 outcomes the number four. You send in a three outcomes 141 00:06:47.360 --> 00:06:49.800 of nine. It's just a simple predictable machine. 142 00:06:49.920 --> 00:06:52.240 Okay. So if the function is just a factory box, 143 00:06:52.360 --> 00:06:55.319 what is a derivative? Because just hearing the word derivative 144 00:06:55.480 --> 00:06:58.560 definitely triggers some traumatic math flashbacks for me. 145 00:06:58.639 --> 00:07:01.800 Oh for sure. But let's dick with our factory visualization. Yeah, 146 00:07:01.920 --> 00:07:04.800 imagine there is a physical string connecting the input of 147 00:07:04.800 --> 00:07:06.959 the factory to the output of the factory, a string. 148 00:07:07.160 --> 00:07:07.439 Okay. 149 00:07:07.519 --> 00:07:10.680 The derivative is simply asking a very practical question. If 150 00:07:10.720 --> 00:07:14.079 you pull on the input string by a very very 151 00:07:14.079 --> 00:07:17.040 small amount, a tiny nudge like point zero zero zero one, 152 00:07:17.560 --> 00:07:20.720 by what multiple does the output string move? 153 00:07:20.839 --> 00:07:23.240 Ah? Okay, So it's kind of like adjusting an analog 154 00:07:23.600 --> 00:07:26.959 volume knob on an old stereo. If I nudge the 155 00:07:27.000 --> 00:07:30.560 input dial just a tiny fraction of a millimeter, how 156 00:07:30.639 --> 00:07:33.480 much louder does the music actually get? Like? Does a 157 00:07:33.519 --> 00:07:37.360 tiny nudge on the input cause a massive blown speaker 158 00:07:37.399 --> 00:07:40.759 spike in the output, or does it barely move the 159 00:07:40.759 --> 00:07:41.399 needle at all? 160 00:07:41.680 --> 00:07:43.680 Exactly? You're measuring the rate of change. 161 00:07:43.959 --> 00:07:47.319 But okay, why is this tiny nudge so crucial? Like 162 00:07:47.480 --> 00:07:50.120 why does an artificial intelligence care so much about this 163 00:07:50.160 --> 00:07:50.759 little string? 164 00:07:51.199 --> 00:07:56.040 Because this rate of change? Knowing exactly how the input 165 00:07:56.079 --> 00:08:00.879 affects the output is the literal engine of machine learning. Yes, 166 00:08:01.319 --> 00:08:03.720 it is how the model knows how to correct its 167 00:08:03.759 --> 00:08:06.800 own errors. Think about it. If an AI makes a 168 00:08:06.800 --> 00:08:09.560 prediction and that prediction is wrong, it needs to know 169 00:08:09.560 --> 00:08:10.399 how to fix it right. 170 00:08:10.439 --> 00:08:11.160 It has to adjust. 171 00:08:11.199 --> 00:08:13.639 And if the AI knows exactly how a tiny nudge 172 00:08:13.680 --> 00:08:16.399 to its internal settings will affect the final outcome, it 173 00:08:16.439 --> 00:08:19.040 knows exactly which dials to turn and in which direction 174 00:08:19.360 --> 00:08:21.959 to get a better result next time. The derivative is 175 00:08:21.959 --> 00:08:24.040 basically the compass pointing toward the correct answer. 176 00:08:24.160 --> 00:08:26.360 I see. Okay, so we have a single mini factory. 177 00:08:26.480 --> 00:08:28.839 You nudge the input, you watch the output change, you 178 00:08:28.879 --> 00:08:32.000 adjust the dial. That makes sense, but predicting a housing 179 00:08:32.039 --> 00:08:35.399 price or writing a poem takes way more than one 180 00:08:35.399 --> 00:08:39.320 mathematical step. Real data doesn't just go through one simple rule. 181 00:08:39.639 --> 00:08:41.840 So how do these boxes actually talk? To each other 182 00:08:41.840 --> 00:08:43.000 without losing all the data. 183 00:08:43.600 --> 00:08:46.120 So this brings us to the concept of nested functions. 184 00:08:46.799 --> 00:08:49.879 In deep learning. You almost never have just one factory. 185 00:08:50.200 --> 00:08:52.279 You have a chain of them, an assembly line. 186 00:08:52.399 --> 00:08:55.840 Exactly an assembly line. The output conveyor belt of factory 187 00:08:55.879 --> 00:08:58.960 one feeds directly into the input conveyor belt a factory 188 00:08:59.000 --> 00:09:03.039 two one transforms the raw data, passes it to factory two, 189 00:09:03.120 --> 00:09:04.919 which transforms it again, and so on. 190 00:09:05.039 --> 00:09:07.399 Okay, but if I nudge the input at the very 191 00:09:07.440 --> 00:09:10.440 beginning of the assembly line, that ripple has to travel 192 00:09:10.440 --> 00:09:13.399 through every single factory to reach the end. How do 193 00:09:13.440 --> 00:09:16.039 we track that string across ten different boxes. 194 00:09:16.120 --> 00:09:18.759 We use what might be the single most important mathematical 195 00:09:18.840 --> 00:09:22.039 rule in all the deep learning, the chain rule from calculus, 196 00:09:22.120 --> 00:09:26.240 the chain rule. Yes and again, Wideman demystifies this beautifully 197 00:09:26.600 --> 00:09:27.799 using the factory boxes. 198 00:09:28.000 --> 00:09:29.720 Okay, let's trace the string. Then, let's say we have 199 00:09:29.720 --> 00:09:31.759 two boxes. We pull the string on the input to 200 00:09:31.799 --> 00:09:34.679 box one. We observe that its output changes by a 201 00:09:34.679 --> 00:09:37.200 factor of three, so a three in a multiplier. Right, 202 00:09:37.559 --> 00:09:40.440 That output is now the input for box two. And 203 00:09:40.519 --> 00:09:43.000 we already know that if we tweak the input of 204 00:09:43.039 --> 00:09:46.159 box two, its output changes by a factor of say 205 00:09:46.320 --> 00:09:47.440 migus two units. 206 00:09:47.679 --> 00:09:51.159 Perfect setup. So to find the total change across the 207 00:09:51.279 --> 00:09:53.639 entire chain from the very first input to the very 208 00:09:53.679 --> 00:09:56.480 last output, the chain rule says, we simply multiply those 209 00:09:56.559 --> 00:09:59.240 rates of change together. Just multiply them, just multiply them. 210 00:10:00.200 --> 00:10:02.840 One changes things by a factor three, box two changes 211 00:10:02.879 --> 00:10:05.200 things by a factor of niggas two. The total change 212 00:10:05.200 --> 00:10:08.320 across the whole chain is three multiplied by niggas two, 213 00:10:08.759 --> 00:10:10.080 which equals negative six. 214 00:10:10.200 --> 00:10:12.360 Oh wow, so a one unit nudge at the start 215 00:10:12.519 --> 00:10:14.720 creates a null six unit shift at the very end. 216 00:10:14.919 --> 00:10:15.320 Exactly. 217 00:10:15.399 --> 00:10:18.720 But wait, practically speaking, if I'm actually coding this assembly line, 218 00:10:18.879 --> 00:10:20.399 how does the system know those numbers? Like? 219 00:10:20.440 --> 00:10:20.519 Do? 220 00:10:20.600 --> 00:10:22.159 I have to run the data all the way forward 221 00:10:22.200 --> 00:10:24.279 to get an answer, and then somehow trace my steps 222 00:10:24.279 --> 00:10:26.480 all the way backward to figure out the chain rule math. 223 00:10:26.720 --> 00:10:29.759 That is exactly what you have to do. To code 224 00:10:29.759 --> 00:10:33.639 this from scratch. Your system has to make two distinct passes. 225 00:10:34.240 --> 00:10:37.440 First is the forward pass. Okay, forward, You feed your 226 00:10:37.440 --> 00:10:39.919 initial data into the first factory and you let it 227 00:10:40.000 --> 00:10:42.679 run all the way down the assembly line. But here's 228 00:10:42.679 --> 00:10:46.159 the catch. As the data moves forward, the system has 229 00:10:46.200 --> 00:10:49.679 to save all the intermediate quantities at every single step. 230 00:10:50.120 --> 00:10:53.039 It has to keep a meticulous record of what happened 231 00:10:53.080 --> 00:10:53.960 inside each box. 232 00:10:54.159 --> 00:10:56.039 Why doesn't need to save all that? If it reaches 233 00:10:56.080 --> 00:10:58.480 the end and gets an answer, hasn't it done its job? 234 00:10:58.639 --> 00:11:02.279 Because of the second step, the backward pass. Once the 235 00:11:02.360 --> 00:11:04.919 data reaches the end and then network spits out a prediction, 236 00:11:05.440 --> 00:11:08.159 you compare that prediction to the correct answer to see 237 00:11:08.159 --> 00:11:11.360 how wrong you were. Then you run backward down the 238 00:11:11.399 --> 00:11:14.240 assembly line. You use all those intermediate records you save 239 00:11:14.320 --> 00:11:17.320 during the forward pass to calculate the derivatives the strings. 240 00:11:17.320 --> 00:11:20.840 Going backward, you calculate box two string, then multiply it 241 00:11:20.879 --> 00:11:23.440 by box one string using the chain rule, all the 242 00:11:23.440 --> 00:11:24.360 way back to the start. 243 00:11:24.480 --> 00:11:27.639 I'm not going to lie. That sounds incredibly tedious to 244 00:11:27.679 --> 00:11:30.960 code by hand, keeping track of every single variable, saving 245 00:11:30.960 --> 00:11:34.799 it all in memory, running backward, multiplying the strings. It 246 00:11:34.879 --> 00:11:37.519 sounds like an absolute nightmare of bookkeeping. 247 00:11:37.600 --> 00:11:40.639 It is. It's a massive bookkeeping operation. Yeah, and this 248 00:11:40.799 --> 00:11:44.120 is exactly why modern deep learning libraries like PyTorch are 249 00:11:44.159 --> 00:11:45.240 so popular. 250 00:11:44.840 --> 00:11:47.000 Today because they do it for you exactly. 251 00:11:47.360 --> 00:11:51.960 They use something called automatic differentiation. They handle all that 252 00:11:52.039 --> 00:11:56.080 tedious forward and backward bookkeeping completely invisibly. You just define 253 00:11:56.080 --> 00:11:58.600 the factories and the library does the calculus for you. 254 00:11:58.919 --> 00:12:01.080 But Widman forces you to it by hand anyway, right, 255 00:12:01.200 --> 00:12:03.759 he does, because if you just rely on PyTorch, you're 256 00:12:03.799 --> 00:12:06.360 back to being a blind man touching the elephant. You 257 00:12:06.360 --> 00:12:08.360 don't see the whole process exactly. 258 00:12:08.720 --> 00:12:11.519 By coding the forward and backward passes from scratch in Python, 259 00:12:11.879 --> 00:12:15.080 you actually see the mechanics. You realize that learning isn't consciousness. 260 00:12:15.320 --> 00:12:19.000 It's literally just a series of multipliers passed backward down 261 00:12:19.000 --> 00:12:19.879 an assembly line. 262 00:12:19.919 --> 00:12:22.120 Okay, I'm with you on the strings in the assembly line, 263 00:12:22.120 --> 00:12:26.600 But single numbers are great for theory. Reality is messy. 264 00:12:26.879 --> 00:12:27.440 Very messy. 265 00:12:27.639 --> 00:12:30.279 If I want an AI to predict a housing price, 266 00:12:30.840 --> 00:12:34.360 I'm not just feeding it a single number. A house 267 00:12:34.399 --> 00:12:38.519 has dozens of features, square footage, number of bedrooms, age 268 00:12:38.559 --> 00:12:41.799 of the roof, proximity to a highway. So how do 269 00:12:41.840 --> 00:12:44.919 we pull a string on a massive spreadsheet of information? 270 00:12:45.200 --> 00:12:47.879 This is where we scale up to matrices and supervised 271 00:12:47.960 --> 00:12:51.679 learning sew of us. Learning is just finding relationships between 272 00:12:51.759 --> 00:12:54.840 characteristics that have already been measured, Okay, and to process 273 00:12:54.840 --> 00:12:57.600 all those characteristics, we can't use single numbers. We have 274 00:12:57.639 --> 00:13:00.559 to stack the data into grids, which in numb pie 275 00:13:00.759 --> 00:13:03.480 are called end arrays or n dimensional arrays. 276 00:13:03.639 --> 00:13:06.600 Right. So, if you visualize a spreadsheet, the columns are 277 00:13:06.600 --> 00:13:10.720 the features like bedrooms, square footage, and every specific house 278 00:13:10.759 --> 00:13:12.279 you are evaluating becomes a row. 279 00:13:12.519 --> 00:13:12.679 Yep. 280 00:13:13.000 --> 00:13:15.080 So a two x two grid might be two houses 281 00:13:15.159 --> 00:13:16.799 each with two features exactly. 282 00:13:16.960 --> 00:13:19.480 Now, when this grid of data enters the first factory, 283 00:13:19.720 --> 00:13:22.480 the model needs a way to evaluate it. It performs 284 00:13:22.480 --> 00:13:23.840 what's called a weighted sum. 285 00:13:23.960 --> 00:13:24.679 A weighted sum. 286 00:13:24.799 --> 00:13:27.080 Right. It looks at the features and decides how important 287 00:13:27.120 --> 00:13:30.000 each one is. Does the square footage matter more than 288 00:13:30.039 --> 00:13:33.000 the age of the roof? It assigns a mathematical weight 289 00:13:33.080 --> 00:13:33.759 to each feature. 290 00:13:33.960 --> 00:13:37.759 Okay, let me guess how this works mathematically. If I 291 00:13:37.919 --> 00:13:40.879 have a column for bedrooms in a way that says 292 00:13:40.960 --> 00:13:44.519 bedrooms are very important, is the factory just doing a 293 00:13:44.559 --> 00:13:46.559 dot product like matching them up? 294 00:13:46.759 --> 00:13:49.360 Yes. Think of a dot product as a matching game. 295 00:13:50.039 --> 00:13:52.720 The factory lines up the house's features in one hand 296 00:13:53.039 --> 00:13:55.080 and its internal priority weights in the other. 297 00:13:55.320 --> 00:13:55.679 Okay. 298 00:13:56.000 --> 00:13:59.360 It matches the bedrooms to the bedroom weight, multiplies them together, 299 00:14:00.039 --> 00:14:03.440 matches the square footage to the square footage weight, multiplies them. 300 00:14:03.879 --> 00:14:06.919 Then it throws all those paired results into one single 301 00:14:06.919 --> 00:14:08.480 bucket and adds them up. 302 00:14:08.879 --> 00:14:11.279 That's the sum, right, But if you keep multiplying features 303 00:14:11.279 --> 00:14:14.200 by weights, that bucket is going to overflow real fast. 304 00:14:14.320 --> 00:14:16.720 I mean, a three thousand square foot house multiplied by 305 00:14:16.759 --> 00:14:19.480 a heavyweight becomes a massive number. Do we just let 306 00:14:19.519 --> 00:14:21.240 the numbers get infinitely large? 307 00:14:21.440 --> 00:14:23.799 We can't, which is why we usually feed that bucket 308 00:14:23.799 --> 00:14:27.799 into another factory right afterward, typically something called a sigmoid function. 309 00:14:28.039 --> 00:14:30.000 A sigmoid function, we haven't covered that one. What's that? 310 00:14:30.200 --> 00:14:32.360 A sigmoid function is basically a squishing factory. 311 00:14:32.399 --> 00:14:33.480 A squishing factory. 312 00:14:33.639 --> 00:14:36.840 Yeah, it takes whatever wild massive number comes out of 313 00:14:36.840 --> 00:14:40.080 the weighted sum, and it brutally compresses it into a 314 00:14:40.120 --> 00:14:42.200 manageable decimal between zero and one. 315 00:14:42.320 --> 00:14:42.519 Oh. 316 00:14:42.559 --> 00:14:44.759 I see this is incredibly useful if you just want 317 00:14:44.759 --> 00:14:47.240 the network to give you a probability, like a point 318 00:14:47.320 --> 00:14:49.559 eight chance that the house is a goodbye, rather than 319 00:14:49.679 --> 00:14:51.480 outputting a raw score of four million. 320 00:14:51.559 --> 00:14:54.799 Okay, so our assembly line is now take the matrix 321 00:14:54.879 --> 00:14:58.240 of houses, match them with weights, sum them up, and 322 00:14:58.279 --> 00:15:01.159 then squish them through a sigmoid factory to get a probability. 323 00:15:01.279 --> 00:15:01.759 You got it. 324 00:15:02.080 --> 00:15:04.799 I get the forward pass. But here's where my brain 325 00:15:04.960 --> 00:15:08.639 completely breaks. To do the backward pass, we have to 326 00:15:08.679 --> 00:15:12.480 pull the string to correct the errors. How on earth 327 00:15:12.519 --> 00:15:15.279 do you track the derivative of a giant grid of 328 00:15:15.320 --> 00:15:19.519 interacting numbers. Every row and column is interacting with every weight. 329 00:15:19.919 --> 00:15:23.720 The calculus must just explode into absolute chaos, you would. 330 00:15:23.480 --> 00:15:26.759 Think so tracking every single string individually across a massive 331 00:15:26.759 --> 00:15:30.440 matrix would be impossible, right, But the math looks incredibly 332 00:15:30.480 --> 00:15:34.600 messy on a whiteboard, while the resulting code is brilliantly, 333 00:15:34.960 --> 00:15:38.879 shockingly clean. It's a magical property of linear algebra. 334 00:15:39.000 --> 00:15:41.639 WHOA, I would stop right there, time out. You literally 335 00:15:41.759 --> 00:15:45.120 cannot start this deep dive by promising no magic and 336 00:15:45.159 --> 00:15:47.759 then tell me the math relies on a magical property 337 00:15:48.200 --> 00:15:52.519 that is totally cheating. Explain it. Why does the matrix 338 00:15:52.559 --> 00:15:53.919 math clean up so nicely? 339 00:15:54.120 --> 00:15:57.159 Fair? Catch? Okay, you're right, no magic. It comes down 340 00:15:57.200 --> 00:15:58.759 to something called matrix transposition. 341 00:15:58.840 --> 00:15:59.960 Matrix transposition. 342 00:16:00.039 --> 00:16:02.840 Yes, when you need to compute the backward pass the 343 00:16:02.919 --> 00:16:07.200 gradient for a giant grid of weights, the chain rule 344 00:16:07.279 --> 00:16:10.080 dictates that you don't actually have to calculate a million 345 00:16:10.120 --> 00:16:13.879 individual strings. Instead, you take the input matrix, and you 346 00:16:13.879 --> 00:16:16.240 simply transcose it. You flip it on its side. 347 00:16:16.039 --> 00:16:18.879 Meaning the rows become columns and the columns become rows. 348 00:16:19.320 --> 00:16:22.840 Exactly, And why does this work mechanically? Think of the 349 00:16:22.840 --> 00:16:26.399 forward pass like a river flowing downstream, splitting into hundreds 350 00:16:26.440 --> 00:16:29.360 of tiny branches. Those are your data points interacting with weights. 351 00:16:29.480 --> 00:16:30.519 Okay, I picture it. 352 00:16:30.639 --> 00:16:32.519 If you want to send an error signal back up 353 00:16:32.559 --> 00:16:34.679 the river to the exact source that caused it, you 354 00:16:34.759 --> 00:16:37.600 just referse the map. Flipping the matrix on its side 355 00:16:37.720 --> 00:16:40.879 perfectly re routes the air signals backward along the exact 356 00:16:40.879 --> 00:16:43.600 same mathematical paths the data used to travel forward. 357 00:16:43.799 --> 00:16:46.960 Oh wow, So you aren't doing entirely new chaotic math 358 00:16:47.039 --> 00:16:49.279 to go backward, not at all. You're just taking the 359 00:16:49.279 --> 00:16:52.240 infrastructure you build going forward, turning it sideways, and letting 360 00:16:52.279 --> 00:16:54.399 the error flow back to the correct weight. 361 00:16:54.559 --> 00:16:58.799 Precisely because of how matrix transposes work out mathematically, this 362 00:16:59.039 --> 00:17:03.159 incredibly common plex web of interacting data collapses into a 363 00:17:03.200 --> 00:17:06.880 few incredibly simple lines of Python code. During the backward pass. 364 00:17:07.119 --> 00:17:08.720 It scales perfectly. 365 00:17:08.839 --> 00:17:11.079 That is wild. So it doesn't matter if I'm feeding 366 00:17:11.440 --> 00:17:14.279 the factory a single two x two grid or a 367 00:17:14.319 --> 00:17:17.559 massive matrix with a million rows representing every house in 368 00:17:17.559 --> 00:17:20.680 the country. The logic of the assembly line stays exactly 369 00:17:20.680 --> 00:17:21.279 the same. 370 00:17:21.200 --> 00:17:22.079 Exactly the same. 371 00:17:22.160 --> 00:17:25.079 The forward pass runs the matching game and squishes the numbers. 372 00:17:25.519 --> 00:17:27.880 The backward pass flips the map on its side to 373 00:17:27.960 --> 00:17:31.359 route the blame, runs the chain roll and updates the weights. 374 00:17:31.799 --> 00:17:34.000 And that is why we can have AI models today 375 00:17:34.039 --> 00:17:38.319 with billions or even trillions of parameters. The fundamental architecture, 376 00:17:38.400 --> 00:17:41.839 the mini factory, the chain roll, the matrix transposes. It's 377 00:17:42.039 --> 00:17:45.440 infinitely scalable. You just need more powerful computers to run 378 00:17:45.480 --> 00:17:46.559 the assembly line faster. 379 00:17:46.839 --> 00:17:49.480 Okay, let's bring this all together for you, the listener. 380 00:17:50.200 --> 00:17:52.720 We started this deep dive staring at a hidden circuitry 381 00:17:52.759 --> 00:17:56.359 that everyone just assumes is impenetrable, but by looking through 382 00:17:56.400 --> 00:17:59.240 the lens of Seth Weidman's work, we've stripped it down. 383 00:17:59.599 --> 00:18:03.240 We have deep learning is an assembly line of mini factories. 384 00:18:03.640 --> 00:18:07.359 We have inputs that flow forward through nested functions, matching 385 00:18:07.400 --> 00:18:10.160 features to weights. We save our math as we go. 386 00:18:10.880 --> 00:18:13.680 Then we compare our final answer, and we pull the 387 00:18:13.720 --> 00:18:17.559 strings backward, flipping the matrices to calculate exactly how to 388 00:18:17.599 --> 00:18:20.759 adjust our internal dials. It's not a brain, it's just 389 00:18:21.079 --> 00:18:25.720 very fast, very elegant bookkeeping. You now have the foundational 390 00:18:25.759 --> 00:18:28.640 mental model for how machines actually learn. 391 00:18:29.119 --> 00:18:31.960 It's a very empowering realization, honestly, to finally see the 392 00:18:32.000 --> 00:18:35.279 gears turning. But this actually raises a really important question, 393 00:18:35.319 --> 00:18:36.960 and it's the thought I want to leave you with today. 394 00:18:37.359 --> 00:18:39.079 It goes all the way back to the very first 395 00:18:39.079 --> 00:18:42.200 step of this entire process supervised learning. 396 00:18:42.359 --> 00:18:44.119 You mean, like setting up the grid of numbers in 397 00:18:44.119 --> 00:18:44.680 the first place. 398 00:18:44.839 --> 00:18:48.519 Right. Widman points out that to use these beautiful mathematical 399 00:18:48.519 --> 00:18:52.440 assembly lines, we have to translate the messy, ambiguous real 400 00:18:52.480 --> 00:18:54.279 world into precise numbers. 401 00:18:54.440 --> 00:18:54.839