WEBVTT 1 00:00:00.000 --> 00:00:02.560 All right, everyone, welcome back to the deep Dive. Today, 2 00:00:02.600 --> 00:00:05.919 we're tackling a really big topic, one that shapes so 3 00:00:06.040 --> 00:00:08.960 much of our modern world, neural networks and deep learning. 4 00:00:09.560 --> 00:00:12.320 Our mission, as always is to cut through the complexity, 5 00:00:12.359 --> 00:00:15.400 pull out the most important bits for you, and give 6 00:00:15.400 --> 00:00:19.679 you that shortcut to being truly well informed. We're diving 7 00:00:19.679 --> 00:00:23.800 into this fantastic textbook by Charusi Aggriwall Neural Networks and 8 00:00:23.879 --> 00:00:27.800 Deep Learning. Seriously, this thing is packed. It covers everything 9 00:00:27.960 --> 00:00:31.359 from the foundational ideas all the way to super cutting 10 00:00:31.399 --> 00:00:33.880 edge applications, and it even touches on some architectures that 11 00:00:33.920 --> 00:00:36.880 have been sort of well forgotten. So our goal is 12 00:00:36.920 --> 00:00:39.119 to make sense of this intricate world, give you a 13 00:00:39.159 --> 00:00:42.240 clear picture of what these powerful technologies are and maybe 14 00:00:42.280 --> 00:00:44.399 more importantly, why they matter so much. 15 00:00:44.560 --> 00:00:47.119 Yeah, and what's really fascinating, I think, is just tracing 16 00:00:47.600 --> 00:00:49.840 how this field has evolved over time. We're going to 17 00:00:50.039 --> 00:00:53.960 explore not just what these networks are, but how they 18 00:00:53.960 --> 00:00:56.439 actually learn, why they work the way they do, which 19 00:00:56.479 --> 00:00:58.920 can be pretty powerful, and of course the incredible ways 20 00:00:58.920 --> 00:01:02.439 they're being used today. I mean everything from self driving 21 00:01:02.479 --> 00:01:06.719 cars to generating creative art. It's kind of amazing. We 22 00:01:06.760 --> 00:01:09.719 really want you to walk away with those aha moments, 23 00:01:09.760 --> 00:01:13.120 you know, yeah, feeling like you've really unlocked some profound insights. 24 00:01:13.319 --> 00:01:17.000 Okay, so let's unpack this right from the start. When 25 00:01:17.040 --> 00:01:19.599 we hear neural networks, the first thing that pops into 26 00:01:19.640 --> 00:01:22.640 mind is, well, the human brain. But how close is 27 00:01:22.680 --> 00:01:23.840 that comparison? Really? 28 00:01:24.359 --> 00:01:26.319 That's a great question. The book points out that, yeah, 29 00:01:26.319 --> 00:01:29.400 there was definitely initial biological inspiration. You know, things like 30 00:01:29.439 --> 00:01:33.040 convolutional neural networks or CNNs were partly inspired by Huble 31 00:01:33.079 --> 00:01:37.319 and Bisles work on how the cats visual cortex processes images. 32 00:01:37.680 --> 00:01:40.640 But this is important. The book also mentions that comparison 33 00:01:40.680 --> 00:01:44.560 is often criticized. It's seen as maybe a poor caricature, 34 00:01:44.719 --> 00:01:47.200 a really simplified version of the actual brain. 35 00:01:47.319 --> 00:01:50.159 Okay, so a loose inspiration, not a direct copy. 36 00:01:49.879 --> 00:01:54.040 Exactly, though neuroscience principles has certainly been useful along the way. 37 00:01:54.319 --> 00:01:56.519 But here's something I found really interesting in the book. 38 00:01:57.200 --> 00:02:01.519 At their core, these networks aren't some completely alien tech. 39 00:02:01.959 --> 00:02:06.959 They're actually built from like basic computational units inspired by 40 00:02:07.000 --> 00:02:10.639 algorithms we already knew from traditional machine learning, things like 41 00:02:10.879 --> 00:02:13.360 least squares regression or logistic regression. 42 00:02:13.479 --> 00:02:13.560 Right. 43 00:02:13.840 --> 00:02:17.000 The power comes from how they combine tons of these 44 00:02:17.000 --> 00:02:17.879 simple units. Right. 45 00:02:17.960 --> 00:02:21.360 Precisely, they learn how to connect these basic building blocks 46 00:02:21.439 --> 00:02:25.960 in really intricate ways, all working together to minimize the 47 00:02:25.960 --> 00:02:28.840 prediction error. It's kind of like building something really complex 48 00:02:28.879 --> 00:02:33.759 and amazing, like a cathedral, but using very simple, powerful bricks. 49 00:02:33.960 --> 00:02:36.120 Okay, so what does that basic brick look like? Then? 50 00:02:36.400 --> 00:02:39.280 Good question. We can start with the simplest one, really, 51 00:02:39.719 --> 00:02:43.479 the perceptron. Imagine it is a tiny decision maker. It 52 00:02:43.520 --> 00:02:47.680 takes in different features pieces of information. Each feature gets 53 00:02:47.719 --> 00:02:50.520 multiplied by a weight, which basically says how important it is. 54 00:02:50.919 --> 00:02:54.360 Then it sums all those weighted features up, and finally 55 00:02:54.479 --> 00:02:58.080 that sum goes through something called an activation function. Think 56 00:02:58.120 --> 00:03:00.879 of it like a gait, which produces the final output. 57 00:03:01.000 --> 00:03:04.520 Often say a class label like cat or dog, and. 58 00:03:04.520 --> 00:03:06.800 I read about a biased neuron. What's that? 59 00:03:07.080 --> 00:03:09.840 Ah? Right, that's sort of a neat trick. It's like 60 00:03:09.919 --> 00:03:12.159 adding a constant offset to the sum before it hits 61 00:03:12.199 --> 00:03:14.520 the activation function. You can do that by having a 62 00:03:14.560 --> 00:03:16.960 special input that always has a value of one with 63 00:03:17.080 --> 00:03:19.240 its own weight. It just gives the model a bit 64 00:03:19.280 --> 00:03:20.120 more flexibility. 65 00:03:20.240 --> 00:03:22.759 Gotcha, And these activation functions. You said they're like gates, 66 00:03:22.800 --> 00:03:23.840 Why are they so important? 67 00:03:24.000 --> 00:03:28.520 They're absolutely crucial. They introduce nonlinearity. Without them, stacking layers 68 00:03:28.520 --> 00:03:30.719 wouldn't actually add much power. It would still just be 69 00:03:30.759 --> 00:03:34.199 a linear model. Early on, people used a simple sign function, 70 00:03:34.400 --> 00:03:36.960 just outputting plus one or angs one. Basically yes, no, 71 00:03:37.479 --> 00:03:39.960 but that's hard to train mathematically because it's not smooth, 72 00:03:40.159 --> 00:03:44.439 not differentiable. Okay, so functions like sigmoid and tan became popular. 73 00:03:44.479 --> 00:03:47.199 Sigmoid squishes the output between zero and one, which are 74 00:03:47.199 --> 00:03:51.159 great for probabilities. Tan is similar as shaped, but squishes 75 00:03:51.199 --> 00:03:52.400 between negative one and one. 76 00:03:52.599 --> 00:03:54.680 But the book mentioned something else has taken over now 77 00:03:54.879 --> 00:03:55.639 re LU. 78 00:03:56.000 --> 00:03:59.439 Yes, and this is a really key aha moment for 79 00:03:59.520 --> 00:04:03.360 understanding modern deep learning. Re LU, which stands for a 80 00:04:03.400 --> 00:04:07.400 rectified linear unit, sounds fancy, but it's incredibly simple. It's 81 00:04:07.400 --> 00:04:09.919 just max v zero, so if the input is negative, 82 00:04:09.919 --> 00:04:12.680 the output is zero. Otherwise the output is just the input. Yeah, 83 00:04:12.759 --> 00:04:18.639 deceptively simple. ReLU and variations like Hardtan have largely replaced 84 00:04:18.639 --> 00:04:22.759 sigmoid and soft hand. Why because they're a piece wise linear. 85 00:04:23.240 --> 00:04:26.360 This makes the math, specifically the gradients used in training 86 00:04:26.560 --> 00:04:29.600 much much easier to handle. They suffer way less from 87 00:04:29.639 --> 00:04:32.639 a huge problem called vanishing gradients, which we should definite 88 00:04:32.680 --> 00:04:35.439 talk more about. This change was fundamental in allowing us 89 00:04:35.480 --> 00:04:37.439 to train much much deeper networks. 90 00:04:37.600 --> 00:04:40.639 Okay, so we have these basic units perceptrons, with these 91 00:04:40.639 --> 00:04:45.040 crucial nonlinear activation functions like ReLU. How do we go 92 00:04:45.079 --> 00:04:47.000 from that to well deep learning? 93 00:04:47.160 --> 00:04:50.160 Right? Connecting it to the bigger picture. It's fascinating actually 94 00:04:50.240 --> 00:04:52.639 that many traditional machine learning models, the ones that people 95 00:04:52.720 --> 00:04:55.759 use for decades, can be seen as shallow neural networks. 96 00:04:55.920 --> 00:04:59.199 Think about least squares regression, logistic regression, even support VIC 97 00:04:59.279 --> 00:05:02.160 machines as v You could represent all of them as 98 00:05:02.199 --> 00:05:05.240 simple neural architectures, maybe just one or two layers deep. 99 00:05:05.399 --> 00:05:06.839 Really like SVMs too. 100 00:05:07.000 --> 00:05:10.279 Yeah, The main difference is often just boil down to 101 00:05:10.319 --> 00:05:13.399 the specific loss function they're trying to minimize, and maybe 102 00:05:13.439 --> 00:05:17.079 the activation function in the output layer. For example, logistic 103 00:05:17.120 --> 00:05:21.319 regression for binary classification uses that sigmoid function we mentioned 104 00:05:21.480 --> 00:05:25.680 to output or probability. Its loss function comes from maximizing 105 00:05:25.680 --> 00:05:29.160 the likelihood of the data. The book also contrasts the 106 00:05:29.160 --> 00:05:32.560 original perceptron learning rule, which would be a bit unstable, 107 00:05:32.800 --> 00:05:36.519 with the Hinge loss used by SVMs, which provides better stability. 108 00:05:36.800 --> 00:05:38.759 It shows this kind of shared ancestry. 109 00:05:38.879 --> 00:05:41.439 Okay, so those are the shallow ones. But the real magic, 110 00:05:41.600 --> 00:05:44.199 the deep and deep learning that comes from adding more 111 00:05:44.240 --> 00:05:45.920 layers right stacking them up. 112 00:05:46.079 --> 00:05:49.120 Exactly, That's where the power really scales up. Multi layer 113 00:05:49.199 --> 00:05:52.439 neural networks introduce what we call hidden layers. These are 114 00:05:52.519 --> 00:05:56.079 layers of computation sandwich between the input and the final output. 115 00:05:56.439 --> 00:06:01.040 You don't directly see their results, hence hidden. Typically, information 116 00:06:01.120 --> 00:06:03.639 flows forward through these layers, one feeding into the next. 117 00:06:03.839 --> 00:06:05.639 We call these feed forward networks. 118 00:06:05.720 --> 00:06:08.360 And what happens inside those hidden layers. 119 00:06:08.160 --> 00:06:11.079 This is where the concept of hierarchical feature engineering comes in. 120 00:06:11.199 --> 00:06:14.480 It's a really powerful idea. Imagine you feed an image 121 00:06:14.519 --> 00:06:17.879 into the network. The first hidden layer might learn to 122 00:06:17.920 --> 00:06:24.199 detect very simple, primitive characteristics, things like horizontal lines, vertical lines, 123 00:06:24.199 --> 00:06:25.720 maybe simple curves or edges. 124 00:06:25.879 --> 00:06:27.240 Okay, basic stuff, right. 125 00:06:27.639 --> 00:06:30.800 Then the next hidden layer takes those simple features as 126 00:06:30.839 --> 00:06:34.000 its input and learns to combine them into slightly more 127 00:06:34.000 --> 00:06:38.519 complex shapes or patterns, maybe corners, circles, simple. 128 00:06:38.240 --> 00:06:40.519 Textures, ah building blocks. 129 00:06:40.199 --> 00:06:44.240 Exactly, And as you go deeper, subsequent layers combine those 130 00:06:44.240 --> 00:06:49.160 features into even more complex semantically significant characteristics. So maybe 131 00:06:49.199 --> 00:06:51.720 a later layer recognizes combinations that look like an eye 132 00:06:51.839 --> 00:06:55.639 or a wheel, or in the book's example, hexagons or honeycombs. 133 00:06:55.920 --> 00:06:58.600 By the time the information reached the final layers, it's 134 00:06:58.639 --> 00:07:01.680 represented in a way that makes classification much easier. The 135 00:07:01.720 --> 00:07:03.759 network has learned to see the important patterns. 136 00:07:03.839 --> 00:07:06.120 That makes a lot of sense. It's like learning progressively 137 00:07:06.199 --> 00:07:08.240 more abstract concepts. 138 00:07:07.920 --> 00:07:12.319 Precisely, and a key advantage here is flexibility. You can 139 00:07:12.319 --> 00:07:16.639 adjust the model's complexity, its learning capacity by just adding 140 00:07:16.800 --> 00:07:19.720 or removing neurons or entire layers, depending on how much 141 00:07:19.759 --> 00:07:22.600 data you have or the convocational resources available. 142 00:07:22.639 --> 00:07:25.199 That brings up another point from the book, the AI winters. 143 00:07:25.600 --> 00:07:27.800 Why did it take so long for neural networks to 144 00:07:27.839 --> 00:07:30.120 really take off if the ideas were around earlier. 145 00:07:30.240 --> 00:07:33.439 Yeah, that's another aha moment. The core concepts from many 146 00:07:33.480 --> 00:07:37.639 of these networks existed decades ago, but they were held back. 147 00:07:38.160 --> 00:07:42.079 The book really emphasizes that the crucial factors were the 148 00:07:42.199 --> 00:07:46.000 massive increase in data availability, the big data and the 149 00:07:46.079 --> 00:07:48.800 parallel explosion and computational power, especially. 150 00:07:48.519 --> 00:07:50.800 With GPUs GPUs the graphics cards. 151 00:07:50.519 --> 00:07:52.839 Exactly they happen to be incredibly good at the kind 152 00:07:52.879 --> 00:07:56.439 of parallel matrix multiplications that neural networks rely on. So 153 00:07:56.480 --> 00:07:58.600 it was really after maybe twenty ten twenty eleven when 154 00:07:58.600 --> 00:08:01.160 we finally had enough data and enough computing power that 155 00:08:01.199 --> 00:08:05.279 these deeper, more complex models could finally be trained effectively 156 00:08:05.560 --> 00:08:08.279 and show what they were capable of. The resources caught 157 00:08:08.360 --> 00:08:09.720 up with the ideas. 158 00:08:09.519 --> 00:08:13.079 Right, Okay, so training these deep networks sounds like a beast. 159 00:08:13.480 --> 00:08:17.480 How does that learning part actually happen? You mentioned gradients before. 160 00:08:17.680 --> 00:08:20.600 Yeah. The core algorithm, the engine driving the learning, is 161 00:08:20.639 --> 00:08:24.240 called back propagation. It's essentially a clever way to figure 162 00:08:24.240 --> 00:08:28.240 out how much each connection, each weight in the network 163 00:08:28.319 --> 00:08:31.279 contributed to the overall error on a given training example. 164 00:08:31.759 --> 00:08:35.600 Works in two phases. First, there's a forward pass. You 165 00:08:35.639 --> 00:08:38.360 feed the input data through the network layer by layer 166 00:08:38.639 --> 00:08:41.360 until you get an output. Then you compare that output 167 00:08:41.399 --> 00:08:45.080 to the correct answer and calculate the error or loss. 168 00:08:45.279 --> 00:08:46.399 Okay, see how wrong. 169 00:08:46.240 --> 00:08:50.399 It was exactly. Then comes the backward pass. Using calculus, 170 00:08:50.399 --> 00:08:53.799 specifically the chain rule, back propagation calculates the gradient of 171 00:08:53.840 --> 00:08:56.159 the loss with respect to each weight. It figures out 172 00:08:56.200 --> 00:08:59.120 how changing each weight would affect the error. This gradient 173 00:08:59.159 --> 00:09:02.559 information is then propagated backward through the network layer by layer. 174 00:09:02.639 --> 00:09:06.159 It's like an assigning blame or credit for the error 175 00:09:06.200 --> 00:09:07.919 back to the connections that caused. 176 00:09:07.600 --> 00:09:10.159 It, and then you use that information to adjust the 177 00:09:10.159 --> 00:09:11.440 weights precisely. 178 00:09:12.080 --> 00:09:14.519 The most common method is to cast a gradient descent 179 00:09:14.679 --> 00:09:18.159 or ASGD. Instead of calculating the error over the entire 180 00:09:18.279 --> 00:09:21.759 massive data sent which would be incredibly slow. SGD takes 181 00:09:21.759 --> 00:09:24.600 a single training example or maybe a small batch of them, 182 00:09:24.919 --> 00:09:27.840 calculates the gradients and makes a small adjustment to the 183 00:09:27.840 --> 00:09:30.600 weights in the direction that reduces the error. Then it 184 00:09:30.600 --> 00:09:33.879 moves to the next example or batch. It's stochastic because 185 00:09:33.919 --> 00:09:36.679 each update is based on just a small sample, making 186 00:09:36.720 --> 00:09:39.360 it a bit noisy, but much much faster overall. 187 00:09:39.480 --> 00:09:42.000 Okay, that makes sense, but you mentioned a problem earlier, 188 00:09:42.159 --> 00:09:46.919 something about gradients, ugh, vanishing and exploding gradients. That sounds bad. 189 00:09:46.960 --> 00:09:48.080 What's going on there right? 190 00:09:48.120 --> 00:09:50.399 This is a huge challenge, especially when you start building 191 00:09:50.399 --> 00:09:54.759 really deep networks. It's a stability issue. Remember how backpropagation 192 00:09:54.919 --> 00:09:58.320 uses the chain rule that involves multiplying many small numbers 193 00:09:58.320 --> 00:10:01.120 together as you go backward through the layer. If those 194 00:10:01.240 --> 00:10:04.480 numbers related to the derivatives of the accivation functions are 195 00:10:04.519 --> 00:10:08.200 consistently less than one, their product can become incredibly tiny, 196 00:10:08.320 --> 00:10:11.120 almost zero by the time it reaches the early layers. 197 00:10:11.399 --> 00:10:14.840 That's the vanishing gradient problem. The signal just fades. 198 00:10:14.480 --> 00:10:17.480 Away, so the early layers stop learning effectively. 199 00:10:17.600 --> 00:10:21.159 Yes, they don't get useful information about how to adjust 200 00:10:21.159 --> 00:10:25.840 their weights. Conversely, if those numbers are consistently greater than one, 201 00:10:25.960 --> 00:10:30.000 their product can blow up, becoming astronomically large. That's the 202 00:10:30.080 --> 00:10:34.279 exploding gradient problem. The updates become huge and unstable and 203 00:10:34.320 --> 00:10:35.360 the network diverges. 204 00:10:35.679 --> 00:10:38.519 Yikes. Okay, so how do we fix that? How do 205 00:10:38.559 --> 00:10:40.360 we train these deep things reliably? 206 00:10:40.720 --> 00:10:44.200 Well? Thankfully, researchers have developed a whole toolkit of techniques 207 00:10:44.240 --> 00:10:47.519 to combat these issues and also to prevent another big 208 00:10:47.559 --> 00:10:48.960 problem overfitting. 209 00:10:49.039 --> 00:10:52.120 Overfitting that's when the model just memorizes the training data right, 210 00:10:52.240 --> 00:10:53.879 but doesn't work well on new stuff. 211 00:10:53.960 --> 00:10:58.519 Exactly, it fails to generalize. So first we have regularization techniques. 212 00:10:58.919 --> 00:11:01.120 Think of these as ways to impose discipline on the 213 00:11:01.159 --> 00:11:04.720 network during training. Weight decay using L one or L 214 00:11:04.759 --> 00:11:07.679 two penalties is common. It adds a cost to having 215 00:11:07.759 --> 00:11:11.360 large weights, encouraging the network to find simpler solutions that 216 00:11:11.360 --> 00:11:15.240 are less likely to overfit. Another simple but effective one 217 00:11:15.399 --> 00:11:18.720 is early stopping. You monitor the network's performance on a 218 00:11:18.720 --> 00:11:21.919 separate data set, a validation set that it doesn't train on. 219 00:11:22.480 --> 00:11:25.039 When the error on that validation set starts to increase, 220 00:11:25.080 --> 00:11:27.240 even if the training error is still decreasing, you just 221 00:11:27.240 --> 00:11:29.679 stop training. The model is starting to overfit. 222 00:11:29.799 --> 00:11:31.399 Makes sense, stop before it gets worse. 223 00:11:31.559 --> 00:11:34.639 Right. Then there are techniques aimed more directly at the 224 00:11:34.679 --> 00:11:38.279 learning dynamics. Dropout is a really clever one. During training, 225 00:11:38.320 --> 00:11:40.960 for each input or a mini batch, you randomly drop 226 00:11:41.000 --> 00:11:44.799 out temporarily said to zero a certain percentage of the neurons, 227 00:11:44.799 --> 00:11:45.919 and the hidden layers. 228 00:11:45.720 --> 00:11:46.799 Just switch them off randomly. 229 00:11:47.080 --> 00:11:51.080 Yep. This forces other neurons to learn more robust features 230 00:11:51.279 --> 00:11:54.120 because they can't rely too much on any single other 231 00:11:54.200 --> 00:11:57.519 neuron always being there. It's like training a team where 232 00:11:57.559 --> 00:12:01.320 players might randomly be unavailable. Everyone has to be more versatile. 233 00:12:01.679 --> 00:12:05.879 It acts like training many different smaller networks simultaneously. And 234 00:12:06.039 --> 00:12:10.320 batch normalization is another life saver, especially for very deep networks. 235 00:12:10.960 --> 00:12:14.159 It normalizes the activations within each mini batch during training, 236 00:12:14.360 --> 00:12:17.879 basically rescaling them to have a consistent mean in variance. 237 00:12:17.559 --> 00:12:19.679 Like tuning the signal kind of yeah. 238 00:12:19.799 --> 00:12:21.759 It helps keep the signals flowing through the network in 239 00:12:21.799 --> 00:12:24.519 a healthy range, preventing them from becoming too large or 240 00:12:24.559 --> 00:12:28.000 too small, which he'll stabilize training and allows for faster learning. 241 00:12:28.440 --> 00:12:31.120 Okay, wow, that's a lot of tricks. Anything else? 242 00:12:31.440 --> 00:12:34.919 Oh? Yeah, we also have adaptive learning rate methods. Instead 243 00:12:34.960 --> 00:12:37.720 of using one fixed learning rate for the entire network, 244 00:12:38.120 --> 00:12:42.639 algorithms like ATTIGRAD, RMS, PROP and the very popular ATOM 245 00:12:43.120 --> 00:12:46.720 dynamically adjust the learning rate for each parameter individually. They 246 00:12:46.759 --> 00:12:49.519 can speed up learning for slow parameters and slow it 247 00:12:49.519 --> 00:12:53.200 down for fast ones, helping convergence. Weight initialization is also 248 00:12:53.279 --> 00:12:56.720 surprisingly important. If you start all weights at zero, all 249 00:12:56.720 --> 00:12:58.840 neurons in a layer will learn the exact same thing, 250 00:12:59.080 --> 00:13:03.519 So you need randomized initialization like xavier or Gloro initialization 251 00:13:03.919 --> 00:13:06.399 to break that symmetry and get things going. And finally, 252 00:13:06.480 --> 00:13:09.799 especially for things like images, data augmentation is huge. You 253 00:13:09.879 --> 00:13:12.639 create more training data by applying random transformations to your 254 00:13:12.679 --> 00:13:16.399 existing data, rotating images, shifting them, changing brightness, stuff like that. 255 00:13:16.480 --> 00:13:18.399 It makes the model more robust variations. 256 00:13:18.600 --> 00:13:21.240 That's quite a toolbox. So putting it all together, what 257 00:13:21.279 --> 00:13:24.679 does deploying these models actually involve in practice. 258 00:13:24.120 --> 00:13:27.159 Well, it means a lot of careful hyper parameter tuning, 259 00:13:27.639 --> 00:13:30.600 finding the right learning rate, the right amount of regularization, 260 00:13:31.039 --> 00:13:34.399 the best network architecture. That often involves experimenting and using 261 00:13:34.399 --> 00:13:37.639 those validation sets to see what works best. The book 262 00:13:37.720 --> 00:13:39.759 mentions that for the huge data sets we have today, 263 00:13:39.799 --> 00:13:42.519 people might use splits like ninety eight percent for training, 264 00:13:42.879 --> 00:13:46.120 one percent for validation, and one percent for final testing, 265 00:13:46.360 --> 00:13:48.600 which is different from older rules of thumb for smaller 266 00:13:48.679 --> 00:13:49.279 data sets. 267 00:13:49.600 --> 00:13:51.120 And you mentioned GPUs earlier. 268 00:13:51.360 --> 00:13:54.919 Absolutely critical training these models involves tons and tons of 269 00:13:54.919 --> 00:13:59.120 matrix multiplications. GPUs are designed for parallel processing and have 270 00:13:59.200 --> 00:14:02.480 high memory ban with making them orders of magnitude faster 271 00:14:02.559 --> 00:14:05.799 than traditional CPUs For this kind of work, training deep 272 00:14:05.840 --> 00:14:10.519 models without GPUs would be practically impossible or at least incredibly. 273 00:14:09.960 --> 00:14:12.000 Slow, And sometimes you need more than one. 274 00:14:11.840 --> 00:14:15.159 GPU for really big models or data sets. Yes, you 275 00:14:15.240 --> 00:14:18.039 might use data parallelism where you split the data across 276 00:14:18.120 --> 00:14:21.039 multiple GPUs, each training a copy of the model, or 277 00:14:21.080 --> 00:14:24.120 even model parallelism where different parts of the neural network 278 00:14:24.120 --> 00:14:27.159 itself are spread across different GPUs because the whole model 279 00:14:27.200 --> 00:14:28.039 is too big to fit. 280 00:14:27.960 --> 00:14:30.440 On one Okay, that gives a much clearer picture of 281 00:14:30.480 --> 00:14:33.720 the training process and challenges. So we've got the basics, 282 00:14:33.919 --> 00:14:37.519 the depth the training. Now let's dive into some specific 283 00:14:37.799 --> 00:14:40.960 types of networks. The book talks about architectures designed for 284 00:14:40.960 --> 00:14:42.080 different kinds of data. 285 00:14:42.200 --> 00:14:46.600 Right exactly. Neural networks are incredibly versatile, partly because we 286 00:14:46.639 --> 00:14:50.639 could design specialized architectures. Let's start with probably the most 287 00:14:50.639 --> 00:14:54.960 famous one for images, convolutional neural networks or CNNs, right. 288 00:14:54.840 --> 00:14:57.080 The ones inspired by the visual cortex. 289 00:14:57.240 --> 00:15:00.399 Loosely, yes, the key idea in CNN this is how 290 00:15:00.399 --> 00:15:03.679 they process spatial data like images. They typically work with 291 00:15:03.759 --> 00:15:07.960 layers that have three dimensions height, width, and depth. Depth 292 00:15:08.000 --> 00:15:10.399 here refers to the number of channels like red, green, 293 00:15:10.480 --> 00:15:12.840 blue in the input or different feature maps in the 294 00:15:12.879 --> 00:15:16.320 hidden layers. The core operation is the convolution. You have 295 00:15:16.360 --> 00:15:18.799 these small filters. You can think of them as pattern detectors. 296 00:15:19.039 --> 00:15:21.840 Maybe one looks for vertical edges, another for horizontal edges, 297 00:15:21.879 --> 00:15:25.559 another for specific texture. These filters slide across the input 298 00:15:25.600 --> 00:15:28.000 image or the feature map from the previous layer and 299 00:15:28.039 --> 00:15:32.360 compute activations. Where the filter finds its specific pattern, it 300 00:15:32.399 --> 00:15:35.320 produces a strong activation in the output feature map. 301 00:15:35.279 --> 00:15:38.159 So each filter creates its own map, highlighting where it 302 00:15:38.240 --> 00:15:38.799 found its. 303 00:15:38.639 --> 00:15:42.840 Pattern precisely, and a key aspect is parameter sharing. The 304 00:15:42.879 --> 00:15:45.919 same filter is used across the entire image, which makes 305 00:15:45.919 --> 00:15:49.879 CNNs efficient and helps them recognize patterns regardless of where 306 00:15:49.879 --> 00:15:53.519 they appear. These convolutional layers are usually paired with ray 307 00:15:53.639 --> 00:15:57.279 lu activations and then often followed by pooling layers. Max 308 00:15:57.320 --> 00:16:00.720 pooling is common. It downsamples the feature map, making it 309 00:16:00.759 --> 00:16:04.200 smaller by taking the maximum value in small regions. This 310 00:16:04.279 --> 00:16:07.320 helps reduce computation and makes the learned features more rowe 311 00:16:07.360 --> 00:16:09.879 busts to small shifts or distortions. 312 00:16:09.320 --> 00:16:11.320 And these are the networks behind image. 313 00:16:11.000 --> 00:16:16.799 Recognition, absolutely, image classification, object detection. CNNs have driven huge 314 00:16:16.840 --> 00:16:20.120 breakthroughs there. The book mentioned some landmark architectures that came 315 00:16:20.159 --> 00:16:23.039 out of research and competitions. There was alex net, which 316 00:16:23.039 --> 00:16:25.559 really kicked off the deep learning revolution in images around 317 00:16:25.559 --> 00:16:29.200 twenty twelve. Then zf net improved on it. Google net 318 00:16:29.200 --> 00:16:32.559 introduced these clever inception modules that process features at different 319 00:16:32.559 --> 00:16:36.720 scale simultaneously and reduce the number of parameters, and ResNet 320 00:16:36.799 --> 00:16:40.919 or residual networks introduced skip connections. Skip connection Yeah they 321 00:16:40.919 --> 00:16:43.679 allowed the gradient information to flow more easily through very 322 00:16:43.679 --> 00:16:47.720 deep networks by creating shortcuts, essentially letting the signal bypass 323 00:16:47.759 --> 00:16:51.720 some layers. This allowed researchers to train networks with hundreds, 324 00:16:51.759 --> 00:16:53.120 even over one thousand layers. 325 00:16:53.200 --> 00:16:57.120 Wow. Okay, so CNN's are four images. What about data 326 00:16:57.159 --> 00:17:00.080 that comes in sequences like text or speech where the 327 00:17:00.159 --> 00:17:01.039 order is critical. 328 00:17:01.159 --> 00:17:04.519 That's the domain of recurrent neural networks or RNNs. Their 329 00:17:04.519 --> 00:17:08.000 defining feature is a kind of memory. They process sequences 330 00:17:08.000 --> 00:17:11.200 step by step, and at each step the output depends 331 00:17:11.279 --> 00:17:13.680 not only on the current input, but also on a 332 00:17:13.720 --> 00:17:17.079 hidden state that summarizes information from previous. 333 00:17:16.720 --> 00:17:18.759 Steps, so they remember what came before. 334 00:17:19.359 --> 00:17:22.400 In a sense. Yes, you can visualize an RNN as 335 00:17:22.400 --> 00:17:25.319 having a loop. The hidden state from one time step 336 00:17:25.359 --> 00:17:27.519 feeds back into the network at the next time step. 337 00:17:28.119 --> 00:17:30.400 And useful way to think about it, especially for training, 338 00:17:30.480 --> 00:17:34.160 is to unfurl or unroll this loop over time. It 339 00:17:34.200 --> 00:17:37.039 looks like a very deep feed forward network, but with 340 00:17:37.079 --> 00:17:40.079 a crucial difference. The same set of weights is used 341 00:17:40.119 --> 00:17:43.000 at every single comm step. This weight sharing is key 342 00:17:43.000 --> 00:17:45.200 for learning patterns that apply across the sequence. 343 00:17:45.240 --> 00:17:46.759 What's a typical use case. 344 00:17:46.839 --> 00:17:49.799 Language modeling is a classic one predicting the next word 345 00:17:49.839 --> 00:17:53.039 in a sentence. The book mentions a cool example by 346 00:17:53.119 --> 00:17:56.839 Andre's Karpathy, who trained an RNN character by character on 347 00:17:56.920 --> 00:18:00.559 Shakespeare's plays. After just a few training it areas, it 348 00:18:00.599 --> 00:18:04.440 produced complete gibberish, but after many more iterations it started 349 00:18:04.519 --> 00:18:09.839 generating text that looked syntactically like Shakespeare, correctly spelled words, punctuation, 350 00:18:10.319 --> 00:18:13.720 line breaks. Even though the meeting was nonsensical, it showed 351 00:18:13.720 --> 00:18:16.160 the RNN was learning the structure of the language. 352 00:18:16.400 --> 00:18:19.319 That's pretty cool. But do RNNs have issues too, like 353 00:18:19.359 --> 00:18:20.480 the gradient problems? 354 00:18:20.519 --> 00:18:24.240 Oh? Definitely. Those vanishing and exploding gradients we talked about 355 00:18:24.240 --> 00:18:27.640 are a major problem for basic RNNs, especially when dealing 356 00:18:27.640 --> 00:18:31.200 with long sequences. Trying to propagate information over many time 357 00:18:31.240 --> 00:18:34.200 steps is difficult. This led to the development of more 358 00:18:34.240 --> 00:18:37.960 sophisticated recurrent units, most famously the long short term memory 359 00:18:38.400 --> 00:18:39.880 or LSTM LCM. 360 00:18:39.920 --> 00:18:40.960 Heard of that one, Yeah. 361 00:18:41.160 --> 00:18:43.519 LSTMs are a type of R and N cell designed 362 00:18:43.559 --> 00:18:47.200 specifically to combat the vanishing gradient problem and capture long 363 00:18:47.279 --> 00:18:52.440 range dependencies. They have internal mechanisms called gates, an input gate, 364 00:18:52.720 --> 00:18:55.640 a forget gate, and an output gate, and a separate 365 00:18:55.680 --> 00:18:58.400 cell state that acts like a conveytor belt for information. 366 00:18:59.279 --> 00:19:01.799 These gates learn to control what information is added to 367 00:19:01.880 --> 00:19:05.119 the cell state, what's removed, and what affects the output 368 00:19:05.160 --> 00:19:08.319 at each step. It allows them to maintain important information 369 00:19:08.440 --> 00:19:12.519 over much longer periods. More recently, things like layer normalization 370 00:19:12.839 --> 00:19:14.599 have also helped improve RNN. 371 00:19:14.400 --> 00:19:19.039 Stability, so LSTMs are better at remembering long term patterns. 372 00:19:18.599 --> 00:19:21.440 Much better generally speaking, and they've been crucial for many 373 00:19:21.440 --> 00:19:25.759 applications machine translation, often using any encoder decoder structure where 374 00:19:25.759 --> 00:19:28.119 one RNN reads the foot sequence and another generates the 375 00:19:28.119 --> 00:19:32.759 output sequence. Google Translate US. This heavily also building conversational 376 00:19:32.759 --> 00:19:36.240 AI systems chatbots doing things like named entity recognition and 377 00:19:36.319 --> 00:19:40.759 text like identifying names or locations, and even powering recommender systems. 378 00:19:40.960 --> 00:19:46.039 Okay, CNN's for space, RNs slstms for time or sequence. 379 00:19:46.920 --> 00:19:50.799 What if the goal is different, like compressing data or 380 00:19:50.920 --> 00:19:52.720 finding a new way to represent it. 381 00:19:52.720 --> 00:19:55.720 That's where auto encoders come into play. The fundamental idea 382 00:19:55.920 --> 00:19:58.920 is pretty elegant, and auto encoder is a neural network 383 00:19:58.960 --> 00:20:00.599 trained to reconstruct it its own input. 384 00:20:00.880 --> 00:20:03.440 Reconstruct its input. What's the point of that? 385 00:20:03.839 --> 00:20:06.720 Ah? The trick is in the middle. The network usually 386 00:20:06.759 --> 00:20:09.960 has a bottleneck layer, a hidden layer with fewer neurons 387 00:20:10.000 --> 00:20:13.920 than the input or output layers. To successfully reconstruct the input, 388 00:20:14.000 --> 00:20:16.839 the network is forced to learn a compressed representation, a 389 00:20:16.880 --> 00:20:19.839 sort of code. In that bottleneck layer. It has to 390 00:20:19.839 --> 00:20:22.079 figure out the most essential features of the data to 391 00:20:22.119 --> 00:20:24.759 squeeze it through the bottleneck and then reconstruct it. They're 392 00:20:24.759 --> 00:20:26.400 sometimes called replicator. 393 00:20:25.920 --> 00:20:30.720 Networks, so it's learning a compressed version like dimensionality reduction exactly. 394 00:20:31.079 --> 00:20:35.079 Basic auto encoders with the linear activation function essentially learn 395 00:20:35.160 --> 00:20:39.799 the same subspace as principal component analysis PCA, but the 396 00:20:39.839 --> 00:20:42.400 real power comes when you make them deep auto encoders 397 00:20:42.720 --> 00:20:47.200 with multiple hidden layers and nonlinear activation functions like RYLU. 398 00:20:47.519 --> 00:20:51.359 These can learn much more complex nonlinear transformations of the data, 399 00:20:51.400 --> 00:20:55.359 effectively disentangling data that might lie on a complicated manifold. 400 00:20:54.960 --> 00:20:56.440 Better than something like PCA. 401 00:20:56.519 --> 00:21:00.680