WEBVTT 1 00:00:00.120 --> 00:00:02.839 Welcome to the deep dive. Our mission is pretty simple. 2 00:00:03.319 --> 00:00:06.879 You give us the source material and we jump right 3 00:00:06.919 --> 00:00:09.679 in to pull out the essential knowledge, basically giving you 4 00:00:09.720 --> 00:00:11.599 a shortcut to getting up to speed on a topic. 5 00:00:11.759 --> 00:00:15.080 Exactly. We dig into the pages, the research, whatever you 6 00:00:15.119 --> 00:00:17.719 send us, and we extract those key insights, maybe some 7 00:00:17.760 --> 00:00:20.320 surprising details, and the stuff you can actually use. 8 00:00:20.559 --> 00:00:24.519 And for this deep dive, we're tackling excerpts from Applied 9 00:00:24.640 --> 00:00:28.440 deep Learning, a case based approach to understanding deep neural 10 00:00:28.480 --> 00:00:30.879 networks by Umberco Miclucci. 11 00:00:30.760 --> 00:00:33.320 Right, and this source material it really gets into the 12 00:00:33.399 --> 00:00:37.439 nuts and bolts how neural networks function, how they actually learn, 13 00:00:37.799 --> 00:00:40.640 and then the practical side like building and training them 14 00:00:40.759 --> 00:00:42.200 using tools like TensorFlow. 15 00:00:42.600 --> 00:00:45.039 So our goal here is simple walk you through these 16 00:00:45.119 --> 00:00:49.520 core deep learning ideas straight from this applied angle, making 17 00:00:49.520 --> 00:00:52.079 it hopefully clear and digestible. Let's get started. 18 00:00:52.159 --> 00:00:54.000 Okay, let's start where the book does really at the 19 00:00:54.000 --> 00:00:56.600 foundation computational graphs. 20 00:00:56.560 --> 00:00:58.399 Right, but before you even talk about neurons, the book 21 00:00:58.439 --> 00:01:01.600 sets it up with this idea. A computational graph is well, 22 00:01:01.640 --> 00:01:04.079 it's just a way to map out and organized calculations. Right. 23 00:01:04.120 --> 00:01:05.879 You define the steps how the data. 24 00:01:05.680 --> 00:01:09.719 Flows, And that's precisely how libraries like TensorFlow work. You 25 00:01:11.200 --> 00:01:16.040 build this graph defining all the operations, additions, multiplications, activation, functions, whatever. 26 00:01:16.319 --> 00:01:19.879 Then TensorFlow takes that graph and runs it, executing everything 27 00:01:20.000 --> 00:01:20.640 very efficiently. 28 00:01:20.920 --> 00:01:23.879 Okay, so what are the basic building blocks for these 29 00:01:23.920 --> 00:01:26.319 graphs and TensorFlow? According to the source, Well. 30 00:01:26.239 --> 00:01:28.480 The most fundamental thing is the tensor itself. You can 31 00:01:28.519 --> 00:01:31.560 think of it basically as a multidimensional array, very much 32 00:01:31.599 --> 00:01:35.519 like NUMPI arrays actually, and it's rank just tells you 33 00:01:35.560 --> 00:01:38.040 how many dimensions it has. Ranked zero is a centile number, 34 00:01:38.120 --> 00:01:40.239 rank one is a vector ranked to a matrix, and 35 00:01:40.280 --> 00:01:40.640 so on. 36 00:01:40.799 --> 00:01:44.519 Got it. Multidimensional arrays holding the data. But what about 37 00:01:44.519 --> 00:01:47.799 the pieces that do the work or change during training? 38 00:01:47.920 --> 00:01:50.159 A right, So you have different kinds of nodes. There's 39 00:01:50.319 --> 00:01:53.879 tf dot variable. These are for parameters that the network 40 00:01:53.879 --> 00:01:57.000 needs to update as it learns. Think weights and biases, 41 00:01:57.040 --> 00:01:58.159 the classic examples. 42 00:01:58.359 --> 00:02:02.200 They vary, makes sense, they vary. What about tf dot placeholder? Then? 43 00:02:02.519 --> 00:02:05.120 Placeholders are different. They're like entry points into the graph. 44 00:02:05.319 --> 00:02:08.400 You use them to feed in data from outside when 45 00:02:08.439 --> 00:02:11.159 you actually run the calculation. They hold values that are 46 00:02:11.199 --> 00:02:14.800 fixed during one run, but you might change them between runs. 47 00:02:14.840 --> 00:02:17.879 Like feeding in a batch of training data, or maybe 48 00:02:17.919 --> 00:02:18.800 the learning rate. 49 00:02:18.800 --> 00:02:22.479 Exactly input data batches are the prime example, or maybe 50 00:02:22.479 --> 00:02:24.000 a learning rate you're setting manually. 51 00:02:24.159 --> 00:02:28.080 Okay, so variables change during a run, placeholders get new 52 00:02:28.159 --> 00:02:31.919 values between runs, and tf dot constant. 53 00:02:31.960 --> 00:02:33.800 That sounds easy. It's just for a value that stays 54 00:02:33.840 --> 00:02:35.800 the same, always, never changes. 55 00:02:35.560 --> 00:02:38.599 And TensorFlow runs this whole graph thing using something called 56 00:02:38.599 --> 00:02:39.120 a session. 57 00:02:39.360 --> 00:02:42.479 That's right. You define the graph structure first, then you 58 00:02:42.520 --> 00:02:45.479 need a TensorFlow session to actually execute the operations in 59 00:02:45.479 --> 00:02:48.280 that graph. The book makes a distinction between session dot 60 00:02:48.360 --> 00:02:51.759 run and tensor dot evil. Dot run lets you execute 61 00:02:51.800 --> 00:02:54.639 specific nodes you list, whereas evil is more like a 62 00:02:54.639 --> 00:02:57.319 shortcut you called directly on a tensor or variable. It 63 00:02:57.400 --> 00:03:00.120 just runs that specific thing within the current session and 64 00:03:00.560 --> 00:03:01.639 gives you its value back. 65 00:03:01.840 --> 00:03:05.400 So we've got the structure for doing calculations. Now, what's 66 00:03:05.439 --> 00:03:09.159 the book says the smallest unit in deep learning. 67 00:03:09.319 --> 00:03:12.639 That would be the single neuron. It's the fundamental building block, 68 00:03:12.719 --> 00:03:16.360 kind of inspired by biological neurons, but much simpler as 69 00:03:16.400 --> 00:03:19.199 the book describes it. It takes several numerical inputs your data 70 00:03:19.240 --> 00:03:23.400 features usually does some processing and spits out a single number. 71 00:03:23.759 --> 00:03:28.479 And that processing involves multiplying inputs by weights, adding a bias, 72 00:03:29.360 --> 00:03:30.919 and then hitting it with an activation function. 73 00:03:31.039 --> 00:03:34.120 Precisely, the weights give importance to different inputs, the bias 74 00:03:34.240 --> 00:03:37.680 shifts the result, and that activation function that's super important. 75 00:03:38.080 --> 00:03:42.080 It introduces nonlinearity because if you just stacked layers of 76 00:03:42.120 --> 00:03:45.039 linear operations, the whole network would still just be doing 77 00:03:45.080 --> 00:03:49.199 a linear transformation. Nonlinearity lets it learn complex stuff. 78 00:03:49.680 --> 00:03:52.000 What activation functions does the book focus on? 79 00:03:52.280 --> 00:03:55.240 It covers some key ones. There's the Sigmoorid function, squashes 80 00:03:55.280 --> 00:03:58.000 everything between zero and one, historically used a lot for 81 00:03:58.319 --> 00:04:02.199 like binary classification. Then the identity function, which is just 82 00:04:02.360 --> 00:04:06.080 linear output equals input, and the really common one now 83 00:04:06.240 --> 00:04:10.240 re lu the rectified linear unit, which is output is 84 00:04:10.360 --> 00:04:13.319 just the input if it's positive and zero if it's negative. 85 00:04:13.599 --> 00:04:15.240 Simple but works very well. 86 00:04:15.439 --> 00:04:17.720 Now the book points that are really practical. Kind of 87 00:04:17.720 --> 00:04:21.079 tricky thing with sigmoid. Doesn't it something that can cause problems? 88 00:04:21.360 --> 00:04:27.560 Ah? Yeah, this is a classic theory versus practice issue. Mathematically, 89 00:04:27.680 --> 00:04:30.759 sigmoid gets really close to zero or one, but never 90 00:04:30.839 --> 00:04:35.079 quite touches them. But computers use floating point numbers. So 91 00:04:35.480 --> 00:04:39.240 for really big positive or negative inputs, the result can 92 00:04:39.279 --> 00:04:41.439 actually get rounded to exactly zero or one. 93 00:04:41.680 --> 00:04:43.759 And why is that a problem? Where does it bite you? 94 00:04:44.079 --> 00:04:47.279 It bites you when you calculate the cost function, Especially 95 00:04:47.279 --> 00:04:50.399 in classification, you often need the logarithm of the output 96 00:04:50.519 --> 00:04:52.839 or log one output. If the output is exactly zero 97 00:04:52.920 --> 00:04:57.319 or one, you're trying to calculate log zero, which is undefined. 98 00:04:57.279 --> 00:05:00.519 Leading to those nan values not a number exactly. 99 00:05:00.600 --> 00:05:02.759 You see nan popping up in your training loss, that's 100 00:05:02.800 --> 00:05:05.519 often a clue, could be the sigmoid issue, maybe related 101 00:05:05.519 --> 00:05:08.160 to data scaling or initial weights being too large. It's 102 00:05:08.160 --> 00:05:09.120 a debugging flag. 103 00:05:09.319 --> 00:05:13.519 That's a super useful tip, watch eff nance. Another practical 104 00:05:13.519 --> 00:05:17.240 point the book makes is about speed right computational efficiency. 105 00:05:17.560 --> 00:05:21.439 Absolutely, it has this great comparison it shows implementing something 106 00:05:21.600 --> 00:05:25.759 like ReLU using numb pies built in matrix operations versus 107 00:05:25.920 --> 00:05:28.000 just writing a standard Python for loop. 108 00:05:28.360 --> 00:05:31.399 I think I remember seeing that graphic. The difference was huge, 109 00:05:31.439 --> 00:05:32.319 wasn't it massive? 110 00:05:32.399 --> 00:05:34.879 Something like one hundred times faster in their example for 111 00:05:34.959 --> 00:05:37.879 a big array. Wow, And it really drives home why 112 00:05:37.879 --> 00:05:41.600 we use libraries like NUMPI or TensorFlow. They push these 113 00:05:41.600 --> 00:05:45.240 operations down to low level code like see and use vectorization. 114 00:05:45.720 --> 00:05:48.480 They process chunks of data all at once, which is 115 00:05:48.680 --> 00:05:52.879 way faster than Python looping through element by element. Understanding 116 00:05:52.920 --> 00:05:55.920 that efficiency is key to why deep learning scales. 117 00:05:56.160 --> 00:05:58.680 Okay, so we have the neuron the basic unit, but 118 00:05:58.759 --> 00:06:02.920 how does it or whole network of them actually learn anything? Right? 119 00:06:02.959 --> 00:06:05.800 So, learning here means finding the best possible values for 120 00:06:05.839 --> 00:06:09.040 the network's parameters, the weights and biases. Best means the 121 00:06:09.120 --> 00:06:12.000 values that make the network predictions match the true answers 122 00:06:12.040 --> 00:06:13.360 as closely as possible. 123 00:06:13.600 --> 00:06:15.920 You measure that closeness using the cost function. 124 00:06:16.079 --> 00:06:18.560 Precisely, the cost function gives you a number that says 125 00:06:18.600 --> 00:06:22.480 how wrong the network is. Lower cost means better performance 126 00:06:22.480 --> 00:06:23.639 on the training data. 127 00:06:23.680 --> 00:06:27.199 And the main algorithm for lowering that cost is gradient 128 00:06:27.240 --> 00:06:28.120 descent YEP. 129 00:06:28.160 --> 00:06:31.360 Gradient descent is the workhorse YEP. It works by calculating 130 00:06:31.439 --> 00:06:35.040 the gradient basically the slope of the cost function with 131 00:06:35.120 --> 00:06:38.279 respect to each weight and bias. Then it adjusts the 132 00:06:38.319 --> 00:06:42.519 parameters slightly in the opposite direction of the gradient. It's 133 00:06:42.519 --> 00:06:45.360 like taking a small step downhill on the cost landscape, 134 00:06:45.639 --> 00:06:47.920 always trying to find the lowest point and. 135 00:06:47.879 --> 00:06:49.000 The size of that step. 136 00:06:49.360 --> 00:06:52.560 That's the learning rate exactly. The learning rate, often written 137 00:06:52.600 --> 00:06:56.920 as gamma or alpha, is a really critical hyperparameter. It 138 00:06:57.000 --> 00:06:59.560 dictates how biggest step you take down hell each time. 139 00:07:00.000 --> 00:07:03.759 A small learning rate means tiny, maybe cautious steps. Large 140 00:07:03.839 --> 00:07:07.120 learning rate means big, bold steps, which sounds good. 141 00:07:07.199 --> 00:07:10.040 But that's where the quirks come in. As the book 142 00:07:10.079 --> 00:07:11.839 puts it, what happens if it's too big. 143 00:07:11.920 --> 00:07:14.800 If it's too big, you can overshoot the minimum point 144 00:07:14.920 --> 00:07:17.079 in the cost landscape. You jump right over it. You 145 00:07:17.120 --> 00:07:19.959 might end up bouncing back and forth, oscillating around the minimum, 146 00:07:20.040 --> 00:07:23.079 or even flying off entirely and diverging. The cost gets 147 00:07:23.079 --> 00:07:24.079 worse instead of better. 148 00:07:24.319 --> 00:07:26.720 Yeah, I can picture that like rolling a ball down 149 00:07:26.759 --> 00:07:28.680 a hill too fast and it rolls right across the 150 00:07:28.759 --> 00:07:30.000 valley and up the other side. 151 00:07:30.079 --> 00:07:32.959 That's a good analogy. Finding that just right. Learning rate 152 00:07:33.040 --> 00:07:35.480 is often one of the first big challenges when you're 153 00:07:35.560 --> 00:07:36.319 training a network. 154 00:07:36.439 --> 00:07:42.040 Okay, so individual neurons learn by minimizing cost with gradient descent, 155 00:07:42.759 --> 00:07:45.399 but the real power comes when you connect lots of 156 00:07:45.399 --> 00:07:47.920 them together in feed forward neural networks. 157 00:07:47.959 --> 00:07:50.519 That's right. You arrange neurons in layers. You have an 158 00:07:50.519 --> 00:07:53.160 input layer, one or more hidden layers in the middle, 159 00:07:53.240 --> 00:07:56.279 and then an output layer. And in a standard fully 160 00:07:56.279 --> 00:08:00.519 connected network, every neuron in one layer passes it output 161 00:08:00.639 --> 00:08:03.040 to every neuron in the very next layer, and. 162 00:08:03.000 --> 00:08:05.920 The calculations just flow forward layer by layer, Which is 163 00:08:05.920 --> 00:08:09.000 why those matrix operations we talked about are so useful. Right, 164 00:08:09.040 --> 00:08:10.360 processing a whole layer at. 165 00:08:10.240 --> 00:08:14.360 Once exactly that equations zwx plus B that's not just 166 00:08:14.399 --> 00:08:16.920 one neuron. W is a matrix of all weights for 167 00:08:16.959 --> 00:08:19.920 the layer, x is a matrix of all inputs or 168 00:08:19.959 --> 00:08:23.000 previous layer outputs for a whole batch, and B is 169 00:08:23.040 --> 00:08:25.759 the bias factor. It calculates everything for the layer in 170 00:08:25.800 --> 00:08:27.720 one go, super efficient, but. 171 00:08:27.800 --> 00:08:30.879 Building these bigger, deeper networks introduces a huge challenge. The 172 00:08:30.879 --> 00:08:33.120 book really digs into overfitting. 173 00:08:33.360 --> 00:08:37.000 Oh yeah. Overfitting is a constant concern. It's when your 174 00:08:37.000 --> 00:08:40.080 model gets too good at the training data. It doesn't 175 00:08:40.120 --> 00:08:43.279 just learn the underlying patterns, it starts memorizing the specific 176 00:08:43.360 --> 00:08:46.279 training examples, including all the random noise and quirks, so. 177 00:08:46.200 --> 00:08:48.720 It eases the practice test but fails the real. 178 00:08:48.600 --> 00:08:52.000 Exam perfect analogy. It performs great on data it's seen, 179 00:08:52.080 --> 00:08:54.960 but poorly on new unseen data because it didn't learn 180 00:08:54.960 --> 00:08:55.919 the general rules. 181 00:08:56.000 --> 00:08:59.519 And the opposite is underfitting or high bias, where the 182 00:08:59.519 --> 00:09:02.080 model is too simple it can't even capture the training 183 00:09:02.159 --> 00:09:03.080 data patterns well. 184 00:09:03.159 --> 00:09:06.159 Right, and the book stresses that the very first step 185 00:09:06.320 --> 00:09:09.799 in fighting overfitting is being able to spot it, which 186 00:09:09.879 --> 00:09:11.480 means you have to split your. 187 00:09:11.440 --> 00:09:14.840 Data into a training set and a development set or 188 00:09:14.879 --> 00:09:16.440 a validation set exactly. 189 00:09:16.480 --> 00:09:18.840 You train the model only on the training set, but 190 00:09:19.120 --> 00:09:22.000 periodically you check its performance on the development dove set, 191 00:09:22.000 --> 00:09:23.159 which it hasn't been trained on. 192 00:09:23.320 --> 00:09:25.799 And if the training air keeps going down but the 193 00:09:25.840 --> 00:09:28.600 dev air stops improving or starts going. 194 00:09:28.440 --> 00:09:30.879 Up, bingo, that's your alarm bell. The model is starting 195 00:09:30.879 --> 00:09:33.679 to overfit the training data. The dev set acts like 196 00:09:33.720 --> 00:09:35.080 your early warning system. 197 00:09:35.399 --> 00:09:38.720 Now going back to grading descent for training these networks, 198 00:09:38.840 --> 00:09:40.000 there are different flavors of. 199 00:09:39.960 --> 00:09:43.200 It, yes, because using the entire data set for every 200 00:09:43.240 --> 00:09:46.279 single weight update that's called batch gradient descent can be 201 00:09:46.320 --> 00:09:49.879 incredibly slow and memory intensive. For large data sets. Batch 202 00:09:49.919 --> 00:09:53.600 GD gives you a very accurate gradient estimate, but the 203 00:09:53.720 --> 00:09:55.399 updates are infrequent. 204 00:09:55.080 --> 00:09:58.279 So the alternative is stochastic gradient descent or SGD. 205 00:09:58.559 --> 00:10:01.759 Right, SGD goes to the other extreme, it updates the 206 00:10:01.759 --> 00:10:04.799 weights after looking at just one training example. This makes 207 00:10:04.799 --> 00:10:07.919 the updates very fast and frequent, but also very noisy 208 00:10:08.000 --> 00:10:12.080 or stochastic. The path towards the minimum jumps around a lot. 209 00:10:12.399 --> 00:10:15.840 That noise can sometimes help it escape shallow local minimum, though. 210 00:10:15.759 --> 00:10:18.120 In the most common approach sits in the middle. Mini 211 00:10:18.200 --> 00:10:20.120 batch gradient descent exactly. 212 00:10:20.480 --> 00:10:22.799 This is what people usually mean by SGD nowadays, even 213 00:10:22.799 --> 00:10:25.799 though it's technically minibatch. You calculate the gradient and update 214 00:10:25.840 --> 00:10:28.120 the weights based on a small batch maybe thirty two, 215 00:10:28.320 --> 00:10:31.679 sixty four, hundred and twenty eight examples. It's a compromise. 216 00:10:31.960 --> 00:10:35.320 You get smoother convergence than pure SGD, but much faster 217 00:10:35.399 --> 00:10:39.320 updates than batch GD. It leverages matrix operations efficiently. 218 00:10:39.120 --> 00:10:41.360 And that mini batch size is another one of those 219 00:10:41.399 --> 00:10:42.559 hyper parameters you have to. 220 00:10:42.519 --> 00:10:46.600 Choose yep and the book clarifies terminology. An iteration is 221 00:10:46.679 --> 00:10:49.879 usually one pass through a mini batch and one weight update, 222 00:10:50.320 --> 00:10:53.639 and epoch is one full pass through the entire training 223 00:10:53.720 --> 00:10:56.320 data set, so many iterations per at BOCH. 224 00:10:56.399 --> 00:11:00.360 The book also mentions something about starting weights weight initialization 225 00:11:00.480 --> 00:11:01.919 being important, very important. 226 00:11:02.120 --> 00:11:04.759 It's not just setting them to zero, which causes problems. 227 00:11:04.960 --> 00:11:08.320 How you initialize them can seriously affect how quickly or 228 00:11:08.399 --> 00:11:12.039 even if the network trains successfully. Bad initialization can lead 229 00:11:12.080 --> 00:11:16.240 to exploding gradients getting huge, or vanishing gradients getting tiny, 230 00:11:16.799 --> 00:11:18.559 or those nan values again. 231 00:11:18.799 --> 00:11:20.000 So what does the book suggest? 232 00:11:20.399 --> 00:11:23.639 It often uses something like TFT truncated normal with a 233 00:11:23.679 --> 00:11:27.480 small standard deviation, maybe zero point one. This draws initial 234 00:11:27.480 --> 00:11:30.840 weights from a normal distribution but cuts off extreme values. 235 00:11:31.480 --> 00:11:33.960 The idea is to start with small random weights to 236 00:11:34.000 --> 00:11:36.320 break symmetry but avoid large values. 237 00:11:36.360 --> 00:11:40.759 Initially, let's talk architecture. Why are deeper networks like with 238 00:11:40.879 --> 00:11:44.440 multiple hidden layers often better than just one really wide 239 00:11:44.480 --> 00:11:45.080 hidden layer. 240 00:11:45.320 --> 00:11:48.840 Well, empirically, deeper networks often seem to need fewer neurons 241 00:11:48.840 --> 00:11:51.440 in total to get the same level of performance as 242 00:11:51.480 --> 00:11:55.159 a very wide but shallow network, but perhaps more importantly, 243 00:11:55.240 --> 00:11:59.159 they often generalize better. The thinking is that layers learn 244 00:11:59.279 --> 00:12:02.480 features higher archically. How So, like the first layer might 245 00:12:02.559 --> 00:12:05.960 learn simple things like edges or corners from pixels, the 246 00:12:06.000 --> 00:12:09.440 next layer combines those into shapes, The layer after that 247 00:12:09.480 --> 00:12:13.000 combines shapes into objects, and so on. It builds up complexity. 248 00:12:13.279 --> 00:12:16.480 So potentially a more sophisticated understanding of the data. But 249 00:12:16.519 --> 00:12:19.399 the book is clear right there's no magic formula for 250 00:12:19.480 --> 00:12:21.000 the number of layers or neurons. 251 00:12:21.279 --> 00:12:24.559 Absolutely not. It's very much problem dependent. Finding the right 252 00:12:24.679 --> 00:12:28.759 architecture usually involves a lot of trial and error experimentation, 253 00:12:28.960 --> 00:12:32.399 maybe drawing on architectures known to work well for similar problems. 254 00:12:32.559 --> 00:12:36.559 Okay, we've got network structure learning algorithms. How to spot overfitting? 255 00:12:37.080 --> 00:12:41.440 What about making the training itself better, faster, more reliable. 256 00:12:41.600 --> 00:12:44.399 One key area is tweaking the learning rate during training 257 00:12:44.759 --> 00:12:48.399 instead of just fixing it using learning rate decay is common. 258 00:12:48.159 --> 00:12:50.559 So starting higher and then reducing it over time. 259 00:12:50.879 --> 00:12:54.440 Exactly, you might start with a relatively large learning rate 260 00:12:54.559 --> 00:12:57.240 to make quick progress when you're far from the solution. 261 00:12:58.120 --> 00:13:00.399 Then as the training goes on and you get closer 262 00:13:00.399 --> 00:13:03.440 to the minimum, you gradually decrease the learning rate to 263 00:13:03.519 --> 00:13:07.399 take smaller, finer steps. This helps avoid that oscillation we 264 00:13:07.440 --> 00:13:10.039 talked about and allows for more precise convergence. 265 00:13:10.480 --> 00:13:11.919 What are common ways to decay it? 266 00:13:12.519 --> 00:13:15.200 The book mentions things like in verse time decay or 267 00:13:15.279 --> 00:13:19.399 exponential decay, where the rate decreases smoothly over training iterations. 268 00:13:19.480 --> 00:13:22.039 It's usually tied to the iteration count, not just the 269 00:13:22.080 --> 00:13:22.919 epoch count. 270 00:13:23.000 --> 00:13:27.000 And then there are fancier optimization algorithms beyond just basic 271 00:13:27.000 --> 00:13:28.360 gradient descent with decay. 272 00:13:28.440 --> 00:13:31.000 Oh yes, these aim to speed up training and make 273 00:13:31.039 --> 00:13:33.559 it more robust. Many of them rely on the idea 274 00:13:33.600 --> 00:13:35.440 of exponentially weighted averages. 275 00:13:35.600 --> 00:13:37.120 Okay, what's the intuition there. 276 00:13:37.279 --> 00:13:40.159 Instead of just using the gradient from the current mini batch, 277 00:13:40.200 --> 00:13:42.879 which can be noisy, these methods keep a running average 278 00:13:42.879 --> 00:13:46.200 of recent gradients. This average smooths out the noise and 279 00:13:46.240 --> 00:13:49.159 gives a better estimate of the true downhill direction. It 280 00:13:49.240 --> 00:13:52.200 helps the optimizer build up momentum to get through flat 281 00:13:52.240 --> 00:13:55.519 regions or damp down oscillations in narrow valleys of the 282 00:13:55.559 --> 00:13:56.200 cost function. 283 00:13:56.519 --> 00:13:58.120 So it's like smoothing out the bumps in the road, 284 00:13:58.559 --> 00:14:02.360 and that leads to optimizers life momentum RMS PROP. 285 00:14:02.159 --> 00:14:05.519 ADAM exactly those momentum adds a fraction of the previous 286 00:14:05.639 --> 00:14:09.039 update step to the current one. RMSProp adapts the learning 287 00:14:09.080 --> 00:14:12.039 rate for each parameter individually based on the average size 288 00:14:12.039 --> 00:14:15.120 of recent ingredients for that parameter, and ADAM, as the 289 00:14:15.159 --> 00:14:18.000 source suggests, kind of combines the ideas of momentum and 290 00:14:18.159 --> 00:14:22.080 RMS PROP. It's often the default go to optimizer because 291 00:14:22.120 --> 00:14:24.279 it tends to work well across a wide range of 292 00:14:24.320 --> 00:14:28.279 problems with relatively little tuning, usually faster and better. 293 00:14:28.279 --> 00:14:30.879 The book says, now, let's circle back to fighting overfitting. 294 00:14:31.039 --> 00:14:34.480 We mentioned the train dev split. What about techniques built 295 00:14:34.480 --> 00:14:37.960 into the training process itself? Regularization right. 296 00:14:38.080 --> 00:14:42.200 Regularization methods are specifically designed to prevent overfitting and help 297 00:14:42.240 --> 00:14:45.039 the model generalize better to data it hasn't seen before. 298 00:14:45.159 --> 00:14:48.039 The book talks about E two and E to one regularization. 299 00:14:48.519 --> 00:14:49.240 What's the difference? 300 00:14:49.480 --> 00:14:52.440 Both work by adding a penalty term to the cost function. 301 00:14:53.240 --> 00:14:56.159 This penalty is based on the size of the network's weights. 302 00:14:56.679 --> 00:15:00.440 Under two, regularization, sometimes called weight decay, adds a penalty 303 00:15:00.480 --> 00:15:03.159 proportional to the sum of the squares of all the weights. 304 00:15:04.000 --> 00:15:07.600 It pushes weights towards zero, but not usually exactly zero. 305 00:15:08.159 --> 00:15:11.919 It encourages smaller whites overall, making the model simpler. And 306 00:15:12.240 --> 00:15:15.919 one home one regularization adds a penalty proportional to the 307 00:15:15.919 --> 00:15:19.039 sum of the absolute values of the weights. It also 308 00:15:19.120 --> 00:15:22.200 pushes weights towards zero, but because of the math involved 309 00:15:22.360 --> 00:15:24.799 the shape of the penalty function, it tends to make 310 00:15:24.879 --> 00:15:27.039 many weights exactly zero, so. 311 00:15:26.960 --> 00:15:30.600 It leads to sparser models where some connections are effectively 312 00:15:30.600 --> 00:15:31.720 turned off exactly. 313 00:15:32.240 --> 00:15:34.840 L one can be useful for feature selection in a 314 00:15:34.879 --> 00:15:37.960 way because it zero's out weights for less important inputs. 315 00:15:38.000 --> 00:15:40.320 Then there's dropout, which sounds completely different. 316 00:15:40.440 --> 00:15:42.879 It is quite different. Yeah, dropout is a very clever 317 00:15:43.000 --> 00:15:46.360 and widely used technique. During each training iteration, you randomly 318 00:15:46.440 --> 00:15:49.840 drop out, temporarily remove a fraction of the neurons in 319 00:15:49.879 --> 00:15:51.039 certain layers. 320 00:15:50.840 --> 00:15:52.559 Just randomly ignore them for that update. 321 00:15:52.840 --> 00:15:56.320 Yep, for that one mini batch calculation, those neurons and 322 00:15:56.360 --> 00:16:00.480 their connections are just gone. In the next iteration, a 323 00:16:00.519 --> 00:16:02.279 different random sat might be dropped. 324 00:16:02.320 --> 00:16:03.039 How does that help? 325 00:16:03.279 --> 00:16:06.080 It prevents the network from becoming too reliant on any 326 00:16:06.120 --> 00:16:10.399 single neuron or specific pathway. Since any neuron might disappear, 327 00:16:10.480 --> 00:16:14.519 the network is forced to learn more robust, redundant representations. 328 00:16:14.919 --> 00:16:17.679 It's kind of like training a large ensemble of slightly 329 00:16:17.720 --> 00:16:19.240 different networks all at once. 330 00:16:19.320 --> 00:16:22.399 Yeah, that makes sense. Forces redundancy. The source notes that 331 00:16:22.480 --> 00:16:24.360 can make the training costs jump around. 332 00:16:24.159 --> 00:16:27.360 A bit more, though, yes, because the network structure is 333 00:16:27.600 --> 00:16:31.320 literally changing slightly on every iteration due to the randomness. 334 00:16:31.840 --> 00:16:34.799 So the training metric might look a bit noisier, but 335 00:16:34.879 --> 00:16:37.720 it often leads to much better generalization on the dev 336 00:16:37.799 --> 00:16:38.600 and test sets. 337 00:16:38.840 --> 00:16:42.840 Okay, so we've trained or model applied regularization, how do 338 00:16:42.919 --> 00:16:46.679 we really know if it's any good? Evaluation seems critical. 339 00:16:46.440 --> 00:16:50.000 Absolutely crucial, and just looking at training error isn't enough. 340 00:16:50.200 --> 00:16:53.399 The book brings up human level performance HLP and Bayes 341 00:16:53.559 --> 00:16:57.559 error in for tasks humans are good at, like recognizing 342 00:16:57.559 --> 00:17:01.639 images or transcribing speech. HLP can be a practical estimate 343 00:17:01.679 --> 00:17:05.519 for the theoretical best possible error. The bees aer Beyes 344 00:17:05.640 --> 00:17:09.599 error is the irreducible error rate. No model, however, perfect 345 00:17:09.720 --> 00:17:12.359 could do better due to inherent ambiguity or noise in 346 00:17:12.400 --> 00:17:13.279 the data itself. 347 00:17:13.400 --> 00:17:15.759 So knowing the HLP gives you a target, like what's 348 00:17:15.799 --> 00:17:16.839 potentially achievable? 349 00:17:16.960 --> 00:17:20.799 Exactly? If human error on a task is say one percent, 350 00:17:20.839 --> 00:17:23.240 and your model has ten percent error, you know there's 351 00:17:23.319 --> 00:17:25.200 likely a lot of room for improvement. If your model 352 00:17:25.240 --> 00:17:27.400 is at one point five percent, maybe you're getting close 353 00:17:27.440 --> 00:17:30.799 to the limit. The book uses MS digit recognition, where 354 00:17:30.960 --> 00:17:33.319 HLP is cited around zero point two percent error. 355 00:17:33.359 --> 00:17:36.799 Okay, hlt bese er is the theoretical floor. How do 356 00:17:36.839 --> 00:17:39.480 we diagnose our model's specific shortcomings? 357 00:17:39.519 --> 00:17:42.319 The book introduces a simple framework called the metric analysis 358 00:17:42.319 --> 00:17:45.359 diagram or MENE. It helps you pinpoint where the error 359 00:17:45.400 --> 00:17:47.079 is coming from by looking at different gaps. 360 00:17:47.119 --> 00:17:48.279 Let's walk through those gaps. 361 00:17:48.400 --> 00:17:53.559 Okay, First, gap bias or sometimes avoidable bias. This is 362 00:17:53.599 --> 00:17:57.160 the difference between the Bayes error or HLP and your 363 00:17:57.240 --> 00:17:59.880 training error. If this gap is large, it means you're 364 00:17:59.839 --> 00:18:03.519 model isn't even fitting the training data. Well, it's likely 365 00:18:03.559 --> 00:18:08.160 to simple underfitting, or the training algorithm itself isn't finding 366 00:18:08.160 --> 00:18:08.839 a good solution. 367 00:18:09.039 --> 00:18:11.640 Okay, so bias is about performance on data it's already 368 00:18:11.680 --> 00:18:15.039 seen relative to the best possible. What's the next gap? 369 00:18:15.400 --> 00:18:18.240 Variance? This is the difference between your training error and 370 00:18:18.279 --> 00:18:20.720 your development set error. If your training error is low 371 00:18:20.880 --> 00:18:23.319 but your DEV error is much higher, that's a classic 372 00:18:23.359 --> 00:18:27.079 sign of overfitting. The model learned the training data specifics, 373 00:18:27.079 --> 00:18:29.680 but isn't generalizing. High variance, and. 374 00:18:29.640 --> 00:18:31.839 There's potentially a third gap mentioned. 375 00:18:31.559 --> 00:18:33.839 Yes, overfitting on the dev set. This is the gap 376 00:18:33.880 --> 00:18:36.279 between your doveset error and your error on a completely 377 00:18:36.279 --> 00:18:39.880 separate test set. If you tune your hyperparameters extensively based 378 00:18:39.880 --> 00:18:43.079 on the debset results, you might inadvertently make your model 379 00:18:43.079 --> 00:18:45.799 perform well, specifically on that deb set, but it might 380 00:18:45.839 --> 00:18:48.240 not generalize as well to totally new data. 381 00:18:48.279 --> 00:18:50.799 AHH, so you've sort of used up the deb set 382 00:18:50.880 --> 00:18:54.599 for unbiased evaluation by tuning on it too much. That's 383 00:18:54.640 --> 00:18:56.839 why you need that final untouched. 384 00:18:56.400 --> 00:18:59.119 Test set precisely keep the test set sacred until the 385 00:18:59.240 --> 00:19:01.599 very end for a final honest assessment. 386 00:19:01.759 --> 00:19:04.799 This all really highlights how crucial that initial data split 387 00:19:04.920 --> 00:19:07.000 is train dev. 388 00:19:07.039 --> 00:19:11.480 Test, and the book emphasizes a critical point. Your dev 389 00:19:11.519 --> 00:19:14.599 and test sets must reflect the real world data distribution. 390 00:19:14.680 --> 00:19:16.720 Your model will actually see. 391 00:19:16.480 --> 00:19:18.559 What kinds of problems happen if they don't well. 392 00:19:18.599 --> 00:19:21.759 A big one is unbalanced classes. The book mentions examples 393 00:19:21.839 --> 00:19:25.279 like detecting rare fraud or maybe identifying only certain digits 394 00:19:25.279 --> 00:19:28.079 in MNIST. If say, fraud is only a point one 395 00:19:28.079 --> 00:19:29.880 percent of your real data, but your dev set is 396 00:19:29.920 --> 00:19:33.079 balanced fifty to fifty, your devset accuracy won't tell you 397 00:19:33.119 --> 00:19:36.200 how the model does on the real skewed distribution, right. 398 00:19:36.079 --> 00:19:38.559 Because getting ninety nine point nine percent accuracy by just 399 00:19:38.599 --> 00:19:41.720 always predicting not fraud would look great on the real data, 400 00:19:41.759 --> 00:19:44.759 but terrible on the balanced DEV set or vice versa. 401 00:19:44.799 --> 00:19:48.039