WEBVTT 1 00:00:00.160 --> 00:00:02.879 Welcome to the deep dive, your express lane to understanding 2 00:00:02.919 --> 00:00:06.719 what truly matters in today's most complex subjects. Forget digging 3 00:00:06.799 --> 00:00:10.199 through dense texts. We're cutting straight to the core of 4 00:00:10.240 --> 00:00:12.720 what you need to know today. We're diving into the 5 00:00:12.720 --> 00:00:17.160 fascinating world of deep learning, specifically using PyTorch. Our source 6 00:00:17.239 --> 00:00:21.480 material is a pretty comprehensive guide to PyTorch and our mission. Well, 7 00:00:21.679 --> 00:00:26.640 it's simple unpack these powerful concepts into clear, actionable insights 8 00:00:26.719 --> 00:00:27.079 for you. 9 00:00:27.359 --> 00:00:31.519 Yeah, and it's incredibly relevant right now. Deep learning isn't 10 00:00:31.519 --> 00:00:36.079 just theory anymore. It's actively reshaping entire industries. I think 11 00:00:36.079 --> 00:00:40.320 personalized medicine, autonomous cars. So understanding the fundamentals and how 12 00:00:40.399 --> 00:00:43.280 tools like PyTorch actually make it happen gives you a 13 00:00:43.320 --> 00:00:46.280 really critical perspective on the edge of AI. We're hoping 14 00:00:46.320 --> 00:00:49.399 to give you that foundational knowledge plus a practical feel 15 00:00:49.439 --> 00:00:50.320 for what's under the hood. 16 00:00:50.439 --> 00:00:53.640 Okay, let's unpack this. Then. When we talk about machine intelligence, 17 00:00:53.679 --> 00:00:57.560 you hear AI, machine learning, deep learning. They sometimes get 18 00:00:57.640 --> 00:00:59.799 used interchangeably. Can we clarify that relationship? 19 00:00:59.840 --> 00:01:02.240 For absolutely, it's helpful to think of it in layers. 20 00:01:02.600 --> 00:01:06.319 So artificial intelligence AI, that's the big overarching goal right, 21 00:01:06.400 --> 00:01:12.120 making machines intelligence grand ambition exactly now. Machine learning or 22 00:01:12.239 --> 00:01:14.959 mL is one major way to get there. It's where 23 00:01:15.000 --> 00:01:19.079 machines learn from data without being explicitly programmed for every 24 00:01:19.120 --> 00:01:19.879 single task. 25 00:01:20.079 --> 00:01:22.000 Okay, so they learn patterns right. 26 00:01:22.120 --> 00:01:26.000 And deep learning DL is a specific type within machine learning. 27 00:01:26.239 --> 00:01:29.480 It's a technique that's proven incredibly effective, especially for learning 28 00:01:29.519 --> 00:01:33.239 really complex patterns from unstructured data like images or sound. 29 00:01:33.439 --> 00:01:36.560 So AI is the goal, mL is the approach. DL 30 00:01:36.640 --> 00:01:38.519 is a powerful technique within that approach. 31 00:01:38.760 --> 00:01:41.719 You got it. And the reason DL has really taken 32 00:01:41.760 --> 00:01:45.120 off is its advantage in certain areas. Traditional mL often 33 00:01:45.159 --> 00:01:48.799 needs humans to carefully engineer features from the data first, 34 00:01:49.120 --> 00:01:51.959 like telling the algorithm what to look for in an image, 35 00:01:52.079 --> 00:01:54.519 a lot of manual work a ton. Deep learning kind 36 00:01:54.519 --> 00:01:57.560 of flips that the algorithm itself learns to extract the 37 00:01:57.599 --> 00:02:00.719 important features directly from the raw data builds up this 38 00:02:01.400 --> 00:02:03.959 hierarchical understanding in a non linear way. 39 00:02:04.079 --> 00:02:06.560 Nonlinear That's key, isn't it crucial? 40 00:02:07.000 --> 00:02:10.319 Because real world data isn't neat straight lines, and this 41 00:02:10.400 --> 00:02:13.159 ability to learn features means DL models tend to keep 42 00:02:13.159 --> 00:02:16.520 getting better the more data you give them traditional mL 43 00:02:16.560 --> 00:02:17.680 can sometimes plateau. 44 00:02:18.080 --> 00:02:21.560 Ah, so scale really matters for deep learning performance, like 45 00:02:22.199 --> 00:02:24.919 more data equals significantly better results. 46 00:02:24.800 --> 00:02:29.159 Generally generally, yes, significantly. Think of it like learning a language. 47 00:02:29.639 --> 00:02:33.960 Traditional mL might learn vocabulary lists. Deep learning with enough 48 00:02:34.000 --> 00:02:38.319 exposure starts to grasp the grammar and nuance the underlying structure. 49 00:02:38.439 --> 00:02:40.479 Okay, that makes sense. Now, if we want to build 50 00:02:40.479 --> 00:02:44.159 these systems, we need tools frameworks, and that brings us 51 00:02:44.199 --> 00:02:46.919 to PyTorch. What makes it stand out? I've heard about 52 00:02:46.919 --> 00:02:48.599 this defined by run. 53 00:02:48.479 --> 00:02:51.240 Thing, right, that's a big one. So some other frameworks 54 00:02:51.360 --> 00:02:53.520 use a define and run approach. You first build this 55 00:02:53.680 --> 00:02:56.479 entire computation graph like a blueprint, and then you run 56 00:02:56.560 --> 00:02:59.360 data through it. It's quite static, okay. Pritorch is defined 57 00:02:59.360 --> 00:03:02.599 by run. The compugation graph gets built dynamically on the 58 00:03:02.639 --> 00:03:05.319 fly as your Python code executes. 59 00:03:05.000 --> 00:03:08.479 So it's more flexible, like you can change things midstream exactly. 60 00:03:08.560 --> 00:03:12.520 You can use standard Python loops, conditionals, print statements, debuggers. 61 00:03:12.960 --> 00:03:16.039 It feels much more like regular programming. This makes it 62 00:03:16.080 --> 00:03:20.319 super popular for research and rapid prototyping where you're constantly experimenting. 63 00:03:20.439 --> 00:03:23.080 That sounds much more intuitive, especially if you're already comfortable 64 00:03:23.080 --> 00:03:23.599 with Python. 65 00:03:23.719 --> 00:03:27.080 It often is, and practically speaking, is just a Python package. 66 00:03:27.159 --> 00:03:30.000 You install it with pip Workonda easy setup. 67 00:03:29.719 --> 00:03:32.319 And GPUs graphics cards. 68 00:03:32.319 --> 00:03:34.919 I hear they're important, oh, absolutely crucial for serious work. 69 00:03:35.520 --> 00:03:40.919 Deep learning involves massive matrix multiplications, and GPUs, especially in video. 70 00:03:40.960 --> 00:03:44.960 Ones using CUDA, are specifically designed to crunch those numbers 71 00:03:44.960 --> 00:03:48.280 incredibly fast, way faster than a standard CPU. 72 00:03:48.479 --> 00:03:50.960 What if you don't have a powerful GPU sitting around? 73 00:03:51.120 --> 00:03:55.400 Cloud computing is your friend? Services like Google Cloud, Aws, Azure. 74 00:03:55.840 --> 00:03:58.280 They offer instances with powerful GPUs you can rep by 75 00:03:58.319 --> 00:03:59.759 the hour. Very accessible. 76 00:04:00.199 --> 00:04:02.039 Good to know. So let's get into the nitty gritty. 77 00:04:02.039 --> 00:04:05.199 If we're building something, what are the absolute core components? 78 00:04:05.400 --> 00:04:07.479 Starting with the neural network itself. 79 00:04:07.360 --> 00:04:11.319 Right at its heart. A neural network is an algorithm 80 00:04:11.360 --> 00:04:15.159 designed to learn relationships. It maps input variables like say, 81 00:04:15.360 --> 00:04:18.560 pixels in an image, to some target output like cat 82 00:04:18.680 --> 00:04:19.079 or dog. 83 00:04:19.839 --> 00:04:21.079 How does it learn that mapping? 84 00:04:21.839 --> 00:04:25.839 Let's stick a simpler example. Yeah, predicting college admission. Your 85 00:04:25.879 --> 00:04:30.079 inputs might be GPA gr score, university rank, okay. In 86 00:04:30.120 --> 00:04:33.839 the network, these inputs are connected to processing units sometimes 87 00:04:33.839 --> 00:04:37.680 called neurons. Each connection has a weight, basically a number 88 00:04:37.680 --> 00:04:40.279 indicating how important that input is for the prediction. 89 00:04:40.160 --> 00:04:43.079 And the network learns these weights exactly. 90 00:04:43.319 --> 00:04:46.120 It learns them from the data inside a neuron. There 91 00:04:46.120 --> 00:04:50.120 are typically two main operations. First, a dot product summing 92 00:04:50.199 --> 00:04:53.040 up all the inputs multiplied by their weights. It's like 93 00:04:53.120 --> 00:04:55.000 mixing the ingredients based on their importance. 94 00:04:55.079 --> 00:04:56.480 Okay, a weighted sum. 95 00:04:56.600 --> 00:05:01.439 Then, crucially, it applies a nonlinear transformation and activation function. 96 00:05:01.560 --> 00:05:03.759 Why nonlinear? Why can't it just be linear? 97 00:05:03.920 --> 00:05:06.480 Because if you just stack linear operations, no matter how 98 00:05:06.519 --> 00:05:08.920 many layers you have, the whole thing is still just 99 00:05:08.959 --> 00:05:12.279 one big linear transformation. It can only learn straightline relationships. 100 00:05:12.560 --> 00:05:16.720 Ah, and the real world is messy, not straight lines precisely. 101 00:05:17.319 --> 00:05:21.879 Think about recognizing a face or understanding language. Super complex 102 00:05:22.079 --> 00:05:26.040 nonlinear patterns. Those activation functions allow the network to learn 103 00:05:26.079 --> 00:05:29.360 these intricate curves and boundaries. Each layer builds on the 104 00:05:29.399 --> 00:05:31.959 previous one, learning more abstract features. 105 00:05:32.000 --> 00:05:36.399 And in PyTorch, how do you actually define these network structures? 106 00:05:36.759 --> 00:05:39.680 For simple feed forward networks, you can use torch dot 107 00:05:39.800 --> 00:05:42.480 nn sequential. It lets you just list the layers one 108 00:05:42.519 --> 00:05:46.000 after another. Very straightforward, okay. But for anything more complex, 109 00:05:46.040 --> 00:05:49.360 maybe networks with multiple inputs or outputs or custom connections. 110 00:05:49.600 --> 00:05:52.720 You'll typically define your own network class you inherit from 111 00:05:53.040 --> 00:05:55.959 torch dot nn dot module and define the layers in 112 00:05:56.000 --> 00:05:58.279 the init method and how data flows through them in 113 00:05:58.319 --> 00:06:00.120 the forward method. Gives you total. 114 00:06:00.079 --> 00:06:02.959 Control, right, more power, more flexibility. Now back to those 115 00:06:03.000 --> 00:06:06.240 non linear activation functions. You said they're crucial. What are 116 00:06:06.240 --> 00:06:06.920 some common ones? 117 00:06:07.000 --> 00:06:09.920 Yeah, there are a few main players. Historically, sigmoid was popular. 118 00:06:10.240 --> 00:06:13.240 It squashes any input value into a range between zero and. 119 00:06:13.199 --> 00:06:16.920 One, useful for probabilities, maybe like binary classification. 120 00:06:17.959 --> 00:06:20.800 Exactly is it a cat near one or not hear 121 00:06:20.879 --> 00:06:23.439 or one? The problem is when the output is very 122 00:06:23.480 --> 00:06:26.879 close to zero or one, the gradient the signal use 123 00:06:26.920 --> 00:06:30.560 for learning becomes tiny. It basically stops learning. 124 00:06:30.519 --> 00:06:33.800 Ah the vanish ingredient problem, leading to dead neurons. 125 00:06:33.920 --> 00:06:37.279 Precisely, parts of the network just stop updating. Then there's 126 00:06:37.399 --> 00:06:41.160 ton or hyperbolic tangent, similar to sigmoid, but it squashes 127 00:06:41.279 --> 00:06:42.839 values between man is one and one. 128 00:06:42.920 --> 00:06:45.079 Why is mata one to one better than zero to. 129 00:06:44.959 --> 00:06:48.360 One because its output is zero centered. This often helps 130 00:06:48.399 --> 00:06:51.319 the optimization process converge a bit faster and more reliably. 131 00:06:51.839 --> 00:06:54.839 It's generally preferred over sigmoid in many cases. Okay, what 132 00:06:54.920 --> 00:06:58.399 else The current crowd favorite really is real you rectified 133 00:06:58.439 --> 00:07:01.240 linear unit. It's super simple. If the input is negative, 134 00:07:01.279 --> 00:07:03.920 the output is zero. If it's positive, the output is 135 00:07:03.959 --> 00:07:05.319 just the input of value itself. 136 00:07:05.360 --> 00:07:07.519 That sounds really simple. Why is it so popular? 137 00:07:07.920 --> 00:07:11.079 It's computationally very cheap, much faster than sigmoid or ton, 138 00:07:11.759 --> 00:07:14.639 and in practice it often helps networks learn faster because 139 00:07:14.639 --> 00:07:17.959 it doesn't saturate for positive values, but it can still die. 140 00:07:18.199 --> 00:07:21.319 If a neuron consistently gets negative input, it just outputs 141 00:07:21.399 --> 00:07:22.759 zero and the gradient becomes. 142 00:07:22.600 --> 00:07:25.920 Zero, so it has its own dying real you problem. 143 00:07:26.160 --> 00:07:29.680 It can, yeah, which led to variations like leaky reel you. 144 00:07:30.560 --> 00:07:34.240 Instead of outputting zero for negative inputs, it outputs a 145 00:07:34.319 --> 00:07:38.160 very small positive value like point zero one times the input, 146 00:07:38.319 --> 00:07:40.759 just enough to keep it alive basically exactly keeps the 147 00:07:40.800 --> 00:07:44.040 gradient flowing prevents the neuron from completely dying off. 148 00:07:44.120 --> 00:07:47.160 Okay, so we have networks with layers, weights and these 149 00:07:47.199 --> 00:07:50.759 activation functions. How do we measure if it's actually learning 150 00:07:50.800 --> 00:07:54.240 the right thing? How do we quantify good or bad predictions? 151 00:07:54.480 --> 00:07:56.959 That's the job of the loss function, sometimes called a 152 00:07:57.000 --> 00:08:00.399 cost function or objective function. Its whole purpose is to 153 00:08:00.399 --> 00:08:02.920 take the network's predictions and compare them to the actual 154 00:08:03.000 --> 00:08:06.199 correct answers the targets or labels, and spit out a 155 00:08:06.240 --> 00:08:07.800 single number at the loss. 156 00:08:07.680 --> 00:08:09.519 One number representing the error yep. 157 00:08:09.639 --> 00:08:12.000 A high loss means the predictions are bad. A low 158 00:08:12.040 --> 00:08:15.199 loss means they're good. The entire goal of training is 159 00:08:15.240 --> 00:08:17.000 to minimize this loss value. 160 00:08:17.079 --> 00:08:18.040 What are some examples. 161 00:08:18.199 --> 00:08:21.240 Well, if you're predicting a continuous value like the price 162 00:08:21.279 --> 00:08:23.839 of a house or a T shirt like in the 163 00:08:23.879 --> 00:08:27.600 book's example, that's regression. A common loss function is mean 164 00:08:27.680 --> 00:08:31.959 squared error MAS. It calculates the average of the squared 165 00:08:31.959 --> 00:08:34.960 differences between each prediction and the actual value. 166 00:08:35.159 --> 00:08:38.519 Squaring makes errors positive and penalizes larger rors. 167 00:08:38.559 --> 00:08:43.080 More right exactly now, for classification deciding between categories like 168 00:08:43.120 --> 00:08:46.679 cat versus dog versus panda, you often use cross entropy loss. 169 00:08:47.159 --> 00:08:50.679 It measures how different the network's predicted probability distribution is 170 00:08:50.720 --> 00:08:53.759 from the actual distribution, which is usually one for the 171 00:08:53.799 --> 00:08:55.440 correct class and o for others. 172 00:08:55.679 --> 00:08:58.519 So if network is very confident about the wrong class, 173 00:08:58.840 --> 00:09:01.240 the cross enterpy loss will be high, very high. 174 00:09:01.320 --> 00:09:05.360 It heavily penalizes confident wrong answers pushing the network towards 175 00:09:05.360 --> 00:09:07.759 predicting the correct class with high probability. 176 00:09:08.039 --> 00:09:10.639 Okay, so the loss function tells us how bad we are. 177 00:09:10.799 --> 00:09:13.000 How does the network use that information to get better? 178 00:09:13.279 --> 00:09:15.840 That's where optimizers come in. The loss function gives us 179 00:09:15.840 --> 00:09:18.799 the error signal. The optimizer is the algorithm that uses 180 00:09:18.840 --> 00:09:20.840 that signal to update the network's weights. 181 00:09:20.919 --> 00:09:22.960 It adjusts the knobs basically. 182 00:09:22.759 --> 00:09:26.000 Precisely, it figures out how to adjust each weight to 183 00:09:26.039 --> 00:09:29.120 reduce the loss. The most basic one is to cast 184 00:09:29.120 --> 00:09:33.120 a gradient descent SGD, but there are more advanced ones 185 00:09:33.200 --> 00:09:36.639 like atom or arms prop that often converge faster and 186 00:09:36.720 --> 00:09:39.759 more reliably by adapting the learning rate for each weight. 187 00:09:40.320 --> 00:09:42.679 And in PyTorch, how does that training loop look? You 188 00:09:42.720 --> 00:09:45.759 mentioned steps like zero grad backwards step right. 189 00:09:45.799 --> 00:09:48.759 It's a cycle for each batch of data. One you 190 00:09:48.799 --> 00:09:51.039 feed the data forward through the network to get predictions. 191 00:09:51.200 --> 00:09:55.200 Two you calculate the loss using your chosen loss function. Three, crucially, 192 00:09:55.320 --> 00:09:58.879 you call optimizer dot zerograd. This clears out any old 193 00:09:58.879 --> 00:10:02.759 gradient calculations from the previous batch. Very important. Four you 194 00:10:02.799 --> 00:10:06.600 call loss dot backward. This is where PyTorch automatically calculates 195 00:10:06.600 --> 00:10:09.279 the gradients how much each weight contributed to the loss 196 00:10:09.360 --> 00:10:13.720 using backpropagation five. Finally, you call optimizer dot step. This 197 00:10:13.840 --> 00:10:16.000 tells the optimizer to update the weights based on the 198 00:10:16.000 --> 00:10:17.279 gradients it just calculated. 199 00:10:17.320 --> 00:10:22.480 Predect, calculate loss, clear old gradients, calculate new gradients, update weights, repeat. 200 00:10:22.679 --> 00:10:25.080 That's the essence of training a neural network. You do 201 00:10:25.120 --> 00:10:27.480 this over and over, batch after batch, e back after 202 00:10:27.559 --> 00:10:30.480 APOC until the loss is low and the network performs well. 203 00:10:30.639 --> 00:10:34.600 Okay, before we get into specific applications like vision or language, 204 00:10:34.679 --> 00:10:37.960 we need to talk about how PyTorch actually handles the data. 205 00:10:38.399 --> 00:10:40.039 You mentioned tensors earlier. 206 00:10:40.200 --> 00:10:44.000 Yes, tensors are the absolute fundamental data structure in pytors. 207 00:10:44.600 --> 00:10:49.039 You can think of them as multidimensional arrays like numb 208 00:10:49.080 --> 00:10:53.080 pi arrays, but with superpowers, especially acceleration on GPUs. 209 00:10:53.279 --> 00:10:55.320 Multidimensional what does that mean exactly? 210 00:10:55.480 --> 00:10:58.279 It refers to the tensor's order or number of dimensions. 211 00:10:58.759 --> 00:11:01.559 A single number like five is a scaler a tensor 212 00:11:01.559 --> 00:11:04.480 of order zero. A list of numbers like one, two, 213 00:11:04.559 --> 00:11:07.559 three is a vector order one. A grid of numbers 214 00:11:07.559 --> 00:11:09.960 like a spreadsheet table is a matrix order two, and 215 00:11:10.039 --> 00:11:12.360 you can keep going. An image might be ordered three 216 00:11:12.799 --> 00:11:15.840 height with color channels, and a batch of images would 217 00:11:15.840 --> 00:11:18.679 be order four batch size, height, width channels. 218 00:11:18.919 --> 00:11:21.039 So the order tells you how many indices you need 219 00:11:21.080 --> 00:11:22.559 to access a specific element. 220 00:11:22.720 --> 00:11:24.960 Exactly to get element twenty one twenty two from that 221 00:11:25.039 --> 00:11:27.960 hypothetical fourth order tensor, you'd use four indicies like my 222 00:11:28.039 --> 00:11:30.960 tensor one zero, one one. The number of indices always 223 00:11:30.960 --> 00:11:32.000 matches the tensor's order. 224 00:11:32.279 --> 00:11:34.159 How do you know the shape or size of a tensor? 225 00:11:34.320 --> 00:11:38.279 You use the dot size or dot shape attribute. It 226 00:11:38.320 --> 00:11:40.879 returns a topal telling you the length of each dimension. 227 00:11:41.360 --> 00:11:44.759 For instance, a batch of thirty two images each two 228 00:11:44.799 --> 00:11:47.039 hundred and twenty four by two hundred twenty four pixels 229 00:11:47.240 --> 00:11:50.080 with three color channels would have a shape of thirty 230 00:11:50.080 --> 00:11:51.960 two two twenty four two twenty four three. Ken you 231 00:11:52.039 --> 00:11:55.399 change the shape, yes, using methods like dot view or 232 00:11:55.519 --> 00:11:58.799 dot reshape. This lets you rearrange the elements into a 233 00:11:58.799 --> 00:12:02.240 different configuration without changing the total number of elements or 234 00:12:02.279 --> 00:12:05.639 the underlying data. It's really useful, for example, flattening an 235 00:12:05.639 --> 00:12:08.200 image before feeding it into a simple linear. 236 00:12:07.919 --> 00:12:10.399 Layer, and you has that handy night of one trick. 237 00:12:10.519 --> 00:12:12.919 Yeah, if you specified night of one for one dimension. 238 00:12:13.240 --> 00:12:16.879 PyTorch automatically calculates its size based on the total number 239 00:12:16.879 --> 00:12:19.360 of elements and the sizes of the other dimensions you provided. 240 00:12:19.600 --> 00:12:23.159 Super convenient, and remember dot view usually returns a new 241 00:12:23.200 --> 00:12:25.960 tensor sharing the same data. It doesn't modify the original 242 00:12:25.960 --> 00:12:26.279 in place. 243 00:12:26.320 --> 00:12:29.000 Typically, what about basic math adding? Multiplying? 244 00:12:29.080 --> 00:12:33.960 Tensors support all the standard element wise operations addition, subtraction, multiplication, division. 245 00:12:34.240 --> 00:12:35.679 They work just like you'd expect of the race. 246 00:12:35.759 --> 00:12:38.519 Any gotcha's there? The book mentions something about division. 247 00:12:38.840 --> 00:12:42.440 Ah, yes, data types. If you have a tensor of integers, 248 00:12:42.679 --> 00:12:46.559 say Torch dot tensor five three, which defaults to in 249 00:12:46.600 --> 00:12:49.519 ten sixty four, and you divide five to three element wise, 250 00:12:49.840 --> 00:12:52.960 you might get one because it performs integer division right 251 00:12:53.080 --> 00:12:56.240 truncates the deskmall exactly to get the floating point result 252 00:12:56.240 --> 00:12:58.799 like one point sixty sixty six. You need to make 253 00:12:58.840 --> 00:13:00.919 sure at least one of the ten tensors has a 254 00:13:00.919 --> 00:13:03.840 floating point D type like torch dot float three two. 255 00:13:04.440 --> 00:13:06.960 You can specify the D type when creating the tensor 256 00:13:07.080 --> 00:13:09.639 or cast it later. Always be mindful of your data 257 00:13:09.639 --> 00:13:10.720 types makes sense. 258 00:13:11.000 --> 00:13:13.879 Tensors really seem like the core way Pietorch handles all 259 00:13:13.960 --> 00:13:16.919 numerical data from inputs to weights to gradients. 260 00:13:17.120 --> 00:13:22.879 They absolutely are everything flows as tensors. Getting comfortable manipulating them, indexing, reshaping, 261 00:13:22.960 --> 00:13:26.600 doing operations is key to working effectively with PyTorch. 262 00:13:26.639 --> 00:13:28.600 All right, so we have the network building blocks, we 263 00:13:28.679 --> 00:13:31.679 understand loss and optimization, and we know data is handled 264 00:13:31.759 --> 00:13:34.720 via tensors. Let's see the stuff in action. Computer vision 265 00:13:34.759 --> 00:13:36.919 seems like a huge area for depth learning. 266 00:13:36.759 --> 00:13:39.279 Definitely one of the fields where it first made massive breakthroughs. 267 00:13:39.559 --> 00:13:41.200 The problem with the images, as we touched on, is 268 00:13:41.200 --> 00:13:43.039 that if you just flatten them into a long vector 269 00:13:43.120 --> 00:13:46.480 for a standard fully connected network, you lose. 270 00:13:46.320 --> 00:13:48.960 All the spatial information like which pixels are next to 271 00:13:49.000 --> 00:13:49.559 each other. 272 00:13:49.519 --> 00:13:53.120 Precisely, and the number of weights needed becomes astronomically large 273 00:13:53.399 --> 00:13:56.919 even for moderately sized images. It just doesn't scale well 274 00:13:57.039 --> 00:13:59.399 and doesn't leverage the inherent structure of images. 275 00:13:59.519 --> 00:14:01.240 So what's the deep learning solution? 276 00:14:01.639 --> 00:14:06.120 Convolutional neural networks or CNNs. They are designed specifically to 277 00:14:06.200 --> 00:14:08.799 process grid like data like images. 278 00:14:08.879 --> 00:14:10.000 How did they work differently? 279 00:14:10.279 --> 00:14:13.120 Instead of connecting every input pixel to every neuron in 280 00:14:13.159 --> 00:14:17.399 the first layer, CNNs use filters or kernels. These are 281 00:14:17.440 --> 00:14:19.840 small windows of weight say three by three or five 282 00:14:19.960 --> 00:14:21.879 y five that slide across the. 283 00:14:21.799 --> 00:14:25.279 Input image like scanning the image with a small magnifying glass. 284 00:14:25.480 --> 00:14:28.159 Kind of a yeah. Each filter learns to detect a 285 00:14:28.159 --> 00:14:31.960 specific local feature, maybe a vertical edge, a horizontal line, 286 00:14:32.320 --> 00:14:35.320 a certain curve, or a texture. As the filter slides 287 00:14:35.360 --> 00:14:38.639 across the image, it creates an activation map showing where 288 00:14:38.639 --> 00:14:39.519 it found that feature. 289 00:14:39.600 --> 00:14:42.600 And because it's sliding, it detects that feature regardless of 290 00:14:42.600 --> 00:14:43.120 where it is. 291 00:14:43.039 --> 00:14:46.200 In the image exactly. That's called translation in variants, a 292 00:14:46.279 --> 00:14:50.919 key property, and crucially, it preserves the spatial relationships between features. 293 00:14:51.559 --> 00:14:53.919 Layers deeper in the CNN then learn to combine these 294 00:14:53.960 --> 00:14:58.919 simpler features into more complex ones edges combined to form corners, corners, 295 00:14:58.919 --> 00:15:01.600 and textures combined objects like eyes or wheels. 296 00:15:01.720 --> 00:15:05.679 Let's look at that journey the classic MNIST data set 297 00:15:05.799 --> 00:15:08.320 handwritten digits. How well do CNNs do there? 298 00:15:08.600 --> 00:15:11.840 Even a fairly simple CNN can achieve really high accuracy 299 00:15:11.879 --> 00:15:14.960 on MNIST, like ninety eight percent or ninety nine percent. 300 00:15:15.120 --> 00:15:17.639 It's a stand in benchmark and CNN's crush it. 301 00:15:17.840 --> 00:15:20.600 But then you take that same CNN, maybe trained on MNIST, 302 00:15:20.720 --> 00:15:23.320 and try it on something harder like the Dogs Versus 303 00:15:23.320 --> 00:15:24.879 Cats challenge from Cagle. 304 00:15:24.600 --> 00:15:28.120 And suddenly it might struggle, maybe only seventy five percent accuracy. 305 00:15:28.200 --> 00:15:31.720 The features learned for recognizing simple digits aren't necessarily complex 306 00:15:31.840 --> 00:15:34.759 enough or the right kind to distinguish between detailed photos 307 00:15:34.759 --> 00:15:37.720 of different animal breeds. It doesn't generalize well enough. 308 00:15:37.559 --> 00:15:39.799 Which brings us back to an important idea you mentioned, 309 00:15:40.039 --> 00:15:42.720 how do we tackle these harder tasks, especially if we 310 00:15:42.759 --> 00:15:45.679 don't have millions of labeled dog and cat photos ourselves. 311 00:15:45.919 --> 00:15:50.080 Trendsfer learning This is hugely powerful in computer vision. The 312 00:15:50.159 --> 00:15:53.919 idea is why start learning from scratch when others have 313 00:15:53.960 --> 00:15:57.240 already trained massive models on enormous data. 314 00:15:57.039 --> 00:15:59.720 Sets, like learning to drive a motorbike after knowing how 315 00:15:59.720 --> 00:16:02.679 to drive a car reusing the basic road knowledge. 316 00:16:02.840 --> 00:16:06.320 Perfect analogy, we take a pre trade model like VGG 317 00:16:06.440 --> 00:16:09.799 sixteen or ResNet, which has already been trained on image neet, 318 00:16:09.879 --> 00:16:13.559 a data set with millions of images across one thousand categories. 319 00:16:14.159 --> 00:16:17.519 These models have learned incredibly rich in general visual features 320 00:16:17.519 --> 00:16:22.120 in their early layers edge detectors, texture detectors, basic shape detectors. 321 00:16:22.200 --> 00:16:24.759 So you take that pre trained network and. 322 00:16:24.679 --> 00:16:28.000 You typically freeze the weights of those early convolutional layers, 323 00:16:28.000 --> 00:16:30.120 you don't let them train anymore. You basically treat them 324 00:16:30.120 --> 00:16:34.200 as fixed future extractors. Then you replace the final classification 325 00:16:34.279 --> 00:16:36.639 layer which was trained for the original one thousand imagh 326 00:16:36.679 --> 00:16:39.440 net classes, with a new one suited to your task 327 00:16:39.639 --> 00:16:41.759 like discriminating between dogs and cats. 328 00:16:41.559 --> 00:16:43.879 And you only train this new final layer, or maybe 329 00:16:43.879 --> 00:16:45.759 the last few layers exactly. 330 00:16:45.919 --> 00:16:48.879 You only train the small task specific part of the 331 00:16:48.919 --> 00:16:52.759 network using your relatively smaller data set, like the dogs 332 00:16:52.799 --> 00:16:56.960 versus cats images. The bulk of the network's knowledge is transferred, and. 333 00:16:56.919 --> 00:16:59.240 The result on dogs versus cats it's dramatic. 334 00:16:59.480 --> 00:17:02.960 Instead of seventy five percent accuracy, using transfer learning with 335 00:17:03.039 --> 00:17:05.640 a pre trained ResNet can easily push you up to 336 00:17:05.720 --> 00:17:10.319 ninety eight percent or higher. Massive improvement leveraging knowledge learn 337 00:17:10.400 --> 00:17:11.799 from a different, larger task. 338 00:17:12.039 --> 00:17:15.319 That's amazing. And you can even peak inside these CNNs 339 00:17:16.400 --> 00:17:17.799 visualize what they're learning. 340 00:17:17.920 --> 00:17:20.400 Yeah, it's fascinating. You can look at the activations the 341 00:17:20.440 --> 00:17:24.079 output maps from different filters at different layers. Early layers, 342 00:17:24.160 --> 00:17:27.440 you'll see activations responding to simple things like edges and corners. 343 00:17:27.599 --> 00:17:31.759 Go deeper, and you see activations responding to more complex textures, patterns, 344 00:17:31.880 --> 00:17:34.039 or even parts of objects like eyes or snouts. 345 00:17:34.240 --> 00:17:36.519 It really gives you sense that the network is building up, 346 00:17:36.599 --> 00:17:38.759 understanding hierarchically it does. 347 00:17:38.920 --> 00:17:41.680 It demystifies the black box a little bit. And beyond 348 00:17:41.799 --> 00:17:44.839 VGG and ResNet, there are other cool architectures you mentioned 349 00:17:44.880 --> 00:17:48.920 ResNet solving the vanishing grading issue with skip connections. 350 00:17:48.519 --> 00:17:50.359 Right letting information by pass layers. 351 00:17:50.680 --> 00:17:54.599 Then there's inception or Google net, which cleverly uses parallel 352 00:17:54.640 --> 00:17:58.240 convolutional filters of different sizes one by one, three by three, 353 00:17:58.319 --> 00:18:01.559 five by five at the same length layer and concatenates 354 00:18:01.559 --> 00:18:05.799 their outputs. It captures features at multiple scale simultaneously, and 355 00:18:05.839 --> 00:18:09.400 it uses one by one convolution smartly for dimensionality reduction, 356 00:18:09.680 --> 00:18:12.960 making it efficient and dense net. Dense net took connectivity 357 00:18:12.960 --> 00:18:16.839 even further. Each layer receives inputs from all preceding layers 358 00:18:16.880 --> 00:18:19.599 and passes its own feature maps to all subsequent layers. 359 00:18:20.000 --> 00:18:23.400 It sounds complex, but it actually encourages feature reuse and 360 00:18:23.440 --> 00:18:26.279 can lead to models with fewer parameters that are very effective. 361 00:18:26.559 --> 00:18:29.319 So if one of these powerful models gives great results, 362 00:18:29.359 --> 00:18:31.559 can you do even better by combining them? 363 00:18:31.680 --> 00:18:35.039 Yes, that's model ensembling. You train several different high performing models, 364 00:18:35.039 --> 00:18:38.000 maybe a res net and inception a dense net independently 365 00:18:38.039 --> 00:18:40.319 on the same task. Then for a new image, you 366 00:18:40.359 --> 00:18:42.839 get predictions from all of them and combine those predictions, 367 00:18:43.000 --> 00:18:45.640 often just by averaging their output probabilities or taking a 368 00:18:45.680 --> 00:18:46.599 majority vote, and. 369 00:18:46.480 --> 00:18:48.519 That actually improves accuracy further. 370 00:18:48.880 --> 00:18:52.279 Often, yes, it can smooth out the errors or biases 371 00:18:52.319 --> 00:18:56.960 of individual models. For dogs versus cats, ensembling can nudge 372 00:18:57.000 --> 00:19:00.680 accuracy even higher, maybe to ninety nine point three percent more. 373 00:19:01.240 --> 00:19:03.680 The downside is well, you have to train and run 374 00:19:03.759 --> 00:19:06.759 multiple models, so it's computationally more expensive. 375 00:19:06.319 --> 00:19:09.640 A trade off between performance and cost. Okay, let's switch 376 00:19:09.640 --> 00:19:14.839 gears now, from seeing to understanding language. Natural language processing 377 00:19:15.240 --> 00:19:17.240 or NLP text data is different. 378 00:19:17.279 --> 00:19:20.480 It's sequential absolutely. The meaning often depends on the order 379 00:19:20.519 --> 00:19:24.319 of words, so the first step is usually tokenization, breaking 380 00:19:24.319 --> 00:19:26.240 the text down into smaller units. 381 00:19:25.960 --> 00:19:28.440 Or tokens like words or characters. 382 00:19:28.519 --> 00:19:31.839 Could be either. For a review like just perfect, Splitting 383 00:19:31.880 --> 00:19:35.119 by spaces gives you word tokens just perfect. Using Python's 384 00:19:35.119 --> 00:19:37.400 list function on the frame would give you character tokens 385 00:19:37.519 --> 00:19:40.200 j sh she d. The choice depends on the task. 386 00:19:40.319 --> 00:19:43.759 Okay. Once you have tokens, you need numbers right vectorization. 387 00:19:43.359 --> 00:19:46.400 Right, we need to represent these tokens numerically. One old 388 00:19:46.440 --> 00:19:50.079 method is one hot encoding, where each unique word gets 389 00:19:50.119 --> 00:19:52.640 a huge vector that's all zeros except for a single 390 00:19:52.680 --> 00:19:54.039 one at its specific index. 391 00:19:54.359 --> 00:19:57.200 Sounds very sparse and doesn't capture meaning, does it like 392 00:19:57.400 --> 00:20:00.720 king and queen would be totally unrelated vectors exactly. 393 00:20:01.000 --> 00:20:05.240 It's rarely used in modern deep learning for NLP. Much 394 00:20:05.279 --> 00:20:09.359 more powerful are word embeddings. These represent words as dense, 395 00:20:09.720 --> 00:20:13.119 relatively low dimensional vectors, maybe one hundred or three hundred 396 00:20:13.160 --> 00:20:14.519 dimensions instead of millions. 397 00:20:14.759 --> 00:20:16.480 And these vectors capture meaning. 398 00:20:16.839 --> 00:20:19.680 Yes, that's the key. They are learned in such a 399 00:20:19.680 --> 00:20:23.079 way that words with similar meanings end up having similar 400 00:20:23.160 --> 00:20:27.759 vector representations, like the vector for king might be mathematically 401 00:20:27.759 --> 00:20:31.440